Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Kindle Formats

Notices

Reply
 
Thread Tools Search this Thread
Old 08-14-2019, 12:24 PM   #16
ilovejedd
hopeless n00b
ilovejedd ought to be getting tired of karma fortunes by now.ilovejedd ought to be getting tired of karma fortunes by now.ilovejedd ought to be getting tired of karma fortunes by now.ilovejedd ought to be getting tired of karma fortunes by now.ilovejedd ought to be getting tired of karma fortunes by now.ilovejedd ought to be getting tired of karma fortunes by now.ilovejedd ought to be getting tired of karma fortunes by now.ilovejedd ought to be getting tired of karma fortunes by now.ilovejedd ought to be getting tired of karma fortunes by now.ilovejedd ought to be getting tired of karma fortunes by now.ilovejedd ought to be getting tired of karma fortunes by now.
 
ilovejedd's Avatar
 
Posts: 5,111
Karma: 19597086
Join Date: Jan 2009
Location: in the middle of nowhere
Device: PW4, PW3, Libra H2O, iPad 10.5, iPad 11, iPad 12.9
Quote:
Originally Posted by j.p.s View Post
My method does not interact with amazon servers, but extracts notes from the files in the .sdr directories for your books on your kindle to plain text output. Depending on how pretty you want the format of the notes, my method might word for you.

You might also look at jhowell's kindle reader data store KRDS https://www.mobileread.com/forums/sh...d.php?t=322172
Question, is this capable of extracting the actual text of highlights? I don't think those are stored in the .sdr files (just location).
ilovejedd is offline   Reply With Quote
Old 08-14-2019, 01:17 PM   #17
j.p.s
Grand Sorcerer
j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.
 
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
Quote:
Originally Posted by ilovejedd View Post
Question, is this capable of extracting the actual text of highlights? I don't think those are stored in the .sdr files (just location).
The notes_insert.pl script does as part of modifying the rawml file to reflect highlighting.

It would be pretty easy to add an option to both the azw3r C program and perl script to extract the text of the highlights. Part of why I haven't done it yet is because I am unsure how useful those are out of context and because they would contain any HTML markup within the highlight text. (The latter surprised me when it showed up in some short highlights.)
j.p.s is offline   Reply With Quote
Advert
Old 08-14-2019, 11:20 PM   #18
j.p.s
Grand Sorcerer
j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.
 
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
I'm attaching a PDF of a book with inserted highlights and notes to this post along with associated files to make it. It turns out that the utility html2ps does not choke on the XML in a rawml file like my web browser does, so there was no need to comment out the XML. What is also surprising to me is that the TOC in the PDF works. This is not meant to be a book with highlights and notes, but rather the highlights and notes shown in context.

The source book is EPUB of The Humbugs of the World by P T Barnum from the Mobileread Library. I used kindlegen to make a dual mobi and used kindleunpack to extract the rawml and azw3, which I copied to a kindle and quickly made 9 highlights with bogus notes.

Then I copied the azw3r and dumped the notes, which also gives the start and end of the each higlight. Next I used the notes_insert.pl from the first post to modify the rawml, then html2ps and ps2pdf. You can search the PDF for '[HL]' or '[Note:' to find the highlights and notes.
Attached Files
File Type: zip HighlightsNotes_in_pdf.zip (1.01 MB, 453 views)
j.p.s is offline   Reply With Quote
Old 08-17-2019, 01:53 PM   #19
j.p.s
Grand Sorcerer
j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.
 
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
Quote:
Originally Posted by ilovejedd View Post
Question, is this capable of extracting the actual text of highlights? I don't think those are stored in the .sdr files (just location).
The latest C and perl versions now have an option to extract the highlighted text from the rawml file. When the -h argument is supplied along with -r filename.rawml a fourth column will be printed consisting of the highlighted text in single quotes.

These changes, along with an expanded README are in the latest release, v0.1.4, at https://github.com/jps-e/azw3r and the attachments azw3r.c.gz and azw3r.pl.gz have been updated in post #1 of this thread.
j.p.s is offline   Reply With Quote
Old 08-17-2019, 05:57 PM   #20
j.p.s
Grand Sorcerer
j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.
 
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
Quote:
Originally Posted by Luca2903 View Post
HI JPS, very interesting work.

Could you please be so kind to try and help me a little bit?

I have this problem here, and I'd like to understand more if your solution is able to help me.

https://www.mobileread.com/forums/sh...44#post3878444

Thanks!
Luca2903,

you have not said what kindle format your books are in. If they are KF8 (azw3), then I think my scripts are in good enough shape for just about anyone to extract both notes and highlights as text as separate files and/or insert them into the text of the book for context.
j.p.s is offline   Reply With Quote
Advert
Old 08-17-2019, 06:20 PM   #21
j.p.s
Grand Sorcerer
j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.
 
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
KF8 notes and highlights embedded in text of book for context as PDF

When I first started playing with this I thought Calibre could not convert a kindleunpack rawml file to PDF, but when I tack and .html extension to the input file name, Calibre makes a usable PDF, even if the XML has not been commented out. It has the added advantage over html2ps that the font is larger. (However, the Calibre generated PDF does not have a clickable TOC. Also, for me using xpdf to view it, clicking on a link internal to the PDF causes an empty web browser window to open.)
j.p.s is offline   Reply With Quote
Old 08-18-2019, 10:56 AM   #22
rzikaou
Junior Member
rzikaou began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Aug 2019
Device: Kindle
Hi j.p.s,

First, I'd like to thank you for the work you've done so far.

I was in the process of reverse engineering azw3r files and then found your project and it's been super helpful.

I'm not sure if this is a known issue, but I tried using your tool to extract highlights (with no notes) from a book, but the highlight text it is extracting is not correct.

I suspect the rawml file I'm providing it might not to be the right one.

I used KindleUnpack and it gave me 2 different rawml files: mobi7/book.rawml and mobi8/book.rawml.

I tried running `azw3r -i book.azw3r -h -r book.rawml` with both of them and the extracted highlight text is incorrect.

Any ideas?

Thanks!
rzikaou is offline   Reply With Quote
Old 08-18-2019, 12:18 PM   #23
j.p.s
Grand Sorcerer
j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.
 
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
hi rzikaou,

I'm sorry to hear that it is not working for you.

Can you post the exact sequence of commands you are using?

It would also be good if you can pick some public domain book (preferably short with few or no images) in KF8 format (azw3) and make some highlights in it and post the output of the azw3r program here along with the book.azw3r file used to extract the highlights. Then I can try to reproduce your problem.
j.p.s is offline   Reply With Quote
Old 08-18-2019, 02:39 PM   #24
rzikaou
Junior Member
rzikaou began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Aug 2019
Device: Kindle
The document I'm using is an html document that I've converted to `epub`, then to `mobi` (using `kindlegen`) and then I e-mailed it to my Kindle with the special `@kindle.com` email which resulted in the `azw3` document that is on my kindle.

I've made the following test highlights on the first page of the article (without any notes):

"On a bright Monday in January"

"a thousand"

"They packed themselves into a cheerful courtyard outside"

But this doesn't seem to be what the tool is returning.

I'm trying to remember how I generated the rawml because now running `kindleunpack` doesn't give me any `.rawml` files.
Attached Files
File Type: zip test-article.zip (50.8 KB, 380 views)
rzikaou is offline   Reply With Quote
Old 08-18-2019, 04:10 PM   #25
j.p.s
Grand Sorcerer
j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.
 
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
Thanks rzikaou for uploading your example. It turns out that was important rather than my suggestion to use a book. I've never understood why there is a 14 byte difference between Amazon's offset into the rawml and where the text actually is. It turns out that for your azw3 the offset is 166 bytes. I have added a -o option to the azw3r.c and azw3r.pl attached to the first post and made a new github release.

I don't know whether it is possible to tell in advance what the offset is. I had to get yours experimentally. I moved your azw3r and rawml files into the same directory so that the command would merely be way too long instead of impossibly long.
Code:
azw3r -h -o 166 -i "Test article_CTD7HH6AE5BVXFGTFOTOV54NOREZMUWNa1bd4a78ed253ba5271d0cb7df407fda.azw3r" -r "Test article_CTD7HH6AE5BVXFGTFOTOV54NOREZMUWN.rawml"
1259    1269    Highlight:      'a thousand '
1184    1220    Highlight:      'On a bright</span> Monday in January '
1462    1518    Highlight:      'They packed themselves into a cheerful courtyard outside '
I have shown the "-o 166" at the beginning of the command for clarity. During experimentation it would be best at the end.
j.p.s is offline   Reply With Quote
Old 08-18-2019, 06:03 PM   #26
jhowell
Grand Sorcerer
jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.
 
jhowell's Avatar
 
Posts: 6,496
Karma: 84420419
Join Date: Nov 2011
Location: Tampa Bay, Florida
Device: Kindles
Quote:
Originally Posted by j.p.s View Post
I don't know whether it is possible to tell in advance what the offset is. I had to get yours experimentally.
"kindleunpack -d" will produce a file named "assembled_text.dat" in the mobi8 folder containing a subset of the rawml corresponding to the actual book content (flow 0). I think you will find that the position number offsets are indexed into this data without any correction needed.
jhowell is offline   Reply With Quote
Old 08-18-2019, 07:46 PM   #27
rzikaou
Junior Member
rzikaou began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Aug 2019
Device: Kindle
Thanks for debugging this j.p.s.

jhowell as far as I can tell, you are correct. Using the offsets against the "assembled_text.dat" gives the expected result!
rzikaou is offline   Reply With Quote
Old 08-18-2019, 07:49 PM   #28
j.p.s
Grand Sorcerer
j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.
 
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
Quote:
Originally Posted by jhowell View Post
"kindleunpack -d" will produce a file named "assembled_text.dat" in the mobi8 folder containing a subset of the rawml corresponding to the actual book content (flow 0). I think you will find that the position number offsets are indexed into this data without any correction needed.
And so it does. But, it seems a bit magic. Somewhat early on, the rawml has the 14 byte string "</body></html>" not in assembled_text.dat, then somehow the two files have unaligned sets of opening and closing html and body tags which somehow do not affect the byte offsets of book text.

rzikaou's rawml file has extra header tags not in the assembled_text.dat file.

So the C azw3r and the perl azw3r.pl can be used as is with
-r assembled_text.dat -o 0

Quote:
Originally Posted by odamizu View Post
Thank you jhowell! As always, you are a wonderful source of enlightenment
Ditto.
j.p.s is offline   Reply With Quote
Old 09-07-2019, 06:18 PM   #29
j.p.s
Grand Sorcerer
j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.
 
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
New release, process KRDS JSON and use defaults assuming assembled_text.dat format

There is a new release at github that makes the default rawml offset 0 in the C and perl utilites, so kindleunpack -d should be used to make assembled_text.dat instead of kindleupack -r to make <book>.rawml

Also, there is a new utility named krdsJSON2notes.pl that processes the <book>.json file produced by jhowell's KRDS parser https://www.mobileread.com/forums/sh...d.php?t=322172 into the same format used by notes_insert.pl to highlight and insert notes into a rawml file (assembled_text.dat) suitable for converting to PDF.

So now human readable personal notes can be extracted from all Kindle books and personal highlights can be extracted from KF8 (azw3) and probably mobi books.

The current latest release is attached as azw3r-0.1.7.zip to post #1 in this thread.
j.p.s is offline   Reply With Quote
Old 10-09-2019, 05:17 PM   #30
Luca2903
Junior Member
Luca2903 began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Jul 2019
Device: Kindle PW 2
Quote:
Originally Posted by j.p.s View Post
Luca2903,

you have not said what kindle format your books are in. If they are KF8 (azw3), then I think my scripts are in good enough shape for just about anyone to extract both notes and highlights as text as separate files and/or insert them into the text of the book for context.
Hello man, the format is .Kfx.

Thanks.
Luca2903 is offline   Reply With Quote
Reply

Tags
azw3r, highlights, highlights and notes, notes


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Fully Automated ebook file parsing, ISBN extraction, Titel Extraction and metadata isbnread Reading and Management 0 02-20-2017 10:20 AM
Paperwhite 2 add note without highlight? just_jeepin Amazon Kindle 3 10-07-2013 02:07 PM
PRS-650 Two years late — A crossplatform ePub highlight extraction tool for PRS-350, 650... Syniurge Sony Reader 1 09-30-2013 12:45 PM
eink device with note and highlight sync with Mendeley aldomenguzzi Which one should I buy? 0 12-04-2012 04:44 AM


All times are GMT -4. The time now is 09:16 PM.


MobileRead.com is a privately owned, operated and funded community.