azw3r highlight and note extraction info - Page 2

ilovejedd · 08-14-2019, 12:24 PM

Quote:

Originally Posted by j.p.s

My method does not interact with amazon servers, but extracts notes from the files in the .sdr directories for your books on your kindle to plain text output. Depending on how pretty you want the format of the notes, my method might word for you.

You might also look at jhowell's kindle reader data store KRDS https://www.mobileread.com/forums/sh...d.php?t=322172

Question, is this capable of extracting the actual text of highlights? I don't think those are stored in the .sdr files (just location).

j.p.s · 08-14-2019, 01:17 PM

Quote:

Originally Posted by ilovejedd

Question, is this capable of extracting the actual text of highlights? I don't think those are stored in the .sdr files (just location).

The notes_insert.pl script does as part of modifying the rawml file to reflect highlighting.

It would be pretty easy to add an option to both the azw3r C program and perl script to extract the text of the highlights. Part of why I haven't done it yet is because I am unsure how useful those are out of context and because they would contain any HTML markup within the highlight text. (The latter surprised me when it showed up in some short highlights.)

j.p.s · 08-14-2019, 11:20 PM

I'm attaching a PDF of a book with inserted highlights and notes to this post along with associated files to make it. It turns out that the utility html2ps does not choke on the XML in a rawml file like my web browser does, so there was no need to comment out the XML. What is also surprising to me is that the TOC in the PDF works. This is not meant to be a book with highlights and notes, but rather the highlights and notes shown in context.

The source book is EPUB of The Humbugs of the World by P T Barnum from the Mobileread Library. I used kindlegen to make a dual mobi and used kindleunpack to extract the rawml and azw3, which I copied to a kindle and quickly made 9 highlights with bogus notes.

Then I copied the azw3r and dumped the notes, which also gives the start and end of the each higlight. Next I used the notes_insert.pl from the first post to modify the rawml, then html2ps and ps2pdf. You can search the PDF for '[HL]' or '[Note:' to find the highlights and notes.

j.p.s · 08-17-2019, 01:53 PM

Quote:

Originally Posted by ilovejedd

Question, is this capable of extracting the actual text of highlights? I don't think those are stored in the .sdr files (just location).

The latest C and perl versions now have an option to extract the highlighted text from the rawml file. When the -h argument is supplied along with -r filename.rawml a fourth column will be printed consisting of the highlighted text in single quotes.

These changes, along with an expanded README are in the latest release, v0.1.4, at https://github.com/jps-e/azw3r and the attachments azw3r.c.gz and azw3r.pl.gz have been updated in post #1 of this thread.

j.p.s · 08-17-2019, 05:57 PM

Quote:

Originally Posted by Luca2903

HI JPS, very interesting work.

Could you please be so kind to try and help me a little bit?

I have this problem here, and I'd like to understand more if your solution is able to help me.

https://www.mobileread.com/forums/sh...44#post3878444

Thanks!

Luca2903,

you have not said what kindle format your books are in. If they are KF8 (azw3), then I think my scripts are in good enough shape for just about anyone to extract both notes and highlights as text as separate files and/or insert them into the text of the book for context.

j.p.s · 08-17-2019, 06:20 PM

When I first started playing with this I thought Calibre could not convert a kindleunpack rawml file to PDF, but when I tack and .html extension to the input file name, Calibre makes a usable PDF, even if the XML has not been commented out. It has the added advantage over html2ps that the font is larger. (However, the Calibre generated PDF does not have a clickable TOC. Also, for me using xpdf to view it, clicking on a link internal to the PDF causes an empty web browser window to open.)

rzikaou · 08-18-2019, 10:56 AM

Hi j.p.s,

First, I'd like to thank you for the work you've done so far.

I was in the process of reverse engineering azw3r files and then found your project and it's been super helpful.

I'm not sure if this is a known issue, but I tried using your tool to extract highlights (with no notes) from a book, but the highlight text it is extracting is not correct.

I suspect the rawml file I'm providing it might not to be the right one.

I used KindleUnpack and it gave me 2 different rawml files: mobi7/book.rawml and mobi8/book.rawml.

I tried running `azw3r -i book.azw3r -h -r book.rawml` with both of them and the extracted highlight text is incorrect.

Any ideas?

Thanks!

j.p.s · 08-18-2019, 12:18 PM

hi rzikaou,

I'm sorry to hear that it is not working for you.

Can you post the exact sequence of commands you are using?

It would also be good if you can pick some public domain book (preferably short with few or no images) in KF8 format (azw3) and make some highlights in it and post the output of the azw3r program here along with the book.azw3r file used to extract the highlights. Then I can try to reproduce your problem.

rzikaou · 08-18-2019, 02:39 PM

The document I'm using is an html document that I've converted to `epub`, then to `mobi` (using `kindlegen`) and then I e-mailed it to my Kindle with the special `@kindle.com` email which resulted in the `azw3` document that is on my kindle.

I've made the following test highlights on the first page of the article (without any notes):

"On a bright Monday in January"

"a thousand"

"They packed themselves into a cheerful courtyard outside"

But this doesn't seem to be what the tool is returning.

I'm trying to remember how I generated the rawml because now running `kindleunpack` doesn't give me any `.rawml` files.

j.p.s · 08-18-2019, 04:10 PM

Thanks rzikaou for uploading your example. It turns out that was important rather than my suggestion to use a book. I've never understood why there is a 14 byte difference between Amazon's offset into the rawml and where the text actually is. It turns out that for your azw3 the offset is 166 bytes. I have added a -o option to the azw3r.c and azw3r.pl attached to the first post and made a new github release.

I don't know whether it is possible to tell in advance what the offset is. I had to get yours experimentally. I moved your azw3r and rawml files into the same directory so that the command would merely be way too long instead of impossibly long.

Code:

azw3r -h -o 166 -i "Test article_CTD7HH6AE5BVXFGTFOTOV54NOREZMUWNa1bd4a78ed253ba5271d0cb7df407fda.azw3r" -r "Test article_CTD7HH6AE5BVXFGTFOTOV54NOREZMUWN.rawml"
1259    1269    Highlight:      'a thousand '
1184    1220    Highlight:      'On a bright</span> Monday in January '
1462    1518    Highlight:      'They packed themselves into a cheerful courtyard outside '

I have shown the "-o 166" at the beginning of the command for clarity. During experimentation it would be best at the end.

jhowell · 08-18-2019, 06:03 PM

Quote:

Originally Posted by j.p.s

I don't know whether it is possible to tell in advance what the offset is. I had to get yours experimentally.

"kindleunpack -d" will produce a file named "assembled_text.dat" in the mobi8 folder containing a subset of the rawml corresponding to the actual book content (flow 0). I think you will find that the position number offsets are indexed into this data without any correction needed.

rzikaou · 08-18-2019, 07:46 PM

Thanks for debugging this j.p.s.

jhowell as far as I can tell, you are correct. Using the offsets against the "assembled_text.dat" gives the expected result!

j.p.s · 08-18-2019, 07:49 PM

Quote:

Originally Posted by jhowell

"kindleunpack -d" will produce a file named "assembled_text.dat" in the mobi8 folder containing a subset of the rawml corresponding to the actual book content (flow 0). I think you will find that the position number offsets are indexed into this data without any correction needed.

And so it does. But, it seems a bit magic. Somewhat early on, the rawml has the 14 byte string "</body></html>" not in assembled_text.dat, then somehow the two files have unaligned sets of opening and closing html and body tags which somehow do not affect the byte offsets of book text.

rzikaou's rawml file has extra header tags not in the assembled_text.dat file.

So the C azw3r and the perl azw3r.pl can be used as is with
-r assembled_text.dat -o 0

Quote:

Originally Posted by odamizu

Thank you jhowell! As always, you are a wonderful source of enlightenment

Ditto.

j.p.s · 09-07-2019, 06:18 PM

There is a new release at github that makes the default rawml offset 0 in the C and perl utilites, so kindleunpack -d should be used to make assembled_text.dat instead of kindleupack -r to make <book>.rawml

Also, there is a new utility named krdsJSON2notes.pl that processes the <book>.json file produced by jhowell's KRDS parser https://www.mobileread.com/forums/sh...d.php?t=322172 into the same format used by notes_insert.pl to highlight and insert notes into a rawml file (assembled_text.dat) suitable for converting to PDF.

So now human readable personal notes can be extracted from all Kindle books and personal highlights can be extracted from KF8 (azw3) and probably mobi books.

The current latest release is attached as azw3r-0.1.7.zip to post #1 in this thread.

Luca2903 · 10-09-2019, 05:17 PM

Quote:

Originally Posted by j.p.s

Luca2903,

you have not said what kindle format your books are in. If they are KF8 (azw3), then I think my scripts are in good enough shape for just about anyone to extract both notes and highlights as text as separate files and/or insert them into the text of the book for context.

Hello man, the format is .Kfx.

Thanks.

08-17-2019, 06:20 PM	#21
j.p.s Grand Sorcerer Posts: 5,278 Karma: 98804578 Join Date: Apr 2011 Device: pb360	KF8 notes and highlights embedded in text of book for context as PDF When I first started playing with this I thought Calibre could not convert a kindleunpack rawml file to PDF, but when I tack and .html extension to the input file name, Calibre makes a usable PDF, even if the XML has not been commented out. It has the added advantage over html2ps that the font is larger. (However, the Calibre generated PDF does not have a clickable TOC. Also, for me using xpdf to view it, clicking on a link internal to the PDF causes an empty web browser window to open.)

08-18-2019, 04:10 PM	#25
j.p.s Grand Sorcerer Posts: 5,278 Karma: 98804578 Join Date: Apr 2011 Device: pb360	Thanks rzikaou for uploading your example. It turns out that was important rather than my suggestion to use a book. I've never understood why there is a 14 byte difference between Amazon's offset into the rawml and where the text actually is. It turns out that for your azw3 the offset is 166 bytes. I have added a -o option to the azw3r.c and azw3r.pl attached to the first post and made a new github release. I don't know whether it is possible to tell in advance what the offset is. I had to get yours experimentally. I moved your azw3r and rawml files into the same directory so that the command would merely be way too long instead of impossibly long. Code: azw3r -h -o 166 -i "Test article_CTD7HH6AE5BVXFGTFOTOV54NOREZMUWNa1bd4a78ed253ba5271d0cb7df407fda.azw3r" -r "Test article_CTD7HH6AE5BVXFGTFOTOV54NOREZMUWN.rawml" 1259 1269 Highlight: 'a thousand ' 1184 1220 Highlight: 'On a bright</span> Monday in January ' 1462 1518 Highlight: 'They packed themselves into a cheerful courtyard outside ' I have shown the "-o 166" at the beginning of the command for clarity. During experimentation it would be best at the end.

09-07-2019, 06:18 PM	#29
j.p.s Grand Sorcerer Posts: 5,278 Karma: 98804578 Join Date: Apr 2011 Device: pb360	New release, process KRDS JSON and use defaults assuming assembled_text.dat format There is a new release at github that makes the default rawml offset 0 in the C and perl utilites, so kindleunpack -d should be used to make assembled_text.dat instead of kindleupack -r to make <book>.rawml Also, there is a new utility named krdsJSON2notes.pl that processes the <book>.json file produced by jhowell's KRDS parser https://www.mobileread.com/forums/sh...d.php?t=322172 into the same format used by notes_insert.pl to highlight and insert notes into a rawml file (assembled_text.dat) suitable for converting to PDF. So now human readable personal notes can be extracted from all Kindle books and personal highlights can be extracted from KF8 (azw3) and probably mobi books. The current latest release is attached as azw3r-0.1.7.zip to post #1 in this thread.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Fully Automated ebook file parsing, ISBN extraction, Titel Extraction and metadata	isbnread	Reading and Management	0	02-20-2017 10:20 AM
Paperwhite 2 add note without highlight?	just_jeepin	Amazon Kindle	3	10-07-2013 02:07 PM
PRS-650 Two years late — A crossplatform ePub highlight extraction tool for PRS-350, 650...	Syniurge	Sony Reader	1	09-30-2013 12:45 PM
eink device with note and highlight sync with Mendeley	aldomenguzzi	Which one should I buy?	0	12-04-2012 04:44 AM

08-18-2019, 10:56 AM	#22
rzikaou Junior Member Posts: 3 Karma: 10 Join Date: Aug 2019 Device: Kindle	Hi j.p.s, First, I'd like to thank you for the work you've done so far. I was in the process of reverse engineering azw3r files and then found your project and it's been super helpful. I'm not sure if this is a known issue, but I tried using your tool to extract highlights (with no notes) from a book, but the highlight text it is extracting is not correct. I suspect the rawml file I'm providing it might not to be the right one. I used KindleUnpack and it gave me 2 different rawml files: mobi7/book.rawml and mobi8/book.rawml. I tried running `azw3r -i book.azw3r -h -r book.rawml` with both of them and the extracted highlight text is incorrect. Any ideas? Thanks!

08-18-2019, 12:18 PM	#23
j.p.s Grand Sorcerer Posts: 5,278 Karma: 98804578 Join Date: Apr 2011 Device: pb360	hi rzikaou, I'm sorry to hear that it is not working for you. Can you post the exact sequence of commands you are using? It would also be good if you can pick some public domain book (preferably short with few or no images) in KF8 format (azw3) and make some highlights in it and post the output of the azw3r program here along with the book.azw3r file used to extract the highlights. Then I can try to reproduce your problem.

08-18-2019, 07:46 PM	#27
rzikaou Junior Member Posts: 3 Karma: 10 Join Date: Aug 2019 Device: Kindle	Thanks for debugging this j.p.s. jhowell as far as I can tell, you are correct. Using the offsets against the "assembled_text.dat" gives the expected result!

Advert

Advert