08-14-2019, 12:24 PM | #16 | |
hopeless n00b
Posts: 5,110
Karma: 19597086
Join Date: Jan 2009
Location: in the middle of nowhere
Device: PW4, PW3, Libra H2O, iPad 10.5, iPad 11, iPad 12.9
|
Quote:
|
|
08-14-2019, 01:17 PM | #17 | |
Grand Sorcerer
Posts: 5,421
Karma: 99236514
Join Date: Apr 2011
Device: pb360
|
Quote:
It would be pretty easy to add an option to both the azw3r C program and perl script to extract the text of the highlights. Part of why I haven't done it yet is because I am unsure how useful those are out of context and because they would contain any HTML markup within the highlight text. (The latter surprised me when it showed up in some short highlights.) |
|
Advert | |
|
08-14-2019, 11:20 PM | #18 |
Grand Sorcerer
Posts: 5,421
Karma: 99236514
Join Date: Apr 2011
Device: pb360
|
I'm attaching a PDF of a book with inserted highlights and notes to this post along with associated files to make it. It turns out that the utility html2ps does not choke on the XML in a rawml file like my web browser does, so there was no need to comment out the XML. What is also surprising to me is that the TOC in the PDF works. This is not meant to be a book with highlights and notes, but rather the highlights and notes shown in context.
The source book is EPUB of The Humbugs of the World by P T Barnum from the Mobileread Library. I used kindlegen to make a dual mobi and used kindleunpack to extract the rawml and azw3, which I copied to a kindle and quickly made 9 highlights with bogus notes. Then I copied the azw3r and dumped the notes, which also gives the start and end of the each higlight. Next I used the notes_insert.pl from the first post to modify the rawml, then html2ps and ps2pdf. You can search the PDF for '[HL]' or '[Note:' to find the highlights and notes. |
08-17-2019, 01:53 PM | #19 | |
Grand Sorcerer
Posts: 5,421
Karma: 99236514
Join Date: Apr 2011
Device: pb360
|
Quote:
These changes, along with an expanded README are in the latest release, v0.1.4, at https://github.com/jps-e/azw3r and the attachments azw3r.c.gz and azw3r.pl.gz have been updated in post #1 of this thread. |
|
08-17-2019, 05:57 PM | #20 | |
Grand Sorcerer
Posts: 5,421
Karma: 99236514
Join Date: Apr 2011
Device: pb360
|
Quote:
you have not said what kindle format your books are in. If they are KF8 (azw3), then I think my scripts are in good enough shape for just about anyone to extract both notes and highlights as text as separate files and/or insert them into the text of the book for context. |
|
Advert | |
|
08-17-2019, 06:20 PM | #21 |
Grand Sorcerer
Posts: 5,421
Karma: 99236514
Join Date: Apr 2011
Device: pb360
|
KF8 notes and highlights embedded in text of book for context as PDF
When I first started playing with this I thought Calibre could not convert a kindleunpack rawml file to PDF, but when I tack and .html extension to the input file name, Calibre makes a usable PDF, even if the XML has not been commented out. It has the added advantage over html2ps that the font is larger. (However, the Calibre generated PDF does not have a clickable TOC. Also, for me using xpdf to view it, clicking on a link internal to the PDF causes an empty web browser window to open.)
|
08-18-2019, 10:56 AM | #22 |
Junior Member
Posts: 3
Karma: 10
Join Date: Aug 2019
Device: Kindle
|
Hi j.p.s,
First, I'd like to thank you for the work you've done so far. I was in the process of reverse engineering azw3r files and then found your project and it's been super helpful. I'm not sure if this is a known issue, but I tried using your tool to extract highlights (with no notes) from a book, but the highlight text it is extracting is not correct. I suspect the rawml file I'm providing it might not to be the right one. I used KindleUnpack and it gave me 2 different rawml files: mobi7/book.rawml and mobi8/book.rawml. I tried running `azw3r -i book.azw3r -h -r book.rawml` with both of them and the extracted highlight text is incorrect. Any ideas? Thanks! |
08-18-2019, 12:18 PM | #23 |
Grand Sorcerer
Posts: 5,421
Karma: 99236514
Join Date: Apr 2011
Device: pb360
|
hi rzikaou,
I'm sorry to hear that it is not working for you. Can you post the exact sequence of commands you are using? It would also be good if you can pick some public domain book (preferably short with few or no images) in KF8 format (azw3) and make some highlights in it and post the output of the azw3r program here along with the book.azw3r file used to extract the highlights. Then I can try to reproduce your problem. |
08-18-2019, 02:39 PM | #24 |
Junior Member
Posts: 3
Karma: 10
Join Date: Aug 2019
Device: Kindle
|
The document I'm using is an html document that I've converted to `epub`, then to `mobi` (using `kindlegen`) and then I e-mailed it to my Kindle with the special `@kindle.com` email which resulted in the `azw3` document that is on my kindle.
I've made the following test highlights on the first page of the article (without any notes): "On a bright Monday in January" "a thousand" "They packed themselves into a cheerful courtyard outside" But this doesn't seem to be what the tool is returning. I'm trying to remember how I generated the rawml because now running `kindleunpack` doesn't give me any `.rawml` files. |
08-18-2019, 04:10 PM | #25 |
Grand Sorcerer
Posts: 5,421
Karma: 99236514
Join Date: Apr 2011
Device: pb360
|
Thanks rzikaou for uploading your example. It turns out that was important rather than my suggestion to use a book. I've never understood why there is a 14 byte difference between Amazon's offset into the rawml and where the text actually is. It turns out that for your azw3 the offset is 166 bytes. I have added a -o option to the azw3r.c and azw3r.pl attached to the first post and made a new github release.
I don't know whether it is possible to tell in advance what the offset is. I had to get yours experimentally. I moved your azw3r and rawml files into the same directory so that the command would merely be way too long instead of impossibly long. Code:
azw3r -h -o 166 -i "Test article_CTD7HH6AE5BVXFGTFOTOV54NOREZMUWNa1bd4a78ed253ba5271d0cb7df407fda.azw3r" -r "Test article_CTD7HH6AE5BVXFGTFOTOV54NOREZMUWN.rawml" 1259 1269 Highlight: 'a thousand ' 1184 1220 Highlight: 'On a bright</span> Monday in January ' 1462 1518 Highlight: 'They packed themselves into a cheerful courtyard outside ' |
08-18-2019, 06:03 PM | #26 |
Grand Sorcerer
Posts: 6,670
Karma: 86234809
Join Date: Nov 2011
Location: Charlottesville, VA
Device: Kindles
|
"kindleunpack -d" will produce a file named "assembled_text.dat" in the mobi8 folder containing a subset of the rawml corresponding to the actual book content (flow 0). I think you will find that the position number offsets are indexed into this data without any correction needed.
|
08-18-2019, 07:46 PM | #27 |
Junior Member
Posts: 3
Karma: 10
Join Date: Aug 2019
Device: Kindle
|
Thanks for debugging this j.p.s.
jhowell as far as I can tell, you are correct. Using the offsets against the "assembled_text.dat" gives the expected result! |
08-18-2019, 07:49 PM | #28 | |
Grand Sorcerer
Posts: 5,421
Karma: 99236514
Join Date: Apr 2011
Device: pb360
|
Quote:
rzikaou's rawml file has extra header tags not in the assembled_text.dat file. So the C azw3r and the perl azw3r.pl can be used as is with -r assembled_text.dat -o 0 Ditto. |
|
09-07-2019, 06:18 PM | #29 |
Grand Sorcerer
Posts: 5,421
Karma: 99236514
Join Date: Apr 2011
Device: pb360
|
New release, process KRDS JSON and use defaults assuming assembled_text.dat format
There is a new release at github that makes the default rawml offset 0 in the C and perl utilites, so kindleunpack -d should be used to make assembled_text.dat instead of kindleupack -r to make <book>.rawml
Also, there is a new utility named krdsJSON2notes.pl that processes the <book>.json file produced by jhowell's KRDS parser https://www.mobileread.com/forums/sh...d.php?t=322172 into the same format used by notes_insert.pl to highlight and insert notes into a rawml file (assembled_text.dat) suitable for converting to PDF. So now human readable personal notes can be extracted from all Kindle books and personal highlights can be extracted from KF8 (azw3) and probably mobi books. The current latest release is attached as azw3r-0.1.7.zip to post #1 in this thread. |
10-09-2019, 05:17 PM | #30 | |
Junior Member
Posts: 9
Karma: 10
Join Date: Jul 2019
Device: Kindle PW 2
|
Quote:
Thanks. |
|
Tags |
azw3r, highlights, highlights and notes, notes |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Fully Automated ebook file parsing, ISBN extraction, Titel Extraction and metadata | isbnread | Reading and Management | 0 | 02-20-2017 10:20 AM |
Paperwhite 2 add note without highlight? | just_jeepin | Amazon Kindle | 3 | 10-07-2013 02:07 PM |
PRS-650 Two years late — A crossplatform ePub highlight extraction tool for PRS-350, 650... | Syniurge | Sony Reader | 1 | 09-30-2013 12:45 PM |
eink device with note and highlight sync with Mendeley | aldomenguzzi | Which one should I buy? | 0 | 12-04-2012 04:44 AM |