View Single Post
Old 07-27-2019, 01:52 PM   #1
j.p.s
Grand Sorcerer
j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.j.p.s ought to be getting tired of karma fortunes by now.
 
Posts: 5,803
Karma: 103362673
Join Date: Apr 2011
Device: pb360
azw3r highlight and note extraction info

I've figured out enough of the azw3r format to extract personal highlights, notes, and maybe bookmarks. (All strictly by inspection.) I've also written a C program to extract highlights and notes (in a text format possibly most suitable as an intermediate stage) and a perl script that uses the extracted highlights and notes to mark up the rawml for the book. azw3r.pl is a perl alternative to the C program which takes the same arguments and produces the same output. Both of these can now extract highlighted text from the book's rawml file. Both might also be used with yjr files from KFX books, but without the capability to extract highlighted text.

Since jhowell's KRDS parser krds.py https://www.mobileread.com/forums/sh...d.php?t=322172 is general and complete, I've put the details of my partial reverse engineering in spoiler tags.
Spoiler:

As I write this up, I see that the structures are saved avl interval trees, which is meaningless to me and the results of a web search don't look interesting. This particular file is a strange mix of binary and text. (Of course the notes are in text, but see the following.

Each hightlight begins (for my purposes) with the string "annotation.personal.highlight" followed by 4 bytes. The first byte is always 0x03 (^C) followed by 3 bytes that seem to give the length of the following text string that denotes the rawml byte offset of the beginning of the highlight. This is followed by a repeat to give the byte offset of the end of the highlight, which is followed by about a couple dozen bytes of (as far as I am concerned) junk.
Code:
annotation.personal.highlight^C^@^@^G1191325^C^@^@^G1191337^B^@^@^A...
                              3 0 0 7        3 0 0 7
(0*256) + 0)*256 + 7 = 7

Personal notes are similar to highlights. They begin with the string "annotation.personal.note", followed by the rawml byte offset of the highlight associated with the note. This is followed by more "junk", then binary (only) length of the note, then the text of the note itself.

Bookmarks look similar to highlights, but I have not investigated.

The C code and perl scripts are in github at https://github.com/jps-e/azw3r and a
ttached here along with a sed script to make the rawml viewable in a web browser.

ETA: The C and perl have been updated

ETA: New release attached as azw3r-0.1.7.zip to this post. See post #29 for details of added features.
Attached Files
File Type: gz notes_insert.pl.gz (492 Bytes, 871 views)
File Type: gz unxml.sed.gz (78 Bytes, 1009 views)
File Type: gz azw3r.pl.gz (822 Bytes, 852 views)
File Type: gz azw3r.c.gz (1.0 KB, 893 views)
File Type: zip azw3r-0.1.7.zip (4.2 KB, 749 views)

Last edited by j.p.s; 09-07-2019 at 06:25 PM. Reason: New release 0.1.7
j.p.s is offline   Reply With Quote