View Single Post
Old 07-01-2010, 11:52 AM   #15
Man Eating Duck
Addict
Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.
 
Posts: 254
Karma: 69786
Join Date: May 2006
Location: Oslo, Norway
Device: Kobo Aura, Sony PRS-650
Quote:
Originally Posted by thingumybob View Post
... I was hoping calibre would permit it by now? If not I will do it myself if somebody can lead me to where the file/s to decrypt are hidden! Thanks.
I actually had a look at this earlier as I'd also like this feature. While I'm not a programmer by trade I have fairly extensive experience with scripting languages, although not so much with Python.

Sony seems to use three different schemes for saving annotations (I have only looked at ebub files, for other formats there might be other syntaxes, after all a PDF does not conform to an xml structure):

* Freehand scribbles: Saved as SVG, plus an entry in an .annot-file which indicates the position relative to page. Saved in for instance \database\markup\database\media\books\Miguel de Cervantes\The Story of Don Quixote\The Story of Don Quixote - Miguel de Cervantes_45.epub\1277997910049.445.svg on the reader.

* Highlighted text: Entries indicate start/end for a "marked" section. The reference includes filename, but it appears that start and end in different xhtml-files is not supported (the annotation won't show on the reader).

* Bookmarks: As for highlighted text, but the position indicates a single character only (probably first character of the relevant page)

The positions are saved in a .annot file which looks suspiciously like an xml file: For instance \Digital Editions\Annotations\database\media\books\Miguel de Cervantes\The Story of Don Quixote\The Story of Don Quixote - Miguel de Cervantes_45.epub.annot

The thing is, the marked text is not itself contained in the XML, it needs to be extracted from the files in the epub. An example reference to a highlight is:

<target>
<fragment start="29468/0.html#point(/1/4/1/1)" end="29468/0.html#point(/1/4/1/1:27)"/>
</target>

When I tried to examine the xpointer-ish entry it doesn't seem to be valid for the indicated file, it's probably a reference to an internal rendering of the text. All points seem to start with "/1/4/....." as indicated above, but when I try to find this entry via xpointer in the file I can't use the reference to get to the correct text. For the example it's easy (it's the very first 27 characters of the body), but for instance "29468/0.html#point(/1/4/34/1)" does not refer to the 34th xhtml element of the body. It seems that it ignores certain tags when counting, but I don't know which ones (again, probably it refers to some internal rendering).

In addition I'm not familiar with how I could access the relevant text in an epub from code within Calibre, nor how to handle the content with the Calibre data structure if I did manage to extract it. If a more accomplished Python coder would like to try reverse-engineering it I'll help in any way that I can

Edit: I also found a file \database\cache\cacheExt.xml, which seems to contain references in a more readable format. It also stores up to a hundred characters of text from highlights, but it's possible to highlight more than that on the reader. I suspect this might be that Sony wants to avoid issues with copyright holders. To get these would definitely be an improvement.

Last edited by Man Eating Duck; 07-01-2010 at 12:15 PM.
Man Eating Duck is offline   Reply With Quote