View Single Post
Old 06-21-2013, 06:17 PM   #17
-axel-
Junior Member
-axel- could sell banana peel slippers to a Deveel.-axel- could sell banana peel slippers to a Deveel.-axel- could sell banana peel slippers to a Deveel.-axel- could sell banana peel slippers to a Deveel.-axel- could sell banana peel slippers to a Deveel.-axel- could sell banana peel slippers to a Deveel.-axel- could sell banana peel slippers to a Deveel.-axel- could sell banana peel slippers to a Deveel.-axel- could sell banana peel slippers to a Deveel.-axel- could sell banana peel slippers to a Deveel.-axel- could sell banana peel slippers to a Deveel.
 
Posts: 6
Karma: 3088
Join Date: Apr 2013
Device: Kindle Paperwhite
Kindle international

Quote:
Originally Posted by GRiker View Post
@-axel-,
I fixed the Position/Location error, version 1.1.2 should work properly for you.
G
Hi,

sorry, but it did not work for me. I've had a deeper look into the Kindle Clippings format and did learn some Python. I'll try to attach my version of Kindle.py, where the core parsing is done in a separate file ParseKindleMyClippingsTxt.py. (Actually I added a complete plugin zip file, so you can easily figure out the version I worked with and maybe it even runs out of the box.)

You are very welcome to use it and itegrate it into your plugin, but please resist the temptation to refactor ParseKindleMyClippingsTxt.py into your style and interface conventions. So in case I will use it for my project we both can benefit from improvements.

It turned out locale is worthless for a module parsing a multi-language file, because it is not thread-safe and not really portable among OS; Windows uses different locale names than Linux and Python does not supply an abstraction level for that. But the killer problem was thread-safety, which means that a module is not allowed to change locale, not even for a short time.

I've had a look into the Calibre code you were using for date parsing. Funny thing is, that module contains a helper function which replaces French and German month names by the English names. Inspired by this I did a similar thing and implemented a simple multilingual parsing which works for the format my Paperwhite generated, for some examples of My Clippings.txt content I googled (most of them here on the forum), and for a few format variations I considered worth covering without examples.

The multilingual parsing concept is a mix of table based and procedural encoding. It would hope that way it has the flexibility to be easily extended to cover future variations of Amazons My Clippings.txt format, including added languages.

Here are a few notes on limitations I observed for "My Clippings.txt":

It seems to be an append-only file. Changing Kindle language will not change the content of the file. Only new annotations are affected by the new language setting.

Also if a note, mark or bookmark is deleted, it will *not* be deleted from the txt file. Same holds for edited notes.

Using the timestamp as a unique ID does not always work, even if the date can be read perfectly. I had a case where two annotations, done a few seconds apart, did get the same timestamp.

The format of My Clippings.txt is ambiguous. E.g. I can add a note which looks like the separator line "==========". You may think that is a pretty academic case, but consider for example the My Clippings.txt is also a Kindle document. You can open it an highlight parts of it. With a bit of bad luck, the highlight will even be indistinguishable from a separate entry in My Clippings.txt. I did give up on the last case, but I tried to cover the simpler cases with my parser.

Not having such brain-twisting ambiguities in logfiles was one of the major reasons XML was invented. Which brings me to my last remark here: there is also an XML file system/userannotlog in the hidden but readable system folder of the Kindle, which contains a protocol of all annotation operations. From this file I can figure out deletions, edits, etc. Timestamps are in easy web format, no language translations involved. Locations are in full resolution, i.e. the identify start and end character of each highlight.

So that would be an interesting alternative, but not a perfect one. The log does not contain the highlighted text itself. And I have no information on how stable the format was or will be over the years. For example google found me a file userannotlog.0, which indicates there is or was some splitting strategy / naming convention to prevent the file to become too large. Does anyone know more?
Attached Files
File Type: zip Annotations.zip (598.9 KB, 435 views)
-axel- is offline