MobileRead Forums - View Single Post

jhowell · 08-26-2023, 12:32 PM

Quote:

Originally Posted by willemml

Hello, I am trying to write a script that will let me extract PDFs from the Kindle in their annotated form without contacting amazon. Do the annotation "notebooks" (each write-on PDF on the Kindle seems to have a corresponding notebook folder that when put through KFX Input gives me the SVG of all my annotations.)

The notebook folder name is based on the metadata of the KFX book being annotated. It is composed of the content_id, cde_content_type, and the string "notebook"; all separated by "!!". For example "EBEA035E6DB444159EF42DA7E5EEF8F6!!PDOC!!notebook" .

Quote:

Originally Posted by willemml

Do these "notebooks" contain info on which SVG goes with which page of each PDF? (Even if it is not in the corresponding epub file and only tells me which pen traces go on each page in the KFX nbk file.) If so where is this data stored?

The EPUB produced by KFX Input from a Scribe annotation notebook contains one XHTML file per annotation, each linking to an SVG image. The connection between a book page (in KFX format produced from PDF) and the associated annotation notebook page is provided by a file with the extension .yjr found in the .sdr folder associated with the KFX book. That file can be converted to JSON using KRDS - A parser for Kindle reader data store files.

Each annotated page will have an entry such as:

Code:

    "annotation.cache.object": {
        "annotation.personal.handwritten_note": [
            {
                "startPosition": "201.0:13974",
                "endPosition": "201.0:13974",
                "creationTime": "2023-08-26T09:09:38.130000",
                "lastModificationTime": "2023-08-26T09:09:38.130000",
                "template": "0\ufffc0",
                "handwritten_note_nbk_ref": "crEq-GhRTSa63nk5j3KC6Qw0"
            }
        ]
    },

The startPosition is a KFX position number that corresponds to the book page being annotated. The page number can be found by looking up the part of the position number following the colon in a content JSON file that can be optionally produced by the CLI of the KFX Input plugin. (The number will match a type 2 entry. Count type 2 entries in the file to find the page number.)

The handwritten_note_nbk_ref is the KFX section ID of the associated annotation page in the notebook. Currently those IDs are not reflected in the EPUB generated by the KFX Input plugin for an annotation notebook. I will update the plugin to include this data in the EPUB so that these can be matched.

The margins of the PDF page may be been trimmed during conversion to KFX format for delivery to the Scribe. Also the SVG produced will have the aspect ratio of the Scribe screen which might not match the PDF page. Because of this some image manipulation may be needed to properly overlay the SVG image onto the original PDF page.