extract node text at epubcfi/last_read_position

wwfn · 06-18-2022, 10:22 PM

cfiepub in last_read_positions is exciting metadata! I'm hoping to play around with it -- first trying to extract the node/text at the identifier/last read position. Is this reasonable/possible with code already in calibre?

I think I'm stuck on building concatenated html from an epub container. I imagine there is already a container method to generate this. But I haven't found it yet. Or maybe I'm approaching it all wrong. Any pointers? (initial attempt below)

If that's possible, I'd also like to generate a fragment identifier given a node of an epub tree. Is this something that can be done from python? That code looks like it's in the pyj files (?)

Thanks!

Code:

import init_calibre
import calibre
from calibre.ebooks.oeb.polish.container import get_container
from calibre.ebooks.epub.cfi.parse import parser as cfi_parser, decode_cfi
from calibre.ebooks.oeb.polish.parsing import parse as parse_book


# select path from book where id = 296;
fname_epub = '/path/to/my/file296.epub'
# select cfi from last_read_positions where book = 296;
cfi_str='/36/2/4[x9780525538332_EPUB-16]/2/6/1:46'
container = get_container(fname_epub, tweak_mode=False)
cfi = cfi_parser().parse_path(cfi_str)

# calibre/gui2/tweak_book/boss.py uses editor.get_raw_data()
# maybe combine container.mime_map and then calibre.ebooks.oeb.polish.parsing?
raw_data = .... #? 
root = parse_book(
    raw_data, decoder=lambda x: x.decode('utf-8'),
    line_numbers=True, linenumber_attribute='data-lnum')

node = decode_cfi(root, cfi)

kovidgoyal · 06-18-2022, 11:41 PM

There is no pre-existing python code to do that, as that is done in the viewer, in javascript via cfi.pyj. You would need to write that yourself.

First you get a container object from the path to the epub then you get the individual file using spine_index. Get the root of it using Container.parsed(). Then translate the cfi to a node, which should be pretty easy, if you ignore the text part of it and just stop at the containing node.

06-18-2022, 10:22 PM	#1
wwfn Junior Member Posts: 1 Karma: 10 Join Date: Jun 2022 Device: emacs	extract node text at epubcfi/last_read_position cfiepub in last_read_positions is exciting metadata! I'm hoping to play around with it -- first trying to extract the node/text at the identifier/last read position. Is this reasonable/possible with code already in calibre? I think I'm stuck on building concatenated html from an epub container. I imagine there is already a container method to generate this. But I haven't found it yet. Or maybe I'm approaching it all wrong. Any pointers? (initial attempt below) If that's possible, I'd also like to generate a fragment identifier given a node of an epub tree. Is this something that can be done from python? That code looks like it's in the pyj files (?) Thanks! Code: import init_calibre import calibre from calibre.ebooks.oeb.polish.container import get_container from calibre.ebooks.epub.cfi.parse import parser as cfi_parser, decode_cfi from calibre.ebooks.oeb.polish.parsing import parse as parse_book # select path from book where id = 296; fname_epub = '/path/to/my/file296.epub' # select cfi from last_read_positions where book = 296; cfi_str='/36/2/4[x9780525538332_EPUB-16]/2/6/1:46' container = get_container(fname_epub, tweak_mode=False) cfi = cfi_parser().parse_path(cfi_str) # calibre/gui2/tweak_book/boss.py uses editor.get_raw_data() # maybe combine container.mime_map and then calibre.ebooks.oeb.polish.parsing? raw_data = .... #? root = parse_book( raw_data, decoder=lambda x: x.decode('utf-8'), line_numbers=True, linenumber_attribute='data-lnum') node = decode_cfi(root, cfi)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
(Open-source) application to extract text layer?	Shohreh	PDF	5	02-11-2022 08:00 AM
Failed to extract text from gutenberg books	xxyzz	Development	7	06-17-2021 05:45 AM
How to extract text and images from an .mobi file (ebook)?	Arkadya	Workshop	7	02-28-2019 05:14 AM
Extract PDF text and store in custom column	diazlaz	Development	2	12-30-2013 10:00 PM
Best format to extract text from speed vs accuracy	Txomin	Conversion	6	02-07-2013 12:54 AM

06-18-2022, 11:41 PM	#2
kovidgoyal creator of calibre Posts: 43,850 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	There is no pre-existing python code to do that, as that is done in the viewer, in javascript via cfi.pyj. You would need to write that yourself. First you get a container object from the path to the epub then you get the individual file using spine_index. Get the root of it using Container.parsed(). Then translate the cfi to a node, which should be pretty easy, if you ignore the text part of it and just stop at the containing node.

Advert