06-18-2022, 10:22 PM | #1 |
Junior Member
Posts: 1
Karma: 10
Join Date: Jun 2022
Device: emacs
|
extract node text at epubcfi/last_read_position
cfiepub in last_read_positions is exciting metadata! I'm hoping to play around with it -- first trying to extract the node/text at the identifier/last read position. Is this reasonable/possible with code already in calibre?
I think I'm stuck on building concatenated html from an epub container. I imagine there is already a container method to generate this. But I haven't found it yet. Or maybe I'm approaching it all wrong. Any pointers? (initial attempt below) If that's possible, I'd also like to generate a fragment identifier given a node of an epub tree. Is this something that can be done from python? That code looks like it's in the pyj files (?) Thanks! Code:
import init_calibre import calibre from calibre.ebooks.oeb.polish.container import get_container from calibre.ebooks.epub.cfi.parse import parser as cfi_parser, decode_cfi from calibre.ebooks.oeb.polish.parsing import parse as parse_book # select path from book where id = 296; fname_epub = '/path/to/my/file296.epub' # select cfi from last_read_positions where book = 296; cfi_str='/36/2/4[x9780525538332_EPUB-16]/2/6/1:46' container = get_container(fname_epub, tweak_mode=False) cfi = cfi_parser().parse_path(cfi_str) # calibre/gui2/tweak_book/boss.py uses editor.get_raw_data() # maybe combine container.mime_map and then calibre.ebooks.oeb.polish.parsing? raw_data = .... #? root = parse_book( raw_data, decoder=lambda x: x.decode('utf-8'), line_numbers=True, linenumber_attribute='data-lnum') node = decode_cfi(root, cfi) |
06-18-2022, 11:41 PM | #2 |
creator of calibre
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
There is no pre-existing python code to do that, as that is done in the viewer, in javascript via cfi.pyj. You would need to write that yourself.
First you get a container object from the path to the epub then you get the individual file using spine_index. Get the root of it using Container.parsed(). Then translate the cfi to a node, which should be pretty easy, if you ignore the text part of it and just stop at the containing node. |
Advert | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
(Open-source) application to extract text layer? | Shohreh | 5 | 02-11-2022 08:00 AM | |
Failed to extract text from gutenberg books | xxyzz | Development | 7 | 06-17-2021 05:45 AM |
How to extract text and images from an .mobi file (ebook)? | Arkadya | Workshop | 7 | 02-28-2019 05:14 AM |
Extract PDF text and store in custom column | diazlaz | Development | 2 | 12-30-2013 10:00 PM |
Best format to extract text from speed vs accuracy | Txomin | Conversion | 6 | 02-07-2013 12:54 AM |