|
|
#1 |
|
Junior Member
![]() Posts: 1
Karma: 10
Join Date: Jun 2022
Device: emacs
|
cfiepub in last_read_positions is exciting metadata! I'm hoping to play around with it -- first trying to extract the node/text at the identifier/last read position. Is this reasonable/possible with code already in calibre?
I think I'm stuck on building concatenated html from an epub container. I imagine there is already a container method to generate this. But I haven't found it yet. Or maybe I'm approaching it all wrong. Any pointers? (initial attempt below) If that's possible, I'd also like to generate a fragment identifier given a node of an epub tree. Is this something that can be done from python? That code looks like it's in the pyj files (?) Thanks! Code:
import init_calibre
import calibre
from calibre.ebooks.oeb.polish.container import get_container
from calibre.ebooks.epub.cfi.parse import parser as cfi_parser, decode_cfi
from calibre.ebooks.oeb.polish.parsing import parse as parse_book
# select path from book where id = 296;
fname_epub = '/path/to/my/file296.epub'
# select cfi from last_read_positions where book = 296;
cfi_str='/36/2/4[x9780525538332_EPUB-16]/2/6/1:46'
container = get_container(fname_epub, tweak_mode=False)
cfi = cfi_parser().parse_path(cfi_str)
# calibre/gui2/tweak_book/boss.py uses editor.get_raw_data()
# maybe combine container.mime_map and then calibre.ebooks.oeb.polish.parsing?
raw_data = .... #?
root = parse_book(
raw_data, decoder=lambda x: x.decode('utf-8'),
line_numbers=True, linenumber_attribute='data-lnum')
node = decode_cfi(root, cfi)
|
|
|
|
|
|
#2 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,633
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
There is no pre-existing python code to do that, as that is done in the viewer, in javascript via cfi.pyj. You would need to write that yourself.
First you get a container object from the path to the epub then you get the individual file using spine_index. Get the root of it using Container.parsed(). Then translate the cfi to a node, which should be pretty easy, if you ignore the text part of it and just stop at the containing node. |
|
|
|
| Advert | |
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| (Open-source) application to extract text layer? | Shohreh | 5 | 02-11-2022 09:00 AM | |
| Failed to extract text from gutenberg books | xxyzz | Development | 7 | 06-17-2021 06:45 AM |
| How to extract text and images from an .mobi file (ebook)? | Arkadya | Workshop | 7 | 02-28-2019 06:14 AM |
| Extract PDF text and store in custom column | diazlaz | Development | 2 | 12-30-2013 11:00 PM |
| Best format to extract text from speed vs accuracy | Txomin | Conversion | 6 | 02-07-2013 01:54 AM |