Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 06-18-2022, 10:22 PM   #1
wwfn
Junior Member
wwfn began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Jun 2022
Device: emacs
Question extract node text at epubcfi/last_read_position

cfiepub in last_read_positions is exciting metadata! I'm hoping to play around with it -- first trying to extract the node/text at the identifier/last read position. Is this reasonable/possible with code already in calibre?

I think I'm stuck on building concatenated html from an epub container. I imagine there is already a container method to generate this. But I haven't found it yet. Or maybe I'm approaching it all wrong. Any pointers? (initial attempt below)

If that's possible, I'd also like to generate a fragment identifier given a node of an epub tree. Is this something that can be done from python? That code looks like it's in the pyj files (?)

Thanks!

Code:
import init_calibre
import calibre
from calibre.ebooks.oeb.polish.container import get_container
from calibre.ebooks.epub.cfi.parse import parser as cfi_parser, decode_cfi
from calibre.ebooks.oeb.polish.parsing import parse as parse_book


# select path from book where id = 296;
fname_epub = '/path/to/my/file296.epub'
# select cfi from last_read_positions where book = 296;
cfi_str='/36/2/4[x9780525538332_EPUB-16]/2/6/1:46'
container = get_container(fname_epub, tweak_mode=False)
cfi = cfi_parser().parse_path(cfi_str)

# calibre/gui2/tweak_book/boss.py uses editor.get_raw_data()
# maybe combine container.mime_map and then calibre.ebooks.oeb.polish.parsing?
raw_data = .... #? 
root = parse_book(
    raw_data, decoder=lambda x: x.decode('utf-8'),
    line_numbers=True, linenumber_attribute='data-lnum')

node = decode_cfi(root, cfi)
wwfn is offline   Reply With Quote
Old 06-18-2022, 11:41 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
There is no pre-existing python code to do that, as that is done in the viewer, in javascript via cfi.pyj. You would need to write that yourself.

First you get a container object from the path to the epub then you get the individual file using spine_index. Get the root of it using Container.parsed(). Then translate the cfi to a node, which should be pretty easy, if you ignore the text part of it and just stop at the containing node.
kovidgoyal is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
(Open-source) application to extract text layer? Shohreh PDF 5 02-11-2022 08:00 AM
Failed to extract text from gutenberg books xxyzz Development 7 06-17-2021 05:45 AM
How to extract text and images from an .mobi file (ebook)? Arkadya Workshop 7 02-28-2019 05:14 AM
Extract PDF text and store in custom column diazlaz Development 2 12-30-2013 10:00 PM
Best format to extract text from speed vs accuracy Txomin Conversion 6 02-07-2013 12:54 AM


All times are GMT -4. The time now is 04:25 AM.


MobileRead.com is a privately owned, operated and funded community.