View Single Post
Old 02-15-2023, 09:23 AM   #4
isarl
Addict
isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.
 
Posts: 293
Karma: 2534928
Join Date: Nov 2022
Location: Canada
Device: Kobo Aura 2
Instead of using Calibre's objects I find it simplest to use the Python library ebooklib. Calibre's container types work with exact MIME types whereas ebooklib simply lets me ask for all ITEM_DOCUMENTs in an ebook. Here is some sample code I have written which demonstrates using it to read ebook contents:

Code:
import ebooklib
import lxml

book = ebooklib.epub.read_epub("path/to/book.epub")
docs = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT)
# beware non-UTF8 content! E.g. you might need to .decode("latin1"), or some other encoding, instead.
doctree = lxml.etree.fromstring(docs[0].get_body_content().decode())
If you are interested in counting words then I recommend Calibre's calibre.spell.break_iterator.count_words function which reuses logic from the International Consortium for Unicode to get it “right” (± locale and quality of input text).

Good luck with your project.

Last edited by isarl; 02-15-2023 at 09:25 AM.
isarl is offline   Reply With Quote