MobileRead Forums - View Single Post

isarl · 02-15-2023, 09:23 AM

Instead of using Calibre's objects I find it simplest to use the Python library ebooklib. Calibre's container types work with exact MIME types whereas ebooklib simply lets me ask for all ITEM_DOCUMENTs in an ebook. Here is some sample code I have written which demonstrates using it to read ebook contents:

Code:

import ebooklib
import lxml

book = ebooklib.epub.read_epub("path/to/book.epub")
docs = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT)
# beware non-UTF8 content! E.g. you might need to .decode("latin1"), or some other encoding, instead.
doctree = lxml.etree.fromstring(docs[0].get_body_content().decode())

If you are interested in counting words then I recommend Calibre's calibre.spell.break_iterator.count_words function which reuses logic from the International Consortium for Unicode to get it “right” (± locale and quality of input text).

Good luck with your project.

02-15-2023, 09:23 AM	#4
isarl Addict Posts: 295 Karma: 2534928 Join Date: Nov 2022 Location: Canada Device: Kobo Aura 2	Instead of using Calibre's objects I find it simplest to use the Python library ebooklib. Calibre's container types work with exact MIME types whereas ebooklib simply lets me ask for all ITEM_DOCUMENTs in an ebook. Here is some sample code I have written which demonstrates using it to read ebook contents: Code: import ebooklib import lxml book = ebooklib.epub.read_epub("path/to/book.epub") docs = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT) # beware non-UTF8 content! E.g. you might need to .decode("latin1"), or some other encoding, instead. doctree = lxml.etree.fromstring(docs[0].get_body_content().decode()) If you are interested in counting words then I recommend Calibre's calibre.spell.break_iterator.count_words function which reuses logic from the International Consortium for Unicode to get it “right” (± locale and quality of input text). Good luck with your project. Last edited by isarl; 02-15-2023 at 09:25 AM.