Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 02-15-2023, 04:54 AM   #1
sgmoore
Connoisseur
sgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfolded
 
Posts: 59
Karma: 52636
Join Date: Mar 2021
Device: Kindle Voyage
Getting Text content of book

What is the proper way of extracting the text from a book from a plugin?

I know I can do
Code:
os.system('ebook-convert' , ...
to create a temporary txt file and then read the contents of the temporary file into a string, but there is probably a better and faster way.
sgmoore is offline   Reply With Quote
Old 02-15-2023, 06:16 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,909
Karma: 22669818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
That way works fine.
kovidgoyal is online now   Reply With Quote
Old 02-15-2023, 06:27 AM   #3
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,212
Karma: 16534894
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
I don't know whether it's better or faster but the calibre plugin 'Count Pages' contains some code for extracting book text into a big string. It uses it when calculating a wordcount for the book.
jackie_w is offline   Reply With Quote
Old 02-15-2023, 09:23 AM   #4
isarl
Addict
isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.
 
Posts: 287
Karma: 2534928
Join Date: Nov 2022
Location: Canada
Device: Kobo Aura 2
Instead of using Calibre's objects I find it simplest to use the Python library ebooklib. Calibre's container types work with exact MIME types whereas ebooklib simply lets me ask for all ITEM_DOCUMENTs in an ebook. Here is some sample code I have written which demonstrates using it to read ebook contents:

Code:
import ebooklib
import lxml

book = ebooklib.epub.read_epub("path/to/book.epub")
docs = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT)
# beware non-UTF8 content! E.g. you might need to .decode("latin1"), or some other encoding, instead.
doctree = lxml.etree.fromstring(docs[0].get_body_content().decode())
If you are interested in counting words then I recommend Calibre's calibre.spell.break_iterator.count_words function which reuses logic from the International Consortium for Unicode to get it “right” (± locale and quality of input text).

Good luck with your project.

Last edited by isarl; 02-15-2023 at 09:25 AM.
isarl is offline   Reply With Quote
Old 02-15-2023, 10:09 AM   #5
sgmoore
Connoisseur
sgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfolded
 
Posts: 59
Karma: 52636
Join Date: Mar 2021
Device: Kindle Voyage
Quote:
Originally Posted by jackie_w View Post
I don't know whether it's better or faster but the calibre plugin 'Count Pages' contains some code for extracting book text into a big string. It uses it when calculating a wordcount for the book.
It's definitely faster (a quick one off test shows that spawning ebook-convert is about five times slower).

Unfortunately it is not better and indeed not good enough. I have some files which look like they have been generated as epub files by Microsoft Word, and the count_pages algorithm produces text which is about four times larger than ebook-convert. (A quick glance shows thousands of font-family entries which have not been removed by count_pages).
sgmoore is offline   Reply With Quote
Old 02-15-2023, 12:52 PM   #6
sgmoore
Connoisseur
sgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfolded
 
Posts: 59
Karma: 52636
Join Date: Mar 2021
Device: Kindle Voyage
Quote:
Originally Posted by isarl View Post
Instead of using Calibre's objects I find it simplest to use the Python library ebooklib.
Looks like that only works with epub files.
sgmoore is offline   Reply With Quote
Old 02-15-2023, 02:18 PM   #7
isarl
Addict
isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.
 
Posts: 287
Karma: 2534928
Join Date: Nov 2022
Location: Canada
Device: Kobo Aura 2
Quote:
Originally Posted by sgmoore View Post
Looks like that only works with epub files.
Yes, file format is going to affect how you solve this problem. The more disparate types you need to handle, the more work you need to do to ensure that each one is being handled correctly. TANSTAAFL
isarl is offline   Reply With Quote
Old 02-15-2023, 03:43 PM   #8
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,605
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by sgmoore View Post
It's definitely faster (a quick one off test shows that spawning ebook-convert is about five times slower).

Unfortunately it is not better and indeed not good enough. I have some files which look like they have been generated as epub files by Microsoft Word, and the count_pages algorithm produces text which is about four times larger than ebook-convert. (A quick glance shows thousands of font-family entries which have not been removed by count_pages).
AFAIK Microsoft Word itself is incapable of creating EPUB files directly.

The ePUBTools Word addin can create EPUBs from within Word, and there are a number of tools, including calibre, that will convert MS Word's native format DOCX files to EPUB.

Added: InDesign is a more likely candidate as the source of poorly formed EPUBs.

BR

Last edited by BetterRed; 02-15-2023 at 03:57 PM.
BetterRed is offline   Reply With Quote
Old 02-15-2023, 08:41 PM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,909
Karma: 22669818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If you care about speed use the extract_text() function from calibre.db.fts.text
kovidgoyal is online now   Reply With Quote
Old 02-16-2023, 09:00 AM   #10
sgmoore
Connoisseur
sgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfoldedsgmoore reads XML... blindfolded
 
Posts: 59
Karma: 52636
Join Date: Mar 2021
Device: Kindle Voyage
Quote:
Originally Posted by kovidgoyal View Post
If you care about speed use the extract_text() function from calibre.db.fts.text
Thank you. That looks to do what I want. It does not produce the exact same results as ebook-convert but ignoring formatting and white-space issues its is extremely close. I tried it on about 1000 books and the worse case was still 99% similar and the vast majority of them were over 99.9% similar.

Also only takes about 1/20 of the time to call ebook-convert.

Thanks again.
sgmoore is offline   Reply With Quote
Old 02-16-2023, 05:22 PM   #11
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,228
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Calibre converting docx to epub works better than the plugins I've tried on Word & LO Writer. It seems to work better than Indesign and other commercial tools judging by the commercial ebooks from big publishers.
Indesign should only be used for fancy colour coffee table books and glossy magasines.

Calibre is jjust about perfect from properly formatted docx (made by Word or extra Save As in LO Writer) for novels. Also for ordinary novels direct PDF export from a differently formated copy of the Wp file beats Indesign too. Also you now can only rent Indesign.
Quoth is offline   Reply With Quote
Reply

Tags
plugins


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Content Server now has Full-text search! Comfy.n Calibre 1 12-16-2022 02:56 AM
Aura Grey text in book, Black text in menu aluisscp Kobo Reader 4 09-03-2014 07:10 PM
HTML to ePub stripping out Content text nimblebooks Conversion 6 02-01-2012 01:50 AM
text file and table of content skao Calibre 1 04-09-2010 12:15 PM


All times are GMT -4. The time now is 09:05 AM.


MobileRead.com is a privately owned, operated and funded community.