Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 12-02-2022, 10:26 AM   #1
isarl
Addict
isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.
 
Posts: 287
Karma: 2534928
Join Date: Nov 2022
Location: Canada
Device: Kobo Aura 2
Counting is hard: word breaks and XPath

Hello, thank you for taking the time to read my post.

I am interested in counting words in ebooks. “Simple!” you might respond. “Use the Count Pages plugin!” If this is your suggestion, then thank you – I already do! However, I am specifically interested in counting the words in only part of an ebook.

I got this idea when I started using KoboUtilities and examined the format in which the Kobo stores my “last reading position”, which looks like this:
Code:
Text/Chapter02.xhtml#kobo.27.1
In other words, it stores the location as the name of the document in the book's spine, and an XHTML tag ID (these tags and their ID values are inserted by the Kobo driver when converting to .kepub). This means that I can use the OEB container data types provided by Calibre to do something like:

Code:
from calibre.ebooks.oeb.polish.container import get_container
book = get_container("/path/to/book.kepub")
chapter = book.parsed("Text/Chapter02.xhtml")
excerpt = ''.join(chapter.xpath("//*[@id='kobo.27.1']/following::*//text()"))
from calibre.spell.break_iterator import count_words
count_words(excerpt)
The above code counts all words in the document (“chapter”/section/split) of that name after the indicated XHTML span. However, because str.join is unaware of display rules for HTML block elements, successive paragraph tags are joined without any space, and usually the first word of the next paragraph is counted along with the last word of the preceding paragraph (although some punctuation results in the correct count). Example:

Code:
>>> count_words("Hello.How are you?")
3
>>> count_words("‘Hello.’How are you?")
4
I am hesitant to roll my own word counting function as I strongly suspect that the ICU code core to count_words is much better than anything I can come up with. Is there perhaps a better XPath query I can use for this, or some other mechanism to excerpt the content I wish to count? Should I still use XPath, only be more intelligent about how I count words? (Perhaps I can drop the "//text()" suffix and be smart about iterating over the returned nodeset, e.g. counting words for each paragraph tag separately? But I'm not sure how I would do this without exhaustively enumerating every possible block-type tag name I might have to consider, and this also completely ignores that an individual book might have style rules which change one or more block elements to display inline.)

My ultimate goal with this code is to take two reading positions like Kobo stores, and count the words between them. There is extra logic involved in determining, “Are the starting and ending positions in the same document? Do the names documents exist in this book? Do the named tags exist in their documents?” which I have omitted here for the sake of brevity.

Thank you again for taking the time to read my post! Even if you can't help, I appreciate your time, and I hope you have a lovely day.

~isarl

Last edited by isarl; 12-02-2022 at 10:28 AM.
isarl is offline   Reply With Quote
Old 12-02-2022, 12:17 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
See the code in polish/spell.py for many different ways of counting words some of which you should be able to adapt
kovidgoyal is offline   Reply With Quote
Advert
Old 12-02-2022, 01:16 PM   #3
isarl
Addict
isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.isarl ought to be getting tired of karma fortunes by now.
 
Posts: 287
Karma: 2534928
Join Date: Nov 2022
Location: Canada
Device: Kobo Aura 2
Thank you Kovid! I appreciate the pointer and look forward to exploring the classes and methods available.
isarl is offline   Reply With Quote
Reply

Tags
plugin development, word break, word count


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
What is the Xpath for "Split html at the word 'chapter" lealla Editor 5 06-26-2015 03:32 AM
Word breaks FlexUser Calibre 10 03-24-2014 02:42 AM
xpath to insert chapter breaks - but chapter name cut off ? Rob557 Conversion 2 03-06-2014 06:59 AM
How to insert hard page breaks Blessings2all ePub 4 02-28-2013 11:22 AM
Unwrapping hard line breaks across all input formats ldolse Calibre 17 05-10-2009 11:31 PM


All times are GMT -4. The time now is 06:12 PM.


MobileRead.com is a privately owned, operated and funded community.