MobileRead Forums - View Single Post

kiwidude · 10-10-2023, 04:41 AM

Quote:

Originally Posted by JSWolf

What did change was the word count. I take it the word count is now more accurate?

Well I would hope so

Previously the plugin had a crude approach of extracting the html body content and stripping html tags using the raw text and regular expressions. Which for a rough estimate is usually perfectly fine.

However as with any shortcuts a crude approach creates edge cases where you can get outlier results. In this case the problem is the encoding of html entities - one book (or even two formats of the same book) might have text like "it's" and the other has "it's". Previously I was not "decoding" the first case, so if you were counting characters in the body obviously the number is larger, and if counting words then the way word boundaries were defined it might count as two words rather than one based on the semi-colon being in there.

Now however I am using the BeautifulSoup parser to process the html and then asking it to give me the body text. Because it can be told to decode all those entities (and strip out all the other meaningless html tags like spans etc) so I have a consistent starting point of the second example text to calculate page and word counts for.

Provided BeautifulSoup is given fairly well formed html it should indeed all work fine. Give it bad html books and it will probably give terrible results - but then if the html is that awful then many ereaders might struggle with it too...