Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 10-10-2023, 09:17 AM   #1696
ownedbycats
Custom User Title
ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.
 
ownedbycats's Avatar
 
Posts: 8,877
Karma: 62040409
Join Date: Oct 2018
Location: Canada
Device: Kobo Libra H2O, formerly Aura HD
Quote:
Originally Posted by kiwidude View Post
Well I would hope so

Previously the plugin had a crude approach of extracting the html body content and stripping html tags using the raw text and regular expressions. Which for a rough estimate is usually perfectly fine.
I suspect this did not work as expected on "HTML for Dummies"
ownedbycats is offline   Reply With Quote
Old 10-10-2023, 11:39 AM   #1697
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 37,024
Karma: 148321038
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by ownedbycats View Post
I suspect this did not work as expected on "HTML for Dummies"
Wait till you do word and page counts on a ePub with Base64 encoded images.
DNSB is offline   Reply With Quote
Advert
Old 10-10-2023, 02:18 PM   #1698
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,713
Karma: 130140792
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by kiwidude View Post
Well I would hope so

Previously the plugin had a crude approach of extracting the html body content and stripping html tags using the raw text and regular expressions. Which for a rough estimate is usually perfectly fine.

However as with any shortcuts a crude approach creates edge cases where you can get outlier results. In this case the problem is the encoding of html entities - one book (or even two formats of the same book) might have text like "it's" and the other has "it's". Previously I was not "decoding" the first case, so if you were counting characters in the body obviously the number is larger, and if counting words then the way word boundaries were defined it might count as two words rather than one based on the semi-colon being in there.

Now however I am using the BeautifulSoup parser to process the html and then asking it to give me the body text. Because it can be told to decode all those entities (and strip out all the other meaningless html tags like spans etc) so I have a consistent starting point of the second example text to calculate page and word counts for.

Provided BeautifulSoup is given fairly well formed html it should indeed all work fine. Give it bad html books and it will probably give terrible results - but then if the html is that awful then many ereaders might struggle with it too...
Wouldn't converting the eBook to text and then counting the words give a fairly accurate word count and in most cases an exact word count?
JSWolf is offline   Reply With Quote
Old 10-10-2023, 03:07 PM   #1699
ownedbycats
Custom User Title
ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.
 
ownedbycats's Avatar
 
Posts: 8,877
Karma: 62040409
Join Date: Oct 2018
Location: Canada
Device: Kobo Libra H2O, formerly Aura HD
Quote:
Originally Posted by JSWolf View Post
Wouldn't converting the eBook to text and then counting the words give a fairly accurate word count and in most cases an exact word count?
eBooks are made up of... HTML files...
ownedbycats is offline   Reply With Quote
Old 10-10-2023, 04:29 PM   #1700
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,713
Karma: 130140792
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by ownedbycats View Post
eBooks are made up of... HTML files...
And when you convert them to text, you get rid of all of the HTML code. Plus, you convert the entities to symbols. Then you count the words and there you go.
JSWolf is offline   Reply With Quote
Advert
Old 10-11-2023, 12:24 AM   #1701
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 37,024
Karma: 148321038
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
I've attached an epub with what the Adobe SPN algorithm says contains 27 words and 3,115 pages.

A pathological case.

Personally, I find Count Pages to be good enough if not perfect.
Attached Files
File Type: epub Base64 Images - Ann Onymous.epub (3.30 MB, 23 views)
DNSB is offline   Reply With Quote
Old 10-11-2023, 01:56 AM   #1702
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,654
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Quote:
Originally Posted by JSWolf View Post
And when you convert them to text, you get rid of all of the HTML code. Plus, you convert the entities to symbols. Then you count the words and there you go.
The plugin would be orders of magnitude slower, with all the additional disk activity, temp files etc. It won't be changing to doing that

Quote:
Originally Posted by DNSB View Post
I've attached an epub with what the Adobe SPN algorithm says contains 27 words and 3,115 pages.

A pathological case.

Personally, I find Count Pages to be good enough if not perfect.
Therein lies the weakness in the simplistic approach of the ADE algorithm - inlined Base64 images result in enormous html file sizes which is what ADE calculates page estimates based on. And also why I don't use it personally myself - because unless you are in the habit of opening every file to inspect whether it has been edited this way (rather than external image files via links) you would never know if it is a genuinely big book or from this flaw in approach.
kiwidude is offline   Reply With Quote
Old 10-11-2023, 11:35 AM   #1703
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,713
Karma: 130140792
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by DNSB View Post
I've attached an epub with what the Adobe SPN algorithm says contains 27 words and 3,115 pages.

A pathological case.

Personally, I find Count Pages to be good enough if not perfect.
I only count 15 words. Where are the other 12 words? I converted this to text and it is only 15 words.

I'll have to try the older version of Count Pages on my Surface and see what result I get. I'd be very interested if the older version gives a more accurate number of words.

Last edited by JSWolf; 10-11-2023 at 11:43 AM.
JSWolf is offline   Reply With Quote
Old 10-11-2023, 12:35 PM   #1704
ownedbycats
Custom User Title
ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.
 
ownedbycats's Avatar
 
Posts: 8,877
Karma: 62040409
Join Date: Oct 2018
Location: Canada
Device: Kobo Libra H2O, formerly Aura HD
Quote:
Originally Posted by JSWolf View Post
I only count 15 words. Where are the other 12 words? I converted this to text and it is only 15 words..
If you open in ebook-editor, there's a few h3s that are invisible due to CSS. 12 words there.
ownedbycats is offline   Reply With Quote
Old 10-11-2023, 01:05 PM   #1705
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 37,024
Karma: 148321038
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by kiwidude View Post
Therein lies the weakness in the simplistic approach of the ADE algorithm - inlined Base64 images result in enormous html file sizes which is what ADE calculates page estimates based on. And also why I don't use it personally myself - because unless you are in the habit of opening every file to inspect whether it has been edited this way (rather than external image files via links) you would never know if it is a genuinely big book or from this flaw in approach.
I've only found Base64 encoded images in 2 ePubs in the wild. For both, I converted them back to image files. In both cases, the issues became very noticeable just from the number of pages reported. The sample ePub I attached was generated during a discussion @davidfor and I had quite a while back about some claims that Base64 encoded images rendered faster though I did modify the posted file to leave only the Base64 encoded image pages.
DNSB is offline   Reply With Quote
Old 10-11-2023, 01:10 PM   #1706
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,713
Karma: 130140792
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by ownedbycats View Post
If you open in ebook-editor, there's a few h3s that are invisible due to CSS. 12 words there.
But they don't count as they are not displayed. So my idea of converting to text and then counting is most accurate.
JSWolf is offline   Reply With Quote
Old 10-11-2023, 01:17 PM   #1707
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 37,024
Karma: 148321038
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by JSWolf View Post
But they don't count as they are not displayed. So my idea of converting to text and then counting is most accurate.
If you convert to text, they would no longer be hidden and would be counted.

I somehow doubt that anyone is going to start looking for display: none, height:0 and whatever other methods of hiding text can be used, excluding words in the ePub 3 navigation document even when it is visible, etc. since that way lies extreme slowness.
DNSB is offline   Reply With Quote
Old 10-11-2023, 01:45 PM   #1708
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,713
Karma: 130140792
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by DNSB View Post
If you convert to text, they would no longer be hidden and would be counted.

I somehow doubt that anyone is going to start looking for display: none, height:0 and whatever other methods of hiding text can be used, excluding words in the ePub 3 navigation document even when it is visible, etc. since that way lies extreme slowness.
calibre did not convert the hidden text. Thus, only 15 words are ever displayed.
JSWolf is offline   Reply With Quote
Old 10-11-2023, 03:59 PM   #1709
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 37,024
Karma: 148321038
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by JSWolf View Post
calibre did not convert the hidden text. Thus, only 15 words are ever displayed.
Now we are trying to add calibre conversion into the mix? Might have been nice of you to mention that earlier. Have you never noticed that calibre's conversion to txt seems to lose text in headers that are display: none?

OTOH, did you count the words in my original epub with 15 words in the nav.xhtml and the other 12 words in the hidden headers? Again, please note that when you remove the html tags, all 27 words are visible.

Perhaps knocking off the what-if-isms would make it easier for you to communicate with others.

Anyhow, enough said on this off-topic topic. If you wish to continue this, take it to a vent and rant thread.

Last edited by DNSB; 10-11-2023 at 04:25 PM.
DNSB is offline   Reply With Quote
Old 10-11-2023, 04:03 PM   #1710
ownedbycats
Custom User Title
ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.
 
ownedbycats's Avatar
 
Posts: 8,877
Karma: 62040409
Join Date: Oct 2018
Location: Canada
Device: Kobo Libra H2O, formerly Aura HD
Quote:
Originally Posted by DNSB View Post
Now we are trying to add calibre conversion into the mix? Might have been nice of you to mention that earlier.
He keeps claiming that converting to text gets a more accurate wordcount. Nevermind the impracticality of it.
ownedbycats is offline   Reply With Quote
Reply

Tags
count, count pages, page count, pages, plugin


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[GUI Plugin] Quality Check kiwidude Plugins 1184 04-17-2024 06:17 PM
[GUI Plugin] Open With kiwidude Plugins 403 04-01-2024 08:39 AM
[GUI Plugin] Quick Preferences kiwidude Plugins 62 03-16-2024 11:47 PM
[GUI Plugin] Kindle Collections (old) meme Plugins 2070 08-11-2014 12:02 AM
[GUI Plugin] Plugin Updater **Deprecated** kiwidude Plugins 159 06-19-2011 12:27 PM


All times are GMT -4. The time now is 11:51 AM.


MobileRead.com is a privately owned, operated and funded community.