Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 12-07-2019, 11:16 PM   #1321
NiLuJe
BLAM!
NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.
 
NiLuJe's Avatar
 
Posts: 13,506
Karma: 26047202
Join Date: Jun 2010
Location: Paris, France
Device: Kindle 2i, 3g, 4, 5w, PW, PW2, PW5; Kobo H2O, Forma, Elipsa, Sage, C2E
@davidfor: The marking appears to happen in a timely manner on my end.
NiLuJe is offline   Reply With Quote
Old 12-07-2019, 11:56 PM   #1322
snarkophilus
Wannabe Connoisseur
snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.
 
Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
Quote:
Originally Posted by davidfor View Post
My statement was badly phrased. It's about how long it takes to do the spelling check and mark the words. I'd expect this to take a long time for a decent size for you. What you could try is to turn off the spelling option, reopen the file and go to the bottom. Add a misspelled word and turn the option on again. For me, the word gets marked in a few seconds. If what I think is going on, it might take minutes for that to happen.
I joined the two Dorian Gray files together to make a 453kB file, closed that file, turned off spell check, reopened the file, added the word flubbermunger to the last paragraph at the end, enabled spell check and then using the highly accurate "one one thousand, two one thousand, ..." stop watch it took about 19 seconds to highlight flubbermunger once I closed the Preferences tab. That's a bit quicker than you were expecting?
snarkophilus is offline   Reply With Quote
Old 12-08-2019, 01:16 AM   #1323
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 47,944
Karma: 174315098
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by davidfor View Post
I just grabbed "Complete Works Oscar Wilde". It took 9 seconds on my laptop (Win10, 16GB, i7, NVIDIA graphics).

Could you post the log from a count? It doesn't matter which, I just want to check in case it shows anything strange.
My laptop (Win10, 32GB, i7, Intel/NVidia graphics) took 4 seconds for the Complete Works of Oscar Wilde and 2 seconds for Gremlin, Go Home.


Code:
Logfile for book ID 8195 (Complete Works)
	Found 1155998 words
	Method of counting _page_count_mode=Estimate _download_sources=[]
	results= {u'PageCount': 4259, u'WordCount': 1155998}
	Found 4259 pages
8195
do_statistics_for_book:  C:\Users\David\AppData\Local\Temp\calibre_c9bwbz\uvamax_count_pages\8195.epub 0 Estimate [] [u'WordCount', u'PageCount'] 1500 True
	Estimated accurate page count
	  Lines: 132049  Divs: 773  Paras: 42789
	  Accurate count: 4259  Fast count: 3122
	Page count: 4259
	Word count using icu_wordcount - trying to count_words
	Word count - used count_words: 1155998
	Word count: 1155998

Logfile for book ID 8171 (Gremlins Go Home)
	Found 35742 words
	Method of counting _page_count_mode=Estimate _download_sources=[]
	results= {u'WordCount': 35742, u'PageCount': 133}
	Found 133 pages
8171
do_statistics_for_book:  C:\Users\David\AppData\Local\Temp\calibre_c9bwbz\bqeky5_count_pages\8171.epub 0 Estimate [] [u'WordCount', u'PageCount'] 1500 True
	Estimated accurate page count
	  Lines: 4151  Divs: 20  Paras: 1339
	  Accurate count: 133  Fast count: 102
	Page count: 133
	Word count using icu_wordcount - trying to count_words
	Word count - used count_words: 35742
	Word count: 35742
DNSB is offline   Reply With Quote
Old 12-08-2019, 01:31 AM   #1324
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 47,944
Karma: 174315098
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
I was looking at the posted result logs. I noticed that in snarkophilus's log he had the following:

Code:
For this book, using language=eng
	Flesch Reading Ease: 79.8637432468
	Flesch Kincade Grade: 6.59164101285
	Gunning Fog: 10.694577889
which I had not enabled. After adding custom columns for those 3 calculations, calibre is currently showing the job at 1% after 4 minutes and 45 seconds. I'll edit this message after the job completes to give the final time.

Edit: final time was 12 minutes, 17 seconds.

Code:
Logfile for book ID 8195 (Complete Works)
	Method of counting _page_count_mode=Estimate _download_sources=[]
	results= {u'FleschGrade': 6.591641012852087, u'FleschReading': 79.86374324684013, u'PageCount': 4259, u'WordCount': 1155998, u'GunningFog': 10.694577889041557}
	Found 4259 pages
	Computed 79.9 Flesch Reading
	Computed 6.6 Flesch-Kincaid Grade
	Found 1155998 words
	Computed 10.7 Gunning Fog Index
8195
do_statistics_for_book:  C:\Users\David\AppData\Local\Temp\calibre_bqkn7c\9ohcwo_count_pages\8195.epub 0 Estimate [] [u'PageCount', u'FleschReading', u'FleschGrade', u'WordCount', u'GunningFog'] 1500 True
	Estimated accurate page count
	  Lines: 132049  Divs: 773  Paras: 42789
	  Accurate count: 4259  Fast count: 3122
	Page count: 4259
	Word count using icu_wordcount - trying to count_words
	Word count - used count_words: 1155998
	Word count: 1155998
	Results of NLTK text analysis:
	  Number of characters: 5468651
	  Number of words: 1251081
	  Number of sentences: 68486
	  Number of syllables: 1607495
	  Number of complex words: 109300
	  Average words per sentence: 18
For this book, using language=eng
	Flesch Reading Ease: 79.8637432468
	Flesch Kincade Grade: 6.59164101285
	Gunning Fog: 10.694577889
Hope this is of some help.

Last edited by DNSB; 12-08-2019 at 01:39 AM.
DNSB is offline   Reply With Quote
Old 12-08-2019, 04:41 AM   #1325
snarkophilus
Wannabe Connoisseur
snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.
 
Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
Quote:
Originally Posted by DNSB View Post
I was looking at the posted result logs. I noticed that in snarkophilus's log he had the following:

Code:
For this book, using language=eng
	Flesch Reading Ease: 79.8637432468
	Flesch Kincade Grade: 6.59164101285
	Gunning Fog: 10.694577889
which I had not enabled. After adding custom columns for those 3 calculations, calibre is currently showing the job at 1% after 4 minutes and 45 seconds. I'll edit this message after the job completes to give the final time.

Edit: final time was 12 minutes, 17 seconds.
Bingo! Remove those, and my Oscar Wilde page count is down from over 25 minutes to 7 seconds! My IOO Classic Books I which previously took about 2.5 hours is down to 23 seconds too.

To be honest, I think I only enabled those out of some sort of curiosity value. I've certainly never ever used them, and now that I check, I don't even have those columns visible by default anyway.
snarkophilus is offline   Reply With Quote
Old 12-08-2019, 07:29 AM   #1326
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by snarkophilus View Post
Bingo! Remove those, and my Oscar Wilde page count is down from over 25 minutes to 7 seconds! My IOO Classic Books I which previously took about 2.5 hours is down to 23 seconds too.

To be honest, I think I only enabled those out of some sort of curiosity value. I've certainly never ever used them, and now that I check, I don't even have those columns visible by default anyway.
That probably explains the difference for Windows machines. But, the original report explicitly mentioned using the ICU option for the word count. And my testing at work on a Linux box was only for the word count. I can't do much for this.

I'll have look when I have time at the other stats. But, that isn't likely to happen soon.
davidfor is offline   Reply With Quote
Old 12-08-2019, 03:14 PM   #1327
NiLuJe
BLAM!
NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.
 
NiLuJe's Avatar
 
Posts: 13,506
Karma: 26047202
Join Date: Jun 2010
Location: Paris, France
Device: Kindle 2i, 3g, 4, 5w, PW, PW2, PW5; Kobo H2O, Forma, Elipsa, Sage, C2E
Okay, finally let it run to completion, and it indeed took ~30min over here.

(That's with the NLTK stuff disabled).

Code:
Count Page/Word Statistics
        do_count_statistics - book_path=/tmp/calibre_4.5.0_tmp_Cq86l_/1PH1R6_count_pages/6379.epub, pages_algorithm=0, page_count_mode=Estimate, statistics_to_run=[u'WordCount', u'PageCount'], custom_chars_per_page=1500, icu_wordcount=True
        do_count_statistics - job started for file book_path=/tmp/calibre_4.5.0_tmp_Cq86l_/1PH1R6_count_pages/6379.epub
        -------------------------------
        Logfile for book ID 6379 (Complete Works)
                Found 1155998 words
                Method of counting _page_count_mode=Estimate _download_sources=[]
                results= {u'WordCount': 1155998, u'PageCount': 4259}
                Found 4259 pages
        6379
        do_statistics_for_book:  /tmp/calibre_4.5.0_tmp_Cq86l_/1PH1R6_count_pages/6379.epub 0 Estimate [] [u'WordCount', u'PageCount'] 1500 True
                Estimated accurate page count
                  Lines: 132049  Divs: 773  Paras: 42789
                  Accurate count: 4259  Fast count: 3122
                Page count: 4259
                Word count using icu_wordcount - trying to count_words
                Word count - used count_words: 1155998
                Word count: 1155998
Replicated @snarkophilus's experiment with the editor (i.e., a bigger DorianGray.htm), and that still "only" takes at most 6 or 8s to HL.

Last edited by NiLuJe; 12-08-2019 at 04:36 PM.
NiLuJe is offline   Reply With Quote
Old 12-08-2019, 03:48 PM   #1328
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 80,650
Karma: 150249619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by snarkophilus View Post
Nothing hugely useful at a glance:

Spoiler:
Code:
Count Page/Word Statistics
do_count_statistics - book_path=C:\Users\simon\AppData\Local\Temp\calibre_6yaerj\9cqwln_count_pages\5433.epub, pages_algorithm=2, page_count_mode=Estimate, statistics_to_run=[u'PageCount', u'GunningFog', u'WordCount', u'FleschGrade', u'FleschReading'], custom_chars_per_page=1500, icu_wordcount=True
do_count_statistics - job started for file book_path=C:\Users\simon\AppData\Local\Temp\calibre_6yaerj\9cqwln_count_pages\5433.epub
-------------------------------
Logfile for book ID 5433 (Complete Works)
	Method of counting _page_count_mode=Estimate _download_sources=[]
	results= {u'GunningFog': 10.694577889041557, u'WordCount': 1155998, u'FleschGrade': 6.591641012852087, u'FleschReading': 79.86374324684013, u'PageCount': 2504.0}
	Found 2504 pages
	Computed 10.7 Gunning Fog Index
	Found 1155998 words
	Computed 6.6 Flesch-Kincaid Grade
	Computed 79.9 Flesch Reading
5433
do_statistics_for_book:  C:\Users\simon\AppData\Local\Temp\calibre_6yaerj\9cqwln_count_pages\5433.epub 2 Estimate [] [u'PageCount', u'GunningFog', u'WordCount', u'FleschGrade', u'FleschReading'] 1500 True
	Page count: 2504.0
	Word count using icu_wordcount - trying to count_words
	Word count - used count_words: 1155998
	Word count: 1155998
	Results of NLTK text analysis:
	  Number of characters: 5468651
	  Number of words: 1251081
	  Number of sentences: 68486
	  Number of syllables: 1607495
	  Number of complex words: 109300
	  Average words per sentence: 18
For this book, using language=eng
	Flesch Reading Ease: 79.8637432468
	Flesch Kincade Grade: 6.59164101285
	Gunning Fog: 10.694577889


It was faster this time - 26m06s today vs 26m36s yesterday

I'm using the current Calibre Windows 64-bit build. I update everytime Calibre reminds me there's a new version.

Are there any Calibre debug options that might give a more verbose log file?

EDIT: I'm using the ADE page count algorithm if that's relevant.

EDIT2: Attached screenshot of my Count Pages configuration
One difference between your settings and mine is you have the readability settings on. I do not have them on. Turn them off and try again with the Monte Cristo book from MR.
JSWolf is offline   Reply With Quote
Old 12-08-2019, 05:54 PM   #1329
snarkophilus
Wannabe Connoisseur
snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.
 
Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
Quote:
Originally Posted by davidfor View Post
That probably explains the difference for Windows machines. But, the original report explicitly mentioned using the ICU option for the word count. And my testing at work on a Linux box was only for the word count. I can't do much for this.
Ahh, true. We (I!!) have gone off on a tangent. I've confirmed that normal vs ICU doesn't really affect run time for me.

Quote:
I'll have look when I have time at the other stats. But, that isn't likely to happen soon.
Almost all the time is spent counting syllables. I added a bit of timing stuff to the plugin (yay, my first actual working change to anything Calibre related!) and I see in my log of Oscar Wilde

Code:
count syllables in all words
 .... count syllables done --- 1539.17500019 seconds ---
and total run time was just over 25 minutes again.

If I insert a return 1607495 right before the for word in words: loop in nltk_lite/textanalyzer.py, then it only takes 29 seconds instead of nearly half an hour.

Counting syllables is difficult?!
snarkophilus is offline   Reply With Quote
Old 12-08-2019, 07:04 PM   #1330
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 80,650
Karma: 150249619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by snarkophilus View Post
Ahh, true. We (I!!) have gone off on a tangent. I've confirmed that normal vs ICU doesn't really affect run time for me.



Almost all the time is spent counting syllables. I added a bit of timing stuff to the plugin (yay, my first actual working change to anything Calibre related!) and I see in my log of Oscar Wilde

Code:
count syllables in all words
 .... count syllables done --- 1539.17500019 seconds ---
and total run time was just over 25 minutes again.

If I insert a return 1607495 right before the for word in words: loop in nltk_lite/textanalyzer.py, then it only takes 29 seconds instead of nearly half an hour.

Counting syllables is difficult?!
Maybe counting syllables is that difficult. Or maybe the routine used is inefficient. You could give a look and see if you can improve it.
JSWolf is offline   Reply With Quote
Old 12-08-2019, 08:07 PM   #1331
snarkophilus
Wannabe Connoisseur
snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.
 
Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
Quote:
Originally Posted by JSWolf View Post
Maybe counting syllables is that difficult. Or maybe the routine used is inefficient. You could give a look and see if you can improve it.
It turns out that counting syllables isn't hard on its own, but looks like if you count syllables in each word separately when trying to determine the complex word count (words with >= 3 syllables) then it is harder:

Code:
count all syllables
 .... count all syllables = 270010 done --- 1.28500008583 seconds ---
count syllables in all words for complex words
 .... count syllables done --- 43.6440000534 seconds ---
Turns out that hunch was also incorrect. I dug a bit deeper, and this appears to be the culprit:

Code:
                    for sentence in sentences:
                        if str(sentence).startswith(word):
                            found = True
                            break
If I understand that correctly, for every word we loop over (for Endymion which has only 200,000ish words and is faster to work with) around 13,000 sentences to check if that word appears at the start of a sentence, so we're potentially doing approx 3.5 billion compares?! Give or take a few for early matches of a word at the beginning of a sentence. For Oscar we're potentially doing around 79 billion compares. No wonder this isn't fast

I'm very new to Python. If this were in Perl I'd think about storing each first word of a sentence in a hash (an associative array) and instead of looping over all sentences for each word just check if the hash value exists. Is this type of thing possible in Python?
snarkophilus is offline   Reply With Quote
Old 12-08-2019, 08:23 PM   #1332
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by snarkophilus View Post
Ahh, true. We (I!!) have gone off on a tangent. I've confirmed that normal vs ICU doesn't really affect run time for me.
Yes, there are two issues. The ICU count time seems to be a Linux problem. I don't know if it is something in the build, or something in Linux. I'm not really setup to test that.
Quote:
Almost all the time is spent counting syllables. I added a bit of timing stuff to the plugin (yay, my first actual working change to anything Calibre related!) and I see in my log of Oscar Wilde

Code:
count syllables in all words
 .... count syllables done --- 1539.17500019 seconds ---
and total run time was just over 25 minutes again.

If I insert a return 1607495 right before the for word in words: loop in nltk_lite/textanalyzer.py, then it only takes 29 seconds instead of nearly half an hour.

Counting syllables is difficult?!
Not really, it just takes a while. I hadn't looked at that code before, but, it is looking at every character in word, and then running 25 regexes against the word. It does cache the count for each word, though it looks like there is an error with this if the word ends in "e".
davidfor is offline   Reply With Quote
Old 12-08-2019, 09:20 PM   #1333
NiLuJe
BLAM!
NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.NiLuJe ought to be getting tired of karma fortunes by now.
 
NiLuJe's Avatar
 
Posts: 13,506
Karma: 26047202
Join Date: Jun 2010
Location: Paris, France
Device: Kindle 2i, 3g, 4, 5w, PW, PW2, PW5; Kobo H2O, Forma, Elipsa, Sage, C2E
I'm using completely custom source builds, so it's not specific to the binary releases, at least.

If you have that on hand, could you point me to the relevant bit of code? Does that rely on the ICU shim built by Calibre?
NiLuJe is offline   Reply With Quote
Old 12-08-2019, 09:27 PM   #1334
snarkophilus
Wannabe Connoisseur
snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.
 
Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
Ok, I had a play with trying to use a Python dictionary to store the starts of sentences, and the results are very encouraging.

I'm defining the end of the first word in a sentence using the regex [ ,\.\?:;] which I'm sure can be improved upon. Then store the results of that for every sentence in a dictionary, and check if a word is in that dictionary instead of looping over every sentence.

Because I've changed the rules about what consitutes a sentence start, the number of Complex Words varies slightly from the previous code. It seems to have a small impact on the scores though. And as mentioned above, this was a rough attempt to define the start of a sentence.

For Oscar the time comes down from 26 minutes to 37 seconds, and for IOO Classic Books I the time comes down from 2 hours 30 minutes to 1 minute 49 seconds. Some details:

Spoiler:
Code:
Endymion orig code (52 seconds):
        Results of NLTK text analysis:
          Number of complex words: 19600
For this book, using language=eng
        Flesch Reading Ease: 83.5260184816
        Flesch Kincade Grade: 5.58397141746
        Gunning Fog: 10.0747645854

Endymion new code (11 seconds):
        Results of NLTK text analysis:
          Number of complex words: 19449
For this book, using language=eng
        Flesch Reading Ease: 83.5260184816
        Flesch Kincade Grade: 5.58397141746
        Gunning Fog: 10.046453899


Oscar orig code (26 min)
        Results of NLTK text analysis:
          Number of complex words: 109300
  For this book, using language=eng
        Flesch Reading Ease: 79.8637432468
        Flesch Kincade Grade: 6.59164101285
        Gunning Fog: 10.694577889

Oscar new code (37 seconds)
        Results of NLTK text analysis:
          Number of complex words: 108737
  For this book, using language=eng
        Flesch Reading Ease: 79.8637432468
        Flesch Kincade Grade: 6.59164101285
        Gunning Fog: 10.6765774558


IOO Classic Books I orig code (2h30m):
        Flesch Reading Ease: 74.5 (from stats, don't have original log)
        Flesch Kincade Grade: 7.7 (from stats, don't have original log)
        Gunning Fog: 11.8 (from stats, don't have original log)
IOO Classic Books I new code (1m49s):
        Results of NLTK text analysis:
          Number of complex words: 362162
  For this book, using language=eng
        Flesch Reading Ease: 74.5418275204
        Flesch Kincade Grade: 7.83079710708
        Gunning Fog: 11.7300776014


I would guess that this needs much cleanup. I'm sure I've broken many rules in my first attempt to write something vaguely meaningful in Python .

Work in progress is available here
snarkophilus is offline   Reply With Quote
Old 12-08-2019, 09:35 PM   #1335
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by snarkophilus View Post
It turns out that counting syllables isn't hard on its own, but looks like if you count syllables in each word separately when trying to determine the complex word count (words with >= 3 syllables) then it is harder:

Code:
count all syllables
 .... count all syllables = 270010 done --- 1.28500008583 seconds ---
count syllables in all words for complex words
 .... count syllables done --- 43.6440000534 seconds ---
Turns out that hunch was also incorrect. I dug a bit deeper, and this appears to be the culprit:

Code:
                    for sentence in sentences:
                        if str(sentence).startswith(word):
                            found = True
                            break
If I understand that correctly, for every word we loop over (for Endymion which has only 200,000ish words and is faster to work with) around 13,000 sentences to check if that word appears at the start of a sentence, so we're potentially doing approx 3.5 billion compares?! Give or take a few for early matches of a word at the beginning of a sentence. For Oscar we're potentially doing around 79 billion compares. No wonder this isn't fast

I'm very new to Python. If this were in Perl I'd think about storing each first word of a sentence in a hash (an associative array) and instead of looping over all sentences for each word just check if the hash value exists. Is this type of thing possible in Python?
(should have refreshed before posting)

You could build a dictionary of the first word in each sentence. That is basically the same as a Perl hash. But, I'm not sure that is the issue. But, there is a problem with this: What is the first word? That should be easy, but, the code we are looking at is all about splitting up the words in the correct way. Using "startswith" means you don't need to do that.

But, I think the real issue is that each word is checked. By caching the result, you don't need to count the syllables or check the beginning of the sentences more than once.

I have just tried the following:
Code:
    #This method must be enhanced. At the moment it only
    #considers the number of syllables in a word.
    #This often results in that too many complex words are detected.
    def countComplexWords(self, text='', sentences=[], words=[]):
        if not sentences:
            sentences = self.getSentences(text)
        if not words:
            words = self.getWords(text)
#        words = set(words)
        complexWords = 0
        found = False;
        #Just for manual checking and debugging.
        #cWords = []
        curWord = []
        cWords = {}

        for word in words:
            is_complex = cWords.get(word, -1)
            if is_complex > -1:
               complexWords += is_complex
               continue
#            curWord.append(word)
            if self.countSyllables(word)>= 3:
                complexWords += 1
                cWords[word] = 1

                #Checking proper nouns. If a word starts with a capital letter
                #and is NOT at the beginning of a sentence we don't add it
                #as a complex word.
                if not(word[0].isupper()):
                    complexWords += 1
                    #cWords.append(word)
                else:
                    for sentence in sentences:
                        if str(sentence).startswith(word):
                            found = True
                            break

                    if found:
                        complexWords+=1
                        found = False

#            curWord.remove(word)
            else:
                cWords[word] = 0
        #print cWords
        return complexWords
Just doing the readability stats, the results for the Oscar Wilde book is:
Code:
Count Page/Word Statistics
do_count_statistics - book_path=/tmp/calibre_4.5.0_tmp_gQE8_N/A8eL2p_count_pages/71.epub, pages_algorithm=2, page_count_mode=Estimate, statistics_to_run=[u'FleschReading', u'GunningFog', u'FleschGrade'], custom_chars_per_page=1500, icu_wordcount=False
do_count_statistics - job started for file book_path=/tmp/calibre_4.5.0_tmp_gQE8_N/A8eL2p_count_pages/71.epub
-------------------------------
Logfile for book ID 71 (Complete Works)
	Computed 84.6 Flesch Reading
	Computed 10.9 Gunning Fog Index
	Computed 5.9 Flesch-Kincaid Grade
71
do_statistics_for_book:  /tmp/calibre_4.5.0_tmp_gQE8_N/A8eL2p_count_pages/71.epub 2 Estimate [] [u'FleschReading', u'GunningFog', u'FleschGrade'] 1500 False
	Results of NLTK text analysis:
	  Number of characters: 5468651
	  Number of words: 1251081
	  Number of sentences: 68486
	  Number of syllables: 1538135
	  Number of complex words: 116847
	  Average words per sentence: 18
For this book, using language=eng
DEBUG:    0.0 	Flesch Reading Ease: 84.5539719371
DEBUG:    0.0 	Flesch Kincade Grade: 5.93744835866
DEBUG:    0.0 	Gunning Fog: 10.9358732168
Elapsed time was 2m24s. The attached beta version has this change. I haven't checked if it produces the same stats as before. I didn't want to wait for the results. I'll have to reinstall the released version and test it.
Attached Files
File Type: zip Count Pages-beta.zip (291.5 KB, 335 views)
davidfor is offline   Reply With Quote
Reply

Tags
count, count pages, page count, pages, plugin


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[GUI Plugin] Quality Check kiwidude Plugins 1277 10-21-2025 10:04 AM
[GUI Plugin] Open With kiwidude Plugins 404 02-21-2025 05:42 AM
[GUI Plugin] Quick Preferences kiwidude Plugins 62 03-16-2024 11:47 PM
[GUI Plugin] Kindle Collections (old) meme Plugins 2070 08-11-2014 12:02 AM
[GUI Plugin] Plugin Updater **Deprecated** kiwidude Plugins 159 06-19-2011 12:27 PM


All times are GMT -4. The time now is 03:44 AM.


MobileRead.com is a privately owned, operated and funded community.