![]() |
#1321 |
BLAM!
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 13,506
Karma: 26047202
Join Date: Jun 2010
Location: Paris, France
Device: Kindle 2i, 3g, 4, 5w, PW, PW2, PW5; Kobo H2O, Forma, Elipsa, Sage, C2E
|
@davidfor: The marking appears to happen in a timely manner on my end.
|
![]() |
![]() |
![]() |
#1322 | |
Wannabe Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
|
Quote:
|
|
![]() |
![]() |
![]() |
#1323 | |
Bibliophagist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 46,828
Karma: 169712582
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
Quote:
Code:
Logfile for book ID 8195 (Complete Works) Found 1155998 words Method of counting _page_count_mode=Estimate _download_sources=[] results= {u'PageCount': 4259, u'WordCount': 1155998} Found 4259 pages 8195 do_statistics_for_book: C:\Users\David\AppData\Local\Temp\calibre_c9bwbz\uvamax_count_pages\8195.epub 0 Estimate [] [u'WordCount', u'PageCount'] 1500 True Estimated accurate page count Lines: 132049 Divs: 773 Paras: 42789 Accurate count: 4259 Fast count: 3122 Page count: 4259 Word count using icu_wordcount - trying to count_words Word count - used count_words: 1155998 Word count: 1155998 Logfile for book ID 8171 (Gremlins Go Home) Found 35742 words Method of counting _page_count_mode=Estimate _download_sources=[] results= {u'WordCount': 35742, u'PageCount': 133} Found 133 pages 8171 do_statistics_for_book: C:\Users\David\AppData\Local\Temp\calibre_c9bwbz\bqeky5_count_pages\8171.epub 0 Estimate [] [u'WordCount', u'PageCount'] 1500 True Estimated accurate page count Lines: 4151 Divs: 20 Paras: 1339 Accurate count: 133 Fast count: 102 Page count: 133 Word count using icu_wordcount - trying to count_words Word count - used count_words: 35742 Word count: 35742 |
|
![]() |
![]() |
![]() |
#1324 |
Bibliophagist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 46,828
Karma: 169712582
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
I was looking at the posted result logs. I noticed that in snarkophilus's log he had the following:
Code:
For this book, using language=eng Flesch Reading Ease: 79.8637432468 Flesch Kincade Grade: 6.59164101285 Gunning Fog: 10.694577889 Edit: final time was 12 minutes, 17 seconds. Code:
Logfile for book ID 8195 (Complete Works) Method of counting _page_count_mode=Estimate _download_sources=[] results= {u'FleschGrade': 6.591641012852087, u'FleschReading': 79.86374324684013, u'PageCount': 4259, u'WordCount': 1155998, u'GunningFog': 10.694577889041557} Found 4259 pages Computed 79.9 Flesch Reading Computed 6.6 Flesch-Kincaid Grade Found 1155998 words Computed 10.7 Gunning Fog Index 8195 do_statistics_for_book: C:\Users\David\AppData\Local\Temp\calibre_bqkn7c\9ohcwo_count_pages\8195.epub 0 Estimate [] [u'PageCount', u'FleschReading', u'FleschGrade', u'WordCount', u'GunningFog'] 1500 True Estimated accurate page count Lines: 132049 Divs: 773 Paras: 42789 Accurate count: 4259 Fast count: 3122 Page count: 4259 Word count using icu_wordcount - trying to count_words Word count - used count_words: 1155998 Word count: 1155998 Results of NLTK text analysis: Number of characters: 5468651 Number of words: 1251081 Number of sentences: 68486 Number of syllables: 1607495 Number of complex words: 109300 Average words per sentence: 18 For this book, using language=eng Flesch Reading Ease: 79.8637432468 Flesch Kincade Grade: 6.59164101285 Gunning Fog: 10.694577889 Last edited by DNSB; 12-08-2019 at 01:39 AM. |
![]() |
![]() |
![]() |
#1325 | |
Wannabe Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
|
Quote:
To be honest, I think I only enabled those out of some sort of curiosity value. I've certainly never ever used them, and now that I check, I don't even have those columns visible by default anyway. |
|
![]() |
![]() |
![]() |
#1326 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
Quote:
I'll have look when I have time at the other stats. But, that isn't likely to happen soon. |
|
![]() |
![]() |
![]() |
#1327 |
BLAM!
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 13,506
Karma: 26047202
Join Date: Jun 2010
Location: Paris, France
Device: Kindle 2i, 3g, 4, 5w, PW, PW2, PW5; Kobo H2O, Forma, Elipsa, Sage, C2E
|
Okay, finally let it run to completion, and it indeed took ~30min over here.
(That's with the NLTK stuff disabled). Code:
Count Page/Word Statistics do_count_statistics - book_path=/tmp/calibre_4.5.0_tmp_Cq86l_/1PH1R6_count_pages/6379.epub, pages_algorithm=0, page_count_mode=Estimate, statistics_to_run=[u'WordCount', u'PageCount'], custom_chars_per_page=1500, icu_wordcount=True do_count_statistics - job started for file book_path=/tmp/calibre_4.5.0_tmp_Cq86l_/1PH1R6_count_pages/6379.epub ------------------------------- Logfile for book ID 6379 (Complete Works) Found 1155998 words Method of counting _page_count_mode=Estimate _download_sources=[] results= {u'WordCount': 1155998, u'PageCount': 4259} Found 4259 pages 6379 do_statistics_for_book: /tmp/calibre_4.5.0_tmp_Cq86l_/1PH1R6_count_pages/6379.epub 0 Estimate [] [u'WordCount', u'PageCount'] 1500 True Estimated accurate page count Lines: 132049 Divs: 773 Paras: 42789 Accurate count: 4259 Fast count: 3122 Page count: 4259 Word count using icu_wordcount - trying to count_words Word count - used count_words: 1155998 Word count: 1155998 Last edited by NiLuJe; 12-08-2019 at 04:36 PM. |
![]() |
![]() |
![]() |
#1328 | |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,929
Karma: 146918083
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
|
|
![]() |
![]() |
![]() |
#1329 | ||
Wannabe Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
|
Quote:
Quote:
Code:
count syllables in all words .... count syllables done --- 1539.17500019 seconds --- If I insert a return 1607495 right before the for word in words: loop in nltk_lite/textanalyzer.py, then it only takes 29 seconds instead of nearly half an hour. Counting syllables is difficult?! |
||
![]() |
![]() |
![]() |
#1330 | |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,929
Karma: 146918083
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
|
|
![]() |
![]() |
![]() |
#1331 | |
Wannabe Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
|
Quote:
Code:
count all syllables .... count all syllables = 270010 done --- 1.28500008583 seconds --- count syllables in all words for complex words .... count syllables done --- 43.6440000534 seconds --- Code:
for sentence in sentences: if str(sentence).startswith(word): found = True break ![]() I'm very new to Python. If this were in Perl I'd think about storing each first word of a sentence in a hash (an associative array) and instead of looping over all sentences for each word just check if the hash value exists. Is this type of thing possible in Python? |
|
![]() |
![]() |
![]() |
#1332 | ||
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#1333 |
BLAM!
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 13,506
Karma: 26047202
Join Date: Jun 2010
Location: Paris, France
Device: Kindle 2i, 3g, 4, 5w, PW, PW2, PW5; Kobo H2O, Forma, Elipsa, Sage, C2E
|
I'm using completely custom source builds, so it's not specific to the binary releases, at least.
If you have that on hand, could you point me to the relevant bit of code? Does that rely on the ICU shim built by Calibre? |
![]() |
![]() |
![]() |
#1334 |
Wannabe Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
|
Ok, I had a play with trying to use a Python dictionary to store the starts of sentences, and the results are very encouraging.
I'm defining the end of the first word in a sentence using the regex [ ,\.\?:;] which I'm sure can be improved upon. Then store the results of that for every sentence in a dictionary, and check if a word is in that dictionary instead of looping over every sentence. Because I've changed the rules about what consitutes a sentence start, the number of Complex Words varies slightly from the previous code. It seems to have a small impact on the scores though. And as mentioned above, this was a rough attempt to define the start of a sentence. For Oscar the time comes down from 26 minutes to 37 seconds, and for IOO Classic Books I the time comes down from 2 hours 30 minutes to 1 minute 49 seconds. Some details: Spoiler:
I would guess that this needs much cleanup. I'm sure I've broken many rules in my first attempt to write something vaguely meaningful in Python ![]() Work in progress is available here |
![]() |
![]() |
![]() |
#1335 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
Quote:
You could build a dictionary of the first word in each sentence. That is basically the same as a Perl hash. But, I'm not sure that is the issue. But, there is a problem with this: What is the first word? That should be easy, but, the code we are looking at is all about splitting up the words in the correct way. Using "startswith" means you don't need to do that. But, I think the real issue is that each word is checked. By caching the result, you don't need to count the syllables or check the beginning of the sentences more than once. I have just tried the following: Code:
#This method must be enhanced. At the moment it only #considers the number of syllables in a word. #This often results in that too many complex words are detected. def countComplexWords(self, text='', sentences=[], words=[]): if not sentences: sentences = self.getSentences(text) if not words: words = self.getWords(text) # words = set(words) complexWords = 0 found = False; #Just for manual checking and debugging. #cWords = [] curWord = [] cWords = {} for word in words: is_complex = cWords.get(word, -1) if is_complex > -1: complexWords += is_complex continue # curWord.append(word) if self.countSyllables(word)>= 3: complexWords += 1 cWords[word] = 1 #Checking proper nouns. If a word starts with a capital letter #and is NOT at the beginning of a sentence we don't add it #as a complex word. if not(word[0].isupper()): complexWords += 1 #cWords.append(word) else: for sentence in sentences: if str(sentence).startswith(word): found = True break if found: complexWords+=1 found = False # curWord.remove(word) else: cWords[word] = 0 #print cWords return complexWords Code:
Count Page/Word Statistics do_count_statistics - book_path=/tmp/calibre_4.5.0_tmp_gQE8_N/A8eL2p_count_pages/71.epub, pages_algorithm=2, page_count_mode=Estimate, statistics_to_run=[u'FleschReading', u'GunningFog', u'FleschGrade'], custom_chars_per_page=1500, icu_wordcount=False do_count_statistics - job started for file book_path=/tmp/calibre_4.5.0_tmp_gQE8_N/A8eL2p_count_pages/71.epub ------------------------------- Logfile for book ID 71 (Complete Works) Computed 84.6 Flesch Reading Computed 10.9 Gunning Fog Index Computed 5.9 Flesch-Kincaid Grade 71 do_statistics_for_book: /tmp/calibre_4.5.0_tmp_gQE8_N/A8eL2p_count_pages/71.epub 2 Estimate [] [u'FleschReading', u'GunningFog', u'FleschGrade'] 1500 False Results of NLTK text analysis: Number of characters: 5468651 Number of words: 1251081 Number of sentences: 68486 Number of syllables: 1538135 Number of complex words: 116847 Average words per sentence: 18 For this book, using language=eng DEBUG: 0.0 Flesch Reading Ease: 84.5539719371 DEBUG: 0.0 Flesch Kincade Grade: 5.93744835866 DEBUG: 0.0 Gunning Fog: 10.9358732168 |
|
![]() |
![]() |
![]() |
Tags |
count, count pages, page count, pages, plugin |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
[GUI Plugin] Quality Check | kiwidude | Plugins | 1252 | 08-02-2025 09:53 AM |
[GUI Plugin] Open With | kiwidude | Plugins | 404 | 02-21-2025 05:42 AM |
[GUI Plugin] Quick Preferences | kiwidude | Plugins | 62 | 03-16-2024 11:47 PM |
[GUI Plugin] Kindle Collections (old) | meme | Plugins | 2070 | 08-11-2014 12:02 AM |
[GUI Plugin] Plugin Updater **Deprecated** | kiwidude | Plugins | 159 | 06-19-2011 12:27 PM |