MobileRead Forums - View Single Post

davidfor · 12-08-2019, 09:35 PM

Quote:

Originally Posted by snarkophilus

It turns out that counting syllables isn't hard on its own, but looks like if you count syllables in each word separately when trying to determine the complex word count (words with >= 3 syllables) then it is harder:

Code:

count all syllables
 .... count all syllables = 270010 done --- 1.28500008583 seconds ---
count syllables in all words for complex words
 .... count syllables done --- 43.6440000534 seconds ---

Turns out that hunch was also incorrect. I dug a bit deeper, and this appears to be the culprit:

Code:

                    for sentence in sentences:
                        if str(sentence).startswith(word):
                            found = True
                            break

If I understand that correctly, for every word we loop over (for Endymion which has only 200,000ish words and is faster to work with) around 13,000 sentences to check if that word appears at the start of a sentence, so we're potentially doing approx 3.5 billion compares?! Give or take a few for early matches of a word at the beginning of a sentence. For Oscar we're potentially doing around 79 billion compares. No wonder this isn't fast

I'm very new to Python. If this were in Perl I'd think about storing each first word of a sentence in a hash (an associative array) and instead of looping over all sentences for each word just check if the hash value exists. Is this type of thing possible in Python?

(should have refreshed before posting)

You could build a dictionary of the first word in each sentence. That is basically the same as a Perl hash. But, I'm not sure that is the issue. But, there is a problem with this: What is the first word? That should be easy, but, the code we are looking at is all about splitting up the words in the correct way. Using "startswith" means you don't need to do that.

But, I think the real issue is that each word is checked. By caching the result, you don't need to count the syllables or check the beginning of the sentences more than once.

I have just tried the following:

Code:

    #This method must be enhanced. At the moment it only
    #considers the number of syllables in a word.
    #This often results in that too many complex words are detected.
    def countComplexWords(self, text='', sentences=[], words=[]):
        if not sentences:
            sentences = self.getSentences(text)
        if not words:
            words = self.getWords(text)
#        words = set(words)
        complexWords = 0
        found = False;
        #Just for manual checking and debugging.
        #cWords = []
        curWord = []
        cWords = {}

        for word in words:
            is_complex = cWords.get(word, -1)
            if is_complex > -1:
               complexWords += is_complex
               continue
#            curWord.append(word)
            if self.countSyllables(word)>= 3:
                complexWords += 1
                cWords[word] = 1

                #Checking proper nouns. If a word starts with a capital letter
                #and is NOT at the beginning of a sentence we don't add it
                #as a complex word.
                if not(word[0].isupper()):
                    complexWords += 1
                    #cWords.append(word)
                else:
                    for sentence in sentences:
                        if str(sentence).startswith(word):
                            found = True
                            break

                    if found:
                        complexWords+=1
                        found = False

#            curWord.remove(word)
            else:
                cWords[word] = 0
        #print cWords
        return complexWords

Just doing the readability stats, the results for the Oscar Wilde book is:

Code:

Count Page/Word Statistics
do_count_statistics - book_path=/tmp/calibre_4.5.0_tmp_gQE8_N/A8eL2p_count_pages/71.epub, pages_algorithm=2, page_count_mode=Estimate, statistics_to_run=[u'FleschReading', u'GunningFog', u'FleschGrade'], custom_chars_per_page=1500, icu_wordcount=False
do_count_statistics - job started for file book_path=/tmp/calibre_4.5.0_tmp_gQE8_N/A8eL2p_count_pages/71.epub
-------------------------------
Logfile for book ID 71 (Complete Works)
	Computed 84.6 Flesch Reading
	Computed 10.9 Gunning Fog Index
	Computed 5.9 Flesch-Kincaid Grade
71
do_statistics_for_book:  /tmp/calibre_4.5.0_tmp_gQE8_N/A8eL2p_count_pages/71.epub 2 Estimate [] [u'FleschReading', u'GunningFog', u'FleschGrade'] 1500 False
	Results of NLTK text analysis:
	  Number of characters: 5468651
	  Number of words: 1251081
	  Number of sentences: 68486
	  Number of syllables: 1538135
	  Number of complex words: 116847
	  Average words per sentence: 18
For this book, using language=eng
DEBUG:    0.0 	Flesch Reading Ease: 84.5539719371
DEBUG:    0.0 	Flesch Kincade Grade: 5.93744835866
DEBUG:    0.0 	Gunning Fog: 10.9358732168

Elapsed time was 2m24s. The attached beta version has this change. I haven't checked if it produces the same stats as before. I didn't want to wait for the results. I'll have to reinstall the released version and test it.