View Single Post
Old 12-08-2019, 09:27 PM   #1334
snarkophilus
Wannabe Connoisseur
snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.
 
Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
Ok, I had a play with trying to use a Python dictionary to store the starts of sentences, and the results are very encouraging.

I'm defining the end of the first word in a sentence using the regex [ ,\.\?:;] which I'm sure can be improved upon. Then store the results of that for every sentence in a dictionary, and check if a word is in that dictionary instead of looping over every sentence.

Because I've changed the rules about what consitutes a sentence start, the number of Complex Words varies slightly from the previous code. It seems to have a small impact on the scores though. And as mentioned above, this was a rough attempt to define the start of a sentence.

For Oscar the time comes down from 26 minutes to 37 seconds, and for IOO Classic Books I the time comes down from 2 hours 30 minutes to 1 minute 49 seconds. Some details:

Spoiler:
Code:
Endymion orig code (52 seconds):
        Results of NLTK text analysis:
          Number of complex words: 19600
For this book, using language=eng
        Flesch Reading Ease: 83.5260184816
        Flesch Kincade Grade: 5.58397141746
        Gunning Fog: 10.0747645854

Endymion new code (11 seconds):
        Results of NLTK text analysis:
          Number of complex words: 19449
For this book, using language=eng
        Flesch Reading Ease: 83.5260184816
        Flesch Kincade Grade: 5.58397141746
        Gunning Fog: 10.046453899


Oscar orig code (26 min)
        Results of NLTK text analysis:
          Number of complex words: 109300
  For this book, using language=eng
        Flesch Reading Ease: 79.8637432468
        Flesch Kincade Grade: 6.59164101285
        Gunning Fog: 10.694577889

Oscar new code (37 seconds)
        Results of NLTK text analysis:
          Number of complex words: 108737
  For this book, using language=eng
        Flesch Reading Ease: 79.8637432468
        Flesch Kincade Grade: 6.59164101285
        Gunning Fog: 10.6765774558


IOO Classic Books I orig code (2h30m):
        Flesch Reading Ease: 74.5 (from stats, don't have original log)
        Flesch Kincade Grade: 7.7 (from stats, don't have original log)
        Gunning Fog: 11.8 (from stats, don't have original log)
IOO Classic Books I new code (1m49s):
        Results of NLTK text analysis:
          Number of complex words: 362162
  For this book, using language=eng
        Flesch Reading Ease: 74.5418275204
        Flesch Kincade Grade: 7.83079710708
        Gunning Fog: 11.7300776014


I would guess that this needs much cleanup. I'm sure I've broken many rules in my first attempt to write something vaguely meaningful in Python .

Work in progress is available here
snarkophilus is offline   Reply With Quote