View Single Post
Old 01-05-2020, 06:11 PM   #88
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,255
Karma: 16544692
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
@Kovid and @davidfor,

I've been working my way through my Editor plugins and I think I've found something a bit odd with the count_words function in calibre.spell.break_iterator in calibre 4.99.2.

I sometimes use count_words in my Editor plugins before & after a plugin's mass cleanups to sound an immediate alarm if text content accidentally got removed. It seems to run very much slower in 4.99.2 than in 4.8 and I wonder if you can shed any light on why that might be?

@davidfor, I know count_words is also an integral part of the Count Pages plugin.

I ran some tests to illustrate the problem. I selected 4 books (2 long, 1 medium, 1 short) in EPUB2 format. They all validate clean in both calibre CheckBook and EpubCheck.

For each EPUB I counted the words (python script attached in spoiler below) using 3 different versions of calibre in debug mode:
- Win 64bit 4.99.2
- Win 32bit 4.8
- Win 64bit 4.99.2 run from source (fully up-to-date)

These are the results. As you can see v4.99.2 runs 10 to 35 times slower than v4.8:
Code:
1. Alexandre Dumas - The Count of Monte Cristo
    DEBUG:    0.0 Start: calibre: 4.99.2 [64bit]; ispy3: True
    DEBUG:    7.4   End: Wordcount: 496791

    DEBUG:    0.0 Start: calibre: 4.8; ispy3: False
    DEBUG:    0.7   End: Wordcount: 496791

    DEBUG:    0.0 Start: calibre: 4.99.2* [64bit]; ispy3: True
    DEBUG:    7.4   End: Wordcount: 496791

2. Peter F Hamilton - The Naked God
    DEBUG:    0.0 Start: calibre: 4.99.2 [64bit]; ispy3: True
    DEBUG:   21.9   End: Wordcount: 455220

    DEBUG:    0.0 Start: calibre: 4.8; ispy3: False
    DEBUG:    0.6   End: Wordcount: 455220

    DEBUG:    0.0 Start: calibre: 4.99.2* [64bit]; ispy3: True
    DEBUG:   22.0   End: Wordcount: 455220

3. EF Benson - Mapp and Lucia
    DEBUG:    0.0 Start: calibre: 4.99.2 [64bit]; ispy3: True
    DEBUG:    2.8   End: Wordcount: 114695

    DEBUG:    0.0 Start: calibre: 4.8; ispy3: False
    DEBUG:    0.2   End: Wordcount: 114695

    DEBUG:    0.0 Start: calibre: 4.99.2* [64bit]; ispy3: True
    DEBUG:    2.9   End: Wordcount: 114695

4. Evelyn Waugh - Vile Bodies
    DEBUG:    0.0 Start: calibre: 4.99.2 [64bit]; ispy3: True
    DEBUG:    1.3   End: Wordcount: 76836

    DEBUG:    0.0 Start: calibre: 4.8; ispy3: False
    DEBUG:    0.1   End: Wordcount: 76836

    DEBUG:    0.0 Start: calibre: 4.99.2* [64bit]; ispy3: True
    DEBUG:    1.3   End: Wordcount: 76836
This was the simple script I used:
Spoiler:
Code:
from __future__ import (unicode_literals, division, absolute_import, print_function)
import sys
import os

from calibre.constants import ispy3, get_version
from calibre.devices.usbms.driver import debug_print
from calibre.ebooks.oeb.base import xml2text
from calibre.ebooks.oeb.polish.container import get_container    
from calibre.spell.break_iterator import count_words

if __name__ == "__main__":   
    pathtoepub = sys.argv[1]
    container = get_container(pathtoepub)
    
    debug_print('Start: calibre: {0}; ispy3: {1}'.format(get_version(), ispy3))
    wordcount = 0
    for name, lin in container.spine_names:
        root = container.parsed(name)
        text = xml2text(root)
        wordcount += count_words(text)

    debug_print('  End: Wordcount: {0}'.format(wordcount))
jackie_w is offline   Reply With Quote