View Single Post
Old 04-17-2011, 12:25 PM   #13
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,731
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Cool, thx Kovid.

FYI I decided to override get_title_tokens as there were a few too many edge cases it created problems for with Goodreads search. The issue is the replacing of all those things with a space. Some examples Charlotte's Web becomes "charlotte+s+web", "1,000 places" becomes "1+000+places", "Catch-22" becomes "Catch 22" and so on. I also strip off stuff in parenthesis to get rid of things like "(Omnibus)" and "(2010)" in the title.

So at the moment my function has something like this:
Code:
title_patterns = [(re.compile(pat, re.IGNORECASE), repl) for pat, repl in
            [
                (r'(\(.*\))', ''),
                (r'\d+(,)\d+', ''),
                (r'(\s-)', ' '),
                (r'''[']''', ''),
                (r'''[:,;+!@#$%^&*(){}.`~"\s\[\]/]''', ' ')
            ]]
for pat, repl in title_patterns:
    title = pat.sub(repl, title)
It obviously isn't perfect but hopefully it does more good than harm
kiwidude is offline   Reply With Quote