MobileRead Forums - View Single Post

kiwidude · 04-17-2011, 01:25 PM

Cool, thx Kovid.

FYI I decided to override get_title_tokens as there were a few too many edge cases it created problems for with Goodreads search. The issue is the replacing of all those things with a space. Some examples Charlotte's Web becomes "charlotte+s+web", "1,000 places" becomes "1+000+places", "Catch-22" becomes "Catch 22" and so on. I also strip off stuff in parenthesis to get rid of things like "(Omnibus)" and "(2010)" in the title.

So at the moment my function has something like this:

Code:

title_patterns = [(re.compile(pat, re.IGNORECASE), repl) for pat, repl in
            [
                (r'(\(.*\))', ''),
                (r'\d+(,)\d+', ''),
                (r'(\s-)', ' '),
                (r'''[']''', ''),
                (r'''[:,;+!@#$%^&*(){}.`~"\s\[\]/]''', ' ')
            ]]
for pat, repl in title_patterns:
    title = pat.sub(repl, title)

It obviously isn't perfect but hopefully it does more good than harm

04-17-2011, 01:25 PM	#13
kiwidude Calibre Plugins Developer Posts: 4,745 Karma: 2208556 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Cool, thx Kovid. FYI I decided to override get_title_tokens as there were a few too many edge cases it created problems for with Goodreads search. The issue is the replacing of all those things with a space. Some examples Charlotte's Web becomes "charlotte+s+web", "1,000 places" becomes "1+000+places", "Catch-22" becomes "Catch 22" and so on. I also strip off stuff in parenthesis to get rid of things like "(Omnibus)" and "(2010)" in the title. So at the moment my function has something like this: Code: title_patterns = [(re.compile(pat, re.IGNORECASE), repl) for pat, repl in [ (r'(\(.\))', ''), (r'\d+(,)\d+', ''), (r'(\s-)', ' '), (r'''[']''', ''), (r'''[:,;+!@#$%^&(){}.`~"\s\[\]/]''', ' ') ]] for pat, repl in title_patterns: title = pat.sub(repl, title) It obviously isn't perfect but hopefully it does more good than harm