Cool, thx Kovid.
FYI I decided to override get_title_tokens as there were a few too many edge cases it created problems for with Goodreads search. The issue is the replacing of all those things with a space. Some examples Charlotte's Web becomes "charlotte+s+web", "1,000 places" becomes "1+000+places", "Catch-22" becomes "Catch 22" and so on. I also strip off stuff in parenthesis to get rid of things like "(Omnibus)" and "(2010)" in the title.
So at the moment my function has something like this:
Code:
title_patterns = [(re.compile(pat, re.IGNORECASE), repl) for pat, repl in
[
(r'(\(.*\))', ''),
(r'\d+(,)\d+', ''),
(r'(\s-)', ' '),
(r'''[']''', ''),
(r'''[:,;+!@#$%^&*(){}.`~"\s\[\]/]''', ' ')
]]
for pat, repl in title_patterns:
title = pat.sub(repl, title)
It obviously isn't perfect but hopefully it does more good than harm