Quote:
Originally Posted by arspr
See my previous post (now edited). I can fully replicate the issue with your sample _before.epub
|
I'll be damned. I can replicate it with your exception file too! But in all fairness... that file is one weird puppy.
It starts off with a UTF8 byte order mark (which is being included with the first 'bout' entry); each line is terminated with a CR/LF and then there's an additional LF character in between every entry. What did you use to create it?
I'll see if I can't come up with something to scrub the file of UTF8 BOM and additional LF characters before processing.
Quote:
OTOH How do you manage to get ' correctly modified before decades in number? Because any automated decision you make can be really risky.
|
I don't DO anything. SmartyPants looks for a single-straight quote immediately followed by two digits that are immediately followed by the lower-case letter 's' ... and changes that straight-single quote to a curly right-single quote. There are a few other details that take care of unique situations, but that's the gist of it. I don't really see the "risk" in that. Instead of worrying about it, why not offer up a situation where my tool gets 'XXs wrong? Or calibre's smartener for that matter. They're no different in that regard. Just straight-up SmartyPants.
Besides ... the decades thing is easy enough to double-check with a regex search.
I'm not trying to offer up any new, infallible quotation-smartening logic here. All the caveats for algorithmic quotation-smartening still apply. I'm just looking to add more control to WHAT you want to smarten, and to lessen the number of 'tis 'bout and 'cept'n foul-ups, and to do so without affecting any code in the document that doesn't pertain to punctuation being smartened.