MobileRead Forums - View Single Post

Snowman · 02-07-2009, 08:20 PM

I've noticed recently that a surprising number of my ebooks have a mixture of both curly quotes and "straight" quotes. Most of the start-of-para quotes are open-curly (“ ), along with a few in-para open-curly. However, most of the close-quotes are the staight version, rather than the close-curly (” ). This is clearly the result of a broken conversion algorithm and bad (or no) copy-editing, and is quite distracting, as once you have spotted it, you notice every instance.

So lets defy R.W. Emerson and have a little consistency, please.

It's easy to do a global replace of all quotes to the straight version, but I quite like curly quotes. The reverse conversion is more tricky, and not really do-able with a normal programmer's editor. MS Word 2003 apparently can, but I don't have that.

So I wrote a quick-and-dirty brute-force style perl script to do it for me. It seems to work quite well, so I've cleaned it up a bit, and have attached it for anybody to use/modify/hack. On a very large book like The Count of Monte Cristo, it runs through in 25 seconds on a 2nd-hand 5 year old machine.

Prerequisites: Perl must be installed on your machine. I am using Active State perl v5.8.8
The Input: An html file that is reasonably "clean", with one paragraph per line. I use "html tidy" to achieve this. A copy of this file is saved as 'inputfilename.curlybak'. e.g. cristo.html is saved as cristo.html.curlybak.
The Double-quote algorithm: On any one line, the quotes must be balanced in left-right pairs. This will break if the quotes span a line because of an intervening , for example.
The Single-quote algorithm: This is difficult, because the single quote has multiple usage. Rather than try for balance, I look at the preceding character (ignoring html tags). If at the start-of-line, or the preceding character is a space, a left-double-quote, or an open-paren, then the output is a left-curly (‘ ), otherwise it is a right-curly (’ ).

This will break for the (rare) instances of a leading apostrophe (’ ) in cases such as 'ware for "beware" for example. And I'm sure that there are one or two other places it will go wrong.
Error Handling: The two errors that the script can detect are "unbalanced double quotes" and "html tag opened but not closed". In both of these cases, the entire line is transcribed unchanged, but prefixed with an anchor like
<a id="baddoublequote-12"></a>. This makes the erroneous lines easy to find and manually correct. An index to these anchors is also appended to the end of the output file just before the </body> tag.
Protection: If you want to protect part of the html file from being mangled by the script, you can use a pair of dummy html tags that contain the strings "curly-off" and "curly-on"

For example: ... lots of texttext with manually tailored quoting more text ...
Running: This is a command-line tool. From a command window, navigate to your working directory. If you have only ONE .html file in this directory, simply call the script with no parameters, otherwise pass the filename in as its only parameter.

The script can be re-run on the same file multiple times with no problems.
Unicode: All my data files are utf-8. If yours are not, then remove the two 'binmode' statements from the script.

This is what a typical run looks like:

Code:

D:\E-Library\work\processing>..\_bin\curly.pl

Readin The_Count_of_Monte_Cristo.html
Input file renamed to The_Count_of_Monte_Cristo.html.curlybak

STATS
-----
lines=12372
double-quote count=30595
lines with unmatched double-quotes=211
single-quote count=4218, lsquotes output=654, rsquotes output=3564
html tag count=25395
lines with broken tags=0
processing time 25 seconds

total time taken 26 seconds

D:\E-Library\work\processing>

So here I will have to manually correct 211 paragraphs - all marked with an anchor so that they are easily found. It won't take long. For a smaller book, I rarely get an error count above 1 or 2.

Good luck

Snowman