Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 02-07-2009, 07:20 PM   #1
Snowman
Connoisseur
Snowman doesn't litterSnowman doesn't litter
 
Posts: 97
Karma: 196
Join Date: Aug 2008
Location: London UK
Device: iPhone 5, Kindle K3
A tool for converting to curly quotes

I've noticed recently that a surprising number of my ebooks have a mixture of both curly quotes and "straight" quotes. Most of the start-of-para quotes are open-curly (“ ), along with a few in-para open-curly. However, most of the close-quotes are the staight version, rather than the close-curly (” ). This is clearly the result of a broken conversion algorithm and bad (or no) copy-editing, and is quite distracting, as once you have spotted it, you notice every instance.

So lets defy R.W. Emerson and have a little consistency, please.

It's easy to do a global replace of all quotes to the straight version, but I quite like curly quotes. The reverse conversion is more tricky, and not really do-able with a normal programmer's editor. MS Word 2003 apparently can, but I don't have that.

So I wrote a quick-and-dirty brute-force style perl script to do it for me. It seems to work quite well, so I've cleaned it up a bit, and have attached it for anybody to use/modify/hack. On a very large book like The Count of Monte Cristo, it runs through in 25 seconds on a 2nd-hand 5 year old machine.
  • Prerequisites: Perl must be installed on your machine. I am using Active State perl v5.8.8
  • The Input: An html file that is reasonably "clean", with one paragraph per line. I use "html tidy" to achieve this. A copy of this file is saved as 'inputfilename.curlybak'. e.g. cristo.html is saved as cristo.html.curlybak.
  • The Double-quote algorithm: On any one line, the quotes must be balanced in left-right pairs. This will break if the quotes span a line because of an intervening <br>, for example.
  • The Single-quote algorithm: This is difficult, because the single quote has multiple usage. Rather than try for balance, I look at the preceding character (ignoring html tags). If at the start-of-line, or the preceding character is a space, a left-double-quote, or an open-paren, then the output is a left-curly (&lsquo; ), otherwise it is a right-curly (&rsquo; ).

    This will break for the (rare) instances of a leading apostrophe (&rsquo; ) in cases such as 'ware for "beware" for example. And I'm sure that there are one or two other places it will go wrong.
  • Error Handling: The two errors that the script can detect are "unbalanced double quotes" and "html tag opened but not closed". In both of these cases, the entire line is transcribed unchanged, but prefixed with an anchor like
    <a id="baddoublequote-12"></a>. This makes the erroneous lines easy to find and manually correct. An index to these anchors is also appended to the end of the output file just before the </body> tag.
  • Protection: If you want to protect part of the html file from being mangled by the script, you can use a pair of dummy html tags that contain the strings "curly-off" and "curly-on"

    For example: ... lots of text<span class="curly-off"></span>text with manually tailored quoting<span class="curly-on"></span> more text ...
  • Running: This is a command-line tool. From a command window, navigate to your working directory. If you have only ONE .html file in this directory, simply call the script with no parameters, otherwise pass the filename in as its only parameter.

    The script can be re-run on the same file multiple times with no problems.
  • Unicode: All my data files are utf-8. If yours are not, then remove the two 'binmode' statements from the script.

This is what a typical run looks like:
Code:
D:\E-Library\work\processing>..\_bin\curly.pl

Readin The_Count_of_Monte_Cristo.html
Input file renamed to The_Count_of_Monte_Cristo.html.curlybak

STATS
-----
lines=12372
double-quote count=30595
lines with unmatched double-quotes=211
single-quote count=4218, lsquotes output=654, rsquotes output=3564
html tag count=25395
lines with broken tags=0
processing time 25 seconds

total time taken 26 seconds

D:\E-Library\work\processing>
So here I will have to manually correct 211 paragraphs - all marked with an anchor so that they are easily found. It won't take long. For a smaller book, I rarely get an error count above 1 or 2.

Good luck

Snowman
Attached Files
File Type: pl curly.pl (6.0 KB, 164 views)
Snowman is offline   Reply With Quote
Old 02-08-2009, 03:40 AM   #2
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 5,974
Karma: 4346919
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by Snowman View Post
The Double-quote algorithm: On any one line, the quotes must be balanced in left-right pairs. This will break if the quotes span a line because of an intervening <br>, for example.
Or if a quoted text spans several paragraphs, in this case each paragraph starts with an opening quote mark but only the last one ends with a closing quote mark.

Quote:
The Single-quote algorithm: This is difficult, because the single quote has multiple usage. Rather than try for balance, I look at the preceding character (ignoring html tags). If at the start-of-line, or the preceding character is a space, a left-double-quote, or an open-paren, then the output is a left-curly (&lsquo; ), otherwise it is a right-curly (&rsquo; ).

This will break for the (rare) instances of a leading apostrophe (&rsquo; ) in cases such as 'ware for "beware" for example. And I'm sure that there are one or two other places it will go wrong.
Not so rare, depending on the text. There are books with lots of 'tis, 'twas, 'em, 'im, etc. It can also be a bit tricky when you have a preceding em-dash... And worse, there are books that use single quotes for top-level quote marks (mainly British, I think).

I think I'll continue using partially manual search and replace with vim. I also try to distinguish between closing single quotes (& rsquo;) and curly apostrophes (&# 8217;). They are both the same character (glyph), but using different codes in the source HTML allows me to easily exchange single and double quotes without affecting apostrophes, if needed.

I usually first replace every instance of ([letter]'[letter]) with the apostrophe, then search for (s') and put the apostrophe if needed, then search one by one for all (") or (') and replace it with opening or closing single or double quotes or apostrophe (each case is attached to one single key, so it's relatively quick and easy, and I can keep track of nesting levels or multi-paragraph quotes), as a bonus, I can also detect many cases of missing or wrong quote marks!

Last edited by Jellby; 02-08-2009 at 08:39 AM.
Jellby is offline   Reply With Quote
Old 02-08-2009, 04:10 AM   #3
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 62,558
Karma: 40125235
Join Date: Nov 2006
Location: UK
Device: PW2, iPad Retina Mini, iPhone 4, MS Surface Pro, Onyx T68, N7,
Quote:
Originally Posted by Jellby View Post
Or if a quoted text spans several paragraphs, in this case each paragraph starts with an opening quote mark but only the last one ends with a closing quote mark.
This is extremely common, especially in 19th century novels.

Quote:
And worse, there are books that use single quotes for top-level quote marks (mainly British, I think).
Yes - virtually all the Dickens books I've uploaded to MR use this convention.
HarryT is online now   Reply With Quote
Old 02-08-2009, 07:02 AM   #4
nrapallo
GuteBook/Mobi2IMP Creator
nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.
 
nrapallo's Avatar
 
Posts: 2,958
Karma: 2530531
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
Quote:
Originally Posted by Jellby View Post
I think I'll continue using partially manual search and replace with vim. I also try to distinguish between closing single quotes (& rsquo;) and curly apostrophes (&# 8217;).
Just in case you're wondering, those wink smilies can be disabled for any post by choosing in the Additional Options section below the Message Composition window the Disable smilies in text opyion. It's an all or nothing option, though. FYI. :)

Miscellaneous Options
o Show your signature
o Automatically parse links in text
o Disable smilies in text
nrapallo is offline   Reply With Quote
Old 02-08-2009, 07:53 AM   #5
tompe
Grand Sorcerer
tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.
 
Posts: 6,992
Karma: 3726689
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Nexus 7, Nexus 4, iPad 2, Notion Ink Adam Qi, Kindle WiFi, Kindle PW
How is this solved in ePub. The real solution ought to be to mark up the text so that all different variants can be generated dependent on country, language and wishes from the reader.

So you would have something like: <q> This is a nested <q>example</q>.</q>

which would be translated to e.g: ``This is a nested `example'.''
tompe is offline   Reply With Quote
Old 02-08-2009, 08:46 AM   #6
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 5,974
Karma: 4346919
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by nrapallo View Post
Just in case you're wondering, those wink smilies can be disabled
Thanks, I knew that, I just forgot in that post (and didn't preview it)

Quote:
Originally Posted by tompe View Post
How is this solved in ePub. The real solution ought to be to mark up the text so that all different variants can be generated dependent on country, language and wishes from the reader.

So you would have something like: <q> This is a nested <q>example</q>.</q>

which would be translated to e.g: ``This is a nested `example'.''
That would be good indeed, but I don't think it's possible with current ePUB. With CSS I've used the ":after" and ":before" pseudo-classes together with "quotes" and "content: open-quote" or "content: close-quote" properties, but I don't think the ePUB specification supports these things now. I believe our only choice is hard-coding the quote marks, which is not so bad after all, there has to be a line in user-configurability...
Jellby is offline   Reply With Quote
Old 02-08-2009, 10:43 AM   #7
Snowman
Connoisseur
Snowman doesn't litterSnowman doesn't litter
 
Posts: 97
Karma: 196
Join Date: Aug 2008
Location: London UK
Device: iPhone 5, Kindle K3
Thanks for the replies; I agree with what you say. There is no way such a simple-minded tool could cope with many different 'house' styles around, especially the older ones. And I'm only using it on those ebooks where a half-baked conversion has been done, leaving a nasty mix of straight and curly quotes. These are usually conversions to etext from books published from around 1970 onwards. (often those with the title and author centered between rows of hyphens on page 1).

The single quote section was a bit of an afterthought, and I'm still dithering about whether I will retain it.

I used 'Cristo' as the example as this happened to be the largest book I had conveniently available, and gave a good idea of the runtime. The usual book normally runs through in less than 5 secs.

I clearly misnamed the thread. 'Checker' or 'correction' would have have better conveyed the intention rather than 'converting'.

Still, there it is. I leave it to the reader whether to use it as another tool in the armoury, to use it as a framework to build something better, or to use it as an awful example of how not to do it.

regards

Snowman
Snowman is offline   Reply With Quote
Old 02-08-2009, 12:22 PM   #8
Patricia
Reader
Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.
 
Patricia's Avatar
 
Posts: 11,520
Karma: 2199070
Join Date: May 2007
Location: South Wales, UK
Device: Sony PRS-500, PRS-505, Asus EEEpc 4G
I find that whatever I use, it is still essential to check all instances of quotation marks after emdashes; and a good idea to check quotation marks before emdashes just in case.
Patricia is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Change single quotes to double quotes Elfwreck Workshop 16 04-26-2013 10:06 AM
convert straight quotes to curly quotes alansplace Calibre 3 09-25-2010 03:51 PM
curly quotes DaleDe Sigil 6 06-26-2010 10:33 PM
Pielrf - Text to LRF with Easy TOC, Headers, Curly-Quotes, etc. (Mac!) EatingPie LRF 104 01-12-2009 12:35 PM
Austen, Jane: Emma HTML (PDA and iPhone-friendly) with curly quotes etc andym Other Books 6 09-11-2007 02:00 PM


All times are GMT -4. The time now is 02:59 AM.


MobileRead.com is a privately owned, operated and funded community.