Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 05-16-2009, 01:06 PM   #16
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,983
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by pepak View Post
Actually, quotes, are quite doable using regular expressions. This regexp won't work 100%, but it will work most of the time:

Search = ([>_])’(.*?[^a-z_])’([<_])
Replace = $1opening_quote$2closing_quote$3

(note: underscore in the search string represents a space)
What happens if there is a missing quote?
JSWolf is offline   Reply With Quote
Old 05-16-2009, 02:57 PM   #17
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
The script should also identify apostrophes, in words like 'em, 'tis, and other transcriptions of spoken language (much too often one finds an opening quote in those cases instead). I also try to properly mark single quotes and apostrophes, so that I could convert single quotes into double quotes without fear of ruining apostrophes.

I do this with a mix of regexp and manual search and replace (each occurrence with the right character, which I map to hotkeys so that it's relatively easy to run along the text). This also helps locating possible missing quotes, and at the end there's always the reading phase, to confirm everything's right.

If you are going the LaTeX way, you should also check the spacing after fullstops and question/exclamation marks. By convention LaTeX put a wider space after those, which you'd have to suppress (with a \@ after the sign) in abbreviations or other not-end-of-sentence cases, and force (with a \@ before) when they come after a capital letter. I did that for my Lewis Carroll PDFs, and it's time consuming, but the result looks great (to me).
Jellby is offline   Reply With Quote
Advert
Old 05-16-2009, 03:35 PM   #18
pepak
Guru
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
Quote:
Originally Posted by JSWolf View Post
What happens if there is a missing quote?
Nothing - it doesn't get converted at all.

Quote:
Originally Posted by Jellby
The script should also identify apostrophes, in words like 'em, 'tis, and other transcriptions of spoken language (much too often one finds an opening quote in those cases instead).
My regexp does that, except in the case when the apostrophed word appears before the actual opening quote. (That is, it will work fine if 'em is inside quotes and will work fine if there are no quotes after 'em).
pepak is offline   Reply With Quote
Old 05-17-2009, 05:51 AM   #19
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by pepak View Post
My regexp does that, except in the case when the apostrophed word appears before the actual opening quote. (That is, it will work fine if 'em is inside quotes and will work fine if there are no quotes after 'em).
What if it comes after a word like o' or callin' ?

I know it looks like I'm just trying to find the most difficult case, but I have actually found this in some of the Wodehouse books I've made, so it's a real case.
Jellby is offline   Reply With Quote
Old 05-17-2009, 06:04 AM   #20
pepak
Guru
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
Quote:
Originally Posted by Jellby View Post
What if it comes after a word like o' or callin' ?
It works, of course. Why don't you try it?
pepak is offline   Reply With Quote
Advert
Old 05-17-2009, 06:17 AM   #21
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by pepak View Post
It works, of course. Why don't you try it?
I would, but it's not vim-regexp... would you care explaining it?

What would it do with:

Code:
'Don't come callin' 'em so late'
I'd like to turn it into:

Code:
&lsquo;Don&8217;t come callin&8217; &8217;em so late&rsquo;
(where there should be a # between & and 8, but I can't write it here)
Jellby is offline   Reply With Quote
Old 05-17-2009, 06:28 AM   #22
pepak
Guru
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
Quote:
Originally Posted by Jellby View Post
I would, but it's not vim-regexp... would you care explaining it?
It's about as standard regexp as they come. I use specifically FAR Manager's Regular Expression Search And Replace plugin, but the same code would work with e.g. PHP's ereg(i).

Quote:
What would it do with:
Code:
'Don't come callin' 'em so late'
It wouldn't do anything becayse the sentence doesn't have a proper punctuation and no HTML paragraphs. If you wanted to convert:
Code:
<p>'Don't come callin' 'em so late.'</p>
or even
Code:
<p>'Don't come callin' 'em so late,' he said angrily.</p>
it would convert it into:
Code:
<p>{left_quote}Don't come callin' 'em so late.{right_quote}</p>
After that, you could do a simple search-and-replace for changing apostrophes to &{hash}8217;. Though I would postpone that until after I read the book through to make sure there were no forgotten quote-apostrophes left.
pepak is offline   Reply With Quote
Old 05-17-2009, 06:37 AM   #23
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by pepak View Post
It wouldn't do anything becayse the sentence doesn't have a proper punctuation and no HTML paragraphs.
Ah, I see... so it needs some punctuation before the closing quote. I'll see how that works for my next project.

Ensuring properly nested single and double quotes when the source is not consistent would be harder, though.
Jellby is offline   Reply With Quote
Old 05-17-2009, 06:56 AM   #24
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
@pepak re: CSS

I think I need to study up on it. I'm using a sort of bastardized HTML mix. I've been using clips (macros) in NoteTab for a long time now, so I re-checked what I'm doing. I actually use both <a name="chapter_ChapterNumber"> and <h3 id="chapter_ChapterNumber" class="chapter" align="center"> in my files to mark a chapter. A combination of overkill and ignorance.

I think it comes from haphazardly learning it as I needed it, and from working towards the obsolete REB1100. (Did you know that if you want a cover to appear on the first page of an REB1100 ebook, you must wrap it in a <center> tag? Otherwise, invisible!)

Providing the CSS with the markup allows someone to simply change the CSS file to suit their needs. Awesome. Currently, without a new reader, I can't think of a way to write the code in such a way as to ensure forward-compatability.

Another excuse to buy a new reader!

And you've inspired me to consider rewriting my macros to include such things as <em class="psionic">. Which just reads cool.

As for <span>, I'm still a little confused; I get the <div> styles thing -- open and close a style on what is otherwise a normal something-or-other, only distinguished by its class. And I get that the sections thing makes sense for, say, auto-searching the structure of a document, and offering an outline or some-such.

But SPANs? Are we talking something that parallels <h> and <p>? Open a <span> on something and close it so that you can apply a sub-style there too? ie: <span class="dialogue"> or <span class="paragraph">? Or is it lesser? For instance: <span class="italic">? Or am I missing something completely? Which is entirely likely.

m a r
rogue_ronin is offline   Reply With Quote
Old 05-17-2009, 07:03 AM   #25
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by rogue_ronin View Post
As for <span>, I'm still a little confused; I get the <div> styles thing -- open and close a style on what is otherwise a normal something-or-other, only distinguished by its class. And I get that the sections thing makes sense for, say, auto-searching the structure of a document, and offering an outline or some-such.
<div> is a generic block container, it could be the same as <p>, <h3>, <blockquote>.

<span> is a generic in-line container, it's like <em>, <code> or <strong> but without any default meaning.
Jellby is offline   Reply With Quote
Old 05-17-2009, 08:02 AM   #26
pepak
Guru
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
Quote:
Originally Posted by Jellby View Post
Ah, I see... so it needs some punctuation before the closing quote. I'll see how that works for my next project.
Well, without punctuation there isn't anything to work with. You would have to build a dictionary of apostrophable words and use that.

Quote:
Originally Posted by rogue_ronin View Post
I actually use both <a name="chapter_ChapterNumber"> and <h3 id="chapter_ChapterNumber" class="chapter" align="center"> in my files to mark a chapter. A combination of overkill and ignorance.
Not only an overkill, you are actually producing invalid files - names/ids should be unique.

Quote:
Providing the CSS with the markup allows someone to simply change the CSS file to suit their needs. Awesome. Currently, without a new reader, I can't think of a way to write the code in such a way as to ensure forward-compatability.
You could use an intermediate converter for that. Calibre, for example, tries to convert CSS into (bastardized) HTML when building a LRF output, because LRF only supports tiny subset of CSS.

Quote:
And you've inspired me to consider rewriting my macros to include such things as <em class="psionic">. Which just reads cool.
It not only looks cool, you can also use a different font for psionics and different font for other emphasisation.

Quote:
As for <span>, I'm still a little confused; I get the <div> styles thing
SPAN is pretty much the same thing as DIV, except that DIV works with blocks (creates a newline before and after, among other things) while SPAN just works within one line.

E.g.
Code:
<div class="block">
<p>first paragraph</p>
<p>second paragraph</p>
<h2>header</h2>
<p>more paragraphs</p>
</div>
vs.
Code:
<p>this is a <span style="text-decoration: underline;">sentence</span> where I want the word "sentence" underlined, which I can't do with U-tag in XHTML because it is deprecated.</p>
pepak is offline   Reply With Quote
Old 05-17-2009, 08:15 AM   #27
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
Thanks Jellby! Guess I was a little vague, there.

BTW, I just tried pepak's regex. Worked pretty well, finding left and right single-quotes (apostrophes, literally -- I just replaced the rsquo's from the example), when someone is quoting something inside dialogue. (ie: "He said 'xylophone,' did he?" asked Boojum.) Let me easily switch to lsquo and rsquo.

Only had one false positive. There was a positive there too, but some short distance preceding it was an 'em and it was lumped into the positive, ie:
Quote:
I saw 'em blah blah 'positive quote.'
was all highlighted (except I saw.) Pretty easy to recognize, though. It's a keeper!

But I was working in an HTML document, so I modified it slightly afterward:
Quote:
([>_;])'(.*?[^a-z_])'([_&<])
adding ; and &, which allows for things like: &quot;'Xylophone,' yup, 'xylophone.'&quot; he responded.

Worked awesomely, had only one similar false positive (that contained two positives) and beat my prior search regex:
Quote:
'(.*?)'
with an ugly stick; the old one required me to make a lot more decisions, found many (hundreds) more false positives.

But I did the modified run after initially running his regex. Anyone see a reason why it might not work straight-up? I'm having trouble imagining a sentence that would false positive because of ; and &...

m a r

Last edited by rogue_ronin; 05-17-2009 at 08:18 AM.
rogue_ronin is offline   Reply With Quote
Old 05-17-2009, 09:05 AM   #28
pepak
Guru
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
Quote:
Originally Posted by rogue_ronin View Post
Only had one false positive. There was a positive there too, but some short distance preceding it was an 'em and it was lumped into the positive, ie:
Yep, that's what I mentioned above:

Quote:
Originally Posted by pepak View Post
My regexp does that, except in the case when the apostrophed word appears before the actual opening quote. (That is, it will work fine if 'em is inside quotes and will work fine if there are no quotes after 'em).
Quote:
... and beat my prior search regex with an ugly stick
Your old regex may still be useful for replacing double-quotes. It it pretty poor with single-quotes, though. That's why I suggested my regexp :-)

Quote:
But I did the modified run after initially running his regex. Anyone see a reason why it might not work straight-up? I'm having trouble imagining a sentence that would false positive because of ; and &...
The starting semicolon could give you trouble with documents containing non-english characters. E.g. if your character's name ends with &ccedil; - it would get recognized as end of sentence.

I approach your problem from the other side - I always insert a space between apostrophe/single-quote and quote/double-quote. Not only it makes my regexp work just fine, it is far nicer visually, too. Later on, when all quotes are converted, you can either remove the space or (better yet) convert it to non-breaking space.
pepak is offline   Reply With Quote
Old 05-17-2009, 09:09 AM   #29
pepak
Guru
pepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura aboutpepak has a spectacular aura about
 
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
Anyway, I have more useful regexps prepared for my documents. At the moment their description is written in czech language only, but maybe it would be useful for others if I translated the post into english? Also, many (well, some :-)) of the regexps will be recognizable to the trained eye right away - here (about in the middle of the page). They are geared towards fixing errors made by FineReader 9.
pepak is offline   Reply With Quote
Old 05-17-2009, 12:31 PM   #30
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
You sure did mention that. But I wasn't paying enough attention...

I was thinking about starting a regex thread -- but you should do it. I may have a couple to contribute, after seeing yours. For instance:
Quote:
([a-zI]+)&rsquo;([a-z]+)
which I replace with:
Quote:
$1'$2
because it's an apostrophe, not a quote mark in words like: I'm, we'll, could've...

Is &apos; well-supported now?

I'm of the mind that you should quote and apostrophe, etc. with either the entity-name tags (in HTML) or with the ascii/unicode character (in text) and not mix them up. But I cannot find that in real life much. Since a blanket search replace of ' with does visually improve a text, I understand why it happens.

As for the '{space}" layout you mention, I do try to change things to that -- but the texts I find are not always so neat. Therefore, I have to do it the hard way sometimes.

Your regex was a little difficult to use on one text I did this afternoon: it used ’.” ’?” ’!” at the end of sentences and both ‘ ’ and “ ” unicode characters. A simple search/replace on individual characters probably would have been smarter -- and in fact I had to do that at the end. Then switch the rsquo and the punctuation. I've been rushing a bit to complete a goal, so I'm not taking enough time to figure it out ahead. (A set of 56 short stories [some are novellas] by a single author.)

But it might not be the regex -- I'm running Win2k in a virtual machine to support NoteTab, and who knows what that can lead to. At one point I was getting only part of what I would copy to the clipboard. (Restarted, of course.)

It's funny -- someone went to a lot of trouble to use curly quotes in this text, but did no work on mdashes vs. hyphens, ellipses, or to clean up blockquotes, or even spell-check thoroughly. Quite a haphazard use of italics, too. Weird.

m a r
rogue_ronin is offline   Reply With Quote
Reply

Tags
conversion, typography


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Kindle Typography ChaoZ Amazon Kindle 21 08-14-2010 12:50 PM
Is there hope for better ebook typography? tomsem Amazon Kindle 0 08-12-2010 10:44 PM
Typography on the iPad LDBoblo Apple Devices 1 04-14-2010 03:33 PM
French Typography ahi Workshop 14 09-16-2009 02:22 PM
Chinese Typography ahi Workshop 81 09-14-2009 09:34 AM


All times are GMT -4. The time now is 09:05 AM.


MobileRead.com is a privately owned, operated and funded community.