View Single Post
Old 10-10-2010, 01:12 PM   #5
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by cybmole View Post
ok -

a little play with the wizard indicates that this will work for title
<p class="calibre1">[0-9]+ Mexico</p>

but I'm not sure how to generalise it so that it also takes out the variable phrases which form chapter names ?
I need something that takes out stuff like
<p class="calibre1">The Cactus and the Maguey 11</p> where the text can be any phrase fragment which is followed by a number ?

still, it's a start, thanks.
inspecting the .mobi output, the above regex has done it's job, but I also now see that the text is littered with OCR scan errors so various corruptions of Mexico still sneak through.
I read elsewhere where that there are no good sources for this & for several other books by James Michener, & no kindle versions on sales either -
The tutorial does cover your scenario, but since you went to the effort of reading it here's your answer. You probably really don't want to do 'any phrase' followed by a number, I wouldn't risk it (too big a chance to remove real content), but you would want a regex that looks like this:
Code:
<p\sclass="calibre1"?>([0-9]+\s*Mexico\s*|[a-zA-Z\s]*?\d+\s*</p>
Be sure to check every single match in the wizard if you do that to make sure it doesn't overmatch. With a Michener book you might be checking for a while....

Though that regex could help you find all the chapter names for the safer thing:

A MUCH safer regex would be to look for all the chapter names and just put the beginning of them in your pattern:
Code:
<p\sclass="calibre1">([0-9]+\s*Mexico\s*|(The\sCactus|Start\s*of\s*Two|Start\s*of\s*Three).*?\d+\s*</p>
You can see there where the starting words of each chapter is separated by | and surrounded by parentheses.
ldolse is offline   Reply With Quote