MobileRead Forums - View Single Post

ldolse · 10-10-2010, 02:12 PM

Quote:

Originally Posted by cybmole

ok -

a little play with the wizard indicates that this will work for title
<p class="calibre1">[0-9]+ Mexico</p>

but I'm not sure how to generalise it so that it also takes out the variable phrases which form chapter names ?
I need something that takes out stuff like
<p class="calibre1">The Cactus and the Maguey 11</p> where the text can be any phrase fragment which is followed by a number ?

still, it's a start, thanks.
inspecting the .mobi output, the above regex has done it's job, but I also now see that the text is littered with OCR scan errors so various corruptions of Mexico still sneak through.
I read elsewhere where that there are no good sources for this & for several other books by James Michener, & no kindle versions on sales either -

The tutorial does cover your scenario, but since you went to the effort of reading it here's your answer. You probably really don't want to do 'any phrase' followed by a number, I wouldn't risk it (too big a chance to remove real content), but you would want a regex that looks like this:

Code:

<p\sclass="calibre1"?>([0-9]+\s*Mexico\s*|[a-zA-Z\s]*?\d+\s*</p>

Be sure to check every single match in the wizard if you do that to make sure it doesn't overmatch. With a Michener book you might be checking for a while....

Though that regex could help you find all the chapter names for the safer thing:

A MUCH safer regex would be to look for all the chapter names and just put the beginning of them in your pattern:

Code:

<p\sclass="calibre1">([0-9]+\s*Mexico\s*|(The\sCactus|Start\s*of\s*Two|Start\s*of\s*Three).*?\d+\s*</p>

You can see there where the starting words of each chapter is separated by | and surrounded by parentheses.