Quote:
Originally Posted by cybmole
ok -
a little play with the wizard indicates that this will work for title
<p class="calibre1">[0-9]+ Mexico</p>
but I'm not sure how to generalise it so that it also takes out the variable phrases which form chapter names ?
I need something that takes out stuff like
<p class="calibre1">The Cactus and the Maguey 11</p> where the text can be any phrase fragment which is followed by a number ?
still, it's a start, thanks.
inspecting the .mobi output, the above regex has done it's job, but I also now see that the text is littered with OCR scan errors so various corruptions of Mexico still sneak through.
I read elsewhere where that there are no good sources for this & for several other books by James Michener, & no kindle versions on sales either -
|
The tutorial does cover your scenario, but since you went to the effort of reading it here's your answer. You probably really don't want to do 'any phrase' followed by a number, I wouldn't risk it (too big a chance to remove real content), but you would want a regex that looks like this:
Code:
<p\sclass="calibre1"?>([0-9]+\s*Mexico\s*|[a-zA-Z\s]*?\d+\s*</p>
Be sure to check every single match in the wizard if you do that to make sure it doesn't overmatch. With a Michener book you might be checking for a while....
Though that regex could help you find all the chapter names for the safer thing:
A MUCH safer regex would be to look for all the chapter names and just put the beginning of them in your pattern:
Code:
<p\sclass="calibre1">([0-9]+\s*Mexico\s*|(The\sCactus|Start\s*of\s*Two|Start\s*of\s*Three).*?\d+\s*</p>
You can see there where the starting words of each chapter is separated by | and surrounded by parentheses.