View Single Post
Old 01-26-2011, 05:26 PM   #1
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Using the Search & Replace feature

The search and replace feature uses regular expressions to describe what text to replace. If you need an introduction, there's one available here.

You can use the search & replace feature in the conversion options to search and replace strings of text with some other strings. This can, for example, be used to remove headers/footers or pagenumbers. Note that the search & replace operates on the XHTML Calibre produces during conversion, not on the original file.
You can input a regular expression that describes the string of text that will be replaced during the conversion. The neat part is the wizard: Click on the wizard staff and you get a preview of what Calibre "sees" during the conversion process- the previously mentioned XHTML. Find the string you want to replace and construct your regex accordingly. Hit the button labeled "Test" and Calibre highlights the parts it would replace were you to use the regexp. Once you're satisfied, hit OK, input your replacement text, and convert. If you supply an empty string as a replacement text, Calibre will simply delete the strings matching the regular expression.

Practical examples

Removing header/footer strings:
Spoiler:
Code:
"Maybe, but the cops feel like you do, Anita. What's one more dead vampire? New laws don't change that." </p><p class="calibre4">
<b class="calibre2">Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre3">erter, http://www.processtext.com/abclit.html</a></b></p><p class="calibre4">
It had only been two years since Addison v. Clark. The court case gave us a revised version of what life was
(Shamelessly ripped out of this thread.)
You want to remove the ABC Amber LIT Converter advert that's embedded in the text. To do that, you'll have to remove some of the tags as well. In this example, I'd recommend beginning with the tag <b class="calibre2">, now you have to end with the corresponding closing tag (opening tags are <tag>, closing tags are </tag>), which is simply the next </b> in this case. (Refer to a good HTML manual or ask in the forum if you are unclear on this point.) The opening tag can be described using the regex <b.*?>, the closing tag using </b>, thus we could remove everything between those tags using
Code:
<b.*?>.*?</b>
But using this expression would be a bad idea, because it removes everything enclosed by <b>- tags (which, by the way, render the enclosed text in bold print), and it's a fair bet that we'll remove portions of the book in this way. Instead, include the beginning of the enclosed string as well, making the regular expression
Code:
<b.*?>\s*Generated\s+by\s+ABC\s+Amber\s+LIT.*?</b>
The \s with quantifiers are included here instead of explicitly using the spaces as seen in the string to catch any variations of the string that might occur. Remember to check what Calibre will remove to make sure you don't remove any portions you want to keep if you test a new expression. If you only check one occurence, you might miss a mismatch somewhere else in the text. Also note that should you accidentally remove more or fewer tags than you actually wanted to, Calibre tries to repair the damaged code after doing the header/footer removal.


Moving footnotes:
Spoiler:

Consider a book where the footnotes are presented at the end of a paragraph or what was once a physical page, as may be the case when the source of your book is a OCR'd paper book. An example may look like this:
Code:
<br>
Das Mädchen war über sich selbst hinausgewachsen. Johnny konnte es <br>
nur bewundern. Trotz der Angst war Kathy* nicht durchgedreht. Beide <br>
hatten genau das Richtige getan. Nicht die Helden spielen, sondern das <br>
Haus verlassen und flüchten. Kathy hatte sich einfach ein Rad <br>
<br>
* Siehe John Sinclair Nr. 1027: „Der Traum vom Schwarzen Tod“ <br>
<hr>
(From this thread) The asterisk marks where the footnote, the text after the second asterisk, should go. We want to insert the footnote inside a pair of brackets. The assumptions we make for finding the footnotes are that
  • the footnotes contain no markup (no HTML tags)
  • the footnote starts at the second asterisk and should be inserted at the position of the first asterisk
  • there's only one footnote per page
(These are made just for conveniences sake to show a proof-of-concept, if you have a more complicated case than the one presented here, adopt your regular expression accordingly.)
The search expression would then, for example, be
Code:
(?s)\*\s*(?P<text>[^*]*?)\s*\*\s*(?P<footnote>[^<]*)
with the replacement text
Code:
(\g<footnote>) \g<text>
This would yield, after the search & replace finishes, the result
Code:
<br>
Das Mädchen war über sich selbst hinausgewachsen. Johnny konnte es <br>
nur bewundern. Trotz der Angst war Kathy(Siehe John Sinclair Nr. 1027: „Der Traum vom Schwarzen Tod“ ) nicht durchgedreht. Beide <br>
hatten genau das Richtige getan. Nicht die Helden spielen, sondern das <br>
Haus verlassen und flüchten. Kathy hatte sich einfach ein Rad <br>
<br><br>
<hr>
Of note here is that, in the regular expression, we use named groups for backreferences. This is to be preferred over numerals as backreferences, as it is easier to read and thus gives more control over what actually happens.

Last edited by Manichean; 01-28-2011 at 12:44 PM.
Manichean is offline