MobileRead Forums - View Single Post

PHC · 08-16-2014, 01:02 PM

Though it is daunting at first, learning regex is the best thing you could do as it comes in handy in many situations. Sigil regex is less strict than 'official' regex and requires less strict regex code. An example that I am using right now:

I wanted to find code segments where a calibre-converted epub from a PDF was splitting sentences at every page break from the PDF, and replace it with nothing, i.e., remove the split. All I had to do was find the first example of this in the HTML file in Sigil, and highlight everything from the </p> tag at the end of the split line to the first character of the next word in the sentence, copy it, and paste it in the 'Find:' box:

Original:

Code:

opportunity to play with the girl, fleetingly and unbeknownst to Phyllis, before what Izzy called “nights out” each evening. Izzy was the only</p>

  <p class="whitespace">&nbsp;</p>

  <p class="calibre1">person who Petra could talk to about magic, although they had to keep it a sworn secret. Izzy loved Petra’s stories about

Highlighted:

Code:

</p>

  <p class="whitespace">&nbsp;</p>

  <p class="calibre1">

Then I added some regex code ([a-z0-9]) at the end to find the first character of the next word in the split line:

Code:

</p>

  <p class="whitespace">&nbsp;</p>

  <p class="calibre1">([a-z0-9])

This finds the split code plus the first character:

[Image violates guidelines for size - MODERATOR]

For 'Replace:', I just use ' \1' -- note the leading <space>. This replaces all the HTML code with a space and the found first character:

[Image violates guidelines for size - MODERATOR]

Whatever text is found within (…) is copied over by \1. So this basically 'unsplits' the line. The search/replace can be repeated with 'Find', 'Replace', 'Replace/Find', or if you're pretty sure this won't do something unexpected to code you'd rather keep, 'Replace All' -- be careful with this.

As far as tools to test your regex, first of all Sigil is the best tool you have because it highlights the results immediately. Or you can go to http://regex101.com/ and use their online tool. It is the very best one I found out of the dozens that I tried. It color codes your regex expressions and highlights your errors and tells you what you did wrong when you hover the mouse cursor over the highlighted error. Fantastic. It is my go-to regex tester. I have RegexBuddy and RegexMagic on Windows, and RegExRX and Reggy on OS X, and I never use them because this online tool is so much better.

[Image violates guidelines for size - MODERATOR]

[Image violates guidelines for size - MODERATOR]

Note that Sigil didn't require the '\' escape character for the '/'.

The best interactive regex tutorial I found is RegexOne - Learn regular expressions with interactive examples. It guides you through the basics and quizzes you at each step. The best complete tutorial and reference, from the makers of RegexBuddy, is Regular Expression Tutorial.

08-16-2014, 01:02 PM	#8
PHC Member Posts: 21 Karma: 15000 Join Date: Feb 2014 Device: iPhone, iPad, Macbook Pro, Mac Pro	Though it is daunting at first, learning regex is the best thing you could do as it comes in handy in many situations. Sigil regex is less strict than 'official' regex and requires less strict regex code. An example that I am using right now: I wanted to find code segments where a calibre-converted epub from a PDF was splitting sentences at every page break from the PDF, and replace it with nothing, i.e., remove the split. All I had to do was find the first example of this in the HTML file in Sigil, and highlight everything from the </p> tag at the end of the split line to the first character of the next word in the sentence, copy it, and paste it in the 'Find:' box: Original: Code: opportunity to play with the girl, fleetingly and unbeknownst to Phyllis, before what Izzy called “nights out” each evening. Izzy was the only</p> <p class="whitespace"> </p> <p class="calibre1">person who Petra could talk to about magic, although they had to keep it a sworn secret. Izzy loved Petra’s stories about Highlighted: Code: </p> <p class="whitespace"> </p> <p class="calibre1"> Then I added some regex code ([a-z0-9]) at the end to find the first character of the next word in the split line: Code: </p> <p class="whitespace"> </p> <p class="calibre1">([a-z0-9]) This finds the split code plus the first character: [Image violates guidelines for size - MODERATOR] For 'Replace:', I just use ' \1' -- note the leading <space>. This replaces all the HTML code with a space and the found first character: [Image violates guidelines for size - MODERATOR] Whatever text is found within (…) is copied over by \1. So this basically 'unsplits' the line. The search/replace can be repeated with 'Find', 'Replace', 'Replace/Find', or if you're pretty sure this won't do something unexpected to code you'd rather keep, 'Replace All' -- be careful with this. As far as tools to test your regex, first of all Sigil is the best tool you have because it highlights the results immediately. Or you can go to http://regex101.com/ and use their online tool. It is the very best one I found out of the dozens that I tried. It color codes your regex expressions and highlights your errors and tells you what you did wrong when you hover the mouse cursor over the highlighted error. Fantastic. It is my go-to regex tester. I have RegexBuddy and RegexMagic on Windows, and RegExRX and Reggy on OS X, and I never use them because this online tool is so much better. [Image violates guidelines for size - MODERATOR] [Image violates guidelines for size - MODERATOR] Note that Sigil didn't require the '\' escape character for the '/'. The best interactive regex tutorial I found is RegexOne - Learn regular expressions with interactive examples. It guides you through the basics and quizzes you at each step. The best complete tutorial and reference, from the makers of RegexBuddy, is Regular Expression Tutorial. Last edited by Dr. Drib; 08-17-2014 at 06:40 AM.