MobileRead Forums - View Single Post - PDF to Kindle: The unobtainable Holy Grail of ebooks

Serpentine · 10-19-2011, 06:19 PM

Quote:

Originally Posted by Blossom

I can see how these will come in handy. I just need figure out a program they will work on. What html editor do you use? I also have Regexbuddy but it's greek to me.

I just use Sigil and RegexBuddy - from time to time Notepad++, but it's not all that useful for regex as it's got a very limited and strange engine for that.

I do as much as I can to make the file super simple, strip most stuff so that only the essentials for the desired CSS are left. From there its just a matter of making expressions that do what you want. RegexBuddy is great for this part.

If you go to the 'test' tab, put your file down there in the bottom pane, then enter the regex at the top, it will highlight the matches. There's a little dropdown box at the top that lets you select your regex flavour, GJsoft is very easy and allows some nice stuff if you use lookaheads and such. Python and perl are pretty much the same if you need to share things with friends.

Working out what things do is the tricky part, but regex is rather easy to understand - a lot of people focus on trivial stuff that wildcards can work just as well with. I find the most handy things that make a person suddenly get regex are back references as lookarounds (sometimes only refered to as lookahead/lookbehind).

If you right click in the top pane above the test text, it has a nice context menu which allows you to add things that you might not know the name of. If you paste in regex which you don't understand, you can also swap to the 'Create' tab - this explains the regex, but don't be expecting it to be straight forward explanations

I'd suggest just using the test area for testing, use the 'grep' tab to apply regex to your files - handy for epubs with multiple xhtml files, also makes it easy to preview replacements - always preview.

Back to two things I think most people miss:
Back references are really easy - say we want to find simple formatting tags:
<([sbui])>([^<]+)</\1>
Orange finds a tag with a single letter from the set {s,b,u,i} in it eg <s>. The (brackets) around the character catch mean that the result is stored.
Red finds characters as they appear that are not a "<" - avoiding us going into the next tag by mistake.
Blue finds us a tag with the same letter as the group we got in the first (\1) group. i.e if we get a 'b' match for , we can reference back and then use it to find its' matching .

Lookarounds are something that very few short intros/tuts ever explain well. They're actually damn easy. They're used to 'look ahead', i.e you use them to decide if something is what you want, or not. If it's something that signifies you want to look closer - its positive, if its something that says you don't want to look at it its negative. If it's something prior of the potentially interesting stuff (i.e the deciding factor is on the left hand side of what you want to potentially get) - it's called a lookbehind.

For example if we only want to match stuff in an italic tag, if it is found after a comma: (?<=,).+?
Now you're saying "But why not just use : ,.+?
Well, if you do that you're including the comma in your match, so if you were to be removing the text you found, you'd need to make sure you replace it with a comma, to make sure it's not removed. It also allows you to match quite specific things. For example - if you only wanted to match it if there was an exclamation mark somewhere previously in the line : (?<=!.*).+? // this requires the JGsoft syntax which allows repetition in the lookarounds.

Those are examples of positive use, where you are looking for a specific something nearby. You could use the negative to avoid something. i.e you'd use a negative one.

I'd just throw a pile of random html into the test window and see what you can do.