Need help with RegEx

Gary Friedman · 01-25-2017, 04:52 AM

I am using Calibre to convert a .docx file with a complex layout (lots of figures, tables, etc.) into .epub and .mobi. While the conversion succeeds I have used some RegEx expressions to find and replace some formatting irregularities.

The RegEx expressions I've written work most of the time but still miss about 30% of the things I'm trying to fix, leaving me to go through each .html file and fix things by hand. My books are often 500+ pages and this process is getting tedious. What I would like to do is share with you my input file, the RegEx expressions I'm using, and ask if anyone can make suggestions on how to make these expressions more bullet-proof.

Here's the process I use. (Links to files appear below.) I start with a complex-layout docx like the kind attached. Before conversion I'll replace non-standard characters (like in-line arrows and smiley faces) with ASCII-character equivalents; I also replace multiple-images-in-tables with just one image. Then I use calibre to convert to epub.

From there I run the following regex expressions:

This widens all tables to fill the reader width:
Find: <table class="table_.*">
Replace with: <table width="100%">

Next I want to enlarge all images that appear within tables / figures:

Find: <table width="100%">((.|\n)*?)src=(.*?)class=(.*)/>(.*?)Figure((.|\n)*?)/table>
Replace with: <table width="100%"> \1 src= \3 width="100%"/> \5 Figure \6/table>

This works for most but not all images within figures. The red arrows in the upper-right-hand corner of http://friedmanarchives.com/~downloa...escription.jpg shows examples of where it fails.

The files are too large to upload but you can download them from my server:

1) The original .docx file (so you can see the complex layout as it was intended for printed form): http://friedmanarchives.com/~downloa.../Original.docx

2) The .epub version after calibre had converted it (and after I started to fix things by hand): http://friedmanarchives.com/~downloa...ed_output.epub

I'd even be willing to pay someone to help create more bulletproof REGEX' and to help fix other formatting anomolies that I currently have to tweak by hand in HTML.

Sorry for the long post; hopefully some of you can be of help!

Sincerely, Gary

Phssthpok · 01-25-2017, 01:00 PM

The best thing to do is to undo the change on the file you show, and do "Find" to see what gets highlighted before you hit "Replace". I suspect that you should use "dot-all" instead of "(.|\n)", and use a much more specific regex (e.g. look specifically for "<img"). I've often tripped up with things like "<.*?>x" when I wanted a tag followed by x, and it matched "<...>...<...>x" instead -- but changing it to "<[^>]*>x" worked fine.

Gary Friedman · 01-25-2017, 02:57 PM

[QUOTE=Phssthpok;3464352]The best thing to do is to undo the change on the file you show, and do "Find" to see what gets highlighted before you hit "Replace".

That's actually what I'm doing now but it still takes forever and I know deep in my heart that there's a better way to do this.

THANK YOU for your insightful answer! I will play with this some more tonight and let you know. (And when you say "dot-all" you really mean ".*", right?)

Gary

Tex2002ans · 01-26-2017, 01:37 AM

Quote:

Originally Posted by Gary Friedman

I am using Calibre to convert a .docx file with a complex layout (lots of figures, tables, etc.) into .epub and .mobi. While the conversion succeeds I have used some RegEx expressions to find and replace some formatting irregularities.

This is a VERY complicated source document.

Quote:

Originally Posted by Gary Friedman

The RegEx expressions I've written work most of the time but still miss about 30% of the things I'm trying to fix, leaving me to go through each .html file and fix things by hand.

I suspect you are trying to take off bites that are too large in one go. Typically with these very complicated documents, you have to take your Regex in very small bites/passes. For example:

Step 1

Take Figure 1-12:

Spoiler:

Cleanup the code for Figure Images first:

Spoiler:

Step 2

Then you can use that as a basis to make your next Regex easier. You can now look for something like class="figureimage" and you KNOW that you are dealing with Figures.

So now cleanup some of the Caption code:

Spoiler:

Step 3

Then cleanup the bold Figure Text:

Spoiler:

Step 4

Then just toss those hard-coded italics in the garbage and use CSS instead (you will thank me later when you want to change the look of the captions):

Spoiler:

CSS:

Spoiler:

Step 5

Do a pass to look through the book and see what sort of Figures were missed (because of inconsistent code, multi-image figures, etc. etc.).

Side Note: A hell of a lot of your life would have been saved if your used Styles in your original source document.

Step 6

Move on to cleaning up the next problem! (Cleaning up Table code, making human-readable filenames, etc. etc.) :P

Maybe in the end you might end up with something infinitely more maintainable, like this:

Spoiler:

Quote:

Originally Posted by Gary Friedman

I'd even be willing to pay someone to help create more bulletproof REGEX' and to help fix other formatting anomolies that I currently have to tweak by hand in HTML.

Once you throw in a Calibre conversion all bets are off. It will generate a bajillion different calibre## and block_## classes (in the case of this document, there are over 1300 classes created).

And instead of trying to use straight Regex, tools like Diap's Editing Toolbag can make your life easier when trying to remove some hideous nested HTML:

https://www.mobileread.com/forums/sh...d.php?t=251365

Anyway, there are a few professional conversion people on the boards who do this as full-time jobs—one even starts with "Tex".

Quote:

Originally Posted by Gary Friedman

My books are often 500+ pages and this process is getting tedious.

That is what happens when you don't plan for the future when creating the source document... or using Styles consistently! You would have saved yourself a heck of a lot of future headaches! :P

And this conversion stuff is pretty hard when you start adding in Cross-References, complicated tables, Sidebars, Indexes, and all sorts of other fun formatting!

Plus you have to simplify a lot of this code so things work on your basic e-ink devices, so many print-first decisions should be reformatted for more ebook-friendly decisions:

complicated tables -> simpler lists
tables with images in them -> normal text
multi-image figures -> single-image figures (maybe with text captions inside?)
floating boxes -> non-floating + sitting within text
[...]

Quote:

Originally Posted by Gary Friedman

Sorry for the long post; hopefully some of you can be of help!

Pffffffff... you don't know long posts! Your post is a baby compared to some of mine! :P

Side Note: So I found a few typos/mistakes in your source document while I was looking.

There is an accidental space at the very beginning of these paragraphs:

(There are quite a few more, but I don't know Word's variant of Regular Expressions enough to tell you how to catch them within Word):

Quote:

4.2.1

[...]

You can’t do much when you run the app by itself, but I encourage you to do so just so you can change one setting.

4.4.1

Next you get an unintuitive Windows Firewall setting screen, which is essentially telling you that it’s unwise to do this Wi-Fi upload thing at Starbuck’s (or other public hot spots) because it opens your computer up to potential security threats.

5.40.11

[...]

Night Portrait mode (Figure 5 86) is the same thing as using the flash in Manual mode with a long shutter speed.

The period here is accidentally underlined+blue. Also, I am not sure if this was intended, but many of the periods after links have spaces around them:

Quote:

14.1.7

[...]

The Anker Astro E5 1600 mAh battery available here: http://amzn.to/1JPliUr .

The first quote in '60's is actually the wrong way (it should be a RIGHT single quote):

Quote:

14.1.4

If you grew up with the famous “<span class="text_94">N</span>Ever-ready” leather cases in the 1950’s and ‘60’s then you’ll love this custom made leather case by Gariz:

There is an accidental opening quote:

Quote:

3.1.3

[...]

This does the same thing as going to “MENU  6  Creative Style  Saturation and going from +3 to -3, but I must admit this new way is much easier.

The inches measurement should be "dumb quotes" and not smart quotes:

Quote:

13.10

[...]

An RX-100 has a 20 megapixel sensor which produces images that are about 76" x 50" x 72 dpi out of the camera. Taking the exact same set of pixels and changing to print resolution (300 dpi), the dimensions change to 18.2” x 12.1” x 300 dpi.

[...]

If you wanted to make the image twice as large, you could decrease the dpi to 150 dpi and end up with an image 36.4” x 24.3" in size.

Missing an opening quote:

Quote:

8.2

[...]

2. Service Availability” which tries to download (via your pre-established Wi-Fi connection to your home router) a list of countries you can use this feature in.

Gary Friedman · 01-26-2017, 02:28 AM

Tex2002ans,

Wow! Okay, that answer was very thorough but it also left me a little confused. I DO use Word styles consistently (which is how I get a consistent look in the printed edition). I'm not a programmer and although I understand your suggested approach at a high level I'm not certain how I would get there.

Plus you said "Once you throw in a Calibre conversion all bets are off." Not sure how I should interpret that. Do you mean it's hopeless?

GF

BetterRed · 01-26-2017, 03:02 AM

Quote:

Originally Posted by Gary Friedman

I am using Calibre to convert a .docx file with a complex layout (lots of figures, tables, etc.) into .epub and .mobi.

Somehow I glossed over that bit when I first read your post. It was Tex2002ans' post that gave me the hint.

Sigil has an Import DOCX plugin that you may want to consider using in the future. The plugin is a wrapper for the Mammoth DOCX to HTML converter.

It's not a silver bullet, to use it effectively you have to 'code' the mapping between your Word Template Styles and CSS styles. This will require considerable effort for complex templates (i.e. quite a few hours over a few days), but assuming you use a common template for your books the mapping is reusable.

Mammoth only works effectively if you do not use Word as if it were a typewriter. It is most effective if you do do all your formatting with styles from an attached template, rather than ad-hoc in-line styling. The same is true of calibre's DOCX conversion facility. Mammoth goes the extra step of providing the bookmaker with the wherewithal of crafting a mapping betwixt Word Template Styles and W3C CSS Styles.

Added: I convert between relatively straightforward 10-20 public domain DOCX's a day via calibre. Most of the DOCX's adhere to the above guidelines, so I don't get gadzillions of .calibre and .block styles in the epub's CSS. The only thing I've found 'better' than calibre, from an XHTML coding purist's perspective, is to reduce the documents to plain text or very simple markdown and redo the formatting in an epub editor manually. That requires more effort, and the end product would not be substantially different code wise, and no different from the readers perspective. These documents are rarely above 50 A4/letter pages

For really complex documents like yours I don't bother converting to EPUB, they're almost always PDF's which introduces a new set of problems. All the people who consume my 'stuff' have decent tablets, so PDF's are not such a big deal. And guess what, a good proportion of them print the documents, scribble their 'action items' in the margins, which then they give to their lackeys to deal with.

I sometimes rename the .calibre and .block CCS styles to the original Word style names - not for any particular reason, merely to give me a mindless task while my mind is elsewhere -- like listening to someone droning on about the former president-elect.

BR

Tex2002ans · 01-26-2017, 05:17 AM

Quote:

Originally Posted by Gary Friedman

Wow! Okay, that answer was very thorough but it also left me a little confused. I DO use Word styles consistently (which is how I get a consistent look in the printed edition).

Hmm, I will have to take a look again at the source file.

IF Word Styles were used properly throughout, to my knowledge, Calibre would have drastically cut down on the 1300+ "calibre##" + "block_##" classes, and instead had many classes named "MsoNormal" + "MsoNormalTable" + Word's naming conventions (you can see a lot of the Word classes if you do a Word -> Save As -> Filtered HTML).

You may have accidentally introduced some Direct Formatting somewhere along the line (WYSYWIG Editors are pretty crappy at introducing hidden cruft).

Quote:

Originally Posted by Gary Friedman

Plus you said "Once you throw in a Calibre conversion all bets are off." Not sure how I should interpret that. Do you mean it's hopeless?

Heh... no... I meant it like this.

Code from your specific DOCX -> EPUB conversion:

Spoiler:

but I took your Original.docx -> Calibre -> EPUB and my conversion got this slightly different code:

Spoiler:

(Maybe this was due to different Calibre settings/versions, maybe you tweaked the DOCX slightly before conversion, etc. etc.)

It just so happens to be that some of your Figures/Captions used these calibre## + block_## classes:

calibre2 = Maybe Figures
block_ = Maybe Figure Image
block_17 = Maybe the entire Figure Caption
calibre7 = Maybe the italic Caption Text
[...]

but MY Calibre conversion came up with:

calibre3 = Maybe Figures
block_23 = Maybe Figure Image
block_22 = Maybe the entire Figure Caption
calibre8 = Maybe the italic Caption Text
[...]

So all of YOUR 1300+ classes do not match up with all of MY 1300+ classes. Any sort of specific Regex I come up with would not be easily copyable to your EPUB. Mine might be looking for class="frame_" while yours is looking for class="frame_1".

The ONLY way to figure it out is to look at the code and see what CSS class does what... and then come up with Regex+ways to clean it up from there.

Side Note: Also, once you create this DOCX/EPUB divide, all work isn't easily transferable BACK to the source document. For example:

There are quite a bit of 'Dumb Single Quotes'+"Dumb Double Quotes" that have to be changed to proper ‘Single Quotes’+“Double Quotes”.
Many of your "TIP:"s are missing the double space after the colon.

These sort of mass fixes are more easily fixed in the source document, THEN you can generate your DOCX -> EPUB.

You don't want to:

Spend 10+ hours on EPUB-specific tweaks/fixes...
Then have to do 10+ hours of reduplicating corrections in your original DOCX.
And then: "Oh crap... I have to generate a new DOCX -> EPUB and now the tens/hundreds of Regex I came up with for specific calibre## doesn't work any more"

Quote:

Originally Posted by Gary Friedman

I'm not a programmer and although I understand your suggested approach at a high level I'm not certain how I would get there.

If you are going to be cleaning up/editing the EPUB, you should at least know basic HTML+CSS.

I find the Calibre/Sigil Reports functionality is very helpful in spotting all the different classes:

Calibre: Tools -> Reports -> Style Classes
Sigil: Tools -> Reports... -> Style Classes in HTML Files

Click image for larger version

Name: CalibreReports.png
Views: 272
Size: 15.2 KB
ID: 154503

And then there really is nothing that can replace just going through the entire book with multiple passes, figuring out what each class is doing, and "fixing" it:

Click image for larger version

Name: LookAtPreview.png
Views: 239
Size: 25.9 KB
ID: 154502

Click image for larger version

Name: LookAtHTML.png
Views: 253
Size: 65.2 KB
ID: 154501

And in Sigil, I much prefer right clicking on a class and pressing "Go To Link Or Style". This jumps you directly to the CSS class:

Click image for larger version

Name: LookAtCSS.png
Views: 242
Size: 53.5 KB
ID: 154500

So in that case, calibre10 is useless, so you can get rid of all references in the EPUB.

As you can see, there is an absolute TON of cruft introduced... so depending on the book, different workflows might be faster (maybe Calibre might be best, maybe Word Filtered HTML, maybe BetterRed's recommendation of Mammoth, [...]).

This book's layout is very complicated... so any of these workflows will be time- + labor-intensive, and you might lose certain functionality depending on which workflow you use (for example, linked Indexes go poof with Word's Filtered HTML). It will be a beast to convert no matter which way you slice it.

BetterRed · 01-26-2017, 05:34 AM

Quote:

Originally Posted by BetterRed

Sigil has an Import DOCX plugin that you may want to consider using in the future. The plugin is a wrapper for the Mammoth DOCX to HTML converter.

I thought I was in the Sigil forum, here's a link to Mammoth ==>> Convert Word documents (.docx files) to HTML

If someone was minded they could probably take DiapDealers PI for Sigil (it's a wrapper) and transform it into a similar calibre editor plugin without a too much effort.

A feature of the calibre editor I find useful, which Tex20021ns may not have mentioned, is the Live CSS view - basically it allows you to put the cursor in the code and see the composite 'style' that will be used at that place, and where each element comes from.

BR

Phssthpok · 01-26-2017, 08:37 AM

Quote:

Originally Posted by Gary Friedman

THANK YOU for your insightful answer! I will play with this some more tonight and let you know. (And when you say "dot-all" you really mean ".*", right?)

Gary

There is a checkbox below the "replace all" button labelled "dot all". Normally "." matches any character except "\n"; ticking this box makes it match "\n" as well, and "\s" (match a white-space character) will also match "\n". I generally keep it turned on, since line breaks are not necessarily at predictable places in HTML code. Plan B is to run a preliminary set of edits to put each para on a separate line, then remove any line breaks inside the paras so that each para is a single line.

01-25-2017, 04:52 AM	#1
Gary Friedman Member Posts: 12 Karma: 10 Join Date: Oct 2014 Device: none	Need help with RegEx I am using Calibre to convert a .docx file with a complex layout (lots of figures, tables, etc.) into .epub and .mobi. While the conversion succeeds I have used some RegEx expressions to find and replace some formatting irregularities. The RegEx expressions I've written work most of the time but still miss about 30% of the things I'm trying to fix, leaving me to go through each .html file and fix things by hand. My books are often 500+ pages and this process is getting tedious. What I would like to do is share with you my input file, the RegEx expressions I'm using, and ask if anyone can make suggestions on how to make these expressions more bullet-proof. Here's the process I use. (Links to files appear below.) I start with a complex-layout docx like the kind attached. Before conversion I'll replace non-standard characters (like in-line arrows and smiley faces) with ASCII-character equivalents; I also replace multiple-images-in-tables with just one image. Then I use calibre to convert to epub. From there I run the following regex expressions: This widens all tables to fill the reader width: Find: <table class="table_."> Replace with: <table width="100%"> Next I want to enlarge all images that appear within tables / figures: Find: <table width="100%">((.\|\n)?)src=(.?)class=(.)/>(.?)Figure((.\|\n)?)/table> Replace with: <table width="100%"> \1 src= \3 width="100%"/> \5 Figure \6/table> This works for most but not all images within figures. The red arrows in the upper-right-hand corner of http://friedmanarchives.com/~downloa...escription.jpg shows examples of where it fails. The files are too large to upload but you can download them from my server: 1) The original .docx file (so you can see the complex layout as it was intended for printed form): http://friedmanarchives.com/~downloa.../Original.docx 2) The .epub version after calibre had converted it (and after I started to fix things by hand): http://friedmanarchives.com/~downloa...ed_output.epub I'd even be willing to pay someone to help create more bulletproof REGEX' and to help fix other formatting anomolies that I currently have to tweak by hand in HTML. Sorry for the long post; hopefully some of you can be of help! Sincerely, Gary

01-25-2017, 01:00 PM	#2
Phssthpok Age improves with wine. Posts: 558 Karma: 95229 Join Date: Nov 2014 Device: Kindle Oasis, Kobo Libra II	The best thing to do is to undo the change on the file you show, and do "Find" to see what gets highlighted before you hit "Replace". I suspect that you should use "dot-all" instead of "(.\|\n)", and use a much more specific regex (e.g. look specifically for "<img"). I've often tripped up with things like "<.?>x" when I wanted a tag followed by x, and it matched "<...>...<...>x" instead -- but changing it to "<[^>]>x" worked fine. Last edited by Phssthpok; 01-25-2017 at 01:08 PM.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Regex help please	FrostWolf	Library Management	2	09-23-2014 11:50 PM
Regex help please	BookJunkieLI	Calibre	3	07-01-2014 03:18 PM
RegEx Help	ghostyjack	Workshop	4	03-22-2012 09:24 AM
What a regex is	Worldwalker	Calibre	20	05-10-2010 05:51 AM
Help with a regex	A.T.E.	Calibre	1	04-05-2010 07:50 AM

01-25-2017, 02:57 PM	#3
Gary Friedman Member Posts: 12 Karma: 10 Join Date: Oct 2014 Device: none	[QUOTE=Phssthpok;3464352]The best thing to do is to undo the change on the file you show, and do "Find" to see what gets highlighted before you hit "Replace". That's actually what I'm doing now but it still takes forever and I know deep in my heart that there's a better way to do this. THANK YOU for your insightful answer! I will play with this some more tonight and let you know. (And when you say "dot-all" you really mean ".*", right?) Gary

01-26-2017, 02:28 AM	#5
Gary Friedman Member Posts: 12 Karma: 10 Join Date: Oct 2014 Device: none	Tex2002ans, Wow! Okay, that answer was very thorough but it also left me a little confused. I DO use Word styles consistently (which is how I get a consistent look in the printed edition). I'm not a programmer and although I understand your suggested approach at a high level I'm not certain how I would get there. Plus you said "Once you throw in a Calibre conversion all bets are off." Not sure how I should interpret that. Do you mean it's hopeless? GF

Advert

Advert