01-25-2017, 04:52 AM | #1 |
Member
Posts: 12
Karma: 10
Join Date: Oct 2014
Device: none
|
Need help with RegEx
I am using Calibre to convert a .docx file with a complex layout (lots of figures, tables, etc.) into .epub and .mobi. While the conversion succeeds I have used some RegEx expressions to find and replace some formatting irregularities.
The RegEx expressions I've written work most of the time but still miss about 30% of the things I'm trying to fix, leaving me to go through each .html file and fix things by hand. My books are often 500+ pages and this process is getting tedious. What I would like to do is share with you my input file, the RegEx expressions I'm using, and ask if anyone can make suggestions on how to make these expressions more bullet-proof. Here's the process I use. (Links to files appear below.) I start with a complex-layout docx like the kind attached. Before conversion I'll replace non-standard characters (like in-line arrows and smiley faces) with ASCII-character equivalents; I also replace multiple-images-in-tables with just one image. Then I use calibre to convert to epub. From there I run the following regex expressions: This widens all tables to fill the reader width: Find: <table class="table_.*"> Replace with: <table width="100%"> Next I want to enlarge all images that appear within tables / figures: Find: <table width="100%">((.|\n)*?)src=(.*?)class=(.*)/>(.*?)Figure((.|\n)*?)/table> Replace with: <table width="100%"> \1 src= \3 width="100%"/> \5 Figure \6/table> This works for most but not all images within figures. The red arrows in the upper-right-hand corner of http://friedmanarchives.com/~downloa...escription.jpg shows examples of where it fails. The files are too large to upload but you can download them from my server: 1) The original .docx file (so you can see the complex layout as it was intended for printed form): http://friedmanarchives.com/~downloa.../Original.docx 2) The .epub version after calibre had converted it (and after I started to fix things by hand): http://friedmanarchives.com/~downloa...ed_output.epub I'd even be willing to pay someone to help create more bulletproof REGEX' and to help fix other formatting anomolies that I currently have to tweak by hand in HTML. Sorry for the long post; hopefully some of you can be of help! Sincerely, Gary |
01-25-2017, 01:00 PM | #2 |
Age improves with wine.
Posts: 558
Karma: 95229
Join Date: Nov 2014
Device: Kindle Oasis, Kobo Libra II
|
The best thing to do is to undo the change on the file you show, and do "Find" to see what gets highlighted before you hit "Replace". I suspect that you should use "dot-all" instead of "(.|\n)", and use a much more specific regex (e.g. look specifically for "<img"). I've often tripped up with things like "<.*?>x" when I wanted a tag followed by x, and it matched "<...>...<...>x" instead -- but changing it to "<[^>]*>x" worked fine.
Last edited by Phssthpok; 01-25-2017 at 01:08 PM. |
Advert | |
|
01-25-2017, 02:57 PM | #3 |
Member
Posts: 12
Karma: 10
Join Date: Oct 2014
Device: none
|
[QUOTE=Phssthpok;3464352]The best thing to do is to undo the change on the file you show, and do "Find" to see what gets highlighted before you hit "Replace".
That's actually what I'm doing now but it still takes forever and I know deep in my heart that there's a better way to do this. THANK YOU for your insightful answer! I will play with this some more tonight and let you know. (And when you say "dot-all" you really mean ".*", right?) Gary |
01-26-2017, 01:37 AM | #4 | |||||||||||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Quote:
Step 1 Take Figure 1-12: Spoiler:
Cleanup the code for Figure Images first: Spoiler:
Step 2 Then you can use that as a basis to make your next Regex easier. You can now look for something like class="figureimage" and you KNOW that you are dealing with Figures. So now cleanup some of the Caption code: Spoiler:
Step 3 Then cleanup the bold Figure Text: Spoiler:
Step 4 Then just toss those hard-coded italics in the garbage and use CSS instead (you will thank me later when you want to change the look of the captions): Spoiler:
CSS: Spoiler:
Step 5 Do a pass to look through the book and see what sort of Figures were missed (because of inconsistent code, multi-image figures, etc. etc.). Side Note: A hell of a lot of your life would have been saved if your used Styles in your original source document. Step 6 Move on to cleaning up the next problem! (Cleaning up Table code, making human-readable filenames, etc. etc.) :P Maybe in the end you might end up with something infinitely more maintainable, like this: Spoiler:
Quote:
And instead of trying to use straight Regex, tools like Diap's Editing Toolbag can make your life easier when trying to remove some hideous nested HTML: https://www.mobileread.com/forums/sh...d.php?t=251365 Anyway, there are a few professional conversion people on the boards who do this as full-time jobs—one even starts with "Tex". Quote:
And this conversion stuff is pretty hard when you start adding in Cross-References, complicated tables, Sidebars, Indexes, and all sorts of other fun formatting! Plus you have to simplify a lot of this code so things work on your basic e-ink devices, so many print-first decisions should be reformatted for more ebook-friendly decisions:
Quote:
Side Note: So I found a few typos/mistakes in your source document while I was looking. There is an accidental space at the very beginning of these paragraphs: (There are quite a few more, but I don't know Word's variant of Regular Expressions enough to tell you how to catch them within Word): Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Last edited by Tex2002ans; 01-26-2017 at 02:15 AM. |
|||||||||||
01-26-2017, 02:28 AM | #5 |
Member
Posts: 12
Karma: 10
Join Date: Oct 2014
Device: none
|
Tex2002ans,
Wow! Okay, that answer was very thorough but it also left me a little confused. I DO use Word styles consistently (which is how I get a consistent look in the printed edition). I'm not a programmer and although I understand your suggested approach at a high level I'm not certain how I would get there. Plus you said "Once you throw in a Calibre conversion all bets are off." Not sure how I should interpret that. Do you mean it's hopeless? GF |
Advert | |
|
01-26-2017, 03:02 AM | #6 | |
null operator (he/him)
Posts: 20,539
Karma: 26944418
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
Sigil has an Import DOCX plugin that you may want to consider using in the future. The plugin is a wrapper for the Mammoth DOCX to HTML converter. It's not a silver bullet, to use it effectively you have to 'code' the mapping between your Word Template Styles and CSS styles. This will require considerable effort for complex templates (i.e. quite a few hours over a few days), but assuming you use a common template for your books the mapping is reusable. Mammoth only works effectively if you do not use Word as if it were a typewriter. It is most effective if you do do all your formatting with styles from an attached template, rather than ad-hoc in-line styling. The same is true of calibre's DOCX conversion facility. Mammoth goes the extra step of providing the bookmaker with the wherewithal of crafting a mapping betwixt Word Template Styles and W3C CSS Styles. Added: I convert between relatively straightforward 10-20 public domain DOCX's a day via calibre. Most of the DOCX's adhere to the above guidelines, so I don't get gadzillions of .calibre and .block styles in the epub's CSS. The only thing I've found 'better' than calibre, from an XHTML coding purist's perspective, is to reduce the documents to plain text or very simple markdown and redo the formatting in an epub editor manually. That requires more effort, and the end product would not be substantially different code wise, and no different from the readers perspective. These documents are rarely above 50 A4/letter pages For really complex documents like yours I don't bother converting to EPUB, they're almost always PDF's which introduces a new set of problems. All the people who consume my 'stuff' have decent tablets, so PDF's are not such a big deal. And guess what, a good proportion of them print the documents, scribble their 'action items' in the margins, which then they give to their lackeys to deal with. I sometimes rename the .calibre and .block CCS styles to the original Word style names - not for any particular reason, merely to give me a mindless task while my mind is elsewhere -- like listening to someone droning on about the former president-elect. BR Last edited by BetterRed; 01-26-2017 at 06:36 AM. |
|
01-26-2017, 05:17 AM | #7 | |||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
IF Word Styles were used properly throughout, to my knowledge, Calibre would have drastically cut down on the 1300+ "calibre##" + "block_##" classes, and instead had many classes named "MsoNormal" + "MsoNormalTable" + Word's naming conventions (you can see a lot of the Word classes if you do a Word -> Save As -> Filtered HTML). You may have accidentally introduced some Direct Formatting somewhere along the line (WYSYWIG Editors are pretty crappy at introducing hidden cruft). Quote:
Code from your specific DOCX -> EPUB conversion: Spoiler:
but I took your Original.docx -> Calibre -> EPUB and my conversion got this slightly different code: Spoiler:
(Maybe this was due to different Calibre settings/versions, maybe you tweaked the DOCX slightly before conversion, etc. etc.) It just so happens to be that some of your Figures/Captions used these calibre## + block_## classes:
but MY Calibre conversion came up with:
So all of YOUR 1300+ classes do not match up with all of MY 1300+ classes. Any sort of specific Regex I come up with would not be easily copyable to your EPUB. Mine might be looking for class="frame_" while yours is looking for class="frame_1". The ONLY way to figure it out is to look at the code and see what CSS class does what... and then come up with Regex+ways to clean it up from there. Side Note: Also, once you create this DOCX/EPUB divide, all work isn't easily transferable BACK to the source document. For example:
These sort of mass fixes are more easily fixed in the source document, THEN you can generate your DOCX -> EPUB. You don't want to:
Quote:
I find the Calibre/Sigil Reports functionality is very helpful in spotting all the different classes:
And then there really is nothing that can replace just going through the entire book with multiple passes, figuring out what each class is doing, and "fixing" it: And in Sigil, I much prefer right clicking on a class and pressing "Go To Link Or Style". This jumps you directly to the CSS class: So in that case, calibre10 is useless, so you can get rid of all references in the EPUB. As you can see, there is an absolute TON of cruft introduced... so depending on the book, different workflows might be faster (maybe Calibre might be best, maybe Word Filtered HTML, maybe BetterRed's recommendation of Mammoth, [...]). This book's layout is very complicated... so any of these workflows will be time- + labor-intensive, and you might lose certain functionality depending on which workflow you use (for example, linked Indexes go poof with Word's Filtered HTML). It will be a beast to convert no matter which way you slice it. Last edited by Tex2002ans; 01-26-2017 at 05:29 AM. |
|||
01-26-2017, 05:34 AM | #8 | |
null operator (he/him)
Posts: 20,539
Karma: 26944418
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
If someone was minded they could probably take DiapDealers PI for Sigil (it's a wrapper) and transform it into a similar calibre editor plugin without a too much effort. A feature of the calibre editor I find useful, which Tex20021ns may not have mentioned, is the Live CSS view - basically it allows you to put the cursor in the code and see the composite 'style' that will be used at that place, and where each element comes from. BR Last edited by BetterRed; 01-26-2017 at 06:47 AM. |
|
01-26-2017, 08:37 AM | #9 |
Age improves with wine.
Posts: 558
Karma: 95229
Join Date: Nov 2014
Device: Kindle Oasis, Kobo Libra II
|
There is a checkbox below the "replace all" button labelled "dot all". Normally "." matches any character except "\n"; ticking this box makes it match "\n" as well, and "\s" (match a white-space character) will also match "\n". I generally keep it turned on, since line breaks are not necessarily at predictable places in HTML code. Plan B is to run a preliminary set of edits to put each para on a separate line, then remove any line breaks inside the paras so that each para is a single line.
Last edited by Phssthpok; 01-26-2017 at 09:12 AM. |
Tags |
conversion, regex |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Regex help please | FrostWolf | Library Management | 2 | 09-23-2014 11:50 PM |
Regex help please | BookJunkieLI | Calibre | 3 | 07-01-2014 03:18 PM |
RegEx Help | ghostyjack | Workshop | 4 | 03-22-2012 09:24 AM |
What a regex is | Worldwalker | Calibre | 20 | 05-10-2010 05:51 AM |
Help with a regex | A.T.E. | Calibre | 1 | 04-05-2010 07:50 AM |