Quote:
Originally Posted by Gary Friedman
I am using Calibre to convert a .docx file with a complex layout (lots of figures, tables, etc.) into .epub and .mobi. While the conversion succeeds I have used some RegEx expressions to find and replace some formatting irregularities.
|
This is a VERY complicated source document.
Quote:
Originally Posted by Gary Friedman
The RegEx expressions I've written work most of the time but still miss about 30% of the things I'm trying to fix, leaving me to go through each .html file and fix things by hand.
|
I suspect you are trying to take off bites that are too large in one go. Typically with these very complicated documents, you have to take your Regex in very small bites/passes. For example:
Step 1
Take Figure 1-12:
Cleanup the code for Figure Images first:
Step 2
Then you can use that as a basis to make your next Regex easier. You can now look for something like
class="figureimage" and you KNOW that you are dealing with Figures.
So now cleanup some of the Caption code:
Step 3
Then cleanup the bold Figure Text:
Step 4
Then just toss those hard-coded italics in the garbage and use CSS instead (you will thank me later when you want to change the look of the captions):
CSS:
Step 5
Do a pass to look through the book and see what sort of Figures were missed (because of inconsistent code, multi-image figures, etc. etc.).
Side Note: A hell of a lot of your life would have been saved if your used Styles in your original source document.
Step 6
Move on to cleaning up the next problem! (Cleaning up Table code, making human-readable filenames, etc. etc.) :P
Maybe in the end you might end up with something infinitely more maintainable, like this:
Quote:
Originally Posted by Gary Friedman
I'd even be willing to pay someone to help create more bulletproof REGEX' and to help fix other formatting anomolies that I currently have to tweak by hand in HTML.
|
Once you throw in a Calibre conversion all bets are off. It will generate a bajillion different calibre## and block_## classes (in the case of this document, there are over 1300 classes created).
And instead of trying to use straight Regex, tools like
Diap's Editing Toolbag can make your life easier when trying to remove some hideous nested HTML:
https://www.mobileread.com/forums/sh...d.php?t=251365
Anyway, there are a few professional conversion people on the boards who do this as full-time jobs—one even starts with "Tex".
Quote:
Originally Posted by Gary Friedman
My books are often 500+ pages and this process is getting tedious.
|
That is what happens when you don't plan for the future when creating the source document... or using Styles consistently! You would have saved yourself a heck of a lot of future headaches! :P
And this conversion stuff is pretty hard when you start adding in Cross-References, complicated tables, Sidebars, Indexes, and all sorts of other fun formatting!
Plus you have to simplify a lot of this code so things work on your basic e-ink devices, so many print-first decisions should be reformatted for more ebook-friendly decisions:
- complicated tables -> simpler lists
- tables with images in them -> normal text
- multi-image figures -> single-image figures (maybe with text captions inside?)
- floating boxes -> non-floating + sitting within text
- [...]
Quote:
Originally Posted by Gary Friedman
Sorry for the long post; hopefully some of you can be of help!
|
Pffffffff... you don't know long posts! Your post is a baby compared to some of mine! :P
Side Note: So I found a few typos/mistakes in your source document while I was looking.
There is an accidental space at the very beginning of these paragraphs:
(There are quite a few more, but I don't know Word's variant of Regular Expressions enough to tell you how to catch them within Word):
Quote:
4.2.1
[...]
You can’t do much when you run the app by itself, but I encourage you to do so just so you can change one setting.
4.4.1
Next you get an unintuitive Windows Firewall setting screen, which is essentially telling you that it’s unwise to do this Wi-Fi upload thing at Starbuck’s (or other public hot spots) because it opens your computer up to potential security threats.
5.40.11
[...]
Night Portrait mode (Figure 5 86) is the same thing as using the flash in Manual mode with a long shutter speed.
|
The period here is accidentally underlined+blue. Also, I am not sure if this was intended, but many of the periods after links have spaces around them:
The first quote in '60's is actually the wrong way (it should be a RIGHT single quote):
Quote:
14.1.4
If you grew up with the famous “<span class="text_94">N</span>Ever-ready” leather cases in the 1950’s and ‘60’s then you’ll love this custom made leather case by Gariz:
|
There is an accidental opening quote:
Quote:
3.1.3
[...]
This does the same thing as going to “MENU 6 Creative Style Saturation and going from +3 to -3, but I must admit this new way is much easier.
|
The inches measurement should be "dumb quotes" and not smart quotes:
Quote:
13.10
[...]
An RX-100 has a 20 megapixel sensor which produces images that are about 76" x 50" x 72 dpi out of the camera. Taking the exact same set of pixels and changing to print resolution (300 dpi), the dimensions change to 18.2” x 12.1” x 300 dpi.
[...]
If you wanted to make the image twice as large, you could decrease the dpi to 150 dpi and end up with an image 36.4” x 24.3" in size.
|
Missing an opening quote:
Quote:
8.2
[...]
2. Service Availability” which tries to download (via your pre-established Wi-Fi connection to your home router) a list of countries you can use this feature in.
|