View Single Post
Old 01-26-2017, 05:17 AM   #7
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Gary Friedman View Post
Wow! Okay, that answer was very thorough but it also left me a little confused. I DO use Word styles consistently (which is how I get a consistent look in the printed edition).
Hmm, I will have to take a look again at the source file.

IF Word Styles were used properly throughout, to my knowledge, Calibre would have drastically cut down on the 1300+ "calibre##" + "block_##" classes, and instead had many classes named "MsoNormal" + "MsoNormalTable" + Word's naming conventions (you can see a lot of the Word classes if you do a Word -> Save As -> Filtered HTML).

You may have accidentally introduced some Direct Formatting somewhere along the line (WYSYWIG Editors are pretty crappy at introducing hidden cruft).

Quote:
Originally Posted by Gary Friedman View Post
Plus you said "Once you throw in a Calibre conversion all bets are off." Not sure how I should interpret that. Do you mean it's hopeless?
Heh... no... I meant it like this.

Code from your specific DOCX -> EPUB conversion:

Spoiler:
Code:
  <table class="table_14">
    <tbody class="calibre2">
      <tr class="calibre3">
        <td class="td_23">
          <p class="block_"><img src="../Images/image22.png" alt="Image" class="calibre36"/></p>
        </td>
      </tr>

      <tr class="calibre3">
        <td class="td_23">
          <div class="frame_1">
            <p class="block_17"><span class="text_21">Figure 1-12:</span><i class="calibre7"><span class="calibre5"> </span></i><i class="calibre7"><span class="calibre5">A reference shot taken in Program Mode at ISO 3200. </span></i><a href="#id_Ref336091007" class="calibre10"><span class="text_21">Figure 1-13</span></a><i class="calibre7"> shows comparison close-ups of the yellow square.</i></p>
          </div>
        </td>
      </tr>
    </tbody>
  </table>


but I took your Original.docx -> Calibre -> EPUB and my conversion got this slightly different code:

Spoiler:
Code:
<table width="100%">
    <tbody class="calibre2">
      <tr class="calibre3">
        <td class="td_24">
          <p class="block_23"><img alt="Image" src="images/image35.png" class="calibre49"/></p>
        </td>
      </tr>

      <tr class="calibre3">
        <td class="td_24">
          <div class="frame_">
            <p class="block_22"><span class="text_14">Figure 1-12:</span><i class="calibre8"><span class="calibre5"> </span></i><i class="calibre8"><span class="calibre5">A reference shot taken in Program Mode at ISO 3200. </span></i><a href="index_split_011.html#id_Ref336091007" class="calibre11"><span class="text_14">Figure 1-13</span></a><i class="calibre8"> shows comparison close-ups of the yellow square.</i></p>
          </div>
        </td>
      </tr>
    </tbody>
  </table>


(Maybe this was due to different Calibre settings/versions, maybe you tweaked the DOCX slightly before conversion, etc. etc.)

It just so happens to be that some of your Figures/Captions used these calibre## + block_## classes:
  • calibre2 = Maybe Figures
  • block_ = Maybe Figure Image
  • block_17 = Maybe the entire Figure Caption
  • calibre7 = Maybe the italic Caption Text
  • [...]

but MY Calibre conversion came up with:
  • calibre3 = Maybe Figures
  • block_23 = Maybe Figure Image
  • block_22 = Maybe the entire Figure Caption
  • calibre8 = Maybe the italic Caption Text
  • [...]

So all of YOUR 1300+ classes do not match up with all of MY 1300+ classes. Any sort of specific Regex I come up with would not be easily copyable to your EPUB. Mine might be looking for class="frame_" while yours is looking for class="frame_1".

The ONLY way to figure it out is to look at the code and see what CSS class does what... and then come up with Regex+ways to clean it up from there.

Side Note: Also, once you create this DOCX/EPUB divide, all work isn't easily transferable BACK to the source document. For example:
  • There are quite a bit of 'Dumb Single Quotes'+"Dumb Double Quotes" that have to be changed to proper ‘Single Quotes’+“Double Quotes”.
  • Many of your "TIP:"s are missing the double space after the colon.

These sort of mass fixes are more easily fixed in the source document, THEN you can generate your DOCX -> EPUB.

You don't want to:
  • Spend 10+ hours on EPUB-specific tweaks/fixes...
  • Then have to do 10+ hours of reduplicating corrections in your original DOCX.
  • And then: "Oh crap... I have to generate a new DOCX -> EPUB and now the tens/hundreds of Regex I came up with for specific calibre## doesn't work any more"

Quote:
Originally Posted by Gary Friedman View Post
I'm not a programmer and although I understand your suggested approach at a high level I'm not certain how I would get there.
If you are going to be cleaning up/editing the EPUB, you should at least know basic HTML+CSS.

I find the Calibre/Sigil Reports functionality is very helpful in spotting all the different classes:
  • Calibre: Tools -> Reports -> Style Classes
  • Sigil: Tools -> Reports... -> Style Classes in HTML Files

Click image for larger version

Name:	CalibreReports.png
Views:	430
Size:	15.2 KB
ID:	154503

And then there really is nothing that can replace just going through the entire book with multiple passes, figuring out what each class is doing, and "fixing" it:

Click image for larger version

Name:	LookAtPreview.png
Views:	399
Size:	25.9 KB
ID:	154502 Click image for larger version

Name:	LookAtHTML.png
Views:	415
Size:	65.2 KB
ID:	154501

And in Sigil, I much prefer right clicking on a class and pressing "Go To Link Or Style". This jumps you directly to the CSS class:

Click image for larger version

Name:	LookAtCSS.png
Views:	405
Size:	53.5 KB
ID:	154500

So in that case, calibre10 is useless, so you can get rid of all references in the EPUB.

As you can see, there is an absolute TON of cruft introduced... so depending on the book, different workflows might be faster (maybe Calibre might be best, maybe Word Filtered HTML, maybe BetterRed's recommendation of Mammoth, [...]).

This book's layout is very complicated... so any of these workflows will be time- + labor-intensive, and you might lose certain functionality depending on which workflow you use (for example, linked Indexes go poof with Word's Filtered HTML). It will be a beast to convert no matter which way you slice it.

Last edited by Tex2002ans; 01-26-2017 at 05:29 AM.
Tex2002ans is offline   Reply With Quote