MobileRead Forums - View Single Post

BetterRed · 01-09-2014, 01:39 PM

I think the issue of what you get in the ePUB XHTML depends on 'what you do' in Word with cut & paste and editing - in that shot I posted I changed the ". Not" to "… not" in Word and did a conversion.

The fragmentation of the not that you see in the ePUB XHTML reflects what's in the Word DOCX XML

ePUB XHTML

Code:

<i class="calibre1">n</i><span class="text1">ot</span>

DOCX XML

Code:

         <w:r w:rsidR="00AB4F90" w:rsidRPr="00160E46">
            <w:t>n</w:t>
         </w:r>
         <w:r w:rsidRPr="00160E46">
            <w:t>ot.</w:t>
         </w:r>

So my conclusion is that the <span class:"text1>blah blah</span> sequences stem directly from the XML that Word creates in its DOCX files. And that as one does more editing on the DOCX the XML becomes more disorderly. Which after conversion results in less than optimal XHTML - ie Garbage In Garbage Out.

One way of ensuring better consistency might be to paste plain ASCII text into the DOCX - you can achieve this via the Word Options->Advanced->Cut, copy and Paste settings. You'd then have to do all the font styling manually.

If the examples you posted originate from LIT it might be interesting to see the XHTML that a LIT to EPUB conversion creates.

BR

01-09-2014, 01:39 PM	#18
BetterRed null operator (he/him) Posts: 21,808 Karma: 30277270 Join Date: Mar 2012 Location: Sydney Australia Device: none	I think the issue of what you get in the ePUB XHTML depends on 'what you do' in Word with cut & paste and editing - in that shot I posted I changed the ". Not" to "… not" in Word and did a conversion. The fragmentation of the not that you see in the ePUB XHTML reflects what's in the Word DOCX XML ePUB XHTML Code: <i class="calibre1">n</i><span class="text1">ot</span> DOCX XML Code: <w:r w:rsidR="00AB4F90" w:rsidRPr="00160E46"> <w:t>n</w:t> </w:r> <w:r w:rsidRPr="00160E46"> <w:t>ot.</w:t> </w:r> So my conclusion is that the <span class:"text1>blah blah</span> sequences stem directly from the XML that Word creates in its DOCX files. And that as one does more editing on the DOCX the XML becomes more disorderly. Which after conversion results in less than optimal XHTML - ie Garbage In Garbage Out. One way of ensuring better consistency might be to paste plain ASCII text into the DOCX - you can achieve this via the Word Options->Advanced->Cut, copy and Paste settings. You'd then have to do all the font styling manually. If the examples you posted originate from LIT it might be interesting to see the XHTML that a LIT to EPUB conversion creates. BR Last edited by BetterRed; 01-09-2014 at 04:08 PM. Reason: did a Tidy on the XML fragment