tag vs tag - Page 2

BetterRed · 01-09-2014, 02:02 AM

Hmmmm - very bizarre see attachment - I'll need to have a sleep on it, maybe someone will solve it while I'm doing that

I don't think its an Editor issue, I'm pretty sure its a Conversion issue

BR

Sablerose · 01-09-2014, 04:34 AM

And here is what I got for that specific phrase.

BetterRed · 01-09-2014, 01:39 PM

I think the issue of what you get in the ePUB XHTML depends on 'what you do' in Word with cut & paste and editing - in that shot I posted I changed the ". Not" to "… not" in Word and did a conversion.

The fragmentation of the not that you see in the ePUB XHTML reflects what's in the Word DOCX XML

ePUB XHTML

Code:

<i class="calibre1">n</i><span class="text1">ot</span>

DOCX XML

Code:

         <w:r w:rsidR="00AB4F90" w:rsidRPr="00160E46">
            <w:t>n</w:t>
         </w:r>
         <w:r w:rsidRPr="00160E46">
            <w:t>ot.</w:t>
         </w:r>

So my conclusion is that the blah blah sequences stem directly from the XML that Word creates in its DOCX files. And that as one does more editing on the DOCX the XML becomes more disorderly. Which after conversion results in less than optimal XHTML - ie Garbage In Garbage Out.

One way of ensuring better consistency might be to paste plain ASCII text into the DOCX - you can achieve this via the Word Options->Advanced->Cut, copy and Paste settings. You'd then have to do all the font styling manually.

If the examples you posted originate from LIT it might be interesting to see the XHTML that a LIT to EPUB conversion creates.

BR

PeterT · 01-09-2014, 02:06 PM

You can also "strip" formatting via selecting the text in question and doing both a ctrl-q and ctrl-space keyboard commands (at least on Windows).

Sablerose · 01-09-2014, 02:44 PM

We'll, it might be easiest to change my Word defaults to strip out the formatting as I paste text in. Since I verify all italics as I go already, I might as well redo them, and see how that affects the conversion results.

The ctrl-q and ctrl-space idea, I hesitate at, since I don't know what those functions do. But I'll take a look and see.

As far as the idea of a direct LIT conversion, I don't know for sure what type of file these docx files came from. I had already done a cut/paste from the original to make the docx file I converted. And I get my eBooks in all types of files, including PDFs, which I always remake to docx before I try them in Calibre.

I think as a test case, I will do a couple chapters of a fresh book (to see which way it goes from the formatting as is), then redo the same text, with Word stripping the formatting and my redoing it myself.

I'll let you all know my results.

Oh, and another reason I want this to be consistent. I have ~absolutely no~ idea what the code line you got in the doc XML means. And I don't know anything about regex either. Besides, simple is better and removing the problem will make redoing things faster.

BetterRed · 01-09-2014, 04:11 PM

Quote:

Originally Posted by Sablerose

We'll, it might be easiest to change my Word defaults to strip out the formatting as I paste text in. Since I verify all italics as I go already, I might as well redo them, and see how that affects the conversion results.

The ctrl-q and ctrl-space idea, I hesitate at, since I don't know what those functions do. But I'll take a look and see.

@Sablerose - 300+ Useful Word 2007 KB Shortcuts That site also has shortcuts for Word 2010 - and many other programs.

Quote:

Originally Posted by Sablerose

Oh, and another reason I want this to be consistent. I have ~absolutely no~ idea what the code line you got in the doc XML means. And I don't know anything about regex either. Besides, simple is better and removing the problem will make redoing things faster.

@Sablerose - I only posted the XML from the DOCX, to demonstrate that the weird XHTML in the ePUB is a direct result converting the similarly weird XML in the DOCX. The 'weird' is that 'not' is split into 'n' and 'ot'.

I updated the XML fragment I posted earlier - after a Tidy

The 'n' and 'ot' are at the beginning and end of the both the XHTML and the XML. I don't expect anyone to actually comprehend the DOCX XML - except maybe Kovid

BR

buffaloseven · 01-14-2014, 04:46 PM

I only briefly scanned this and thought it worth mentioning: using an tag is no longer a valid way to present italics in an HTML file. Italics now use and any converter that is converting an tag to a span is doing what it should (since it's inline within a paragraph). Try using instead and see if the problem persists

Toxaris · 01-15-2014, 02:26 AM

Quote:

Originally Posted by buffaloseven

I only briefly scanned this and thought it worth mentioning: using an tag is no longer a valid way to present italics in an HTML file. Italics now use and any converter that is converting an tag to a span is doing what it should (since it's inline within a paragraph). Try using instead and see if the problem persists

And that is a painfully wrong decision. The tag is about style and the about semantics. It is something different. That in most (or perhaps all current) readers/browsers is represented as is another story.
If you really wanted to do this right, identify the semantic use of the tag and create a class in the stylesheet for that use and determine its style. It might be emphasis, it might be thoughts, it might be a letter, etc. If you really want to make semantic use, follow it through.
The tag will be supported for a long time, even if it is deprecated. I would not even be surprised if it is restored.

WordML/OpenXML is not that difficult to understand. It is just very big with a lot of options and functions. It is all documented quite well. The reason for these splitups can be various. Most likely the word was edited with a slightly different style. That would cause this behavior. It would be correct, only not very useful for further processing. That is the main reason I use a different way of creating the HTML export from Word...

01-09-2014, 02:02 AM	#16
BetterRed null operator (he/him) Posts: 20,572 Karma: 26954694 Join Date: Mar 2012 Location: Sydney Australia Device: none	Hmmmm - very bizarre see attachment - I'll need to have a sleep on it, maybe someone will solve it while I'm doing that I don't think its an Editor issue, I'm pretty sure its a Conversion issue BR Attached Thumbnails

01-09-2014, 04:34 AM	#17
Sablerose Enthusiast Posts: 42 Karma: 10 Join Date: Dec 2010 Location: Arizona USA Device: iPod Touch 6G	And here is what I got for that specific phrase. Attached Thumbnails

01-09-2014, 01:39 PM	#18
BetterRed null operator (he/him) Posts: 20,572 Karma: 26954694 Join Date: Mar 2012 Location: Sydney Australia Device: none	I think the issue of what you get in the ePUB XHTML depends on 'what you do' in Word with cut & paste and editing - in that shot I posted I changed the ". Not" to "… not" in Word and did a conversion. The fragmentation of the not that you see in the ePUB XHTML reflects what's in the Word DOCX XML ePUB XHTML Code: <i class="calibre1">n</i><span class="text1">ot</span> DOCX XML Code: <w:r w:rsidR="00AB4F90" w:rsidRPr="00160E46"> <w:t>n</w:t> </w:r> <w:r w:rsidRPr="00160E46"> <w:t>ot.</w:t> </w:r> So my conclusion is that the <span class:"text1>blah blah</span> sequences stem directly from the XML that Word creates in its DOCX files. And that as one does more editing on the DOCX the XML becomes more disorderly. Which after conversion results in less than optimal XHTML - ie Garbage In Garbage Out. One way of ensuring better consistency might be to paste plain ASCII text into the DOCX - you can achieve this via the Word Options->Advanced->Cut, copy and Paste settings. You'd then have to do all the font styling manually. If the examples you posted originate from LIT it might be interesting to see the XHTML that a LIT to EPUB conversion creates. BR Last edited by BetterRed; 01-09-2014 at 04:08 PM. Reason: did a Tidy on the XML fragment

01-09-2014, 02:44 PM	#20
Sablerose Enthusiast Posts: 42 Karma: 10 Join Date: Dec 2010 Location: Arizona USA Device: iPod Touch 6G	Easiest alternative We'll, it might be easiest to change my Word defaults to strip out the formatting as I paste text in. Since I verify all italics as I go already, I might as well redo them, and see how that affects the conversion results. The ctrl-q and ctrl-space idea, I hesitate at, since I don't know what those functions do. But I'll take a look and see. As far as the idea of a direct LIT conversion, I don't know for sure what type of file these docx files came from. I had already done a cut/paste from the original to make the docx file I converted. And I get my eBooks in all types of files, including PDFs, which I always remake to docx before I try them in Calibre. I think as a test case, I will do a couple chapters of a fresh book (to see which way it goes from the formatting as is), then redo the same text, with Word stripping the formatting and my redoing it myself. I'll let you all know my results. Oh, and another reason I want this to be consistent. I have ~absolutely no~ idea what the code line you got in the doc XML means. And I don't know anything about regex either. Besides, simple is better and removing the problem will make redoing things faster.

01-14-2014, 04:46 PM	#22
buffaloseven Watching the Sky Posts: 234 Karma: 634112 Join Date: Sep 2012 Location: Winnipeg, MB Device: Kobo Aura	I only briefly scanned this and thought it worth mentioning: using an <i> tag is no longer a valid way to present italics in an HTML file. Italics now use <em> and any converter that is converting an <i> tag to a span is doing what it should (since it's inline within a paragraph). Try using <em> instead and see if the problem persists

01-09-2014, 02:06 PM	#19
PeterT Grand Sorcerer Posts: 12,168 Karma: 73448616 Join Date: Nov 2007 Location: Toronto Device: Nexus 7, Clara, Touch, Tolino EPOS	You can also "strip" formatting via selecting the text in question and doing both a ctrl-q and ctrl-space keyboard commands (at least on Windows).

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Sorting tag values in tag browser	mcam	Library Management	15	08-25-2013 05:50 AM
Send tag to device only if tag has more than 1 book?	eosrose	Calibre	0	01-29-2013 07:46 PM
svg image inside span tag in mobi file not display numbers	Hala Aly	Workshop	3	09-12-2012 08:00 AM
Adding an Owner tag to tag list?	Fangles	Library Management	1	02-25-2011 02:32 AM
'Keep' tag?	AnemicOak	Amazon Kindle	13	03-17-2009 04:19 PM