Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 01-09-2014, 02:02 AM   #16
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,572
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Hmmmm - very bizarre see attachment - I'll need to have a sleep on it, maybe someone will solve it while I'm doing that

I don't think its an Editor issue, I'm pretty sure its a Conversion issue

BR
Attached Thumbnails
Click image for larger version

Name:	Capture.JPG
Views:	356
Size:	32.2 KB
ID:	117754  
BetterRed is online now   Reply With Quote
Old 01-09-2014, 04:34 AM   #17
Sablerose
Enthusiast
Sablerose began at the beginning.
 
Posts: 42
Karma: 10
Join Date: Dec 2010
Location: Arizona USA
Device: iPod Touch 6G
And here is what I got for that specific phrase.
Attached Thumbnails
Click image for larger version

Name:	Conversion.jpg
Views:	344
Size:	37.8 KB
ID:	117756  
Sablerose is offline   Reply With Quote
Old 01-09-2014, 01:39 PM   #18
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,572
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
I think the issue of what you get in the ePUB XHTML depends on 'what you do' in Word with cut & paste and editing - in that shot I posted I changed the ". Not" to "… not" in Word and did a conversion.

The fragmentation of the not that you see in the ePUB XHTML reflects what's in the Word DOCX XML

ePUB XHTML
Code:
<i class="calibre1">n</i><span class="text1">ot</span>
DOCX XML
Code:
         <w:r w:rsidR="00AB4F90" w:rsidRPr="00160E46">
            <w:t>n</w:t>
         </w:r>
         <w:r w:rsidRPr="00160E46">
            <w:t>ot.</w:t>
         </w:r>
So my conclusion is that the <span class:"text1>blah blah</span> sequences stem directly from the XML that Word creates in its DOCX files. And that as one does more editing on the DOCX the XML becomes more disorderly. Which after conversion results in less than optimal XHTML - ie Garbage In Garbage Out.

One way of ensuring better consistency might be to paste plain ASCII text into the DOCX - you can achieve this via the Word Options->Advanced->Cut, copy and Paste settings. You'd then have to do all the font styling manually.

If the examples you posted originate from LIT it might be interesting to see the XHTML that a LIT to EPUB conversion creates.

BR

Last edited by BetterRed; 01-09-2014 at 04:08 PM. Reason: did a Tidy on the XML fragment
BetterRed is online now   Reply With Quote
Old 01-09-2014, 02:06 PM   #19
PeterT
Grand Sorcerer
PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.
 
PeterT's Avatar
 
Posts: 12,168
Karma: 73448616
Join Date: Nov 2007
Location: Toronto
Device: Nexus 7, Clara, Touch, Tolino EPOS
You can also "strip" formatting via selecting the text in question and doing both a ctrl-q and ctrl-space keyboard commands (at least on Windows).
PeterT is offline   Reply With Quote
Old 01-09-2014, 02:44 PM   #20
Sablerose
Enthusiast
Sablerose began at the beginning.
 
Posts: 42
Karma: 10
Join Date: Dec 2010
Location: Arizona USA
Device: iPod Touch 6G
Easiest alternative

We'll, it might be easiest to change my Word defaults to strip out the formatting as I paste text in. Since I verify all italics as I go already, I might as well redo them, and see how that affects the conversion results.

The ctrl-q and ctrl-space idea, I hesitate at, since I don't know what those functions do. But I'll take a look and see.

As far as the idea of a direct LIT conversion, I don't know for sure what type of file these docx files came from. I had already done a cut/paste from the original to make the docx file I converted. And I get my eBooks in all types of files, including PDFs, which I always remake to docx before I try them in Calibre.

I think as a test case, I will do a couple chapters of a fresh book (to see which way it goes from the formatting as is), then redo the same text, with Word stripping the formatting and my redoing it myself.

I'll let you all know my results.

Oh, and another reason I want this to be consistent. I have ~absolutely no~ idea what the code line you got in the doc XML means. And I don't know anything about regex either. Besides, simple is better and removing the problem will make redoing things faster.
Sablerose is offline   Reply With Quote
Old 01-09-2014, 04:11 PM   #21
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,572
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by Sablerose View Post
We'll, it might be easiest to change my Word defaults to strip out the formatting as I paste text in. Since I verify all italics as I go already, I might as well redo them, and see how that affects the conversion results.

The ctrl-q and ctrl-space idea, I hesitate at, since I don't know what those functions do. But I'll take a look and see.
@Sablerose - 300+ Useful Word 2007 KB Shortcuts That site also has shortcuts for Word 2010 - and many other programs.

Quote:
Originally Posted by Sablerose View Post
Oh, and another reason I want this to be consistent. I have ~absolutely no~ idea what the code line you got in the doc XML means. And I don't know anything about regex either. Besides, simple is better and removing the problem will make redoing things faster.
@Sablerose - I only posted the XML from the DOCX, to demonstrate that the weird XHTML in the ePUB is a direct result converting the similarly weird XML in the DOCX. The 'weird' is that 'not' is split into 'n' and 'ot'.

I updated the XML fragment I posted earlier - after a Tidy

The 'n' and 'ot' are at the beginning and end of the both the XHTML and the XML. I don't expect anyone to actually comprehend the DOCX XML - except maybe Kovid

BR
BetterRed is online now   Reply With Quote
Old 01-14-2014, 04:46 PM   #22
buffaloseven
Watching the Sky
buffaloseven ought to be getting tired of karma fortunes by now.buffaloseven ought to be getting tired of karma fortunes by now.buffaloseven ought to be getting tired of karma fortunes by now.buffaloseven ought to be getting tired of karma fortunes by now.buffaloseven ought to be getting tired of karma fortunes by now.buffaloseven ought to be getting tired of karma fortunes by now.buffaloseven ought to be getting tired of karma fortunes by now.buffaloseven ought to be getting tired of karma fortunes by now.buffaloseven ought to be getting tired of karma fortunes by now.buffaloseven ought to be getting tired of karma fortunes by now.buffaloseven ought to be getting tired of karma fortunes by now.
 
buffaloseven's Avatar
 
Posts: 234
Karma: 634112
Join Date: Sep 2012
Location: Winnipeg, MB
Device: Kobo Aura
I only briefly scanned this and thought it worth mentioning: using an <i> tag is no longer a valid way to present italics in an HTML file. Italics now use <em> and any converter that is converting an <i> tag to a span is doing what it should (since it's inline within a paragraph). Try using <em> instead and see if the problem persists
buffaloseven is offline   Reply With Quote
Old 01-15-2014, 02:26 AM   #23
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
Quote:
Originally Posted by buffaloseven View Post
I only briefly scanned this and thought it worth mentioning: using an <i> tag is no longer a valid way to present italics in an HTML file. Italics now use <em> and any converter that is converting an <i> tag to a span is doing what it should (since it's inline within a paragraph). Try using <em> instead and see if the problem persists
And that is a painfully wrong decision. The <i> tag is about style and the <em> about semantics. It is something different. That in most (or perhaps all current) readers/browsers <em> is represented as <i> is another story.
If you really wanted to do this right, identify the semantic use of the <i> tag and create a class in the stylesheet for that use and determine its style. It might be emphasis, it might be thoughts, it might be a letter, etc. If you really want to make semantic use, follow it through.
The <i> tag will be supported for a long time, even if it is deprecated. I would not even be surprised if it is restored.

WordML/OpenXML is not that difficult to understand. It is just very big with a lot of options and functions. It is all documented quite well. The reason for these splitups can be various. Most likely the word was edited with a slightly different style. That would cause this behavior. It would be correct, only not very useful for further processing. That is the main reason I use a different way of creating the HTML export from Word...
Toxaris is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Sorting tag values in tag browser mcam Library Management 15 08-25-2013 05:50 AM
Send tag to device only if tag has more than 1 book? eosrose Calibre 0 01-29-2013 07:46 PM
svg image inside span tag in mobi file not display numbers Hala Aly Workshop 3 09-12-2012 08:00 AM
Adding an Owner tag to tag list? Fangles Library Management 1 02-25-2011 02:32 AM
'Keep' tag? AnemicOak Amazon Kindle 13 03-17-2009 04:19 PM


All times are GMT -4. The time now is 07:52 PM.


MobileRead.com is a privately owned, operated and funded community.