View Single Post
Old 02-09-2014, 04:25 PM   #103
unboggling
Wizard
unboggling ought to be getting tired of karma fortunes by now.unboggling ought to be getting tired of karma fortunes by now.unboggling ought to be getting tired of karma fortunes by now.unboggling ought to be getting tired of karma fortunes by now.unboggling ought to be getting tired of karma fortunes by now.unboggling ought to be getting tired of karma fortunes by now.unboggling ought to be getting tired of karma fortunes by now.unboggling ought to be getting tired of karma fortunes by now.unboggling ought to be getting tired of karma fortunes by now.unboggling ought to be getting tired of karma fortunes by now.unboggling ought to be getting tired of karma fortunes by now.
 
Posts: 1,065
Karma: 858115
Join Date: Jan 2011
Device: Kobo Clara, Kindle Paperwhite 10
Quote:
Originally Posted by LadyKate View Post
I did a quick look at the markdown but... does it maintain the italics and bold settings? Didn't notice that.
I don't use it. I noticed in its documentation that it handles emphasis, but I don't know how that translates to bold or italic, if at all.

Quote:
Originally Posted by LadyKate View Post
The freebie tool (not around that I could find but still works well) HTML Book Fixer, strips the excess spans BUT it also manages to remove the italics if they are in a span. Most irritating.

With excess nested spans it is darn near impossible to find the matching open / close tags that refer to italics using regex and a royal pain to "eyeball" the italics in the original.

I don't know why modern word processors don't allow the option to clean up the underlying code that is used to create pdf and html files. The main reason I find pdf files so hard to clean up is because most were created in a wysiwyg program. From the underlying code I get in the html it is usually word or a word clone that uses the horrid "<p class=MsoNormal><span style='mso-fareast-font-family:"MS Mincho"'>" often skipping the quotes around the class name. (note the font family/name is whatever font the doc used.)

I think all that excess code can lead to problems in conversions when nested too deep. I had one problem caused by not cleaning up a file because I had not noticed that one of the nested div tags was class="chapter" and around the entire chapter and another was class="chapterHead" and around the Chapter whatever.
Okay, I think I've been misunderstanding what you're trying to do. Do I have the following sequence straight?
  1. You are starting with a PDF. (That is a horrible format to start with.)
  2. Then you convert it to what? HTML? EPUB? RTF? I'm confused about that.
  3. Then you want to clean up the underlying code. Why? Is it just to eliminate formatting problems that interfere with readability? Or are you producing books for other people that must have clean code?
  4. At the end of the sequence, what format should it be, EPUB?

This is what I do, using the "ignore underlying code" approach. If bold or italic formatting are there to begin with in the original format, that formatting is preserved all the way through to the end (except in cases where I fix inappropriate use of bold or italic).
  1. I buy a book. Say AZW3. In this case it's file size is 405 KB. Add it to calibre, download metadata.
  2. Convert to EPUB. This EPUB is 315 KB. Assess it in calibre Viewer. It has some bold and italic formatting. If annoying formatting problems don't exist enter format quality rating, else:
  3. Convert to RTF. Fix formatting problems in Word, without worrying about underlying code. Deal with Word's presentation of markup (if necessary replace interfering linefeeds with pilcrows, replace awkward page breaks with nothing, etc.) Fix other annoying problems. The appropriate bold and italic formatting is still there. Save as DOCX.
  4. Convert to EPUB. This EPUB is 310 KB. Assess in Viewer, enter format quality rating. The bold and italic are still there. Then just read the book, unbothered by any extra unnecessary code in the format because it is invisible and irrelevant while reading.

Now, instead of starting with AZW3, let's say I start with PDF. I would do the same sequence: convert to EPUB and assess it, convert to RTF, fix in Word (get rid of headers/footers, deal with Word's presentation of markup, fix other annoying problems) and save as DOCX, convert to EPUB, assess it in Viewer. Ignore the underlying code. Read the book.

Admittedly this works best for simply formatted text-based books, starting from EPUB, AZW3, or MOBI. PDF conversions usually have more problems so for me they're more trouble than they're worth.

Last edited by unboggling; 02-10-2014 at 09:20 PM.
unboggling is offline   Reply With Quote