View Single Post
Old 12-22-2010, 11:30 AM   #14
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by kiwidude View Post
Interesting, could you share the things you do as "preprocess code" and how? While I use Calibre I haven't yet gone through the huge amount of options to figure out what ones might be most useful to me.
The option is under structure detection in Calibre (disabled by default), enable it before converting your source to epub. It doesn't work on epub documents primarily because the conversion pipeline assumes that epub should be a well formatted document already. If your source is epub just rename it from .epub to .zip and import it to Calibre as zipped html, which you can then convert to epub with the feature enabled.

The code is here if you wanted to review it - it's mostly regex based, so it's pretty easy to understand if you're familiar with regex. If you've got things which would work across a wide variety of docs that you would like to see added let me know.

There are a bunch of things it does:
  • Removes empty span and formatting tags
  • Checks to see if the doc is just a giant text file in <pre> tags and marks the individual lines up with <p> tags.
  • Searches for faux indents using nbsp and replaces it with a 3% text indent style
  • de-hyphenates the source document to get rid of 99-100% of the hyphens that shouldn't be there, but retain the ones that should. (currently only on line breaks, I've been considering doing all hyphenated content in the doc)
  • Unwraps hard line breaks
  • Searches for numerous kinds of common chapter headings - wraps the heading in an <h2> tag, also searches for common titles following the headings and wraps them in <h3> tags. A lot of logic put into here to prevent false positives, though they can still creep in (still easier to fix a a couple false positives after the fact than to split and mark up the whole book by hand)
  • Centers common soft break markers if they're not centered
  • Deletes empty paragraphs in cases where there is an empty paragraph between every other paragraph, but only when a user also enables the 'remove paragraph spacing' option under look and feel (still need to tune this one to detect/retain soft-breaks)
  • Probably a couple other things I don't remember
  • The only really destructive thing it does is remove all non-breaking spaces - this needs to be done for the regexes to work correctly. non-breaking spaces in empty paragraphs which were used for spacing are replaced at the end of the process, but others are eliminated permanently. This isn't often an issue, but it might screw up a bit of formatting in one book out of fifty.

Once I do all that I find the work I need to do in Sigil is a lot less. I've been thinking about fixing the chapter markup routine to work a little bit better with Sigil as well, add in the 'not in sigil toc' id so that only the heading or the title gets used by Sigil instead of both.

Last edited by ldolse; 12-22-2010 at 11:40 AM.
ldolse is offline   Reply With Quote