Quote:
Originally Posted by unboggling
I don't use it. I noticed in its documentation that it handles emphasis, but I don't know how that translates to bold or italic, if at all.
Okay, I think I've been misunderstanding what you're trying to do. Do I have the following sequence straight? - You are starting with a PDF. (That is a horrible format to start with.)
- Then you convert it to what? HTML? EPUB? RTF? I'm confused about that.
- Then you want to clean up the underlying code. Why? Is it just to eliminate formatting problems that interfere with readability? Or are you producing books for other people that must have clean code?
- At the end of the sequence, what format should it be, EPUB?
|
Ok, I tend to look at things as starting to cleanup with HTML.
HTML can be obtained by opening an ePub, or a Mobi file from Calibre. Saving an rtf, doc or docx file as html in some kind of editor that handles it.
Converting a pdf file to HTM or HTML using Acrobat Pro (I only have version 7 lol. don't use it enough to buy a newer version), a word processor that can translate to HTML or mobipocket creator which as part of the process of translating the prc generates an html file.
In other words. Using any method I can find I translate my original document to HTML. Perhaps even taking an old text file and going through and adding tags to it. (I can't find the php files I had that used a bunch of rules for creating paragraphs out of a flat txt file. It took me quite a while to write it and figure out the regex for finding all the characters found in a paragraph)
Sometimes, if it is horrid with nested spans and garbage I will strip all coding from it and using the original file do a search for italics and bold or strong. Using two editors I will search in the original file for the tag. Copy enough of the text in or around the tag to find the text in the "clean" copy and put in clean tags.
ARgh I must apologize. This has become much too rambling and probably has bored everyone.