View Single Post
Old 01-01-2010, 10:50 AM   #6
tyche
Addict
tyche plays well with otherstyche plays well with otherstyche plays well with otherstyche plays well with otherstyche plays well with otherstyche plays well with otherstyche plays well with otherstyche plays well with otherstyche plays well with otherstyche plays well with otherstyche plays well with others
 
Posts: 227
Karma: 2530
Join Date: Dec 2009
Device: PRS-505, iPad
Jackie_w, that is some good info. I'll just add a few of my usual tricks for fixing messed up formatting. Even a good .lit file needs some massaging to make a good epub. I find it's worth a little effort to fix up a text before reading it. And while I'm reading one book and I can be working on another. Once you get better at it, you can fix even the most messed up text in about 30 mins.


Do a search for " " or whatever is used for the spoken text. This will find lines with two different people talking (The end of the first person and the beginning of the second). Then break the lines up so the story flows better. I hate when two people are talking on the same line :0

Another big one is attaching broken paragraphs. The obvious detection is the fact that a hard return is followed by a lowercase letter (and vice versa). Using Word, a regex you can find these and either mass change the results or just find and fix. For example, things like lyrics or special messages would get caught in this find but you wouldn't want to attach the lines.

ex. a regex search for a hard return, ^13, and then a lowercase letter [a-z] would be ^13([a-z])

With the parenthesizes you can do a macro replace of what it found. Replace like this would be ' \1' --without the '. i.e. a space then \1. This will remove the return with a space and the lowercase letter it found will be added back to make the line join up.

Doing the search the other way, ([a-z])^13 helps find broken lines as well as missing endings like periods. It's replace format would be '\1 ' without the '

Then clean up the extra spacing by searching for ^p^p (2 returns, or as many as you are looking for) and replace with ^p. You can then select the whole text and do margin and line spacing as well to something you like.

In Word I find it better to save a copy as html (filtered). Even with the crappy MS additions, Calibre will build a very accurate result in epub. You can also copy & paste it into Open Office and save the result in .html and it will have even less baggage but I've not seen any benefit in the resulting epub. Or even use something like notepad++ and with some experience, wipe out all the extraneous html tagging. I usually leave it at the ms word filtered unless I want a standard .html file.

Be sure to look at the Calibre options for removing spaces between paragraphs. Even with your html page looking right, this can help fix extra spaces from creeping in.

Good luck!
tyche is offline   Reply With Quote