clean HTML or PDF before mobi conversion in Calibre
I recently bought a dvd which contains 1000+ books. Having preordered a Kindle 3, I'd like to create mobi files of some of this content, but have run into some formatting challenges. I'd appraciate any advice on how to resolve these and automate the solution as much as possible.
I have found two methods to export content from the dvd so far.
1. Create a PDF of the book. Most formatting is preserved in the PDF, but importing it in Calibre results in garbled text, missing text, double text and other formatting issues.
2. Copy the book content to a text editor. The formatting is always lost this way. Also, all the headings in the text now also appear twice, like this:
instead of this:
Option one appears to be the only way to retain most of the text's formatting, but is only useful if the content is copied over from PDF to another text editor or extracted some other way. Copy / pasting from PDF to Word 2003 and saving as a filtered webpage causes a new problem: every last few lines of each PDF page now appears twice in the HTML file.
Option 2 produces ok content except for the double header and missing formatting. I found that using wildcards in "find and replace" in Word can help here:
The options for using wildcards and using bold text in the replace field also need to be activated. However, the problem with this search and replace solution is that Word freezes when it's applied to documents longer than 5 pages! Is there another way to get rid of the 2nd header in brackets and bold the first one?
I have attached a sample pdf and htm file, and would appreciate any help on how to handle these files to produce clean mobi files.
edit, seems I can't upload htm files...