After testing around I finally found a way to convert my pdfs.
It's a multiple step process but at last I end up with a mobi-file that satisfies my needs.
- First i run the PDF through Abbyy PDF-Transformer with the layout set to "text flow" and create a rtf-file (I suppose every other tool that does a good job in pdf->rtf is fine aswell)
- The rtf-file is then opened in MSWord where I do a spellcheck (doesn't take more than a few minutes per book and makes sure there are no hyphe-nations left from the original text.
- I save the file as filtered html in Word
- At last I run the file through a php-script that does the following:
Code:
$text = str_replace(array("<body","</body>","<p","</p>"), array("{body}<body","</body>{/body}","{p}<p","</p>{/p}"), $text);
$text = strip_tags($text, "<b><i><u><html><head><title><h1><h2><h3><h4><h5><h6>");
$text = str_replace(array("{body}","{/body}","{p}","{/p}"," n "," n."," n,"), array("<body>","</body>","<p>","</p>","n ","n.","n,"), $text);
$text = preg_replace('/>\s*</','><',$text);
$text = preg_replace('/\s\s*\s/',' ',$text);
$text = preg_replace('/([a-zA-Z\d\,])<\/p><p>/','$1 ',$text);
The html-file I end up with still has bold, italic, underline text and headings.
Paragraphs are reduced by the ones that were only created due to pagebreaks.
Importing this html-file with Mobipocket-creator gives a great result.
It definitely does take more time than a fully automated conversion, but the result is also a lot better