View Single Post
Old 08-27-2008, 06:52 AM   #1
Pulp
Palm Addict
Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.
 
Pulp's Avatar
 
Posts: 477
Karma: 1001951
Join Date: Aug 2008
Device: Cybook Gen3 [512mb, FW: 1.5]
Talking PDF->mobi "my way"

After testing around I finally found a way to convert my pdfs.

It's a multiple step process but at last I end up with a mobi-file that satisfies my needs.
  • First i run the PDF through Abbyy PDF-Transformer with the layout set to "text flow" and create a rtf-file (I suppose every other tool that does a good job in pdf->rtf is fine aswell)
  • The rtf-file is then opened in MSWord where I do a spellcheck (doesn't take more than a few minutes per book and makes sure there are no hyphe-nations left from the original text.
  • I save the file as filtered html in Word
  • At last I run the file through a php-script that does the following:
    Code:
    $text = str_replace(array("<body","</body>","<p","</p>"), array("{body}<body","</body>{/body}","{p}<p","</p>{/p}"), $text);
    $text = strip_tags($text, "<b><i><u><html><head><title><h1><h2><h3><h4><h5><h6>");
    $text = str_replace(array("{body}","{/body}","{p}","{/p}"," n "," n."," n,"), array("<body>","</body>","<p>","</p>","n ","n.","n,"), $text);
    $text = preg_replace('/>\s*</','><',$text);
    $text = preg_replace('/\s\s*\s/',' ',$text);
    $text = preg_replace('/([a-zA-Z\d\,])<\/p><p>/','$1 ',$text);

The html-file I end up with still has bold, italic, underline text and headings.
Paragraphs are reduced by the ones that were only created due to pagebreaks.

Importing this html-file with Mobipocket-creator gives a great result.

It definitely does take more time than a fully automated conversion, but the result is also a lot better
Pulp is offline   Reply With Quote