View Single Post
Old 01-28-2014, 07:18 PM   #534
Difflugia
Testate Amoeba
Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.
 
Difflugia's Avatar
 
Posts: 3,049
Karma: 27300000
Join Date: Sep 2012
Device: Many Android devices, Kindle 2, Toshiba e755 PocketPC
Quote:
Originally Posted by Blossom View Post
I use Prince but do have Inkscape however it just easy for me to let Prince do it then use crop pages in Acrobat 8 Pro to get rid of the white margins. You also want to clear the comments from the pdf if you do use Prince and I also run the optimizer to get rid of all hidden stuff that isn't needed which helps the OCR process. It takes 10 minutes using Prince on a 500 page book which saves alot of time and about 10-15 minutes Acrobat to get the final pdf. I use Prince 8 as 9 is too buggy and I don't like it.

So for me to summarize it's like this.

1. Extract Topaz to SVG images
2. Open xml files in Notepad++ and change the background to white, remove navigation script and border using batch find and replace.
3. Use Prince to convert xml files to PDFs
4. Use Acrobat Pro to merge PDFs to a single PDF
5. Use Crop Pages to get a nice even white smaller margins
6. Remove all comments
7. Remove all links
8. Run PDF Optimizer and Save final PDF
9. Run through OCR program (Able2Extract Pro)
10. Open PSPAD and run through tidy upgrading CSS which removes the absolute location tags from the span and div tags to it's own style tag area I then delete that and the script.
11. Copy the inline style sheet to a external one and link it in the html.
12. Remove all font-family, font-size and color references from style sheet,
13. Open HTML in Word 2003 and fix formatting, broken sentences, chapter headers and Save.
14. Run through Calibre and convert to epub and mobi.

It takes about 3 hours for me to do all this. The most timing consuming is the OCR and then editing it in Word. I decided to forget fixing the hyphenation and just remove it so if a few words are run together I can live with that.
Thanks for the details! I hadn't seen Able2Extract before. It's a little pricey, but if it works that well, it's probably worth it. I've downloaded the free trial to give it a shot.
Difflugia is offline   Reply With Quote