MobileRead Forums - View Single Post - Converting pdf file of a scanned book to epub format

Tex2002ans · 03-03-2017, 06:18 AM

Quote:

Originally Posted by fcemari

1. Before or after ocr I have to manually deselect all the page numbers and volume or author names which appear on the most upper part of pages. I couldn't find any script or any solution for that. Do you know any easy method?

I find Finereader does a pretty decent job at detecting the Headers/Footers and never exporting them to the output format in the first place. There is the occasional book it has problems with (mostly when the Header/Footer is extremely close to the text block), but 95+% of the books it works nearly perfect.

See how close the Header is in Figure A:

Click image for larger version

Name: FigureA.png
Views: 509
Size: 78.7 KB
ID: 155408

Finereader has a serious problem with detecting the Header in that book... you can see how it could easily be seen as a part of the body text.

And see what your typical book in Figure B:

Click image for larger version

Name: FigureB.png
Views: 466
Size: 52.8 KB
ID: 155409

Finereader has absolutely zero problems with that. Only when the Header is as close or closer than Figure A might Finereader start to become inaccurate with its guesses (maybe it will handle the Headers perfectly though.... each book is different).

Potential Solution

Not too sure if Finereader on Mac is the same, but there is a "Save Area Template" under Area > Save Area Template...:

http://help.abbyy.com/FineReader/Fin...hTemplates.htm

I think the Area Templates were intended more for scanning in documents of the same exact type (like hundreds of forms that all have the exact same layout).

You may be able to hack an Area Template together for yourself on a per-book basis. I personally haven't found it to be too useful in the case of books, but your case may be different.

Quote:

Originally Posted by fcemari

2. The reference numbers in superscript at the end of sentences have to be manually linked. Is there a way to do it automatically?

No.

Depending on the export format, Finereader does try to do its best, but it botches the "linking back/forth footnotes" pretty badly. The only way to handle it is properly is to manually correct them.

There are some tools to kind of help speed up the process though:

1. If you have Microsoft Word, you can use Finereader to export to DOCX (Formatted), and then run Toxaris's EPUB Tools add-in (doesn't work in the Mac version):

https://toxaris.nl/en/

Toxaris specialized his tool for a lot of Finereader cleanup (and a ton of of other helpful things). If you use his tool and press "Preparation", it can clean up a lot of the Finereader DOCX cruft. You can then fix the document in Word, or export from there and do more thorough cleaning.

2. Taking the HTML and doing lots of fancy Regex (each book is different).

Some generic rules can apply though, like searching a book for all ## (these are most likely superscript footnotes sitting in the text). Or searching for paragraphs starting with a superscript number (this is most likely a footnote):

Click image for larger version

Name: FootnotesInText.png
Views: 489
Size: 90.1 KB
ID: 155407

Things can get a little hairier if you have a complex book (like one with formulas) or OCR errors (maybe a ” [Right Double Quote] might be OCRed as a 9 or a ° [degree] might be OCRed as 0).

Quote:

Originally Posted by fcemari

3. Footnotes at the end of every page should be either deselected or manually transferred to the end of the pages in order not to compromise the book's reading in epub format. Is there any automatic solution for that?

No.

You will most likely have to manually fix/check the links and place them in the proper order/location.

Side Note: You will also have to keep an eye out for footnotes that are missing text or large footnotes that carry on to a second/third page... these will have to be manually stitched back together.