Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 03-03-2017, 04:38 AM   #1
fcemari
Junior Member
fcemari began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Mar 2017
Device: Kindle
Post Converting pdf file of a scanned book to epub format

Hi,

I try to convert my books to epub format, so that I can easily have them when I am mobile.
I don't have any problems with OCR programms. Mostly I use Abbyy Fine Reader for Mac.

1. Before or after ocr I have to manually deselect all the page numbers and volume or author names which appear on the most upper part of pages. I couldn't find any script or any solution for that. Do you know any easy method?

2. The reference numbers in superscript at the end of sentences have to be manually linked. Is there a way to do it automatically?

3. Footnotes at the end of every page should be either deselected or manually transferred to the end of the pages in order not to compromise the book's reading in epub format. Is there any automatic solution for that?
fcemari is offline   Reply With Quote
Old 03-03-2017, 06:18 AM   #2
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by fcemari View Post
1. Before or after ocr I have to manually deselect all the page numbers and volume or author names which appear on the most upper part of pages. I couldn't find any script or any solution for that. Do you know any easy method?
I find Finereader does a pretty decent job at detecting the Headers/Footers and never exporting them to the output format in the first place. There is the occasional book it has problems with (mostly when the Header/Footer is extremely close to the text block), but 95+% of the books it works nearly perfect.

See how close the Header is in Figure A:

Click image for larger version

Name:	FigureA.png
Views:	361
Size:	78.7 KB
ID:	155408

Finereader has a serious problem with detecting the Header in that book... you can see how it could easily be seen as a part of the body text.

And see what your typical book in Figure B:

Click image for larger version

Name:	FigureB.png
Views:	333
Size:	52.8 KB
ID:	155409

Finereader has absolutely zero problems with that. Only when the Header is as close or closer than Figure A might Finereader start to become inaccurate with its guesses (maybe it will handle the Headers perfectly though.... each book is different).

Potential Solution

Not too sure if Finereader on Mac is the same, but there is a "Save Area Template" under Area > Save Area Template...:

http://help.abbyy.com/FineReader/Fin...hTemplates.htm

I think the Area Templates were intended more for scanning in documents of the same exact type (like hundreds of forms that all have the exact same layout).

You may be able to hack an Area Template together for yourself on a per-book basis. I personally haven't found it to be too useful in the case of books, but your case may be different.

Quote:
Originally Posted by fcemari View Post
2. The reference numbers in superscript at the end of sentences have to be manually linked. Is there a way to do it automatically?
No.

Depending on the export format, Finereader does try to do its best, but it botches the "linking back/forth footnotes" pretty badly. The only way to handle it is properly is to manually correct them.

There are some tools to kind of help speed up the process though:

1. If you have Microsoft Word, you can use Finereader to export to DOCX (Formatted), and then run Toxaris's EPUB Tools add-in (doesn't work in the Mac version):

https://toxaris.nl/en/

Toxaris specialized his tool for a lot of Finereader cleanup (and a ton of of other helpful things). If you use his tool and press "Preparation", it can clean up a lot of the Finereader DOCX cruft. You can then fix the document in Word, or export from there and do more thorough cleaning.

2. Taking the HTML and doing lots of fancy Regex (each book is different).

Some generic rules can apply though, like searching a book for all <sup>##</sup> (these are most likely superscript footnotes sitting in the text). Or searching for paragraphs starting with a superscript number (this is most likely a footnote):

Click image for larger version

Name:	FootnotesInText.png
Views:	348
Size:	90.1 KB
ID:	155407

Things can get a little hairier if you have a complex book (like one with formulas) or OCR errors (maybe a ” [Right Double Quote] might be OCRed as a <sup>9</sup> or a ° [degree] might be OCRed as <sup>0</sup>).

Quote:
Originally Posted by fcemari View Post
3. Footnotes at the end of every page should be either deselected or manually transferred to the end of the pages in order not to compromise the book's reading in epub format. Is there any automatic solution for that?
No.

You will most likely have to manually fix/check the links and place them in the proper order/location.

Side Note: You will also have to keep an eye out for footnotes that are missing text or large footnotes that carry on to a second/third page... these will have to be manually stitched back together.

Last edited by Tex2002ans; 03-03-2017 at 07:10 AM.
Tex2002ans is offline   Reply With Quote
Advert
Old 03-03-2017, 05:50 PM   #3
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,459
Karma: 26645808
Join Date: Mar 2012
Location: Sydney Australia
Device: none
@fcemari - Sigil has plugin for tidying ePub files created from scans and PDFs ==>> ePub Tidy

I don't use the Sigil PI because I have Word on Windows, so I use the epub-tools add-in that Tex2002ans has already mentioned. From its description the Sigil PI appears do something similar to epub-tools Preparation. As well correcting Latin text it also corrects Greek text.

BR

Last edited by BetterRed; 03-03-2017 at 05:52 PM.
BetterRed is offline   Reply With Quote
Reply

Tags
book, footnotes, pdf to epub, reference numbers

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting a scanned book from 1DollarScan to ePub adrenaline Workshop 30 10-04-2014 02:24 AM
converting PDF magazine to ePub format PublicarGuate General Discussions 2 01-21-2014 05:44 PM
converting pdf screenplays / scripts for movies into ePUB format alanjay Calibre 15 10-07-2011 07:49 AM
Classic Converting .epub to .pdb file format ashalluri Barnes & Noble NOOK 3 05-27-2010 05:07 PM


All times are GMT -4. The time now is 03:45 AM.


MobileRead.com is a privately owned, operated and funded community.