OCR to EPUB Best Workflow - Page 2

DSpider · 02-23-2013, 12:39 PM

That's what happens with PDF files that contain text (FineReader converts them into images), so I'm a bit sceptical that it somehow extracts the JPGs from the PDF without further processing...

Toxaris · 02-23-2013, 02:44 PM

It is a structurally different PDF, so I would not be surprised. The tools to extract the images/jpg's from the PDF are easy to get and open source. Why would ABBYY not incorporate those algorithms? It is not that hard.

parkher · 01-23-2014, 04:41 PM

So far I was not able to find anything better than PerfectEpub extension for OpenOffice.
Therefore I do this:

OCR in FineReader -> save as odt -> open in OpenOffice -> run PerfectEpub (after possibly other cleaning with regex find/replace, etc.) -> writer2ePub (or save as odt or as html and then use Calibre converter - whichever works better) -> SIGIL (where you again can do regex find/replace, merge/split if necessary, etc.)

However it is better to get rid of any page numbers / headers before PerfectEpub.
FineReader 11 is pretty good at recognizing headers/footers so they are not much of a problem.
PerfectEpub joins wrongly split lines (paragraphs) with one click and also splits wrongly joined lines, etc. I don't understand why FineReader can't do this itself, though. If it can, I need to find out how...

I use PerfectEpub on already made epubs and other formats too, if they have wrongly split lines or wrongly joined lines in them.
For an epub, I do this: epub -> htmlz -> extract files -> open in OpenOffice -> run PerfectEpub -> save back to html (or run writer2ePub)

The line joining / splitting in such cases when the information about the original pages is no longer available can be done with regex find/replace in sigil directly, but it requires multiple regex expressions to be used and different for pretty much each epub, so PerfectEpub is a much quicker solution.

mav8rick · 04-19-2014, 12:40 AM

For me, I have tried so many, many pdf to epub readers that I despaired of trying one more. However, after reading the stuff on the internet, I decided to look up ABBYY FineReader and lo and behold, they actually have a direct pdf to epub converter called ABBYY PDF Converter.
After trying the trial (converts 100 pages max), I decided to plonk down the money for the full version.

End result; I think I converted > 40 books and ONLY one didn't convert properly. Most converted with graphics intact and > 30% had their TOC links done properly!

I regretted spending all that time with all those other pdf to epub converters; you name it, I'd have tried it (Calibre, EPUB Converter, Doremisoft, 3DPageFlip, Vibosoft, PDFMate, iStonsoft, Go4ePub - this is an online site...) and for some reason, a lot of them look suspiciously alike so either they had the same underlying product and they just customized the look and feel or some of them pirate it from a main source and spinned it off on their own.

The reason I gave ABBYY a chance is because they're a reputable OCR software vendor too so I figured if they can do OCR well, they surely can do something about the pesky PDF internal structure/markups.

I absolutely have no other business interest in ABBYY other than wonder why they didn't market this product well - if you guys are still vexing over the conversion, you'd take a look.

Hitch · 04-22-2014, 03:05 PM

Quote:

Originally Posted by mav8rick

For me, I have tried so many, many pdf to epub readers that I despaired of trying one more. However, after reading the stuff on the internet, I decided to look up ABBYY FineReader and lo and behold, they actually have a direct pdf to epub converter called ABBYY PDF Converter.
After trying the trial (converts 100 pages max), I decided to plonk down the money for the full version.

End result; I think I converted > 40 books and ONLY one didn't convert properly. Most converted with graphics intact and > 30% had their TOC links done properly!

I regretted spending all that time with all those other pdf to epub converters; you name it, I'd have tried it (Calibre, EPUB Converter, Doremisoft, 3DPageFlip, Vibosoft, PDFMate, iStonsoft, Go4ePub - this is an online site...) and for some reason, a lot of them look suspiciously alike so either they had the same underlying product and they just customized the look and feel or some of them pirate it from a main source and spinned it off on their own.

The reason I gave ABBYY a chance is because they're a reputable OCR software vendor too so I figured if they can do OCR well, they surely can do something about the pesky PDF internal structure/markups.

I absolutely have no other business interest in ABBYY other than wonder why they didn't market this product well - if you guys are still vexing over the conversion, you'd take a look.

Well...pretty much everyone here that does this either in large quantities, or professionally, as I do, already uses Abbyy (Fine Reader). While it's viable for the initial conversion, the "cruft" left underneath the file, in the coding, isn't very attractive, and requires a fair amount of clean-up. I don't have any reason to think that Abbyy's "ABBYY PDF Converter" is anything different than the PDF-->Converter that's in AFR 11, and I'd be surprised if it were.

One of the things that Abbyy is very good about is making everything look pretty good on the surface, but once you delve into it--lo, here there be dragons. Have you spent a lot of time looking at the coding of the ePUBS that you've created, just out of curiosity?

Hitch

01-23-2014, 04:41 PM	#18
parkher Evangelist Posts: 467 Karma: 369018 Join Date: Nov 2010 Device: BL Alita/Mimas/Ares, OB Note2/Note, KA One/H2O/HD, S PRS T2/T1, PB 902	So far I was not able to find anything better than PerfectEpub extension for OpenOffice. Therefore I do this: OCR in FineReader -> save as odt -> open in OpenOffice -> run PerfectEpub (after possibly other cleaning with regex find/replace, etc.) -> writer2ePub (or save as odt or as html and then use Calibre converter - whichever works better) -> SIGIL (where you again can do regex find/replace, merge/split if necessary, etc.) However it is better to get rid of any page numbers / headers before PerfectEpub. FineReader 11 is pretty good at recognizing headers/footers so they are not much of a problem. PerfectEpub joins wrongly split lines (paragraphs) with one click and also splits wrongly joined lines, etc. I don't understand why FineReader can't do this itself, though. If it can, I need to find out how... I use PerfectEpub on already made epubs and other formats too, if they have wrongly split lines or wrongly joined lines in them. For an epub, I do this: epub -> htmlz -> extract files -> open in OpenOffice -> run PerfectEpub -> save back to html (or run writer2ePub) The line joining / splitting in such cases when the information about the original pages is no longer available can be done with regex find/replace in sigil directly, but it requires multiple regex expressions to be used and different for pretty much each epub, so PerfectEpub is a much quicker solution. Last edited by parkher; 01-23-2014 at 04:47 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
A workflow for generating epub files from InDesign	Man Eating Duck	ePub	5	01-27-2013 07:47 AM
Workflow - XHTML to mobi to ePub	lissie	Workshop	7	01-23-2013 03:22 AM
Persisting html-to-epub workflow	Chaihana Joe	Calibre	2	01-28-2012 05:37 PM
Smooth workflow from HTML to Sigil epub	useroo	Sigil	1	07-04-2011 12:31 AM
Opinion on workflow (and enhancing it) - research-type workflow	TheDarkTrumpet	Which one should I buy?	8	03-02-2009 10:41 AM

02-23-2013, 12:39 PM	#16
DSpider Evangelist Posts: 450 Karma: 343115 Join Date: Nov 2009 Location: Romania Device: PW2 2014	That's what happens with PDF files that contain text (FineReader converts them into images), so I'm a bit sceptical that it somehow extracts the JPGs from the PDF without further processing...

02-23-2013, 02:44 PM	#17
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	It is a structurally different PDF, so I would not be surprised. The tools to extract the images/jpg's from the PDF are easy to get and open source. Why would ABBYY not incorporate those algorithms? It is not that hard.

04-19-2014, 12:40 AM	#19
mav8rick Junior Member Posts: 3 Karma: 10 Join Date: Mar 2014 Device: Kobo Aura	For me, I have tried so many, many pdf to epub readers that I despaired of trying one more. However, after reading the stuff on the internet, I decided to look up ABBYY FineReader and lo and behold, they actually have a direct pdf to epub converter called ABBYY PDF Converter. After trying the trial (converts 100 pages max), I decided to plonk down the money for the full version. End result; I think I converted > 40 books and ONLY one didn't convert properly. Most converted with graphics intact and > 30% had their TOC links done properly! I regretted spending all that time with all those other pdf to epub converters; you name it, I'd have tried it (Calibre, EPUB Converter, Doremisoft, 3DPageFlip, Vibosoft, PDFMate, iStonsoft, Go4ePub - this is an online site...) and for some reason, a lot of them look suspiciously alike so either they had the same underlying product and they just customized the look and feel or some of them pirate it from a main source and spinned it off on their own. The reason I gave ABBYY a chance is because they're a reputable OCR software vendor too so I figured if they can do OCR well, they surely can do something about the pesky PDF internal structure/markups. I absolutely have no other business interest in ABBYY other than wonder why they didn't market this product well - if you guys are still vexing over the conversion, you'd take a look.