|02-23-2013, 12:39 PM||#16|
Join Date: Nov 2009
Device: iPod touch 2G (16 GB)
That's what happens with PDF files that contain text (FineReader converts them into images), so I'm a bit sceptical that it somehow extracts the JPGs from the PDF without further processing...
|02-23-2013, 02:44 PM||#17|
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-300, PRS-T1
It is a structurally different PDF, so I would not be surprised. The tools to extract the images/jpg's from the PDF are easy to get and open source. Why would ABBYY not incorporate those algorithms? It is not that hard.
|01-23-2014, 04:41 PM||#18|
Join Date: Nov 2010
Device: Kobo Aura HD, Sony PRS (T1,T2), PocketBook 902
So far I was not able to find anything better than PerfectEpub extension for OpenOffice.
Therefore I do this:
OCR in FineReader -> save as odt -> open in OpenOffice -> run PerfectEpub (after possibly other cleaning with regex find/replace, etc.) -> writer2ePub (or save as odt or as html and then use Calibre converter - whichever works better) -> SIGIL (where you again can do regex find/replace, merge/split if necessary, etc.)
However it is better to get rid of any page numbers / headers before PerfectEpub.
FineReader 11 is pretty good at recognizing headers/footers so they are not much of a problem.
PerfectEpub joins wrongly split lines (paragraphs) with one click and also splits wrongly joined lines, etc. I don't understand why FineReader can't do this itself, though. If it can, I need to find out how...
I use PerfectEpub on already made epubs and other formats too, if they have wrongly split lines or wrongly joined lines in them.
For an epub, I do this: epub -> htmlz -> extract files -> open in OpenOffice -> run PerfectEpub -> save back to html (or run writer2ePub)
The line joining / splitting in such cases when the information about the original pages is no longer available can be done with regex find/replace in sigil directly, but it requires multiple regex expressions to be used and different for pretty much each epub, so PerfectEpub is a much quicker solution.
Last edited by parkher; 01-23-2014 at 04:47 PM.
|Thread Tools||Search this Thread|
|Thread||Thread Starter||Forum||Replies||Last Post|
|A workflow for generating epub files from InDesign||Man Eating Duck||ePub||5||01-27-2013 07:47 AM|
|Workflow - XHTML to mobi to ePub||lissie||Workshop||7||01-23-2013 03:22 AM|
|Persisting html-to-epub workflow||Chaihana Joe||Calibre||2||01-28-2012 05:37 PM|
|Smooth workflow from HTML to Sigil epub||useroo||Sigil||1||07-04-2011 12:31 AM|
|Opinion on workflow (and enhancing it) - research-type workflow||TheDarkTrumpet||Which one should I buy?||8||03-02-2009 10:41 AM|