pdf to epub

crutledge · 07-05-2015, 12:16 PM

Has anyone developed the tools and procedures to go from a book scan to an epub?

Arios · 07-05-2015, 03:53 PM

Unfortunately, as far as I know, there are no such easy tools or procedures. Even with ABBYY FineReader and a very clean pdf file (with no other languages, good scan, perfect letters for example), there are a lot of things to check out.

Moreover, I think a direct export to epub format, is risky (at least!). I'd prefer to clean up the text with a familiar editor (Writer, Word or AWP) and then export this file to epub format. (With Word 2007 and up you can use the tools created by Toxaris).

Anyway, have you tried with ABBYY if you possess this particular software? With it you can do what you want (book scan in pdf format to epub), but generally the file does not pass epubcheck.

crutledge · 07-05-2015, 05:24 PM

Quote:

Originally Posted by Arios

Unfortunately, as far as I know, there are no such easy tools or procedures. Even with ABBYY FineReader and a very clean pdf file (with no other languages, good scan, perfect letters for example), there are a lot of things to check out.

Moreover, I think a direct export to epub format, is risky (at least!). I'd prefer to clean up the text with a familiar editor (Writer, Word or AWP) and then export this file to epub format. (With Word 2007 and up you can use the tools created by Toxaris).

Anyway, have you tried with ABBYY if you possess this particular software? With it you can do what you want (book scan in pdf format to epub), but generally the file does not pass epubcheck.

This has been my experience. I use adobe Acrobat. Is ABBYY FineReader any better or easier to use?

Arios · 07-05-2015, 06:27 PM

I can't speak for all of them, but AFAIK, this is the best software of his kind.

But it has a price (I use 11 version). You can check around for a good price. As I say, in the end of the road, there is a lot of work to do, anyway.

Sorry but I can't compared with Acrobat as I tried it as a tryout

, and I prefer ABBYY.

JSWolf · 07-05-2015, 07:51 PM

Adobe Acrobat Pro cannot convert without errors.

Hitch · 07-08-2015, 06:45 PM

Quote:

Originally Posted by crutledge

This has been my experience. I use adobe Acrobat. Is ABBYY FineReader any better or easier to use?

Charlie:

To me, those are night and day. Abbyy actually gives you, by and large, what's in the book. It takes some learning, etc., natch, and it's not free. (Although you can use their online, in-browser version for nowt, for...up to X pages, or something. Don't remember). Acrobat outputs utter crap. I can't believe you're using it, it's so bad. When you use Acrobat Pro, it's the iceberg that hit the Titanic. You can see this by copy-pasting almost ANYTHING from a Word file output by Acrobat into the SEARCH window (not the file--the search box) in the Document pane or the regular search box, and you'll see what you're getting in the murks. Make sure you try words like (for example "fiat") or anything that has symbols or characters for punctuation, etc. It's fugly, on a massive scale.

(n.b.: although, having said that, when you look at your HTML, are you getting CLEAN results? I mean, I admit, I'd be gobsmacked if you are, but if you are....)

I get far, far, FAR better results with Abbyy than Acrobat Pro. They're just not the same thing. Abbyy understands and tries to compensate for that second layer in PDFs, whereas all Acrobat cares about is outputting a Word file that LOOKS like the PDF source--isn't necessarily remotely the same as the PDF source.

STRONGLY recommend Abbyy over Acrobat for this purpose. Acrobat does a lot of things really well, but this isn't one of them, IMHO.

Hitch

crutledge · 07-09-2015, 04:29 AM

Hitch, Good to hear from you.

Abbyy has several products. It seems that I need the Fine Reader. Is this correct?

Charlie

Toxaris · 07-09-2015, 05:00 AM

Correct!

Hitch · 07-09-2015, 05:05 AM

Quote:

Originally Posted by Toxaris

Correct!

Yup, what he said! ;-) Tox is quite the Abbyy expert, seriously. Far more so than I.

Hitch

Notjohn · 07-09-2015, 06:03 AM

Quote:

Originally Posted by crutledge

It seems that I need the Fine Reader. Is this correct?

Yes, it's streets ahead of those OCR softwares that come bundled with the $100 scanner. But note that even if the scanning is 99.9% correct, that still leaves a whole lot of proof-reading to be done. Happily the errors tend to fall into a pattern, so you can search for them. I once scanned a backlist book about skiing, in which the term ski bum probably appeared a hundred times. Finereader mistook M for RN, so I would up with a hundred occasions of SKI BURN.

There were other mistooks, too, which could only be found by a Closereader.

crutledge · 07-09-2015, 08:57 AM

Quote:

Originally Posted by Toxaris

Correct!

Quote:

I feel like I've gone to heaven

I would like to open the same as Acrobat. I have found it once accidently.

The list of pages on the left, a single page ine the middle, and the tools display on the right.

I hit it once but not sure what I did.

Attached is a view of Acrobat. I achieved essentially the same view in Fine Reader but have been unable to repeat it.

Will someone please tell how to achieve this view in Fine Reader?

Thenks

crutledge · 07-17-2015, 04:09 PM

After about three days of wrestling with FineReader, I finally got down to what might be considered an acceptable ePub.

I have much to learn. If anyone is familiar with Verificat1on and the Character Table I sure would like to talk.

The ePub is attached if anyone would like to throw rocks.

I would also like to know the sequence used to get from PDF to ePub.

The FineReader documentation is lacking in details I need.

Tex2002ans · 07-18-2015, 03:59 AM

Quote:

Originally Posted by crutledge

After about three days of wrestling with FineReader, I finally got down to what might be considered an acceptable ePub.

Glad to hear you finally figured it out. :P

Quote:

Originally Posted by crutledge

I have much to learn. If anyone is familiar with Verificat1on and the Character Table I sure would like to talk.

Character Table (?): What version of Finereader are you using? Perhaps they changed the name slightly in the newer ones. Are you talking about the "Pattern Editor" where you manually recreate the OCR for hard-to-OCR fonts?

Verification: May be helpful, depending on your workflow. I personally never use it, but I guess SOMEONE is getting some benefit out of it. It is one of those things they expanded in Finereader 12.

I would leave more thorough spellchecking for a much later step, with different tools (I much prefer the Spellcheck lists in Sigil + Calibre Editor). And you may prefer spellchecking in your favorite word processor (Word, LibreOffice, etc. etc.).

Quote:

Originally Posted by crutledge

The ePub is attached if anyone would like to throw rocks.

Hard to tell how good you did without a link to the original PDF.

One quick thing of note that I did find was some spacing before/after Right/Left quotation marks. One of the final steps I typically do in Finereader is a search for "LEFT DOUBLE QUOTE + SPACE" and "SPACE + RIGHT DOUBLE QUOTE", and replace with the unspaced version.

I know there is also a setting buried in Finereader to automatically (?) fix those spacing errors, but I never used that checkmark. I always do that as one of the very last "rounds of fixes" (in many books, the left/right single/double quotation marks may be notoriously bad when OCRed).

Quote:

Originally Posted by crutledge

I would also like to know the sequence used to get from PDF to ePub.

Way back in 2013, I did post a rough draft of an Outline I had written of my workflow at the time, and things to pay attention to while OCRing (planning to put together some sort of PDF -> EPUB Tutorial). It is Post #10 in this topic:

https://www.mobileread.com/forums/sho...d.php?t=223817

Some things have changed, most things haven't... and I would DEFINITELY expand lots of areas since then.

There are also some other alternative workflows/tools that can be used later, like going Finereader -> DOC(X) -> Word -> Toxaris's EPUB Tools -> EPUB. Toxaris initially built up his macros/tools to clean up a lot of Finereader cruft, to really speed up the monotonous merging of paragraphs, and other OCR errors that creep in.

I personally don't use Toxaris's Tools for the Finereader cleaning, but I DO use it for the other fantastic things, like Dialogue Check, which is far and away the best tool for fixing mismatching quotation marks (and now mismatching parenthesis/brackets too).

I personally still do A LOT of the cleaning and A/B comparison in Finereader, and then do the Finereader -> EPUB -> Manual cleanup with Sigil workflow.

Quote:

Originally Posted by crutledge

The FineReader documentation is lacking in details I need.

If you want to chat over webcam, I could teach you my Finereader ways. Perhaps teaching an interested pupil would revive my 2013 project. I have been looking for an interested "guinea pig" for years!! :P

I have written a heck of a lot on the subject over the many years, but it is scattered over a ton of different topics/posts. Mostly with how I deal with tackling an individual subject X, Y, or Z (Tables, Footnotes, Equations/Formulas, etc. etc.).

Last I remember was one of those massive posts I always point back to:

https://www.mobileread.com/forums/sho...d.php?t=234146

Nibbling away at certain pieces here and there (answering the person's questions, and doing my usual expanding into semi-relevant/semi-related topics).

And I never did get around to expanding that Outline at all since 2013. It has mostly just been continually growing and refining in my head.

Notjohn · 07-18-2015, 05:54 AM

>And you may prefer spellchecking in your favorite word processor (Word, LibreOffice, etc. etc.).

IMHO, flagging spelling errors is the single best feature of Word. I just last week pasted a Sigil epub into Word for just that purpose. Even the green underlining for usage is sometimes helpful. Just as one sometimes writes for the 9th grade level of comprehension, I figure writing for Bill Gates's reading level might sometimes be called for. There's no point in puzzling people unnecessarily.

Tex2002ans · 07-18-2015, 08:23 AM

Quote:

Originally Posted by Notjohn

IMHO, flagging spelling errors is the single best feature of Word. I just last week pasted a Sigil epub into Word for just that purpose.

I believe I sent PMs/emails a few years back discussing this topic. I forget if I posted about this on the forums. I did a quick search and couldn't find any old posts of mine talking about it, but I could have sworn I did!

My Thoughts on Dictionaries:

Using a dictionary of words is a balancing act between including TOO MANY words and too little. Too many will cause you to miss a lot of errors, and too little will cause you to waste a lot of manhours checking perfectly spelled words.

As an example of missing errors because of the dictionary, I always go back to the simple case of hyphenated or accented words:

"cooperate" + "co-operate" + "coöperate"

are all perfectly legitimate versions of the word, but Word might say all three are correct—when in reality, one or two of the variations in this book may have been an OCR error. A common OCR error is accidentally introducing accented characters that were never there (dust, markings on the page, [...]) or soft hyphens in the PDF might have been incorrectly turned into hard hyphens (words split at the end of lines/pages).

Side Note: I wouldn't rely on the Spellcheck in Word alone... the English dictionary that comes with Sigil/Calibre is not as extensive as the one in Word, but I find this to be a good thing. Many errors live in that crack between the two sets of dictionaries. I would much rather go with the dictionary with too few words, over the one with too many (letting errors slip by).

I would also prefer one that doesn't treat each word on the side of a hyphen as its own word ("jumpstart" + "jump-start" + "jump start").

Side Note: Although I do agree with you about the grammar checking (the green squiggly lines) being very helpful. That sometimes help catch a whole class of errors that can't be caught using just normal Spellchecking, such as wrong usages of "than" + "then".

Some Classes of OCR Errors and My Solutions:

Accents: Toxaris has a "Check Accents" button that checks the document for accented characters. Although I prefer looking up each accented character using the Spellcheck lists in Sigil and Calibre's Editor to just get an easy-to-see list of every word in the EPUB with that character. Then I just easily A/B compare in Finereader with the original PDF.

Side Note: I work mostly in English works, that solution might not be the greatest if you work on languages with MANY accents.

Hyphenation: I prefer just typing "-" into the search box in Sigil/Calibre's Spellcheck list to get a list of every single word with a hyphen in it, and then go through the hyphenated words to see if I can spot any blatant errors. At least one pass with "Show misspelled words" on and one off.

Side Note: I did create a program I personally use that helps in this regard. I name this whole class of errors "Hyphenation Inconsistencies".

The program compares all words with hyphens with their non-hyphenated versions, and tells me if there are any matches in the same book (a book using "non-hyphenated" and "nonhyphenated" at the same time is most likely an error, or at least has to be looked into).

But the code is bad, it is buggy (can't handle UTF-8 as well as I would like, can't handle words properly with two or more hyphens, [...]), and I don't want to release it to the public like that. :P There is also A TON more that has to be programmed to solve this "Hyphenation Problem" (handling inconsistent prefixes/suffixes, comparing spaced/unspaced, [...]).

Also, according to all my testing, it seems like every book I worked on has 0-8 of these "hyphenation inconsistencies", a handful of which were problems in the books/documents themselves. Seems to me like this is a very common error that humans make in large works, and nobody really has a way to automatically check or notify you of this stuff.

I also have been running it on all the DOC(X)/InDesign files of new books I have been getting my way, and been reporting the 0-8 errors to the authors/publishers. The one book, I caught 4 of these hyphenation inconsistencies in the Preface itself!

Right But Wrong: There are also whole classes of common OCR errors that can't be caught by dictionaries, because the mistakes are also correctly spelled words:

"modern" + "modem"
"corn" + "com"
[...]

Most of these require manual checking, and can't just be fully automated.

The only thing I can think of at the moment of a semi-automated solution would be Toxaris's "Search/Replace" functionality:

http://www.toxaris.nl/helpen/index.h...ek_vervang.htm

combined with his "Replacerules" that can be found here:

http://toxaris.nl/en/

Although this requires someone to actually go through and create the proper Search/Replace lists... I just haven't put together the time to figure out Word's version of Regex and do it myself (although I do have tens/hundreds of this class of words written down on pieces of paper over the years, and in my head

).

If Toxaris reads this, I am sorry for never getting around to it... it is on my backburner though (for a very long time now).

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Convert epub to pdf, with notes with main text in the pdf?	8140david	ePub	1	06-18-2015 01:13 PM
Convert epub to pdf, with notes with main text in the pdf?	8140david	Conversion	1	06-18-2015 11:02 AM
ePub->pdf:Please help to overcome long standing Kindle pdf bug	EbokJunkie	Conversion	4	01-25-2015 12:44 PM

07-05-2015, 12:16 PM	#1
crutledge eBook FANatic Posts: 18,301 Karma: 16071131 Join Date: Apr 2008 Location: Alabama, USA Device: HP ipac RX5915 Wife's Kindle	pdf to epub Has anyone developed the tools and procedures to go from a book scan to an epub?

07-05-2015, 03:53 PM	#2
Arios A curiosus lector! Posts: 463 Karma: 2015140 Join Date: Jun 2012 Device: Sony PRS-T1, Kobo Touch	Unfortunately, as far as I know, there are no such easy tools or procedures. Even with ABBYY FineReader and a very clean pdf file (with no other languages, good scan, perfect letters for example), there are a lot of things to check out. Moreover, I think a direct export to epub format, is risky (at least!). I'd prefer to clean up the text with a familiar editor (Writer, Word or AWP) and then export this file to epub format. (With Word 2007 and up you can use the tools created by Toxaris). Anyway, have you tried with ABBYY if you possess this particular software? With it you can do what you want (book scan in pdf format to epub), but generally the file does not pass epubcheck.

07-05-2015, 06:27 PM	#4
Arios A curiosus lector! Posts: 463 Karma: 2015140 Join Date: Jun 2012 Device: Sony PRS-T1, Kobo Touch	I can't speak for all of them, but AFAIK, this is the best software of his kind. But it has a price (I use 11 version). You can check around for a good price. As I say, in the end of the road, there is a lot of work to do, anyway. Sorry but I can't compared with Acrobat as I tried it as a tryout , and I prefer ABBYY.

07-05-2015, 07:51 PM	#5
JSWolf Resident Curmudgeon Posts: 74,037 Karma: 129333114 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	Adobe Acrobat Pro cannot convert without errors.

07-09-2015, 04:29 AM	#7
crutledge eBook FANatic Posts: 18,301 Karma: 16071131 Join Date: Apr 2008 Location: Alabama, USA Device: HP ipac RX5915 Wife's Kindle	Hitch, Good to hear from you. Abbyy has several products. It seems that I need the Fine Reader. Is this correct? Charlie

07-09-2015, 05:00 AM	#8
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	Correct!

07-18-2015, 05:54 AM	#14
Notjohn mostly an observer Posts: 1,515 Karma: 987654 Join Date: Dec 2012 Device: Kindle	>And you may prefer spellchecking in your favorite word processor (Word, LibreOffice, etc. etc.). IMHO, flagging spelling errors is the single best feature of Word. I just last week pasted a Sigil epub into Word for just that purpose. Even the green underlining for usage is sometimes helpful. Just as one sometimes writes for the 9th grade level of comprehension, I figure writing for Bill Gates's reading level might sometimes be called for. There's no point in puzzling people unnecessarily.

Advert

Advert