07-05-2015, 12:16 PM | #1 |
eBook FANatic
Posts: 18,301
Karma: 16071131
Join Date: Apr 2008
Location: Alabama, USA
Device: HP ipac RX5915 Wife's Kindle
|
pdf to epub
Has anyone developed the tools and procedures to go from a book scan to an epub?
|
07-05-2015, 03:53 PM | #2 |
A curiosus lector!
Posts: 463
Karma: 2015140
Join Date: Jun 2012
Device: Sony PRS-T1, Kobo Touch
|
Unfortunately, as far as I know, there are no such easy tools or procedures. Even with ABBYY FineReader and a very clean pdf file (with no other languages, good scan, perfect letters for example), there are a lot of things to check out.
Moreover, I think a direct export to epub format, is risky (at least!). I'd prefer to clean up the text with a familiar editor (Writer, Word or AWP) and then export this file to epub format. (With Word 2007 and up you can use the tools created by Toxaris). Anyway, have you tried with ABBYY if you possess this particular software? With it you can do what you want (book scan in pdf format to epub), but generally the file does not pass epubcheck. |
Advert | |
|
07-05-2015, 05:24 PM | #3 | |
eBook FANatic
Posts: 18,301
Karma: 16071131
Join Date: Apr 2008
Location: Alabama, USA
Device: HP ipac RX5915 Wife's Kindle
|
Quote:
|
|
07-05-2015, 06:27 PM | #4 |
A curiosus lector!
Posts: 463
Karma: 2015140
Join Date: Jun 2012
Device: Sony PRS-T1, Kobo Touch
|
I can't speak for all of them, but AFAIK, this is the best software of his kind.
But it has a price (I use 11 version). You can check around for a good price. As I say, in the end of the road, there is a lot of work to do, anyway. Sorry but I can't compared with Acrobat as I tried it as a tryout , and I prefer ABBYY. |
07-05-2015, 07:51 PM | #5 |
Resident Curmudgeon
Posts: 74,037
Karma: 129333114
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Adobe Acrobat Pro cannot convert without errors.
|
Advert | |
|
07-08-2015, 06:45 PM | #6 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
To me, those are night and day. Abbyy actually gives you, by and large, what's in the book. It takes some learning, etc., natch, and it's not free. (Although you can use their online, in-browser version for nowt, for...up to X pages, or something. Don't remember). Acrobat outputs utter crap. I can't believe you're using it, it's so bad. When you use Acrobat Pro, it's the iceberg that hit the Titanic. You can see this by copy-pasting almost ANYTHING from a Word file output by Acrobat into the SEARCH window (not the file--the search box) in the Document pane or the regular search box, and you'll see what you're getting in the murks. Make sure you try words like (for example "fiat") or anything that has symbols or characters for punctuation, etc. It's fugly, on a massive scale. (n.b.: although, having said that, when you look at your HTML, are you getting CLEAN results? I mean, I admit, I'd be gobsmacked if you are, but if you are....) I get far, far, FAR better results with Abbyy than Acrobat Pro. They're just not the same thing. Abbyy understands and tries to compensate for that second layer in PDFs, whereas all Acrobat cares about is outputting a Word file that LOOKS like the PDF source--isn't necessarily remotely the same as the PDF source. STRONGLY recommend Abbyy over Acrobat for this purpose. Acrobat does a lot of things really well, but this isn't one of them, IMHO. Hitch |
|
07-09-2015, 04:29 AM | #7 |
eBook FANatic
Posts: 18,301
Karma: 16071131
Join Date: Apr 2008
Location: Alabama, USA
Device: HP ipac RX5915 Wife's Kindle
|
Hitch, Good to hear from you.
Abbyy has several products. It seems that I need the Fine Reader. Is this correct? Charlie |
07-09-2015, 05:00 AM | #8 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Correct!
|
07-09-2015, 05:05 AM | #9 |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
|
07-09-2015, 06:03 AM | #10 |
mostly an observer
Posts: 1,515
Karma: 987654
Join Date: Dec 2012
Device: Kindle
|
Yes, it's streets ahead of those OCR softwares that come bundled with the $100 scanner. But note that even if the scanning is 99.9% correct, that still leaves a whole lot of proof-reading to be done. Happily the errors tend to fall into a pattern, so you can search for them. I once scanned a backlist book about skiing, in which the term ski bum probably appeared a hundred times. Finereader mistook M for RN, so I would up with a hundred occasions of SKI BURN.
There were other mistooks, too, which could only be found by a Closereader. |
07-09-2015, 08:57 AM | #11 | |
eBook FANatic
Posts: 18,301
Karma: 16071131
Join Date: Apr 2008
Location: Alabama, USA
Device: HP ipac RX5915 Wife's Kindle
|
Quote:
Will someone please tell how to achieve this view in Fine Reader? Thenks Last edited by crutledge; 07-09-2015 at 04:54 PM. |
|
07-17-2015, 04:09 PM | #12 |
eBook FANatic
Posts: 18,301
Karma: 16071131
Join Date: Apr 2008
Location: Alabama, USA
Device: HP ipac RX5915 Wife's Kindle
|
I finally did it.
After about three days of wrestling with FineReader, I finally got down to what might be considered an acceptable ePub.
I have much to learn. If anyone is familiar with Verificat1on and the Character Table I sure would like to talk. The ePub is attached if anyone would like to throw rocks. I would also like to know the sequence used to get from PDF to ePub. The FineReader documentation is lacking in details I need. |
07-18-2015, 03:59 AM | #13 | |||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Quote:
Verification: May be helpful, depending on your workflow. I personally never use it, but I guess SOMEONE is getting some benefit out of it. It is one of those things they expanded in Finereader 12. I would leave more thorough spellchecking for a much later step, with different tools (I much prefer the Spellcheck lists in Sigil + Calibre Editor). And you may prefer spellchecking in your favorite word processor (Word, LibreOffice, etc. etc.). Hard to tell how good you did without a link to the original PDF. One quick thing of note that I did find was some spacing before/after Right/Left quotation marks. One of the final steps I typically do in Finereader is a search for "LEFT DOUBLE QUOTE + SPACE" and "SPACE + RIGHT DOUBLE QUOTE", and replace with the unspaced version. I know there is also a setting buried in Finereader to automatically (?) fix those spacing errors, but I never used that checkmark. I always do that as one of the very last "rounds of fixes" (in many books, the left/right single/double quotation marks may be notoriously bad when OCRed). Quote:
https://www.mobileread.com/forums/sho...d.php?t=223817 Some things have changed, most things haven't... and I would DEFINITELY expand lots of areas since then. There are also some other alternative workflows/tools that can be used later, like going Finereader -> DOC(X) -> Word -> Toxaris's EPUB Tools -> EPUB. Toxaris initially built up his macros/tools to clean up a lot of Finereader cruft, to really speed up the monotonous merging of paragraphs, and other OCR errors that creep in. I personally don't use Toxaris's Tools for the Finereader cleaning, but I DO use it for the other fantastic things, like Dialogue Check, which is far and away the best tool for fixing mismatching quotation marks (and now mismatching parenthesis/brackets too). I personally still do A LOT of the cleaning and A/B comparison in Finereader, and then do the Finereader -> EPUB -> Manual cleanup with Sigil workflow. If you want to chat over webcam, I could teach you my Finereader ways. Perhaps teaching an interested pupil would revive my 2013 project. I have been looking for an interested "guinea pig" for years!! :P I have written a heck of a lot on the subject over the many years, but it is scattered over a ton of different topics/posts. Mostly with how I deal with tackling an individual subject X, Y, or Z (Tables, Footnotes, Equations/Formulas, etc. etc.). Last I remember was one of those massive posts I always point back to: https://www.mobileread.com/forums/sho...d.php?t=234146 Nibbling away at certain pieces here and there (answering the person's questions, and doing my usual expanding into semi-relevant/semi-related topics). And I never did get around to expanding that Outline at all since 2013. It has mostly just been continually growing and refining in my head. Last edited by Tex2002ans; 07-18-2015 at 04:28 AM. |
|||
07-18-2015, 05:54 AM | #14 |
mostly an observer
Posts: 1,515
Karma: 987654
Join Date: Dec 2012
Device: Kindle
|
>And you may prefer spellchecking in your favorite word processor (Word, LibreOffice, etc. etc.).
IMHO, flagging spelling errors is the single best feature of Word. I just last week pasted a Sigil epub into Word for just that purpose. Even the green underlining for usage is sometimes helpful. Just as one sometimes writes for the 9th grade level of comprehension, I figure writing for Bill Gates's reading level might sometimes be called for. There's no point in puzzling people unnecessarily. |
07-18-2015, 08:23 AM | #15 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
My Thoughts on Dictionaries: Using a dictionary of words is a balancing act between including TOO MANY words and too little. Too many will cause you to miss a lot of errors, and too little will cause you to waste a lot of manhours checking perfectly spelled words. As an example of missing errors because of the dictionary, I always go back to the simple case of hyphenated or accented words: "cooperate" + "co-operate" + "coöperate" are all perfectly legitimate versions of the word, but Word might say all three are correct—when in reality, one or two of the variations in this book may have been an OCR error. A common OCR error is accidentally introducing accented characters that were never there (dust, markings on the page, [...]) or soft hyphens in the PDF might have been incorrectly turned into hard hyphens (words split at the end of lines/pages). Side Note: I wouldn't rely on the Spellcheck in Word alone... the English dictionary that comes with Sigil/Calibre is not as extensive as the one in Word, but I find this to be a good thing. Many errors live in that crack between the two sets of dictionaries. I would much rather go with the dictionary with too few words, over the one with too many (letting errors slip by). I would also prefer one that doesn't treat each word on the side of a hyphen as its own word ("jumpstart" + "jump-start" + "jump start"). Side Note: Although I do agree with you about the grammar checking (the green squiggly lines) being very helpful. That sometimes help catch a whole class of errors that can't be caught using just normal Spellchecking, such as wrong usages of "than" + "then". Some Classes of OCR Errors and My Solutions: Accents: Toxaris has a "Check Accents" button that checks the document for accented characters. Although I prefer looking up each accented character using the Spellcheck lists in Sigil and Calibre's Editor to just get an easy-to-see list of every word in the EPUB with that character. Then I just easily A/B compare in Finereader with the original PDF. Side Note: I work mostly in English works, that solution might not be the greatest if you work on languages with MANY accents. Hyphenation: I prefer just typing "-" into the search box in Sigil/Calibre's Spellcheck list to get a list of every single word with a hyphen in it, and then go through the hyphenated words to see if I can spot any blatant errors. At least one pass with "Show misspelled words" on and one off. Side Note: I did create a program I personally use that helps in this regard. I name this whole class of errors "Hyphenation Inconsistencies". The program compares all words with hyphens with their non-hyphenated versions, and tells me if there are any matches in the same book (a book using "non-hyphenated" and "nonhyphenated" at the same time is most likely an error, or at least has to be looked into). But the code is bad, it is buggy (can't handle UTF-8 as well as I would like, can't handle words properly with two or more hyphens, [...]), and I don't want to release it to the public like that. :P There is also A TON more that has to be programmed to solve this "Hyphenation Problem" (handling inconsistent prefixes/suffixes, comparing spaced/unspaced, [...]). Also, according to all my testing, it seems like every book I worked on has 0-8 of these "hyphenation inconsistencies", a handful of which were problems in the books/documents themselves. Seems to me like this is a very common error that humans make in large works, and nobody really has a way to automatically check or notify you of this stuff. I also have been running it on all the DOC(X)/InDesign files of new books I have been getting my way, and been reporting the 0-8 errors to the authors/publishers. The one book, I caught 4 of these hyphenation inconsistencies in the Preface itself! Right But Wrong: There are also whole classes of common OCR errors that can't be caught by dictionaries, because the mistakes are also correctly spelled words: "modern" + "modem" "corn" + "com" [...] Most of these require manual checking, and can't just be fully automated. The only thing I can think of at the moment of a semi-automated solution would be Toxaris's "Search/Replace" functionality: http://www.toxaris.nl/helpen/index.h...ek_vervang.htm combined with his "Replacerules" that can be found here: http://toxaris.nl/en/ Although this requires someone to actually go through and create the proper Search/Replace lists... I just haven't put together the time to figure out Word's version of Regex and do it myself (although I do have tens/hundreds of this class of words written down on pieces of paper over the years, and in my head ). If Toxaris reads this, I am sorry for never getting around to it... it is on my backburner though (for a very long time now). Last edited by Tex2002ans; 07-18-2015 at 09:15 AM. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Convert epub to pdf, with notes with main text in the pdf? | 8140david | ePub | 1 | 06-18-2015 01:13 PM |
Convert epub to pdf, with notes with main text in the pdf? | 8140david | Conversion | 1 | 06-18-2015 11:02 AM |
ePub->pdf:Please help to overcome long standing Kindle pdf bug | EbokJunkie | Conversion | 4 | 01-25-2015 12:44 PM |