Thread: pdf to epub
View Single Post
Old 07-18-2015, 08:23 AM   #15
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Notjohn View Post
IMHO, flagging spelling errors is the single best feature of Word. I just last week pasted a Sigil epub into Word for just that purpose.
I believe I sent PMs/emails a few years back discussing this topic. I forget if I posted about this on the forums. I did a quick search and couldn't find any old posts of mine talking about it, but I could have sworn I did!

My Thoughts on Dictionaries:

Using a dictionary of words is a balancing act between including TOO MANY words and too little. Too many will cause you to miss a lot of errors, and too little will cause you to waste a lot of manhours checking perfectly spelled words.

As an example of missing errors because of the dictionary, I always go back to the simple case of hyphenated or accented words:

"cooperate" + "co-operate" + "coöperate"

are all perfectly legitimate versions of the word, but Word might say all three are correct—when in reality, one or two of the variations in this book may have been an OCR error. A common OCR error is accidentally introducing accented characters that were never there (dust, markings on the page, [...]) or soft hyphens in the PDF might have been incorrectly turned into hard hyphens (words split at the end of lines/pages).

Side Note: I wouldn't rely on the Spellcheck in Word alone... the English dictionary that comes with Sigil/Calibre is not as extensive as the one in Word, but I find this to be a good thing. Many errors live in that crack between the two sets of dictionaries. I would much rather go with the dictionary with too few words, over the one with too many (letting errors slip by).

I would also prefer one that doesn't treat each word on the side of a hyphen as its own word ("jumpstart" + "jump-start" + "jump start").

Side Note: Although I do agree with you about the grammar checking (the green squiggly lines) being very helpful. That sometimes help catch a whole class of errors that can't be caught using just normal Spellchecking, such as wrong usages of "than" + "then".

Some Classes of OCR Errors and My Solutions:

Accents: Toxaris has a "Check Accents" button that checks the document for accented characters. Although I prefer looking up each accented character using the Spellcheck lists in Sigil and Calibre's Editor to just get an easy-to-see list of every word in the EPUB with that character. Then I just easily A/B compare in Finereader with the original PDF.

Side Note: I work mostly in English works, that solution might not be the greatest if you work on languages with MANY accents.

Hyphenation: I prefer just typing "-" into the search box in Sigil/Calibre's Spellcheck list to get a list of every single word with a hyphen in it, and then go through the hyphenated words to see if I can spot any blatant errors. At least one pass with "Show misspelled words" on and one off.

Side Note: I did create a program I personally use that helps in this regard. I name this whole class of errors "Hyphenation Inconsistencies".

The program compares all words with hyphens with their non-hyphenated versions, and tells me if there are any matches in the same book (a book using "non-hyphenated" and "nonhyphenated" at the same time is most likely an error, or at least has to be looked into).

But the code is bad, it is buggy (can't handle UTF-8 as well as I would like, can't handle words properly with two or more hyphens, [...]), and I don't want to release it to the public like that. :P There is also A TON more that has to be programmed to solve this "Hyphenation Problem" (handling inconsistent prefixes/suffixes, comparing spaced/unspaced, [...]).

Also, according to all my testing, it seems like every book I worked on has 0-8 of these "hyphenation inconsistencies", a handful of which were problems in the books/documents themselves. Seems to me like this is a very common error that humans make in large works, and nobody really has a way to automatically check or notify you of this stuff.

I also have been running it on all the DOC(X)/InDesign files of new books I have been getting my way, and been reporting the 0-8 errors to the authors/publishers. The one book, I caught 4 of these hyphenation inconsistencies in the Preface itself!

Right But Wrong: There are also whole classes of common OCR errors that can't be caught by dictionaries, because the mistakes are also correctly spelled words:

"modern" + "modem"
"corn" + "com"
[...]

Most of these require manual checking, and can't just be fully automated.

The only thing I can think of at the moment of a semi-automated solution would be Toxaris's "Search/Replace" functionality:

http://www.toxaris.nl/helpen/index.h...ek_vervang.htm

combined with his "Replacerules" that can be found here:

http://toxaris.nl/en/

Although this requires someone to actually go through and create the proper Search/Replace lists... I just haven't put together the time to figure out Word's version of Regex and do it myself (although I do have tens/hundreds of this class of words written down on pieces of paper over the years, and in my head ).

If Toxaris reads this, I am sorry for never getting around to it... it is on my backburner though (for a very long time now).

Last edited by Tex2002ans; 07-18-2015 at 09:15 AM.
Tex2002ans is offline   Reply With Quote