![]() |
#16 |
null operator (he/him)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 21,736
Karma: 29711016
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
@Doitsu - thanks for the tips re use of Grammar Checker, I was planning on trying it over weekend.
Having cut my teeth on Algol and various assemblers, I find it 'interesting' that your sample combines the ultra-verbosity of XML with maxi-terseness of regex. Not your fault of course, its the way of the world as it is - more given to extremes. FWIW - I use the calibre editor Reports to: a) eyeball the Words list filtered on '-' (this morning I found 'them-selves' and 'Wag-nails'), b) scan the Character list for 'odd-ball' characters. The ability to sort the various lists on frequency is helpful, as is the facility to save a list to a csv. BR |
![]() |
![]() |
![]() |
#17 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,413
Karma: 13369310
Join Date: May 2008
Location: Launceston, Tasmania
Device: Sony PRS T3, Kobo Glo, Kindle Touch, iPad, Samsung SB 2 tablet
|
Thanks for all your responses - I've been so busy proofreading that I haven't been back on the forum. I obviously have a lot to read through and digest.
I may need to consider my method of doing ebooks. I don't use Word; I use LibreOffice. I don't use Sigil; I use the CoffeeCup HTML editor I used when I dabbled in website design years ago - though I often use Notepad++ first. CoffeeCup does have a spell checker My usual practice is to find I book I want to do on the Internet Archive or elsewhere, and download the pdf and ePub 'ebook' files. Then I open up the ePub ebook file and edit the HTML files within. Another thing I've learned recently is to take more trouble to find the best possible original file. Some of the pdfs on the Internet Archive are just awful, making it much more likely that I'll read a comma as a period, and vice versa and so on. So where I can I use the HathiTrust version to check against - their version is usually excellent, and they seem to have nearly everything I want. But of course one cannot download the files. Thanks for the suggestions about fonts; I normally use Amasis when proofreading, but will experiment with the other font families and see if that helps. |
![]() |
![]() |
Advert | |
|
![]() |
#18 | |||||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
I must admit, I haven't touched LibreOffice in a while (I just use Notepad++ for all my writing). But the more types of tools you can throw at it, the better (certain tools might catch errors that others might miss). Quote:
The plugin felt quite rough:
Quote:
Quote:
![]() Yes, this is one of the first steps I do after I OCR the book. Who knows what crazy characters might have snuck in (or accents on characters). I then go through the book and check every odd/accented character to doublecheck they are correct. Doing this pass also helps you potentially catch inconsistencies like "vis-à-vis" + "vis-a-vis" existing in the same book. Side Note: Before Toxaris comes swooping in here, yes, his EPUB Tools also has "Check Accents". Quote:
Again, just a different way to visualize the data might make discrepancies stand out like a sore thumb. Last edited by Tex2002ans; 07-09-2016 at 12:38 AM. |
|||||
![]() |
![]() |
![]() |
#19 | |
frumious Bandersnatch
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,549
Karma: 19500001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Quote:
Is there something "wrong" in your system? I don't think so either. I haven't actually tried the font (maybe once long ago), but it takes a fair amount of work and knowledge to make a nice and smooth font, and/or a sophisticate software. I guess the creators of the font didn't have either of them. |
|
![]() |
![]() |
![]() |
#20 | |||
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,730
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
|
It is indeed a bit rough, but it was the best I could do with my very limited Python skills.
BTW, I found a Windows bug related to the ngram spellcheck feature that required a minor update. If you want to experiment with ngrams, you'll need to install the latest version. As for your questions: Quote:
This feature might be easier to implement in Calibre, because it's based on Python. Maybe Kovid Goyal will implement it, if you ask him nicely. ![]() I'll also ask KevinH, whether he could add some kind of Python-accessible highlight function, but since that would probably require a lot of work and not that many people are interested in this plugin, it's not very likely to happen. Unfortunately, the software module used for validation messages doesn't support multi-line text. Quote:
Code:
"allFiles": true
It really slows LanguageTool down, but it did find some problems. It all depends on the texts that you want to check. Quote:
Code:
{ "enabledOnly": true, "enabledRules": "CONFUSION_RULE", "ngramIndexDir": "C:/ngrams", "ltPath": "C:/Program Files/LanguageTool-3.3/languagetool-commandline.jar", "allFiles": true } If you want to experiment with the ngram spellcheak feature, you'll need to create a folder with an en subfolder in it and extract the ngram data files to that en folder. For example, on my machine the ngram files are in C:\ngrams\en (e.g. C:\ngrams\en\1grams). As far as LanguageTool is concerned, ngrams is the ngram folder that you'll need to specify via ngramIndexDir. Note also that you'll need to replace backslashes in folder names with slashes or write the backslash twice. For example: Code:
"ngramIndexDir": "C:/ngrams", Code:
"ngramIndexDir": "C:\\ngrams", Last edited by Doitsu; 07-12-2016 at 07:07 AM. Reason: New version attached |
|||
![]() |
![]() |
Advert | |
|
![]() |
#21 |
Obsessively Dedicated...
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,221
Karma: 35037583
Join Date: May 2011
Location: PA {back in the usa!}
Device: Sony PRS-T2, ADE on PC
|
Well, after all the advanced technical discussions, this post is a bit like a mouse screaming at a lion, but here is a short list of frequent OCR errors I have come across. There are many more I have never noted down, but just fixed on the fly.
Maybe more folks can share their "little lists" for the edification of us all. Some of these will be caught with spell-check, but not all, by any means ... OCR VILLAINS: Spoiler:
|
![]() |
![]() |
![]() |
#22 | |||||||||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Quote:
Quote:
![]() Most of the information I have directly on hand is all of the actual book typos I have come across over the years. I stopped writing down OCR errors so many years ago, and now could probably only gather them with code comparisons between EPUB versions as I worked on them. Quote:
Quote:
There probably aren't many actual words in the book with a capital U in them, so they stick out like a sore thumb... especially if you sort the Spellcheck List by Case Sensitive Sort. Anything that starts with a lowercase letter and has an uppercase "U" in it is a mistake 99% of the time. ![]() Side Note: That type of search is better in Calibre's Spellcheck because you can do a Case Sensitive Search. Quote:
Side Note: I even caught this in quite a few InDesign files as well. This is an easy error to slip by even in purely digital files. Quote:
Search: \s[b-z]\s Similarly, I run this one too to catch all capital letters that are by themselves that are not "A" or "I": Search: \s[B-HJ-Z]\s Those basic Regexes do miss the odd case of that occurring anywhere near an HTML tag though. So it would miss: <p>B ob said to go outside!</p> or: <p><i>Then </i>S uzy told Bob to jump over the fence.</p> But if the book is riddled with them, then I make sure to look much more closely (and those typically get caught at other passes, or just write up a custom Regex to catch that error). Side Note: I don't use the capitals one too often because many of the books I work on have text along these lines: "Product C and Product D" + "Person X and Y". Quote:
Search: ‘“ Replace: “‘ Search: ”’ Replace: ’” Although use those on a case-by-case basis (don't just do a huge Replace All). Side Note: Quotation marks typically require some scrutiny, because there are a huge amount of actual book typos that have creeped in due to wrong nesting. As a related note, I found that parenthesis + brackets follow the same rules, and also have a relatively large amount of nesting errors. This was an entire class of errors that I missed until I used Toxaris's "Dialogue Check" (Pure Regex is not as good). Quote:
Search: ‘(Em|em|Til|til|Tis|tis|Twas|twas) Replace: ’\1 Related is the RIGHT single quote before shortened years: Search: ‘([0-9]) Replace: ’\1 or the RIGHT single quote before + after the "n": Rock ’n’ Roll Last edited by Tex2002ans; 07-11-2016 at 05:52 AM. |
|||||||||
![]() |
![]() |
![]() |
#23 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,730
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
|
I had a look at the documentation for the Hunspell library, which appears to have been written by a programmer who does his taxes in binary, and found out that it's possible to add custom letter replacements to get betters spelling suggestions.
Replacements need to be defined in the affix file (e.g. en_US.aff for US English), which is a plain text file that can be edited with a programmer's editor, e.g. Notepad ++. The format is as follows Code:
REP {number of following entries} REP {OLD} {NEW} Code:
REP 94 REP nt n't ... ... REP shun tion REP shun sion REP shun cion Spoiler:
With this change in place, the first suggestion for "ahnost" is no longer stenost, but almost and the suggestion for "hke" is like instead of hike. If you want to test my modified file: 1. Go to C:\Program Files\Sigil\hunspell_dictionaries 2. Create a backup copy of en_US.aff. 3. Overwrite en_US.aff with the attached version. (You'll need to confirm a system warning.) Last edited by Doitsu; 07-22-2016 at 03:17 AM. |
![]() |
![]() |
![]() |
#24 |
frumious Bandersnatch
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,549
Karma: 19500001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
DP has a list of some words that will not be detected by a spell checker, but are most probably OCR errors (scannos), among them the infamous "arid" (for and) and "modem" (for modern):
http://www.pgdp.net/c/faq/wordcheck-...ite_word_lists |
![]() |
![]() |
![]() |
#25 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,413
Karma: 13369310
Join Date: May 2008
Location: Launceston, Tasmania
Device: Sony PRS T3, Kobo Glo, Kindle Touch, iPad, Samsung SB 2 tablet
|
Thanks, Tex2002an, #22. I'm afraid I haven't kept a record. As I remember many of them were , instead of . and vice versa, and I instead of ! and vice versa. But many of them just shouldn't have been there at all.
The pdf originals from which the ePub files I used were made were of quite poor quality - though that's no excuse. |
![]() |
![]() |
![]() |
#26 | |||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
![]() Quote:
I tend to mark all of my files with [YYYY.MM.DD] and just save them as I go along. Therefore in the future, I could easily use code comparison tools on the EPUBs to see exactly what has changed between versions. Quote:
Side Note: Here are a few common OCR errors I ran into tonight: o£ -> of tbe -> the lias -> has Roman Numeral Problems with the "V" OCRing as "Y": Chapter XY -> Chapter XV Chapter Y -> Chapter V Chapter XYI -> Chapter XVI CHAPTER XXIY -> CHAPTER XXIV CHAPTER XXYI -> CHAPTER XXVI Punctuation Errors (em dash + hyphen): —- -> — -— -> — You may also want to look out for hyphens followed by a space. This needs to be decided on a case-by-case basis, because many of these are valid. Example, "This is a one- or two-hyphen error." In many cases it is either a badly recognized soft hyphen (end of line or end of page), a speck of dust, or an actual OCR error. You may also want to make a pass looking for <sup> or <sub> tags. Sometimes OCR just goes crazy and inserts this into the text. Last edited by Tex2002ans; 07-15-2016 at 04:48 AM. |
|||
![]() |
![]() |
![]() |
#27 |
Scanning Services
![]() Posts: 2
Karma: 10
Join Date: May 2014
Location: Missouri
Device: multiple
|
When proofing against a scanned and converted Word doc try bringing up an image only PDF file on half the screen and the word doc of the other. Then slowly go through and check it against the PDF and apply corrections. When you're finished have another pair of eyes do the same thing. That's how we do it. We call it corrective editing.
Stan www.pdfdocument.com has more information for those who are interested. |
![]() |
![]() |
![]() |
#28 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,413
Karma: 13369310
Join Date: May 2008
Location: Launceston, Tasmania
Device: Sony PRS T3, Kobo Glo, Kindle Touch, iPad, Samsung SB 2 tablet
|
Quote:
|
|
![]() |
![]() |
![]() |
#29 |
Gregg Bell
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,266
Karma: 3917598
Join Date: Jan 2013
Location: Itasca, Illinois
Device: Kindle Touch 7, Sony PRS300, Fire HD8 Tablet
|
I'll second the vote for Balbolka. And I don't know if it was mentioned or not but when the word is spoken aloud the text for that word is also highlighted.
Now I use Linux and there is a similar program to Balbolka named Espeak. I use the LibreOffice spell checker but I also find it helpful to borrow a Windows computer and use the Word spell checker. (I find that the Word spell (and grammar) checker catches things Libreoffice doesn't like): John went to the store fro a gallon of milk. Last edited by Gregg Bell; 09-25-2016 at 03:56 PM. |
![]() |
![]() |
![]() |
#30 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,730
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
If you check your sample sentence with it, you'll get the following error message: Did you mean "for" or "from"? |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Tools and methodology for easier proof-reading | Iznogood | Workshop | 23 | 12-05-2016 10:43 AM |
ABBYY FineReader - Proof reading tips? | PieOPah | Workshop | 23 | 03-02-2012 01:03 AM |
Proof reading: What do you do when you find a clear misprint? | graycyn | Workshop | 4 | 07-20-2011 01:13 PM |
Calibre Book Reader for Proof Reading/Editing | Agama | Calibre | 16 | 05-10-2011 05:08 PM |
Proof Reading Service | genepool | General Discussions | 1 | 03-16-2011 09:02 AM |