07-19-2020, 06:45 PM | #31 |
Zealot
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
|
pdftotext did the job. Thanks!
|
07-20-2020, 01:23 PM | #32 |
Zealot
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
|
As a way to re-add them after a PDF was OCRed, do you know of a PDF reader for Windows that can find all words in italics?
SumatraPDF: NO Foxit Reader: NO Acrobat Reader: NO XpdfReader: NO |
07-20-2020, 05:36 PM | #33 |
null operator (he/him)
Posts: 20,457
Karma: 26645808
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Word or Writer can read some PDFs, and find strings with specific formats.
BR |
07-20-2020, 05:55 PM | #34 | |
Addict
Posts: 378
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Ubuntu, Jutoh,Kobo Forma
|
Quote:
|
|
07-21-2020, 09:05 AM | #35 |
Zealot
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
|
How does pdftotext show italics or bold?
There's no difference in the output when using "-layout". No extras spaces I could use to find those. Last edited by Shohreh; 07-21-2020 at 09:08 AM. |
07-21-2020, 11:50 AM | #36 | |
Addict
Posts: 378
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Ubuntu, Jutoh,Kobo Forma
|
Quote:
And no italics or bold--text only. No magic, just another tool. |
|
07-21-2020, 03:38 PM | #37 | |||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Finereader should have carried over italics and other formatting for you. Quote:
Again, there's a reason why PDF is the absolute worst input format. I even wrote a lot about this back in 2013: "Best way to copy text from a PDF or MOBI?". Quote:
Adobe's ClearScan only messes with that Frontend layer. It takes a Bitmap/Scanned image, then creates "custom fonts" based on the shapes themselves. So you might have dozens of scanned 'g'-looking shapes: https://blogs.adobe.com/acrolaw/file...law/003b_G.GIF It will replace every "scanned g" with a "digital g": https://blogs.adobe.com/acrolaw/file...law/003a_G.GIF Next, it'll run across a tilted g (italics), etc. It does this for thousands of unique shapes, and assigns them to digital/vector fonts. This is why I said it's still a scanned document. It doesn't change the nature of the PDF. It looks digital, like a purely vector document, but it isn't. In many cases, it's even worse than just having the original scan, because ClearScan may botch the document even worse than expected. Here's an 'm' scanned at 300dpi, then ran through ClearScan: https://blogs.adobe.com/acrolaw/file..._300_dpi_m.PNG potential distortions add up, and you might get other serious errors that crop in. I don't have a ClearScan document on hand (and I don't use Adobe Acrobat), but here's an example of what I'm talking about when scan->digital goes awry: You can see:
Last edited by Tex2002ans; 07-21-2020 at 03:57 PM. |
|||
07-24-2020, 03:38 AM | #38 | |
Zealot
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
|
Thanks much for the infos about layers.
Quote:
Turns out it's still a bit of work to…
|
|
07-24-2020, 03:49 PM | #39 | ||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
What did you export as?
You should see italics/bold showing up in the right half of Finereader: Left should display the original document, and the Right half should show all the actual OCRed text. Did you select Document Layout: "Formatted Text". In the dropdown, you can also select DOCX: (Personally, I keep everything on "Exact Copy" until I'm ready to export the document. This makes the Left/Right halves match much more closely, making it easier to make corrections.) Quote:
It's almost always better to re-OCR and work from scratch (see the PDF+OCR topics I previously linked to). Yes, exactly, which is why you want the computer doing that. Finereader does a better job than any other tool at carrying over this (along with superscript/subscript/tables, etc. etc.). Quote:
Still a lot of manual correction needs to be done though, and that's where you use some of the tricks I listed in Post #15. Spellcheck Lists are a fantastic way to catch/correct these things, along with Regex. Yep, that one's a pain, but there are methods. Finereader should detect all that, and if not, you adjust the recognition boxes. I explained some of this back in 2014: Post #5 in "Problems converting K2PDF Opt files to EPUB". As long as your headings are marked fine (<h1> <h2> <h3> ...), you regenerate that from Sigil. Last edited by Tex2002ans; 07-24-2020 at 03:58 PM. |
||
07-26-2020, 07:50 AM | #40 |
Member
Posts: 24
Karma: 4472
Join Date: Jan 2011
Device: Kindle
|
recommended Sigil-Plug in: Epub Tidy Tool
The Sigil-Plugin Epub tidy Tool does a decent job a fixing incorrect line breaks. If you install the text file "IncorrectWords.txt" provided by the author, it will also fix a lot of common OCR errors.
Best to use early in the process, before the thorough proofreading. Other Tips: - think about what quality you want/need in the end. 80/20 applies to OCRing, you can spend way more than 80 % of your time finding the last spelling or formatting errors that don't really make a big difference to the reader. For books that I might read more than once, I tend to find myself going with fairly rough first version, highlighting problems in my Kindle (and fixing them later in Sigil), then doing another iteration before reading the book again a few years later. - Finereader works well for me. Worth exploring the options, good settings (e.g. remove headers/footers) save a lot of fixing later - think about what formatting you'd want to keep. OCR does a pretty lousy job if asked to preserve all formatting. You'll end up with lots of text boxes, italics, superscripts that should not be there and make a mess out of conversions. - for fiction with no footnotes and little or no bold and italic, you could even consider converting to .txt, formatting Chapter headings in Word or Libre Office and fine-tuning ToC and page breaks in Sigil after a conversion in Calibre. And be done in a few hours. |
07-28-2020, 12:45 AM | #41 |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Above, I shared information about Tesseract. These are the latest (1.7.2020) data I am using. Same for English.
Last edited by roger64; 07-28-2020 at 08:22 PM. Reason: English |
08-03-2020, 10:59 AM | #42 | |
Bookmaker & Cat Slave
Posts: 11,447
Karma: 157030631
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Take new Word file, cleaned-up, export to PDF, lather-rinse-repeat. Yes, it's tedious and all that, but it's a shedload less tedious than trying to find all the OCR errors yourself manually. Does it find everything? Oh, hells, no, but it's an option that most people overlook. Offered FWIW. Hitch |
|
08-07-2020, 07:35 AM | #43 |
Zealot
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
|
Thanks for the infos.
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
An advice on OCRing, please. | nlundberg | Workshop | 6 | 03-13-2013 06:29 AM |
Book Designer Hints and Tips | Patricia | Workshop | 59 | 06-10-2010 07:14 AM |