Quote:
Originally Posted by Shohreh
As a way to re-add them after a PDF was OCRed, do you know of a PDF reader for Windows that can find all words in italics?
[...]
|
Didn't you already say in Post #28 that you ran this PDF through Finereader?
Finereader should have carried over italics and other formatting for you.
Quote:
Originally Posted by Shohreh
How does pdftotext show italics or bold?
There's no difference in the output when using "-layout". No extras spaces I could use to find those.
|
pdftotext is plaintext only... as are most "get the text out of PDF" tools.
Again, there's a reason why PDF is the absolute worst input format. I even wrote a lot about this back in
2013: "Best way to copy text from a PDF or MOBI?".
Quote:
Originally Posted by Shohreh
That's why it looked like scanned pages, but the text is still selectable like text PDF.
|
PDFs potentially have two layers:
- Frontend
- Bitmap/Image
- Like a scanned document.
- You can zoom in and see speckles/defects/lower-resolution.
- Vector
- Like a purely digital document (DOCX, InDesign, [...]).
- You can zoom in and the text/graphics are perfectly crisp.
- Backend (Text) (Optional)
- This is the invisible layer you search/copy/paste from.
- An OCR program is going to create this.
- Note: There is also such a thing as a "Tagged PDF" file, which does carry over information like headings/italics/bold, but it's rare that people even create these types (let alone tagged properly).
Adobe's ClearScan only messes with that Frontend layer. It takes a Bitmap/Scanned image, then creates "custom fonts" based on the shapes themselves.
So you might have dozens of scanned 'g'-looking shapes:
https://blogs.adobe.com/acrolaw/file...law/003b_G.GIF
It will replace every "scanned g" with a "digital g":
https://blogs.adobe.com/acrolaw/file...law/003a_G.GIF
Next, it'll run across a tilted
g (italics), etc. It does this for thousands of unique shapes, and assigns them to digital/vector fonts.
This is why I said it's still a scanned document. It doesn't change the nature of the PDF. It
looks digital, like a purely vector document, but it isn't.
In many cases, it's even worse than just having the original scan, because ClearScan may botch the document even worse than expected.
Here's an 'm' scanned at 300dpi, then ran through ClearScan:
https://blogs.adobe.com/acrolaw/file..._300_dpi_m.PNG
potential distortions add up, and you might get other serious errors that crop in.
I don't have a ClearScan document on hand (and I don't use Adobe Acrobat), but here's an example of what I'm talking about when scan->digital goes awry:
You can see:
- "of" is squished
- the em dashes plus period were inconsistently recognized
- weird random bolding
- and kerning is especially awful around italics