MobileRead Forums - View Single Post

Tex2002ans · 07-21-2020, 03:38 PM

Quote:

Originally Posted by Shohreh

As a way to re-add them after a PDF was OCRed, do you know of a PDF reader for Windows that can find all words in italics?

[...]

Didn't you already say in Post #28 that you ran this PDF through Finereader?

Finereader should have carried over italics and other formatting for you.

Quote:

Originally Posted by Shohreh

How does pdftotext show italics or bold?

There's no difference in the output when using "-layout". No extras spaces I could use to find those.

pdftotext is plaintext only... as are most "get the text out of PDF" tools.

Again, there's a reason why PDF is the absolute worst input format. I even wrote a lot about this back in 2013: "Best way to copy text from a PDF or MOBI?".

Quote:

Originally Posted by Shohreh

That's why it looked like scanned pages, but the text is still selectable like text PDF.

PDFs potentially have two layers:

Frontend
- Bitmap/Image
  - Like a scanned document.
  - You can zoom in and see speckles/defects/lower-resolution.
- Vector
  - Like a purely digital document (DOCX, InDesign, [...]).
  - You can zoom in and the text/graphics are perfectly crisp.
Backend (Text) (Optional)
- This is the invisible layer you search/copy/paste from.
- An OCR program is going to create this.
- Note: There is also such a thing as a "Tagged PDF" file, which does carry over information like headings/italics/bold, but it's rare that people even create these types (let alone tagged properly).

Adobe's ClearScan only messes with that Frontend layer. It takes a Bitmap/Scanned image, then creates "custom fonts" based on the shapes themselves.

So you might have dozens of scanned 'g'-looking shapes:

https://blogs.adobe.com/acrolaw/file...law/003b_G.GIF

It will replace every "scanned g" with a "digital g":

https://blogs.adobe.com/acrolaw/file...law/003a_G.GIF

Next, it'll run across a tilted g (italics), etc. It does this for thousands of unique shapes, and assigns them to digital/vector fonts.

This is why I said it's still a scanned document. It doesn't change the nature of the PDF. It looks digital, like a purely vector document, but it isn't.

In many cases, it's even worse than just having the original scan, because ClearScan may botch the document even worse than expected.

Here's an 'm' scanned at 300dpi, then ran through ClearScan:

https://blogs.adobe.com/acrolaw/file..._300_dpi_m.PNG

potential distortions add up, and you might get other serious errors that crop in.

I don't have a ClearScan document on hand (and I don't use Adobe Acrobat), but here's an example of what I'm talking about when scan->digital goes awry:

Click image for larger version

Name: Scanned.to.Digital.Distortions.png
Views: 414
Size: 77.9 KB
ID: 180833

You can see:

"of" is squished
the em dashes plus period were inconsistently recognized
weird random bolding
and kerning is especially awful around italics