View Single Post
Old 07-21-2020, 03:38 PM   #37
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Shohreh View Post
As a way to re-add them after a PDF was OCRed, do you know of a PDF reader for Windows that can find all words in italics?

[...]
Didn't you already say in Post #28 that you ran this PDF through Finereader?

Finereader should have carried over italics and other formatting for you.

Quote:
Originally Posted by Shohreh View Post
How does pdftotext show italics or bold?

There's no difference in the output when using "-layout". No extras spaces I could use to find those.
pdftotext is plaintext only... as are most "get the text out of PDF" tools.

Again, there's a reason why PDF is the absolute worst input format. I even wrote a lot about this back in 2013: "Best way to copy text from a PDF or MOBI?".

Quote:
Originally Posted by Shohreh View Post
That's why it looked like scanned pages, but the text is still selectable like text PDF.
PDFs potentially have two layers:
  • Frontend
    • Bitmap/Image
      • Like a scanned document.
      • You can zoom in and see speckles/defects/lower-resolution.
    • Vector
      • Like a purely digital document (DOCX, InDesign, [...]).
      • You can zoom in and the text/graphics are perfectly crisp.
  • Backend (Text) (Optional)
    • This is the invisible layer you search/copy/paste from.
    • An OCR program is going to create this.
    • Note: There is also such a thing as a "Tagged PDF" file, which does carry over information like headings/italics/bold, but it's rare that people even create these types (let alone tagged properly).

Adobe's ClearScan only messes with that Frontend layer. It takes a Bitmap/Scanned image, then creates "custom fonts" based on the shapes themselves.

So you might have dozens of scanned 'g'-looking shapes:

https://blogs.adobe.com/acrolaw/file...law/003b_G.GIF

It will replace every "scanned g" with a "digital g":

https://blogs.adobe.com/acrolaw/file...law/003a_G.GIF

Next, it'll run across a tilted g (italics), etc. It does this for thousands of unique shapes, and assigns them to digital/vector fonts.

This is why I said it's still a scanned document. It doesn't change the nature of the PDF. It looks digital, like a purely vector document, but it isn't.

In many cases, it's even worse than just having the original scan, because ClearScan may botch the document even worse than expected.

Here's an 'm' scanned at 300dpi, then ran through ClearScan:

https://blogs.adobe.com/acrolaw/file..._300_dpi_m.PNG

potential distortions add up, and you might get other serious errors that crop in.

I don't have a ClearScan document on hand (and I don't use Adobe Acrobat), but here's an example of what I'm talking about when scan->digital goes awry:

Click image for larger version

Name:	Scanned.to.Digital.Distortions.png
Views:	414
Size:	77.9 KB
ID:	180833

You can see:
  • "of" is squished
  • the em dashes plus period were inconsistently recognized
  • weird random bolding
  • and kerning is especially awful around italics

Last edited by Tex2002ans; 07-21-2020 at 03:57 PM.
Tex2002ans is offline   Reply With Quote