Fixing garbage when pasting from PDF?

Shohreh · 09-13-2025, 07:06 AM

Hello,

With some PDFs, I get this kind of garbage when copy-pasting:

Code:

��������������������������������������������������������
����������������������������������������������������������������������������������
���������������������������������������������������������������������������������
�����������������������������������������������������������������������
����������������������������������������������������

FWIW, here's the fonts used by such PDF:

Code:

Fonts: Bauhaus93 (Type1; embedded)
Calibri (Type1; embedded)
Calibri,Italic (TrueType (CID); Identity-H)
Calibri-Bold (Type1; embedded)
Calibri-Bold-KSCms-UHC-H (Type1 (CID); Identity-H; embedded)
Calibri-BoldItalic-KSCms-UHC-H (Type1 (CID); Identity-H; embedded)
Calibri-Italic (Type1; embedded)
Calibri-Italic-KSCms-UHC-H (Type1 (CID); Identity-H; embedded)
Calibri-KSCms-UHC-H (Type1 (CID); Identity-H; embedded)
NirmalaUI-Bold (Type1; embedded)

Do you know of a fix?

Thank you.

Quoth · 09-13-2025, 02:28 PM

OCR the image?

Shohreh · 09-13-2025, 05:40 PM

I thought about it, but before, I'd like to 1) understand what the problem is and 2) check if the PDF can't be doctored to solve the problem at the root (change fonts?)

Sarmat89 · 09-14-2025, 07:10 AM

Most PDF tools cannot work with identity-encoded fonts. I found the PDFMiner Python package can.

Quoth · 09-14-2025, 07:42 AM

Also try export using Ghostscript (or Ghostview GUI of it).

Shohreh · 09-15-2025, 07:13 AM

"identity-encoded fonts": What's that?

Before I investigate, would you have the commands handy?

willus · 09-15-2025, 09:50 PM

Can you attach a page or two of the PDF that gives you the "garbage" cut and paste result?

Shohreh · 09-16-2025, 09:12 AM

Here's one:

Karellen · 09-16-2025, 03:33 PM

Out of curiosity I ran it through gimagereader and it found and exported the text correctly...

Shohreh · 09-19-2025, 10:06 PM

Why doesn't SumatraPDF display it correctly?

Karellen · 09-20-2025, 12:41 AM

Quote:

Originally Posted by Shohreh

Why doesn't SumatraPDF display it correctly?

Sorry, no idea. I am not familiar with that software.

willus · 09-21-2025, 03:22 PM

Thanks for posting an example PDF. Do you have any software that correctly copies and pastes the text from your sample (garbage.pdf)? I tried:

1. Loading the PDF directly into MS Word
2. Extracting the text w/k2pdfopt
3. Copying and pasting from SumatraPDF v3.5.2
4. Copying and pasting from Adobe Reader
5. Copying and pasting from Abby FineReader v16

All of them showed the same thing--basically repeated UTF-8 values of 0xEF 0xBF 0xBD.

I think the PDF itself is likely encoded incorrectly.

-Will

PS. gImageReader is a Tesseract OCR front-end. I don't believe it is extracting the text layer from the PDF. I think it's doing OCR on the sample to get the text.

Jellby · 09-22-2025, 12:31 PM

I think the font is intentionally garbled, possibly to make copy-paste impossible or very inconvenient. With some patience and a hex editor, it may be possible to find a one-to-one equivalence to characters.

Here's what pdftotext and fontforge give for the text and font.

Shohreh · 09-23-2025, 06:58 AM

Thanks. I had the same problem recently with a PDF from a different source, both opened in SumatraPDF — since it's the default PDF/EPUB reader I use.

09-16-2025, 03:33 PM	#9
Karellen Wizard Posts: 1,643 Karma: 9500498 Join Date: Sep 2021 Location: Australia Device: Kobo Libra 2	Out of curiosity I ran it through gimagereader and it found and exported the text correctly... Attached Thumbnails

09-21-2025, 03:22 PM	#12
willus Fuzzball, the purple cat Posts: 1,306 Karma: 11087488 Join Date: Jun 2011 Location: California Device: iPad	Thanks for posting an example PDF. Do you have any software that correctly copies and pastes the text from your sample (garbage.pdf)? I tried: 1. Loading the PDF directly into MS Word 2. Extracting the text w/k2pdfopt 3. Copying and pasting from SumatraPDF v3.5.2 4. Copying and pasting from Adobe Reader 5. Copying and pasting from Abby FineReader v16 All of them showed the same thing--basically repeated UTF-8 values of 0xEF 0xBF 0xBD. I think the PDF itself is likely encoded incorrectly. -Will PS. gImageReader is a Tesseract OCR front-end. I don't believe it is extracting the text layer from the PDF. I think it's doing OCR on the sample to get the text. Last edited by willus; 09-21-2025 at 03:31 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Fixing hyphenation or word breaks from PDF conversion	democrite	ePub	13	12-10-2023 06:36 PM
Kindle conversion to PDF results in garbage	jgt1942	Amazon Kindle	1	12-03-2021 06:23 PM
Problems with fixing PDF's converted to HTML (allignment, font)	SpaceCase42	Conversion	4	09-23-2011 12:10 AM
pdf to epub results in 'garbage'?	wulfie	Calibre	6	09-23-2010 08:01 AM
Blank PDF with Booken - fixing	shane	Bookeen	6	01-30-2009 02:08 PM

09-13-2025, 02:28 PM	#2
Quoth Still reading Posts: 14,671 Karma: 109269703 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper	OCR the image?

09-13-2025, 05:40 PM	#3
Shohreh Addict Posts: 219 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	I thought about it, but before, I'd like to 1) understand what the problem is and 2) check if the PDF can't be doctored to solve the problem at the root (change fonts?)

09-14-2025, 07:10 AM	#4
Sarmat89 Fanatic Posts: 531 Karma: 2268308 Join Date: Nov 2015 Device: none	Most PDF tools cannot work with identity-encoded fonts. I found the PDFMiner Python package can.

09-14-2025, 07:42 AM	#5
Quoth Still reading Posts: 14,671 Karma: 109269703 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper	Also try export using Ghostscript (or Ghostview GUI of it).

09-15-2025, 07:13 AM	#6
Shohreh Addict Posts: 219 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	"identity-encoded fonts": What's that? Before I investigate, would you have the commands handy?

09-15-2025, 09:50 PM	#7
willus Fuzzball, the purple cat Posts: 1,306 Karma: 11087488 Join Date: Jun 2011 Location: California Device: iPad	Can you attach a page or two of the PDF that gives you the "garbage" cut and paste result?

09-19-2025, 10:06 PM	#10
Shohreh Addict Posts: 219 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Why doesn't SumatraPDF display it correctly?

09-23-2025, 06:58 AM	#14
Shohreh Addict Posts: 219 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Thanks. I had the same problem recently with a PDF from a different source, both opened in SumatraPDF — since it's the default PDF/EPUB reader I use.