![]() |
#1 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 219
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
|
![]()
Hello,
With some PDFs, I get this kind of garbage when copy-pasting: Code:
�������������������������������������������������������� ���������������������������������������������������������������������������������� ��������������������������������������������������������������������������������� ����������������������������������������������������������������������� ���������������������������������������������������� Code:
Fonts: Bauhaus93 (Type1; embedded) Calibri (Type1; embedded) Calibri,Italic (TrueType (CID); Identity-H) Calibri-Bold (Type1; embedded) Calibri-Bold-KSCms-UHC-H (Type1 (CID); Identity-H; embedded) Calibri-BoldItalic-KSCms-UHC-H (Type1 (CID); Identity-H; embedded) Calibri-Italic (Type1; embedded) Calibri-Italic-KSCms-UHC-H (Type1 (CID); Identity-H; embedded) Calibri-KSCms-UHC-H (Type1 (CID); Identity-H; embedded) NirmalaUI-Bold (Type1; embedded) Thank you. |
![]() |
![]() |
![]() |
#2 |
Still reading
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 14,671
Karma: 109269703
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper
|
OCR the image?
|
![]() |
![]() |
![]() |
#3 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 219
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
|
I thought about it, but before, I'd like to 1) understand what the problem is and 2) check if the PDF can't be doctored to solve the problem at the root (change fonts?)
|
![]() |
![]() |
![]() |
#4 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 531
Karma: 2268308
Join Date: Nov 2015
Device: none
|
Most PDF tools cannot work with identity-encoded fonts. I found the PDFMiner Python package can.
|
![]() |
![]() |
![]() |
#5 |
Still reading
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 14,671
Karma: 109269703
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper
|
Also try export using Ghostscript (or Ghostview GUI of it).
|
![]() |
![]() |
![]() |
#6 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 219
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
|
"identity-encoded fonts": What's that?
Before I investigate, would you have the commands handy? |
![]() |
![]() |
![]() |
#7 |
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,306
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Can you attach a page or two of the PDF that gives you the "garbage" cut and paste result?
|
![]() |
![]() |
![]() |
#8 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 219
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
|
Here's one:
|
![]() |
![]() |
![]() |
#9 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,643
Karma: 9500498
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
|
Out of curiosity I ran it through gimagereader and it found and exported the text correctly...
|
![]() |
![]() |
![]() |
#10 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 219
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
|
Why doesn't SumatraPDF display it correctly?
|
![]() |
![]() |
![]() |
#11 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,643
Karma: 9500498
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
|
Quote:
|
|
![]() |
![]() |
![]() |
#12 |
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,306
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Thanks for posting an example PDF. Do you have any software that correctly copies and pastes the text from your sample (garbage.pdf)? I tried:
1. Loading the PDF directly into MS Word 2. Extracting the text w/k2pdfopt 3. Copying and pasting from SumatraPDF v3.5.2 4. Copying and pasting from Adobe Reader 5. Copying and pasting from Abby FineReader v16 All of them showed the same thing--basically repeated UTF-8 values of 0xEF 0xBF 0xBD. I think the PDF itself is likely encoded incorrectly. -Will PS. gImageReader is a Tesseract OCR front-end. I don't believe it is extracting the text layer from the PDF. I think it's doing OCR on the sample to get the text. Last edited by willus; 09-21-2025 at 03:31 PM. |
![]() |
![]() |
![]() |
#13 |
frumious Bandersnatch
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,564
Karma: 20150435
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
I think the font is intentionally garbled, possibly to make copy-paste impossible or very inconvenient. With some patience and a hex editor, it may be possible to find a one-to-one equivalence to characters.
Here's what pdftotext and fontforge give for the text and font. |
![]() |
![]() |
![]() |
#14 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 219
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
|
Thanks. I had the same problem recently with a PDF from a different source, both opened in SumatraPDF — since it's the default PDF/EPUB reader I use.
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Fixing hyphenation or word breaks from PDF conversion | democrite | ePub | 13 | 12-10-2023 06:36 PM |
Kindle conversion to PDF results in garbage | jgt1942 | Amazon Kindle | 1 | 12-03-2021 06:23 PM |
Problems with fixing PDF's converted to HTML (allignment, font) | SpaceCase42 | Conversion | 4 | 09-23-2011 12:10 AM |
pdf to epub results in 'garbage'? | wulfie | Calibre | 6 | 09-23-2010 08:01 AM |
Blank PDF with Booken - fixing | shane | Bookeen | 6 | 01-30-2009 02:08 PM |