words problems in PDF

yuxi_kelly · 12-07-2010, 01:38 AM

when I copy a word from pdf then paste it onto a Microsoft Word.Then something strange happens, the words becomes a question mark or a square.

or if i convert some PDFs to Microsoft Word, such situations always happens,

what is wrong?
and how can i find a PDF's encoding? does this work?

frabjous · 12-07-2010, 11:40 AM

Have you noticed any patterns for when it happens? Is it for certain words, like those with ligatures (typically ff, ffl, ffi, fi and fl)? Or special symbols?

Does it happen with every PDF viewers, or just some? Have you tried direct conversion (e.g., calibre for PDF to RTF)?

yuxi_kelly · 12-09-2010, 02:54 AM

It happens only for some PDFs, a lot different words, when you convert it out to any format, the words will fall apart into two, or two words bump together, or some letters will be question marks. i meet with so many different situations.

the only aim is how can i assure the words quality when convert a pdf to any format files.

frabjous · 12-09-2010, 07:54 AM

You cannot.

PDF is not designed as an import format to be converted to another format. It is designed as an output format. PDFs are always created from other source documents; the thought has always been that if you wanted to change it or convert it, you'd go back and make changes to the source document. The fact that PDFs can ever be converted at all with useable results is somewhat surprising.

A PDF is designed to emulate a printed page and the purpose of a PDF is to look exactly the same for everyone who views it, much like a printed page would. Typically, a PDF only contains information about the exact placement of characters, images, and vectors on a page and nothing else. It does not even maintain the information from the original source document such as where one word begins and another ends, much less where paragraphs begin and end. When you convert a PDF to another format, it is up to the artificial intelligent of the converter to try to "reconstruct" this information. There is no easy way to make this work well for each case.

You get spaces between letters, or places where words bump together, when the converter incorrectly reads where word boundaries are. You get boxes or wrong characters when the glyphs from its fonts don't match the glyphs from the fonts you are converting into, or there are differences in the fonts' character encodings.

You really can't expect perfection here; you're working against the grain, trying to do what you were never designed to do. I'm afraid you're going to have to live with that, or else learn a lot of programming and try to write a superior artificial intelligence algorithm.

Nevertheless, if you don't have access to the source document, converting is still often preferable to retyping everything from scratch, but do be prepared for a fair amount of manual fixing.

dwig · 12-09-2010, 08:10 AM

Quote:

Originally Posted by frabjous

...
You get spaces between letters, or places where words bump together, when the converter incorrectly reads where word boundaries are. ...

The converter "decoding" the PDF isn't really to blame. The root cause is in the app that built the PDF. The breaks frequently occur when the original document has some non-default letter spacing (kerning) in a word. The PDF generator will often break the work in two so that it can reposition the character following the custom kerning. The poor app trying to make sense of the PDF encounters part of the word in one data block and the rest in another. Its difficult for it to recognize that the two need to be reassembled without a space or hard return.

frabjous · 12-09-2010, 08:27 AM

Quote:

Originally Posted by dwig

The converter "decoding" the PDF isn't really to blame. The root cause is in the app that built the PDF. The breaks frequently occur when the original document has some non-default letter spacing (kerning) in a word.

What do you mean by that? Most fonts have pair kerning information, that is information about how the kerning for individual pairs of letters need to be adjusted to accommodate their exact shapes -- this information is usually made use of by quality typesetting software, but is not used by default by web browsers or word processors like MS Word (though you can turn it on) nor by most ebook software for other formats, which is why they don't look as good.

Is the default to use pair kerning, or not to use pair kerning? Using it is typographically superior, but since the exact shapes of the glyphs is not stored in the PDF, use of pair kerning probably makes it more difficult for a converter to do its job properly. But that doesn't mean there's a flaw with the software that create the PDF -- quite the contrary!

Moreover, for a variety of typographical reasons, someone might alter the kerning for typographical reasons. Kerning is often greater in titles, for example. None of this is anyone's "fault" or points to a flaw in the software. Indeed, software that is able to do is superior in my mind to that which cannot.

If someone begins arguing that people ought to forget about quality typography for the same of convenience in wrong-way conversions, well, let me just register my strong disagreement. As much as I like mobile devices for reading, the typography is usually much worse unless possibly you're reading a PDF. Let's not advocate moving software away from the possibility of providing one of their best features!

(To be clear, I'm not accusing you of arguing for that. It's just the kind of thing I've heard people suggest before...)

Faster · 02-18-2011, 04:02 PM

The way I get round this problem after it has happened with Adobe Reader is to open the file in Foxit Phantom, select the text icon, select all with Ctrl A, copy with Ctrl C, and paste into notepad.
Then open the Notepad text file with Word and use VBA macros to modify the story.
As suggested earlier, you could try different PDF readers.

DaleDe · 02-18-2011, 04:17 PM

Another way to handle the problem is to OCR the PDF. ABBYY has a tool for that or for $10 they have tool that will capture an image off the Screen and OCR it.

Dale

12-07-2010, 01:38 AM	#1
yuxi_kelly Enthusiast Posts: 37 Karma: 10 Join Date: Oct 2010 Device: ipad	words problems in PDF when I copy a word from pdf then paste it onto a Microsoft Word.Then something strange happens, the words becomes a question mark or a square. or if i convert some PDFs to Microsoft Word, such situations always happens, what is wrong? and how can i find a PDF's encoding? does this work?

12-09-2010, 07:54 AM	#4
frabjous Wizard Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505	You cannot. PDF is not designed as an import format to be converted to another format. It is designed as an output format. PDFs are always created from other source documents; the thought has always been that if you wanted to change it or convert it, you'd go back and make changes to the source document. The fact that PDFs can ever be converted at all with useable results is somewhat surprising. A PDF is designed to emulate a printed page and the purpose of a PDF is to look exactly the same for everyone who views it, much like a printed page would. Typically, a PDF only contains information about the exact placement of characters, images, and vectors on a page and nothing else. It does not even maintain the information from the original source document such as where one word begins and another ends, much less where paragraphs begin and end. When you convert a PDF to another format, it is up to the artificial intelligent of the converter to try to "reconstruct" this information. There is no easy way to make this work well for each case. You get spaces between letters, or places where words bump together, when the converter incorrectly reads where word boundaries are. You get boxes or wrong characters when the glyphs from its fonts don't match the glyphs from the fonts you are converting into, or there are differences in the fonts' character encodings. You really can't expect perfection here; you're working against the grain, trying to do what you were never designed to do. I'm afraid you're going to have to live with that, or else learn a lot of programming and try to write a superior artificial intelligence algorithm. Nevertheless, if you don't have access to the source document, converting is still often preferable to retyping everything from scratch, but do be prepared for a fair amount of manual fixing. Last edited by frabjous; 12-09-2010 at 07:57 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
pdf with split words at end of line - how best to convert	cybmole	Calibre	8	10-19-2010 08:27 AM
PDF to MOBI conversion - unable to detect any words	qwerty123456	Calibre	1	07-22-2010 07:54 AM
PDF on PRS-300 cutting words in half	dink	Sony Reader	2	01-11-2010 01:31 PM
How to buy pdf converter of Panasonic Words Gear BKE-T3	asun	Legacy E-Book Devices	0	07-06-2009 11:26 PM
Repagination problems? Losing words at the end of pages	melrowgo	Sony Reader	6	05-26-2009 12:57 PM

12-07-2010, 11:40 AM	#2
frabjous Wizard Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505	Have you noticed any patterns for when it happens? Is it for certain words, like those with ligatures (typically ff, ffl, ffi, fi and fl)? Or special symbols? Does it happen with every PDF viewers, or just some? Have you tried direct conversion (e.g., calibre for PDF to RTF)?

12-09-2010, 02:54 AM	#3
yuxi_kelly Enthusiast Posts: 37 Karma: 10 Join Date: Oct 2010 Device: ipad	It happens only for some PDFs, a lot different words, when you convert it out to any format, the words will fall apart into two, or two words bump together, or some letters will be question marks. i meet with so many different situations. the only aim is how can i assure the words quality when convert a pdf to any format files.

02-18-2011, 04:02 PM	#7
Faster Connoisseur Posts: 61 Karma: 12096 Join Date: Sep 2010 Location: Tasmania Device: Sony PRS 650	The way I get round this problem after it has happened with Adobe Reader is to open the file in Foxit Phantom, select the text icon, select all with Ctrl A, copy with Ctrl C, and paste into notepad. Then open the Notepad text file with Word and use VBA macros to modify the story. As suggested earlier, you could try different PDF readers.

02-18-2011, 04:17 PM	#8
DaleDe Grand Sorcerer Posts: 11,470 Karma: 13095790 Join Date: Aug 2007 Location: Grass Valley, CA Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7	Another way to handle the problem is to OCR the PDF. ABBYY has a tool for that or for $10 they have tool that will capture an image off the Screen and OCR it. Dale

Advert

Advert