![]() |
#1 |
Enthusiast
![]() Posts: 37
Karma: 10
Join Date: Oct 2010
Device: ipad
|
words problems in PDF
when I copy a word from pdf then paste it onto a Microsoft Word.Then something strange happens, the words becomes a question mark or a square.
or if i convert some PDFs to Microsoft Word, such situations always happens, what is wrong? and how can i find a PDF's encoding? does this work? ![]() |
![]() |
![]() |
![]() |
#2 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
Have you noticed any patterns for when it happens? Is it for certain words, like those with ligatures (typically ff, ffl, ffi, fi and fl)? Or special symbols?
Does it happen with every PDF viewers, or just some? Have you tried direct conversion (e.g., calibre for PDF to RTF)? |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Enthusiast
![]() Posts: 37
Karma: 10
Join Date: Oct 2010
Device: ipad
|
It happens only for some PDFs, a lot different words, when you convert it out to any format, the words will fall apart into two, or two words bump together, or some letters will be question marks. i meet with so many different situations.
the only aim is how can i assure the words quality when convert a pdf to any format files. |
![]() |
![]() |
![]() |
#4 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
You cannot.
PDF is not designed as an import format to be converted to another format. It is designed as an output format. PDFs are always created from other source documents; the thought has always been that if you wanted to change it or convert it, you'd go back and make changes to the source document. The fact that PDFs can ever be converted at all with useable results is somewhat surprising. A PDF is designed to emulate a printed page and the purpose of a PDF is to look exactly the same for everyone who views it, much like a printed page would. Typically, a PDF only contains information about the exact placement of characters, images, and vectors on a page and nothing else. It does not even maintain the information from the original source document such as where one word begins and another ends, much less where paragraphs begin and end. When you convert a PDF to another format, it is up to the artificial intelligent of the converter to try to "reconstruct" this information. There is no easy way to make this work well for each case. You get spaces between letters, or places where words bump together, when the converter incorrectly reads where word boundaries are. You get boxes or wrong characters when the glyphs from its fonts don't match the glyphs from the fonts you are converting into, or there are differences in the fonts' character encodings. You really can't expect perfection here; you're working against the grain, trying to do what you were never designed to do. I'm afraid you're going to have to live with that, or else learn a lot of programming and try to write a superior artificial intelligence algorithm. Nevertheless, if you don't have access to the source document, converting is still often preferable to retyping everything from scratch, but do be prepared for a fair amount of manual fixing. Last edited by frabjous; 12-09-2010 at 07:57 AM. |
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,613
Karma: 6718541
Join Date: Dec 2004
Location: Paradise (Key West, FL)
Device: Current:Surface Go & Kindle 3 - Retired: DellV8p, Clie UX50, ...
|
The converter "decoding" the PDF isn't really to blame. The root cause is in the app that built the PDF. The breaks frequently occur when the original document has some non-default letter spacing (kerning) in a word. The PDF generator will often break the work in two so that it can reposition the character following the custom kerning. The poor app trying to make sense of the PDF encounters part of the word in one data block and the rest in another. Its difficult for it to recognize that the two need to be reassembled without a space or hard return.
|
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
Quote:
Is the default to use pair kerning, or not to use pair kerning? Using it is typographically superior, but since the exact shapes of the glyphs is not stored in the PDF, use of pair kerning probably makes it more difficult for a converter to do its job properly. But that doesn't mean there's a flaw with the software that create the PDF -- quite the contrary! Moreover, for a variety of typographical reasons, someone might alter the kerning for typographical reasons. Kerning is often greater in titles, for example. None of this is anyone's "fault" or points to a flaw in the software. Indeed, software that is able to do is superior in my mind to that which cannot. If someone begins arguing that people ought to forget about quality typography for the same of convenience in wrong-way conversions, well, let me just register my strong disagreement. As much as I like mobile devices for reading, the typography is usually much worse unless possibly you're reading a PDF. Let's not advocate moving software away from the possibility of providing one of their best features! (To be clear, I'm not accusing you of arguing for that. It's just the kind of thing I've heard people suggest before...) Last edited by frabjous; 12-09-2010 at 08:39 AM. |
|
![]() |
![]() |
![]() |
#7 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 61
Karma: 12096
Join Date: Sep 2010
Location: Tasmania
Device: Sony PRS 650
|
The way I get round this problem after it has happened with Adobe Reader is to open the file in Foxit Phantom, select the text icon, select all with Ctrl A, copy with Ctrl C, and paste into notepad.
Then open the Notepad text file with Word and use VBA macros to modify the story. As suggested earlier, you could try different PDF readers. |
![]() |
![]() |
![]() |
#8 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
Another way to handle the problem is to OCR the PDF. ABBYY has a tool for that or for $10 they have tool that will capture an image off the Screen and OCR it.
Dale |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
pdf with split words at end of line - how best to convert | cybmole | Calibre | 8 | 10-19-2010 08:27 AM |
PDF to MOBI conversion - unable to detect any words | qwerty123456 | Calibre | 1 | 07-22-2010 07:54 AM |
PDF on PRS-300 cutting words in half | dink | Sony Reader | 2 | 01-11-2010 01:31 PM |
How to buy pdf converter of Panasonic Words Gear BKE-T3 | asun | Legacy E-Book Devices | 0 | 07-06-2009 11:26 PM |
Repagination problems? Losing words at the end of pages | melrowgo | Sony Reader | 6 | 05-26-2009 12:57 PM |