Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 12-07-2010, 01:38 AM   #1
yuxi_kelly
Enthusiast
yuxi_kelly began at the beginning.
 
Posts: 37
Karma: 10
Join Date: Oct 2010
Device: ipad
words problems in PDF

when I copy a word from pdf then paste it onto a Microsoft Word.Then something strange happens, the words becomes a question mark or a square.

or if i convert some PDFs to Microsoft Word, such situations always happens,

what is wrong?
and how can i find a PDF's encoding? does this work?
yuxi_kelly is offline   Reply With Quote
Old 12-07-2010, 11:40 AM   #2
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
Have you noticed any patterns for when it happens? Is it for certain words, like those with ligatures (typically ff, ffl, ffi, fi and fl)? Or special symbols?

Does it happen with every PDF viewers, or just some? Have you tried direct conversion (e.g., calibre for PDF to RTF)?
frabjous is offline   Reply With Quote
Advert
Old 12-09-2010, 02:54 AM   #3
yuxi_kelly
Enthusiast
yuxi_kelly began at the beginning.
 
Posts: 37
Karma: 10
Join Date: Oct 2010
Device: ipad
It happens only for some PDFs, a lot different words, when you convert it out to any format, the words will fall apart into two, or two words bump together, or some letters will be question marks. i meet with so many different situations.

the only aim is how can i assure the words quality when convert a pdf to any format files.
yuxi_kelly is offline   Reply With Quote
Old 12-09-2010, 07:54 AM   #4
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
You cannot.

PDF is not designed as an import format to be converted to another format. It is designed as an output format. PDFs are always created from other source documents; the thought has always been that if you wanted to change it or convert it, you'd go back and make changes to the source document. The fact that PDFs can ever be converted at all with useable results is somewhat surprising.

A PDF is designed to emulate a printed page and the purpose of a PDF is to look exactly the same for everyone who views it, much like a printed page would. Typically, a PDF only contains information about the exact placement of characters, images, and vectors on a page and nothing else. It does not even maintain the information from the original source document such as where one word begins and another ends, much less where paragraphs begin and end. When you convert a PDF to another format, it is up to the artificial intelligent of the converter to try to "reconstruct" this information. There is no easy way to make this work well for each case.

You get spaces between letters, or places where words bump together, when the converter incorrectly reads where word boundaries are. You get boxes or wrong characters when the glyphs from its fonts don't match the glyphs from the fonts you are converting into, or there are differences in the fonts' character encodings.

You really can't expect perfection here; you're working against the grain, trying to do what you were never designed to do. I'm afraid you're going to have to live with that, or else learn a lot of programming and try to write a superior artificial intelligence algorithm.

Nevertheless, if you don't have access to the source document, converting is still often preferable to retyping everything from scratch, but do be prepared for a fair amount of manual fixing.

Last edited by frabjous; 12-09-2010 at 07:57 AM.
frabjous is offline   Reply With Quote
Old 12-09-2010, 08:10 AM   #5
dwig
Wizard
dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.
 
dwig's Avatar
 
Posts: 1,613
Karma: 6718541
Join Date: Dec 2004
Location: Paradise (Key West, FL)
Device: Current:Surface Go & Kindle 3 - Retired: DellV8p, Clie UX50, ...
Quote:
Originally Posted by frabjous View Post
...
You get spaces between letters, or places where words bump together, when the converter incorrectly reads where word boundaries are. ...
The converter "decoding" the PDF isn't really to blame. The root cause is in the app that built the PDF. The breaks frequently occur when the original document has some non-default letter spacing (kerning) in a word. The PDF generator will often break the work in two so that it can reposition the character following the custom kerning. The poor app trying to make sense of the PDF encounters part of the word in one data block and the rest in another. Its difficult for it to recognize that the two need to be reassembled without a space or hard return.
dwig is offline   Reply With Quote
Advert
Old 12-09-2010, 08:27 AM   #6
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
Quote:
Originally Posted by dwig View Post
The converter "decoding" the PDF isn't really to blame. The root cause is in the app that built the PDF. The breaks frequently occur when the original document has some non-default letter spacing (kerning) in a word.
What do you mean by that? Most fonts have pair kerning information, that is information about how the kerning for individual pairs of letters need to be adjusted to accommodate their exact shapes -- this information is usually made use of by quality typesetting software, but is not used by default by web browsers or word processors like MS Word (though you can turn it on) nor by most ebook software for other formats, which is why they don't look as good.

Is the default to use pair kerning, or not to use pair kerning? Using it is typographically superior, but since the exact shapes of the glyphs is not stored in the PDF, use of pair kerning probably makes it more difficult for a converter to do its job properly. But that doesn't mean there's a flaw with the software that create the PDF -- quite the contrary!

Moreover, for a variety of typographical reasons, someone might alter the kerning for typographical reasons. Kerning is often greater in titles, for example. None of this is anyone's "fault" or points to a flaw in the software. Indeed, software that is able to do is superior in my mind to that which cannot.

If someone begins arguing that people ought to forget about quality typography for the same of convenience in wrong-way conversions, well, let me just register my strong disagreement. As much as I like mobile devices for reading, the typography is usually much worse unless possibly you're reading a PDF. Let's not advocate moving software away from the possibility of providing one of their best features!

(To be clear, I'm not accusing you of arguing for that. It's just the kind of thing I've heard people suggest before...)

Last edited by frabjous; 12-09-2010 at 08:39 AM.
frabjous is offline   Reply With Quote
Old 02-18-2011, 04:02 PM   #7
Faster
Connoisseur
Faster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of light
 
Posts: 61
Karma: 12096
Join Date: Sep 2010
Location: Tasmania
Device: Sony PRS 650
The way I get round this problem after it has happened with Adobe Reader is to open the file in Foxit Phantom, select the text icon, select all with Ctrl A, copy with Ctrl C, and paste into notepad.
Then open the Notepad text file with Word and use VBA macros to modify the story.
As suggested earlier, you could try different PDF readers.
Faster is offline   Reply With Quote
Old 02-18-2011, 04:17 PM   #8
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
Another way to handle the problem is to OCR the PDF. ABBYY has a tool for that or for $10 they have tool that will capture an image off the Screen and OCR it.

Dale
DaleDe is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
pdf with split words at end of line - how best to convert cybmole Calibre 8 10-19-2010 08:27 AM
PDF to MOBI conversion - unable to detect any words qwerty123456 Calibre 1 07-22-2010 07:54 AM
PDF on PRS-300 cutting words in half dink Sony Reader 2 01-11-2010 01:31 PM
How to buy pdf converter of Panasonic Words Gear BKE-T3 asun Legacy E-Book Devices 0 07-06-2009 11:26 PM
Repagination problems? Losing words at the end of pages melrowgo Sony Reader 6 05-26-2009 12:57 PM


All times are GMT -4. The time now is 04:27 PM.


MobileRead.com is a privately owned, operated and funded community.