View Single Post
Old 11-01-2011, 08:05 PM   #11
frostschutz
Linux User
frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.
 
frostschutz's Avatar
 
Posts: 2,282
Karma: 6123806
Join Date: Sep 2010
Location: Heidelberg, Germany
Device: none
There are a thousand reasons why text extraction from PDF does not work properly - this is because PDF does not care about structure or even machine readability. All it cares about is how it looks like on a sheet paper (whether that sheet is real or just displayed in the given, fixed dimensions on screen). It's possible to construct PDF so that all text you extract comes out backwards or completely garbled, that's just how it is.

There may be an easy way to fix your particular problem but it's hard to say without having access to the PDF file itself.

If all else fails you could always try running it through OCR. As long as the text is clean in appearance this will work reasonably well.
frostschutz is offline   Reply With Quote