MobileRead Forums - View Single Post

frostschutz · 11-01-2011, 09:05 PM

There are a thousand reasons why text extraction from PDF does not work properly - this is because PDF does not care about structure or even machine readability. All it cares about is how it looks like on a sheet paper (whether that sheet is real or just displayed in the given, fixed dimensions on screen). It's possible to construct PDF so that all text you extract comes out backwards or completely garbled, that's just how it is.

There may be an easy way to fix your particular problem but it's hard to say without having access to the PDF file itself.

If all else fails you could always try running it through OCR. As long as the text is clean in appearance this will work reasonably well.

11-01-2011, 09:05 PM	#11
frostschutz Linux User Posts: 2,282 Karma: 6123806 Join Date: Sep 2010 Location: Heidelberg, Germany Device: none	There are a thousand reasons why text extraction from PDF does not work properly - this is because PDF does not care about structure or even machine readability. All it cares about is how it looks like on a sheet paper (whether that sheet is real or just displayed in the given, fixed dimensions on screen). It's possible to construct PDF so that all text you extract comes out backwards or completely garbled, that's just how it is. There may be an easy way to fix your particular problem but it's hard to say without having access to the PDF file itself. If all else fails you could always try running it through OCR. As long as the text is clean in appearance this will work reasonably well.