There are a thousand reasons why text extraction from PDF does not work properly - this is because PDF does not care about structure or even machine readability. All it cares about is how it looks like on a sheet paper (whether that sheet is real or just displayed in the given, fixed dimensions on screen). It's possible to construct PDF so that all text you extract comes out backwards or completely garbled, that's just how it is.
There may be an easy way to fix your particular problem but it's hard to say without having access to the PDF file itself.
If all else fails you could always try running it through OCR. As long as the text is clean in appearance this will work reasonably well.
|