Thank you so much ldolse.
That really helped me understand what is going on. I'd guessed that it had something to do with the way the page-breaking heuristics work.
It seems like I've been using Poppler's pdftohtml since the 1990s (probably have been), so it's about time I looked into it properly.
Thanks again.
|