The information I get about a PDF file from the muPDF library is simply a list of characters and what their X,Y positions on the page are. There is no information to indicate either new-line or a paragraph--you have to infer this solely from the character positions, so I'd have to deduce from the line spacings or indentation if a paragraph was indended, which will be quite error prone. It's probably easier to hand edit the unicode text out of k2pdfopt (-ocrout option).
|