Ahhhh ... It was user_none that gave me the kick in the head and got me realigned ... I used to dabble in PostScript in the eighties and now I DO have at least a clue to what's going on. I've been thinking too much about text files and not about how PostScript works.
After wandering thru piles and piles of PDF files over the last couple of years, I feel that they fall into three cases:
1. The "perfect" PDF file:
These files have both paragraph indents and paragraph spacing. It ought to be simple to analyze the PostScript code for text positioning, and, given numbers for normal line spacing, paragraph spacing and indent spacing, it ought to be a piece of cake to properly format text from these files with absolutely no wrap/unwrap errors.
2. The "nice" PDF file:
These files have either paragraph indents or paragraph spacing but not both. It still ought to be simple to properly format text from these files without wrapping errors.
3. The "bad" PDF file:
These files have neither paragraph indents or paragraph spacing and one is stuck with only looking at punctuation and end-of-line position to find where paragraphs break.
I have seen a "4th" case, where the file was one complete glob of text, with no breaks whatsoever .... I just throw those files away.
Please note that I got all the way to case 3 before even mentioning punctuation or end-of-line position. Unfortunately, this seems to be the only way that Calibre's PDF converter formats text, without even considering the first two cases.
I went to a folder with almost 200 PDFs in it and tallied up the first 60 files (and then gave up!). I found that 29 files matched case 1, 25 files matched case 2, and only 6 files matched case 3. I could probably go for a larger statistical base, but this still argues for a better way to analyze PDF files.
It would be really nice if the PDF converter first looked for paragraph indents and paragraph spacing and used those for controlling wrapping when possible, falling back to the worst case of punctuation and end-of-line position only when the other two failed.
Idolse: I understand the point you're making with your example. But after scrubbing thru piles of converted files, I would have to say that in 99.99% of the cases that match your example, the lines should be wrapped without a hard line break.
|