View Single Post
Old 01-04-2009, 01:51 PM   #14
tompe
Grand Sorcerer
tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.
 
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
Quote:
Originally Posted by Flinx View Post
No, that is not really useful for the most standard PDFs. The text object in a PDF file does not contain a real line break. It contains the position where on the page it has to drawn and a number of characters. The result is a line of text.
The progam that makes the conversion has to estimate from the positions of the text objects in which order the lines come. Simple converters like the most available (including Acrobat) use one text object, convert it to text and set a line break at the end, resulting in one line of the output text. The better converters can try to join the separate text objects, if their horizontal start position is identical and the line is long enough. But this is a difficult job, and I have not yet found a program that works good enough for me.
That might be the case but there is no functional different between encoding paragraphs with two line breaks or one. What you are talking about is how go a converter is detecting a paragraph break but that has no necessary connection to how the encoding is done. You can argue that you loose information if you do not keep the line breaks in a paragraph since they are impossible to recreate but it is trivial to take a paragraph specified by using double line breaks and convert it to one line.
tompe is offline   Reply With Quote