MobileRead Forums - View Single Post - PDF to epub convertion grief; keeping indentation

frostschutz · 10-31-2010, 05:09 PM

In general, PDF has very little knowledge about formatting and contents of a document; instead it is a set of instructions like "draw line from point A to B" or "place letter X in size Y on coordinates Z". So you're lucky to even get simple things such as paragraphs or chapters or headings out of a PDF file. While the indentation is certainly visible to you as a human, this information is not actually readily available in a PDF file since it's more an image of a page layout, rather than the information about the formatting rules that led to this particular page layout.

It's not impossible to convert it, however you'd have to write a custom script that does it. It'd have to be smart enough to recognize the Python snippets and deduct the indentation based on how the text is positioned.

This is one of the two things OCR has to do; one is recognizing the characters - you can skip that step with most (but not all) PDFs; the other is recognizing the layout.

If there aren't too many snippets in the book, it'd probably be faster to just reindent them manually.

10-31-2010, 05:09 PM	#3
frostschutz Linux User Posts: 2,282 Karma: 6123806 Join Date: Sep 2010 Location: Heidelberg, Germany Device: none	In general, PDF has very little knowledge about formatting and contents of a document; instead it is a set of instructions like "draw line from point A to B" or "place letter X in size Y on coordinates Z". So you're lucky to even get simple things such as paragraphs or chapters or headings out of a PDF file. While the indentation is certainly visible to you as a human, this information is not actually readily available in a PDF file since it's more an image of a page layout, rather than the information about the formatting rules that led to this particular page layout. It's not impossible to convert it, however you'd have to write a custom script that does it. It'd have to be smart enough to recognize the Python snippets and deduct the indentation based on how the text is positioned. This is one of the two things OCR has to do; one is recognizing the characters - you can skip that step with most (but not all) PDFs; the other is recognizing the layout. If there aren't too many snippets in the book, it'd probably be faster to just reindent them manually.