PDF to epub convertion grief; keeping indentation

Aia · 10-31-2010, 03:35 PM

Converting pdf to epub for my nook, using Calibre is easy. Except, I can not find a way to keep the proper space (indentation) when the pdf has portion of code examples like python, that requires proper block indentation.
How can I make a proper conversion to epub, where the output file will display the python indentation properly?

Instead of resulted output

Code:

def _find_note(self, note_id):
'''Locate the note with the given id.'''
for note in self.notes:
if str(note.id) == str(note_id):
return note
return None

I would like to have

Code:

def _find_note(self, note_id):
    '''Locate the note with the given id.'''
    for note in self.notes:
        if str(note.id) == str(note_id):
            return note
    return None

I have tried

Code:

p { white-space=pre; }

in the Look and Feel -> Extra CSS in the conversion wizard box. Nevertheless, it doesn't keep the proper indentation.

Is there any thing else I can do, or is this the final state of affairs in the conversion world?

kovidgoyal · 10-31-2010, 04:57 PM

no the space is clobbered in the pdf input stage itself.

frostschutz · 10-31-2010, 05:09 PM

In general, PDF has very little knowledge about formatting and contents of a document; instead it is a set of instructions like "draw line from point A to B" or "place letter X in size Y on coordinates Z". So you're lucky to even get simple things such as paragraphs or chapters or headings out of a PDF file. While the indentation is certainly visible to you as a human, this information is not actually readily available in a PDF file since it's more an image of a page layout, rather than the information about the formatting rules that led to this particular page layout.

It's not impossible to convert it, however you'd have to write a custom script that does it. It'd have to be smart enough to recognize the Python snippets and deduct the indentation based on how the text is positioned.

This is one of the two things OCR has to do; one is recognizing the characters - you can skip that step with most (but not all) PDFs; the other is recognizing the layout.

If there aren't too many snippets in the book, it'd probably be faster to just reindent them manually.

10-31-2010, 03:35 PM	#1
Aia Junior Member Posts: 4 Karma: 2452 Join Date: Oct 2010 Device: nook	PDF to epub convertion grief; keeping indentation Converting pdf to epub for my nook, using Calibre is easy. Except, I can not find a way to keep the proper space (indentation) when the pdf has portion of code examples like python, that requires proper block indentation. How can I make a proper conversion to epub, where the output file will display the python indentation properly? Instead of resulted output Code: def _find_note(self, note_id): '''Locate the note with the given id.''' for note in self.notes: if str(note.id) == str(note_id): return note return None I would like to have Code: def _find_note(self, note_id): '''Locate the note with the given id.''' for note in self.notes: if str(note.id) == str(note_id): return note return None I have tried Code: p { white-space=pre; } in the Look and Feel -> Extra CSS in the conversion wizard box. Nevertheless, it doesn't keep the proper indentation. Is there any thing else I can do, or is this the final state of affairs in the conversion world?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Doc to Epub convertion problems	johnbajer	Calibre	5	06-04-2010 05:30 PM
Cover pictures after convertion from ePub to Mobi	paulpeer	Calibre	8	03-23-2010 09:23 AM
Best PDF Convertion Tool	Nathan Campos	Workshop	5	12-27-2009 10:47 AM
Epub and negative indentation	Nate the great	ePub	6	04-27-2009 11:48 AM
PDF conversion & indentation	Shiren	Calibre	5	12-11-2008 02:09 PM

10-31-2010, 04:57 PM	#2
kovidgoyal creator of calibre Posts: 46,321 Karma: 29630876 Join Date: Oct 2006 Location: Mumbai, India Device: Various	no the space is clobbered in the pdf input stage itself.

10-31-2010, 05:09 PM	#3
frostschutz Linux User Posts: 2,284 Karma: 6123806 Join Date: Sep 2010 Location: Heidelberg, Germany Device: none	In general, PDF has very little knowledge about formatting and contents of a document; instead it is a set of instructions like "draw line from point A to B" or "place letter X in size Y on coordinates Z". So you're lucky to even get simple things such as paragraphs or chapters or headings out of a PDF file. While the indentation is certainly visible to you as a human, this information is not actually readily available in a PDF file since it's more an image of a page layout, rather than the information about the formatting rules that led to this particular page layout. It's not impossible to convert it, however you'd have to write a custom script that does it. It'd have to be smart enough to recognize the Python snippets and deduct the indentation based on how the text is positioned. This is one of the two things OCR has to do; one is recognizing the characters - you can skip that step with most (but not all) PDFs; the other is recognizing the layout. If there aren't too many snippets in the book, it'd probably be faster to just reindent them manually.

Advert