MobileRead Forums - View Single Post - PDF -> HTML conversion

roffLOL · 10-03-2011, 07:07 AM

On the subject of speed, I'm pretty sure I will be able to cut 25% execution time off with a simple optimization, and cut most of the memory consumption as well. I will not aim for a faster conversion than that. For a non-repeated task, it is good enough for me.

The advertised goals were met yesterday. I have created a 98% replica of the PDF-document that serves as my test case. It removes '-' at end of lines and connects the lines. It appends non-fullstop paragraphs that spans pages, it keeps an indentation ratio for paragraph beginnings equal to that of the original document (if it has one), it retains font-information (even for single word in the middle of sentences), it retains line and paragraph spacing, it translates all special characters to their HTML-equivalence, and for some weird reason, the line numbering seems to automagically vanish, even though I can't remember implementing anything to leave them out.

However, the whole implementation will likely blow in my face when tried with another document =). Much work left to be done.

10-03-2011, 07:07 AM	#8
roffLOL Member Posts: 10 Karma: 1538 Join Date: Sep 2011 Location: Sweden Device: Sony PRS-350	On the subject of speed, I'm pretty sure I will be able to cut 25% execution time off with a simple optimization, and cut most of the memory consumption as well. I will not aim for a faster conversion than that. For a non-repeated task, it is good enough for me. The advertised goals were met yesterday. I have created a 98% replica of the PDF-document that serves as my test case. It removes '-' at end of lines and connects the lines. It appends non-fullstop paragraphs that spans pages, it keeps an indentation ratio for paragraph beginnings equal to that of the original document (if it has one), it retains font-information (even for single word in the middle of sentences), it retains line and paragraph spacing, it translates all special characters to their HTML-equivalence, and for some weird reason, the line numbering seems to automagically vanish, even though I can't remember implementing anything to leave them out. However, the whole implementation will likely blow in my face when tried with another document =). Much work left to be done.