PDF -> HTML conversion

roffLOL · 09-30-2011, 05:57 AM

You wouldn't be interested in a PDF -> HTML converter? I'm currently developing one. For single page (one page per page, not those documents with double columns), justified PDF documents it will be able to:

retain:
Fonts
Paragraphs and indentations
alignment
PDF's general logical structure with TOC
[graphics]

remove:
Page numbering
[possibly header and footer]

However, I have developed this library out of need, and as such, will not develop it further as soon as I get it working for the case described (single page, justified PDF document).

Current status is 90% finished. Only 90% development time left, in other words

Say, a month.

The library is in pure python (2.6?).

//Humble greetings,
roffLOL

ldolse · 10-02-2011, 09:46 AM

This sounds like it would be very cool.

drMerry · 10-02-2011, 09:56 AM

Especially the remove function seems to be a very nice function.

user_none · 10-02-2011, 10:34 AM

Quote:

Originally Posted by roffLOL

You wouldn't be interested in a PDF -> HTML converter?

Link?

Quote:

Originally Posted by roffLOL

The library is in pure python (2.6?).

What is the performance? E.G. Memory and CPU usage? Time to complete? A while back I evaluated a few pure python PDF libraries for replacing the current PDF engine and it found them to be up to 60x slower than the current engine without much gain in terms of quality output. This is partly why a large part of the new PDF engine is being written in C++.

roffLOL · 10-02-2011, 03:44 PM

Quote:

Originally Posted by user_none

Link?

What is the performance? E.G. Memory and CPU usage? Time to complete? A while back I evaluated a few pure python PDF libraries for replacing the current PDF engine and it found them to be up to 60x slower than the current engine without much gain in terms of quality output. This is partly why a large part of the new PDF engine is being written in C++.

I am not done and have no reliable numbers. However, my current implementation takes 20 seconds to convert a 350 page book and write it to disk (before any optimization). I can't put it in relation to the current implementation of Calibre's converter, since I can't measure how much time Calibre itself spends on parsing the html; however for the same book the conversion takes as long as mine. As for quality, mine IS MUCH better (If you haven't improved it greatly the last couple of months). I haven't yet done any quantitative quality tests. I will run the conversion on nearly 500 PDF's, and will not be satisfied before it can handle all of them nearly perfectly. I am very pedantic about my reading experience.

I will upload the project somewhere, maybe next week or the week after that. When it comes to 'not coding matters', I am a lazy bastard.

user_none · 10-02-2011, 04:27 PM

Quote:

Originally Posted by roffLOL

I am not done and have no reliable numbers. However, my current implementation takes 20 seconds to convert a 350 page book and write it to disk (before any optimization).

That's not bad.

My opinion is up to 30 seconds is acceptable. 1 minute is a bit much. 2 minutes is excessive. 5+ minutes is completely unacceptable.

At 20 seconds I'll give this is a serious look for use as a replacement for the current PDF input engine. Especially with this:

Quote:

Originally Posted by roffLOL

As for quality, mine IS MUCH better

Quote:

Originally Posted by roffLOL

I will upload the project somewhere, maybe next week or the week after that. When it comes to 'not coding matters', I am a lazy bastard.

I'm a big fan of Launchpad and Google Code.

kiwidude · 10-02-2011, 04:56 PM

FWIW, particularly if it was made available as an "option" a bit like you can switch between the PDF engines currently, I don't care if it takes 10 minutes

. In comparison to the *hours* it takes to manually edit calibre converted PDFs currently to fix up the paragraph splitting issues with the existing engine you are going to be way ahead of the curve.

If/when the new C++ engine is finished and provides quality at blinding speed then that could be even better of course. However I think there would be plenty of users who would be willing to sacrifice some conversion speed if the quality stacks up to the advertised goals. Well I certainly would anyway

.

roffLOL · 10-03-2011, 07:07 AM

On the subject of speed, I'm pretty sure I will be able to cut 25% execution time off with a simple optimization, and cut most of the memory consumption as well. I will not aim for a faster conversion than that. For a non-repeated task, it is good enough for me.

The advertised goals were met yesterday. I have created a 98% replica of the PDF-document that serves as my test case. It removes '-' at end of lines and connects the lines. It appends non-fullstop paragraphs that spans pages, it keeps an indentation ratio for paragraph beginnings equal to that of the original document (if it has one), it retains font-information (even for single word in the middle of sentences), it retains line and paragraph spacing, it translates all special characters to their HTML-equivalence, and for some weird reason, the line numbering seems to automagically vanish, even though I can't remember implementing anything to leave them out.

However, the whole implementation will likely blow in my face when tried with another document =). Much work left to be done.

ldolse · 10-03-2011, 08:24 AM

Quote:

Originally Posted by roffLOL

The advertised goals were met yesterday. I have created a 98% replica of the PDF-document that serves as my test case. It removes '-' at end of lines and connects the lines. It appends non-fullstop paragraphs that spans pages, it keeps an indentation ratio for paragraph beginnings equal to that of the original document (if it has one), it retains font-information (even for single word in the middle of sentences), it retains line and paragraph spacing, it translates all special characters to their HTML-equivalence, and for some weird reason, the line numbering seems to automagically vanish, even though I can't remember implementing anything to leave them out.

Sounds pretty promising, looking forward to seeing it. One note - you can't remove '-' from line endings wholesale - this is one area I spent a fair amount of time, creating a 'dehyphenate' routine which uses the raw document as a dictionary. Feel free to borrow that code/concept from Calibre. The only weakness in the function is that I perform naive 'stemming' which focuses mainly on the english language to increase the likelihood of a dictionary match.

Now that Calibre has the ability to allow the user to specify the language for a book I've been thinking about an industrial strength multi-language stemmer - the most appropriate ones from my searches would be Snowball, which has a python wrapper.

roffLOL · 10-03-2011, 08:42 AM

Quote:

Originally Posted by ldolse

Sounds pretty promising, looking forward to seeing it. One note - you can't remove '-' from line endings wholesale - this is one area I spent a fair amount of time, creating a 'dehyphenate' routine which uses the raw document as a dictionary.

I guessed as much, but figured that it will work in the majority of cases, atleast for most(?) major languages. I have left space for altering of these behaviours, if needed. In which cases are words hyphened, when not for spreading them across lines? Hyphens is not a too common phenomenon inside of actual words.

ldolse · 10-03-2011, 09:56 AM

Maybe it's dependent on language, but hyphenated compound words are quite common in english texts. If you turn on full debugging with Calibre it will actually log which hyphens are removed/retained to give you an idea of how frequently it happens. I've also seen numerous cases where em-dash and en-dash got dumbed down to a regular hyphen. Hyphen is also used as a replacment for an ellipsis in many docs, so it would result in sentences being merged together. Numeric strings are other candidates that are often intentionally hyphenated. Line breaking algorithms aggressively leverage existing hyphens.

My own preference has always been for false negative vs. false positive. For a while Calibre was removing hyphens wholesale, and I have to say that was rather annoying to me personally, though I have a perfectionist streak, that's what drove me to add the dictionary approach - my python skills aren't great but it was simple logic to add.

roffLOL · 10-03-2011, 10:46 AM

Thanks for shared insight. I will look into this matter when I start my testing-spree for realz. My parser makes loads of assumptions about the text to be parsed, for one, it expects to parse a litterary book (this will be easily extendable for a coder in need for a 'bible parsing mode', or whatever; eg., the assumptions of the logical structure of the text are [hopefully] pretty much separated from the rest of the code).

I have another reason for this agressive approach. My reader's software dictionary is rendered unusable by wrongly hyphened words, and I read in a couple of languages not my own [This is in fact the sole reason I'm writing this program. Some fucked up laws and royalties prevents me from buying e-books, and most... ehrm... less commercial books, comes as PDFs]. So rather too few hyphens than too many.

I'm not a Python programmer either, in fact this is my first Python project =)

drMerry · 10-03-2011, 07:47 PM

All sounds very interesting.
Looking forward to a preview version.
If it's a ZIP, you could add it to the forum if you want (But If you want to continue developing the project, Launchpad would be a better option at the end I think (Even if you want to continue alone (version Control))).

JSWolf · 10-03-2011, 08:05 PM

Quote:

Originally Posted by roffLOL

It removes '-' at end of lines and connects the lines.

What happens if the dash is supposed to be there and not removed? How does it handle that? I don't want dashes removed that are part of the word if it happens to fall at the end of a line.

PeterT · 10-03-2011, 08:55 PM

Quote:

Originally Posted by JSWolf

What happens if the dash is supposed to be there and not removed? How does it handle that? I don't want dashes removed that are part of the word if it happens to fall at the end of a line.

How on earth do you expect a program to tell whether or not the hyphen SHOULD be there!

09-30-2011, 05:57 AM	#1
roffLOL Member Posts: 10 Karma: 1538 Join Date: Sep 2011 Location: Sweden Device: Sony PRS-350	PDF -> HTML conversion You wouldn't be interested in a PDF -> HTML converter? I'm currently developing one. For single page (one page per page, not those documents with double columns), justified PDF documents it will be able to: retain: Fonts Paragraphs and indentations alignment PDF's general logical structure with TOC [graphics] remove: Page numbering [possibly header and footer] However, I have developed this library out of need, and as such, will not develop it further as soon as I get it working for the case described (single page, justified PDF document). Current status is 90% finished. Only 90% development time left, in other words Say, a month. The library is in pure python (2.6?). //Humble greetings, roffLOL Last edited by roffLOL; 09-30-2011 at 09:17 AM. Reason: Clarification. Written before coffee o'clock.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Problem with html -> Mobi conversion - html tags visible.	khromov	Calibre	9	08-06-2011 12:25 PM
HTML Conversion	yoss15	Conversion	12	07-28-2011 05:42 PM
clean HTML or PDF before mobi conversion in Calibre	mark235	Calibre	9	12-25-2010 10:37 PM
PDF to WORD/HTML conversion, "special characters and marks" errors	chengyibo	PDF	3	11-06-2010 01:43 AM
Today only - Free IntraPDF conversion tool (PDF -> HTML)	Bob Russell	PDF	7	04-10-2007 01:16 PM

10-02-2011, 09:46 AM	#2
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	This sounds like it would be very cool.

10-02-2011, 09:56 AM	#3
drMerry Addict Posts: 293 Karma: 21022 Join Date: Mar 2011 Location: NL Device: Sony PRS-650	Especially the remove function seems to be a very nice function.

10-02-2011, 04:56 PM	#7
kiwidude Calibre Plugins Developer Posts: 4,745 Karma: 2208556 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	FWIW, particularly if it was made available as an "option" a bit like you can switch between the PDF engines currently, I don't care if it takes 10 minutes . In comparison to the hours it takes to manually edit calibre converted PDFs currently to fix up the paragraph splitting issues with the existing engine you are going to be way ahead of the curve. If/when the new C++ engine is finished and provides quality at blinding speed then that could be even better of course. However I think there would be plenty of users who would be willing to sacrifice some conversion speed if the quality stacks up to the advertised goals. Well I certainly would anyway .

10-03-2011, 07:07 AM	#8
roffLOL Member Posts: 10 Karma: 1538 Join Date: Sep 2011 Location: Sweden Device: Sony PRS-350	On the subject of speed, I'm pretty sure I will be able to cut 25% execution time off with a simple optimization, and cut most of the memory consumption as well. I will not aim for a faster conversion than that. For a non-repeated task, it is good enough for me. The advertised goals were met yesterday. I have created a 98% replica of the PDF-document that serves as my test case. It removes '-' at end of lines and connects the lines. It appends non-fullstop paragraphs that spans pages, it keeps an indentation ratio for paragraph beginnings equal to that of the original document (if it has one), it retains font-information (even for single word in the middle of sentences), it retains line and paragraph spacing, it translates all special characters to their HTML-equivalence, and for some weird reason, the line numbering seems to automagically vanish, even though I can't remember implementing anything to leave them out. However, the whole implementation will likely blow in my face when tried with another document =). Much work left to be done.

10-03-2011, 09:56 AM	#11
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Maybe it's dependent on language, but hyphenated compound words are quite common in english texts. If you turn on full debugging with Calibre it will actually log which hyphens are removed/retained to give you an idea of how frequently it happens. I've also seen numerous cases where em-dash and en-dash got dumbed down to a regular hyphen. Hyphen is also used as a replacment for an ellipsis in many docs, so it would result in sentences being merged together. Numeric strings are other candidates that are often intentionally hyphenated. Line breaking algorithms aggressively leverage existing hyphens. My own preference has always been for false negative vs. false positive. For a while Calibre was removing hyphens wholesale, and I have to say that was rather annoying to me personally, though I have a perfectionist streak, that's what drove me to add the dictionary approach - my python skills aren't great but it was simple logic to add.

10-03-2011, 10:46 AM	#12
roffLOL Member Posts: 10 Karma: 1538 Join Date: Sep 2011 Location: Sweden Device: Sony PRS-350	Thanks for shared insight. I will look into this matter when I start my testing-spree for realz. My parser makes loads of assumptions about the text to be parsed, for one, it expects to parse a litterary book (this will be easily extendable for a coder in need for a 'bible parsing mode', or whatever; eg., the assumptions of the logical structure of the text are [hopefully] pretty much separated from the rest of the code). I have another reason for this agressive approach. My reader's software dictionary is rendered unusable by wrongly hyphened words, and I read in a couple of languages not my own [This is in fact the sole reason I'm writing this program. Some fucked up laws and royalties prevents me from buying e-books, and most... ehrm... less commercial books, comes as PDFs]. So rather too few hyphens than too many. I'm not a Python programmer either, in fact this is my first Python project =)

10-03-2011, 07:47 PM	#13
drMerry Addict Posts: 293 Karma: 21022 Join Date: Mar 2011 Location: NL Device: Sony PRS-650	All sounds very interesting. Looking forward to a preview version. If it's a ZIP, you could add it to the forum if you want (But If you want to continue developing the project, Launchpad would be a better option at the end I think (Even if you want to continue alone (version Control))).

Advert

Advert