|  09-30-2011, 04:57 AM | #1 | 
| Member            Posts: 10 Karma: 1538 Join Date: Sep 2011 Location: Sweden Device: Sony PRS-350 | 
				
				PDF -> HTML conversion
			 
			
			You wouldn't be interested in a PDF -> HTML converter? I'm currently developing one. For single page (one page per page, not those documents with double columns), justified PDF documents it will be able to:  retain: Fonts Paragraphs and indentations alignment PDF's general logical structure with TOC [graphics] remove: Page numbering [possibly header and footer] However, I have developed this library out of need, and as such, will not develop it further as soon as I get it working for the case described (single page, justified PDF document). Current status is 90% finished. Only 90% development time left, in other words  Say, a month. The library is in pure python (2.6?). //Humble greetings, roffLOL Last edited by roffLOL; 09-30-2011 at 08:17 AM. Reason: Clarification. Written before coffee o'clock. | 
|   |   | 
|  10-02-2011, 08:46 AM | #2 | 
| Wizard            Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone | 
			
			This sounds like it would be very cool.
		 | 
|   |   | 
|  10-02-2011, 08:56 AM | #3 | 
| Addict            Posts: 293 Karma: 21022 Join Date: Mar 2011 Location: NL Device: Sony PRS-650 | 
			
			Especially the remove function seems to be a very nice function.
		 | 
|   |   | 
|  10-02-2011, 09:34 AM | #4 | 
| Sigil & calibre developer            Posts: 2,487 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR | 
			
			Link? What is the performance? E.G. Memory and CPU usage? Time to complete? A while back I evaluated a few pure python PDF libraries for replacing the current PDF engine and it found them to be up to 60x slower than the current engine without much gain in terms of quality output. This is partly why a large part of the new PDF engine is being written in C++. | 
|   |   | 
|  10-02-2011, 02:44 PM | #5 | |
| Member            Posts: 10 Karma: 1538 Join Date: Sep 2011 Location: Sweden Device: Sony PRS-350 | Quote: 
 I will upload the project somewhere, maybe next week or the week after that. When it comes to 'not coding matters', I am a lazy bastard. | |
|   |   | 
|  10-02-2011, 03:27 PM | #6 | |
| Sigil & calibre developer            Posts: 2,487 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR | Quote: 
 My opinion is up to 30 seconds is acceptable. 1 minute is a bit much. 2 minutes is excessive. 5+ minutes is completely unacceptable. At 20 seconds I'll give this is a serious look for use as a replacement for the current PDF input engine. Especially with this: I'm a big fan of Launchpad and Google Code. | |
|   |   | 
|  10-02-2011, 03:56 PM | #7 | 
| Calibre Plugins Developer            Posts: 4,735 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis | 
			
			FWIW, particularly if it was made available as an "option" a bit like you can switch between the PDF engines currently, I don't care if it takes 10 minutes   . In comparison to the *hours* it takes to manually edit calibre converted PDFs currently to fix up the paragraph splitting issues with the existing engine you are going to be way ahead of the curve. If/when the new C++ engine is finished and provides quality at blinding speed then that could be even better of course. However I think there would be plenty of users who would be willing to sacrifice some conversion speed if the quality stacks up to the advertised goals. Well I certainly would anyway  . | 
|   |   | 
|  10-03-2011, 06:07 AM | #8 | 
| Member            Posts: 10 Karma: 1538 Join Date: Sep 2011 Location: Sweden Device: Sony PRS-350 | 
			
			On the subject of speed, I'm pretty sure I will be able to cut 25% execution time off with a simple optimization, and cut most of the memory consumption as well. I will not aim for a faster conversion than that. For a non-repeated task, it is good enough for me. The advertised goals were met yesterday. I have created a 98% replica of the PDF-document that serves as my test case. It removes '-' at end of lines and connects the lines. It appends non-fullstop paragraphs that spans pages, it keeps an indentation ratio for paragraph beginnings equal to that of the original document (if it has one), it retains font-information (even for single word in the middle of sentences), it retains line and paragraph spacing, it translates all special characters to their HTML-equivalence, and for some weird reason, the line numbering seems to automagically vanish, even though I can't remember implementing anything to leave them out. However, the whole implementation will likely blow in my face when tried with another document =). Much work left to be done. | 
|   |   | 
|  10-03-2011, 07:24 AM | #9 | |
| Wizard            Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone | Quote: 
 Now that Calibre has the ability to allow the user to specify the language for a book I've been thinking about an industrial strength multi-language stemmer - the most appropriate ones from my searches would be Snowball, which has a python wrapper. Last edited by ldolse; 10-03-2011 at 07:29 AM. | |
|   |   | 
|  10-03-2011, 07:42 AM | #10 | 
| Member            Posts: 10 Karma: 1538 Join Date: Sep 2011 Location: Sweden Device: Sony PRS-350 | 
			
			I guessed as much, but figured that it will work in the majority of cases, atleast for most(?) major languages. I have left space for altering of these behaviours, if needed. In which cases are words hyphened, when not for spreading them across lines? Hyphens is not a too common phenomenon inside of actual words.
		 | 
|   |   | 
|  10-03-2011, 08:56 AM | #11 | 
| Wizard            Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone | 
			
			Maybe it's dependent on language, but hyphenated compound words are quite common in english texts.  If you turn on full debugging with Calibre it will actually log which hyphens are removed/retained to give you an idea of how frequently it happens.  I've also seen numerous cases where em-dash and en-dash got dumbed down to a regular hyphen.  Hyphen is also used as a replacment for an ellipsis in many docs, so it would result in sentences being merged together.  Numeric strings are other candidates that are often intentionally hyphenated.  Line breaking algorithms aggressively leverage existing hyphens. My own preference has always been for false negative vs. false positive. For a while Calibre was removing hyphens wholesale, and I have to say that was rather annoying to me personally, though I have a perfectionist streak, that's what drove me to add the dictionary approach - my python skills aren't great but it was simple logic to add. | 
|   |   | 
|  10-03-2011, 09:46 AM | #12 | 
| Member            Posts: 10 Karma: 1538 Join Date: Sep 2011 Location: Sweden Device: Sony PRS-350 | 
			
			Thanks for shared insight. I will look into this matter when I start my testing-spree for realz. My parser makes loads of assumptions about the text to be parsed, for one, it expects to parse a litterary book (this will be easily extendable for a coder in need for a 'bible parsing mode', or whatever; eg., the assumptions of the logical structure of the text are [hopefully] pretty much separated from the rest of the code). I have another reason for this agressive approach. My reader's software dictionary is rendered unusable by wrongly hyphened words, and I read in a couple of languages not my own [This is in fact the sole reason I'm writing this program. Some fucked up laws and royalties prevents me from buying e-books, and most... ehrm... less commercial books, comes as PDFs]. So rather too few hyphens than too many. I'm not a Python programmer either, in fact this is my first Python project =) | 
|   |   | 
|  10-03-2011, 06:47 PM | #13 | 
| Addict            Posts: 293 Karma: 21022 Join Date: Mar 2011 Location: NL Device: Sony PRS-650 | 
			
			All sounds very interesting. Looking forward to a preview version. If it's a ZIP, you could add it to the forum if you want (But If you want to continue developing the project, Launchpad would be a better option at the end I think (Even if you want to continue alone (version Control))). | 
|   |   | 
|  10-03-2011, 07:05 PM | #14 | 
| Resident Curmudgeon            Posts: 80,677 Karma: 150249619 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3 | |
|   |   | 
|  10-03-2011, 07:55 PM | #15 | 
| Grand Sorcerer            Posts: 13,685 Karma: 79983758 Join Date: Nov 2007 Location: Toronto Device: Libra H2O, Libra Colour | 
			
			How on earth do you expect a program to tell whether or not the hyphen SHOULD be there!
		 | 
|   |   | 
|  | 
| Tags | 
| conversion, pdf | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Problem with html -> Mobi conversion - html tags visible. | khromov | Calibre | 9 | 08-06-2011 11:25 AM | 
| HTML Conversion | yoss15 | Conversion | 12 | 07-28-2011 04:42 PM | 
| clean HTML or PDF before mobi conversion in Calibre | mark235 | Calibre | 9 | 12-25-2010 09:37 PM | 
| PDF to WORD/HTML conversion, "special characters and marks" errors | chengyibo | 3 | 11-06-2010 12:43 AM | |
| Today only - Free IntraPDF conversion tool (PDF -> HTML) | Bob Russell | 7 | 04-10-2007 12:16 PM | |