09-30-2011, 04:57 AM | #1 |
Member
Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
|
PDF -> HTML conversion
You wouldn't be interested in a PDF -> HTML converter? I'm currently developing one. For single page (one page per page, not those documents with double columns), justified PDF documents it will be able to:
retain: Fonts Paragraphs and indentations alignment PDF's general logical structure with TOC [graphics] remove: Page numbering [possibly header and footer] However, I have developed this library out of need, and as such, will not develop it further as soon as I get it working for the case described (single page, justified PDF document). Current status is 90% finished. Only 90% development time left, in other words Say, a month. The library is in pure python (2.6?). //Humble greetings, roffLOL Last edited by roffLOL; 09-30-2011 at 08:17 AM. Reason: Clarification. Written before coffee o'clock. |
10-02-2011, 08:46 AM | #2 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
This sounds like it would be very cool.
|
Advert | |
|
10-02-2011, 08:56 AM | #3 |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
Especially the remove function seems to be a very nice function.
|
10-02-2011, 09:34 AM | #4 |
Sigil & calibre developer
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
Link?
What is the performance? E.G. Memory and CPU usage? Time to complete? A while back I evaluated a few pure python PDF libraries for replacing the current PDF engine and it found them to be up to 60x slower than the current engine without much gain in terms of quality output. This is partly why a large part of the new PDF engine is being written in C++. |
10-02-2011, 02:44 PM | #5 | |
Member
Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
|
Quote:
I will upload the project somewhere, maybe next week or the week after that. When it comes to 'not coding matters', I am a lazy bastard. |
|
Advert | |
|
10-02-2011, 03:27 PM | #6 | |
Sigil & calibre developer
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
Quote:
My opinion is up to 30 seconds is acceptable. 1 minute is a bit much. 2 minutes is excessive. 5+ minutes is completely unacceptable. At 20 seconds I'll give this is a serious look for use as a replacement for the current PDF input engine. Especially with this: I'm a big fan of Launchpad and Google Code. |
|
10-02-2011, 03:56 PM | #7 |
Calibre Plugins Developer
Posts: 4,673
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
FWIW, particularly if it was made available as an "option" a bit like you can switch between the PDF engines currently, I don't care if it takes 10 minutes . In comparison to the *hours* it takes to manually edit calibre converted PDFs currently to fix up the paragraph splitting issues with the existing engine you are going to be way ahead of the curve.
If/when the new C++ engine is finished and provides quality at blinding speed then that could be even better of course. However I think there would be plenty of users who would be willing to sacrifice some conversion speed if the quality stacks up to the advertised goals. Well I certainly would anyway . |
10-03-2011, 06:07 AM | #8 |
Member
Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
|
On the subject of speed, I'm pretty sure I will be able to cut 25% execution time off with a simple optimization, and cut most of the memory consumption as well. I will not aim for a faster conversion than that. For a non-repeated task, it is good enough for me.
The advertised goals were met yesterday. I have created a 98% replica of the PDF-document that serves as my test case. It removes '-' at end of lines and connects the lines. It appends non-fullstop paragraphs that spans pages, it keeps an indentation ratio for paragraph beginnings equal to that of the original document (if it has one), it retains font-information (even for single word in the middle of sentences), it retains line and paragraph spacing, it translates all special characters to their HTML-equivalence, and for some weird reason, the line numbering seems to automagically vanish, even though I can't remember implementing anything to leave them out. However, the whole implementation will likely blow in my face when tried with another document =). Much work left to be done. |
10-03-2011, 07:24 AM | #9 | |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
Now that Calibre has the ability to allow the user to specify the language for a book I've been thinking about an industrial strength multi-language stemmer - the most appropriate ones from my searches would be Snowball, which has a python wrapper. Last edited by ldolse; 10-03-2011 at 07:29 AM. |
|
10-03-2011, 07:42 AM | #10 |
Member
Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
|
I guessed as much, but figured that it will work in the majority of cases, atleast for most(?) major languages. I have left space for altering of these behaviours, if needed. In which cases are words hyphened, when not for spreading them across lines? Hyphens is not a too common phenomenon inside of actual words.
|
10-03-2011, 08:56 AM | #11 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Maybe it's dependent on language, but hyphenated compound words are quite common in english texts. If you turn on full debugging with Calibre it will actually log which hyphens are removed/retained to give you an idea of how frequently it happens. I've also seen numerous cases where em-dash and en-dash got dumbed down to a regular hyphen. Hyphen is also used as a replacment for an ellipsis in many docs, so it would result in sentences being merged together. Numeric strings are other candidates that are often intentionally hyphenated. Line breaking algorithms aggressively leverage existing hyphens.
My own preference has always been for false negative vs. false positive. For a while Calibre was removing hyphens wholesale, and I have to say that was rather annoying to me personally, though I have a perfectionist streak, that's what drove me to add the dictionary approach - my python skills aren't great but it was simple logic to add. |
10-03-2011, 09:46 AM | #12 |
Member
Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
|
Thanks for shared insight. I will look into this matter when I start my testing-spree for realz. My parser makes loads of assumptions about the text to be parsed, for one, it expects to parse a litterary book (this will be easily extendable for a coder in need for a 'bible parsing mode', or whatever; eg., the assumptions of the logical structure of the text are [hopefully] pretty much separated from the rest of the code).
I have another reason for this agressive approach. My reader's software dictionary is rendered unusable by wrongly hyphened words, and I read in a couple of languages not my own [This is in fact the sole reason I'm writing this program. Some fucked up laws and royalties prevents me from buying e-books, and most... ehrm... less commercial books, comes as PDFs]. So rather too few hyphens than too many. I'm not a Python programmer either, in fact this is my first Python project =) |
10-03-2011, 06:47 PM | #13 |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
All sounds very interesting.
Looking forward to a preview version. If it's a ZIP, you could add it to the forum if you want (But If you want to continue developing the project, Launchpad would be a better option at the end I think (Even if you want to continue alone (version Control))). |
10-03-2011, 07:05 PM | #14 |
Resident Curmudgeon
Posts: 75,860
Karma: 134368292
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
|
10-03-2011, 07:55 PM | #15 |
Grand Sorcerer
Posts: 12,600
Karma: 74358024
Join Date: Nov 2007
Location: Toronto
Device: Libra H2O, Libra Colour
|
How on earth do you expect a program to tell whether or not the hyphen SHOULD be there!
|
Tags |
conversion, pdf |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Problem with html -> Mobi conversion - html tags visible. | khromov | Calibre | 9 | 08-06-2011 11:25 AM |
HTML Conversion | yoss15 | Conversion | 12 | 07-28-2011 04:42 PM |
clean HTML or PDF before mobi conversion in Calibre | mark235 | Calibre | 9 | 12-25-2010 09:37 PM |
PDF to WORD/HTML conversion, "special characters and marks" errors | chengyibo | 3 | 11-06-2010 12:43 AM | |
Today only - Free IntraPDF conversion tool (PDF -> HTML) | Bob Russell | 7 | 04-10-2007 12:16 PM |