Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 09-30-2011, 04:57 AM   #1
roffLOL
Member
roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.
 
roffLOL's Avatar
 
Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
PDF -> HTML conversion

You wouldn't be interested in a PDF -> HTML converter? I'm currently developing one. For single page (one page per page, not those documents with double columns), justified PDF documents it will be able to:

retain:
Fonts
Paragraphs and indentations
alignment
PDF's general logical structure with TOC
[graphics]


remove:
Page numbering
[possibly header and footer]

However, I have developed this library out of need, and as such, will not develop it further as soon as I get it working for the case described (single page, justified PDF document).

Current status is 90% finished. Only 90% development time left, in other words Say, a month.

The library is in pure python (2.6?).

//Humble greetings,
roffLOL

Last edited by roffLOL; 09-30-2011 at 08:17 AM. Reason: Clarification. Written before coffee o'clock.
roffLOL is offline   Reply With Quote
Old 10-02-2011, 08:46 AM   #2
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
This sounds like it would be very cool.
ldolse is offline   Reply With Quote
Advert
Old 10-02-2011, 08:56 AM   #3
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
Especially the remove function seems to be a very nice function.
drMerry is offline   Reply With Quote
Old 10-02-2011, 09:34 AM   #4
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by roffLOL View Post
You wouldn't be interested in a PDF -> HTML converter?
Link?

Quote:
Originally Posted by roffLOL View Post
The library is in pure python (2.6?).
What is the performance? E.G. Memory and CPU usage? Time to complete? A while back I evaluated a few pure python PDF libraries for replacing the current PDF engine and it found them to be up to 60x slower than the current engine without much gain in terms of quality output. This is partly why a large part of the new PDF engine is being written in C++.
user_none is offline   Reply With Quote
Old 10-02-2011, 02:44 PM   #5
roffLOL
Member
roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.
 
roffLOL's Avatar
 
Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
Quote:
Originally Posted by user_none View Post
Link?


What is the performance? E.G. Memory and CPU usage? Time to complete? A while back I evaluated a few pure python PDF libraries for replacing the current PDF engine and it found them to be up to 60x slower than the current engine without much gain in terms of quality output. This is partly why a large part of the new PDF engine is being written in C++.
I am not done and have no reliable numbers. However, my current implementation takes 20 seconds to convert a 350 page book and write it to disk (before any optimization). I can't put it in relation to the current implementation of Calibre's converter, since I can't measure how much time Calibre itself spends on parsing the html; however for the same book the conversion takes as long as mine. As for quality, mine IS MUCH better (If you haven't improved it greatly the last couple of months). I haven't yet done any quantitative quality tests. I will run the conversion on nearly 500 PDF's, and will not be satisfied before it can handle all of them nearly perfectly. I am very pedantic about my reading experience.

I will upload the project somewhere, maybe next week or the week after that. When it comes to 'not coding matters', I am a lazy bastard.
roffLOL is offline   Reply With Quote
Advert
Old 10-02-2011, 03:27 PM   #6
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by roffLOL View Post
I am not done and have no reliable numbers. However, my current implementation takes 20 seconds to convert a 350 page book and write it to disk (before any optimization).
That's not bad.

My opinion is up to 30 seconds is acceptable. 1 minute is a bit much. 2 minutes is excessive. 5+ minutes is completely unacceptable.

At 20 seconds I'll give this is a serious look for use as a replacement for the current PDF input engine. Especially with this:

Quote:
Originally Posted by roffLOL View Post
As for quality, mine IS MUCH better
Quote:
Originally Posted by roffLOL View Post
I will upload the project somewhere, maybe next week or the week after that. When it comes to 'not coding matters', I am a lazy bastard.
I'm a big fan of Launchpad and Google Code.
user_none is offline   Reply With Quote
Old 10-02-2011, 03:56 PM   #7
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
FWIW, particularly if it was made available as an "option" a bit like you can switch between the PDF engines currently, I don't care if it takes 10 minutes . In comparison to the *hours* it takes to manually edit calibre converted PDFs currently to fix up the paragraph splitting issues with the existing engine you are going to be way ahead of the curve.

If/when the new C++ engine is finished and provides quality at blinding speed then that could be even better of course. However I think there would be plenty of users who would be willing to sacrifice some conversion speed if the quality stacks up to the advertised goals. Well I certainly would anyway .
kiwidude is offline   Reply With Quote
Old 10-03-2011, 06:07 AM   #8
roffLOL
Member
roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.
 
roffLOL's Avatar
 
Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
On the subject of speed, I'm pretty sure I will be able to cut 25% execution time off with a simple optimization, and cut most of the memory consumption as well. I will not aim for a faster conversion than that. For a non-repeated task, it is good enough for me.

The advertised goals were met yesterday. I have created a 98% replica of the PDF-document that serves as my test case. It removes '-' at end of lines and connects the lines. It appends non-fullstop paragraphs that spans pages, it keeps an indentation ratio for paragraph beginnings equal to that of the original document (if it has one), it retains font-information (even for single word in the middle of sentences), it retains line and paragraph spacing, it translates all special characters to their HTML-equivalence, and for some weird reason, the line numbering seems to automagically vanish, even though I can't remember implementing anything to leave them out.

However, the whole implementation will likely blow in my face when tried with another document =). Much work left to be done.
roffLOL is offline   Reply With Quote
Old 10-03-2011, 07:24 AM   #9
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by roffLOL View Post
The advertised goals were met yesterday. I have created a 98% replica of the PDF-document that serves as my test case. It removes '-' at end of lines and connects the lines. It appends non-fullstop paragraphs that spans pages, it keeps an indentation ratio for paragraph beginnings equal to that of the original document (if it has one), it retains font-information (even for single word in the middle of sentences), it retains line and paragraph spacing, it translates all special characters to their HTML-equivalence, and for some weird reason, the line numbering seems to automagically vanish, even though I can't remember implementing anything to leave them out.
Sounds pretty promising, looking forward to seeing it. One note - you can't remove '-' from line endings wholesale - this is one area I spent a fair amount of time, creating a 'dehyphenate' routine which uses the raw document as a dictionary. Feel free to borrow that code/concept from Calibre. The only weakness in the function is that I perform naive 'stemming' which focuses mainly on the english language to increase the likelihood of a dictionary match.

Now that Calibre has the ability to allow the user to specify the language for a book I've been thinking about an industrial strength multi-language stemmer - the most appropriate ones from my searches would be Snowball, which has a python wrapper.

Last edited by ldolse; 10-03-2011 at 07:29 AM.
ldolse is offline   Reply With Quote
Old 10-03-2011, 07:42 AM   #10
roffLOL
Member
roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.
 
roffLOL's Avatar
 
Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
Quote:
Originally Posted by ldolse View Post
Sounds pretty promising, looking forward to seeing it. One note - you can't remove '-' from line endings wholesale - this is one area I spent a fair amount of time, creating a 'dehyphenate' routine which uses the raw document as a dictionary.
I guessed as much, but figured that it will work in the majority of cases, atleast for most(?) major languages. I have left space for altering of these behaviours, if needed. In which cases are words hyphened, when not for spreading them across lines? Hyphens is not a too common phenomenon inside of actual words.
roffLOL is offline   Reply With Quote
Old 10-03-2011, 08:56 AM   #11
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Maybe it's dependent on language, but hyphenated compound words are quite common in english texts. If you turn on full debugging with Calibre it will actually log which hyphens are removed/retained to give you an idea of how frequently it happens. I've also seen numerous cases where em-dash and en-dash got dumbed down to a regular hyphen. Hyphen is also used as a replacment for an ellipsis in many docs, so it would result in sentences being merged together. Numeric strings are other candidates that are often intentionally hyphenated. Line breaking algorithms aggressively leverage existing hyphens.

My own preference has always been for false negative vs. false positive. For a while Calibre was removing hyphens wholesale, and I have to say that was rather annoying to me personally, though I have a perfectionist streak, that's what drove me to add the dictionary approach - my python skills aren't great but it was simple logic to add.
ldolse is offline   Reply With Quote
Old 10-03-2011, 09:46 AM   #12
roffLOL
Member
roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.
 
roffLOL's Avatar
 
Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
Thanks for shared insight. I will look into this matter when I start my testing-spree for realz. My parser makes loads of assumptions about the text to be parsed, for one, it expects to parse a litterary book (this will be easily extendable for a coder in need for a 'bible parsing mode', or whatever; eg., the assumptions of the logical structure of the text are [hopefully] pretty much separated from the rest of the code).

I have another reason for this agressive approach. My reader's software dictionary is rendered unusable by wrongly hyphened words, and I read in a couple of languages not my own [This is in fact the sole reason I'm writing this program. Some fucked up laws and royalties prevents me from buying e-books, and most... ehrm... less commercial books, comes as PDFs]. So rather too few hyphens than too many.

I'm not a Python programmer either, in fact this is my first Python project =)
roffLOL is offline   Reply With Quote
Old 10-03-2011, 06:47 PM   #13
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
All sounds very interesting.
Looking forward to a preview version.
If it's a ZIP, you could add it to the forum if you want (But If you want to continue developing the project, Launchpad would be a better option at the end I think (Even if you want to continue alone (version Control))).
drMerry is offline   Reply With Quote
Old 10-03-2011, 07:05 PM   #14
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,998
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by roffLOL View Post
It removes '-' at end of lines and connects the lines.
What happens if the dash is supposed to be there and not removed? How does it handle that? I don't want dashes removed that are part of the word if it happens to fall at the end of a line.
JSWolf is offline   Reply With Quote
Old 10-03-2011, 07:55 PM   #15
PeterT
Grand Sorcerer
PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.
 
PeterT's Avatar
 
Posts: 12,168
Karma: 73448616
Join Date: Nov 2007
Location: Toronto
Device: Nexus 7, Clara, Touch, Tolino EPOS
Quote:
Originally Posted by JSWolf View Post
What happens if the dash is supposed to be there and not removed? How does it handle that? I don't want dashes removed that are part of the word if it happens to fall at the end of a line.
How on earth do you expect a program to tell whether or not the hyphen SHOULD be there!
PeterT is offline   Reply With Quote
Reply

Tags
conversion, pdf


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem with html -> Mobi conversion - html tags visible. khromov Calibre 9 08-06-2011 11:25 AM
HTML Conversion yoss15 Conversion 12 07-28-2011 04:42 PM
clean HTML or PDF before mobi conversion in Calibre mark235 Calibre 9 12-25-2010 09:37 PM
PDF to WORD/HTML conversion, "special characters and marks" errors chengyibo PDF 3 11-06-2010 12:43 AM
Today only - Free IntraPDF conversion tool (PDF -> HTML) Bob Russell PDF 7 04-10-2007 12:16 PM


All times are GMT -4. The time now is 09:32 PM.


MobileRead.com is a privately owned, operated and funded community.