Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 10-03-2011, 09:32 PM   #16
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by PeterT View Post
How on earth do you expect a program to tell whether or not the hyphen SHOULD be there!
Did you miss the post about how Calibre does this already today? You use the document as a dictionary to see if the the word exists without a hyphen already. This technique automatically handles all languages and made-up/obscure words.
ldolse is offline   Reply With Quote
Old 10-03-2011, 09:39 PM   #17
PeterT
Grand Sorcerer
PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.
 
PeterT's Avatar
 
Posts: 12,143
Karma: 73448616
Join Date: Nov 2007
Location: Toronto
Device: Nexus 7, Clara, Touch, Tolino EPOS
Yes... but I hate to say there are still cases where a dictionary approach will fail; and JS wants perfections....
PeterT is offline   Reply With Quote
Old 10-03-2011, 09:47 PM   #18
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Ah, I interpreted JS's comment as another vote for false negatives vs. false positives. Using the document as a dict can't guarantee you'll remove every hyphen that should be removed, but it's an excellent technique to ensure that all the ones which are supposed to stay will stay.

Implementing proper multi-language stemming and adding an optional external dictionary would increase the detection rate even more, but it's debatable whether that's worth the effort.
ldolse is offline   Reply With Quote
Old 10-03-2011, 10:50 PM   #19
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by ldolse View Post
Quote:
Originally Posted by PeterT View Post
Yes... but I hate to say there are still cases where a dictionary approach will fail; and JS wants perfections....
Ah, I interpreted JS's comment as another vote for false negatives vs. false positives. Using the document as a dict can't guarantee you'll remove every hyphen that should be removed, but it's an excellent technique to ensure that all the ones which are supposed to stay will stay.
@PeterT - While JS is often a perfectionist in certain things (like aspect ratio) saying JS wants perfection from his small input into this conversation seems akin to mind reading to me. We would all like perfection, but I interpreted JS's comment as another vote for false negatives vs. false positives too. As long as the wrong hyphens are not deleted, afterwards individual editing can deliver perfection.
DoctorOhh is offline   Reply With Quote
Old 10-14-2011, 03:00 AM   #20
MrWarper
Zealot
MrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it is
 
Posts: 133
Karma: 2142
Join Date: Oct 2011
Location: Spain
Device: I'm an iRex man: 8x DR1000S, 4x DR800SG, 4x DR800S
I'd be very interested in giving this a try if it generated clean HTML. Does it? What's the current status?
MrWarper is offline   Reply With Quote
Old 10-16-2011, 04:11 PM   #21
roffLOL
Member
roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.
 
roffLOL's Avatar
 
Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
Quote:
Originally Posted by MrWarper View Post
I'd be very interested in giving this a try if it generated clean HTML. Does it? What's the current status?
Of course. The current implementation still outperforms my solution greatly both in speed and quality for some cases. Looks like I will need more development time than expected.
roffLOL is offline   Reply With Quote
Old 10-17-2011, 08:48 AM   #22
MrWarper
Zealot
MrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it is
 
Posts: 133
Karma: 2142
Join Date: Oct 2011
Location: Spain
Device: I'm an iRex man: 8x DR1000S, 4x DR800SG, 4x DR800S
I don't get it. Your solution is still greatly outperformed by the current implementation of what? Calibre?

Wrt to speed, I'd say interpreted languages like python are not your friends, but anyway... Maybe adding to what outperforms you is a better option over trying to replace it, unless you have a good reason to code independently.
MrWarper is offline   Reply With Quote
Old 10-17-2011, 08:54 AM   #23
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by MrWarper View Post
I don't get it.
There isn't anything to get. roffLOL is developing this and in his opinion calibre's current implementation is still a little better and a little faster in some cases

Quote:
"The current implementation still outperforms my solution greatly both in speed and quality for some cases"
I would like to see roffLOL take a look at calibre's "in development" PDF conversion engine that handles 2 columns and see if he could add the finishing touches to that conversion engine.
DoctorOhh is offline   Reply With Quote
Old 10-21-2011, 05:13 AM   #24
roffLOL
Member
roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.
 
roffLOL's Avatar
 
Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
Quote:
Originally Posted by MrWarper View Post
Wrt to speed, I'd say interpreted languages like python are not your friends, but anyway.
Sure, but I feel that speed is more often in design, and opposite in lack thereof, than in languages' benchmarks =)

Quote:
Originally Posted by MrWarper View Post
Maybe adding to what outperforms you is a better option over trying to replace it, unless you have a good reason to code independently.
Good reason? I felt like it. Competition and what not. Recreation after a sucky workday at a sucky workplace producing sucky code. It's not about doing the optimal thing, but about having fun. Even if my work were to be wasted, atleast my time wasn't.
Besides, I'm not beaten yet. Some cases is not all cases, and some cases may be fixed.

If I cannot match calibres current implementation, I will work on it instead.
To be honest I haven't even looked at it yet, but it has shown some weird errors (like dropping doubles of tightly spaced l:s (L)) which makes me suspect that our implementation approaches differs on quite a low level. There is a value in trying different approaches too.

Quote:
Originally Posted by dwanthny View Post
I would like to see roffLOL take a look at calibre's "in development" PDF conversion engine that handles 2 columns and see if he could add the finishing touches to that conversion engine.
Are double columns even in use? I have found a single book with a layout in that manner.
roffLOL is offline   Reply With Quote
Old 10-21-2011, 06:10 AM   #25
itimpi
Wizard
itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.
 
Posts: 4,552
Karma: 950151
Join Date: Nov 2008
Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)
Quote:
Originally Posted by roffLOL View Post
Are double columns even in use? I have found a single book with a layout in that manner.
Probably not very common in fiction oriented books. It is very common in magazines, and often users may want to be able to convert them on an reader device. I have also seen it in books that are technical in nature and these are again likely conversion candidates.

Of course the real solution is to not start with PDF, but often this is the only format available.
itimpi is offline   Reply With Quote
Old 10-21-2011, 06:27 AM   #26
roffLOL
Member
roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.
 
roffLOL's Avatar
 
Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
But for magazines you can't really expect a double column either, might as well be three or more, and maqazines often follow a weird logical structure, so even if the columns were identified, appending them in correct order would be errorprone.
roffLOL is offline   Reply With Quote
Old 10-21-2011, 10:04 AM   #27
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by roffLOL View Post
But for magazines you can't really expect a double column either, might as well be three or more, and maqazines often follow a weird logical structure, so even if the columns were identified, appending them in correct order would be errorprone.
For the most popular magazines this is true, but there are classes of mags where this two columns is quite common. Science Fiction and other writing oriented mags for example. It's also a very common format in academic journals. I believe that the only thing the new pdf engine attempts to support is two column.

Not sure which part of your progress is hitting snags, but the new pdf engine in Calibre does an initial conversion from pdf to xml using compiled code. The XML retains all the critical formatting information. The output Calibre produces today does not use the XML I'm talking about. You need to use calibre from the CLI with debug enabled - add the argument --new-pdf-engine if you want to see what I'm talking about.

Last edited by ldolse; 10-21-2011 at 10:07 AM.
ldolse is offline   Reply With Quote
Old 10-21-2011, 01:38 PM   #28
roffLOL
Member
roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.roffLOL once ate a cherry pie in a record 7 seconds.
 
roffLOL's Avatar
 
Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
Thanks! I shall try it. If it is for the benefit of academics and sci fi-readers, it should certainly be supported, no matter the cost

Any source for such sci-fi-magz?
roffLOL is offline   Reply With Quote
Old 10-21-2011, 01:56 PM   #29
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Well the example I was thinking of is here:
http://www.starshipsofa.com/anthology/ebook/

Not sure of other good sources, just know that I've seen the two column format used in print for this type of content.

Edit - I don't think these use two column, but since you seem to be interested in other scifi sources:
http://www.hubfiction.com/
http://www.heliotropemag.com/category/heliotrope-issue/


Last edited by ldolse; 10-21-2011 at 02:11 PM.
ldolse is offline   Reply With Quote
Old 10-30-2011, 02:32 PM   #30
MrWarper
Zealot
MrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it isMrWarper knows what time it is
 
Posts: 133
Karma: 2142
Join Date: Oct 2011
Location: Spain
Device: I'm an iRex man: 8x DR1000S, 4x DR800SG, 4x DR800S
The "real thing".

If you need ideas, I'd have a look at PDF.js. After all, I doubt conversion from PDF to HTML can go beyond that

Last edited by MrWarper; 10-30-2011 at 02:32 PM. Reason: title, typo
MrWarper is offline   Reply With Quote
Reply

Tags
conversion, pdf

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem with html -> Mobi conversion - html tags visible. khromov Calibre 9 08-06-2011 11:25 AM
HTML Conversion yoss15 Conversion 12 07-28-2011 04:42 PM
clean HTML or PDF before mobi conversion in Calibre mark235 Calibre 9 12-25-2010 09:37 PM
PDF to WORD/HTML conversion, "special characters and marks" errors chengyibo PDF 3 11-06-2010 12:43 AM
Today only - Free IntraPDF conversion tool (PDF -> HTML) Bob Russell PDF 7 04-10-2007 12:16 PM


All times are GMT -4. The time now is 10:41 AM.


MobileRead.com is a privately owned, operated and funded community.