Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 07-26-2010, 11:40 AM   #1
bjo
Junior Member
bjo began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jul 2010
Device: none
Ebook AI for LISP hackers?

I'm an occasional ebook publisher/converter, so I've used Calibre a few times, but I still seem to find myself wrestling with regexps and manually tweaking formatting errors for hours...

As a rusty old LISP hacker, I find myself fantasizing about a LISP library that scans an etext for common formatting patterns, and automatically detects and corrects/converts the most common.

But that's what Calibre is supposed to do, yes? If I've had to do a lot of post-processing, does that just mean I'm using it wrong?

(I can't give particular examples without going back and doing some experiments, but I thought I'd inquire here first, and see if the consensus is that Calibre should/shouldn't be meeting those needs already.)
bjo is offline   Reply With Quote
Old 07-26-2010, 02:27 PM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by bjo View Post
I thought I'd inquire here first, and see if the consensus is that Calibre should/shouldn't be meeting those needs already.)
Calibre is great, but not yet perfect. There's a new Add Books Wizard on the ToDo list. It should be able to improve its ability to get metadata from the filename. There's a new PDF engine under development, etc.

My answer is that it should be meeting those needs, but not "already."
Starson17 is offline   Reply With Quote
Advert
Old 07-26-2010, 06:14 PM   #3
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by bjo View Post
But that's what Calibre is supposed to do, yes? If I've had to do a lot of post-processing, does that just mean I'm using it wrong?
You're not using it wrong. Structure detection is still a work in progress for some formats.

Also it's not as easy as you might think. Take the TXT input for instance. There is no differentiation other than new lines. Detecting the structure becomes very hard. There are heuristics that can be used but they quickly fall short when you move away from the language the heuristic is based on. We could do language specific processing... As Kovid is fond of saying patches are welcome. Python isn't LISP but it's a very easy and fun language.
user_none is offline   Reply With Quote
Old 07-26-2010, 06:16 PM   #4
bjo
Junior Member
bjo began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jul 2010
Device: none
(What's under the hood?)

"There's a new Add Books Wizard on the ToDo list. It should be able to improve its ability to get metadata from the filename."

What's frustrating to me is the sense that Calibre gives me no say in the details of the conversion, eg if my startingpoint is an OCRed PDF where the space after a close-quote is often lost, I can theoretically go back and attack it with regexps, but this is pre- or post-processing, not central to Calibre's mechanisms.

But isn't Calibre effectively a library of regexps for searching and replacing particular patterns?

And couldn't it be 'opened up' so the user can preview each pattern Calibre thinks it's detected, and how it will be modified, and tweak this if it's not quite right? Or submit new patterns to its database?
bjo is offline   Reply With Quote
Old 07-26-2010, 06:22 PM   #5
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by bjo View Post
But isn't Calibre effectively a library of regexps for searching and replacing particular patterns?
No. Regexes are used when necessary but they are often avoid because they are slow. Tree processing, stream processing, looping techniques, are all used. That statement doesn't take into the fact that many formats are also binary files that need to be read and written. It only looks at it from the prospective of shifting text markup.

Quote:
Originally Posted by bjo View Post
And couldn't it be 'opened up' so the user can preview each pattern Calibre thinks it's detected, and how it will be modified, and tweak this if it's not quite right? Or submit new patterns to its database?
calibre does most of the processing itself but also incorporates third party tools to help. PDF input for instance is first handled by pdftohtml. It would be possible to open it up more. However, it's an open source application and you can add your own processing plugins at various stages.
user_none is offline   Reply With Quote
Advert
Old 07-26-2010, 06:23 PM   #6
bjo
Junior Member
bjo began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jul 2010
Device: none
Quote:
Originally Posted by user_none View Post
...Detecting the structure becomes very hard. There are heuristics that can be used but they quickly fall short when you move away from the language the heuristic is based on. We could do language specific processing...
Do you mean Computer language or Natural language?

I just favor LISP because I know it, but the database of patterns I'm dreaming of would just be like regexps.

If you mean natural languages, that's WAY out of my sphere of thinking.
bjo is offline   Reply With Quote
Old 07-26-2010, 06:26 PM   #7
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by bjo View Post
Do you mean Computer language or Natural language?
Natural language. As in what does a chapter heading look like and how does it differ from a paragraph. This sounds easy but one of the issues I ran into with the PDF line unwrapping code was a chapter title looks remarkably similar to a poem.
user_none is offline   Reply With Quote
Old 07-26-2010, 07:06 PM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,300
Karma: 27111240
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I encourage you to get your hands dirty with some calibre code. I'm sure you will find many areas where the code can be made more intelligent.

Instructions on getting started with calibre development are in the User Manual. If you need more help, feel free to ask.
kovidgoyal is online now   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Questions for iRex DR800SG owners / hackers /developers nikkie iRex 10 02-23-2010 12:22 AM
Hackers crack Amazon's Kindle DRM.... hidari News 81 12-31-2009 12:04 PM
Danger lies ahead from mobile hackers Colin Dunstan Lounge 3 04-28-2005 10:43 AM


All times are GMT -4. The time now is 03:35 AM.


MobileRead.com is a privately owned, operated and funded community.