Ebook AI for LISP hackers?

bjo · 07-26-2010, 11:40 AM

I'm an occasional ebook publisher/converter, so I've used Calibre a few times, but I still seem to find myself wrestling with regexps and manually tweaking formatting errors for hours...

As a rusty old LISP hacker, I find myself fantasizing about a LISP library that scans an etext for common formatting patterns, and automatically detects and corrects/converts the most common.

But that's what Calibre is supposed to do, yes? If I've had to do a lot of post-processing, does that just mean I'm using it wrong?

(I can't give particular examples without going back and doing some experiments, but I thought I'd inquire here first, and see if the consensus is that Calibre should/shouldn't be meeting those needs already.)

Starson17 · 07-26-2010, 02:27 PM

Quote:

Originally Posted by bjo

I thought I'd inquire here first, and see if the consensus is that Calibre should/shouldn't be meeting those needs already.)

Calibre is great, but not yet perfect. There's a new Add Books Wizard on the ToDo list. It should be able to improve its ability to get metadata from the filename. There's a new PDF engine under development, etc.

My answer is that it should be meeting those needs, but not "already."

user_none · 07-26-2010, 06:14 PM

Quote:

Originally Posted by bjo

But that's what Calibre is supposed to do, yes? If I've had to do a lot of post-processing, does that just mean I'm using it wrong?

You're not using it wrong. Structure detection is still a work in progress for some formats.

Also it's not as easy as you might think. Take the TXT input for instance. There is no differentiation other than new lines. Detecting the structure becomes very hard. There are heuristics that can be used but they quickly fall short when you move away from the language the heuristic is based on. We could do language specific processing... As Kovid is fond of saying patches are welcome. Python isn't LISP but it's a very easy and fun language.

bjo · 07-26-2010, 06:16 PM

"There's a new Add Books Wizard on the ToDo list. It should be able to improve its ability to get metadata from the filename."

What's frustrating to me is the sense that Calibre gives me no say in the details of the conversion, eg if my startingpoint is an OCRed PDF where the space after a close-quote is often lost, I can theoretically go back and attack it with regexps, but this is pre- or post-processing, not central to Calibre's mechanisms.

But isn't Calibre effectively a library of regexps for searching and replacing particular patterns?

And couldn't it be 'opened up' so the user can preview each pattern Calibre thinks it's detected, and how it will be modified, and tweak this if it's not quite right? Or submit new patterns to its database?

user_none · 07-26-2010, 06:22 PM

Quote:

Originally Posted by bjo

But isn't Calibre effectively a library of regexps for searching and replacing particular patterns?

No. Regexes are used when necessary but they are often avoid because they are slow. Tree processing, stream processing, looping techniques, are all used. That statement doesn't take into the fact that many formats are also binary files that need to be read and written. It only looks at it from the prospective of shifting text markup.

Quote:

Originally Posted by bjo

And couldn't it be 'opened up' so the user can preview each pattern Calibre thinks it's detected, and how it will be modified, and tweak this if it's not quite right? Or submit new patterns to its database?

calibre does most of the processing itself but also incorporates third party tools to help. PDF input for instance is first handled by pdftohtml. It would be possible to open it up more. However, it's an open source application and you can add your own processing plugins at various stages.

bjo · 07-26-2010, 06:23 PM

Quote:

Originally Posted by user_none

...Detecting the structure becomes very hard. There are heuristics that can be used but they quickly fall short when you move away from the language the heuristic is based on. We could do language specific processing...

Do you mean Computer language or Natural language?

I just favor LISP because I know it, but the database of patterns I'm dreaming of would just be like regexps.

If you mean natural languages, that's WAY out of my sphere of thinking.

user_none · 07-26-2010, 06:26 PM

Quote:

Originally Posted by bjo

Do you mean Computer language or Natural language?

Natural language. As in what does a chapter heading look like and how does it differ from a paragraph. This sounds easy but one of the issues I ran into with the PDF line unwrapping code was a chapter title looks remarkably similar to a poem.

kovidgoyal · 07-26-2010, 07:06 PM

I encourage you to get your hands dirty with some calibre code. I'm sure you will find many areas where the code can be made more intelligent.

Instructions on getting started with calibre development are in the User Manual. If you need more help, feel free to ask.

07-26-2010, 11:40 AM	#1
bjo Junior Member Posts: 3 Karma: 10 Join Date: Jul 2010 Device: none	Ebook AI for LISP hackers? I'm an occasional ebook publisher/converter, so I've used Calibre a few times, but I still seem to find myself wrestling with regexps and manually tweaking formatting errors for hours... As a rusty old LISP hacker, I find myself fantasizing about a LISP library that scans an etext for common formatting patterns, and automatically detects and corrects/converts the most common. But that's what Calibre is supposed to do, yes? If I've had to do a lot of post-processing, does that just mean I'm using it wrong? (I can't give particular examples without going back and doing some experiments, but I thought I'd inquire here first, and see if the consensus is that Calibre should/shouldn't be meeting those needs already.)

07-26-2010, 06:16 PM	#4
bjo Junior Member Posts: 3 Karma: 10 Join Date: Jul 2010 Device: none	(What's under the hood?) "There's a new Add Books Wizard on the ToDo list. It should be able to improve its ability to get metadata from the filename." What's frustrating to me is the sense that Calibre gives me no say in the details of the conversion, eg if my startingpoint is an OCRed PDF where the space after a close-quote is often lost, I can theoretically go back and attack it with regexps, but this is pre- or post-processing, not central to Calibre's mechanisms. But isn't Calibre effectively a library of regexps for searching and replacing particular patterns? And couldn't it be 'opened up' so the user can preview each pattern Calibre thinks it's detected, and how it will be modified, and tweak this if it's not quite right? Or submit new patterns to its database?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Questions for iRex DR800SG owners / hackers /developers	nikkie	iRex	10	02-23-2010 12:22 AM
Hackers crack Amazon's Kindle DRM....	hidari	News	81	12-31-2009 12:04 PM
Danger lies ahead from mobile hackers	Colin Dunstan	Lounge	3	04-28-2005 10:43 AM

07-26-2010, 07:06 PM	#8
kovidgoyal creator of calibre Posts: 45,300 Karma: 27111240 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I encourage you to get your hands dirty with some calibre code. I'm sure you will find many areas where the code can be made more intelligent. Instructions on getting started with calibre development are in the User Manual. If you need more help, feel free to ask.

Advert

Advert