![]() |
#1 |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Jul 2010
Device: none
|
Ebook AI for LISP hackers?
I'm an occasional ebook publisher/converter, so I've used Calibre a few times, but I still seem to find myself wrestling with regexps and manually tweaking formatting errors for hours...
As a rusty old LISP hacker, I find myself fantasizing about a LISP library that scans an etext for common formatting patterns, and automatically detects and corrects/converts the most common. But that's what Calibre is supposed to do, yes? If I've had to do a lot of post-processing, does that just mean I'm using it wrong? (I can't give particular examples without going back and doing some experiments, but I thought I'd inquire here first, and see if the consensus is that Calibre should/shouldn't be meeting those needs already.) |
![]() |
![]() |
![]() |
#2 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
My answer is that it should be meeting those needs, but not "already." |
|
![]() |
![]() |
Advert | |
|
![]() |
#3 | |
Sigil & calibre developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
Quote:
Also it's not as easy as you might think. Take the TXT input for instance. There is no differentiation other than new lines. Detecting the structure becomes very hard. There are heuristics that can be used but they quickly fall short when you move away from the language the heuristic is based on. We could do language specific processing... As Kovid is fond of saying patches are welcome. Python isn't LISP but it's a very easy and fun language. |
|
![]() |
![]() |
![]() |
#4 |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Jul 2010
Device: none
|
(What's under the hood?)
"There's a new Add Books Wizard on the ToDo list. It should be able to improve its ability to get metadata from the filename."
What's frustrating to me is the sense that Calibre gives me no say in the details of the conversion, eg if my startingpoint is an OCRed PDF where the space after a close-quote is often lost, I can theoretically go back and attack it with regexps, but this is pre- or post-processing, not central to Calibre's mechanisms. But isn't Calibre effectively a library of regexps for searching and replacing particular patterns? And couldn't it be 'opened up' so the user can preview each pattern Calibre thinks it's detected, and how it will be modified, and tweak this if it's not quite right? Or submit new patterns to its database? |
![]() |
![]() |
![]() |
#5 | |
Sigil & calibre developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
Quote:
calibre does most of the processing itself but also incorporates third party tools to help. PDF input for instance is first handled by pdftohtml. It would be possible to open it up more. However, it's an open source application and you can add your own processing plugins at various stages. |
|
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Jul 2010
Device: none
|
Quote:
I just favor LISP because I know it, but the database of patterns I'm dreaming of would just be like regexps. If you mean natural languages, that's WAY out of my sphere of thinking. |
|
![]() |
![]() |
![]() |
#7 |
Sigil & calibre developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
Natural language. As in what does a chapter heading look like and how does it differ from a paragraph. This sounds easy but one of the issues I ran into with the PDF line unwrapping code was a chapter title looks remarkably similar to a poem.
|
![]() |
![]() |
![]() |
#8 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,300
Karma: 27111240
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I encourage you to get your hands dirty with some calibre code. I'm sure you will find many areas where the code can be made more intelligent.
Instructions on getting started with calibre development are in the User Manual. If you need more help, feel free to ask. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Questions for iRex DR800SG owners / hackers /developers | nikkie | iRex | 10 | 02-23-2010 12:22 AM |
Hackers crack Amazon's Kindle DRM.... | hidari | News | 81 | 12-31-2009 12:04 PM |
Danger lies ahead from mobile hackers | Colin Dunstan | Lounge | 3 | 04-28-2005 10:43 AM |