Is it possible: support another db backend?

Barafu · 05-11-2014, 12:44 PM

Good day. I have some experience in Python. Before starting to learn materials on Calibre plugins I want to ask if the thing I want to create is possible to implement at all, without rewriting too much.
There is a free database of articles. It is published by goverenment. Currently it contains 300 000 documents totaling 200Gb. Some have covers - the logo of department they originate from.The format of db is following: a text file with basic metadata (except covers, comments and so on) and a hundred of 2Gb zip archives, containing TXT and FB2.
I want to browse the db with Calibre. First, I tried a script that imports files directly to Calibre, but that accomplished hardly 15% in 5 days 24/7. What's worse is that this database is updated monthly, and I don't want to be constantly importing something. (And extra 200Gb for two copies of data isn't a trifle, too).
I want to try to make Calibre understand this db in place, in read-only mode. I want searching, reading locally, sending to email and maybe ebooks. Calibre developers, please tell me your opinion: Is this possible to do with reasonable effort, or I should start my own application from scratch? Any advises on where to start would be appreciated.

aleyx · 05-11-2014, 02:52 PM

Right. As much as I love Calibre, I don't think it's the right tool for this particular job. I'm afraid that your best bet is to make your own application.

If you have more experience in Python than, say, PHP, you could start by looking at CherryPy, which is the Python-based webserver framework I use for small to medium custom projects. There's others, of course.

eschwartz · 05-11-2014, 02:52 PM

I think processing 200GB worh of information is the real problem, that is never going to be fast.

plaintext will not make for a very fast database either.

I'd recommend writing a script to determine when and where articles have been updated/added, then importing the changes into calibre. After the first import, the new material will not be as much, and will take less time.

eschwartz · 05-11-2014, 02:58 PM

If you do end up writing your own application, calibre is licensed under the GPL, so you can use its code wherever useful. calibre has a lot of mature code for its ebook-viewer.exe component that may save you lots of time, for instance.

Barafu · 05-11-2014, 03:35 PM

Thank you for the feedback.
About plaintext being bad for metadata - sure, it can be converted to anything as long as this process takes hours, not weeks. The thing I want to achieve is not to unpack original article archives.
By the way, I was trying to speed up Calibre importing at first. I find out that the process is HDD IO bound, that moving it to tmpfs speeds things up three times at least. May be there is a way to do "fast" import of files, avoiding the usual import procedure? That would be a workaround about problem.

eschwartz · 05-11-2014, 03:45 PM

You could use an SSD, even just for the calibre db alone.

Barafu · 05-11-2014, 04:51 PM

I do already, that doesn't help much. The best I could achieve is 1.5 files per second at start and slowing down.
Setting aside the import variant, and the "Write my own app" variant (which I can always fall back to), there is one idea left. May be I can create some virtual device that will pretend to be a reader with all these books on it?
I can create a standalone script that would present the metadata in any form. If only I could teach Calibre to take actual books (and, preferably, covers) from archives…
BTW, that format for text DBs is rather popular in ex-USSR. Government publications, advertisements, books often come in that form. Its support plus recent movement to ban Windows from many organizations( including schools) would make this addition to Calibre rather popular.
P.S. And instead of my own app I can use LibreOffice Base + set of scripts. At least I hope to. Calibre offers book conversions and ereader support, however, and other neat things.

chaley · 05-11-2014, 05:25 PM

One way to do this would be to build an application that constructs a complete calibre library from the database without using calibre. You would use the same schema that calibre uses along with compatible file naming conventions. Matching the schema would be made easier by starting with an empty calibre library.

There is no doubt that doing this would be a lot of work, but it is orders of magnitude less work than reinventing calibre.

On the other hand, is calibre the right target? Perhaps an academic bibliography manager would be more appropriate? I suspect that generating bibtek would be a lot easier and possibly more useful.

BetterRed · 05-11-2014, 06:48 PM

I'm wondering if you can populate the library with content progressively.

A Calibre book folder does not have to contain a format file; some users use this feature as a means of recording 'books to get', or recording their paper books.

Perhaps you could create a viable (albeit empty) library of 300,000 articles from the text file you mention, which I assume is an index that includes a reference to the archive in which each article is located.

Maybe that reference could be 'munged' into a file uri (file:\\\thearchives\archive_002.zip) and popped to a custom comments-like column that you display in the Book Details area (normally to the right of the 'book' list). Then you could click on it to open the archive as required. Most (all ?) archive utilities will let you open an archive member from within an archive - they extract it to a temporary folder and hand that file off the relevant program.

And of course you could drag the article (text file, fb2, cover) content out of the archive and drop onto the Book details (probably indirectly - archive->scratchpad->calibre)

You might be able to create that initial database using the Import List Plugin.

BR

kovidgoyal · 05-11-2014, 10:24 PM

You definitely should do this in two steps:

1) Create the empty records. Write a small script in python to do that, using calibre apis and run it with calibre-debug script.py

2) Write a script to transfer the book files, that avoids the calibre apis and uses file renames (as opposed to file copies/moves) + direct access to the data table in the calibredb to populate it with the entries.

Should be about a days work, and should finish importing all 300K books in a few hours. Do it first with a few thousand books to get a sense for the performance and feasability.

Sample code for (1)

Code:

from calibre.library import db
from calibre.metadata.books.base import Metadata

books = [Metadata('title1', ['author1']), Metadata('title2', 'author2'), ...]
db = db('path to library folder').new_api

for book in books:
   db.create_book_entry(mi, apply_import_tags=False)

For (2) you just need to create entries in the data table in metadata.db which should be trivial and rename the files into the calibre library using a similar naming scheme as calibre uses for its files.

However, running calibre with 300K entries is not going to be very performant. I suggest splitting up your archive into 5-10 libraries.

Barafu · 05-12-2014, 02:18 AM

I guess I could use FUSE to present the archives as folders. But I will not be able to create metadata.opf for every book this way.
OK, I will read the manuals on db, try some experiments and be back in a few days.

05-11-2014, 12:44 PM	#1
Barafu Junior Member Posts: 7 Karma: 10 Join Date: May 2014 Device: many	Is it possible: support another db backend? Good day. I have some experience in Python. Before starting to learn materials on Calibre plugins I want to ask if the thing I want to create is possible to implement at all, without rewriting too much. There is a free database of articles. It is published by goverenment. Currently it contains 300 000 documents totaling 200Gb. Some have covers - the logo of department they originate from.The format of db is following: a text file with basic metadata (except covers, comments and so on) and a hundred of 2Gb zip archives, containing TXT and FB2. I want to browse the db with Calibre. First, I tried a script that imports files directly to Calibre, but that accomplished hardly 15% in 5 days 24/7. What's worse is that this database is updated monthly, and I don't want to be constantly importing something. (And extra 200Gb for two copies of data isn't a trifle, too). I want to try to make Calibre understand this db in place, in read-only mode. I want searching, reading locally, sending to email and maybe ebooks. Calibre developers, please tell me your opinion: Is this possible to do with reasonable effort, or I should start my own application from scratch? Any advises on where to start would be appreciated.

05-11-2014, 06:48 PM	#9
BetterRed null operator (he/him) Posts: 21,708 Karma: 29711016 Join Date: Mar 2012 Location: Sydney Australia Device: none	I'm wondering if you can populate the library with content progressively. A Calibre book folder does not have to contain a format file; some users use this feature as a means of recording 'books to get', or recording their paper books. Perhaps you could create a viable (albeit empty) library of 300,000 articles from the text file you mention, which I assume is an index that includes a reference to the archive in which each article is located. Maybe that reference could be 'munged' into a file uri (file:\\\thearchives\archive_002.zip) and popped to a custom comments-like column that you display in the Book Details area (normally to the right of the 'book' list). Then you could click on it to open the archive as required. Most (all ?) archive utilities will let you open an archive member from within an archive - they extract it to a temporary folder and hand that file off the relevant program. And of course you could drag the article (text file, fb2, cover) content out of the archive and drop onto the Book details (probably indirectly - archive->scratchpad->calibre) You might be able to create that initial database using the Import List Plugin. BR Last edited by BetterRed; 05-11-2014 at 06:54 PM.

05-11-2014, 10:24 PM	#10
kovidgoyal creator of calibre Posts: 45,319 Karma: 27111242 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You definitely should do this in two steps: 1) Create the empty records. Write a small script in python to do that, using calibre apis and run it with calibre-debug script.py 2) Write a script to transfer the book files, that avoids the calibre apis and uses file renames (as opposed to file copies/moves) + direct access to the data table in the calibredb to populate it with the entries. Should be about a days work, and should finish importing all 300K books in a few hours. Do it first with a few thousand books to get a sense for the performance and feasability. Sample code for (1) Code: from calibre.library import db from calibre.metadata.books.base import Metadata books = [Metadata('title1', ['author1']), Metadata('title2', 'author2'), ...] db = db('path to library folder').new_api for book in books: db.create_book_entry(mi, apply_import_tags=False) For (2) you just need to create entries in the data table in metadata.db which should be trivial and rename the files into the calibre library using a similar naming scheme as calibre uses for its files. However, running calibre with 300K entries is not going to be very performant. I suggest splitting up your archive into 5-10 libraries.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
New database backend - testers needed	kovidgoyal	Calibre	314	08-23-2013 06:09 AM
calibre V0.9.41 released, includes new database backend for testing	Alexander Turcic	Calibre	0	07-28-2013 02:47 AM
NewsBeamer Android App that uses calibre as a backend	duluoz	Related Tools	6	05-23-2013 08:19 AM
shared backend database?	perler	Calibre	4	01-26-2012 05:37 AM
Building calibre backend only?	jesse	Calibre	2	03-15-2009 05:32 PM

05-11-2014, 02:52 PM	#2
aleyx Addict Posts: 250 Karma: 20386 Join Date: Sep 2010 Location: France Device: Bookeen Diva, Kobo Clara BW	Right. As much as I love Calibre, I don't think it's the right tool for this particular job. I'm afraid that your best bet is to make your own application. If you have more experience in Python than, say, PHP, you could start by looking at CherryPy, which is the Python-based webserver framework I use for small to medium custom projects. There's others, of course.

05-11-2014, 02:52 PM	#3
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	I think processing 200GB worh of information is the real problem, that is never going to be fast. plaintext will not make for a very fast database either. I'd recommend writing a script to determine when and where articles have been updated/added, then importing the changes into calibre. After the first import, the new material will not be as much, and will take less time.

05-11-2014, 02:58 PM	#4
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	If you do end up writing your own application, calibre is licensed under the GPL, so you can use its code wherever useful. calibre has a lot of mature code for its ebook-viewer.exe component that may save you lots of time, for instance.

05-11-2014, 03:35 PM	#5
Barafu Junior Member Posts: 7 Karma: 10 Join Date: May 2014 Device: many	Thank you for the feedback. About plaintext being bad for metadata - sure, it can be converted to anything as long as this process takes hours, not weeks. The thing I want to achieve is not to unpack original article archives. By the way, I was trying to speed up Calibre importing at first. I find out that the process is HDD IO bound, that moving it to tmpfs speeds things up three times at least. May be there is a way to do "fast" import of files, avoiding the usual import procedure? That would be a workaround about problem.

05-11-2014, 03:45 PM	#6
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	You could use an SSD, even just for the calibre db alone.

05-11-2014, 04:51 PM	#7
Barafu Junior Member Posts: 7 Karma: 10 Join Date: May 2014 Device: many	I do already, that doesn't help much. The best I could achieve is 1.5 files per second at start and slowing down. Setting aside the import variant, and the "Write my own app" variant (which I can always fall back to), there is one idea left. May be I can create some virtual device that will pretend to be a reader with all these books on it? I can create a standalone script that would present the metadata in any form. If only I could teach Calibre to take actual books (and, preferably, covers) from archives… BTW, that format for text DBs is rather popular in ex-USSR. Government publications, advertisements, books often come in that form. Its support plus recent movement to ban Windows from many organizations( including schools) would make this addition to Calibre rather popular. P.S. And instead of my own app I can use LibreOffice Base + set of scripts. At least I hope to. Calibre offers book conversions and ereader support, however, and other neat things.

05-11-2014, 05:25 PM	#8
chaley Grand Sorcerer Posts: 12,421 Karma: 8012664 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	One way to do this would be to build an application that constructs a complete calibre library from the database without using calibre. You would use the same schema that calibre uses along with compatible file naming conventions. Matching the schema would be made easier by starting with an empty calibre library. There is no doubt that doing this would be a lot of work, but it is orders of magnitude less work than reinventing calibre. On the other hand, is calibre the right target? Perhaps an academic bibliography manager would be more appropriate? I suspect that generating bibtek would be a lot easier and possibly more useful.

05-12-2014, 02:18 AM	#11
Barafu Junior Member Posts: 7 Karma: 10 Join Date: May 2014 Device: many	I guess I could use FUSE to present the archives as folders. But I will not be able to create metadata.opf for every book this way. OK, I will read the manuals on db, try some experiments and be back in a few days.

Advert

Advert