Database Fork

devils_add · 12-16-2013, 08:08 PM

Hi,
I am in an early stages of planning to fork Calibre-eBooks Database.
The first part is the redesign of database storage/organization. The way I envision it is that each instance of a book record, from "<author>\<Title>\" will be just some almost random numerical archived zip file <some number>.zip.
Inside the file I will have the ebooks and other data files associated with it.

The second part is the data itself. I am thinking of moving it to almost html style formatting. Like this

<book>

<file>filename</file>

<format:1>pdf</format:1>

<format:2>djvu</format:2>

</format>

<title>Some Title</title>

<author:1>first middle last</author:1>

<author:2>first middle last</author:2>

</author>

</book>

So, that it will be easier to append functionality in the future and will be easy to make it backwards and forward compatible just by ignoring unknown parts.
This will also allow for nesting tags and nesting other titles and for future expansion of functionality.

aleyx · 12-17-2013, 04:46 AM

Hm. I don't quite understand.

Are you trying to develop an alternate, drop-in replacement for the metadata.db + filesystem that Calibre uses for storage?

devils_add · 12-17-2013, 02:56 PM

Quote:

Originally Posted by aleyx

Hm. I don't quite understand.

Are you trying to develop an alternate, drop-in replacement for the metadata.db + filesystem that Calibre uses for storage?

To some extent yes. The reason is that with the correct redesign of filesystem, Calibre will be able to organize almost everything. Therefore, it will need a more robust metadata.db. In addition, the reason for archiving is that it will allow to create an easy way to transfer items between libraries without having to worry that you will import it wrong and will have to edit metadata again, as everything will be inside that archive.

aleyx · 12-17-2013, 03:24 PM

Ah.

You do realize that with a segmented XML database (what you call "almost html style formatting") hidden away in .zip files, perfs will take pummelling not seen since Wile E. Coyote still tried to get himself a side serving of roasted roadrunner?

See, changing the filesystem hierarchy is one thing. In the end, it's just strings. But getting away from an RDBMS? That is not, I repeat, _not_, something you want to do.

eschwartz · 12-17-2013, 04:49 PM

Quote:

Originally Posted by devils_add

To some extent yes. The reason is that with the correct redesign of filesystem, Calibre will be able to organize almost everything. Therefore, it will need a more robust metadata.db. In addition, the reason for archiving is that it will allow to create an easy way to transfer items between libraries without having to worry that you will import it wrong and will have to edit metadata again, as everything will be inside that archive.

I'm pretty sure we already have that with the currently working system.

aleyx · 12-17-2013, 05:28 PM

I think he wants to make it the main (and only) database. Now I'm no DBA, but it has me very scared.

Because you see, devils_add, if your DB is scattered into thousands of little XML files inside thousands of .zip, then you'll have to open all of those .zip and read all of those XML files every time you want to do anything, like, say, list titles. If you want to _search_, it's even worse, because then you'll have to open it all up again, _then_ make cross-references for basically every single field of every single XML file.

That's pure insanity. There's a reason DBMSs have been around since the '70. It's because it _works_.

Now XML/OPF files have their use, but database queries ain't it.

As someone who once had to convert an old, OLD flat-file-based DB to Access (which is still not a real database but less wrong), I beg you: spare yourself the pain.

eschwartz · 12-17-2013, 05:42 PM

We already have xml backups saved with the book. As backups, which is where metadata xml belongs. Why on earth should the database be replaced to use this instead, purely for the purpose of fixing an imaginary problem?

What do you think databases were invented for anyway?

devils_add · 12-17-2013, 06:46 PM

Guys, you are forgetting about the DMG file extension in Apple. Where everything the program needs to run is inside that file (which is an archive). So, what I am proposing is to have just the met file associated with record and record itself inside the archive which will be added into the main database. The only time the archive is written to is when files are added or when metadata is changed, and all other times it is opened is to extract a needed file to read it or to send it to the device.
Therefore, the main database file will be outside as it is right now. Also, you can have the main database link to virtual libraries databases for different, incompatible formats. In addition, this will allow for creating a single database for everything, with different iteration on front-end. So that you can have all your collections managed by just one database.
Sorry, for wordiness.

Also, with the archive architecture, you can keep some pdf books broken by chapter, and combine them on the fly as requested (resources available), so you don't have to download the full book, but just the chapters you need.

BetterRed · 12-17-2013, 07:17 PM

Quote:

Originally Posted by devils_add

Calibre will be able to organize almost everything. Therefore, it will need a more robust metadata.db.

Calibre can already do that to some degree

Calibre could be more general purpose in a user-friendly sense, if the user could define the labels for the Author/Book Title entities to whatever suits their purpose - e.g. Architect/Building; Software Package/Program; Director/Movie Name; Producer/Game etc.

If one could also add a third entity into the hierarchy then that would probably be enough to cover 90% of potential uses.

==================

I'm puzzled by what you mean by a 'more robust metadata.db'.

I've been using calibre for 2-3 years, and I use it one way or another on most days of the week. It has never crashed, I've never had to rebuild a database nor have I ever had to reinstall calibre.

I wish I could say the same for some other programs that I use - eg web browsers, editors, IDE's, photo and music library managers - even the file manager I use crashes at least once a week.

The only time performance has been an issue, was related to a custom column based on a union of 4 other custom columns - each of which was a list of Names. I intuited at the time I did it that I was pushing the edge of the envelope, so I had a Plan B for what to do when the envelope tore.

====================

I'm also intrigued as to how you would envisage implementing a multi-user server based implementation of your schema on different server platforms.

BR

Addenda : @devils_add - I missed seeing your most recent post before I posted this, vagaries of phone interruptus

aleyx · 12-18-2013, 05:40 AM

Quote:

Originally Posted by devils_add

Guys, you are forgetting about the DMG file extension in Apple.

Dude, you're forgetting the concept of platform-agnosticism. I can install Calibre on pretty much anything. You're suggesting that only MacOS is worthy of Calibre?

Quote:

Originally Posted by devils_add

Where everything the program needs to run is inside that file (which is an archive). So, what I am proposing is to have just the met file associated with record and record itself inside the archive which will be added into the main database.

Soooo... Everything in one file? That's even _worse_. To access your bit of database (Hah! bit! Get it? ^_^), you'll have to look into a big proprietary file then into a small .zip then into an XML? And that big proprietary file can be a small as a few dozen MB for a few books, up to several GB for consequent libraries? My own library is ~950MB and I only have about 1500 books. There's libraries out there with tens of thousands. I/O will be a nightmare.

Quote:

Originally Posted by devils_add

The only time the archive is written to is when files are added or when metadata is changed,

Which is basically every time you use Calibre.

Quote:

Originally Posted by devils_add

and all other times it is opened is to extract a needed file to read it or to send it to the device. Therefore, the main database file will be outside as it is right now. Also, you can have the main database link to virtual libraries databases for different, incompatible formats. In addition, this will allow for creating a single database for everything, with different iteration on front-end. So that you can have all your collections managed by just one database.

Sorry, I really can't picture it. Is it one database, or databases with "links to virtual library databases"? What are you calling "database" in this context? The way you talk about it, I think you mean "one big Apple-format archive in which there's one .zip per book, and in each .zip there's all the formats and one XML file describing the metadata for the book".

That's not a database, that's a .tar backup.

If that's not it, you really really need to do some kind of ASCII art of your file hierarchy, because right now I'm in the dark.

Quote:

Originally Posted by devils_add

Sorry, for wordiness.

I'm really sorry, but the problem here is not the words, it's the concept.

Quote:

Originally Posted by devils_add

Also, with the archive architecture, you can keep some pdf books broken by chapter, and combine them on the fly as requested (resources available), so you don't have to download the full book, but just the chapters you need.

First rule of arch rewrite: aim for feature equality, _then_ build upon it. Calibre's granularity doesn't go down to the chapter. That's why it can't do that. If it did, it would.

Also, the ressources you may (I say _may_) save with that system are utterly dwarfed by the ressources you'll use just to read your database.

There's no two ways of doing database-driven file management on consumer hardware. There's only one. There's one system, made of an RDBMS on one side and a filesystem on the other. Some of those files, if they're text files (as opposed to binary files) can be compressed, but that's as far as it can go.

I know that because I've already tried it all, ever since I've first discovered databases twenty years ago. Your system? I made one like that, more or less, when I was 16. At the time it was a VB4-based management system for the fanfiction I downloaded from R.A.A.C. (ah, those were the times...), and trying to make Eyrie Production's Undocumented Features into some sort of reading order with bookmarks, because even back then it was _huge_. It took me a few weeks before I scrapped the idea of text files-based DB and turned to Access (There was no SQLite in those far away times...). I had much better results.

So, learn from the mistakes of an old (well, 36-years old) pro and use an RDBMS. That's why they're made for. They're good at it.

DoctorOhh · 12-18-2013, 05:56 AM

Quote:

Originally Posted by devils_add

To some extent yes. The reason is that with the correct redesign of filesystem, Calibre will be able to organize almost everything. Therefore, it will need a more robust metadata.db.

Whether I fully understand your goal or not makes no difference. When you're ready for folks to test it I'll be happy to give a try. Even if you decide that your idea wasn't the cat's pajamas you may end up writing code that can be merged into the current codebase with new features or speed enhancements. Quality contributors are always welcome and you have to cut your teeth on the code somehow.

Quote:

Originally Posted by devils_add

In addition, the reason for archiving is that it will allow to create an easy way to transfer items between libraries without having to worry that you will import it wrong and will have to edit metadata again, as everything will be inside that archive.

Calibre already has this capability in the Copy to library feature. There is no "worry that you will import it wrong" since it is a direct copy of the record from one library to another.

Good Luck with your fork.

At_Libitum · 12-28-2013, 11:08 PM

I can see where the idea comes from. Using one big container file with it's own internal 'filesystem' was/is still used for a lot of games. and most of these games also got released on several platforms. So in that respect the idea is not that strange. The only thing that is different here, the dynamic nature of a library compared to the static environment of game resources. You'd have to go the direction of virtual hd files or something similar and let the current rdbms use the container file to write-to/read-from instead of trying to recreate the rdbms. But...like physical hd's, virtual file systems tend to get fragmented the same way, with the same side effects. Which means, reorganization is needed, which means needing at least as much free diskspace as the size of the container file, preferable double that.

It may look like a good idea, but it has one helluva drawback. If something, how small even, breaks in the container file, it's bye-bye- library. At least in the current situation, all books stay intact. Which means you probably want to maintain some kind of parity system for repairs if worst comes to worst. In the end, the risk some mishap occurring to a virtual file system is much higher than to a physical one. Files get damaged much more often than HD's

devils_add · 01-21-2014, 07:25 PM

Quote:

Originally Posted by At_Libitum

I can see where the idea comes from. Using one big container file with it's own internal 'filesystem' was/is still used for a lot of games. and most of these games also got released on several platforms. So in that respect the idea is not that strange. The only thing that is different here, the dynamic nature of a library compared to the static environment of game resources. You'd have to go the direction of virtual hd files or something similar and let the current rdbms use the container file to write-to/read-from instead of trying to recreate the rdbms. But...like physical hd's, virtual file systems tend to get fragmented the same way, with the same side effects. Which means, reorganization is needed, which means needing at least as much free diskspace as the size of the container file, preferable double that.

It may look like a good idea, but it has one helluva drawback. If something, how small even, breaks in the container file, it's bye-bye- library. At least in the current situation, all books stay intact. Which means you probably want to maintain some kind of parity system for repairs if worst comes to worst. In the end, the risk some mishap occurring to a virtual file system is much higher than to a physical one. Files get damaged much more often than HD's

Actually I am not proposing to do a one big encapsulation of the whole database, just of the items it it. So if you look at the database as a tree and the data inside final folder as a leaf, I am proposing to encapsulate that final folder (ok I will probably add some file-structure to it so for example you could store a pdf book broken by chapters and it will combine as needed, audio books could be kept in one place, or a folder for supplementary material which sometimes comes with a book).
The only thing I will have in a big file, will be the general database, so that I don't have to re-scan it again. However, even that might not be true, as I am planning for the database to be locate-able on different hard-drives, not just different folders. Therefore, each location will have a local backup database, which the main database will load from and reference to.

aleyx · 01-22-2014, 04:20 AM

So, if I understand correctly, you have:
- One directory with as many .zip as you have books,
- One .zip with a "general database".

What's in the latter?

12-16-2013, 08:08 PM	#1
devils_add Member Posts: 13 Karma: 10 Join Date: Sep 2013 Device: none	Database Fork Hi, I am in an early stages of planning to fork Calibre-eBooks Database. The first part is the redesign of database storage/organization. The way I envision it is that each instance of a book record, from "<author>\<Title>\" will be just some almost random numerical archived zip file <some number>.zip. Inside the file I will have the ebooks and other data files associated with it. The second part is the data itself. I am thinking of moving it to almost html style formatting. Like this <book> <file>filename</file> <format> <format:1>pdf</format:1> <format:2>djvu</format:2> </format> <title>Some Title</title> <author> <author:1>first middle last</author:1> <author:2>first middle last</author:2> </author> </book> So, that it will be easier to append functionality in the future and will be easy to make it backwards and forward compatible just by ignoring unknown parts. This will also allow for nesting tags and nesting other titles and for future expansion of functionality.

12-17-2013, 06:46 PM	#8
devils_add Member Posts: 13 Karma: 10 Join Date: Sep 2013 Device: none	Guys, you are forgetting about the DMG file extension in Apple. Where everything the program needs to run is inside that file (which is an archive). So, what I am proposing is to have just the met file associated with record and record itself inside the archive which will be added into the main database. The only time the archive is written to is when files are added or when metadata is changed, and all other times it is opened is to extract a needed file to read it or to send it to the device. Therefore, the main database file will be outside as it is right now. Also, you can have the main database link to virtual libraries databases for different, incompatible formats. In addition, this will allow for creating a single database for everything, with different iteration on front-end. So that you can have all your collections managed by just one database. Sorry, for wordiness. Also, with the archive architecture, you can keep some pdf books broken by chapter, and combine them on the fly as requested (resources available), so you don't have to download the full book, but just the chapters you need. Last edited by devils_add; 12-17-2013 at 06:48 PM.

12-28-2013, 11:08 PM	#12
At_Libitum Addict Posts: 265 Karma: 724240 Join Date: Aug 2013 Device: KyBook	I can see where the idea comes from. Using one big container file with it's own internal 'filesystem' was/is still used for a lot of games. and most of these games also got released on several platforms. So in that respect the idea is not that strange. The only thing that is different here, the dynamic nature of a library compared to the static environment of game resources. You'd have to go the direction of virtual hd files or something similar and let the current rdbms use the container file to write-to/read-from instead of trying to recreate the rdbms. But...like physical hd's, virtual file systems tend to get fragmented the same way, with the same side effects. Which means, reorganization is needed, which means needing at least as much free diskspace as the size of the container file, preferable double that. It may look like a good idea, but it has one helluva drawback. If something, how small even, breaks in the container file, it's bye-bye- library. At least in the current situation, all books stay intact. Which means you probably want to maintain some kind of parity system for repairs if worst comes to worst. In the end, the risk some mishap occurring to a virtual file system is much higher than to a physical one. Files get damaged much more often than HD's Last edited by At_Libitum; 12-28-2013 at 11:19 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Librerator - multi-format e-reader, fork of KPV	Kai771	Kindle Developer's Corner	433	05-25-2024 04:34 AM
Free Book (Kindle) - The Tiny Fork Diet [UK]	koland	Deals and Resources (No Self-Promotion or Affiliate Links)	0	12-20-2011 03:22 PM
Walk softly and carry a big fork.	kennyc	Lounge	6	07-15-2011 02:41 PM
Calibre Database cp Kindle Database	mitch13	Library Management	1	05-22-2011 08:33 PM

12-17-2013, 04:46 AM	#2
aleyx Addict Posts: 250 Karma: 20386 Join Date: Sep 2010 Location: France Device: Bookeen Diva, Kobo Clara BW	Hm. I don't quite understand. Are you trying to develop an alternate, drop-in replacement for the metadata.db + filesystem that Calibre uses for storage?

12-17-2013, 03:24 PM	#4
aleyx Addict Posts: 250 Karma: 20386 Join Date: Sep 2010 Location: France Device: Bookeen Diva, Kobo Clara BW	Ah. You do realize that with a segmented XML database (what you call "almost html style formatting") hidden away in .zip files, perfs will take pummelling not seen since Wile E. Coyote still tried to get himself a side serving of roasted roadrunner? See, changing the filesystem hierarchy is one thing. In the end, it's just strings. But getting away from an RDBMS? That is not, I repeat, _not_, something you want to do.

12-17-2013, 05:28 PM	#6
aleyx Addict Posts: 250 Karma: 20386 Join Date: Sep 2010 Location: France Device: Bookeen Diva, Kobo Clara BW	I think he wants to make it the main (and only) database. Now I'm no DBA, but it has me very scared. Because you see, devils_add, if your DB is scattered into thousands of little XML files inside thousands of .zip, then you'll have to open all of those .zip and read all of those XML files every time you want to do anything, like, say, list titles. If you want to _search_, it's even worse, because then you'll have to open it all up again, _then_ make cross-references for basically every single field of every single XML file. That's pure insanity. There's a reason DBMSs have been around since the '70. It's because it _works_. Now XML/OPF files have their use, but database queries ain't it. As someone who once had to convert an old, OLD flat-file-based DB to Access (which is still not a real database but less wrong), I beg you: spare yourself the pain.

12-17-2013, 05:42 PM	#7
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	We already have xml backups saved with the book. As backups, which is where metadata xml belongs. Why on earth should the database be replaced to use this instead, purely for the purpose of fixing an imaginary problem? What do you think databases were invented for anyway?

01-22-2014, 04:20 AM	#14
aleyx Addict Posts: 250 Karma: 20386 Join Date: Sep 2010 Location: France Device: Bookeen Diva, Kobo Clara BW	So, if I understand correctly, you have: - One directory with as many .zip as you have books, - One .zip with a "general database". What's in the latter?

Advert

Advert