MobileRead Forums - View Single Post

ysbelman · 08-09-2010, 01:21 PM

I am trying to import a random collection of ebooks with filename structure from hell. I have customized the regex to parse <author> - <title> the best i could based on 50% of what i found. But there are so many variations in my filenames, with dashes inside title themselves, with author/title flipped, with author first/last flipped, etc. The result is - i have a Calibre directory structure with meaningless author/title information. The fact that Calibre strips the original file info - makes it that much harder to correct it.

Here is an idea which i'm sure has been suggested a million times.
Instead of bothering with directory structure which no user should see anyway, it could just dump all files in original onto disk.
Let the database keep reference to that file.

There will be no need to rename folders, move files around. This will be a lot more efficient, performance wise.
In order to help people with bad original content like me, allow for an iterative reimporting with the following workflow -
1. user imports original set, say 10,000 ebooks.
2. 1/2 of them contained title/author information out of sync with the original regex pattern
3. user selects a swab of rows in calibre, right clicks, chooses menu option "reimport".
4. user is asked to create a new regex
5. calibre knows the original filename of each item in the table
6. calibre iterates thu the list, and reimports using the new regex pattern
7. user continues to "refine" his import, step 1.

What does the community think?

08-09-2010, 01:21 PM	#75
ysbelman Junior Member Posts: 4 Karma: 10 Join Date: Aug 2010 Device: iphone	I am trying to import a random collection of ebooks with filename structure from hell. I have customized the regex to parse <author> - <title> the best i could based on 50% of what i found. But there are so many variations in my filenames, with dashes inside title themselves, with author/title flipped, with author first/last flipped, etc. The result is - i have a Calibre directory structure with meaningless author/title information. The fact that Calibre strips the original file info - makes it that much harder to correct it. Here is an idea which i'm sure has been suggested a million times. Instead of bothering with directory structure which no user should see anyway, it could just dump all files in original onto disk. Let the database keep reference to that file. There will be no need to rename folders, move files around. This will be a lot more efficient, performance wise. In order to help people with bad original content like me, allow for an iterative reimporting with the following workflow - 1. user imports original set, say 10,000 ebooks. 2. 1/2 of them contained title/author information out of sync with the original regex pattern 3. user selects a swab of rows in calibre, right clicks, chooses menu option "reimport". 4. user is asked to create a new regex 5. calibre knows the original filename of each item in the table 6. calibre iterates thu the list, and reimports using the new regex pattern 7. user continues to "refine" his import, step 1. What does the community think?