|05-21-2009, 02:25 AM||#16|
Join Date: Sep 2008
Device: Nokia 770 (fbreader)
|05-29-2009, 06:39 PM||#17|
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
Calibre: Import book metadata from filename
Unlike the other Regular Expression samples provided in this topic, this one is intended to be used on a book's filename, not on the text of the book. Many books do not have metadata included inside the file so that information is included in the filename. By using this formula you can import the book into Calibre and automatically have the data entered correctly.
Users should still review all books that are automatically imported in this manner. Despite the complex testing and accuracy of this expression, errors can and will occur in unique circumstances.
Please note that the Regular Expression which follows is the work of multiple people: Gwynevans, Darkmonk (and the unix forum that assisted him) and myself. The unix forum undoubtedly gets the lion's share of accolades.
Description: Automates metadata entry into Calibre when adding a book based on the filename.
Example of input: John D. Smith-Jones - Bibliographic Perfection 189 - The Perfect Book.pdf
Example of output: (The output is what data Calibre has in its fields after import is completed.)
<author> = John D. Smith-Jones
<series> = Bibliographic Perfection
<series_index> = 189
<title> = The Perfect Book
Requirements: This regular expression is used in Calibre under the user-defined Preferences>Advanced>Regular Expression option. The imported filenames must specifically follow the format of "<author> - <series> - <series_index> - <title>" (specifically only using space dash space between fields). The series and series index portions are optional, filenames without them will be imported correctly.
* All filenames must follow the explicit format mention in Requirements, above. Otherwise they will not import correctly.
* Does not test for name order (lastname, firstname).
* Does not test for multiple authors, but will accept multiples as a single entry in the author field.
* Extra " - "s within the <author> or <series> fields mangles importation.
* Hyphenated names like "Smith-Jones" are fine, "Smith - Jones" mangles importation.
* Leetspeak has limited importation - depending on the exact character combination used.
* Titles generally are not affected by the above since the regex allows all characters once the series name and index is obtained.
* Since titles are not tested at all, serious typographic errors will go undetected on import. Manual review is the only means of correcting.
(Note: This is a complex regular expression and a literal translation would be difficult. It makes extensive use of recursion and capturing of data elements. Because of this a more generic translation is provided.)
* (?P<author>((?!\s-\s).)*): Accept all characters up to, but not including " - ", as a part of the author's name. Allows for hyphenated names, names with apostrophes, multiple authors (A & B; A and B; A, B; etc).
* \s-: Moves testing past the first " - " delimiter.
* [I](?:\s[\I]: Initial test for the presence of the optional elements.
* ((?P<series>.+): Through multiple tests here and in the next two portions, accepts everything up to the next " - " delimeter in a captured group. Then it just extracts the series name from that group, using a single space between text and numbers as the delimiter.
*(?P<series_index>\d+): Pulls the numbers out of the captured group as the index. Will accept any number.
*((?!\s-\s).)*): Finds the end of the optional elements.
*\s-)?\s: Moves testing past the optional elements and the last recognized delimiter. After this the use of the " - " delimiter no longer matters as it is not tested for in the remaining expression.
*(?P<title>.*): All remaining characters are accepted as part of the title. If the book includes " - " as a legitimate part of the title, it will be accepted without errors.
Regexp-modifiers: case-insensitive, single-line, greedy - but regularly gives back
Last edited by Sabardeyn; 05-29-2009 at 06:52 PM. Reason: Fixed layout and typos.
|06-17-2013, 02:56 AM||#19|
Join Date: Jan 2012
Location: South Africa
Device: Kindle 4
I don't want to copy/paraphrase your entry, so I'll ask you to add this regex. Works exactly the same as yours, except it works with filename format:
author - [series series_index] title
|06-17-2013, 08:51 AM||#20|
Join Date: Sep 2010
Device: prs-t1, phone/Cool Reader, tablet/BlueFire, Nook Simple
What it does:
Flags typical OCR errors (at least in tesseract)
([a-z][A-Z]|[a-zA-Z][0-9]|\. *[a-z]| [;:])
False positives, especially abbrev. which are followed by lower-case letters.
Perl & similar. Add \ before (,),| to get a "regular" regexp.
|edit, regex, regular expressions|
|Thread Tools||Search this Thread|
|Thread||Thread Starter||Forum||Replies||Last Post|
|Perl and Regex||Alexander Turcic||Lounge||3||01-25-2011 08:48 PM|
|What a regex is||Worldwalker||Calibre||20||05-10-2010 06:51 AM|
|Help with a regex||A.T.E.||Calibre||1||04-05-2010 08:50 AM|
|Regex help needed||gandor62||Calibre||2||11-04-2009 11:27 AM|
|Regex help...||Bobthebass||Workshop||6||04-26-2009 04:54 PM|