MobileRead Forums - View Single Post

Sabardeyn · 05-29-2009, 06:39 PM

Unlike the other Regular Expression samples provided in this topic, this one is intended to be used on a book's filename, not on the text of the book. Many books do not have metadata included inside the file so that information is included in the filename. By using this formula you can import the book into Calibre and automatically have the data entered correctly.

Users should still review all books that are automatically imported in this manner. Despite the complex testing and accuracy of this expression, errors can and will occur in unique circumstances.

Please note that the Regular Expression which follows is the work of multiple people: Gwynevans, Darkmonk (and the unix forum that assisted him) and myself. The unix forum undoubtedly gets the lion's share of accolades.

Description: Automates metadata entry into Calibre when adding a book based on the filename.

Example of input: John D. Smith-Jones - Bibliographic Perfection 189 - The Perfect Book.pdf

Example of output: (The output is what data Calibre has in its fields after import is completed.)
<author> = John D. Smith-Jones
<series> = Bibliographic Perfection
<series_index> = 189
<title> = The Perfect Book

Requirements: This regular expression is used in Calibre under the user-defined Preferences>Advanced>Regular Expression option. The imported filenames must specifically follow the format of "<author> - <series> - <series_index> - <title>" (specifically only using space dash space between fields). The series and series index portions are optional, filenames without them will be imported correctly.

Faults:
* All filenames must follow the explicit format mention in Requirements, above. Otherwise they will not import correctly.
* Does not test for name order (lastname, firstname).
* Does not test for multiple authors, but will accept multiples as a single entry in the author field.
* Extra " - "s within the <author> or <series> fields mangles importation.
* Hyphenated names like "Smith-Jones" are fine, "Smith - Jones" mangles importation.
* Leetspeak has limited importation - depending on the exact character combination used.
* Titles generally are not affected by the above since the regex allows all characters once the series name and index is obtained.
* Since titles are not tested at all, serious typographic errors will go undetected on import. Manual review is the only means of correcting.

Regexp:

Code:

(?P<author>((?!\s-\s).)*)\s-(?:\s((?P<series>.+) (?P<series_index>\d+)((?!\s-\s).)*)\s-)?\s(?P<title>.*)

Regexp-translation:
(Note: This is a complex regular expression and a literal translation would be difficult. It makes extensive use of recursion and capturing of data elements. Because of this a more generic translation is provided.)

* (?P<author>((?!\s-\s).)*): Accept all characters up to, but not including " - ", as a part of the author's name. Allows for hyphenated names, names with apostrophes, multiple authors (A & B; A and B; A, B; etc).
* \s-: Moves testing past the first " - " delimiter.
* [I](?:\s[\I]: Initial test for the presence of the optional elements.
* ((?P<series>.+): Through multiple tests here and in the next two portions, accepts everything up to the next " - " delimeter in a captured group. Then it just extracts the series name from that group, using a single space between text and numbers as the delimiter.
*(?P<series_index>\d+): Pulls the numbers out of the captured group as the index. Will accept any number.
*((?!\s-\s).)*): Finds the end of the optional elements.
*\s-)?\s: Moves testing past the optional elements and the last recognized delimiter. After this the use of the " - " delimiter no longer matters as it is not tested for in the remaining expression.
*(?P<title>.*): All remaining characters are accepted as part of the title. If the book includes " - " as a legitimate part of the title, it will be accepted without errors.

Regexp-modifiers: case-insensitive, single-line, greedy - but regularly gives back

Regexp-syntax: Python

05-29-2009, 06:39 PM	#17
Sabardeyn Guru Posts: 644 Karma: 1242364 Join Date: May 2009 Location: The Right Coast Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)	Calibre: Import book metadata from filename Unlike the other Regular Expression samples provided in this topic, this one is intended to be used on a book's filename, not on the text of the book. Many books do not have metadata included inside the file so that information is included in the filename. By using this formula you can import the book into Calibre and automatically have the data entered correctly. Users should still review all books that are automatically imported in this manner. Despite the complex testing and accuracy of this expression, errors can and will occur in unique circumstances. Please note that the Regular Expression which follows is the work of multiple people: Gwynevans, Darkmonk (and the unix forum that assisted him) and myself. The unix forum undoubtedly gets the lion's share of accolades. Description: Automates metadata entry into Calibre when adding a book based on the filename. Example of input: John D. Smith-Jones - Bibliographic Perfection 189 - The Perfect Book.pdf Example of output: (The output is what data Calibre has in its fields after import is completed.) <author> = John D. Smith-Jones <series> = Bibliographic Perfection <series_index> = 189 <title> = The Perfect Book Requirements: This regular expression is used in Calibre under the user-defined Preferences>Advanced>Regular Expression option. The imported filenames must specifically follow the format of "<author> - <series> - <series_index> - <title>" (specifically only using space dash space between fields). The series and series index portions are optional, filenames without them will be imported correctly. Faults: * All filenames must follow the explicit format mention in Requirements, above. Otherwise they will not import correctly. * Does not test for name order (lastname, firstname). * Does not test for multiple authors, but will accept multiples as a single entry in the author field. * Extra " - "s within the <author> or <series> fields mangles importation. * Hyphenated names like "Smith-Jones" are fine, "Smith - Jones" mangles importation. * Leetspeak has limited importation - depending on the exact character combination used. * Titles generally are not affected by the above since the regex allows all characters once the series name and index is obtained. * Since titles are not tested at all, serious typographic errors will go undetected on import. Manual review is the only means of correcting. Regexp: Code: (?P<author>((?!\s-\s).))\s-(?:\s((?P<series>.+) (?P<series_index>\d+)((?!\s-\s).))\s-)?\s(?P<title>.) Regexp-translation:* (Note: This is a complex regular expression and a literal translation would be difficult. It makes extensive use of recursion and capturing of data elements. Because of this a more generic translation is provided.) * (?P<author>((?!\s-\s).)): Accept all characters up to, but not including " - ", as a part of the author's name. Allows for hyphenated names, names with apostrophes, multiple authors (A & B; A and B; A, B; etc). \s-: Moves testing past the first " - " delimiter. * [I](?:\s[\I]: Initial test for the presence of the optional elements. * ((?P<series>.+): Through multiple tests here and in the next two portions, accepts everything up to the next " - " delimeter in a captured group. Then it just extracts the series name from that group, using a single space between text and numbers as the delimiter. *(?P<series_index>\d+): Pulls the numbers out of the captured group as the index. Will accept any number. *((?!\s-\s).)): Finds the end of the optional elements. \s-)?\s: Moves testing past the optional elements and the last recognized delimiter. After this the use of the " - " delimiter no longer matters as it is not tested for in the remaining expression. *(?P<title>.): All remaining characters are accepted as part of the title. If the book includes " - " as a legitimate part of the title, it will be accepted without errors. Regexp-modifiers:* case-insensitive, single-line, greedy - but regularly gives back Regexp-syntax: Python Last edited by Sabardeyn; 05-29-2009 at 06:52 PM. Reason: Fixed layout and typos.