Tyrannosaurus Regex - Page 2

rogue_ronin · 05-21-2009, 01:25 AM

Quote:

Originally Posted by pepak

A suggestion:

Maintain a list of regexps (with links to their posts) in the first post. That way, when someone opens the thread, he/she can quickly find the needed regexp.

Done.

Hope that it gets difficult to keep up.

m a r

Sabardeyn · 05-29-2009, 05:39 PM

Unlike the other Regular Expression samples provided in this topic, this one is intended to be used on a book's filename, not on the text of the book. Many books do not have metadata included inside the file so that information is included in the filename. By using this formula you can import the book into Calibre and automatically have the data entered correctly.

Users should still review all books that are automatically imported in this manner. Despite the complex testing and accuracy of this expression, errors can and will occur in unique circumstances.

Please note that the Regular Expression which follows is the work of multiple people: Gwynevans, Darkmonk (and the unix forum that assisted him) and myself. The unix forum undoubtedly gets the lion's share of accolades.

Description: Automates metadata entry into Calibre when adding a book based on the filename.

Example of input: John D. Smith-Jones - Bibliographic Perfection 189 - The Perfect Book.pdf

Example of output: (The output is what data Calibre has in its fields after import is completed.)
<author> = John D. Smith-Jones
<series> = Bibliographic Perfection
<series_index> = 189
<title> = The Perfect Book

Requirements: This regular expression is used in Calibre under the user-defined Preferences>Advanced>Regular Expression option. The imported filenames must specifically follow the format of "<author> - <series> - <series_index> - <title>" (specifically only using space dash space between fields). The series and series index portions are optional, filenames without them will be imported correctly.

Faults:
* All filenames must follow the explicit format mention in Requirements, above. Otherwise they will not import correctly.
* Does not test for name order (lastname, firstname).
* Does not test for multiple authors, but will accept multiples as a single entry in the author field.
* Extra " - "s within the <author> or <series> fields mangles importation.
* Hyphenated names like "Smith-Jones" are fine, "Smith - Jones" mangles importation.
* Leetspeak has limited importation - depending on the exact character combination used.
* Titles generally are not affected by the above since the regex allows all characters once the series name and index is obtained.
* Since titles are not tested at all, serious typographic errors will go undetected on import. Manual review is the only means of correcting.

Regexp:

Code:

(?P<author>((?!\s-\s).)*)\s-(?:\s((?P<series>.+) (?P<series_index>\d+)((?!\s-\s).)*)\s-)?\s(?P<title>.*)

Regexp-translation:
(Note: This is a complex regular expression and a literal translation would be difficult. It makes extensive use of recursion and capturing of data elements. Because of this a more generic translation is provided.)

* (?P<author>((?!\s-\s).)*): Accept all characters up to, but not including " - ", as a part of the author's name. Allows for hyphenated names, names with apostrophes, multiple authors (A & B; A and B; A, B; etc).
* \s-: Moves testing past the first " - " delimiter.
* [I](?:\s[\I]: Initial test for the presence of the optional elements.
* ((?P<series>.+): Through multiple tests here and in the next two portions, accepts everything up to the next " - " delimeter in a captured group. Then it just extracts the series name from that group, using a single space between text and numbers as the delimiter.
*(?P<series_index>\d+): Pulls the numbers out of the captured group as the index. Will accept any number.
*((?!\s-\s).)*): Finds the end of the optional elements.
*\s-)?\s: Moves testing past the optional elements and the last recognized delimiter. After this the use of the " - " delimiter no longer matters as it is not tested for in the remaining expression.
*(?P<title>.*): All remaining characters are accepted as part of the title. If the book includes " - " as a legitimate part of the title, it will be accepted without errors.

Regexp-modifiers: case-insensitive, single-line, greedy - but regularly gives back

Regexp-syntax: Python

rogue_ronin · 05-29-2009, 08:47 PM

I'll need that in the near future...

m a r

macnab69 · 06-17-2013, 01:56 AM

@Sabardeyn:

I don't want to copy/paraphrase your entry, so I'll ask you to add this regex. Works exactly the same as yours, except it works with filename format:
author - [series series_index] title

The regex:

Code:

^(?P<author>((?!\s-\s).)+)\s-\s(?:(?:\[\s*)?(?P<series>.+)\s(?P<series_index>[\d\.]+)(?:\s*\])?\s)?(?P<title>[^(]+)(?:\(.*\))?

Thanks.

SBT · 06-17-2013, 07:51 AM

What it does:
Flags typical OCR errors (at least in tesseract)

Capital letter immediately after lower case letter
Digit after letter
Lower case after full stop
Space before colon/semicolon

Regexp:

Code:

([a-z][A-Z]|[a-zA-Z][0-9]|\. *[a-z]| [;:])

Faults:
False positives, especially abbrev. which are followed by lower-case letters.

Regex variant:
Perl & similar. Add \ before (,),| to get a "regular" regexp.

05-29-2009, 05:39 PM	#17
Sabardeyn Guru Posts: 644 Karma: 1242364 Join Date: May 2009 Location: The Right Coast Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)	Calibre: Import book metadata from filename Unlike the other Regular Expression samples provided in this topic, this one is intended to be used on a book's filename, not on the text of the book. Many books do not have metadata included inside the file so that information is included in the filename. By using this formula you can import the book into Calibre and automatically have the data entered correctly. Users should still review all books that are automatically imported in this manner. Despite the complex testing and accuracy of this expression, errors can and will occur in unique circumstances. Please note that the Regular Expression which follows is the work of multiple people: Gwynevans, Darkmonk (and the unix forum that assisted him) and myself. The unix forum undoubtedly gets the lion's share of accolades. Description: Automates metadata entry into Calibre when adding a book based on the filename. Example of input: John D. Smith-Jones - Bibliographic Perfection 189 - The Perfect Book.pdf Example of output: (The output is what data Calibre has in its fields after import is completed.) <author> = John D. Smith-Jones <series> = Bibliographic Perfection <series_index> = 189 <title> = The Perfect Book Requirements: This regular expression is used in Calibre under the user-defined Preferences>Advanced>Regular Expression option. The imported filenames must specifically follow the format of "<author> - <series> - <series_index> - <title>" (specifically only using space dash space between fields). The series and series index portions are optional, filenames without them will be imported correctly. Faults: * All filenames must follow the explicit format mention in Requirements, above. Otherwise they will not import correctly. * Does not test for name order (lastname, firstname). * Does not test for multiple authors, but will accept multiples as a single entry in the author field. * Extra " - "s within the <author> or <series> fields mangles importation. * Hyphenated names like "Smith-Jones" are fine, "Smith - Jones" mangles importation. * Leetspeak has limited importation - depending on the exact character combination used. * Titles generally are not affected by the above since the regex allows all characters once the series name and index is obtained. * Since titles are not tested at all, serious typographic errors will go undetected on import. Manual review is the only means of correcting. Regexp: Code: (?P<author>((?!\s-\s).))\s-(?:\s((?P<series>.+) (?P<series_index>\d+)((?!\s-\s).))\s-)?\s(?P<title>.) Regexp-translation:* (Note: This is a complex regular expression and a literal translation would be difficult. It makes extensive use of recursion and capturing of data elements. Because of this a more generic translation is provided.) * (?P<author>((?!\s-\s).)): Accept all characters up to, but not including " - ", as a part of the author's name. Allows for hyphenated names, names with apostrophes, multiple authors (A & B; A and B; A, B; etc). \s-: Moves testing past the first " - " delimiter. * [I](?:\s[\I]: Initial test for the presence of the optional elements. * ((?P<series>.+): Through multiple tests here and in the next two portions, accepts everything up to the next " - " delimeter in a captured group. Then it just extracts the series name from that group, using a single space between text and numbers as the delimiter. *(?P<series_index>\d+): Pulls the numbers out of the captured group as the index. Will accept any number. *((?!\s-\s).)): Finds the end of the optional elements. \s-)?\s: Moves testing past the optional elements and the last recognized delimiter. After this the use of the " - " delimiter no longer matters as it is not tested for in the remaining expression. *(?P<title>.): All remaining characters are accepted as part of the title. If the book includes " - " as a legitimate part of the title, it will be accepted without errors. Regexp-modifiers:* case-insensitive, single-line, greedy - but regularly gives back Regexp-syntax: Python Last edited by Sabardeyn; 05-29-2009 at 05:52 PM. Reason: Fixed layout and typos.

06-17-2013, 01:56 AM	#19
macnab69 Zealot Posts: 129 Karma: 5754 Join Date: Jan 2012 Location: South Africa Device: Kindle 4	@Sabardeyn: I don't want to copy/paraphrase your entry, so I'll ask you to add this regex. Works exactly the same as yours, except it works with filename format: author - [series series_index] title The regex: Code: ^(?P<author>((?!\s-\s).)+)\s-\s(?:(?:\[\s)?(?P<series>.+)\s(?P<series_index>[\d\.]+)(?:\s\])?\s)?(?P<title>[^(]+)(?:\(.*\))? Thanks.

06-17-2013, 07:51 AM	#20
SBT Fanatic Posts: 580 Karma: 810184 Join Date: Sep 2010 Location: Norway Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad	Proofreading regex What it does: Flags typical OCR errors (at least in tesseract) Capital letter immediately after lower case letter Digit after letter Lower case after full stop Space before colon/semicolon Regexp: Code: ([a-z][A-Z]\|[a-zA-Z][0-9]\|\. [a-z]\| [;:]) Faults:* False positives, especially abbrev. which are followed by lower-case letters. Regex variant: Perl & similar. Add \ before (,),\| to get a "regular" regexp.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Perl and Regex	Alexander Turcic	Lounge	3	01-25-2011 07:48 PM
What a regex is	Worldwalker	Calibre	20	05-10-2010 05:51 AM
Help with a regex	A.T.E.	Calibre	1	04-05-2010 07:50 AM
Regex help needed	gandor62	Calibre	2	11-04-2009 10:27 AM
Regex help...	Bobthebass	Workshop	6	04-26-2009 03:54 PM

05-29-2009, 08:47 PM	#18
rogue_ronin Banned Posts: 475 Karma: 796 Join Date: Sep 2008 Location: Honolulu Device: Nokia 770 (fbreader)	I'll need that in the near future... m a r