Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 05-21-2009, 01:25 AM   #16
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
Quote:
Originally Posted by pepak View Post
A suggestion:

Maintain a list of regexps (with links to their posts) in the first post. That way, when someone opens the thread, he/she can quickly find the needed regexp.
Done.

Hope that it gets difficult to keep up.

m a r
rogue_ronin is offline   Reply With Quote
Old 05-29-2009, 05:39 PM   #17
Sabardeyn
Guru
Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.
 
Sabardeyn's Avatar
 
Posts: 644
Karma: 1242364
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
Calibre: Import book metadata from filename

Unlike the other Regular Expression samples provided in this topic, this one is intended to be used on a book's filename, not on the text of the book. Many books do not have metadata included inside the file so that information is included in the filename. By using this formula you can import the book into Calibre and automatically have the data entered correctly.

Users should still review all books that are automatically imported in this manner. Despite the complex testing and accuracy of this expression, errors can and will occur in unique circumstances.

Please note that the Regular Expression which follows is the work of multiple people: Gwynevans, Darkmonk (and the unix forum that assisted him) and myself. The unix forum undoubtedly gets the lion's share of accolades.


Description: Automates metadata entry into Calibre when adding a book based on the filename.

Example of input: John D. Smith-Jones - Bibliographic Perfection 189 - The Perfect Book.pdf

Example of output: (The output is what data Calibre has in its fields after import is completed.)
<author> = John D. Smith-Jones
<series> = Bibliographic Perfection
<series_index> = 189
<title> = The Perfect Book

Requirements: This regular expression is used in Calibre under the user-defined Preferences>Advanced>Regular Expression option. The imported filenames must specifically follow the format of "<author> - <series> - <series_index> - <title>" (specifically only using space dash space between fields). The series and series index portions are optional, filenames without them will be imported correctly.

Faults:
* All filenames must follow the explicit format mention in Requirements, above. Otherwise they will not import correctly.
* Does not test for name order (lastname, firstname).
* Does not test for multiple authors, but will accept multiples as a single entry in the author field.
* Extra " - "s within the <author> or <series> fields mangles importation.
* Hyphenated names like "Smith-Jones" are fine, "Smith - Jones" mangles importation.
* Leetspeak has limited importation - depending on the exact character combination used.
* Titles generally are not affected by the above since the regex allows all characters once the series name and index is obtained.
* Since titles are not tested at all, serious typographic errors will go undetected on import. Manual review is the only means of correcting.

Regexp:
Code:
(?P<author>((?!\s-\s).)*)\s-(?:\s((?P<series>.+) (?P<series_index>\d+)((?!\s-\s).)*)\s-)?\s(?P<title>.*)
Regexp-translation:
(Note: This is a complex regular expression and a literal translation would be difficult. It makes extensive use of recursion and capturing of data elements. Because of this a more generic translation is provided.)

* (?P<author>((?!\s-\s).)*): Accept all characters up to, but not including " - ", as a part of the author's name. Allows for hyphenated names, names with apostrophes, multiple authors (A & B; A and B; A, B; etc).
* \s-: Moves testing past the first " - " delimiter.
* [I](?:\s[\I]: Initial test for the presence of the optional elements.
* ((?P<series>.+): Through multiple tests here and in the next two portions, accepts everything up to the next " - " delimeter in a captured group. Then it just extracts the series name from that group, using a single space between text and numbers as the delimiter.
*(?P<series_index>\d+): Pulls the numbers out of the captured group as the index. Will accept any number.
*((?!\s-\s).)*): Finds the end of the optional elements.
*\s-)?\s: Moves testing past the optional elements and the last recognized delimiter. After this the use of the " - " delimiter no longer matters as it is not tested for in the remaining expression.
*(?P<title>.*): All remaining characters are accepted as part of the title. If the book includes " - " as a legitimate part of the title, it will be accepted without errors.

Regexp-modifiers: case-insensitive, single-line, greedy - but regularly gives back

Regexp-syntax: Python

Last edited by Sabardeyn; 05-29-2009 at 05:52 PM. Reason: Fixed layout and typos.
Sabardeyn is offline   Reply With Quote
Advert
Old 05-29-2009, 08:47 PM   #18
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
I'll need that in the near future...

m a r
rogue_ronin is offline   Reply With Quote
Old 06-17-2013, 01:56 AM   #19
macnab69
Zealot
macnab69 shares his or her toysmacnab69 shares his or her toysmacnab69 shares his or her toysmacnab69 shares his or her toysmacnab69 shares his or her toysmacnab69 shares his or her toysmacnab69 shares his or her toysmacnab69 shares his or her toysmacnab69 shares his or her toysmacnab69 shares his or her toysmacnab69 shares his or her toys
 
Posts: 129
Karma: 5754
Join Date: Jan 2012
Location: South Africa
Device: Kindle 4
@Sabardeyn:

I don't want to copy/paraphrase your entry, so I'll ask you to add this regex. Works exactly the same as yours, except it works with filename format:
author - [series series_index] title

The regex:
Code:
^(?P<author>((?!\s-\s).)+)\s-\s(?:(?:\[\s*)?(?P<series>.+)\s(?P<series_index>[\d\.]+)(?:\s*\])?\s)?(?P<title>[^(]+)(?:\(.*\))?
Thanks.
macnab69 is offline   Reply With Quote
Old 06-17-2013, 07:51 AM   #20
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
Proofreading regex

What it does:
Flags typical OCR errors (at least in tesseract)
  • Capital letter immediately after lower case letter
  • Digit after letter
  • Lower case after full stop
  • Space before colon/semicolon

Regexp:
Code:
([a-z][A-Z]|[a-zA-Z][0-9]|\. *[a-z]| [;:])
Faults:
False positives, especially abbrev. which are followed by lower-case letters.

Regex variant:
Perl & similar. Add \ before (,),| to get a "regular" regexp.
SBT is offline   Reply With Quote
Advert
Reply

Tags
edit, regex, regular expressions

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Perl and Regex Alexander Turcic Lounge 3 01-25-2011 07:48 PM
What a regex is Worldwalker Calibre 20 05-10-2010 05:51 AM
Help with a regex A.T.E. Calibre 1 04-05-2010 07:50 AM
Regex help needed gandor62 Calibre 2 11-04-2009 10:27 AM
Regex help... Bobthebass Workshop 6 04-26-2009 03:54 PM


All times are GMT -4. The time now is 02:30 AM.


MobileRead.com is a privately owned, operated and funded community.