Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 08-15-2011, 07:33 AM   #1
mightymouse2045
Enthusiast
mightymouse2045 began at the beginning.
 
Posts: 30
Karma: 10
Join Date: May 2011
Device: xoom
Question Regular expression builder plugin?

Hi,

Would anyone be able to write a regular expression plugin that would allow us to select the various variables ie (title), (author), (series) etc and move them around with an AND or OR in between them to allow it to match multiple criteria etc and then once we have what we want it populates a text field with the regular expression that would match what we want

ie (title) AND - (author) OR _ (series) OR (series_index)

Would give a regular expression that would match the following:

title - author
title - author_series
title - author_ series
title - author _series
title - author _ series
title - author_series seriesindex
title - author _series seriesindex
title - author_ seriesseriesindex
title - author _ series seriesindex
title - author _ seriesseriesindex

and obviously any combination you put the fields in with AND or OR or whatever it is that someone could come up with? and maybe being able to put a NOT in front or something.

It's just I often get a lot of books in completely different naming formats and it is constantly a pain for me to go through and rewrite a regular expression because i am just not geeky enough to get where to put a ? or a /s or whatever it is i am matching against and an hour or two later I realise it would have been easier for me to just go through and rename the 30 or 50 books I've imported instead of trying to fumble around getting a regular expression written that captures exactly what I need without further changes required after importing them....

Pretty please
mightymouse2045 is offline   Reply With Quote
Old 08-15-2011, 08:51 AM   #2
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
This is not as trivial as you may think it is, because it requires one expression to match all variations of a title occuring in the book, one for all variations of author names etc. I don't believe it's possible to do this robustly enough and still retain the simplicity of just having to write (title) or something- at best, you'd end up with some sort of inferior copy of regular expressions.

Bottom line: Learn about regexes, it helps in more places than just using Calibre
Manichean is offline   Reply With Quote
Old 08-15-2011, 10:24 AM   #3
mightymouse2045
Enthusiast
mightymouse2045 began at the beginning.
 
Posts: 30
Karma: 10
Join Date: May 2011
Device: xoom
Quote:
Originally Posted by Manichean View Post
This is not as trivial as you may think it is, because it requires one expression to match all variations of a title occuring in the book, one for all variations of author names etc. I don't believe it's possible to do this robustly enough and still retain the simplicity of just having to write (title) or something- at best, you'd end up with some sort of inferior copy of regular expressions.

Bottom line: Learn about regexes, it helps in more places than just using Calibre
Yeah I am learning a lot about reg ex, but it's just you know how it is - 3 months down the track you get a whole bunch of files you want to import and having to recall the best way to do it, so it can capture all or most of the files without having to edit them all after importing, and trial and error, and 1 to 2 hours later is a pain in the butt.

I don't mean to have one reg ex that can capture all possible combinations. What I mean is some sort of plugin that can popup a box and allow you to drag the fields you want to match in the order you want them and allows you to drag and AND, OR or some other variable in between each field - then click a create button that will populate a text field with the correct reg ex for that combination of fields and AND or OR etc that you can then copy and go into preferences and paste that reg ex into the 'Adding Books' preferences for example....

I downloaded another persons library today and they haven't exported books or just haven't put the books in a very import friendly structure, so for example I have:

1 Divine by Mistake - P.C. Cast.epub
1 How to Train Your Dragon - Cressida Cowell.epub
Kingmaker, Kingbreaker 02_ Awakened Mage - Karen Miller.epub

So basically some are
title - author

and others are
series series index_ title - author

What I have to do now is split up the files and then write 2 reg ex to capture the 2 variations, but also allow for oddities in the names as well ie some having spaces after the _ and some don't etc and that's what I'm not so clever with :P

It would be fantastic if some clever reg ex guru out there could write some reg ex wizard that does even 80% of the job which makes it easier for us not so brilliant book worms to play with and fine tune for our purposes, or even better if a wizard could be made that is clever enough to do the whole thing would of course be even better.

I'm not saying it's easy - just putting the thought out there and hoping to capture the interest of someone who might be talented enough to do that
mightymouse2045 is offline   Reply With Quote
Old 08-15-2011, 11:37 AM   #4
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
Originally Posted by mightymouse2045 View Post
I don't mean to have one reg ex that can capture all possible combinations. What I mean is some sort of plugin that can popup a box and allow you to drag the fields you want to match in the order you want them and allows you to drag and AND, OR or some other variable in between each field - then click a create button that will populate a text field with the correct reg ex for that combination of fields and AND or OR etc that you can then copy and go into preferences and paste that reg ex into the 'Adding Books' preferences for example....
Ah. That's somewhat easier, but basically would require the user to input the same information he would when typing the regex himself, so, apart from not having to learn the syntax, there wouldn't be much improvement. This may be of interest, by the way.
Moderator Notice
Since this is basically a plugin request, I'm moving it to the appropriate forum.
Manichean is offline   Reply With Quote
Old 08-15-2011, 01:47 PM   #5
mightymouse2045
Enthusiast
mightymouse2045 began at the beginning.
 
Posts: 30
Karma: 10
Join Date: May 2011
Device: xoom
Quote:
Originally Posted by Manichean View Post
Ah. That's somewhat easier, but basically would require the user to input the same information he would when typing the regex himself, so, apart from not having to learn the syntax, there wouldn't be much improvement. This may be of interest, by the way.

Since this is basically a plugin request, I'm moving it to the appropriate forum.
Thanks for moving it to the appropriate forum

Yes there are tools out there that help you build a regex expression, and I have looked at kodos and some others but it would nice to do it within Calibre, because it is a pain reordering the fields and getting the syntax right, with the brackets in place, ensuring it's still flexible etc.

Like this one I use for the main part:

Code:
^(?P<author>[^-]+)(\s*-\s*(\[?(?P<series>[^-0-9]+)\s*(?P<series_index>[0-9.]+)?]?)?)?.*?-\s*(?P<title>[^\]{[()]+\w)
Matches a fair bit, but if i want to reorder the fields it escapes me what I should move where...

But I will play around with those builders anyways and see if they can assist for now Cheers
mightymouse2045 is offline   Reply With Quote
Old 08-17-2011, 02:04 PM   #6
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by mightymouse2045 View Post
Matches a fair bit, but if i want to reorder the fields it escapes me what I should move where...
The reason it's not easy to see what to move is that your expression is trying to match lots of variations.
Code:
^(?P<author>[^-]+)(\s*-\s*(\[?(?P<series>[^-0-9]+)\s*(?P<series_index>[0-9.]+)?]?)?)?.*?-\s*(?P<title>[^\]{[()]+\w)
That regex isn't even doing lookaheads, as for example in this one:
Code:
^(?P<author>([^\-_0-9]+)(?=\s*-\s*)(?!\s*-\s*[0-9.]+)|\b)(\s*-\s*)?(?P<title>[a-zA-Z1-9 ]) (\[(?P<series>[^0-9\-]+) (- )?\#?(?P<series_index>[0-9.]+)\])
It's easy, however, to see the order of the fields in each.
I keep several of the complex expressions stored, and match up the order between my book and my regex.
I know that as long as the filename has author immediately followed by title I should try the second, and if it has author followed by series, I should try the first one.

If the test works, fine. If not, and it's not immediately obvious how to modify it, I also keep several very simple regexes, and those are easy to change to match. Later, if I get time, I go back and rewrite the complex expression to also capture the new file so it will work the next time. The trick is just to match the order of the fields in the filename to the order in the regex.
Starson17 is offline   Reply With Quote
Old 08-17-2011, 03:32 PM   #7
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 2,742
Karma: 2920103
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by Starson17 View Post
I know that as long as the filename has author immediately followed by title I should try the second, and if it has author followed by series, I should try the first one.
Code:
(?P<author>[^-]+)(( - | *-- *)[[(]?(?P<series>[^-]+)[[( ]+(?P<series_index>[0-9.]+)?[])]?)?( - | *-- *)(?P<title>.+)
This one has series optional.
Works with series and *also* without.
It also has optional parenthesis around series and/or series number
So it matches:
Sir Arthur Conan Doyle - Sherlock Holmes 1 - Study in red.doc
Sir Arthur Conan Doyle - Study in red.doc
Sir Arthur Conan Doyle -- Sherlock Holmes 1 - Study in red.doc
Sir Arthur Conan Doyle -- Study in red.doc
Sir Arthur Conan Doyle--Sherlock Holmes 1.0--Study in red.doc
Sir Arthur Conan Doyle--Study in red.doc
Sir Arthur Conan Doyle - (Sherlock Holmes 1) - Study in red.doc
Sir Arthur Conan Doyle - [Sherlock Holmes 1] - Study in red.doc
Sir Arthur Conan Doyle - Sherlock Holmes (1) - Study in red.doc
Sir Arthur Conan Doyle - Sherlock Holmes [1] - Study in red.doc

If you find some case that my expression doesn't cover, do not hesitate to post, we can try to craft another, even more complex RE.

Here is another take on problem
Code:
^(?P<author>((?!\s-\s).)+)\s-\s(?:(?:\[\s*)?(?P<series>.+)\s(?P<series_index>[\d\.]+)(?:\s*\])?\s-\s)?(?P<title>[^(]+)(?:\(.*\))?
This one doesn't cover parenthesis around series number, like this
Sir Arthur Conan Doyle - Sherlock Holmes (1) - Study in red.doc
kacir is offline   Reply With Quote
Old 08-17-2011, 08:16 PM   #8
mightymouse2045
Enthusiast
mightymouse2045 began at the beginning.
 
Posts: 30
Karma: 10
Join Date: May 2011
Device: xoom
Quote:
Originally Posted by kacir View Post

If you find some case that my expression doesn't cover, do not hesitate to post, we can try to craft another, even more complex RE.
How about

# title - author
# series ##_ title - author
series ##_ title - author

But strip the number # at the beginning?
mightymouse2045 is offline   Reply With Quote
Old 08-19-2011, 01:02 PM   #9
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 2,742
Karma: 2920103
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by mightymouse2045 View Post
How about

# title - author
# series ##_ title - author
series ##_ title - author

But strip the number # at the beginning?
OK.
Let's have a look at RE from my previous post, split across the lines for better readability
Code:
(?P<author>[^-]+) 
( 
        ( - | *-- *) 
        [[(]? 
        (?P<series>[^-]+) 
        [[( ]+ 
        (?P<series_index>[0-9.]+)? 
        [])]? 
)? 
( - | *-- *) 
(?P<title>.+)
Now, we shall rearrange various elements like so
Code:
(
        [[(]?
        (?P<series>[^-]+)
        [[( ]+
        (?P<series_index>[0-9.]+)? 
        [])]? 
        ( - | *-- *) 
)? 
(?P<author>[^-]+) 
( - | *-- *) 
(?P<title>.+)
Now it matches
series seriesnumber - author - title
author - title
please note, if there is Series, it must be followed by seriesnumber.
I think it is possible to construct RE to make seriesnumber optional, but I do not know it it would be useful that way, and my regular expressions is complicate enough as it is.

Let's add regular expression
[0-9 ]*
at the beginning of the new RE, so it "eats up" any numbers and spaces at the beginning
If there are dots in number, put this at the beginning instead
[0-9. ]*

--------- doesn't work ---------
Now, we need to put underscore among possible delimiters, together with ' - ', '--', ' -- '.
So instead of
( - | *-- *)
at the end of the series, we put
( - | *-- *| *_ *)
Now possible delimiters are ' - ', '--', ' -- ', '-- ', ' --', '_',' _','_ ',' _ '.
-------- end of doesn't work -------
The above construction doesn't work, because you would have to modify also (?P<series>[^-]+) to (?P<series>[^-_]+). Even bigger problem is that Calibre automatically replaces underscores in filenames with spaces. Is there an option to switch off that option?

I recommend to replace underscore with ' - ' in filenames before processing the file in Calibre.

Here is the result
Code:
[0-9 ]*([[(]?(?P<series>[^-]+)[[( ]+(?P<series_index>[0-9.]+)?[])]?( - | *-- *))?(?P<author>[^-]+)( - | *-- *)(?P<title>.+)
I will leave extensive testing of the regular expression as an exercise for the reader ;-)
kacir is offline   Reply With Quote
Old 08-20-2011, 06:02 AM   #10
mightymouse2045
Enthusiast
mightymouse2045 began at the beginning.
 
Posts: 30
Karma: 10
Join Date: May 2011
Device: xoom
Quote:
Originally Posted by kacir View Post
OK.
Let's have a look at RE from my previous post, split across the lines for better readability
Code:
(?P<author>[^-]+) 
( 
        ( - | *-- *) 
        [[(]? 
        (?P<series>[^-]+) 
        [[( ]+ 
        (?P<series_index>[0-9.]+)? 
        [])]? 
)? 
( - | *-- *) 
(?P<title>.+)
Now, we shall rearrange various elements like so
Code:
(
        [[(]?
        (?P<series>[^-]+)
        [[( ]+
        (?P<series_index>[0-9.]+)? 
        [])]? 
        ( - | *-- *) 
)? 
(?P<author>[^-]+) 
( - | *-- *) 
(?P<title>.+)
Now it matches
series seriesnumber - author - title
author - title
please note, if there is Series, it must be followed by seriesnumber.
I think it is possible to construct RE to make seriesnumber optional, but I do not know it it would be useful that way, and my regular expressions is complicate enough as it is.

Let's add regular expression
[0-9 ]*
at the beginning of the new RE, so it "eats up" any numbers and spaces at the beginning
If there are dots in number, put this at the beginning instead
[0-9. ]*

--------- doesn't work ---------
Now, we need to put underscore among possible delimiters, together with ' - ', '--', ' -- '.
So instead of
( - | *-- *)
at the end of the series, we put
( - | *-- *| *_ *)
Now possible delimiters are ' - ', '--', ' -- ', '-- ', ' --', '_',' _','_ ',' _ '.
-------- end of doesn't work -------
The above construction doesn't work, because you would have to modify also (?P<series>[^-]+) to (?P<series>[^-_]+). Even bigger problem is that Calibre automatically replaces underscores in filenames with spaces. Is there an option to switch off that option?

I recommend to replace underscore with ' - ' in filenames before processing the file in Calibre.

Here is the result
Code:
[0-9 ]*([[(]?(?P<series>[^-]+)[[( ]+(?P<series_index>[0-9.]+)?[])]?( - | *-- *))?(?P<author>[^-]+)( - | *-- *)(?P<title>.+)
I will leave extensive testing of the regular expression as an exercise for the reader ;-)
Thanks alot for your explanation - that worked a treat I can now do that with anything else in future
mightymouse2045 is offline   Reply With Quote
Old 08-20-2011, 09:52 AM   #11
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 2,742
Karma: 2920103
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by mightymouse2045 View Post
Thanks alot for your explanation - that worked a treat I can now do that with anything else in future
Complicated regular expressions (like the one above) can be quite ... intimidating. This is because the Regular Expression language is very, very condensed and has developed during many years, so some metacharacters need to be "escaped", others not, and syntax is different for traditional metacharacters and different for relatively recently introduced ones. The syntax can change from tool to tool.

If you "take the RE apart", like I did above, it becomes much clearer.
As I said, the RE language is very "dense", so such expressions are very often jokingly referred to as "write only". It means it can be easier to write it than understand expression that somebody else wrote.

I strongly recommend that you read following post
http://www.mobileread.com/forums/sho...d.php?t=118569
It is a result of very interesting thread that was started by Manichean, with many contributors and is now part of the Calibre documentation.

If you are interested in further learning, get a book called Mastering Regular Expressions by By Jeffrey E.F. Friedl.
kacir is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular Expression Help Azhad Calibre 86 09-27-2011 02:37 PM
Regular Expression Help iKarampa Calibre 13 12-15-2010 07:17 AM
Regular expression help krendk Calibre 4 12-04-2010 04:32 PM
Regular Expression Help smartmart Calibre 5 10-17-2010 05:19 AM
Help with the regular expression Dysonco Calibre 9 03-22-2010 10:45 PM


All times are GMT -4. The time now is 01:26 AM.


MobileRead.com is a privately owned, operated and funded community.