Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 09-23-2010, 11:57 AM   #31
chaley
"chaley", not "charley"
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 5,061
Karma: 802238
Join Date: Jan 2010
Location: France
Device: Many android devices
Quote:
Originally Posted by Starson17 View Post
The first rule of taking a test, even one for fun: read the problem statement very carefully. How about this regular expression for an answer:
Code:
.*
I believe it succeeds at matching "any palindrome."
And I, as the marker, would have to give you credit. Wouldn't be the first time I was caught out for writing bad questions. The good part is that it is almost always the good students who figure out the ambiguity.
chaley is offline  
Old 09-23-2010, 12:14 PM   #32
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by chaley View Post
And I, as the marker, would have to give you credit.
So you are that proverbial literally minded professor
http://www.snopes.com/college/exam/choice.asp
Starson17 is offline  
Old 09-23-2010, 01:11 PM   #33
chaley
"chaley", not "charley"
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 5,061
Karma: 802238
Join Date: Jan 2010
Location: France
Device: Many android devices
Quote:
Originally Posted by Starson17 View Post
So you are that proverbial literally minded professor
http://www.snopes.com/college/exam/choice.asp
Way

Well, it depended on the situation, the question, and the student, but yes, I did sometimes give credit in situations like this. My upper-division classes were reasonably small (10 to 20 people), so I could know the students well enough to tell if someone was jerking my chain or really had no clue.

I had a situation once where I demonstrated that 3 students plagiarized the final project of a fourth. I wrote individual final exams, where for the three students the first question (10% of the exam) was 'explain in detail why your code is identical to XXX's'. Two answered 'because I copied it', and I gave the the exam points. They failed the project, though. The third failed both. Several students told me that I had a reputation for very hard grading, but fair.
chaley is offline  
Old 09-23-2010, 01:36 PM   #34
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Manichean View Post
I know about flags, but as far as I know, Calibre doesn't allow for them to be used, am I right?
To get this thread back on track - you will find flags, such as DOTALL, used in many of the recipes. Customizing recipes inhabits the same sort of advanced user, middle ground, as the advanced conversion options do.
Starson17 is offline  
Old 09-23-2010, 03:28 PM   #35
Calibreuser
Junior Member
Calibreuser began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Sep 2010
Device: nook
Question REGEX help filename parseing


I am fairly new to REGEX but I think I have a handle on it so far in Calibre anyway! Forgive me for skipping ahead I skimmed forum and didnt see my question so if I missed it elsewhere just point me to it thanks

I have some books named like so
Lauthor, Fauthor - series ##- title.ext
Grant, Maxwell - The Shadow 331 - Mark Of The Shadow(b).txt

My best so far is (?P<author>.+) - (?P<series>.+) - (?P<title>[^_]+)

but now the series index is part of series

what is series index Var. name?
and how do I change title section to drop crap at end like (b)
I can rename as needed in most cases

can anyone help me author, series, index, and title out of this

Grant, Maxwell - [The Shadow 331] - Mark Of The Shadow(b).txt

Thanks CalibreUser
Calibreuser is offline  
Old 09-23-2010, 03:31 PM   #36
lost66615
Enthusiast
lost66615 doesn't litterlost66615 doesn't litter
 
Posts: 38
Karma: 134
Join Date: Feb 2010
Location: ENGLAND
Device: kindle dx
just go to edit meta data manually and can change all of these i would suggest signing up on the ISBN website it free and easy
lost66615 is offline  
Old 09-23-2010, 03:43 PM   #37
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Calibreuser View Post
I have some books named like so
Lauthor, Fauthor - series ##- title.ext
Grant, Maxwell - The Shadow 331 - Mark Of The Shadow(b).txt
Try this:
Code:
^((?P<author>([^\-_0-9]+)(?=\s*-\s*)(?!\s*-\s*[0-9.]+)|\b))(\s*-\s*)?((?P<series>[^0-9\-]+)(\s*-\s*)?(?P<series_index>[0-9.]+)\s*-\s*)?(?P<title>[^\-_0-9\(]+)
Starson17 is offline  
Old 09-23-2010, 05:01 PM   #38
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 2,742
Karma: 2920103
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Great post.
I have a few suggestions.
At the very beginning of the first post you might put something like:
This is fourth version of the guide and it was amended using various suggestions in subsequent posts.
I do not suggest this to get some credit for a suggestion or two, but without this explanation a first time reader might find some of the following posts ... superfluous.

I also suggest that the text of this post should be included in official Calibre documentation, or at the very least, Calibre documentation should point to this post.
Another good place for preserving this thread would be our Wiki.
Regular expressions are very widespread and yet, a GOOD documentation, explaining Regular Expressions from a point of view of beginner are relatively hard to find. The documentation for programming language or a text editor is usually written from the point of view of Reference manual describing all the options in a rather terse, concentrated manner. As you see for yourself, writing even relatively simple description of a few selected features is quite lengthy.

My favourite tool For using Regular Expressions is Vim text editor. It has also one of the very best documentations I have seen. Unfortunately, it has a little different syntax than Python REs, but the principle remains the same.

----------------

Now, let's see how we can improve the introduction.

First of all, now that you have introduces the Pipe '|' for providing different branches, you have to explain the rules of precedence a little bit ;-)
A pipe - '|' has the lowest precedence. So if you write RE 'abcd|efgh' it will match the whole 'abcd' string OR 'efgh' and not 'abc' followed by either 'd' or 'e' and then followed by 'fgh'. If we wanted to do that, we would have to write 'abc(d|e)fgh'.
I know, it should be obvious from your example, but there are a few interesting twists here.

Now, I can hear you asking: So now, instead of '[1234]' I can write '(1|2|3|4)'. Well, yes, you can. BUT! '[1234]+' will match strings like '1212' or '444' or '34' - literally any of members of the members of the group [1234] followed by any other member of the group. '(1|2|3|4)+', on the other hand, will match '111', or '22' or '44444', but not '12', or '34'. Because the Regular Expression parser when matching '34' will select '3' out of '(1|2|3|4)' and the plus quantifier will want to match the selected '3' again and will fail.

Let's get back to the precedence rules.
Quantifiers apply only to the preceding atom.
An atom (and that should have been explained at the very beginning, but we did not want to scare the reader away ) is:
- a letter, such as 'a', 'q', '2' or ';' that simply matches itself.
- dot '.' that stands for any character
- special escape sequence, such as '\t' - a tabulator, or '\D' - non digit character
- a group, such as [a-zA-Z] or [^>]
- if you have several atoms, you want to make into one atom, you can enclose them to a pair of parenthesis, such as (<[^>]+>)
So. If I write RE 'ab+', it will match 'ab', or 'abbbbbb', but not 'abab', because the plus quantifier only applies to the preceding atom. If we wanted to match 'abab' or 'ababab' we would need to write Regular expression like this: '(ab)+'

I will continue later. At this moment I go to sleep, but there are a few things that need to be explained, such as:
- referencing parenthesis using \1, \3 notation
- anchors
- interesting extensions (? ... )
- more quantifiers {m,n} (not that I consider them particular useful in Regular Expression typically used in Calibre.)

We should also develop a few very typical examples, useful for ordinary user, such as processing filename that *might* contain series information (here we will use the pipe '|' to process several branches, with and without series info)
So, please, if you want to solve your typical problem, post it here, so we could develop some examples using real-life situations.

Disclaimer: Please feel free to use any portion of my text for improvement of the "introduction"
kacir is offline  
Old 09-23-2010, 05:58 PM   #39
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,451
Karma: 4961459
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I do intend to add this to the User Manual (with Manichean's permission) when he is done updating it.
kovidgoyal is offline  
Old 09-23-2010, 06:48 PM   #40
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
Originally Posted by kacir View Post
Great post.
I have a few suggestions.
At the very beginning of the first post you might put something like:
This is fourth version of the guide and it was amended using various suggestions in subsequent posts.
I do not suggest this to get some credit for a suggestion or two, but without this explanation a first time reader might find some of the following posts ... superfluous.
I listed the edit history at the end, though I admit I didn't credit contributors, it's more of a "you know who you are"- thing I think I'll do that retrospectively.
I don't know about the superfluous following posts, though. I see the first post to be kind of stand-alone and the thread to be a discussion of what can and should be improved. I hoped to have made that point through the introductory and final comments, do you think I should clarify?

Quote:
Originally Posted by Starson17 View Post
To get this thread back on track - you will find flags, such as DOTALL, used in many of the recipes. Customizing recipes inhabits the same sort of advanced user, middle ground, as the advanced conversion options do.
Yes, I know about usage in Python code, but wanted this introduction to mainly be about what we most often see as help requests: conversion header/footer removal and matching of metadata. Somehow I managed to skip the part of the Python manual that explained the use of flags inside the expression itself... as you'll see, I've since added at least two of the flags.

Quote:
Originally Posted by kovidgoyal View Post
I do intend to add this to the User Manual (with Manichean's permission) when he is done updating it.
With pleasure. Though I can't tell you when that will be, because everyone seems to constantly come up with new and valid points.
Manichean is offline  
Old 09-23-2010, 06:56 PM   #41
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
Originally Posted by kacir View Post
I will continue later. At this moment I go to sleep, but there are a few things that need to be explained, such as:
- referencing parenthesis using \1, \3 notation
- anchors
- interesting extensions (? ... )
- more quantifiers {m,n} (not that I consider them particular useful in Regular Expression typically used in Calibre.)
Do you really think this is useful to the average user? This is really more about getting people started than explaining every little detail regular expression syntax offers.

Quote:
Originally Posted by kacir View Post
We should also develop a few very typical examples, useful for ordinary user, such as processing filename that *might* contain series information (here we will use the pipe '|' to process several branches, with and without series info)
So, please, if you want to solve your typical problem, post it here, so we could develop some examples using real-life situations.
I very much agree. This is what is lacking most of all at this point, I should be able to get around to that in the next few days (there's a weekend coming up, after all).
Manichean is offline  
Old 09-23-2010, 07:02 PM   #42
theducks
Grand Sorcerer
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 14,282
Karma: 5495472
Join Date: Aug 2009
Location: The (original) Silicon Valley, USA
Device: Galaxy Tab 2, Astak Pocket Pro, K4NT
My 2 cents
Put a CURRENT Revision Date at the Beginning (or in the title)

Follow that by a Note, that this is a Living Document an the First Post will be revised based upon input received later on in the thread
(no need to drill down to see if there are changes).
theducks is offline  
Old 09-23-2010, 07:10 PM   #43
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
Originally Posted by theducks View Post
My 2 cents
Put a CURRENT Revision Date at the Beginning (or in the title)

Follow that by a Note, that this is a Living Document an the First Post will be revised based upon input received later on in the thread
(no need to drill down to see if there are changes).
That makes sense. I'll do that.
Manichean is offline  
Old 09-23-2010, 07:48 PM   #44
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
I think moving the flags discussion to the front brings up a rather advanced topic a little early. Ignorecase is handy, but it can be worked around easily enough, and re.DOTALL is only useful in specific cases. I'd put them in the end as an addendum.

Also, you repeat this example twice:
Code:
Hello, World!(?is)
I think you meant the second to be
Code:
(?is)Hello, World!
Expanding the header removal section would probably help. Right now it basically just says grab everything between the <p></p> tags, which interpreted most simply would match every single line in the book. I think you could continue working with that example and build the regex piece by piece.

This was the first example:
Code:
"Maybe, but the cops feel like you do, Anita. What's one more dead vampire? New laws don't change that." </p><p class="calibre4">
<b class="calibre2">Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre3">erter, http://www.processtext.com/abclit.html</a></b></p><p class="calibre4">
It had only been two years since Addison v. Clark. The court case gave us a revised version of what life was
If you look at that simple example, you might create a regex that looks like this:
Code:
<p*?>\s*.*?Generated\s+by\s+ABC?\s+Amber.*?</p>
This will work well with that single example, but it's really important to look across the entire book to see what matches with the Magic Wand. As I noted, in the actual book there are other examples where that recommendation would have done very bad things. Here is an example:
Code:
<p class="calibre4">I looked directly at him for a moment. His eyes were still brown. He caught me looking, and I looked down at my desk.</p>
<p class="calibre4">Willie laughed, a wheezing <b class="calibre2">Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre3">erter, http://www.processtext.com/abclit.html</a></b>snicker of a sound. The laugh hadn't changed. "Geez, I love it. You're afraid of me."</p>
<p class="calibre4">"Not afraid, just cautious."</p>
You see that in that example it's been injected in the middle of an actual paragraph, so you can't get rid of the everything between the <p></p> tags. We need to do it based on the surround bold tags instead:
Code:
<b*?>.*?Generated\s+by\s+ABC?\s+Amber.*?</b>
The final regex I proposed in the thread you linked was a bit more complicated for a couple reasons. I tried to make it generic so it also supports pdf. I also try to stay away from .* as it can easily match unintended text. That required me to write extra patterns to accommodate the rest of the header. Finally it seems that there are different versions of the Amber tool creating slightly different header variations, so I tried to cover all the ones I'd seen examples of.
ldolse is offline  
Old 09-23-2010, 09:24 PM   #45
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 8,782
Karma: 12516053
Join Date: Feb 2009
Location: North Carolina
Device: Nexus 7
Quote:
Originally Posted by Manichean View Post
I'm talking about using the search feature with regular expressions, which the programs help notes to use the Scintilla regexp engine and being fixed to a per-line match. Does changing the language change this behaviour?
Short answer: No, I didn't.
I don't know if it makes a difference in the search but as you change the language from html to C++ to java to python the editor changes its behavior.
DoctorOhh is online now  
Closed Thread

Tags
regexp calibre tutorial

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem with regular expressions Manichean Conversion 10 02-03-2011 02:27 PM
Custom Regular Expressions for adding book information bigbot3 Calibre 1 12-25-2010 06:28 PM
Help with Regular Expressions ghostyjack Workshop 2 01-08-2010 11:04 AM
Regular Expressions help needed Phil_C Workshop 20 10-03-2009 12:14 AM
BookDesigner v5 and regular expressions ShineOn Sony Reader 11 08-25-2008 04:06 PM


All times are GMT -4. The time now is 07:59 AM.


MobileRead.com is a privately owned, operated and funded community.