Structure Detection - Remove Header (or Footer) Regex - Page 3

Starson17 · 09-27-2010, 11:56 AM

Quote:

Originally Posted by Manichean

Without intending any offense, at least in one point, it's the latter: You should have a look at where you put your quantifiers (they repeat the preceding characters).

This is actually a very common mistake. It's based on familiarity with wildcards, where the "*" is a character, whereas in regex it's a quantifier for something else. An ab initio reading of the explanation of regex "*" and "+" sometimes causes the user to think they are wildcards for "zero or more characters" and "one or more characters" instead of quantifiers meaning "zero or more of the preceding character(s)" and "one or more of the preceding character(s)."

It's a very understandable error for anyone familiar with wildcards, but not quantifiers. (Perhaps it's worth a brief comment in your excellent beginner's tutorial about the difference between wildcards and quantifiers.)

varmemester · 09-28-2010, 07:21 AM

I have a lot of lines looking like this:
0465002214_Cochran 11/20/08 2:41 PM Page xi

What should the regex look like to remove those? And where do I put it?

In Structure Detection I have ticked 'Remove Header' and 'Remove Footer'. I wonder what the chapter mark options do ('pagebreak', 'rule', 'both', 'none')?

Manichean · 09-28-2010, 07:46 AM

Quote:

Originally Posted by varmemester

I have a lot of lines looking like this:
0465002214_Cochran 11/20/08 2:41 PM Page xi

What should the regex look like to remove those? And where do I put it?

You put it into either the regular expression for footer field or the regular expression for header field. Also, if you only want to remove one line, only use one of the removal options.
You could try with the regular epxression

Code:

\d+_Cochran\s+\d+/\d+/\d+\s+\d+:\d+\s+PM\s+Page\s+\w+

For further information see the tutorial.

Quote:

Originally Posted by varmemester

In Structure Detection I have ticked 'Remove Header' and 'Remove Footer'. I wonder what the chapter mark options do ('pagebreak', 'rule', 'both', 'none')?

You only need to tick one of the removal options and then customize the regular expression to fit. The chapter mark option selects how detected chapter breaks are marked: either with a new page, a horizontal line, both, or none of the above. You should also get a helpful text explaining what a certain option does when you hover your mouse cursor above said option.

Techracer · 01-06-2011, 02:56 AM

Hi,
I'm also having trouble with this implementation of regex. I've looked at the tutorial links, and I've used regex elsewhere before. I'm used to using ^ as meaning the start of the line and this is not working for me.

I'm converting from pdf, the book in question is a free Doctor Who short story on the bbc web site. It has the bbc logo, the web url and another icon as header or footer on every page. It also has the page number, but I want to get rid of that.
Trying to use (^\d+ ) to match
2 
at the start of a line only but it isn't working. If I remove the ^ it finds it but also page numbers from the contents page which are at the end of the line.

Is there some other indicator of "start of line" that I should use?

Cheers,
Damian

ldolse · 01-06-2011, 05:47 AM

Try looking for a newline character before the text you want to remove:

Code:

\n\d+<br>

Confuzzled · 01-19-2011, 03:38 AM

Hey everyone im pretty new to regex coding as well but i've been reading about it and trying to figure it out... I have 2 things im tring to get rid of the abc amber lit converter and that aa bb pdf transform.

For the abc amber lit converter lines i've tried every piece of code on this thread and others ive found and tried all the logic i can think of its still just doesnt work.

My other issue is pple talk about things going yellow when they are to be removed nothing ever gets higlighted in my calibre 7.40 even when i use a example from kovid. is that my issue or is that just a setting? (ive ticked the remove header box obviously)

Can someone plse just give me a copy paste piece of code so i can sort this out? its KILLING ME!

Manichean · 01-19-2011, 03:45 AM

Quote:

Originally Posted by Confuzzled

Hey everyone im pretty new to regex coding as well but i've been reading about it and trying to figure it out... I have 2 things im tring to get rid of the abc amber lit converter and that aa bb pdf transform. For the abc amber lit converter lines i've tried every piece of code on this thread and others ive found and tried all the logic i can think of its still just doesnt work. also pple talk about things going yellow when they are to be removed nothing ever gets higlighted in my calibre 7.40 even when i use a example from kovid. is that my issue or is that just a setting? (ive ticked the remove header box obviously) can someone plse just give me a copy paste pieve of code so i can sort this out? its KILLING ME!

You did read the tutorial, didn't you? Also, the Amber LIT converter headers are notorious for changing their markup (what's written in the XHTML) from document to document, sometimes even inside one document. So in order to help you, we'd at least need an example of the XHTML you want removed. Also, a little more precise description than "things don't turn yellow" would help, as in in addition to the regexes you found and tested, what, as you say, "additional logic" did you try?

itimpi · 01-19-2011, 04:13 AM

Quote:

Originally Posted by Confuzzled

My other issue is pple talk about things going yellow when they are to be removed nothing ever gets higlighted in my calibre 7.40 even when i use a example from kovid. is that my issue or is that just a setting? (ive ticked the remove header box obviously)

This refers to when you are using the Wizard (the button the right of the box holding the regex expression) and have pressed the 'Test' button in the wizard. The text that matches the regex (if any) is then highlighted in yellow in the main window of the wizard.

Confuzzled · 01-19-2011, 05:45 AM

Quote:

Originally Posted by itimpi

This refers to when you are using the Wizard (the button the right of the box holding the regex expression) and have pressed the 'Test' button in the wizard. The text that matches the regex (if any) is then highlighted in yellow in the main window of the wizard.

yeah I did use the test wizard sorry if i wasnt clear... Thats the oddity no yellow even when i just put in a simple string which according to the user manual should come up.

The code i played with is modifications of this code:
<b.*?>\s*Generated\s+by\s+ABC\s+Amber\s+LIT.*? which as far as i should i'm aware should match i came up with something to this affect but using i.e. page break instead of bold wasn't sure of my defining structure tho so took this kovid structure and then when that didnt work tried to edit it until it did.

also tried removing <a> i.e the html link but it didnt work either. delphi was always my preference to python

my problem is this repeating code
Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre2">erter, http://www.processtext.com/abclit.html</a>

thanx so much

Manichean · 01-19-2011, 05:57 AM

That's weird. I just tested the regex you gave with the string you gave, and as I expected, it matches with no problems. Are there any linebreaks in the XHTML that you edited out?

Confuzzled · 01-19-2011, 06:00 AM

another example of my issue if i'm trying to remove a standard page number not bold shouldnt:
(Page [0-9]+)
work? it doesnt

Confuzzled · 01-19-2011, 06:02 AM

I kno! im not crazy there is something wierd hey? should i reinstall calibre do u think?

Confuzzled · 01-19-2011, 06:03 AM

Quote:

Originally Posted by Manichean

That's weird. I just tested the regex you gave with the string you gave, and as I expected, it matches with no problems. Are there any linebreaks in the XHTML that you edited out?

no thats exactly as it is in the code!

Manichean · 01-19-2011, 06:08 AM

Quote:

Originally Posted by Confuzzled

another example of my issue if i'm trying to remove a standard page number not bold shouldnt:
(Page [0-9]+)
work? it doesnt

Again, that quite depends on the markup the page number actually has. The regex should work for any page number that is preceded by the word "Page ", which, actually, may be a bit too indiscriminate, as there might be references to pages in the text... anyway, you're absolutely sure that you're typing the regex correctly in the text box of the wizard and actually pressing the test button and scrolling down to see if anything gets highlighted?

Confuzzled · 01-19-2011, 06:37 AM

hundred percent sure.... i copied the code from the bar into my reply.... could i possibly attach the source document so u can try it in your calibre?

09-28-2010, 07:21 AM	#32
varmemester Connoisseur Posts: 56 Karma: 484 Join Date: Sep 2010 Device: Kindle 3 & Sony PRS-950	I have a lot of lines looking like this: 0465002214_Cochran 11/20/08 2:41 PM Page xi What should the regex look like to remove those? And where do I put it? In Structure Detection I have ticked 'Remove Header' and 'Remove Footer'. I wonder what the chapter mark options do ('pagebreak', 'rule', 'both', 'none')? Last edited by varmemester; 09-28-2010 at 07:24 AM.

01-06-2011, 02:56 AM	#34
Techracer Junior Member Posts: 2 Karma: 10 Join Date: Jan 2011 Location: New Zealand Device: Kobo	Hi, I'm also having trouble with this implementation of regex. I've looked at the tutorial links, and I've used regex elsewhere before. I'm used to using ^ as meaning the start of the line and this is not working for me. I'm converting from pdf, the book in question is a free Doctor Who short story on the bbc web site. It has the bbc logo, the web url and another icon as header or footer on every page. It also has the page number, but I want to get rid of that. Trying to use (^\d+<br>) to match 2<br> at the start of a line only but it isn't working. If I remove the ^ it finds it but also page numbers from the contents page which are at the end of the line. Is there some other indicator of "start of line" that I should use? Cheers, Damian

01-06-2011, 05:47 AM	#35
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Try looking for a newline character before the text you want to remove: Code: \n\d+<br>

01-19-2011, 03:38 AM	#36
Confuzzled Member Posts: 13 Karma: 12 Join Date: Jan 2011 Device: Samsung Galaxy Tab	Hey everyone im pretty new to regex coding as well but i've been reading about it and trying to figure it out... I have 2 things im tring to get rid of the abc amber lit converter and that aa bb pdf transform. For the abc amber lit converter lines i've tried every piece of code on this thread and others ive found and tried all the logic i can think of its still just doesnt work. My other issue is pple talk about things going yellow when they are to be removed nothing ever gets higlighted in my calibre 7.40 even when i use a example from kovid. is that my issue or is that just a setting? (ive ticked the remove header box obviously) Can someone plse just give me a copy paste piece of code so i can sort this out? its KILLING ME! Last edited by Confuzzled; 01-19-2011 at 03:44 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Regex help to remove HTML footer	neonbible	Calibre	4	09-09-2010 09:42 AM
Regex to remove header from PDF	neonbible	Calibre	4	09-07-2010 10:08 AM
Removing header and footer	radicalnomad	Calibre	2	08-26-2010 10:34 AM
Header/Footer removal	Solicitous	Calibre	2	03-30-2010 05:53 AM
Multiline Regex Footer	hover	Calibre	10	02-03-2010 04:23 AM

01-19-2011, 05:57 AM	#40
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	That's weird. I just tested the regex you gave with the string you gave, and as I expected, it matches with no problems. Are there any linebreaks in the XHTML that you edited out?

01-19-2011, 06:00 AM	#41
Confuzzled Member Posts: 13 Karma: 12 Join Date: Jan 2011 Device: Samsung Galaxy Tab	another example of my issue if i'm trying to remove a standard page number not bold shouldnt: (Page [0-9]+) work? it doesnt

01-19-2011, 06:02 AM	#42
Confuzzled Member Posts: 13 Karma: 12 Join Date: Jan 2011 Device: Samsung Galaxy Tab	I kno! im not crazy there is something wierd hey? should i reinstall calibre do u think?

01-19-2011, 06:37 AM	#45
Confuzzled Member Posts: 13 Karma: 12 Join Date: Jan 2011 Device: Samsung Galaxy Tab	hundred percent sure.... i copied the code from the bar into my reply.... could i possibly attach the source document so u can try it in your calibre?