Structure Detection - Remove Header (or Footer) Regex - Page 4

Manichean · 01-19-2011, 07:39 AM

Depends on whether it is copyrighted or not. Also, it depends on whether I have time to mess around with it or not, which is currently not looking too good. If the document is not copyrighted, though, you could attach it and hope that someone else tries.

Confuzzled · 01-19-2011, 07:50 AM

ok actually i think i have it working

itimpi · 01-19-2011, 07:56 AM

Quote:

Originally Posted by Confuzzled

hundred percent sure.... i copied the code from the bar into my reply.... could i possibly attach the source document so u can try it in your calibre?

Only if it is not covered by copyright. Posting copyrighted files is likely to get you banned from the forum.

Confuzzled · 01-19-2011, 08:00 AM

is it possible to chain regex codes in the header bar? e.g. set up a standard code that removes page nmubers, abc, pdf transform etc in one? right now i can only do one at a time...

Manichean · 01-19-2011, 08:12 AM

See this part of the tutorial I linked to earlier.

Confuzzled · 01-19-2011, 08:24 AM

1 more question sorry everyone

i tried using
(<p.*?><a.*?></a>)
to get rid of the aabby stuff and it highlights in yellow but doesnt remove am i making a mistake?

Manichean · 01-19-2011, 08:35 AM

Quote:

Originally Posted by Confuzzled

1 more question sorry everyone

i tried using
(<p.*?><a.*?></a>)
to get rid of the aabby stuff and it highlights in yellow but doesnt remove am i making a mistake?

That only matches the tags, not what's enclosed in them. Or, to be more precise, it matches the tags without anything enclosed in them, so it shouldn't even highlight anything from the Abby stuff. I'm beginning to think there's something seriously wonky with your Calibre.

Confuzzled · 01-19-2011, 08:39 AM

really? i was under the impression it would remove anything page break with a html link inside no matter what the html link is( or at least thats what i was going for and what gets highlighted) any help with it then?

Manichean · 01-19-2011, 08:44 AM

The stuff is a paragraph, not a page break. Just saying because it's been driving me crazy.
Regexes only do string matching with no interpretation of what the strings are, do or mean in it. There's no text to match in between your opening and closing tags, thus, nothing should get matched except for an empty link tag inside a paragraph tag.
It seems to me that while you obviously know something about regular expressions, you may have missed the concept. Think about the string matching part of the second paragraph a little.

Confuzzled · 01-19-2011, 08:57 AM

thanks i'd rather know... sorry it just made more sense to be a page break then a paragraph in the context of the links but in genral contrext paragraph does make more sense so thanx... i think my main problem is bringing my previous programming knowledge into regex.
i kno the purpose is to match strings but in the tutorial it makes refrences to matching all strings no matter what the actual string is as long as its within a particular function (its generally refered to as a problem e.g. trying to remove a bold page number and removing every bold string in the document) therefore i thought the same concept could be applied to the html links... I suppose im wrong oh well

thanks very much manichean u've bn really helpful even tho i've bn a really slow learner

Confuzzled · 01-19-2011, 09:04 AM

dont worry got it working its supposed to be:
<a.*?>.*?</a>
thanks alot manichean

Manichean · 01-19-2011, 09:14 AM

Quote:

Originally Posted by Confuzzled

i kno the purpose is to match strings but in the tutorial it makes refrences to matching all strings no matter what the actual string is as long as its within a particular function (its generally refered to as a problem e.g. trying to remove a bold page number and removing every bold string in the document) therefore i thought the same concept could be applied to the html links... I suppose im wrong oh well

I hope it doesn't say it quite that way, as that would be wrong. Could you quote the paragraph you've understood to say that, please? It might have to be clarified in the tutorial.

Quote:

Originally Posted by Confuzzled

dont worry got it working its supposed to be:
<a.*?>.*?</a>

Yeah, that should work. Be aware, though, that this removes all links from the document. Depending on what you convert, that may not be desirable.

Confuzzled · 01-19-2011, 10:11 AM

"thus we could remove everything between those tags using <b.*?>.*?"
I know but i wanted the code for a particlar set.... thanks again

Manichean · 01-19-2011, 10:29 AM

Quote:

Originally Posted by Confuzzled

"thus we could remove everything between those tags using <b.*?>.*?"
I know but i wanted the code for a particlar set.... thanks again

Notice, though, that there is a wildcard followed by a quantifier in between those two tags. They weren't there in the regex you posted earlier.

CazMar · 02-01-2011, 02:18 AM

I notice the "remove Header" and "remove footer" options have gone in Calibre - could someone point me in the right direction of how to do this very useful job? I presume there is a new way.

01-19-2011, 08:24 AM	#51
Confuzzled Member Posts: 13 Karma: 12 Join Date: Jan 2011 Device: Samsung Galaxy Tab	1 more question sorry everyone i tried using (<p.?><a.?></a></p>) to get rid of the aabby stuff and it highlights in yellow but doesnt remove am i making a mistake?

01-19-2011, 08:44 AM	#54
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	The <p> stuff is a paragraph, not a page break. Just saying because it's been driving me crazy. Regexes only do string matching with no interpretation of what the strings are, do or mean in it. There's no text to match in between your opening and closing tags, thus, nothing should get matched except for an empty link tag inside a paragraph tag. It seems to me that while you obviously know something about regular expressions, you may have missed the concept. Think about the string matching part of the second paragraph a little.

01-19-2011, 10:11 AM	#58
Confuzzled Member Posts: 13 Karma: 12 Join Date: Jan 2011 Device: Samsung Galaxy Tab	"thus we could remove everything between those tags using <b.?>.?</b>" I know but i wanted the code for a particlar set.... thanks again

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Regex help to remove HTML footer	neonbible	Calibre	4	09-09-2010 10:42 AM
Regex to remove header from PDF	neonbible	Calibre	4	09-07-2010 11:08 AM
Removing header and footer	radicalnomad	Calibre	2	08-26-2010 11:34 AM
Header/Footer removal	Solicitous	Calibre	2	03-30-2010 06:53 AM
Multiline Regex Footer	hover	Calibre	10	02-03-2010 05:23 AM

01-19-2011, 07:39 AM	#46
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	Depends on whether it is copyrighted or not. Also, it depends on whether I have time to mess around with it or not, which is currently not looking too good. If the document is not copyrighted, though, you could attach it and hope that someone else tries.

01-19-2011, 07:50 AM	#47
Confuzzled Member Posts: 13 Karma: 12 Join Date: Jan 2011 Device: Samsung Galaxy Tab	ok actually i think i have it working

01-19-2011, 08:00 AM	#49
Confuzzled Member Posts: 13 Karma: 12 Join Date: Jan 2011 Device: Samsung Galaxy Tab	is it possible to chain regex codes in the header bar? e.g. set up a standard code that removes page nmubers, abc, pdf transform etc in one? right now i can only do one at a time...

01-19-2011, 08:12 AM	#50
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	See this part of the tutorial I linked to earlier.

01-19-2011, 08:39 AM	#53
Confuzzled Member Posts: 13 Karma: 12 Join Date: Jan 2011 Device: Samsung Galaxy Tab	really? i was under the impression it would remove anything page break with a html link inside no matter what the html link is( or at least thats what i was going for and what gets highlighted) any help with it then?

01-19-2011, 08:57 AM	#55
Confuzzled Member Posts: 13 Karma: 12 Join Date: Jan 2011 Device: Samsung Galaxy Tab	thanks i'd rather know... sorry it just made more sense to be a page break then a paragraph in the context of the links but in genral contrext paragraph does make more sense so thanx... i think my main problem is bringing my previous programming knowledge into regex. i kno the purpose is to match strings but in the tutorial it makes refrences to matching all strings no matter what the actual string is as long as its within a particular function (its generally refered to as a problem e.g. trying to remove a bold page number and removing every bold string in the document) therefore i thought the same concept could be applied to the html links... I suppose im wrong oh well thanks very much manichean u've bn really helpful even tho i've bn a really slow learner

01-19-2011, 09:04 AM	#56
Confuzzled Member Posts: 13 Karma: 12 Join Date: Jan 2011 Device: Samsung Galaxy Tab	dont worry got it working its supposed to be: <a.?>.?</a> thanks alot manichean

02-01-2011, 02:18 AM	#60
CazMar Book Geek Posts: 596 Karma: 1499085 Join Date: Aug 2010 Location: Adelaide, Australia Device: Kobo Touch, Asus MemPad 7" tablet, Nexus 5, Asus 10" tablet	I notice the "remove Header" and "remove footer" options have gone in Calibre - could someone point me in the right direction of how to do this very useful job? I presume there is a new way.