Structure Detection - Remove Header (or Footer) Regex

DarkKipper · 03-02-2010, 05:42 AM

Is there any good way of referencing variables like the title of the book in the regular expression?

I've noticed a lot of books, particularly if converted from PDF, have the book title in the header of every page, interfering with the flow of the text, like

title

I have quite a good regex set up to remove the common file path footer, page numbers alone on a line, and traces of the abbyy and amber abc converters, and it would be nice to automatically remove a repeated title. I know I can always manually add the actual string for a specific conversion, but it'd be great to do it automatically.

Any thoughts?

kovidgoyal · 03-02-2010, 12:16 PM

No, I'm afraid there isn't.

Starson17 · 03-02-2010, 12:55 PM

Quote:

Originally Posted by DarkKipper

I have quite a good regex set up to remove the common file path footer, page numbers alone on a line, and traces of the abbyy and amber abc converters

Will you share it here?

TheBard · 03-10-2010, 02:46 AM

Well, I'm not DarkKipper, but here are a few regular expressions I use. They have worked on my test files, but could probably be improved or modified:

Delete header/footer that starts with "file///" and ends with either ".txt" or ".htm" or "html"
file:///.+\.(txt|html|htm)

Delete line that starts with "file///" and ends with numbers
file:///.+\d

Combine the two above
file:///.+(\d|(txt|html|htm))

Delete a segment of a line in which the segment ends with a specific string
.* - Baroness Orczy
(the " - Baroness Orczy" is in the line)

Here is one that seems to work, but might need a bit of tweaking. It looks for EITHER a line that starts with "file:///" and ends with numbers, OR a line that starts with a specified string, and deletes the found string. Quite handy when looking for headers / footers that may vary somewhat across a subdirectory
(file:///.+\d|Baroness Orczy.*)

Header with "Generated By ABC ... etc .html (the ABC Amber header)
Generated by.+html

Google "The Regex Coach" for a very nice freeware that is extremely helpful in designing regexes.

Hope these help!

Wreybies · 08-19-2010, 06:57 PM

Bard, those were excellent! Thank you because I was as clueless as Alicia Silverstone after her career nosedived.

Now, One footer that still bugbears me is when there is this on the end:

file:/// blah blah blah.txt (1 of 129) [2/4/03 9:31:57 PM]

When I run the Regex code to make that footer go away, and test it before the actual conversation, the whole line of offending footer goes yellow as if it is going to go bye-bye, but in the end result, starting from the (1 of part to the PM] remains in the final conversion. Even when I run it again, epub to epub this time to debug, I still get it even though the test makes it look as if it will delete it.

What am I doing wrong? Anyone?

I used: file:///.+.PM]

PCreighton · 09-11-2010, 08:27 PM

so I understand you use regex where do you place this line to exclude header and footer?

ldolse · 09-11-2010, 09:04 PM

Quote:

Originally Posted by PCreighton

so I understand you use regex where do you place this line to exclude header and footer?

In the conversion options go to Structure detection, there is a text box to place regexes for headers/footers, as well as a preview function so you can write the regex and see where it matches in the file itself.

Wreybies · 09-17-2010, 11:12 AM

Quote:

Originally Posted by ldolse

In the conversion options go to Structure detection, there is a text box to place regexes for headers/footers, as well as a preview function so you can write the regex and see where it matches in the file itself.

The correct area of Calibre looks like this:

Make sure to check the box for Remove Header. You needn't bother with the Remove Footer. I have found that it doesn't really work. The Remove Header area can be used to remove both headers and footers. When the program is searching the strings that match your regex it makes no distinction between where that string is physically located. If you tap the magic wand lookin' thingie to the right, then you get a preview of the text with all the html tags in place and you can put your regex string in the area provided at the top of the preview window to test if the string will flag for removal the items you really want to remove. Don't get frustrated. This is a trial and error process when there are variable strings, and you may need to do the process more than once if there are different kinds of strings that you want gone.

EDIT ~ And on a side note: Quite often the removal of a header or footer will cause inappropriate paragraph breaks because though the string of the header or footer has been removed, if you don't also remove the html tags that surround that header/footer, this may well cause paragraph breaks or extra carriage returns. If you are a picky bugger like me, then you will want to take those tags into account when you are creating the strings of regex to make go bye-bye the things you want gone.

cybmole · 09-19-2010, 04:46 AM

thanks - I've been searching a while for how to remove the "generated by...."

so I get that should put
file:///.+(\d|(txt|html|htm))
into preferences structure detection, but what happens to the default expression that's already in there ( which I don't really understand).

do I overwrite it with the above, and if, so what do I lose i.e. what was the default expression doing that yours may not do ?

cybmole · 09-19-2010, 05:00 AM

i can't get this to work:
I have a book in epub with lots of instances of Generated by ABC Amber LIT Converter, http://www.processtext.com/abclit.html

so I change header detection to file:///.+(\d|(txt|html|htm)) & tick boxes as per above instructions, then force a conversion from epub to epub - the offending spam is still there ???

also, I've screwed up - I copied the default regex to note pad, so that I could put it back again, but I did not grab the entire line, how do I restore the default expression please. is it the same as the footer expression default ?

ldolse · 09-19-2010, 05:12 AM

You overwrite the default expression. As long as you do it from the convert dialog and not the preferences dialog you don't lose anything, it's only lost for that book. Anyway the default is just an example, it generally needs to be edited in order for anything to match.

file:///.+(\d|(txt|html|htm))

won't get rid of the Amber Lit converter message, that will get rid of headers/footers inserted by browsers when pdf printing html. You'll need a different regex for Amber Lit. Try this:

Code:

(<A name=\d+>\s*</a>)?\s*(<[biu][^>]*>)?\s*Generated\s+by\s+(ABC)?\s+Amber[^<]*(<a\shref=.*?processtext.*?>)?\s*(.*?processtext.*?</a>)?(</[ibu]>)?\s*(<br>\s*)?

cybmole · 09-19-2010, 05:20 AM

Quote:

Originally Posted by ldolse

You overwrite the default expression. As long as you do it from the convert dialog and not the preferences dialog you don't lose anything, it's only lost for that book. Anyway the default is just an example, it generally needs to be edited in order for anything to match.

file:///.+(\d|(txt|html|htm))

won't get rid of the Amber Lit converter message, that will get rid of headers/footers inserted by browsers when pdf printing html. You'll need a different regex for Amber Lit. Try this:

Code:

<A name=\d+>\s*</a>\s*(<[biu]>)?\s*Generated\s+by\s+(ABC)?\s+Amber[^<]*(<a\shref=.*?processtext.*?</a>)?\s*(</[ibu]>)?\s*<br>

thanks - but it did not work - i pasted in everying inside of [code]...[code] on the convert - structure detection page from above & forced a reconvert - but the amber lit stuff is still there

as I'm a total noob with regex, could you please look at the attached .epub book & tell me what will work - thanks.

ldolse · 09-19-2010, 05:33 AM

You can't put copyrighted books on Mobileread, I suggest you edit your post and delete it. All you need to do is click on the structure detection wizard and click the magic wand. Find one instance of the 'generated by' message and just copy/paste that text and a few surrounding lines - paste it into a phpbb code block. The epub is worthless for analysis as a amount of processing happens between the footer removal stage and the epub output stage.

cybmole · 09-19-2010, 05:45 AM

Well I do own the paper copy! but OK -
I never knew what the magic wand was for ...

so I follow your instructions as far as locate an instance of the offending spam, then I'm stuck.

is this what you need to see:
[code]
"Maybe, but the cops feel like you do, Anita. What's one more dead vampire? New laws don't change that." 
Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre3">erter, http://www.processtext.com/abclit.html</a>
It had only been two years since Addison v. Clark. The court case gave us a revised version of what life was[code]

NB I only have the book in this epub format - that's the format the I found it in. the whole series has this spam throughout.

ldolse · 09-19-2010, 05:56 AM

I thought you were converting from pdf to epub?

Code:

(<A name=\d+>\s*</a>)?\s*(<[biu][^>]*>)?\s*Generated\s+by\s+(ABC)?\s+Amber[^<]*(<a\shref=.*?processtext.*?>)?\s*(.*?processtext.*?</a>)?(</[ibu]>)?\s*(<br>\s*)?

That should take care of more variants of the spam, including yours.

Please remove the book from your previous post - it doesn't matter whether you own it, the problem is posting it to a public bulletin board that doesn't condone piracy.

03-02-2010, 05:42 AM	#1
DarkKipper Junior Member Posts: 1 Karma: 10 Join Date: Mar 2010 Location: London Device: iPhone	Structure Detection - Remove Header (or Footer) Regex Is there any good way of referencing variables like the title of the book in the regular expression? I've noticed a lot of books, particularly if converted from PDF, have the book title in the header of every page, interfering with the flow of the text, like title</p><p> I have quite a good regex set up to remove the common file path footer, page numbers alone on a line, and traces of the abbyy and amber abc converters, and it would be nice to automatically remove a repeated title. I know I can always manually add the actual string for a specific conversion, but it'd be great to do it automatically. Any thoughts?

08-19-2010, 06:57 PM	#5
Wreybies M.P. Posts: 7 Karma: 10 Join Date: Aug 2009 Location: Puerto Rico Device: iPad 64GB WiFi	Bard, those were excellent! Thank you because I was as clueless as Alicia Silverstone after her career nosedived. Now, One footer that still bugbears me is when there is this on the end: file:/// blah blah blah.txt (1 of 129) [2/4/03 9:31:57 PM] When I run the Regex code to make that footer go away, and test it before the actual conversation, the whole line of offending footer goes yellow as if it is going to go bye-bye, but in the end result, starting from the (1 of part to the PM] remains in the final conversion. Even when I run it again, epub to epub this time to debug, I still get it even though the test makes it look as if it will delete it. What am I doing wrong? Anyone? I used: file:///.+.PM] Last edited by Wreybies; 08-19-2010 at 06:59 PM.

09-11-2010, 08:27 PM	#6
PCreighton Enthusiast Posts: 27 Karma: 10 Join Date: Aug 2010 Location: Ontario Canada Device: Kindle 2; Kindle WIFI 6";IPAD 2	sorry really new so I understand you use regex where do you place this line to exclude header and footer? Last edited by PCreighton; 09-11-2010 at 08:30 PM.

09-19-2010, 04:46 AM	#9
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	thanks - I've been searching a while for how to remove the "generated by...." so I get that should put file:///.+(\d\|(txt\|html\|htm)) into preferences structure detection, but what happens to the default expression that's already in there ( which I don't really understand). do I overwrite it with the above, and if, so what do I lose i.e. what was the default expression doing that yours may not do ? Last edited by cybmole; 09-19-2010 at 05:09 AM.

09-19-2010, 05:00 AM	#10
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	i can't get this to work: I have a book in epub with lots of instances of Generated by ABC Amber LIT Converter, http://www.processtext.com/abclit.html so I change header detection to file:///.+(\d\|(txt\|html\|htm)) & tick boxes as per above instructions, then force a conversion from epub to epub - the offending spam is still there ??? also, I've screwed up - I copied the default regex to note pad, so that I could put it back again, but I did not grab the entire line, how do I restore the default expression please. is it the same as the footer expression default ? Last edited by cybmole; 09-19-2010 at 05:06 AM.

03-02-2010, 12:16 PM	#2
kovidgoyal creator of calibre Posts: 46,363 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	No, I'm afraid there isn't.

03-10-2010, 02:46 AM	#4
TheBard Bifocal Wearer Posts: 49 Karma: 38902 Join Date: Jan 2010 Location: USA Device: Kobo Touch, Aura, Clara ...	Well, I'm not DarkKipper, but here are a few regular expressions I use. They have worked on my test files, but could probably be improved or modified: Delete header/footer that starts with "file///" and ends with either ".txt" or ".htm" or "html" file:///.+\.(txt\|html\|htm) Delete line that starts with "file///" and ends with numbers file:///.+\d Combine the two above file:///.+(\d\|(txt\|html\|htm)) Delete a segment of a line in which the segment ends with a specific string .* - Baroness Orczy (the " - Baroness Orczy" is in the line) Here is one that seems to work, but might need a bit of tweaking. It looks for EITHER a line that starts with "file:///" and ends with numbers, OR a line that starts with a specified string, and deletes the found string. Quite handy when looking for headers / footers that may vary somewhat across a subdirectory (file:///.+\d\|Baroness Orczy.*) Header with "Generated By ABC ... etc .html (the ABC Amber header) Generated by.+html Google "The Regex Coach" for a very nice freeware that is extremely helpful in designing regexes. Hope these help!

09-19-2010, 05:12 AM	#11
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	You overwrite the default expression. As long as you do it from the convert dialog and not the preferences dialog you don't lose anything, it's only lost for that book. Anyway the default is just an example, it generally needs to be edited in order for anything to match. file:///.+(\d\|(txt\|html\|htm)) won't get rid of the Amber Lit converter message, that will get rid of headers/footers inserted by browsers when pdf printing html. You'll need a different regex for Amber Lit. Try this: Code: (<A name=\d+>\s</a>)?\s(<[biu][^>]>)?\sGenerated\s+by\s+(ABC)?\s+Amber[^<](<a\shref=.?processtext.?>)?\s(.?processtext.?</a>)?(</[ibu]>)?\s(<br>\s)? Last edited by ldolse; 09-19-2010 at 06:31 AM.

09-19-2010, 05:33 AM	#13
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	You can't put copyrighted books on Mobileread, I suggest you edit your post and delete it. All you need to do is click on the structure detection wizard and click the magic wand. Find one instance of the 'generated by' message and just copy/paste that text and a few surrounding lines - paste it into a phpbb code block. The epub is worthless for analysis as a amount of processing happens between the footer removal stage and the epub output stage.

09-19-2010, 05:45 AM	#14
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	Well I do own the paper copy! but OK - I never knew what the magic wand was for ... so I follow your instructions as far as locate an instance of the offending spam, then I'm stuck. is this what you need to see: [code] "Maybe, but the cops feel like you do, Anita. What's one more dead vampire? New laws don't change that." </p><p class="calibre4"> <b class="calibre2">Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre3">erter, http://www.processtext.com/abclit.html</a></b></p><p class="calibre4"> It had only been two years since Addison v. Clark. The court case gave us a revised version of what life was[code] NB I only have the book in this epub format - that's the format the I found it in. the whole series has this spam throughout. Last edited by cybmole; 09-19-2010 at 05:47 AM.

09-19-2010, 05:56 AM	#15
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	I thought you were converting from pdf to epub? Code: (<A name=\d+>\s</a>)?\s(<[biu][^>]>)?\sGenerated\s+by\s+(ABC)?\s+Amber[^<](<a\shref=.?processtext.?>)?\s(.?processtext.?</a>)?(</[ibu]>)?\s(<br>\s)? That should take care of more variants of the spam, including yours. Please remove the book from your previous post - it doesn't matter whether you own it, the problem is posting it to a public bulletin board that doesn't condone piracy. Last edited by ldolse; 09-19-2010 at 06:32 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Regex help to remove HTML footer	neonbible	Calibre	4	09-09-2010 09:42 AM
Regex to remove header from PDF	neonbible	Calibre	4	09-07-2010 10:08 AM
Removing header and footer	radicalnomad	Calibre	2	08-26-2010 10:34 AM
Header/Footer removal	Solicitous	Calibre	2	03-30-2010 05:53 AM
Multiline Regex Footer	hover	Calibre	10	02-03-2010 04:23 AM

Advert

Advert