|  03-02-2010, 05:42 AM | #1 | 
| Junior Member  Posts: 1 Karma: 10 Join Date: Mar 2010 Location: London Device: iPhone | 
				
				Structure Detection - Remove Header (or Footer) Regex
			 
			
			Is there any good way of referencing variables like the title of the book in the regular expression? I've noticed a lot of books, particularly if converted from PDF, have the book title in the header of every page, interfering with the flow of the text, like title</p><p> I have quite a good regex set up to remove the common file path footer, page numbers alone on a line, and traces of the abbyy and amber abc converters, and it would be nice to automatically remove a repeated title. I know I can always manually add the actual string for a specific conversion, but it'd be great to do it automatically. Any thoughts? | 
|   |   | 
|  03-02-2010, 12:16 PM | #2 | 
| creator of calibre            Posts: 45,604 Karma: 28548974 Join Date: Oct 2006 Location: Mumbai, India Device: Various | 
			
			No, I'm afraid there isn't.
		 | 
|   |   | 
| Advert | |
|  | 
|  03-02-2010, 12:55 PM | #3 | 
| Wizard            Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T | |
|   |   | 
|  03-10-2010, 02:46 AM | #4 | 
| Bifocal Wearer            Posts: 49 Karma: 38902 Join Date: Jan 2010 Location: USA Device: Kobo Touch, Aura, Clara ... | 
			
			Well, I'm not DarkKipper, but here are a few regular expressions I use.  They have worked on my test files, but could probably be improved or modified: Delete header/footer that starts with "file///" and ends with either ".txt" or ".htm" or "html" file:///.+\.(txt|html|htm) Delete line that starts with "file///" and ends with numbers file:///.+\d Combine the two above file:///.+(\d|(txt|html|htm)) Delete a segment of a line in which the segment ends with a specific string .* - Baroness Orczy (the " - Baroness Orczy" is in the line) Here is one that seems to work, but might need a bit of tweaking. It looks for EITHER a line that starts with "file:///" and ends with numbers, OR a line that starts with a specified string, and deletes the found string. Quite handy when looking for headers / footers that may vary somewhat across a subdirectory (file:///.+\d|Baroness Orczy.*) Header with "Generated By ABC ... etc .html (the ABC Amber header) Generated by.+html Google "The Regex Coach" for a very nice freeware that is extremely helpful in designing regexes. Hope these help! | 
|   |   | 
|  08-19-2010, 06:57 PM | #5 | 
| M.P.  Posts: 7 Karma: 10 Join Date: Aug 2009 Location: Puerto Rico Device: iPad 64GB WiFi | 
			
			Bard, those were excellent!  Thank you because I was as clueless as Alicia Silverstone after her career nosedived.    Now, One footer that still bugbears me is when there is this on the end: file:/// blah blah blah.txt (1 of 129) [2/4/03 9:31:57 PM] When I run the Regex code to make that footer go away, and test it before the actual conversation, the whole line of offending footer goes yellow as if it is going to go bye-bye, but in the end result, starting from the (1 of part to the PM] remains in the final conversion. Even when I run it again, epub to epub this time to debug, I still get it even though the test makes it look as if it will delete it. What am I doing wrong? Anyone? I used: file:///.+.PM] Last edited by Wreybies; 08-19-2010 at 06:59 PM. | 
|   |   | 
| Advert | |
|  | 
|  09-11-2010, 08:27 PM | #6 | 
| Enthusiast  Posts: 27 Karma: 10 Join Date: Aug 2010 Location: Ontario Canada Device: Kindle 2; Kindle WIFI 6";IPAD 2 | 
				
				sorry really new
			 
			
			so I understand you use regex where do you place this line to exclude header and footer?
		 Last edited by PCreighton; 09-11-2010 at 08:30 PM. | 
|   |   | 
|  09-11-2010, 09:04 PM | #7 | 
| Wizard            Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone | 
			
			In the conversion options go to Structure detection, there is a text box to place regexes for headers/footers, as well as a preview function so you can write the regex and see where it matches in the file itself.
		 | 
|   |   | 
|  09-17-2010, 11:12 AM | #8 | |
| M.P.  Posts: 7 Karma: 10 Join Date: Aug 2009 Location: Puerto Rico Device: iPad 64GB WiFi | Quote: 
  Make sure to check the box for Remove Header. You needn't bother with the Remove Footer. I have found that it doesn't really work. The Remove Header area can be used to remove both headers and footers. When the program is searching the strings that match your regex it makes no distinction between where that string is physically located. If you tap the magic wand lookin' thingie to the right, then you get a preview of the text with all the html tags in place and you can put your regex string in the area provided at the top of the preview window to test if the string will flag for removal the items you really want to remove. Don't get frustrated. This is a trial and error process when there are variable strings, and you may need to do the process more than once if there are different kinds of strings that you want gone.  EDIT ~ And on a side note: Quite often the removal of a header or footer will cause inappropriate paragraph breaks because though the string of the header or footer has been removed, if you don't also remove the html tags that surround that header/footer, this may well cause paragraph breaks or extra carriage returns. If you are a picky bugger like me, then you will want to take those tags into account when you are creating the strings of regex to make go bye-bye the things you want gone. Last edited by Wreybies; 09-17-2010 at 02:08 PM. | |
|   |   | 
|  09-19-2010, 04:46 AM | #9 | 
| Wizard            Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none | 
			
			thanks - I've been searching a while for how to remove the "generated by...." so I get that should put file:///.+(\d|(txt|html|htm)) into preferences structure detection, but what happens to the default expression that's already in there ( which I don't really understand). do I overwrite it with the above, and if, so what do I lose i.e. what was the default expression doing that yours may not do ? Last edited by cybmole; 09-19-2010 at 05:09 AM. | 
|   |   | 
|  09-19-2010, 05:00 AM | #10 | 
| Wizard            Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none | 
			
			i can't get this to work: I have a book in epub with lots of instances of Generated by ABC Amber LIT Converter, http://www.processtext.com/abclit.html so I change header detection to file:///.+(\d|(txt|html|htm)) & tick boxes as per above instructions, then force a conversion from epub to epub - the offending spam is still there ??? also, I've screwed up - I copied the default regex to note pad, so that I could put it back again, but I did not grab the entire line, how do I restore the default expression please. is it the same as the footer expression default ? Last edited by cybmole; 09-19-2010 at 05:06 AM. | 
|   |   | 
|  09-19-2010, 05:12 AM | #11 | 
| Wizard            Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone | 
			
			You overwrite the default expression.  As long as you do it from the convert dialog and not the preferences dialog you don't lose anything, it's only lost for that book.  Anyway the default is just an example, it generally needs to be edited in order for anything to match. file:///.+(\d|(txt|html|htm)) won't get rid of the Amber Lit converter message, that will get rid of headers/footers inserted by browsers when pdf printing html. You'll need a different regex for Amber Lit. Try this: Code: (<A name=\d+>\s*</a>)?\s*(<[biu][^>]*>)?\s*Generated\s+by\s+(ABC)?\s+Amber[^<]*(<a\shref=.*?processtext.*?>)?\s*(.*?processtext.*?</a>)?(</[ibu]>)?\s*(<br>\s*)? Last edited by ldolse; 09-19-2010 at 06:31 AM. | 
|   |   | 
|  09-19-2010, 05:20 AM | #12 | |
| Wizard            Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none | Quote: 
 as I'm a total noob with regex, could you please look at the attached .epub book & tell me what will work - thanks. Last edited by cybmole; 09-19-2010 at 06:00 AM. | |
|   |   | 
|  09-19-2010, 05:33 AM | #13 | 
| Wizard            Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone | 
			
			You can't put copyrighted books on Mobileread, I suggest you edit your post and delete it.   All you need to do is click on the structure detection wizard and click the magic wand.  Find one instance of the 'generated by' message and just copy/paste that text and a few surrounding lines - paste it into a phpbb code block.  The epub is worthless for analysis as a amount of processing happens between the footer removal stage and the epub output stage.
		 | 
|   |   | 
|  09-19-2010, 05:45 AM | #14 | 
| Wizard            Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none | 
			
			Well I do own the paper copy! but OK - I never knew what the magic wand was for ... so I follow your instructions as far as locate an instance of the offending spam, then I'm stuck. is this what you need to see: [code] "Maybe, but the cops feel like you do, Anita. What's one more dead vampire? New laws don't change that." </p><p class="calibre4"> <b class="calibre2">Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre3">erter, http://www.processtext.com/abclit.html</a></b></p><p class="calibre4"> It had only been two years since Addison v. Clark. The court case gave us a revised version of what life was[code] NB I only have the book in this epub format - that's the format the I found it in. the whole series has this spam throughout. Last edited by cybmole; 09-19-2010 at 05:47 AM. | 
|   |   | 
|  09-19-2010, 05:56 AM | #15 | 
| Wizard            Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone | 
			
			I thought you were converting from pdf to epub? Code: (<A name=\d+>\s*</a>)?\s*(<[biu][^>]*>)?\s*Generated\s+by\s+(ABC)?\s+Amber[^<]*(<a\shref=.*?processtext.*?>)?\s*(.*?processtext.*?</a>)?(</[ibu]>)?\s*(<br>\s*)? Please remove the book from your previous post - it doesn't matter whether you own it, the problem is posting it to a public bulletin board that doesn't condone piracy. Last edited by ldolse; 09-19-2010 at 06:32 AM. | 
|   |   | 
|  | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Regex help to remove HTML footer | neonbible | Calibre | 4 | 09-09-2010 09:42 AM | 
| Regex to remove header from PDF | neonbible | Calibre | 4 | 09-07-2010 10:08 AM | 
| Removing header and footer | radicalnomad | Calibre | 2 | 08-26-2010 10:34 AM | 
| Header/Footer removal | Solicitous | Calibre | 2 | 03-30-2010 05:53 AM | 
| Multiline Regex Footer | hover | Calibre | 10 | 02-03-2010 04:23 AM |