03-02-2010, 06:42 AM | #1 |
Junior Member
Posts: 1
Karma: 10
Join Date: Mar 2010
Location: London
Device: iPhone
|
Structure Detection - Remove Header (or Footer) Regex
Is there any good way of referencing variables like the title of the book in the regular expression?
I've noticed a lot of books, particularly if converted from PDF, have the book title in the header of every page, interfering with the flow of the text, like title</p><p> I have quite a good regex set up to remove the common file path footer, page numbers alone on a line, and traces of the abbyy and amber abc converters, and it would be nice to automatically remove a repeated title. I know I can always manually add the actual string for a specific conversion, but it'd be great to do it automatically. Any thoughts? |
03-02-2010, 01:16 PM | #2 |
creator of calibre
Posts: 44,556
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
No, I'm afraid there isn't.
|
Advert | |
|
03-02-2010, 01:55 PM | #3 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
03-10-2010, 03:46 AM | #4 |
Bifocal Wearer
Posts: 49
Karma: 38902
Join Date: Jan 2010
Location: USA
Device: Kobo Touch, Aura, Clara ...
|
Well, I'm not DarkKipper, but here are a few regular expressions I use. They have worked on my test files, but could probably be improved or modified:
Delete header/footer that starts with "file///" and ends with either ".txt" or ".htm" or "html" file:///.+\.(txt|html|htm) Delete line that starts with "file///" and ends with numbers file:///.+\d Combine the two above file:///.+(\d|(txt|html|htm)) Delete a segment of a line in which the segment ends with a specific string .* - Baroness Orczy (the " - Baroness Orczy" is in the line) Here is one that seems to work, but might need a bit of tweaking. It looks for EITHER a line that starts with "file:///" and ends with numbers, OR a line that starts with a specified string, and deletes the found string. Quite handy when looking for headers / footers that may vary somewhat across a subdirectory (file:///.+\d|Baroness Orczy.*) Header with "Generated By ABC ... etc .html (the ABC Amber header) Generated by.+html Google "The Regex Coach" for a very nice freeware that is extremely helpful in designing regexes. Hope these help! |
08-19-2010, 07:57 PM | #5 |
M.P.
Posts: 7
Karma: 10
Join Date: Aug 2009
Location: Puerto Rico
Device: iPad 64GB WiFi
|
Bard, those were excellent! Thank you because I was as clueless as Alicia Silverstone after her career nosedived.
Now, One footer that still bugbears me is when there is this on the end: file:/// blah blah blah.txt (1 of 129) [2/4/03 9:31:57 PM] When I run the Regex code to make that footer go away, and test it before the actual conversation, the whole line of offending footer goes yellow as if it is going to go bye-bye, but in the end result, starting from the (1 of part to the PM] remains in the final conversion. Even when I run it again, epub to epub this time to debug, I still get it even though the test makes it look as if it will delete it. What am I doing wrong? Anyone? I used: file:///.+.PM] Last edited by Wreybies; 08-19-2010 at 07:59 PM. |
Advert | |
|
09-11-2010, 09:27 PM | #6 |
Enthusiast
Posts: 27
Karma: 10
Join Date: Aug 2010
Location: Ontario Canada
Device: Kindle 2; Kindle WIFI 6";IPAD 2
|
sorry really new
so I understand you use regex where do you place this line to exclude header and footer?
Last edited by PCreighton; 09-11-2010 at 09:30 PM. |
09-11-2010, 10:04 PM | #7 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
In the conversion options go to Structure detection, there is a text box to place regexes for headers/footers, as well as a preview function so you can write the regex and see where it matches in the file itself.
|
09-17-2010, 12:12 PM | #8 | |
M.P.
Posts: 7
Karma: 10
Join Date: Aug 2009
Location: Puerto Rico
Device: iPad 64GB WiFi
|
Quote:
Make sure to check the box for Remove Header. You needn't bother with the Remove Footer. I have found that it doesn't really work. The Remove Header area can be used to remove both headers and footers. When the program is searching the strings that match your regex it makes no distinction between where that string is physically located. If you tap the magic wand lookin' thingie to the right, then you get a preview of the text with all the html tags in place and you can put your regex string in the area provided at the top of the preview window to test if the string will flag for removal the items you really want to remove. Don't get frustrated. This is a trial and error process when there are variable strings, and you may need to do the process more than once if there are different kinds of strings that you want gone. EDIT ~ And on a side note: Quite often the removal of a header or footer will cause inappropriate paragraph breaks because though the string of the header or footer has been removed, if you don't also remove the html tags that surround that header/footer, this may well cause paragraph breaks or extra carriage returns. If you are a picky bugger like me, then you will want to take those tags into account when you are creating the strings of regex to make go bye-bye the things you want gone. Last edited by Wreybies; 09-17-2010 at 03:08 PM. |
|
09-19-2010, 05:46 AM | #9 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
thanks - I've been searching a while for how to remove the "generated by...."
so I get that should put file:///.+(\d|(txt|html|htm)) into preferences structure detection, but what happens to the default expression that's already in there ( which I don't really understand). do I overwrite it with the above, and if, so what do I lose i.e. what was the default expression doing that yours may not do ? Last edited by cybmole; 09-19-2010 at 06:09 AM. |
09-19-2010, 06:00 AM | #10 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
i can't get this to work:
I have a book in epub with lots of instances of Generated by ABC Amber LIT Converter, http://www.processtext.com/abclit.html so I change header detection to file:///.+(\d|(txt|html|htm)) & tick boxes as per above instructions, then force a conversion from epub to epub - the offending spam is still there ??? also, I've screwed up - I copied the default regex to note pad, so that I could put it back again, but I did not grab the entire line, how do I restore the default expression please. is it the same as the footer expression default ? Last edited by cybmole; 09-19-2010 at 06:06 AM. |
09-19-2010, 06:12 AM | #11 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
You overwrite the default expression. As long as you do it from the convert dialog and not the preferences dialog you don't lose anything, it's only lost for that book. Anyway the default is just an example, it generally needs to be edited in order for anything to match.
file:///.+(\d|(txt|html|htm)) won't get rid of the Amber Lit converter message, that will get rid of headers/footers inserted by browsers when pdf printing html. You'll need a different regex for Amber Lit. Try this: Code:
(<A name=\d+>\s*</a>)?\s*(<[biu][^>]*>)?\s*Generated\s+by\s+(ABC)?\s+Amber[^<]*(<a\shref=.*?processtext.*?>)?\s*(.*?processtext.*?</a>)?(</[ibu]>)?\s*(<br>\s*)? Last edited by ldolse; 09-19-2010 at 07:31 AM. |
09-19-2010, 06:20 AM | #12 | |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Quote:
as I'm a total noob with regex, could you please look at the attached .epub book & tell me what will work - thanks. Last edited by cybmole; 09-19-2010 at 07:00 AM. |
|
09-19-2010, 06:33 AM | #13 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
You can't put copyrighted books on Mobileread, I suggest you edit your post and delete it. All you need to do is click on the structure detection wizard and click the magic wand. Find one instance of the 'generated by' message and just copy/paste that text and a few surrounding lines - paste it into a phpbb code block. The epub is worthless for analysis as a amount of processing happens between the footer removal stage and the epub output stage.
|
09-19-2010, 06:45 AM | #14 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Well I do own the paper copy! but OK -
I never knew what the magic wand was for ... so I follow your instructions as far as locate an instance of the offending spam, then I'm stuck. is this what you need to see: [code] "Maybe, but the cops feel like you do, Anita. What's one more dead vampire? New laws don't change that." </p><p class="calibre4"> <b class="calibre2">Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre3">erter, http://www.processtext.com/abclit.html</a></b></p><p class="calibre4"> It had only been two years since Addison v. Clark. The court case gave us a revised version of what life was[code] NB I only have the book in this epub format - that's the format the I found it in. the whole series has this spam throughout. Last edited by cybmole; 09-19-2010 at 06:47 AM. |
09-19-2010, 06:56 AM | #15 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
I thought you were converting from pdf to epub?
Code:
(<A name=\d+>\s*</a>)?\s*(<[biu][^>]*>)?\s*Generated\s+by\s+(ABC)?\s+Amber[^<]*(<a\shref=.*?processtext.*?>)?\s*(.*?processtext.*?</a>)?(</[ibu]>)?\s*(<br>\s*)? Please remove the book from your previous post - it doesn't matter whether you own it, the problem is posting it to a public bulletin board that doesn't condone piracy. Last edited by ldolse; 09-19-2010 at 07:32 AM. |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Regex help to remove HTML footer | neonbible | Calibre | 4 | 09-09-2010 10:42 AM |
Regex to remove header from PDF | neonbible | Calibre | 4 | 09-07-2010 11:08 AM |
Removing header and footer | radicalnomad | Calibre | 2 | 08-26-2010 11:34 AM |
Header/Footer removal | Solicitous | Calibre | 2 | 03-30-2010 06:53 AM |
Multiline Regex Footer | hover | Calibre | 10 | 02-03-2010 05:23 AM |