Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 09-07-2010, 07:51 AM   #1
neonbible
Groupie
neonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watch
 
Posts: 199
Karma: 10802
Join Date: Sep 2010
Device: Kindle Paperwhite, iPhone 5, iPad Air, Nexus 7
Regex to remove header from PDF

I have been reading through a few posts about using Regex to remove headers and footers.

I have successfully managed to remove page numbers and static strings. But the particular PDF I am using use the chapter title as the header. So this is going to change a lot.

How do I specify a regex expression to match a phrase/string?
neonbible is offline   Reply With Quote
Old 09-07-2010, 08:12 AM   #2
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
try this:
<br>\s*(title1|title2|title3|title\s+with\s+spaces )\s*<br> - changing title to whatever your chapters are.

You can also add <hr>\s* or \s*<hr> to the beginning or end (depending on whether it's header or footer), to more accurately tie it just to the page headers. If you tie it to the <hr> tag you might be able to get away with something like this:

<br>\s*(\w+\s*)+\s*<br>\s*<hr>

Use the test function with some of those examples to see if you can get what you need.

http://www.regular-expressions.info/ is the best place to read up on how to use regex.

edit - here's a sample regex I used for a file which also had chapter title headers:
Code:
((Castello\s|The\s(Phleg|nun|night|prince\sof\smus|garden|secret\spalac)|Epilogu|Prefac|Four\scarnival|Amalf|La\sSiren|Marriage\sto|Montevergin|Spaccanapol|A\sstiletto|Gesualdo\sC)[^<]+<br>\s*)?(\d|[xvi])+<br>\s*(The\sD\s*e\s*v\s*i\s*l\s*[^<]+<br>)?\s*((Bh|27)[^<]+<br>\s*){4,4}\s*<hr>\s*<A name=\d+></a>
I believe in this case it was a footer, <A name=\d+></a> also shows up on every page break, so it's another way to tie the regex to the header/footer by including that in the pattern.

Last edited by ldolse; 09-07-2010 at 08:29 AM.
ldolse is offline   Reply With Quote
 
Enthusiast
Old 09-07-2010, 08:35 AM   #3
neonbible
Groupie
neonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watchneonbible is clearly one to watch
 
Posts: 199
Karma: 10802
Join Date: Sep 2010
Device: Kindle Paperwhite, iPhone 5, iPad Air, Nexus 7
Thanks. I take it | means OR, so I just type out all the chapter headings.
neonbible is offline   Reply With Quote
Old 09-07-2010, 09:17 AM   #4
itimpi
Wizard
itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.
 
Posts: 4,068
Karma: 777825
Join Date: Nov 2008
Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)
Yes - the | means or - but is a regex operator and is part of the regular expression allowing the one expression to match one of a number of strings
itimpi is offline   Reply With Quote
Old 09-07-2010, 10:08 AM   #5
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Correct, and as itimpi noted, you need to use it correctly in the context of a regex, which primarily means surrounding all the OR'd items with parentheses for that particular operator.

Make sure to include the <br> tags in your pattern as well (at a minimum) so that you don't delete words from the book text.
ldolse is offline   Reply With Quote
Reply

Tags
footer, header, pdf, regex

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Structure Detection - Remove Header (or Footer) Regex DarkKipper Conversion 69 11-09-2013 12:21 PM
Regex help to remove HTML footer neonbible Calibre 4 09-09-2010 09:42 AM
regex request for specific header removal cellocgw Calibre 2 04-15-2010 02:42 PM
Remove Header feature not working sentience Calibre 1 01-09-2010 02:11 PM
Remove Header from PDF rrosenwald Calibre 10 08-22-2009 08:36 PM


All times are GMT -4. The time now is 01:48 AM.


MobileRead.com is a privately owned, operated and funded community.