Structure Detection - Remove Header (or Footer) Regex - Page 5

ldolse · 02-01-2011, 01:25 AM

These have been replaced by the Search and Replace panel in conversion. If you want to delete text just specify a search and leave the replace box empty.

CazMar · 02-01-2011, 06:10 PM

Quote:

Originally Posted by ldolse

These have been replaced by the Search and Replace panel in conversion. If you want to delete text just specify a search and leave the replace box empty.

Ok - thanks for the help.

Luiz Braga · 02-21-2011, 10:40 AM

I'm realy a novice in Calibre, just beginning with two first book convertions.
I try convert PDF books to Mobi. The only problem produced is that the footer and page numbers are embedded with last line of each page in the Mobi format. I did not try or even understand in replys above, how to remove that footers. Any help?
I see that if convert from PDF to RTF of course I can Manuly eliminate these footers but it is troublesome.

Manichean · 02-21-2011, 10:48 AM

Quote:

Originally Posted by Luiz Braga

I'm realy a novice in Calibre, just beginning with two first book convertions.
I try convert PDF books to Mobi. The only problem produced is that the footer and page numbers are embedded with last line of each page in the Mobi format. I did not try or even understand in replys above, how to remove that footers. Any help?
I see that if convert from PDF to RTF of course I can Manuly eliminate these footers but it is troublesome.

You'll have to use the search & replace feature in the conversion settings, there's a brief tutorial available. The search and replace uses regular expressions to describe the text to replace, if you're not comfortable using those, there's a tutorial available on them as well, which I'd suggest you start with if needed.

HornGs · 04-16-2011, 03:33 PM

I've read the tutorial as well as this thread and I've found the information very useful. I just can't find how to extend my selection. for example we have

file://something0%01something0%0 (page 23 of 9000) [April 99, 1903]

I just want to match file://something0%0 and extend my match to the end of line. Or I could match of 9000) and extend to the previous end of line.

Is there a simple way to do that ?

edit: let me clarify ... I mean how to I match to the end of line in a PDF when there are no end of line tags.

I use file://.+br> when it's an html document.

Manichean · 04-16-2011, 06:20 PM

Try something a little more specific like for example

Code:

file://something0%01something0%0\s+\(page\s+\d+\s+of\s+9000\)\s+\[April\s+99,\s+1903\]

ldolse · 04-16-2011, 09:11 PM

This should work too:

Code:

file://.*?\]

Although pdf does have end of line tags, and file:// is already built into pdf as an internal pattern, though admittedly I just checked the code and it seems like two slashes isn't as common as three or four slashes - I've just tweaked the number to look for.

miquele · 09-01-2012, 04:00 PM

hello,

line spacing is off at conversion, so I would like to remove
 \n
(read blank, bracket, newline)
but only if in front of the blank is not a dot, otherwise it should remain.
The RegEx identifying the corrcet places is
[a-z] \n
but now, obviously, one letter too much is replaced. Can I get back this character through a variable to be put into the Replacement Text line?
Otherwise, how can I tell Calibre to replace only once the is no dot in front of the matching RegEx?
Thanks for your help,
miquele

HeyPretty · 08-19-2013, 01:04 AM

Quote:

Originally Posted by Confuzzled

yeah I did use the test wizard sorry if i wasnt clear... Thats the oddity no yellow even when i just put in a simple string which according to the user manual should come up.

The code i played with is modifications of this code:
<b.*?>\s*Generated\s+by\s+ABC\s+Amber\s+LIT.*? which as far as i should i'm aware should match i came up with something to this affect but using i.e. page break instead of bold wasn't sure of my defining structure tho so took this kovid structure and then when that didnt work tried to edit it until it did.

also tried removing <a> i.e the html link but it didnt work either. delphi was always my preference to python

my problem is this repeating code
Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre2">erter, http://www.processtext.com/abclit.html</a>

thanx so much

Your regex worked perfectly for me! Thanks so much for posting it. I'm so glad I searched fervently before I started trying to make my own.

esc7 · 11-09-2013, 12:21 PM

Hello everyone!
I'm downloading webpages with the Mozilla Print Pages 2 PDF add on and I'd like to remove the parts after the main body text (usually related posts, links, ads) with Calibre when converting to mobi, that are of course different at every post and every category. There is only one pattern that occurs in every single page I noticed, that starts with "Inshare(1-9)" and ends with the word "Albumclose " like here:

Last paragraph of text 
inShare2 
Related post1 
Related ad1 
Related post2 
Related post3 
Related ad2 
Albumclose 
Is there any chance I can delete this whole block of text with Calibre's Search & Replace feature at every post automatically or is that impossible? I looked at this part of the manual but it didn't really work. Any help is appreciated

02-21-2011, 10:40 AM	#63
Luiz Braga Junior Member Posts: 3 Karma: 10 Join Date: Jan 2011 Device: Kindle	pdf headers and footers I'm realy a novice in Calibre, just beginning with two first book convertions. I try convert PDF books to Mobi. The only problem produced is that the footer and page numbers are embedded with last line of each page in the Mobi format. I did not try or even understand in replys above, how to remove that footers. Any help? I see that if convert from PDF to RTF of course I can Manuly eliminate these footers but it is troublesome.

04-16-2011, 03:33 PM	#65
HornGs Junior Member Posts: 1 Karma: 10 Join Date: Apr 2011 Device: kindle	I've read the tutorial as well as this thread and I've found the information very useful. I just can't find how to extend my selection. for example we have file://something0%01something0%0 (page 23 of 9000) [April 99, 1903] I just want to match file://something0%0 and extend my match to the end of line. Or I could match of 9000) and extend to the previous end of line. Is there a simple way to do that ? edit: let me clarify ... I mean how to I match to the end of line in a PDF when there are no end of line tags. I use file://.+br> when it's an html document. Last edited by HornGs; 04-16-2011 at 05:12 PM.

04-16-2011, 06:20 PM	#66
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	Try something a little more specific like for example Code: file://something0%01something0%0\s+\(page\s+\d+\s+of\s+9000\)\s+\[April\s+99,\s+1903\]

04-16-2011, 09:11 PM	#67
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	This should work too: Code: file://.*?\] Although pdf does have end of line tags, and file:// is already built into pdf as an internal pattern, though admittedly I just checked the code and it seems like two slashes isn't as common as three or four slashes - I've just tweaked the number to look for.

09-01-2012, 04:00 PM	#68
miquele Connoisseur Posts: 75 Karma: 498122 Join Date: May 2010 Location: Europe Device: Bookeen Cybook Gen3, Kindle 3, Kindle PW, Kindle Voyage	tricky regex hello, line spacing is off at conversion, so I would like to remove <br>\n (read blank, bracket, newline) but only if in front of the blank is not a dot, otherwise it should remain. The RegEx identifying the corrcet places is [a-z] <br>\n but now, obviously, one letter too much is replaced. Can I get back this character through a variable to be put into the Replacement Text line? Otherwise, how can I tell Calibre to replace only once the is no dot in front of the matching RegEx? Thanks for your help, miquele

02-01-2011, 01:25 AM	#61
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	These have been replaced by the Search and Replace panel in conversion. If you want to delete text just specify a search and leave the replace box empty.

11-09-2013, 12:21 PM	#70
esc7 Member Posts: 10 Karma: 6738 Join Date: Dec 2011 Device: Kindle Paperwhite	Footerlike block of text Hello everyone! I'm downloading webpages with the Mozilla Print Pages 2 PDF add on and I'd like to remove the parts after the main body text (usually related posts, links, ads) with Calibre when converting to mobi, that are of course different at every post and every category. There is only one pattern that occurs in every single page I noticed, that starts with "Inshare(1-9)" and ends with the word "Albumclose " like here: Last paragraph of text <br> inShare2<br> Related post1 <br> Related ad1 <br> Related post2<br> Related post3<br> Related ad2 <br> Albumclose <br> Is there any chance I can delete this whole block of text with Calibre's Search & Replace feature at every post automatically or is that impossible? I looked at this part of the manual but it didn't really work. Any help is appreciated Last edited by esc7; 11-09-2013 at 12:48 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Regex help to remove HTML footer	neonbible	Calibre	4	09-09-2010 09:42 AM
Regex to remove header from PDF	neonbible	Calibre	4	09-07-2010 10:08 AM
Removing header and footer	radicalnomad	Calibre	2	08-26-2010 10:34 AM
Header/Footer removal	Solicitous	Calibre	2	03-30-2010 05:53 AM
Multiline Regex Footer	hover	Calibre	10	02-03-2010 04:23 AM