Remove Footer

cdecaf · 10-14-2009, 03:44 AM

Hi,

I have been thinkering about this now for two days and couldn't find a clue.

My problem is I want to convert a bought pdf to epub (so sorry no sample - but I try to find one). The page numbers of the pdf show up just somewhere in between the text so I want them gone.

In the regex window (the one that does come up after clicking the magic wand next to "remove footer") the page numbers show up as

Code:

45 </p><p>

so I try

Code:

\d+ *</p><p>

Code:

[0-9]+ *</p><p>

and

Code:

\d+ *\<\/p\>\<p\>

all of these mark the correct regions in the window - but none of these show an effect in the produced epub.

Just as test I tried

Code:

\d+

- and yes it removes ALL numbers from the converted text.

Also I tried

Code:

ebook-convert book.pdf .epub --debug-pipeline

but it seems the option is no longer possible - all I get is the help for ebook-convert

So does anybody have an idea what I could try?
Thanks.

btw. I'm using calibre 0.6.17 on OSX Snow Leopard.

user_none · 10-14-2009, 05:59 AM

The header removal does appear to be broken. Can you open a ticket so I don't forget to fix it?

Also, --debug-pipeline has been renamed to --debug.

user_none · 10-14-2009, 06:47 AM

I take that back, it is working. Make sure you have the remove footer checkbox checked above the regex.

cdecaf · 10-14-2009, 10:08 AM

Ah that --debug helps a lot.

If I look at the produced directories i think the regex window shows the "parsed" subdirectory. But the regexp acts on the "input".
Also I noticed in the regexp I need to replace all the " " with "\xA0"

cdecaf · 10-14-2009, 10:48 AM

Ok now I'm happy. I found it in the "input" produced by debug.

The regexp that worked for me:

Code:

\d+\xA0*<br>

user_none · 10-14-2009, 05:18 PM

Quote:

Originally Posted by cdecaf

If I look at the produced directories i think the regex window shows the "parsed" subdirectory. But the regexp acts on the "input".

I will correct this.

cdecaf · 10-14-2009, 05:33 PM

That would be great.

Thanks for the quick pointers also.

matthias · 11-19-2009, 03:21 PM

Hello,

i've found this topic after looking for something about the same.

i've tried it yesterday at work with an older Version (i think it was 0.6.22) and it worked without any problems, but today, at an other PC i tried to use the same regexp but it wont work (even with the same pdf file)

in the input-section with debug it shows up like this:

Code:

11&nbsp;&nbsp;<br>

the regexp i'm trying to use to remove the page numbers is

Code:

(\d+\xA0*<br>)

the exact same worked with the older Version, but it won't today. Have there been any changes from that older Version, or is something in the regexp wrong?? i've tryed now everything i'm able to, but i can't find the answer. (although i'm not very comfortable using regexp)

user_none · 11-19-2009, 04:28 PM

Quote:

Originally Posted by matthias

in the input-section with debug it shows up like this:

Code:

11&nbsp;&nbsp;<br>

the regexp i'm trying to use to remove the page numbers is

Code:

(\d+\xA0*<br>)

the exact same worked with the older Version, but it won't today. Have there been any changes from that older Version, or is something in the regexp wrong?

The regex matches later in the conversion pipeline in the newer versions. 0.6.22 sounds about right when the change was made. Entities such as   are now converted to the character they represent before the regex is applied.

matthias · 11-19-2009, 04:38 PM

so now the regexp is applyed to the "parsed"-output of the debug-folder?

in my case, the new regexp should be

Code:

(\d+\s*</p><p>)

since the file shows

Code:

11  </p><p>

but it still won't work.

kovidgoyal · 11-19-2009, 04:48 PM

EDIT: Never mind

matthias · 11-20-2009, 03:31 AM

i still don't get it to work, ...

in the debug-input-folder it shows me the following string:

Code:

11&nbsp;&nbsp;<br>

in the debug-parsed-folder it's then

Code:

11  </p><p>

Earlyer (now i've given it a peek, i used 0.6.16 when it worked) i used

Code:

\d+\xA0*<br>

which won't work anymore, since it seems that there has been changed something in the pipeline. But also replacing the \xA0 with \s (what should be any whitespace-character) doesn't work.

I tryed to use \d+\s+, which works, is applyed somewhere between the input and the parsed section, but isn't quite what i wantet, since it removes every digit in the file followed by a whitespace.

When exactly is the regex applyed to the file? Right on the input-file, or are there some steps in the pipline that change that code bevore applying the regex?

matthias · 11-22-2009, 01:29 PM

Isn't anyone else having trouble removing page numbers from their pdfs?

Bevore applying the regex to the input-section, what happens to the tag " "?
it seems like its not there anymore when the regex is applyed.

how do i have to change my previous regex (used in 0.6.16) to get it to work with the newer Versions of calibre (what is a great program in my eyes)?

matthias · 11-23-2009, 06:15 AM

i have been able to resolve my problem using:

Code:

(\d+\s+<p>)

what seems a little strange to me, since neither in the input-, nor in the parsed-folder it is like this.
To me it looks like the regex is applyed somewhere in between converting the input to the parsed, but for sure before closing the tags (what in my eyes isn't very user-friendly or at least confusing, since noone seems to know it, and its neighter shown in the input nor the parsed-folder)

Also, i think the preconfigured regex (immediately after the installation) should be adjusted to the new practice, because i don't think any footer or header-removement will be done with the standard-regex anymore.

i hope someone of the developers will take care of this problem, or at least explain what the thougs behind this are, so we can understand better when exactly the regex is applyed.

Ydieh · 11-23-2009, 03:02 PM

Quote:

Originally Posted by matthias

i have been able to resolve my problem using:

Code:

(\d+\s+<p>)

what seems a little strange to me, since neither in the input-, nor in the parsed-folder it is like this.
To me it looks like the regex is applyed somewhere in between converting the input to the parsed, but for sure before closing the tags (what in my eyes isn't very user-friendly or at least confusing, since noone seems to know it, and its neighter shown in the input nor the parsed-folder)

Also, i think the preconfigured regex (immediately after the installation) should be adjusted to the new practice, because i don't think any footer or header-removement will be done with the standard-regex anymore.

i hope someone of the developers will take care of this problem, or at least explain what the thougs behind this are, so we can understand better when exactly the regex is applyed.

for the solution for my problem.
I've tried " " and "" in my expression and they both didn't work.
Using only "" works.

10-14-2009, 03:44 AM	#1
cdecaf Junior Member Posts: 9 Karma: 10 Join Date: Oct 2009 Device: prs 505	Remove Footer Hi, I have been thinkering about this now for two days and couldn't find a clue. My problem is I want to convert a bought pdf to epub (so sorry no sample - but I try to find one). The page numbers of the pdf show up just somewhere in between the text so I want them gone. In the regex window (the one that does come up after clicking the magic wand next to "remove footer") the page numbers show up as Code: 45 </p><p> so I try Code: \d+ </p><p> Code: [0-9]+ </p><p> and Code: \d+ *\<\/p\>\<p\> all of these mark the correct regions in the window - but none of these show an effect in the produced epub. Just as test I tried Code: \d+ - and yes it removes ALL numbers from the converted text. Also I tried Code: ebook-convert book.pdf .epub --debug-pipeline but it seems the option is no longer possible - all I get is the help for ebook-convert So does anybody have an idea what I could try? Thanks. btw. I'm using calibre 0.6.17 on OSX Snow Leopard.

10-14-2009, 10:48 AM	#5
cdecaf Junior Member Posts: 9 Karma: 10 Join Date: Oct 2009 Device: prs 505	Ok now I'm happy. I found it in the "input" produced by debug. The regexp that worked for me: Code: \d+\xA0*<br>

11-19-2009, 03:21 PM	#8
matthias Enthusiast Posts: 25 Karma: 4212 Join Date: Nov 2009 Location: South Tyrol, Italy Device: Sony Reader PRS-505	Hello, i've found this topic after looking for something about the same. i've tried it yesterday at work with an older Version (i think it was 0.6.22) and it worked without any problems, but today, at an other PC i tried to use the same regexp but it wont work (even with the same pdf file) in the input-section with debug it shows up like this: Code: 11  <br> the regexp i'm trying to use to remove the page numbers is Code: (\d+\xA0*<br>) the exact same worked with the older Version, but it won't today. Have there been any changes from that older Version, or is something in the regexp wrong?? i've tryed now everything i'm able to, but i can't find the answer. (although i'm not very comfortable using regexp)

11-19-2009, 04:38 PM	#10
matthias Enthusiast Posts: 25 Karma: 4212 Join Date: Nov 2009 Location: South Tyrol, Italy Device: Sony Reader PRS-505	so now the regexp is applyed to the "parsed"-output of the debug-folder? in my case, the new regexp should be Code: (\d+\s*</p><p>) since the file shows Code: 11 </p><p> but it still won't work.

11-20-2009, 03:31 AM	#12
matthias Enthusiast Posts: 25 Karma: 4212 Join Date: Nov 2009 Location: South Tyrol, Italy Device: Sony Reader PRS-505	i still don't get it to work, ... in the debug-input-folder it shows me the following string: Code: 11  <br> in the debug-parsed-folder it's then Code: 11 </p><p> Earlyer (now i've given it a peek, i used 0.6.16 when it worked) i used Code: \d+\xA0*<br> which won't work anymore, since it seems that there has been changed something in the pipeline. But also replacing the \xA0 with \s (what should be any whitespace-character) doesn't work. I tryed to use \d+\s+, which works, is applyed somewhere between the input and the parsed section, but isn't quite what i wantet, since it removes every digit in the file followed by a whitespace. When exactly is the regex applyed to the file? Right on the input-file, or are there some steps in the pipline that change that code bevore applying the regex?

10-14-2009, 05:59 AM	#2
user_none Sigil & calibre developer Posts: 2,488 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR	The header removal does appear to be broken. Can you open a ticket so I don't forget to fix it? Also, --debug-pipeline has been renamed to --debug.

10-14-2009, 06:47 AM	#3
user_none Sigil & calibre developer Posts: 2,488 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR	I take that back, it is working. Make sure you have the remove footer checkbox checked above the regex.

10-14-2009, 10:08 AM	#4
cdecaf Junior Member Posts: 9 Karma: 10 Join Date: Oct 2009 Device: prs 505	Ah that --debug helps a lot. If I look at the produced directories i think the regex window shows the "parsed" subdirectory. But the regexp acts on the "input". Also I noticed in the regexp I need to replace all the " " with "\xA0"

10-14-2009, 05:33 PM	#7
cdecaf Junior Member Posts: 9 Karma: 10 Join Date: Oct 2009 Device: prs 505	That would be great. Thanks for the quick pointers also.

11-19-2009, 04:48 PM	#11
kovidgoyal creator of calibre Posts: 43,826 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	EDIT: Never mind

11-22-2009, 01:29 PM	#13
matthias Enthusiast Posts: 25 Karma: 4212 Join Date: Nov 2009 Location: South Tyrol, Italy Device: Sony Reader PRS-505	Isn't anyone else having trouble removing page numbers from their pdfs? Bevore applying the regex to the input-section, what happens to the tag "<br>"? it seems like its not there anymore when the regex is applyed. how do i have to change my previous regex (used in 0.6.16) to get it to work with the newer Versions of calibre (what is a great program in my eyes)? Last edited by matthias; 11-23-2009 at 05:55 AM.

11-23-2009, 06:15 AM	#14
matthias Enthusiast Posts: 25 Karma: 4212 Join Date: Nov 2009 Location: South Tyrol, Italy Device: Sony Reader PRS-505	i have been able to resolve my problem using: Code: (\d+\s+<p>) what seems a little strange to me, since neither in the input-, nor in the parsed-folder it is like this. To me it looks like the regex is applyed somewhere in between converting the input to the parsed, but for sure before closing the tags (what in my eyes isn't very user-friendly or at least confusing, since noone seems to know it, and its neighter shown in the input nor the parsed-folder) Also, i think the preconfigured regex (immediately after the installation) should be adjusted to the new practice, because i don't think any footer or header-removement will be done with the standard-regex anymore. i hope someone of the developers will take care of this problem, or at least explain what the thougs behind this are, so we can understand better when exactly the regex is applyed.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Structure Detection - Remove Header (or Footer) Regex	DarkKipper	Conversion	69	11-09-2013 12:21 PM
Regex help to remove HTML footer	neonbible	Calibre	4	09-09-2010 09:42 AM
footer removal help	icy	Calibre	7	08-27-2010 01:21 PM
remove PDF footer containing variable?	irisclara	Calibre	10	03-06-2010 10:53 PM
RFE: Remove remove tags in bulk edit	magphil	Calibre	0	08-11-2009 10:37 AM

Advert

Advert