Remove Footer - Page 2

user_none · 11-23-2009, 08:33 PM

Quote:

Originally Posted by matthias

what seems a little strange to me, since neither in the input-, nor in the parsed-folder it is like this.
To me it looks like the regex is applyed somewhere in between converting the input to the parsed, but for sure before closing the tags (what in my eyes isn't very user-friendly or at least confusing, since noone seems to know it, and its neighter shown in the input nor the parsed-folder)

Honestly, it's gone though so many revisions I've pretty much lost track of it. For some reason I have the feeling it's matching input after all of the internal regexes are applied. I'll get it cleaned up and working properly soonish.

Quote:

Originally Posted by matthias

Also, i think the preconfigured regex (immediately after the installation) should be adjusted to the new practice, because i don't think any footer or header-removement will be done with the standard-regex anymore.

Another change I have on my todo list. I plan on removing the option to have it applied. The default will be removed and it will automatically apply the regex if one is present.

Eventually the new PDF input engine will be complete and removing the header and footer will be automatic. This regex based system will be renamed to "remove content".

matthias · 11-24-2009, 04:13 AM

It is just a bit confusing since it's not shown anywhere in the debug-folder (probably not well documented or explained).
Probably an (easy) example with explanations somewhere near the place to insert the regex would help finding the right regex (I've had to check out various sites to find the right regex-syntax, and probably not everyone has the patience or time to do so).

I actually like the posibility to check or uncheck the "remove footer" - checkbox (or even a "remove content-checkbox"), since not on every book i like to apply the regex (some have pagenumbers, others don't), and the way it is I don't have to copy/delete the regex I generally use, I just have to check or uncheck that box.

ac4lt · 01-12-2010, 07:26 PM

I'm having problems with this as well. Here are the details:

Using calibre 0.6.33, I'm trying to convert a pdf to an epub.

In the pdf the last line of the page is a line number. I'm trying to write a regex to remove this.

Setting the debug for conversion I've been able to look at the input, parsed and processed directories.

An example from the input directory shows the last couple of lines of a page:

Code:

The stranger had clambered through the ditch and up the bank,<br>
8<br>

Looking at the parsed directory, I see this:

Code:

The stranger had clambered through the ditch and up the bank, 8</p><p>

Also, my headers have also been removed though we're only at the parsed step and the pdf line unwrapping appears to have been done.

The processed directory shows:

Code:

The stranger had clambered through the ditch and up the bank, 8</p><p class="calibre1">

Though the header was no problem, I can't find a regex to remove the footer. I've tried:

Code:

\d+<br>

and

Code:

\d+</p><p>

and lots of other variations that don't work.

It's not clear to me when the regex processing is done. That is to say, whether it is before or after the conversion to xhtml and line unwrapping have occurred. Speculating, I'd say it's after.

The problem is that it appears impossible to refer to P tags in the regex. They never work.

I've tried everything suggested in this thread and so far nothing has worked.

Anyone have any ideas?

poodlemama · 01-13-2010, 05:25 PM

WoW I am so glad to have found this thread! Thank yall!

matthias · 01-14-2010, 03:58 AM

Hello ac4lt,

Try this one, i had a similar problem, and trying it out it worked for me (that was still in a previous version, but i dont think it changed yet)

Code:

\d+<p>

or even

Code:

\s*\d+<p>

depending on how the result looks (double blanks are removed with the second one)
i hope it still works, but can't test it at the moment because i don't have calibre installed at work.

poodlemama · 01-14-2010, 12:19 PM

Quote:

Originally Posted by matthias

so now the regexp is applyed to the "parsed"-output of the debug-folder?

in my case, the new regexp should be

Code:

(\d+\s*</p><p>)

since the file shows

Code:

11  </p><p>

but it still won't work.

I have been applying the above to one of my ebpub and it starts off highlighting the appropriate areas in the book but when I convert it still doesn't work... I have the remove footers checked so what am I skipping?

ac4lt · 01-14-2010, 01:53 PM

As with poodlemama neither works for me.

poodlemama · 01-14-2010, 05:59 PM

OK let me ask if this is the correct process...

I have a PDF book. . . I add it to my Calibre library... I go to convert ebook and update all metadata and the go to Page set up... make sure input and output are correct.. go to structure detection.. click on "Remove Footer" and click on the wizard tool... type in (\d+\s*) and see the highlighted page numbers and codes... push ok and then click ok to start the conversion.. what am I missing?

Sydney's Mom · 01-14-2010, 07:07 PM

I have not been able to get pdf conversion to work with headers and footers. I convert to prc with mobipocket creator, and them import into Calibre.

matthias · 01-15-2010, 06:36 AM

Try leaving the away, it won't be highlighted in the preview, but for me it worked in the conversion.

ac4lt · 01-15-2010, 07:32 AM

Quote:

Originally Posted by matthias

Try leaving the away, it won't be highlighted in the preview, but for me it worked in the conversion.

I did leave out the '" when I tried your suggestions. Unfortunately, it didn't help.

DoctorOhh · 01-15-2010, 08:21 AM

Quote:

Originally Posted by ac4lt

Unfortunately, it didn't help.

I can't help you get it to work in Calibre, I use a work around.

I import the pdf into the free download of Mobipocket Creator. Importing it strips out some trashy headers and creates a html file.

UPDATE: It was too long ago for my memory

to be reliable so I just grabbed a couple of horrible PDF files and reinstalled Mobipocket Creator. I discovered I was wrong.

Mobipocket Creator removed many atrocious headers for me in the past so I fooled my old memory into thinking it worked all the time. Even when it removed a bad text header I had to manually remove page numbers or other junk using wordpad or MS Word. I primarily used Mobipocket Creator to remove some trash headers and change a PDF to HTML so I could do a quick clean up of the html file.

Sorry for the confusion.... now where else did I post this error?

ac4lt · 01-15-2010, 08:24 AM

I'll take a look at that option. Thanks for the info!

mago55 · 01-17-2010, 09:49 PM

Quote:

Originally Posted by dwanthny

I can't help you get it to work in Calibre, I use a work around.

I import the pdf into the free download of Mobipocket Creator. Importing it strips out headers and footers and creates a html file. Now you can either drag the html to calibre or you can have it build an ebook. Building an ebook will create a PRC file. The prc file can be dragged to Calibre and converted to the format of your choice.

The html and the prc file end up in My Documents\My Publications by default.

It is quick and easy and does what I need while keeping my sanity in tact.

HI everyone!. Ive been going mad over those headers and footers. English is not my first language and i have to work hard to understand all those terms.
I have no idea what a "header regular expression" is, "regex", or how to work out the wand button, although i have tried many times.
Same as everyone, im trying to convert pdf, but in my case is to lrf, the sony ebook format.
I've read what dwanthny said about that Mobipocket Creator, and i think it sounds much easier than removing headers and footers with Calibre.
My questions are: What exactly is to "build an ebook"?, once i create an html with mobipocket from a pdf file, can i just drag that html into Calibre and convert it into an lrf for my sony ebook? or do i have to convert html into a prc first and then to lrf?.
I apologize if my questions are not very clear and i thank you for any help in this!

DoctorOhh · 01-17-2010, 10:45 PM

Quote:

Originally Posted by mago55

What exactly is to "build an ebook"?, once i create an html with mobipocket from a pdf file, can i just drag that html into Calibre and convert it into an lrf for my sony ebook? or do i have to convert html into a prc first and then to lrf?.

I was wrong. It was too long ago for my memory

to be reliable so I just grabbed a couple of horrible PDF files and reinstalled Mobipocket Creator.

Mobipocket Creator removed many atrocious headers for me in the past so I fooled my old memory into thinking it worked all the time. Even when it removed a bad text header I had to manually remove page numbers or other junk using wordpad or MS Word. I primarily used Mobipocket Creator to remove some trash headers and change a PDF to HTML so I could do a quick clean up of the html file.

Sorry for the confusion.... now where else did I post this error?

01-12-2010, 07:26 PM	#18
ac4lt Connoisseur Posts: 61 Karma: 36 Join Date: Jan 2010 Location: Reston, Virginia, US Device: ipad	I'm having problems with this as well. Here are the details: Using calibre 0.6.33, I'm trying to convert a pdf to an epub. In the pdf the last line of the page is a line number. I'm trying to write a regex to remove this. Setting the debug for conversion I've been able to look at the input, parsed and processed directories. An example from the input directory shows the last couple of lines of a page: Code: The stranger had clambered through the ditch and up the bank,<br> 8<br> Looking at the parsed directory, I see this: Code: The stranger had clambered through the ditch and up the bank, 8</p><p> Also, my headers have also been removed though we're only at the parsed step and the pdf line unwrapping appears to have been done. The processed directory shows: Code: The stranger had clambered through the ditch and up the bank, 8</p><p class="calibre1"> Though the header was no problem, I can't find a regex to remove the footer. I've tried: Code: \d+<br> and Code: \d+</p><p> and lots of other variations that don't work. It's not clear to me when the regex processing is done. That is to say, whether it is before or after the conversion to xhtml and line unwrapping have occurred. Speculating, I'd say it's after. The problem is that it appears impossible to refer to P tags in the regex. They never work. I've tried everything suggested in this thread and so far nothing has worked. Anyone have any ideas?

01-14-2010, 03:58 AM	#20
matthias Enthusiast Posts: 25 Karma: 4212 Join Date: Nov 2009 Location: South Tyrol, Italy Device: Sony Reader PRS-505	Hello ac4lt, Try this one, i had a similar problem, and trying it out it worked for me (that was still in a previous version, but i dont think it changed yet) Code: \d+<p> or even Code: \s\d+<p> depending on how the result looks (double blanks are removed with the second one) i hope it still works, but can't test it at the moment because i don't have calibre installed at work. Last edited by matthias; 01-14-2010 at 04:01 AM.*

01-14-2010, 05:59 PM	#23
poodlemama Member Posts: 10 Karma: 10 Join Date: Jan 2010 Device: Sony PRS 600	OK let me ask if this is the correct process... I have a PDF book. . . I add it to my Calibre library... I go to convert ebook and update all metadata and the go to Page set up... make sure input and output are correct.. go to structure detection.. click on "Remove Footer" and click on the wizard tool... type in (\d+\s*</p><p>) and see the highlighted page numbers and codes... push ok and then click ok to start the conversion.. what am I missing?

01-15-2010, 06:36 AM	#25
matthias Enthusiast Posts: 25 Karma: 4212 Join Date: Nov 2009 Location: South Tyrol, Italy Device: Sony Reader PRS-505	Try leaving the </p> away, it won't be highlighted in the preview, but for me it worked in the conversion.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Structure Detection - Remove Header (or Footer) Regex	DarkKipper	Conversion	69	11-09-2013 01:21 PM
Regex help to remove HTML footer	neonbible	Calibre	4	09-09-2010 10:42 AM
footer removal help	icy	Calibre	7	08-27-2010 02:21 PM
remove PDF footer containing variable?	irisclara	Calibre	10	03-06-2010 11:53 PM
RFE: Remove remove tags in bulk edit	magphil	Calibre	0	08-11-2009 11:37 AM

11-24-2009, 04:13 AM	#17
matthias Enthusiast Posts: 25 Karma: 4212 Join Date: Nov 2009 Location: South Tyrol, Italy Device: Sony Reader PRS-505	It is just a bit confusing since it's not shown anywhere in the debug-folder (probably not well documented or explained). Probably an (easy) example with explanations somewhere near the place to insert the regex would help finding the right regex (I've had to check out various sites to find the right regex-syntax, and probably not everyone has the patience or time to do so). I actually like the posibility to check or uncheck the "remove footer" - checkbox (or even a "remove content-checkbox"), since not on every book i like to apply the regex (some have pagenumbers, others don't), and the way it is I don't have to copy/delete the regex I generally use, I just have to check or uncheck that box.

01-13-2010, 05:25 PM	#19
poodlemama Member Posts: 10 Karma: 10 Join Date: Jan 2010 Device: Sony PRS 600	WoW I am so glad to have found this thread! Thank yall!

01-14-2010, 01:53 PM	#22
ac4lt Connoisseur Posts: 61 Karma: 36 Join Date: Jan 2010 Location: Reston, Virginia, US Device: ipad	As with poodlemama neither works for me.

01-14-2010, 07:07 PM	#24
Sydney's Mom Wizard Posts: 2,899 Karma: 6995721 Join Date: Dec 2008 Location: Idaho, on the side of a mountain Device: Kindle Oasis, Fire 3d Gen and 5th Gen and Samsung Tab S	I have not been able to get pdf conversion to work with headers and footers. I convert to prc with mobipocket creator, and them import into Calibre.

01-15-2010, 08:24 AM	#28
ac4lt Connoisseur Posts: 61 Karma: 36 Join Date: Jan 2010 Location: Reston, Virginia, US Device: ipad	I'll take a look at that option. Thanks for the info!