Regexp and Alternate Page Header/Footer

adad · 01-15-2011, 10:11 AM

I am new to ebooks and Calibre. I am only starting to understand some of the great things it can do. I am still on the simple parts. I do have some Unix background -- but more the grep error than python. Clearly not used to the message board formatting yet either.

I am stuck on the regexp to remove the alternating headers. I did more or less dump elegance and went to brute force as I don't understand [I actually have the title and the author manually typed in]. The original PDF has different header on odd and even pages -- information in top corners only.

Quote:

1. On the PDF Page the header is:

TITLE [white space] [page no] - odd pages
[page no] [white space] AUTHOR - even pages

2. On the Calibre display page in "Structure Detection"
for the odd pages
<hr>
<A name=7></a>TITLE 
3 

for the even pages
<hr>
<A name=6></a>2 
AUTHOR 

3. Header regular expression
(?im)(<hr>((\s*<a name=\d+></a>\s*\d+\s* \s*AUTHOR\s* )|(\s*<a name=\d+></a>TITLE\s* \s*\d+\s* )))

On the display, what I want to delete is highlighted in yellow when I test.
I have the delete header box checked but when I run the conversion (to MOBI) I get effectively the page numbers inserted in place of the header. Is the string returning some sort of value or match number, and how do I stop it from inserting?

itimpi · 01-15-2011, 10:51 AM

I am guessing that your regex is incomplete and the headers actually have the page number there as well - so you are deleting all except the page number.

I always use the wizard button next to the regex field so that I can check that I have matched sufficient text.

Manichean · 01-15-2011, 10:53 AM

From what I see, that shouldn't be happening. Did you make sure the page number is highlighted as well?
You could also try to use the header removal for one type of the headers and the footer removal for the other. There may be some funky effect in that regex that I failed to notice.

adad · 01-15-2011, 08:40 PM

No, definitely used the wizard -- I am glad I found it. The page numbers are definitely with the yellow background.

It is of any help, if I go back to MOBI version it has translated to:
[quote]
2 

[\quote]
-- where two is the page number, and
[quote]
3 
[\quote]
when three is the page number.

They are reflexive, but actually do look the same in the kindle reader.

Quote:

Originally Posted by itimpi

I am guessing that your regex is incomplete and the headers actually have the page number there as well - so you are deleting all except the page number.

I always use the wizard button next to the regex field so that I can check that I have matched sufficient text.

adad · 01-15-2011, 08:51 PM

Page numbers are included in the highlighted yellow.

Tried the footnote and header thing. I took out the (?im) and then could drop all the brackets. With a little care on the case, everything highlighted in the wizard (for the appropriate footer and header) but the result was the same.

The eliminated most of the rules I was not familiar with -- and that it is a pretty straightforward regexp. Shame I cannot use grep.

adad · 01-15-2011, 09:03 PM

Thank you for the suggestion. I tried it and it did not work -- but you got me to thinking. I eventually ended up deleting the old MOBI file and then it did work.

I looked at some of my other conversions, and it may be that the Calipre is resetting the input file to the old Mobi file if it exists. When I used the wizard, I had a choice and chose the PDF file. Perhaps it was actually running the regexp against the old Mobi file, not the PDF file. Something else I will have to see in future efforts.

Thank you both for the assistance.

Quote:

Originally Posted by Manichean

From what I see, that shouldn't be happening. Did you make sure the page number is highlighted as well?
You could also try to use the header removal for one type of the headers and the footer removal for the other. There may be some funky effect in that regex that I failed to notice.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Add Header/Footer	hrwriter	Calibre	3	12-08-2010 05:11 AM
Removing header and footer	radicalnomad	Calibre	2	08-26-2010 10:34 AM
Cropping a header and footer from a PDF (Page numbers etc)	NickS	PDF	2	06-09-2010 11:31 AM
Header/Footer removal	Solicitous	Calibre	2	03-30-2010 05:53 AM
Regexp and header/footer problems	concern	Calibre	0	02-07-2010 03:35 AM

01-15-2011, 10:51 AM	#2
itimpi Wizard Posts: 4,552 Karma: 950151 Join Date: Nov 2008 Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)	I am guessing that your regex is incomplete and the headers actually have the page number there as well - so you are deleting all except the page number. I always use the wizard button next to the regex field so that I can check that I have matched sufficient text.

01-15-2011, 10:53 AM	#3
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	From what I see, that shouldn't be happening. Did you make sure the page number is highlighted as well? You could also try to use the header removal for one type of the headers and the footer removal for the other. There may be some funky effect in that regex that I failed to notice.

01-15-2011, 08:51 PM	#5
adad Junior Member Posts: 6 Karma: 10 Join Date: Jan 2011 Device: kindle	Page numbers are included in the highlighted yellow. Tried the footnote and header thing. I took out the (?im) and then could drop all the brackets. With a little care on the case, everything highlighted in the wizard (for the appropriate footer and header) but the result was the same. The eliminated most of the rules I was not familiar with -- and that it is a pretty straightforward regexp. Shame I cannot use grep.

Advert

Advert