remove PDF footer containing variable?

irisclara · 08-19-2009, 05:29 AM

If this question is answered elsewhere, I apologize for wasting time. I did look.

I have a number of pdf files with footers (and sometimes headers) like

file://quickbrownfox/(65 of 296)

so the page number counts up to the total.

I've looked at python regular expressions but I can't figure out how to tell Calibre to leave these lines out when I convert to rtf.

Alternately, does anyone know of a way to remove sequential page numbers in a rtf? Then I could remove the parts of the line that stay the same with the extended replace function in TED notepad and the page numbers some other way.

Thanks.

DerSchwarzePrinz · 08-19-2009, 02:26 PM

Quote:

Originally Posted by irisclara

If this question is answered elsewhere, I apologize for wasting time. I did look.

I have a number of pdf files with footers (and sometimes headers) like

file://quickbrownfox/(65 of 296)

so the page number counts up to the total.

I've looked at python regular expressions but I can't figure out how to tell Calibre to leave these lines out when I convert to rtf.

Alternately, does anyone know of a way to remove sequential page numbers in a rtf? Then I could remove the parts of the line that stay the same with the extended replace function in TED notepad and the page numbers some other way.

Thanks.

Try the following:

file://.+\)

This should remove all occurrences starting with "file://" than any characters ".+" up to a closing bracket "\)".

Removing numbers is very easy with "\d+", but it removes every number in the document.
Perhaps someone out there knows a solution?

=X= · 08-19-2009, 02:42 PM

Yeah I usually use "\n\d+\n". -- ignore quotes

This will remove numbers that lie on a single line but leave numbers in the body of the text.

irisclara · 08-20-2009, 11:11 PM

Thank you very much DerSchwarzePrinz and =X=. I will try what you suggest.

irisclara · 10-16-2009, 02:03 PM

I'm using a variation of DerSchwarzePrinz' suggestion with great success. Thanks again folks.

nerys · 02-27-2010, 11:39 PM

hey!! the file://.+\) worked a treat thanks!

right now I use some pdf to txt to convert the pdf to text

but when I used notepad++ to do all of the following

file://.+\:53 (this changed from docu to docu)

then I replace \r\n\r\n with a unique word

I then eliminate all \r\n

and then convert the unique word back to \r\n\r\n

this gets rid of sentence \r\n but KEEPS paragraph \r\n

alas SOME books "lose" the double \r\n when converted.

GRRRRR :-)

maybe adding in a conversion for .\r\n to \r\n\r\n might work. (yep it worked)

Basically you have to look over the document and do all kinds of custom replace strings to get it to look right.

I wish there was an INTELLIGENT converter that would retain the critical formatting for readability and get rid of the "junk" like footers and headers that are not RELEVANT on an ereader. etc.. etc..

extra spacing which is just "lost" formatting in the conversion process etc..

nerys · 02-28-2010, 12:11 AM

One problem I have is section of this one book (accelerando license says I can repost its contents)

looks like this

***

Take a brain and put it in a bottle. Better: take a map of the brain and put

it in a map of a bottle -- or of a body -- and feed signals to it that mimic

its neurological inputs. Read its outputs and route them to a model body

in a model universe with a model of physical laws, closing the loop. René

Descartes would understand. That's the state of the passengers of the

Field Circus in a nutshell. Formerly physical humans, their neural

software (and a map of the intracranial wetware it runs on) has been

transferred into a virtual machine environment executing on a honking

great computer, where the universe they experience is merely a dream

within a dream

Brains in bottles -- empowered ones, with total, dictatorial, control over

the reality they are exposed to -- sometimes stop engaging in activities

that brains in bodies can't avoid. Menstruation isn't mandatory. Vomiting,

angina, exhaustion, and cramp are all optional. So is meatdeath, the

decomposition of the corpus. But some activities don't cease, because

people (even people who have been converted into a software

description, squirted through a high-bandwidth laser link, and ported into

a virtualization stack) don't want them to stop. Breathing is wholly

unnecessary, but suppression of the breathing reflex is disturbing unless

you hack your hypothalamic map, and most homomorphic uploads don't

want to do that. Then there's eating -- not to avoid starvation, but for

pleasure: Feasts on sautéed dodo seasoned with silphium are readily

available here, and indeed, why not? It seems the human addiction to

sensory input won't go away. And that's without considering sex, and the

technical innovations that become possible when the universe -- and the

bodies within it -- are mutable

***

I have NO IDEA why its formatted like that or the proper way to REFORMAT it.

so just get rid of the double breaks and make it this

***
Take a brain and put it in a bottle. Better: take a map of the brain and put
it in a map of a bottle -- or of a body -- and feed signals to it that mimic
its neurological inputs. Read its outputs and route them to a model body
in a model universe with a model of physical laws, closing the loop. René
Descartes would understand. That's the state of the passengers of the
Field Circus in a nutshell. Formerly physical humans, their neural
software (and a map of the intracranial wetware it runs on) has been
transferred into a virtual machine environment executing on a honking
great computer, where the universe they experience is merely a dream
within a dream
Brains in bottles -- empowered ones, with total, dictatorial, control over
the reality they are exposed to -- sometimes stop engaging in activities
that brains in bodies can't avoid. Menstruation isn't mandatory. Vomiting,
angina, exhaustion, and cramp are all optional. So is meatdeath, the
decomposition of the corpus. But some activities don't cease, because
people (even people who have been converted into a software
description, squirted through a high-bandwidth laser link, and ported into
a virtualization stack) don't want them to stop. Breathing is wholly
unnecessary, but suppression of the breathing reflex is disturbing unless
you hack your hypothalamic map, and most homomorphic uploads don't
want to do that. Then there's eating -- not to avoid starvation, but for
pleasure: Feasts on sautéed dodo seasoned with silphium are readily
available here, and indeed, why not? It seems the human addiction to
sensory input won't go away. And that's without considering sex, and the

technical innovations that become possible when the universe -- and the
bodies within it -- are mutable
***

but its clear thats not quite right but I don't have a better way right now so its what I am using.

ahh it is pretty close. HERE is the same section in the PDF. Very odd formatting.

I have also attach the PDF and the TXT I made from it. (the author gave permission to do this in his lincensing)

Suggest on better ways to automate the conversion? does not have to be txt but close in its simplicity is preferred ie readable on the pc without special software and readable native on the sony reader.

irisclara · 03-05-2010, 03:18 AM

What I do is remove headers and footers and convert my pdfs with Calibre. The resulting txt or rtf usually contains a lot of residual formatting. To get rid of this I use the automated editing tools in Machine Age Reader to remove things like extra lines and spaces and to connect broken lines and paragraphs. When I get it as clean as I can with the auto tools I proof the text (still using Machine Age Reader) and fix any remaining problems manually. In case you can't tell, I really love this program. The tools are perfect for this kind of work. It has saved me many miserable hours manually removing line breaks and extra lines. I also like the fact that it lets me edit while displaying the text in proper, 2 page, computer ebook format. Reading in notepad always gave me a headache.

I'll try running your file through Calibre and Machine Age Reader tomorrow. I'll let you know how long and how many steps it takes.

nerys · 03-05-2010, 10:50 AM

how do you remove the headers and footers?

irisclara · 03-06-2010, 06:38 PM

Calibre can remove headers and footers as part of the conversion process. I, however am not having much luck getting Calibre to remove the headers and footers on your Charles Stross file. I consider myself lucky that the formula this thread gave me works for most of my pdfs and the ones it doesn't work for I can usually fix ok with Machine Age Reader. It's just more tedious.

guess I need to learn regex

nerys · 03-06-2010, 10:53 PM

yes this file://.+\:53 works GREAT but on "some" headers footers it just does not work. From what I understand notepadd++ only partially supports these expression things so something likely breaks down in some strings and it gets confused.

thankfully those issues are rare :-) The combo of notepad++ to clean up amazon for higher res covers and calibre to crunch out lrf's its a very pleasing result.

txt files read fine and require no "formatting" in advance like lrf requires but then I get no covers and I have to wait for it to "format" it when I load it but thats only the first time.

08-19-2009, 05:29 AM	#1
irisclara Member Posts: 17 Karma: 60 Join Date: Aug 2009 Device: eee 900ha, CM7 Nook Color	remove PDF footer containing variable? If this question is answered elsewhere, I apologize for wasting time. I did look. I have a number of pdf files with footers (and sometimes headers) like file://quickbrownfox/(65 of 296) so the page number counts up to the total. I've looked at python regular expressions but I can't figure out how to tell Calibre to leave these lines out when I convert to rtf. Alternately, does anyone know of a way to remove sequential page numbers in a rtf? Then I could remove the parts of the line that stay the same with the extended replace function in TED notepad and the page numbers some other way. Thanks.

03-05-2010, 03:18 AM	#8
irisclara Member Posts: 17 Karma: 60 Join Date: Aug 2009 Device: eee 900ha, CM7 Nook Color	What I do is remove headers and footers and convert my pdfs with Calibre. The resulting txt or rtf usually contains a lot of residual formatting. To get rid of this I use the automated editing tools in Machine Age Reader to remove things like extra lines and spaces and to connect broken lines and paragraphs. When I get it as clean as I can with the auto tools I proof the text (still using Machine Age Reader) and fix any remaining problems manually. In case you can't tell, I really love this program. The tools are perfect for this kind of work. It has saved me many miserable hours manually removing line breaks and extra lines. I also like the fact that it lets me edit while displaying the text in proper, 2 page, computer ebook format. Reading in notepad always gave me a headache. I'll try running your file through Calibre and Machine Age Reader tomorrow. I'll let you know how long and how many steps it takes. Last edited by irisclara; 03-05-2010 at 03:33 AM. Reason: more info to add

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Structure Detection - Remove Header (or Footer) Regex	DarkKipper	Conversion	69	11-09-2013 12:21 PM
Regex help to remove HTML footer	neonbible	Calibre	4	09-09-2010 09:42 AM
Remove Footer	cdecaf	Calibre	44	07-21-2010 05:48 AM
PDF Conversion - Removing Header / Footer Text	heb	Sony Reader	9	07-11-2010 11:02 PM
Cropping a header and footer from a PDF (Page numbers etc)	NickS	PDF	2	06-09-2010 11:31 AM

08-19-2009, 02:42 PM	#3
=X= Wizard Posts: 3,671 Karma: 12205348 Join Date: Mar 2008 Device: Galaxy S, Nook w/CM7	Yeah I usually use "\n\d+\n". -- ignore quotes This will remove numbers that lie on a single line but leave numbers in the body of the text.

08-20-2009, 11:11 PM	#4
irisclara Member Posts: 17 Karma: 60 Join Date: Aug 2009 Device: eee 900ha, CM7 Nook Color	Thank you very much DerSchwarzePrinz and =X=. I will try what you suggest.

10-16-2009, 02:03 PM	#5
irisclara Member Posts: 17 Karma: 60 Join Date: Aug 2009 Device: eee 900ha, CM7 Nook Color	I'm using a variation of DerSchwarzePrinz' suggestion with great success. Thanks again folks.

02-27-2010, 11:39 PM	#6
nerys Addict Posts: 243 Karma: 48 Join Date: Dec 2006 Device: PRS 500 - REB 1200	hey!! the file://.+\) worked a treat thanks! right now I use some pdf to txt to convert the pdf to text but when I used notepad++ to do all of the following file://.+\:53 (this changed from docu to docu) then I replace \r\n\r\n with a unique word I then eliminate all \r\n and then convert the unique word back to \r\n\r\n this gets rid of sentence \r\n but KEEPS paragraph \r\n alas SOME books "lose" the double \r\n when converted. GRRRRR :-) maybe adding in a conversion for .\r\n to \r\n\r\n might work. (yep it worked) Basically you have to look over the document and do all kinds of custom replace strings to get it to look right. I wish there was an INTELLIGENT converter that would retain the critical formatting for readability and get rid of the "junk" like footers and headers that are not RELEVANT on an ereader. etc.. etc.. extra spacing which is just "lost" formatting in the conversion process etc..

03-05-2010, 10:50 AM	#9
nerys Addict Posts: 243 Karma: 48 Join Date: Dec 2006 Device: PRS 500 - REB 1200	how do you remove the headers and footers?

03-06-2010, 06:38 PM	#10
irisclara Member Posts: 17 Karma: 60 Join Date: Aug 2009 Device: eee 900ha, CM7 Nook Color	Calibre can remove headers and footers as part of the conversion process. I, however am not having much luck getting Calibre to remove the headers and footers on your Charles Stross file. I consider myself lucky that the formula this thread gave me works for most of my pdfs and the ones it doesn't work for I can usually fix ok with Machine Age Reader. It's just more tedious. guess I need to learn regex

03-06-2010, 10:53 PM	#11
nerys Addict Posts: 243 Karma: 48 Join Date: Dec 2006 Device: PRS 500 - REB 1200	yes this file://.+\:53 works GREAT but on "some" headers footers it just does not work. From what I understand notepadd++ only partially supports these expression things so something likely breaks down in some strings and it gets confused. thankfully those issues are rare :-) The combo of notepad++ to clean up amazon for higher res covers and calibre to crunch out lrf's its a very pleasing result. txt files read fine and require no "formatting" in advance like lrf requires but then I get no covers and I have to wait for it to "format" it when I load it but thats only the first time.

Advert

Advert