Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 08-19-2009, 05:29 AM   #1
irisclara
Member
irisclara is on a distinguished road
 
irisclara's Avatar
 
Posts: 17
Karma: 60
Join Date: Aug 2009
Device: eee 900ha, CM7 Nook Color
remove PDF footer containing variable?

If this question is answered elsewhere, I apologize for wasting time. I did look.

I have a number of pdf files with footers (and sometimes headers) like

file://quickbrownfox/(65 of 296)

so the page number counts up to the total.

I've looked at python regular expressions but I can't figure out how to tell Calibre to leave these lines out when I convert to rtf.

Alternately, does anyone know of a way to remove sequential page numbers in a rtf? Then I could remove the parts of the line that stay the same with the extended replace function in TED notepad and the page numbers some other way.

Thanks.
irisclara is offline   Reply With Quote
Old 08-19-2009, 02:26 PM   #2
DerSchwarzePrinz
Enthusiast
DerSchwarzePrinz began at the beginning.
 
Posts: 25
Karma: 16
Join Date: Aug 2009
Device: Pocketbook 360, Sony PRS-T1
Quote:
Originally Posted by irisclara View Post
If this question is answered elsewhere, I apologize for wasting time. I did look.

I have a number of pdf files with footers (and sometimes headers) like

file://quickbrownfox/(65 of 296)

so the page number counts up to the total.

I've looked at python regular expressions but I can't figure out how to tell Calibre to leave these lines out when I convert to rtf.

Alternately, does anyone know of a way to remove sequential page numbers in a rtf? Then I could remove the parts of the line that stay the same with the extended replace function in TED notepad and the page numbers some other way.

Thanks.
Try the following:

file://.+\)

This should remove all occurrences starting with "file://" than any characters ".+" up to a closing bracket "\)".

Removing numbers is very easy with "\d+", but it removes every number in the document.
Perhaps someone out there knows a solution?
DerSchwarzePrinz is offline   Reply With Quote
Advert
Old 08-19-2009, 02:42 PM   #3
=X=
Wizard
=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.
 
=X='s Avatar
 
Posts: 3,671
Karma: 12205348
Join Date: Mar 2008
Device: Galaxy S, Nook w/CM7
Yeah I usually use "\n\d+\n". -- ignore quotes


This will remove numbers that lie on a single line but leave numbers in the body of the text.
=X= is offline   Reply With Quote
Old 08-20-2009, 11:11 PM   #4
irisclara
Member
irisclara is on a distinguished road
 
irisclara's Avatar
 
Posts: 17
Karma: 60
Join Date: Aug 2009
Device: eee 900ha, CM7 Nook Color
Thank you very much DerSchwarzePrinz and =X=. I will try what you suggest.
irisclara is offline   Reply With Quote
Old 10-16-2009, 02:03 PM   #5
irisclara
Member
irisclara is on a distinguished road
 
irisclara's Avatar
 
Posts: 17
Karma: 60
Join Date: Aug 2009
Device: eee 900ha, CM7 Nook Color
I'm using a variation of DerSchwarzePrinz' suggestion with great success. Thanks again folks.
irisclara is offline   Reply With Quote
Advert
Old 02-27-2010, 11:39 PM   #6
nerys
Addict
nerys began at the beginning.
 
nerys's Avatar
 
Posts: 243
Karma: 48
Join Date: Dec 2006
Device: PRS 500 - REB 1200
hey!! the file://.+\) worked a treat thanks!

right now I use some pdf to txt to convert the pdf to text

but when I used notepad++ to do all of the following

file://.+\:53 (this changed from docu to docu)

then I replace \r\n\r\n with a unique word

I then eliminate all \r\n

and then convert the unique word back to \r\n\r\n

this gets rid of sentence \r\n but KEEPS paragraph \r\n

alas SOME books "lose" the double \r\n when converted.

GRRRRR :-)

maybe adding in a conversion for .\r\n to \r\n\r\n might work. (yep it worked)

Basically you have to look over the document and do all kinds of custom replace strings to get it to look right.

I wish there was an INTELLIGENT converter that would retain the critical formatting for readability and get rid of the "junk" like footers and headers that are not RELEVANT on an ereader. etc.. etc..

extra spacing which is just "lost" formatting in the conversion process etc..
nerys is offline   Reply With Quote
Old 02-28-2010, 12:11 AM   #7
nerys
Addict
nerys began at the beginning.
 
nerys's Avatar
 
Posts: 243
Karma: 48
Join Date: Dec 2006
Device: PRS 500 - REB 1200
One problem I have is section of this one book (accelerando license says I can repost its contents)

looks like this

***

Take a brain and put it in a bottle. Better: take a map of the brain and put

it in a map of a bottle -- or of a body -- and feed signals to it that mimic

its neurological inputs. Read its outputs and route them to a model body

in a model universe with a model of physical laws, closing the loop. René

Descartes would understand. That's the state of the passengers of the

Field Circus in a nutshell. Formerly physical humans, their neural

software (and a map of the intracranial wetware it runs on) has been

transferred into a virtual machine environment executing on a honking

great computer, where the universe they experience is merely a dream

within a dream

Brains in bottles -- empowered ones, with total, dictatorial, control over

the reality they are exposed to -- sometimes stop engaging in activities

that brains in bodies can't avoid. Menstruation isn't mandatory. Vomiting,

angina, exhaustion, and cramp are all optional. So is meatdeath, the

decomposition of the corpus. But some activities don't cease, because

people (even people who have been converted into a software

description, squirted through a high-bandwidth laser link, and ported into

a virtualization stack) don't want them to stop. Breathing is wholly

unnecessary, but suppression of the breathing reflex is disturbing unless

you hack your hypothalamic map, and most homomorphic uploads don't

want to do that. Then there's eating -- not to avoid starvation, but for

pleasure: Feasts on sautéed dodo seasoned with silphium are readily

available here, and indeed, why not? It seems the human addiction to

sensory input won't go away. And that's without considering sex, and the



technical innovations that become possible when the universe -- and the

bodies within it -- are mutable

***

I have NO IDEA why its formatted like that or the proper way to REFORMAT it.

so just get rid of the double breaks and make it this

***
Take a brain and put it in a bottle. Better: take a map of the brain and put
it in a map of a bottle -- or of a body -- and feed signals to it that mimic
its neurological inputs. Read its outputs and route them to a model body
in a model universe with a model of physical laws, closing the loop. René
Descartes would understand. That's the state of the passengers of the
Field Circus in a nutshell. Formerly physical humans, their neural
software (and a map of the intracranial wetware it runs on) has been
transferred into a virtual machine environment executing on a honking
great computer, where the universe they experience is merely a dream
within a dream
Brains in bottles -- empowered ones, with total, dictatorial, control over
the reality they are exposed to -- sometimes stop engaging in activities
that brains in bodies can't avoid. Menstruation isn't mandatory. Vomiting,
angina, exhaustion, and cramp are all optional. So is meatdeath, the
decomposition of the corpus. But some activities don't cease, because
people (even people who have been converted into a software
description, squirted through a high-bandwidth laser link, and ported into
a virtualization stack) don't want them to stop. Breathing is wholly
unnecessary, but suppression of the breathing reflex is disturbing unless
you hack your hypothalamic map, and most homomorphic uploads don't
want to do that. Then there's eating -- not to avoid starvation, but for
pleasure: Feasts on sautéed dodo seasoned with silphium are readily
available here, and indeed, why not? It seems the human addiction to
sensory input won't go away. And that's without considering sex, and the

technical innovations that become possible when the universe -- and the
bodies within it -- are mutable
***

but its clear thats not quite right but I don't have a better way right now so its what I am using.

ahh it is pretty close. HERE is the same section in the PDF. Very odd formatting.

I have also attach the PDF and the TXT I made from it. (the author gave permission to do this in his lincensing)

Suggest on better ways to automate the conversion? does not have to be txt but close in its simplicity is preferred ie readable on the pc without special software and readable native on the sony reader.
Attached Thumbnails
Click image for larger version

Name:	Clipboard01.jpg
Views:	492
Size:	150.1 KB
ID:	46691  
Attached Files
File Type: pdf Charles Stross - B1 Accelerando.pdf (1.08 MB, 3075 views)
File Type: txt Charles Stross - B1 Accelerando.txt (865.5 KB, 4645 views)

Last edited by nerys; 02-28-2010 at 12:23 AM.
nerys is offline   Reply With Quote
Old 03-05-2010, 03:18 AM   #8
irisclara
Member
irisclara is on a distinguished road
 
irisclara's Avatar
 
Posts: 17
Karma: 60
Join Date: Aug 2009
Device: eee 900ha, CM7 Nook Color
What I do is remove headers and footers and convert my pdfs with Calibre. The resulting txt or rtf usually contains a lot of residual formatting. To get rid of this I use the automated editing tools in Machine Age Reader to remove things like extra lines and spaces and to connect broken lines and paragraphs. When I get it as clean as I can with the auto tools I proof the text (still using Machine Age Reader) and fix any remaining problems manually. In case you can't tell, I really love this program. The tools are perfect for this kind of work. It has saved me many miserable hours manually removing line breaks and extra lines. I also like the fact that it lets me edit while displaying the text in proper, 2 page, computer ebook format. Reading in notepad always gave me a headache.

I'll try running your file through Calibre and Machine Age Reader tomorrow. I'll let you know how long and how many steps it takes.

Last edited by irisclara; 03-05-2010 at 03:33 AM. Reason: more info to add
irisclara is offline   Reply With Quote
Old 03-05-2010, 10:50 AM   #9
nerys
Addict
nerys began at the beginning.
 
nerys's Avatar
 
Posts: 243
Karma: 48
Join Date: Dec 2006
Device: PRS 500 - REB 1200
how do you remove the headers and footers?
nerys is offline   Reply With Quote
Old 03-06-2010, 06:38 PM   #10
irisclara
Member
irisclara is on a distinguished road
 
irisclara's Avatar
 
Posts: 17
Karma: 60
Join Date: Aug 2009
Device: eee 900ha, CM7 Nook Color
Calibre can remove headers and footers as part of the conversion process. I, however am not having much luck getting Calibre to remove the headers and footers on your Charles Stross file. I consider myself lucky that the formula this thread gave me works for most of my pdfs and the ones it doesn't work for I can usually fix ok with Machine Age Reader. It's just more tedious.

guess I need to learn regex
irisclara is offline   Reply With Quote
Old 03-06-2010, 10:53 PM   #11
nerys
Addict
nerys began at the beginning.
 
nerys's Avatar
 
Posts: 243
Karma: 48
Join Date: Dec 2006
Device: PRS 500 - REB 1200
yes this file://.+\:53 works GREAT but on "some" headers footers it just does not work. From what I understand notepadd++ only partially supports these expression things so something likely breaks down in some strings and it gets confused.

thankfully those issues are rare :-) The combo of notepad++ to clean up amazon for higher res covers and calibre to crunch out lrf's its a very pleasing result.

txt files read fine and require no "formatting" in advance like lrf requires but then I get no covers and I have to wait for it to "format" it when I load it but thats only the first time.
nerys is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Structure Detection - Remove Header (or Footer) Regex DarkKipper Conversion 69 11-09-2013 12:21 PM
Regex help to remove HTML footer neonbible Calibre 4 09-09-2010 09:42 AM
Remove Footer cdecaf Calibre 44 07-21-2010 05:48 AM
PDF Conversion - Removing Header / Footer Text heb Sony Reader 9 07-11-2010 11:02 PM
Cropping a header and footer from a PDF (Page numbers etc) NickS PDF 2 06-09-2010 11:31 AM


All times are GMT -4. The time now is 04:21 PM.


MobileRead.com is a privately owned, operated and funded community.