08-19-2009, 05:29 AM | #1 |
Member
Posts: 17
Karma: 60
Join Date: Aug 2009
Device: eee 900ha, CM7 Nook Color
|
remove PDF footer containing variable?
If this question is answered elsewhere, I apologize for wasting time. I did look.
I have a number of pdf files with footers (and sometimes headers) like file://quickbrownfox/(65 of 296) so the page number counts up to the total. I've looked at python regular expressions but I can't figure out how to tell Calibre to leave these lines out when I convert to rtf. Alternately, does anyone know of a way to remove sequential page numbers in a rtf? Then I could remove the parts of the line that stay the same with the extended replace function in TED notepad and the page numbers some other way. Thanks. |
08-19-2009, 02:26 PM | #2 | |
Enthusiast
Posts: 25
Karma: 16
Join Date: Aug 2009
Device: Pocketbook 360, Sony PRS-T1
|
Quote:
file://.+\) This should remove all occurrences starting with "file://" than any characters ".+" up to a closing bracket "\)". Removing numbers is very easy with "\d+", but it removes every number in the document. Perhaps someone out there knows a solution? |
|
Advert | |
|
08-19-2009, 02:42 PM | #3 |
Wizard
Posts: 3,671
Karma: 12205348
Join Date: Mar 2008
Device: Galaxy S, Nook w/CM7
|
Yeah I usually use "\n\d+\n". -- ignore quotes
This will remove numbers that lie on a single line but leave numbers in the body of the text. |
08-20-2009, 11:11 PM | #4 |
Member
Posts: 17
Karma: 60
Join Date: Aug 2009
Device: eee 900ha, CM7 Nook Color
|
Thank you very much DerSchwarzePrinz and =X=. I will try what you suggest.
|
10-16-2009, 02:03 PM | #5 |
Member
Posts: 17
Karma: 60
Join Date: Aug 2009
Device: eee 900ha, CM7 Nook Color
|
I'm using a variation of DerSchwarzePrinz' suggestion with great success. Thanks again folks.
|
Advert | |
|
02-27-2010, 11:39 PM | #6 |
Addict
Posts: 243
Karma: 48
Join Date: Dec 2006
Device: PRS 500 - REB 1200
|
hey!! the file://.+\) worked a treat thanks!
right now I use some pdf to txt to convert the pdf to text but when I used notepad++ to do all of the following file://.+\:53 (this changed from docu to docu) then I replace \r\n\r\n with a unique word I then eliminate all \r\n and then convert the unique word back to \r\n\r\n this gets rid of sentence \r\n but KEEPS paragraph \r\n alas SOME books "lose" the double \r\n when converted. GRRRRR :-) maybe adding in a conversion for .\r\n to \r\n\r\n might work. (yep it worked) Basically you have to look over the document and do all kinds of custom replace strings to get it to look right. I wish there was an INTELLIGENT converter that would retain the critical formatting for readability and get rid of the "junk" like footers and headers that are not RELEVANT on an ereader. etc.. etc.. extra spacing which is just "lost" formatting in the conversion process etc.. |
02-28-2010, 12:11 AM | #7 |
Addict
Posts: 243
Karma: 48
Join Date: Dec 2006
Device: PRS 500 - REB 1200
|
One problem I have is section of this one book (accelerando license says I can repost its contents)
looks like this *** Take a brain and put it in a bottle. Better: take a map of the brain and put it in a map of a bottle -- or of a body -- and feed signals to it that mimic its neurological inputs. Read its outputs and route them to a model body in a model universe with a model of physical laws, closing the loop. René Descartes would understand. That's the state of the passengers of the Field Circus in a nutshell. Formerly physical humans, their neural software (and a map of the intracranial wetware it runs on) has been transferred into a virtual machine environment executing on a honking great computer, where the universe they experience is merely a dream within a dream Brains in bottles -- empowered ones, with total, dictatorial, control over the reality they are exposed to -- sometimes stop engaging in activities that brains in bodies can't avoid. Menstruation isn't mandatory. Vomiting, angina, exhaustion, and cramp are all optional. So is meatdeath, the decomposition of the corpus. But some activities don't cease, because people (even people who have been converted into a software description, squirted through a high-bandwidth laser link, and ported into a virtualization stack) don't want them to stop. Breathing is wholly unnecessary, but suppression of the breathing reflex is disturbing unless you hack your hypothalamic map, and most homomorphic uploads don't want to do that. Then there's eating -- not to avoid starvation, but for pleasure: Feasts on sautéed dodo seasoned with silphium are readily available here, and indeed, why not? It seems the human addiction to sensory input won't go away. And that's without considering sex, and the technical innovations that become possible when the universe -- and the bodies within it -- are mutable *** I have NO IDEA why its formatted like that or the proper way to REFORMAT it. so just get rid of the double breaks and make it this *** Take a brain and put it in a bottle. Better: take a map of the brain and put it in a map of a bottle -- or of a body -- and feed signals to it that mimic its neurological inputs. Read its outputs and route them to a model body in a model universe with a model of physical laws, closing the loop. René Descartes would understand. That's the state of the passengers of the Field Circus in a nutshell. Formerly physical humans, their neural software (and a map of the intracranial wetware it runs on) has been transferred into a virtual machine environment executing on a honking great computer, where the universe they experience is merely a dream within a dream Brains in bottles -- empowered ones, with total, dictatorial, control over the reality they are exposed to -- sometimes stop engaging in activities that brains in bodies can't avoid. Menstruation isn't mandatory. Vomiting, angina, exhaustion, and cramp are all optional. So is meatdeath, the decomposition of the corpus. But some activities don't cease, because people (even people who have been converted into a software description, squirted through a high-bandwidth laser link, and ported into a virtualization stack) don't want them to stop. Breathing is wholly unnecessary, but suppression of the breathing reflex is disturbing unless you hack your hypothalamic map, and most homomorphic uploads don't want to do that. Then there's eating -- not to avoid starvation, but for pleasure: Feasts on sautéed dodo seasoned with silphium are readily available here, and indeed, why not? It seems the human addiction to sensory input won't go away. And that's without considering sex, and the technical innovations that become possible when the universe -- and the bodies within it -- are mutable *** but its clear thats not quite right but I don't have a better way right now so its what I am using. ahh it is pretty close. HERE is the same section in the PDF. Very odd formatting. I have also attach the PDF and the TXT I made from it. (the author gave permission to do this in his lincensing) Suggest on better ways to automate the conversion? does not have to be txt but close in its simplicity is preferred ie readable on the pc without special software and readable native on the sony reader. Last edited by nerys; 02-28-2010 at 12:23 AM. |
03-05-2010, 03:18 AM | #8 |
Member
Posts: 17
Karma: 60
Join Date: Aug 2009
Device: eee 900ha, CM7 Nook Color
|
What I do is remove headers and footers and convert my pdfs with Calibre. The resulting txt or rtf usually contains a lot of residual formatting. To get rid of this I use the automated editing tools in Machine Age Reader to remove things like extra lines and spaces and to connect broken lines and paragraphs. When I get it as clean as I can with the auto tools I proof the text (still using Machine Age Reader) and fix any remaining problems manually. In case you can't tell, I really love this program. The tools are perfect for this kind of work. It has saved me many miserable hours manually removing line breaks and extra lines. I also like the fact that it lets me edit while displaying the text in proper, 2 page, computer ebook format. Reading in notepad always gave me a headache.
I'll try running your file through Calibre and Machine Age Reader tomorrow. I'll let you know how long and how many steps it takes. Last edited by irisclara; 03-05-2010 at 03:33 AM. Reason: more info to add |
03-05-2010, 10:50 AM | #9 |
Addict
Posts: 243
Karma: 48
Join Date: Dec 2006
Device: PRS 500 - REB 1200
|
how do you remove the headers and footers?
|
03-06-2010, 06:38 PM | #10 |
Member
Posts: 17
Karma: 60
Join Date: Aug 2009
Device: eee 900ha, CM7 Nook Color
|
Calibre can remove headers and footers as part of the conversion process. I, however am not having much luck getting Calibre to remove the headers and footers on your Charles Stross file. I consider myself lucky that the formula this thread gave me works for most of my pdfs and the ones it doesn't work for I can usually fix ok with Machine Age Reader. It's just more tedious.
guess I need to learn regex |
03-06-2010, 10:53 PM | #11 |
Addict
Posts: 243
Karma: 48
Join Date: Dec 2006
Device: PRS 500 - REB 1200
|
yes this file://.+\:53 works GREAT but on "some" headers footers it just does not work. From what I understand notepadd++ only partially supports these expression things so something likely breaks down in some strings and it gets confused.
thankfully those issues are rare :-) The combo of notepad++ to clean up amazon for higher res covers and calibre to crunch out lrf's its a very pleasing result. txt files read fine and require no "formatting" in advance like lrf requires but then I get no covers and I have to wait for it to "format" it when I load it but thats only the first time. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Structure Detection - Remove Header (or Footer) Regex | DarkKipper | Conversion | 69 | 11-09-2013 12:21 PM |
Regex help to remove HTML footer | neonbible | Calibre | 4 | 09-09-2010 09:42 AM |
Remove Footer | cdecaf | Calibre | 44 | 07-21-2010 05:48 AM |
PDF Conversion - Removing Header / Footer Text | heb | Sony Reader | 9 | 07-11-2010 11:02 PM |
Cropping a header and footer from a PDF (Page numbers etc) | NickS | 2 | 06-09-2010 11:31 AM |