irisclara
08-19-2009, 05:29 AM
If this question is answered elsewhere, I apologize for wasting time. I did look.
I have a number of pdf files with footers (and sometimes headers) like
file://quickbrownfox/(65 of 296)
so the page number counts up to the total.
I've looked at python regular expressions but I can't figure out how to tell Calibre to leave these lines out when I convert to rtf.
Alternately, does anyone know of a way to remove sequential page numbers in a rtf? Then I could remove the parts of the line that stay the same with the extended replace function in TED notepad and the page numbers some other way.
Thanks.
DerSchwarzePrinz
08-19-2009, 02:26 PM
If this question is answered elsewhere, I apologize for wasting time. I did look.
I have a number of pdf files with footers (and sometimes headers) like
file://quickbrownfox/(65 of 296)
so the page number counts up to the total.
I've looked at python regular expressions but I can't figure out how to tell Calibre to leave these lines out when I convert to rtf.
Alternately, does anyone know of a way to remove sequential page numbers in a rtf? Then I could remove the parts of the line that stay the same with the extended replace function in TED notepad and the page numbers some other way.
Thanks.
Try the following:
file://.+\)
This should remove all occurrences starting with "file://" than any characters ".+" up to a closing bracket "\)".
Removing numbers is very easy with "\d+", but it removes every number in the document.
Perhaps someone out there knows a solution?
Yeah I usually use "\n\d+\n". -- ignore quotes
This will remove numbers that lie on a single line but leave numbers in the body of the text.
irisclara
08-20-2009, 11:11 PM
Thank you very much DerSchwarzePrinz and =X=. I will try what you suggest.
irisclara
10-16-2009, 02:03 PM
I'm using a variation of DerSchwarzePrinz' suggestion with great success. Thanks again folks.
nerys
02-27-2010, 11:39 PM
hey!! the file://.+\) worked a treat thanks!
right now I use some pdf to txt to convert the pdf to text
but when I used notepad++ to do all of the following
file://.+\:53 (this changed from docu to docu)
then I replace \r\n\r\n with a unique word
I then eliminate all \r\n
and then convert the unique word back to \r\n\r\n
this gets rid of sentence \r\n but KEEPS paragraph \r\n
alas SOME books "lose" the double \r\n when converted.
GRRRRR :-)
maybe adding in a conversion for .\r\n to \r\n\r\n might work. (yep it worked)
Basically you have to look over the document and do all kinds of custom replace strings to get it to look right.
I wish there was an INTELLIGENT converter that would retain the critical formatting for readability and get rid of the "junk" like footers and headers that are not RELEVANT on an ereader. etc.. etc..
extra spacing which is just "lost" formatting in the conversion process etc..
nerys
02-28-2010, 12:11 AM
One problem I have is section of this one book (accelerando license says I can repost its contents)
looks like this
***
Take a brain and put it in a bottle. Better: take a map of the brain and put
it in a map of a bottle -- or of a body -- and feed signals to it that mimic
its neurological inputs. Read its outputs and route them to a model body
in a model universe with a model of physical laws, closing the loop. René
Descartes would understand. That's the state of the passengers of the
Field Circus in a nutshell. Formerly physical humans, their neural
software (and a map of the intracranial wetware it runs on) has been
transferred into a virtual machine environment executing on a honking
great computer, where the universe they experience is merely a dream
within a dream
Brains in bottles -- empowered ones, with total, dictatorial, control over
the reality they are exposed to -- sometimes stop engaging in activities
that brains in bodies can't avoid. Menstruation isn't mandatory. Vomiting,
angina, exhaustion, and cramp are all optional. So is meatdeath, the
decomposition of the corpus. But some activities don't cease, because
people (even people who have been converted into a software
description, squirted through a high-bandwidth laser link, and ported into
a virtualization stack) don't want them to stop. Breathing is wholly
unnecessary, but suppression of the breathing reflex is disturbing unless
you hack your hypothalamic map, and most homomorphic uploads don't
want to do that. Then there's eating -- not to avoid starvation, but for
pleasure: Feasts on sautéed dodo seasoned with silphium are readily
available here, and indeed, why not? It seems the human addiction to
sensory input won't go away. And that's without considering sex, and the
technical innovations that become possible when the universe -- and the
bodies within it -- are mutable
***
I have NO IDEA why its formatted like that or the proper way to REFORMAT it.
so just get rid of the double breaks and make it this
***
Take a brain and put it in a bottle. Better: take a map of the brain and put
it in a map of a bottle -- or of a body -- and feed signals to it that mimic
its neurological inputs. Read its outputs and route them to a model body
in a model universe with a model of physical laws, closing the loop. René
Descartes would understand. That's the state of the passengers of the
Field Circus in a nutshell. Formerly physical humans, their neural
software (and a map of the intracranial wetware it runs on) has been
transferred into a virtual machine environment executing on a honking
great computer, where the universe they experience is merely a dream
within a dream
Brains in bottles -- empowered ones, with total, dictatorial, control over
the reality they are exposed to -- sometimes stop engaging in activities
that brains in bodies can't avoid. Menstruation isn't mandatory. Vomiting,
angina, exhaustion, and cramp are all optional. So is meatdeath, the
decomposition of the corpus. But some activities don't cease, because
people (even people who have been converted into a software
description, squirted through a high-bandwidth laser link, and ported into
a virtualization stack) don't want them to stop. Breathing is wholly
unnecessary, but suppression of the breathing reflex is disturbing unless
you hack your hypothalamic map, and most homomorphic uploads don't
want to do that. Then there's eating -- not to avoid starvation, but for
pleasure: Feasts on sautéed dodo seasoned with silphium are readily
available here, and indeed, why not? It seems the human addiction to
sensory input won't go away. And that's without considering sex, and the
technical innovations that become possible when the universe -- and the
bodies within it -- are mutable
***
but its clear thats not quite right but I don't have a better way right now so its what I am using.
ahh it is pretty close. HERE is the same section in the PDF. Very odd formatting.
I have also attach the PDF and the TXT I made from it. (the author gave permission to do this in his lincensing)
Suggest on better ways to automate the conversion? does not have to be txt but close in its simplicity is preferred ie readable on the pc without special software and readable native on the sony reader.
irisclara
03-05-2010, 03:18 AM
What I do is remove headers and footers and convert my pdfs with Calibre. The resulting txt or rtf usually contains a lot of residual formatting. To get rid of this I use the automated editing tools in Machine Age Reader (http://download.cnet.com/Machine-Age-Reader/3000-2125_4-10421956.html) to remove things like extra lines and spaces and to connect broken lines and paragraphs. When I get it as clean as I can with the auto tools I proof the text (still using Machine Age Reader) and fix any remaining problems manually. In case you can't tell, I really love this program. The tools are perfect for this kind of work. It has saved me many miserable hours manually removing line breaks and extra lines. I also like the fact that it lets me edit while displaying the text in proper, 2 page, computer ebook format. Reading in notepad always gave me a headache.
I'll try running your file through Calibre and Machine Age Reader tomorrow. I'll let you know how long and how many steps it takes.
nerys
03-05-2010, 10:50 AM
how do you remove the headers and footers?
irisclara
03-06-2010, 06:38 PM
Calibre can remove headers and footers as part of the conversion process. I, however am not having much luck getting Calibre to remove the headers and footers on your Charles Stross file. I consider myself lucky that the formula this thread gave me works for most of my pdfs and the ones it doesn't work for I can usually fix ok with Machine Age Reader. It's just more tedious.
guess I need to learn regex
nerys
03-06-2010, 10:53 PM
yes this file://.+\:53 works GREAT but on "some" headers footers it just does not work. From what I understand notepadd++ only partially supports these expression things so something likely breaks down in some strings and it gets confused.
thankfully those issues are rare :-) The combo of notepad++ to clean up amazon for higher res covers and calibre to crunch out lrf's its a very pleasing result.
txt files read fine and require no "formatting" in advance like lrf requires but then I get no covers and I have to wait for it to "format" it when I load it but thats only the first time.