View Full Version : Remove linefeeds?


nekokami
03-10-2007, 04:18 PM
I have a number of PDF and other files that have hard linefeeds at the end of each line. When I convert these for my eBookwise reader (through intermediate steps of HTML or RTF), I end up with extra line breaks in the middle of every other line, approximately, which is very annoying.

I guess the best I can do is to get or create a utility to remove linefeeds from the end of any line longer than, say, 70 characters, replacing them with a space. This would occasionally remove lf that I don't want removed, but I think overall the files would be more readable than they are now.

Does anyone know of an available utility to do this? I could write it in Perl, but if there's one out there already, I'd just as soon use it. (And it's not quite as simple as just counting the characters, anyway, because formatting characters shouldn't be included in the total.)

Thanks,

RWood
03-10-2007, 09:10 PM
Try Stingo's Word Macro from the MobileRead Wiki Conversion page. (http://wiki.mobileread.com/wiki/E-book_conversion) I have used it for RTF files on all sorts of Gutenberg books and other texts. You are right it does make the results more readable.

While perhaps not as critical on an eBookWise as it is for a Sony Reader, having the end-of-line mark only at the end of a paragraph allows for better text flow as the font size is changed.

nekokami
03-10-2007, 09:49 PM
Thanks, but this word macro only works if there are two paragraph marks at the end of a paragraph. Unfortunately, the files I need to fix don't have this feature. In the original PDF file, there's an indent on the first line of each new paragraph, but this isn't preserved when I convert the file with ABC or PDFtoHTML. Does anyone know of a PDF converter that preserves line indents in some way? I could search and replace ^p^t with ^p^p and then use Stingo's macro (or modify the macro to do this automatically), if I could get that indent to convert to a tab.

ashkulz
03-10-2007, 11:09 PM
nekokami: if you're trying to read PDFs on the eBookWise, why not read the PDF directly? I use a script similiar to PDFRasterFarian to convert PDF => images which fit exactly on to the REB 1100 which I have (alex_d also helped me to get image dilation working).

It's currently linux-specific, and if you're on ubuntu you need to do
sudo apt-get install pdftk python-imaging xpdf-utils

The script is at http://puggy.symonds.net/~ashish/downloads/build-pdf.py

You'll probably have to change the script a bit, because the eBookWise has slightly lower resolution and cannot read in landscape. You will probably have to change line 67 from cropped.save to cropped.rotate(-90).save and remove the REB1100 specific stuff at the end.

I was planning on posting a generic script sometime when time permitted....

henkvdg
03-11-2007, 09:32 AM
What I do to remove linefeeds (not perfect, but it helps):

- I extract the text from the PDF (or web page).
- I copy that to Word
- I replace all double LF's with &&
- I replace all LF's with spaces
- I change back all && to double LF's
- I do whatever I want with the text.

I hope this is clear

nekokami
03-11-2007, 10:41 AM
@askkulz, thanks, I'll give that a try. I have an Ubuntu system available. (apt-get is a wonderful thing.) How large do the files tend to be via this method?

@henkvdg, I've used this method before, too. Unfortunately, the file I'm working on doesn't have double linefeeds at the ends of paragraphs. The only paragraph indication is an indented first line.

NatCh
03-11-2007, 11:22 AM
You could try searching for the indention chars, I mean either a Tab mark or a set of five spaces (or however many it is) and replace that with itself and a LF in front.

i.e. (assuming actual tab chars, which it probably isn't but this is easier to type out) replace "^t" with "^p^t" (where ^t = Tab and ^p = a paragraph mark in Word).

nekokami
03-11-2007, 11:33 AM
Right. The current methods I have of converting the PDF are not preserving the indents as tabs or spaces. So I need to find a way of converting the PDF such that these indents are preserved, then I can do what you are suggesting. :)

NatCh
03-11-2007, 04:37 PM
Ahhhh. Well, that does complicate matters. :(

RWood
03-11-2007, 05:07 PM
And we are safe to assume that the books/documents are too long to reparagraph them by hand or too many of them if short enough to do one that the entire lot is too large.

I did some PDF conversion a while back with ABBYY Transform and it seemed to keep the paragraph indents as 4 spaces. I don't use it much as I find myself fighting its version of formatting and reserve it almost only for PDFs that need OCR.

ashkulz
03-12-2007, 02:45 AM
@askkulz, thanks, I'll give that a try. I have an Ubuntu system available. (apt-get is a wonderful thing.) How large do the files tend to be via this method?


Well, you can roughly expect the file size to be double or slightly less. That's because I'm using mostly-text based PDFs. If you have lots of graphics, then it should be equal or lesser to the original file size.

nekokami
03-12-2007, 01:01 PM
This program will export to Word and preserve the indents: http://www.convert-in.com/pdf2word.htm

However, it uses line indenting rather than a tab. Anyone know how to use Word to search for lines with a specific indenting? Then I need to insert some identifiable mark right before each of those lines, then get rid of all the paragraph marks, then replace the special mark with a paragraph mark. I can write a macro to do all that, but only if I can identify the indented lines to start with.

I'm giving ABBYY a try next, though it's pretty expensive if I need to keep using it.

Edit: ABBYY does the job if you pick "Text Flow" rather than "Original Layout." Apparently it's smart enough to figure out that an indent should be treated as a new paragraph, and ignores the other linefeeds. Great. Now I have to decide if that's worth US$99, or if I want to just settle for writing a linefeed removal program. :shrug:

nekokami
03-12-2007, 01:31 PM
Victory! ABC Amber PDF Converter will do it, you just need to go into "settings" and click "advanced extraction." Woohoo!

NatCh
03-12-2007, 01:38 PM
"And there was much rejoicing!" :guitarist

RWood
03-12-2007, 04:08 PM
Victory! ABC Amber PDF Converter will do it, you just need to go into "settings" and click "advanced extraction." Woohoo!
I've used the program for so long that I forgot about that. Now that you have jogged my memory I did use that once when I first got the program and then forgot about it. Great find. Thanks, I will go and play with it now. :D