![]() |
#1 |
fruminous edugeek
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
Remove linefeeds?
I have a number of PDF and other files that have hard linefeeds at the end of each line. When I convert these for my eBookwise reader (through intermediate steps of HTML or RTF), I end up with extra line breaks in the middle of every other line, approximately, which is very annoying.
I guess the best I can do is to get or create a utility to remove linefeeds from the end of any line longer than, say, 70 characters, replacing them with a space. This would occasionally remove lf that I don't want removed, but I think overall the files would be more readable than they are now. Does anyone know of an available utility to do this? I could write it in Perl, but if there's one out there already, I'd just as soon use it. (And it's not quite as simple as just counting the characters, anyway, because formatting characters shouldn't be included in the total.) Thanks, Last edited by nekokami; 03-10-2007 at 03:24 PM. Reason: taking formatting into account |
![]() |
![]() |
![]() |
#2 |
Technogeezer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,233
Karma: 1601464
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
|
Try Stingo's Word Macro from the MobileRead Wiki Conversion page. I have used it for RTF files on all sorts of Gutenberg books and other texts. You are right it does make the results more readable.
While perhaps not as critical on an eBookWise as it is for a Sony Reader, having the end-of-line mark only at the end of a paragraph allows for better text flow as the font size is changed. |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
fruminous edugeek
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
Thanks, but this word macro only works if there are two paragraph marks at the end of a paragraph. Unfortunately, the files I need to fix don't have this feature. In the original PDF file, there's an indent on the first line of each new paragraph, but this isn't preserved when I convert the file with ABC or PDFtoHTML. Does anyone know of a PDF converter that preserves line indents in some way? I could search and replace ^p^t with ^p^p and then use Stingo's macro (or modify the macro to do this automatically), if I could get that indent to convert to a tab.
|
![]() |
![]() |
![]() |
#4 | |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 350
Karma: 705
Join Date: Dec 2006
Location: Mumbai, India
Device: Kindle 1/REB 1200
|
nekokami: if you're trying to read PDFs on the eBookWise, why not read the PDF directly? I use a script similiar to PDFRasterFarian to convert PDF => images which fit exactly on to the REB 1100 which I have (alex_d also helped me to get image dilation working).
It's currently linux-specific, and if you're on ubuntu you need to do Quote:
You'll probably have to change the script a bit, because the eBookWise has slightly lower resolution and cannot read in landscape. You will probably have to change line 67 from cropped.save to cropped.rotate(-90).save and remove the REB1100 specific stuff at the end. I was planning on posting a generic script sometime when time permitted.... Last edited by ashkulz; 03-10-2007 at 10:13 PM. |
|
![]() |
![]() |
![]() |
#5 |
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 180
Karma: 66830
Join Date: Oct 2006
Device: IREX iLiad, Pocketbook Pro 903
|
What I do to remove linefeeds (not perfect, but it helps):
- I extract the text from the PDF (or web page). - I copy that to Word - I replace all double LF's with && - I replace all LF's with spaces - I change back all && to double LF's - I do whatever I want with the text. I hope this is clear Last edited by henkvdg; 03-11-2007 at 08:35 AM. |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
fruminous edugeek
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
@askkulz, thanks, I'll give that a try. I have an Ubuntu system available. (apt-get is a wonderful thing.) How large do the files tend to be via this method?
@henkvdg, I've used this method before, too. Unfortunately, the file I'm working on doesn't have double linefeeds at the ends of paragraphs. The only paragraph indication is an indented first line. |
![]() |
![]() |
![]() |
#7 |
Gizmologist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,615
Karma: 929550
Join Date: Jan 2006
Location: Republic of Texas Embassy at Jackson, TN
Device: Pocketbook Touch HD3
|
You could try searching for the indention chars, I mean either a Tab mark or a set of five spaces (or however many it is) and replace that with itself and a LF in front.
i.e. (assuming actual tab chars, which it probably isn't but this is easier to type out) replace "^t" with "^p^t" (where ^t = Tab and ^p = a paragraph mark in Word). |
![]() |
![]() |
![]() |
#8 |
fruminous edugeek
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
Right. The current methods I have of converting the PDF are not preserving the indents as tabs or spaces. So I need to find a way of converting the PDF such that these indents are preserved, then I can do what you are suggesting.
![]() |
![]() |
![]() |
![]() |
#9 |
Gizmologist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,615
Karma: 929550
Join Date: Jan 2006
Location: Republic of Texas Embassy at Jackson, TN
Device: Pocketbook Touch HD3
|
Ahhhh. Well, that does complicate matters.
![]() |
![]() |
![]() |
![]() |
#10 |
Technogeezer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,233
Karma: 1601464
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
|
And we are safe to assume that the books/documents are too long to reparagraph them by hand or too many of them if short enough to do one that the entire lot is too large.
I did some PDF conversion a while back with ABBYY Transform and it seemed to keep the paragraph indents as 4 spaces. I don't use it much as I find myself fighting its version of formatting and reserve it almost only for PDFs that need OCR. |
![]() |
![]() |
![]() |
#11 | |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 350
Karma: 705
Join Date: Dec 2006
Location: Mumbai, India
Device: Kindle 1/REB 1200
|
Quote:
|
|
![]() |
![]() |
![]() |
#12 |
fruminous edugeek
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
This program will export to Word and preserve the indents: http://www.convert-in.com/pdf2word.htm
However, it uses line indenting rather than a tab. Anyone know how to use Word to search for lines with a specific indenting? Then I need to insert some identifiable mark right before each of those lines, then get rid of all the paragraph marks, then replace the special mark with a paragraph mark. I can write a macro to do all that, but only if I can identify the indented lines to start with. I'm giving ABBYY a try next, though it's pretty expensive if I need to keep using it. Edit: ABBYY does the job if you pick "Text Flow" rather than "Original Layout." Apparently it's smart enough to figure out that an indent should be treated as a new paragraph, and ignores the other linefeeds. Great. Now I have to decide if that's worth US$99, or if I want to just settle for writing a linefeed removal program. ![]() Last edited by nekokami; 03-12-2007 at 12:09 PM. |
![]() |
![]() |
![]() |
#13 |
fruminous edugeek
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
Victory! ABC Amber PDF Converter will do it, you just need to go into "settings" and click "advanced extraction." Woohoo!
|
![]() |
![]() |
![]() |
#14 |
Gizmologist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,615
Karma: 929550
Join Date: Jan 2006
Location: Republic of Texas Embassy at Jackson, TN
Device: Pocketbook Touch HD3
|
"And there was much rejoicing!"
![]() |
![]() |
![]() |
![]() |
#15 | |
Technogeezer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,233
Karma: 1601464
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
|
Quote:
![]() |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Remove Books - Restoring Confirm to remove books | Caffey | Calibre | 6 | 09-20-2010 09:23 AM |
RFE: Remove remove tags in bulk edit | magphil | Calibre | 0 | 08-11-2009 10:37 AM |
PDF hard linefeeds | imagitronics | Sony Reader | 5 | 01-02-2009 03:08 PM |