Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 03-10-2007, 03:18 PM   #1
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
Remove linefeeds?

I have a number of PDF and other files that have hard linefeeds at the end of each line. When I convert these for my eBookwise reader (through intermediate steps of HTML or RTF), I end up with extra line breaks in the middle of every other line, approximately, which is very annoying.

I guess the best I can do is to get or create a utility to remove linefeeds from the end of any line longer than, say, 70 characters, replacing them with a space. This would occasionally remove lf that I don't want removed, but I think overall the files would be more readable than they are now.

Does anyone know of an available utility to do this? I could write it in Perl, but if there's one out there already, I'd just as soon use it. (And it's not quite as simple as just counting the characters, anyway, because formatting characters shouldn't be included in the total.)

Thanks,

Last edited by nekokami; 03-10-2007 at 03:24 PM. Reason: taking formatting into account
nekokami is offline   Reply With Quote
Old 03-10-2007, 08:10 PM   #2
RWood
Technogeezer
RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.
 
RWood's Avatar
 
Posts: 7,233
Karma: 1601464
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
Try Stingo's Word Macro from the MobileRead Wiki Conversion page. I have used it for RTF files on all sorts of Gutenberg books and other texts. You are right it does make the results more readable.

While perhaps not as critical on an eBookWise as it is for a Sony Reader, having the end-of-line mark only at the end of a paragraph allows for better text flow as the font size is changed.
RWood is offline   Reply With Quote
Advert
Old 03-10-2007, 08:49 PM   #3
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
Thanks, but this word macro only works if there are two paragraph marks at the end of a paragraph. Unfortunately, the files I need to fix don't have this feature. In the original PDF file, there's an indent on the first line of each new paragraph, but this isn't preserved when I convert the file with ABC or PDFtoHTML. Does anyone know of a PDF converter that preserves line indents in some way? I could search and replace ^p^t with ^p^p and then use Stingo's macro (or modify the macro to do this automatically), if I could get that indent to convert to a tab.
nekokami is offline   Reply With Quote
Old 03-10-2007, 10:09 PM   #4
ashkulz
Addict
ashkulz will become famous soon enoughashkulz will become famous soon enoughashkulz will become famous soon enoughashkulz will become famous soon enoughashkulz will become famous soon enoughashkulz will become famous soon enoughashkulz will become famous soon enough
 
ashkulz's Avatar
 
Posts: 350
Karma: 705
Join Date: Dec 2006
Location: Mumbai, India
Device: Kindle 1/REB 1200
nekokami: if you're trying to read PDFs on the eBookWise, why not read the PDF directly? I use a script similiar to PDFRasterFarian to convert PDF => images which fit exactly on to the REB 1100 which I have (alex_d also helped me to get image dilation working).

It's currently linux-specific, and if you're on ubuntu you need to do
Quote:
sudo apt-get install pdftk python-imaging xpdf-utils
The script is at http://puggy.symonds.net/~ashish/downloads/build-pdf.py

You'll probably have to change the script a bit, because the eBookWise has slightly lower resolution and cannot read in landscape. You will probably have to change line 67 from cropped.save to cropped.rotate(-90).save and remove the REB1100 specific stuff at the end.

I was planning on posting a generic script sometime when time permitted....

Last edited by ashkulz; 03-10-2007 at 10:13 PM.
ashkulz is offline   Reply With Quote
Old 03-11-2007, 08:32 AM   #5
henkvdg
Groupie
henkvdg has read War And Peace ... all of ithenkvdg has read War And Peace ... all of ithenkvdg has read War And Peace ... all of ithenkvdg has read War And Peace ... all of ithenkvdg has read War And Peace ... all of ithenkvdg has read War And Peace ... all of ithenkvdg has read War And Peace ... all of ithenkvdg has read War And Peace ... all of ithenkvdg has read War And Peace ... all of ithenkvdg has read War And Peace ... all of ithenkvdg has read War And Peace ... all of it
 
Posts: 180
Karma: 66830
Join Date: Oct 2006
Device: IREX iLiad, Pocketbook Pro 903
What I do to remove linefeeds (not perfect, but it helps):

- I extract the text from the PDF (or web page).
- I copy that to Word
- I replace all double LF's with &&
- I replace all LF's with spaces
- I change back all && to double LF's
- I do whatever I want with the text.

I hope this is clear

Last edited by henkvdg; 03-11-2007 at 08:35 AM.
henkvdg is offline   Reply With Quote
Advert
Old 03-11-2007, 09:41 AM   #6
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
@askkulz, thanks, I'll give that a try. I have an Ubuntu system available. (apt-get is a wonderful thing.) How large do the files tend to be via this method?

@henkvdg, I've used this method before, too. Unfortunately, the file I'm working on doesn't have double linefeeds at the ends of paragraphs. The only paragraph indication is an indented first line.
nekokami is offline   Reply With Quote
Old 03-11-2007, 10:22 AM   #7
NatCh
Gizmologist
NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.
 
NatCh's Avatar
 
Posts: 11,615
Karma: 929550
Join Date: Jan 2006
Location: Republic of Texas Embassy at Jackson, TN
Device: Pocketbook Touch HD3
You could try searching for the indention chars, I mean either a Tab mark or a set of five spaces (or however many it is) and replace that with itself and a LF in front.

i.e. (assuming actual tab chars, which it probably isn't but this is easier to type out) replace "^t" with "^p^t" (where ^t = Tab and ^p = a paragraph mark in Word).
NatCh is offline   Reply With Quote
Old 03-11-2007, 10:33 AM   #8
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
Right. The current methods I have of converting the PDF are not preserving the indents as tabs or spaces. So I need to find a way of converting the PDF such that these indents are preserved, then I can do what you are suggesting.
nekokami is offline   Reply With Quote
Old 03-11-2007, 03:37 PM   #9
NatCh
Gizmologist
NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.
 
NatCh's Avatar
 
Posts: 11,615
Karma: 929550
Join Date: Jan 2006
Location: Republic of Texas Embassy at Jackson, TN
Device: Pocketbook Touch HD3
Ahhhh. Well, that does complicate matters.
NatCh is offline   Reply With Quote
Old 03-11-2007, 04:07 PM   #10
RWood
Technogeezer
RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.
 
RWood's Avatar
 
Posts: 7,233
Karma: 1601464
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
And we are safe to assume that the books/documents are too long to reparagraph them by hand or too many of them if short enough to do one that the entire lot is too large.

I did some PDF conversion a while back with ABBYY Transform and it seemed to keep the paragraph indents as 4 spaces. I don't use it much as I find myself fighting its version of formatting and reserve it almost only for PDFs that need OCR.
RWood is offline   Reply With Quote
Old 03-12-2007, 01:45 AM   #11
ashkulz
Addict
ashkulz will become famous soon enoughashkulz will become famous soon enoughashkulz will become famous soon enoughashkulz will become famous soon enoughashkulz will become famous soon enoughashkulz will become famous soon enoughashkulz will become famous soon enough
 
ashkulz's Avatar
 
Posts: 350
Karma: 705
Join Date: Dec 2006
Location: Mumbai, India
Device: Kindle 1/REB 1200
Quote:
Originally Posted by nekokami
@askkulz, thanks, I'll give that a try. I have an Ubuntu system available. (apt-get is a wonderful thing.) How large do the files tend to be via this method?
Well, you can roughly expect the file size to be double or slightly less. That's because I'm using mostly-text based PDFs. If you have lots of graphics, then it should be equal or lesser to the original file size.
ashkulz is offline   Reply With Quote
Old 03-12-2007, 12:01 PM   #12
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
This program will export to Word and preserve the indents: http://www.convert-in.com/pdf2word.htm

However, it uses line indenting rather than a tab. Anyone know how to use Word to search for lines with a specific indenting? Then I need to insert some identifiable mark right before each of those lines, then get rid of all the paragraph marks, then replace the special mark with a paragraph mark. I can write a macro to do all that, but only if I can identify the indented lines to start with.

I'm giving ABBYY a try next, though it's pretty expensive if I need to keep using it.

Edit: ABBYY does the job if you pick "Text Flow" rather than "Original Layout." Apparently it's smart enough to figure out that an indent should be treated as a new paragraph, and ignores the other linefeeds. Great. Now I have to decide if that's worth US$99, or if I want to just settle for writing a linefeed removal program.

Last edited by nekokami; 03-12-2007 at 12:09 PM.
nekokami is offline   Reply With Quote
Old 03-12-2007, 12:31 PM   #13
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
Victory! ABC Amber PDF Converter will do it, you just need to go into "settings" and click "advanced extraction." Woohoo!
nekokami is offline   Reply With Quote
Old 03-12-2007, 12:38 PM   #14
NatCh
Gizmologist
NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.
 
NatCh's Avatar
 
Posts: 11,615
Karma: 929550
Join Date: Jan 2006
Location: Republic of Texas Embassy at Jackson, TN
Device: Pocketbook Touch HD3
"And there was much rejoicing!"
NatCh is offline   Reply With Quote
Old 03-12-2007, 03:08 PM   #15
RWood
Technogeezer
RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.
 
RWood's Avatar
 
Posts: 7,233
Karma: 1601464
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
Quote:
Originally Posted by nekokami
Victory! ABC Amber PDF Converter will do it, you just need to go into "settings" and click "advanced extraction." Woohoo!
I've used the program for so long that I forgot about that. Now that you have jogged my memory I did use that once when I first got the program and then forgot about it. Great find. Thanks, I will go and play with it now.
RWood is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Remove Books - Restoring Confirm to remove books Caffey Calibre 6 09-20-2010 09:23 AM
RFE: Remove remove tags in bulk edit magphil Calibre 0 08-11-2009 10:37 AM
PDF hard linefeeds imagitronics Sony Reader 5 01-02-2009 03:08 PM


All times are GMT -4. The time now is 10:01 AM.


MobileRead.com is a privately owned, operated and funded community.