Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Readers > Sony Reader

Notices

Reply
 
Thread Tools Search this Thread
Old 07-25-2008, 03:08 PM   #1
kahn10
Junior Member
kahn10 began at the beginning.
 
Posts: 3
Karma: 10
Join Date: May 2007
Tool for removing line breaks in text documents

I wrote this to remove all the extra line breaks in books from Gutenberg.org. It works fairly well; about 10 - 20 pages a minute. It is a client-side web app so it depends on the speed of your machine.

http://www.allthingscomp.com/breakerbreaker.html

If you want to run it locally or move it to another site you will also need this file for the statistics aspect to work.

http://www.allthingscomp.com/Concurr...ll-20080319.js

If anyone has suggestions or requests please let me know. Also, if there is already a utility that does the same thing but faster, please let me know about that.
kahn10 is offline   Reply With Quote
Old 07-25-2008, 03:49 PM   #2
NatCh
Gizmologist
NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.
 
NatCh's Avatar
 
Posts: 11,615
Karma: 929550
Join Date: Jan 2006
Location: Republic of Texas Embassy at Jackson, TN
Device: Pocketbook Touch HD3
Stingo wrote a macro some time back to do something similar, but it requires MSWord to run, which is a drawback to some folks. No idea how it would compare on speed, though.

Thanks for putting this together and sharing it.
NatCh is offline   Reply With Quote
Advert
Old 07-25-2008, 04:01 PM   #3
thorswitch
Clueless (but nice!) Newb
thorswitch is less competitive than you.thorswitch is less competitive than you.thorswitch is less competitive than you.thorswitch is less competitive than you.thorswitch is less competitive than you.thorswitch is less competitive than you.thorswitch is less competitive than you.thorswitch is less competitive than you.thorswitch is less competitive than you.thorswitch is less competitive than you.thorswitch is less competitive than you.
 
thorswitch's Avatar
 
Posts: 58
Karma: 14701
Join Date: Jun 2008
Device: Kindle Fire HD
Do either of these work to remove the hard page breaks in .pdfs? I'm not having any trouble with line-breaks, but the page breaks are another matter
thorswitch is offline   Reply With Quote
Old 07-25-2008, 04:03 PM   #4
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,897
Karma: 128597114
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
PDFs are an entirely different issue. There are problems with converting PDF and what problems you get depend upon what you use to convert them with and even then you can get different problems with different PDF. I don't know of any program that will convert PDF without errors. Even Adobe Acrobat cannot do it error free.
JSWolf is offline   Reply With Quote
Old 07-25-2008, 05:12 PM   #5
Gideon
Wearer of Pants
Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.
 
Gideon's Avatar
 
Posts: 1,050
Karma: 7634
Join Date: Jan 2008
Location: Norman, OK
Device: Amazon Kindle DX / iPhone
I use textmate on my Mac to remove line breaks for documents. It's built in, just hit a button and it's done in a few seconds. Works great!
Gideon is offline   Reply With Quote
Advert
Old 07-25-2008, 05:39 PM   #6
DMcCunney
New York Editor
DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.
 
DMcCunney's Avatar
 
Posts: 6,384
Karma: 16540415
Join Date: Aug 2007
Device: PalmTX, Pocket eDGe, Alcatel Fierce 4, RCA Viking Pro 10, Nexus 7
Quote:
Originally Posted by kahn10 View Post
If anyone has suggestions or requests please let me know. Also, if there is already a utility that does the same thing but faster, please let me know about that.
Locally, I'd probably look at using a sed script.

(Sed is the Unix "Stream Editor", intended for doing scripted edits on files in a pipeline. It supports regular expressions, and is capable of sophisticated operations. It's provided with Unix/Linux/BSD systems, and ports exist for DOS and Windows.)
______
Dennis

Last edited by DMcCunney; 07-25-2008 at 05:41 PM.
DMcCunney is offline   Reply With Quote
Old 07-26-2008, 04:19 AM   #7
leandroide
Zealot
leandroide has learned how to buy an e-book online
 
leandroide's Avatar
 
Posts: 144
Karma: 92
Join Date: May 2006
Location: Vigo, Spain
Device: Papyre 6.1 (Hanlin V3)
Quote:
Originally Posted by Gideon View Post
I use textmate on my Mac to remove line breaks for documents. It's built in, just hit a button and it's done in a few seconds. Works great!
What button?
leandroide is offline   Reply With Quote
Old 08-21-2010, 05:59 PM   #8
medved13
Junior Member
medved13 began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Aug 2010
Device: none
Tool for removing line breaks online

texthandler.com - online service that I use to removing line breaks
medved13 is offline   Reply With Quote
Old 08-22-2010, 07:11 PM   #9
tweety
Enthusiast
tweety ought to be getting tired of karma fortunes by now.tweety ought to be getting tired of karma fortunes by now.tweety ought to be getting tired of karma fortunes by now.tweety ought to be getting tired of karma fortunes by now.tweety ought to be getting tired of karma fortunes by now.tweety ought to be getting tired of karma fortunes by now.tweety ought to be getting tired of karma fortunes by now.tweety ought to be getting tired of karma fortunes by now.tweety ought to be getting tired of karma fortunes by now.tweety ought to be getting tired of karma fortunes by now.tweety ought to be getting tired of karma fortunes by now.
 
tweety's Avatar
 
Posts: 34
Karma: 2016606
Join Date: Jun 2008
Device: Kindle Scribe
InterParse4

InterParse4 will remove All CRLFs see later and a lot more:

From Options Help:

CRLF = Carriage Return and Line Feed, or, ASCII Characters 13 and 10
A proper Blank Line is nothing more than an empty line that contains nothing but a single carriage return character and a single linefeed character. Thus, removing or adding blank lines is doing nothing more than adding or deleting the CRLF character pair at that location.

Remove Blank Lines
Any line with only a CRLF will be deleted. If the line has even one space or tab it will not be deleted. To clear all undisplayable characters from a blank line, use the Left Justify and the Convert Tabs options before running Remove Blank Line option.

Remove Extra Blank Lines
All blank lines will be deleted except for those that follow a CRLF. This gives the effect of removing all white space in the document except for the single line following a paragraph. Again, if a particular line does not get deleted, it probably has a space or tab imbedded.

Insert Blank Line After Paragraph
This actually inserts another CRLF after each existing CRLF. If the parsed result is double spaced then you have a document with a CRLF after every visible line. You need to parse the document to get the paragraphs delimited, then use this option. This function is very useful for documents to be read on a handheld reader where only a portion of any paragraph can be seen at one time.

Insert Blank Line Before Any Indention
Sometimes you will get a document in which the only formatting is paragraph indentions. This routine will allow you to inject some proper formatting.

Left Justify
This will remove any jagged edges to the left edge of the text. More importantly, it will convert a line with nothing but spaces to a line with a single CRLF. It will not justify a line with a tab character and before using this option, you need to have the paragraphs properly delineated with CRLFs.

Trim Trailing Spaces
Some documents have the end of each line padded with spaces out to a fixed right margin. If the Remove CRLF options is used, these spaces show up as huge voids in the text. This option will remove all spaces, but one, to the right of each line. Then the lines can be joined with the Remove CRLF function.

Trim All Trailing Spaces
This will remove ALL trailing spaces.

Indent Paragraph to "n" Spaces
Assuming that the document has been parsed to having a CRLF after every paragraph, this will add "n" number of spaces to the start of the paragraph, giving a paragraph indention.
The value of "n" spaces is determined by the number in the Space Value box.

Convert Tabs to "n" Spaces
Because tab values are dependent on the reading software, it is easier to parse a document if they are converted to spaces. This will replace each tab character with the spaces using the number in the Space Value Value box.

Remove All CRLFs (except Paragraph end)
Before using this option, you must have your paragraphs delineated with indentions or blank lines. This will join all intra paragraph lines into a proper paragraph.

Remove Hyphenated Line End
This is handy for scanned text that uses hyphenated word breaks. The "-"will be removed in preparation for joining the lines by removing the CRLFs.

Delete Lines with only Numeric
Occasionally used to remove the page numbers from a scanned document. Note that if you are parsing a Math or Engineering document, you probably do not want to use this function.

Put Blank Line if Indention is exactly "n" Spaces
Any line with exactly the number of preceding spaces as specified in the Space Value box will have a blank line inserted before it.

Put Blank Line if line length is "n" characters less than Average
This is a last ditch option for text that has no formatting at all, but which has shorter lines at the end of most paragraphs. This routine only works on a document with a CRLF after every visible line. It will find the average length of all the lines and put a blank line after any line that is Space Value or more less than the average length. Example, a text has an average line length of 55 characters. Any line less than 55 minus Space Value (say, 50) will be considered to be the end of the paragraph. This routine usually either works great or not at all.

Show CRLFs
This is to allow a visual indication of the distribution of line feeds and carriage returns in a document. It is usually used after a text file refuses to cooperate in being made readable. In Version 4, these line feed and carriage return indicators are not really in the text and you may continue parsing without worrying about removing them. However, if you Save Viewer at this point, they will be inserted into the text file.
tweety is offline   Reply With Quote
Old 08-22-2010, 10:05 PM   #10
DMcCunney
New York Editor
DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.DMcCunney ought to be getting tired of karma fortunes by now.
 
DMcCunney's Avatar
 
Posts: 6,384
Karma: 16540415
Join Date: Aug 2007
Device: PalmTX, Pocket eDGe, Alcatel Fierce 4, RCA Viking Pro 10, Nexus 7
Quote:
Originally Posted by tweety View Post
InterParse4 will remove All CRLFs see later and a lot more:
So it will, but the home page for the program has been offline since 2007, so getting it is a challenge.

I put a copy of the last distribution archive for the program here:
https://sites.google.com/site/texted...nterparse4.zip
______
Dennis
DMcCunney is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Removing unnecessary line breaks in books. Wintersdark Calibre 17 09-04-2010 04:34 AM
No line breaks ecpepper Amazon Kindle 3 08-09-2009 06:42 PM
Removing Line-breaks / Preserving Paragraphs ahi Workshop 5 06-08-2009 02:22 AM
Removing extra line breaks plemming Calibre 0 07-31-2008 07:50 PM
Text tool for formatting Gutenberg text files bob_ninja Workshop 5 11-13-2007 12:28 PM


All times are GMT -4. The time now is 05:37 PM.


MobileRead.com is a privately owned, operated and funded community.