07-25-2008, 03:08 PM | #1 |
Junior Member
Posts: 3
Karma: 10
Join Date: May 2007
|
Tool for removing line breaks in text documents
I wrote this to remove all the extra line breaks in books from Gutenberg.org. It works fairly well; about 10 - 20 pages a minute. It is a client-side web app so it depends on the speed of your machine.
http://www.allthingscomp.com/breakerbreaker.html If you want to run it locally or move it to another site you will also need this file for the statistics aspect to work. http://www.allthingscomp.com/Concurr...ll-20080319.js If anyone has suggestions or requests please let me know. Also, if there is already a utility that does the same thing but faster, please let me know about that. |
07-25-2008, 03:49 PM | #2 |
Gizmologist
Posts: 11,615
Karma: 929550
Join Date: Jan 2006
Location: Republic of Texas Embassy at Jackson, TN
Device: Pocketbook Touch HD3
|
Stingo wrote a macro some time back to do something similar, but it requires MSWord to run, which is a drawback to some folks. No idea how it would compare on speed, though.
Thanks for putting this together and sharing it. |
Advert | |
|
07-25-2008, 04:01 PM | #3 |
Clueless (but nice!) Newb
Posts: 58
Karma: 14701
Join Date: Jun 2008
Device: Kindle Fire HD
|
Do either of these work to remove the hard page breaks in .pdfs? I'm not having any trouble with line-breaks, but the page breaks are another matter
|
07-25-2008, 04:03 PM | #4 |
Resident Curmudgeon
Posts: 75,901
Karma: 134368292
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
PDFs are an entirely different issue. There are problems with converting PDF and what problems you get depend upon what you use to convert them with and even then you can get different problems with different PDF. I don't know of any program that will convert PDF without errors. Even Adobe Acrobat cannot do it error free.
|
07-25-2008, 05:12 PM | #5 |
Wearer of Pants
Posts: 1,050
Karma: 7634
Join Date: Jan 2008
Location: Norman, OK
Device: Amazon Kindle DX / iPhone
|
I use textmate on my Mac to remove line breaks for documents. It's built in, just hit a button and it's done in a few seconds. Works great!
|
Advert | |
|
07-25-2008, 05:39 PM | #6 | |
New York Editor
Posts: 6,384
Karma: 16540415
Join Date: Aug 2007
Device: PalmTX, Pocket eDGe, Alcatel Fierce 4, RCA Viking Pro 10, Nexus 7
|
Quote:
(Sed is the Unix "Stream Editor", intended for doing scripted edits on files in a pipeline. It supports regular expressions, and is capable of sophisticated operations. It's provided with Unix/Linux/BSD systems, and ports exist for DOS and Windows.) ______ Dennis Last edited by DMcCunney; 07-25-2008 at 05:41 PM. |
|
07-26-2008, 04:19 AM | #7 |
Zealot
Posts: 144
Karma: 92
Join Date: May 2006
Location: Vigo, Spain
Device: Papyre 6.1 (Hanlin V3)
|
|
08-21-2010, 05:59 PM | #8 |
Junior Member
Posts: 2
Karma: 10
Join Date: Aug 2010
Device: none
|
Tool for removing line breaks online
texthandler.com - online service that I use to removing line breaks
|
08-22-2010, 07:11 PM | #9 |
Enthusiast
Posts: 34
Karma: 2016606
Join Date: Jun 2008
Device: Kindle Scribe
|
InterParse4
InterParse4 will remove All CRLFs see later and a lot more:
From Options Help: CRLF = Carriage Return and Line Feed, or, ASCII Characters 13 and 10 A proper Blank Line is nothing more than an empty line that contains nothing but a single carriage return character and a single linefeed character. Thus, removing or adding blank lines is doing nothing more than adding or deleting the CRLF character pair at that location. Remove Blank Lines Any line with only a CRLF will be deleted. If the line has even one space or tab it will not be deleted. To clear all undisplayable characters from a blank line, use the Left Justify and the Convert Tabs options before running Remove Blank Line option. Remove Extra Blank Lines All blank lines will be deleted except for those that follow a CRLF. This gives the effect of removing all white space in the document except for the single line following a paragraph. Again, if a particular line does not get deleted, it probably has a space or tab imbedded. Insert Blank Line After Paragraph This actually inserts another CRLF after each existing CRLF. If the parsed result is double spaced then you have a document with a CRLF after every visible line. You need to parse the document to get the paragraphs delimited, then use this option. This function is very useful for documents to be read on a handheld reader where only a portion of any paragraph can be seen at one time. Insert Blank Line Before Any Indention Sometimes you will get a document in which the only formatting is paragraph indentions. This routine will allow you to inject some proper formatting. Left Justify This will remove any jagged edges to the left edge of the text. More importantly, it will convert a line with nothing but spaces to a line with a single CRLF. It will not justify a line with a tab character and before using this option, you need to have the paragraphs properly delineated with CRLFs. Trim Trailing Spaces Some documents have the end of each line padded with spaces out to a fixed right margin. If the Remove CRLF options is used, these spaces show up as huge voids in the text. This option will remove all spaces, but one, to the right of each line. Then the lines can be joined with the Remove CRLF function. Trim All Trailing Spaces This will remove ALL trailing spaces. Indent Paragraph to "n" Spaces Assuming that the document has been parsed to having a CRLF after every paragraph, this will add "n" number of spaces to the start of the paragraph, giving a paragraph indention. The value of "n" spaces is determined by the number in the Space Value box. Convert Tabs to "n" Spaces Because tab values are dependent on the reading software, it is easier to parse a document if they are converted to spaces. This will replace each tab character with the spaces using the number in the Space Value Value box. Remove All CRLFs (except Paragraph end) Before using this option, you must have your paragraphs delineated with indentions or blank lines. This will join all intra paragraph lines into a proper paragraph. Remove Hyphenated Line End This is handy for scanned text that uses hyphenated word breaks. The "-"will be removed in preparation for joining the lines by removing the CRLFs. Delete Lines with only Numeric Occasionally used to remove the page numbers from a scanned document. Note that if you are parsing a Math or Engineering document, you probably do not want to use this function. Put Blank Line if Indention is exactly "n" Spaces Any line with exactly the number of preceding spaces as specified in the Space Value box will have a blank line inserted before it. Put Blank Line if line length is "n" characters less than Average This is a last ditch option for text that has no formatting at all, but which has shorter lines at the end of most paragraphs. This routine only works on a document with a CRLF after every visible line. It will find the average length of all the lines and put a blank line after any line that is Space Value or more less than the average length. Example, a text has an average line length of 55 characters. Any line less than 55 minus Space Value (say, 50) will be considered to be the end of the paragraph. This routine usually either works great or not at all. Show CRLFs This is to allow a visual indication of the distribution of line feeds and carriage returns in a document. It is usually used after a text file refuses to cooperate in being made readable. In Version 4, these line feed and carriage return indicators are not really in the text and you may continue parsing without worrying about removing them. However, if you Save Viewer at this point, they will be inserted into the text file. |
08-22-2010, 10:05 PM | #10 |
New York Editor
Posts: 6,384
Karma: 16540415
Join Date: Aug 2007
Device: PalmTX, Pocket eDGe, Alcatel Fierce 4, RCA Viking Pro 10, Nexus 7
|
So it will, but the home page for the program has been offline since 2007, so getting it is a challenge.
I put a copy of the last distribution archive for the program here: https://sites.google.com/site/texted...nterparse4.zip ______ Dennis |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Removing unnecessary line breaks in books. | Wintersdark | Calibre | 17 | 09-04-2010 04:34 AM |
No line breaks | ecpepper | Amazon Kindle | 3 | 08-09-2009 06:42 PM |
Removing Line-breaks / Preserving Paragraphs | ahi | Workshop | 5 | 06-08-2009 02:22 AM |
Removing extra line breaks | plemming | Calibre | 0 | 07-31-2008 07:50 PM |
Text tool for formatting Gutenberg text files | bob_ninja | Workshop | 5 | 11-13-2007 12:28 PM |