Removing Returns, Preserving Paragraphs - Page 2

Argel · 06-02-2009, 05:09 PM

I've had to do this many hundreds of times over the years and never found it necessary to use specialist tools. The sequence with any editor is basically as ahi describes above.

First identify whatever character(s) mark a paragraph end and globally replace them with a unique marker - '|' will do the job but so will "&&&&" or any character combination that doesn't otherwise occur in the text.

You can now remove all the remaining return/newline characters BUT FIRST globally replace them with the same character plus a space. This is because some lines may end [space][newline] and others may not have the space. If you simply strip out the return/newline the last/first words will run together and if you don't notice immediately you're in a mess.

Now repeatedly global search for [return/newline][two spaces] and replace with [return/newline][one space]. When you no longer find the target you know that every line now terminates with [return/newline][space] so just replace that character combination with [space].Of course, as Jelby points out, if you know how to search for a return/newline with an arbitrary number of whitespace characters, you can do this in one operation.

You now have one impenetrable text block but all you have to do is globally replace your original marker, e.g. &&&&, with a paragraph return and hey presto.

Sabardeyn · 06-03-2009, 02:05 AM

Quote:

Originally Posted by Argel

I've had to do this many hundreds of times over the years

Wouldn't it have been substantially easier if you just wrote it once and ran it as needed?

Gideon,
Most Search & Replace functions are a limited form of regular expressions (regex). What both Jellby and Argel have just posted are forms of regex. Granted, Jellby went hardcore and Argel gave more generalized info, but they're still regex.

I mentioned RegexBuddy because its a decent way to learn. It allows you to create a formula and save it in a library for future use. It also explains what the formula is doing in English. A feature whose value cannot be stressed enough - try re-reading Jellby's formulas again. Complex regex is not easy to understand. (Try understanding the expression halfway down this post if you think I'm kidding...)

You could create a Search & Replace (that is, regex) expression for every change needed in this file, and then save them all individually. You could start a new regex formula, load every one of the expressions you just made into one humongous formula, find the correct "stacking order" so they're all processed correctly, and issue one command to fix the whole file. This humongous formula could be saved as well. And, if you ever need it again, you could load, run and be done almost instantly (at least for the "hands on" portion of the work, conversion would take a bit of time, of course).

You do not have to use any of the software that has been mentioned. You can, as others have mentioned, very easily use almost any existing software that you are comfortable with - providing it can perform the necessary tasks.

Argel · 06-03-2009, 05:39 AM

Quote:

Originally Posted by Sabardeyn

Wouldn't it have been substantially easier if you just wrote it once and ran it as needed?

No, not really, since manuscripts are so idiosyncratic that special tools often fail - as Gideon found.

I have some useful Word macros but it's often quicker to this kind of stuff manually. There are two kinds of people in life - those who spend hours fiddling around with quicker ways to do stuff and those who just do stuff.

HarryT · 06-03-2009, 12:19 PM

Quote:

Originally Posted by Argel

No, not really, since manuscripts are so idiosyncratic that special tools often fail - as Gideon found.

The reason that "textify" failed in Gideon's case is that his file has NO indication of "end of paragraph" - there is a carriage return at the end of every line, and no special indication of the start of a paragraph. There is literally nothing for any tool to "pick up on". It's a horribly-formatted file.

Textify works beautifully on the overwhelming majority of files. I must have used it on literally hundreds of files, with great success. It also has the useful feature of being able to create HTML output, with _ _ replaced with italics.

ahi · 06-03-2009, 12:22 PM

Quote:

Originally Posted by HarryT

The reason that "textify" failed in Gideon's case is that his file has NO indication of "end of paragraph" - there is a carriage return at the end of every line, and no special indication of the start of a paragraph. There is literally nothing for any tool to "pick up on". It's a horribly-formatted file.

Textify works beautifully on the overwhelming majority of files. I must have used it on literally hundreds of files, with great success. It also has the useful feature of being able to create HTML output, with _ _ replaced with italics.

... how does Gideon know where the paragraph breaks are?

If he knows where they are, the same way he can ascertain it, so ought a computer program be able to... I think?

- Ahi

HarryT · 06-03-2009, 12:27 PM

Quote:

Originally Posted by ahi

... how does Gideon know where the paragraph breaks are?

If he knows where they are, the same way he can ascertain it, so ought a computer program be able to... I think?

- Ahi

Because people can easily do "pattern recognition" tasks which are extremely difficult for a computer.

You could say "If a line starts in a capital letter then it's probably a new paragraph", I suppose. It wouldn't be 100% reliable, but it would be a good start.

DaleDe · 06-03-2009, 12:34 PM

Quote:

Originally Posted by HarryT

Because people can easily do "pattern recognition" tasks which are extremely difficult for a computer.

You could say "If a line starts in a capital letter then it's probably a new paragraph", I suppose. It wouldn't be 100% reliable, but it would be a good start.

And the previous line was shorter than usual would also be a clue. Unfortunately the length of the line has become a rather poor indicator due to the fact that often the original assumed mono-spaced fonts and currently this is almost never the case. But if you assume it was mono-spaced you can count characters and determine the next word would have fitted. It is certainly beyond simple regexpressions.

Dale

ahi · 06-03-2009, 12:43 PM

Quote:

Originally Posted by HarryT

Because people can easily do "pattern recognition" tasks which are extremely difficult for a computer.

You could say "If a line starts in a capital letter then it's probably a new paragraph", I suppose. It wouldn't be 100% reliable, but it would be a good start.

Speaking in general terms, you are right. This ought very rarely be the case with plaintext though.

Are you guessing, by the way, or have you seen the file? I myself would not claim to be able to tell where the paragraph breaks if my only indicators were lines starting with capitals... an indicator that is, by the way, trivial to identify and process via a script.

Same question to DaleDe: are you guessing, or is what you are saying specifically the issue with Gideon's file?

- Ahi

ahi · 06-03-2009, 01:12 PM

Assuming that the paragraph break information in Gideon's file is not impossible to accurately retrieve, throwing together a python script that works thusly might help:

In sequentially parsing the text file, build a list that consists of (1) non-whitespace character sequence strings, and (2) numbers indicating consecutive whitespace weights. As explained in my "text processing ideas" post, the weighting ought to assign a value of 1 for each space character, and 1000 for each linebreak (which can either be chr(10), chr(13) or the two together [which still should be counted as a single linebreak]). (Tabs, I suppose could be counted as having a weight of 4 or 8 or even larger.)

1 linebreak + 5 spaces = 1005
2 linebreaks + 0 spaces = 2000
1 space + 2 linebreaks + 5 spaces = 2006

Once this is done, you basically have a list that you could process sequentially to recreate the input file (save for the whitespaces).

At that time, take all the whitespace weights and put them in a separate list, and get the mode of that list. It will be the weight of whitespace used to separate words.

Remove from the list of whitespace weights all instances of the mode weight, and take the mode of what remains. It will be the weight of whitespace used to separate lines.

Then remove from the list of whitespace weights all instances of the mode weight, and take the mode of what again remains. It will be the weight of whitespace used to separate paragraphs.

Output the list, replacing the whitespace weights with the appropriate characters (space for word spacing, space for line spacing, linebreak for paragraph spacing).

Of course, such an approach won't work if the text file is as my fellow forum members suggest it to be. But if they are guessing or mistaken, I might throw this script together this evening and post it...

- Ahi

DaleDe · 06-03-2009, 02:31 PM

Quote:

Originally Posted by ahi

Speaking in general terms, you are right. This ought very rarely be the case with plaintext though.

Are you guessing, by the way, or have you seen the file? I myself would not claim to be able to tell where the paragraph breaks if my only indicators were lines starting with capitals... an indicator that is, by the way, trivial to identify and process via a script.

Same question to DaleDe: are you guessing, or is what you are saying specifically the issue with Gideon's file?

- Ahi

My comment was an add-on to Harry's to provide a little more information. He has the file. In general this is an issue that I have also faced from time to time.

Dale

Sabardeyn · 06-03-2009, 03:01 PM

Argel,

In the vast majority of cases files of any kind will conform to their type. Text, python, HTML, C++, whatever, will use certain patterns. Once you start to understand the pattern, you can make use of it in any regex process. You know this - it is exactly what you suggested in your search & replace example.

However their will always be cases where a particular user, file or piece of software does not follow the standard pattern for some reason. In those cases you can still run regex, but on a more limited, and with greater oversight, basis. So, in my example, I would run the "humongous regex" on it first to see what happened. If the result was a really garbled mess, I would revert to the original file and apply each individual expression and look the file over.

Chances are it will become more uniform in general although several errors will creep in because of the regex. (No automated process is infallible.) Since this is something Gideon would read anyway, he can note any remaining errors as he goes. Either he can correct these one at a time or create a new regex to handle it (correcting similar errors that exist in later portions of the text).

This is what Ahi, Dale and Harry are talking about right now. The file in question sounds as though it's one that does not conform to a known and recognized pattern. So you either have to customize any regex or perform a substantial amount of the corrections manually.

Without seeing the file none of us are capable of providing assistance on this matter. Guessing only gets us so far.

rogue_ronin · 06-03-2009, 10:20 PM

If the OP is still following this, the app I linked to in my previous post is basically a tool to apply various preconfigured regex in series. It then displays the results. If you like it, continue to apply various regex, etc, until satisfied, then export the result to any file you like, including the original.

It's got some patterns that search for the things being discussed in the thread.

One of the things that you can search for, and may not be obvious to someone just looking at the text, is a space preceding a hard-return as a marker for end-of-paragraph. I'm surprised how frequently that has turned up.

Luck,

m a r

Gideon · 06-04-2009, 12:17 AM

I actually use a mac, but can dip into windows as needed (reluctantly.)

Me and regular expressions have had some run-ins before, and I really don't care enough to spend the time to learn how to do them. And this file, as Harry said.. it's a mess.

Outside of working out a system where a shorter line preceedes a new paragraph, you don't have a lot to work with.

In cases of double carriage returns it is certainly a doable thing to fix the file (I usually use TextMate) but this one is not so fortunate.

HarryT · 06-04-2009, 02:38 AM

Quote:

Originally Posted by Sabardeyn

Without seeing the file none of us are capable of providing assistance on this matter. Guessing only gets us so far.

I have seen the file. As Gideon says, it's a mess, and I don't think that any of the "usual tools" would be able to do anything with it.

rogue_ronin · 06-04-2009, 10:45 AM

Well, you can always try to find another source... Mr. Google and the Keyword "torrent" are a good match.

m a r

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to reduce indents without "removing space between paragraphs"	Skydog	Calibre	5	06-05-2010 12:58 AM
removing spacing between paragraphs WITHOUT touching indent?	ferossan	Calibre	2	12-24-2009 02:12 PM
Removing blank lines between paragraphs?	corroonb	Workshop	3	08-13-2009 04:23 PM
Removing Line-breaks / Preserving Paragraphs	ahi	Workshop	5	06-08-2009 02:22 AM
Removing excess carriage returns	Halk	Calibre	5	05-17-2009 02:35 PM

06-02-2009, 05:09 PM	#16
Argel Opinionated [but right] Posts: 281 Karma: 1412 Join Date: Apr 2008 Location: UK Device: Cybook Gen3, PRS 505, Kindle Int, Oasis, Paperwhite, Scribe	I've had to do this many hundreds of times over the years and never found it necessary to use specialist tools. The sequence with any editor is basically as ahi describes above. First identify whatever character(s) mark a paragraph end and globally replace them with a unique marker - '\|' will do the job but so will "&&&&" or any character combination that doesn't otherwise occur in the text. You can now remove all the remaining return/newline characters BUT FIRST globally replace them with the same character plus a space. This is because some lines may end [space][newline] and others may not have the space. If you simply strip out the return/newline the last/first words will run together and if you don't notice immediately you're in a mess. Now repeatedly global search for [return/newline][two spaces] and replace with [return/newline][one space]. When you no longer find the target you know that every line now terminates with [return/newline][space] so just replace that character combination with [space].Of course, as Jelby points out, if you know how to search for a return/newline with an arbitrary number of whitespace characters, you can do this in one operation. You now have one impenetrable text block but all you have to do is globally replace your original marker, e.g. &&&&, with a paragraph return and hey presto.

06-03-2009, 01:12 PM	#24
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Assuming that the paragraph break information in Gideon's file is not impossible to accurately retrieve, throwing together a python script that works thusly might help: In sequentially parsing the text file, build a list that consists of (1) non-whitespace character sequence strings, and (2) numbers indicating consecutive whitespace weights. As explained in my "text processing ideas" post, the weighting ought to assign a value of 1 for each space character, and 1000 for each linebreak (which can either be chr(10), chr(13) or the two together [which still should be counted as a single linebreak]). (Tabs, I suppose could be counted as having a weight of 4 or 8 or even larger.) 1 linebreak + 5 spaces = 1005 2 linebreaks + 0 spaces = 2000 1 space + 2 linebreaks + 5 spaces = 2006 Once this is done, you basically have a list that you could process sequentially to recreate the input file (save for the whitespaces). At that time, take all the whitespace weights and put them in a separate list, and get the mode of that list. It will be the weight of whitespace used to separate words. Remove from the list of whitespace weights all instances of the mode weight, and take the mode of what remains. It will be the weight of whitespace used to separate lines. Then remove from the list of whitespace weights all instances of the mode weight, and take the mode of what again remains. It will be the weight of whitespace used to separate paragraphs. Output the list, replacing the whitespace weights with the appropriate characters (space for word spacing, space for line spacing, linebreak for paragraph spacing). Of course, such an approach won't work if the text file is as my fellow forum members suggest it to be. But if they are guessing or mistaken, I might throw this script together this evening and post it... - Ahi

06-03-2009, 03:01 PM	#26
Sabardeyn Guru Posts: 644 Karma: 1242364 Join Date: May 2009 Location: The Right Coast Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)	Argel, In the vast majority of cases files of any kind will conform to their type. Text, python, HTML, C++, whatever, will use certain patterns. Once you start to understand the pattern, you can make use of it in any regex process. You know this - it is exactly what you suggested in your search & replace example. However their will always be cases where a particular user, file or piece of software does not follow the standard pattern for some reason. In those cases you can still run regex, but on a more limited, and with greater oversight, basis. So, in my example, I would run the "humongous regex" on it first to see what happened. If the result was a really garbled mess, I would revert to the original file and apply each individual expression and look the file over. Chances are it will become more uniform in general although several errors will creep in because of the regex. (No automated process is infallible.) Since this is something Gideon would read anyway, he can note any remaining errors as he goes. Either he can correct these one at a time or create a new regex to handle it (correcting similar errors that exist in later portions of the text). This is what Ahi, Dale and Harry are talking about right now. The file in question sounds as though it's one that does not conform to a known and recognized pattern. So you either have to customize any regex or perform a substantial amount of the corrections manually. Without seeing the file none of us are capable of providing assistance on this matter. Guessing only gets us so far.

06-03-2009, 10:20 PM	#27
rogue_ronin Banned Posts: 475 Karma: 796 Join Date: Sep 2008 Location: Honolulu Device: Nokia 770 (fbreader)	If the OP is still following this, the app I linked to in my previous post is basically a tool to apply various preconfigured regex in series. It then displays the results. If you like it, continue to apply various regex, etc, until satisfied, then export the result to any file you like, including the original. It's got some patterns that search for the things being discussed in the thread. One of the things that you can search for, and may not be obvious to someone just looking at the text, is a space preceding a hard-return as a marker for end-of-paragraph. I'm surprised how frequently that has turned up. Luck, m a r

06-04-2009, 12:17 AM	#28
Gideon Wearer of Pants Posts: 1,050 Karma: 7634 Join Date: Jan 2008 Location: Norman, OK Device: Amazon Kindle DX / iPhone	I actually use a mac, but can dip into windows as needed (reluctantly.) Me and regular expressions have had some run-ins before, and I really don't care enough to spend the time to learn how to do them. And this file, as Harry said.. it's a mess. Outside of working out a system where a shorter line preceedes a new paragraph, you don't have a lot to work with. In cases of double carriage returns it is certainly a doable thing to fix the file (I usually use TextMate) but this one is not so fortunate.

06-04-2009, 10:45 AM	#30
rogue_ronin Banned Posts: 475 Karma: 796 Join Date: Sep 2008 Location: Honolulu Device: Nokia 770 (fbreader)	Well, you can always try to find another source... Mr. Google and the Keyword "torrent" are a good match. m a r