Well, one of the first things I do is to convert multi-line documents to single-line paragraphs. Then most of your problems go away. That's why the Buddha invented word-wrap.
I usually do it with a macro, because my text editor has a line-join function. But that shouldn't be too hard to code as regex, let's see...
I'll assume a text file, and I'll assume two or more carriage returns mark the end of a paragraph. If you don't have that, just replace the carriage returns with whatever marks the end of your paragraphs. (If you don't have anything -- no tab, or space, nothing -- there's very little you can do, you're working with a text blob.)
Thinking about it, it's probably safer/easier/smarter to do it in three runs:
==============================
Format: PCRE
What it does: Joins line broken paragraphs.
Best used on: Text.
Run #1
Find: \n\n+
Replace: |PARAGRAPH|
Run #2
Find: \n
Replace: \s
Run #3
Find: |PARAGRAPH|
Replace: \n\n
Translation: Run #1 finds all sequences of multiple hard returns (2 or more) and replaces them with the marker "|PARAGRAPH|"
Run #2 finds all remaining returns and replaces them with a space. You need a space, as otherwise you will merge two words together. If you have multiple spaces, that is easily corrected -- spell-checking several hundred mis-combined words is not.
Run #3 finds all the "|PARAGRAPH|" markers and replaces them with two hard returns.
Variants/Comments: \n is now \r\n in some systems (carriage return, line feed) like mine. If you have 3 or 4 hard returns marking chapters or sections, etc., like with Gutenberg texts, mark those first, as the 3 or 4 returns will be collapsed into a paragraph marker. If you want to do this with HTML, and your paragraphs are already marked with </p>, just do Run #2 on selected text, and avoid both #1 and #3. (Be careful, #2 will replace every hard-return with a space -- if you have HTML with indenting and returns, and change the whole document, it will ruin it. Well, it will work, but be unreadable to edit.)
==================================
@kurochka: you must have some things that you do repeatedly. Some things that you can generalize.
More later, like generic chapter headings.
m a r
Last edited by rogue_ronin; 05-21-2009 at 12:45 AM.
Reason: Caution about destroying HTML formatting...
|