View Single Post
Old 05-20-2009, 11:10 PM   #12
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
Well, one of the first things I do is to convert multi-line documents to single-line paragraphs. Then most of your problems go away. That's why the Buddha invented word-wrap.

I usually do it with a macro, because my text editor has a line-join function. But that shouldn't be too hard to code as regex, let's see...

I'll assume a text file, and I'll assume two or more carriage returns mark the end of a paragraph. If you don't have that, just replace the carriage returns with whatever marks the end of your paragraphs. (If you don't have anything -- no tab, or space, nothing -- there's very little you can do, you're working with a text blob.)

Thinking about it, it's probably safer/easier/smarter to do it in three runs:

==============================
Format: PCRE
What it does: Joins line broken paragraphs.
Best used on: Text.

Run #1
Find: \n\n+
Replace: |PARAGRAPH|

Run #2
Find: \n
Replace: \s

Run #3
Find: |PARAGRAPH|
Replace: \n\n

Translation: Run #1 finds all sequences of multiple hard returns (2 or more) and replaces them with the marker "|PARAGRAPH|"

Run #2 finds all remaining returns and replaces them with a space. You need a space, as otherwise you will merge two words together. If you have multiple spaces, that is easily corrected -- spell-checking several hundred mis-combined words is not.

Run #3 finds all the "|PARAGRAPH|" markers and replaces them with two hard returns.

Variants/Comments: \n is now \r\n in some systems (carriage return, line feed) like mine. If you have 3 or 4 hard returns marking chapters or sections, etc., like with Gutenberg texts, mark those first, as the 3 or 4 returns will be collapsed into a paragraph marker. If you want to do this with HTML, and your paragraphs are already marked with </p>, just do Run #2 on selected text, and avoid both #1 and #3. (Be careful, #2 will replace every hard-return with a space -- if you have HTML with indenting and returns, and change the whole document, it will ruin it. Well, it will work, but be unreadable to edit.)
==================================

@kurochka: you must have some things that you do repeatedly. Some things that you can generalize.

More later, like generic chapter headings.

m a r

Last edited by rogue_ronin; 05-21-2009 at 12:45 AM. Reason: Caution about destroying HTML formatting...
rogue_ronin is offline   Reply With Quote