MobileRead Forums - View Single Post - reformatting: text with unwanted linebreaks

kiwidude · 12-20-2010, 06:30 PM

Quote:

Originally Posted by tscamera

what i am trying to do:
merging two lines of code, where the first line is not ending with .!?" etcpp.
means, merging lines which are broken in the middle, means to merge to a complete sentence.

what i did:
The template line is this
little sentence
using a find/replace with regex:
[a-zA-Z0-9] -will find: s
but the: is the only one, i need to delete.

request 1:
how can ich truncate the search result?
please help with the complete regex-formula to find the "" within the primary search result "s"
(grouping, lookahead, lookbehind, atomic group...???)

request2:
if this would be done, how can i get access to the beginning of the second line-
wich is also needed to be deleted, to join both lines at one?
[a-zA-Z0-9] does'nt help.
searching for won't help either, because it's not segnificant enough.

request3:
so, does anybody know, if it's possible to search over two lines of sourcecode?

please help

I've done this a lot to "repair" the results of PDF conversions.

What you want to do is something like this:
Find: ([a-z])\s+
Replace: \1
In the replace expression, it is \1 followed by a single space.

That will find any sentences ending with a lowercase a-z and strip the paragraph end/beginning and replace with that same last character with an additional space. Putting the () brackets around the expression in the Find puts it into a group which you then access in the replace with \1

You might find in really bad PDF conversions that sometimes a word is split across the paragraph boundary. In which case you don't want the replace expression to have a space or else the word will have a space in it. What I do is manually step through all the matches rather than doing Replace All, and that way you can catch any exceptions.

You may also want to check for other characters like commas and hyphens in that initial ([a-z]). You can also check for paragraphs that start with a lowercase word using similar expressions:
Find: \s+([a-z])
Replace: \1 (a space followed by \1)