View Single Post
Old 12-20-2010, 05:30 PM   #2
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,732
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Quote:
Originally Posted by tscamera View Post
what i am trying to do:
merging two lines of code, where the first line is not ending with .!?" etcpp.
means, merging lines which are broken in the middle, means to merge to a complete sentence.

what i did:
<p class="calibre2">The template line is this</p>
<p class="calibre2">little sentence</p>
using a find/replace with regex:
[a-zA-Z0-9]</p> -will find: s</p>
but the: </p> is the only one, i need to delete.

request 1:
how can ich truncate the search result?
please help with the complete regex-formula to find the "</p>" within the primary search result "s</p>"
(grouping, lookahead, lookbehind, atomic group...???)

request2:
if this would be done, how can i get access to the beginning of the second line-<p class="calibre2">
wich is also needed to be deleted, to join both lines at one?
[a-zA-Z0-9]</p> <p class="calibre2"> does'nt help.
searching for <p class="calibre2"> won't help either, because it's not segnificant enough.

request3:
so, does anybody know, if it's possible to search over two lines of sourcecode?

please help
I've done this a lot to "repair" the results of PDF conversions.

What you want to do is something like this:
Find: ([a-z])</p>\s+<p class="calibre2">
Replace: \1
In the replace expression, it is \1 followed by a single space.

That will find any sentences ending with a lowercase a-z and strip the paragraph end/beginning and replace with that same last character with an additional space. Putting the () brackets around the expression in the Find puts it into a group which you then access in the replace with \1

You might find in really bad PDF conversions that sometimes a word is split across the paragraph boundary. In which case you don't want the replace expression to have a space or else the word will have a space in it. What I do is manually step through all the matches rather than doing Replace All, and that way you can catch any exceptions.

You may also want to check for other characters like commas and hyphens in that initial ([a-z]). You can also check for paragraphs that start with a lowercase word using similar expressions:
Find: </p>\s+<p class="calibre2">([a-z])
Replace: \1 (a space followed by \1)

Last edited by kiwidude; 12-20-2010 at 05:37 PM.
kiwidude is offline   Reply With Quote