View Single Post
Old 03-06-2013, 07:28 PM   #4
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Converting a document from PDF is the WORST case scenario. It will take a lot of elbow grease to fix the document after conversion.

I personally use these two Regexes to help combine broken paragraphs:

Search #1:

Code:
-</p>\s+<p>
Replace #1: (empty)

Search #2:

Code:
([^>”\?\!\.])</p>\s+<p>
Replace #2: (a space is following the 1)

Code:
\1
Search #1 will take a line that ends with a hyphen, erase the hyphen, and combine it with the next line (you may/may not want to keep the hyphen, I replace one at a time to make sure the hyphen is not needed).

Search #2 will look for a paragraph NOT ending with any of the characters in red, and will combine it with the next paragraph.

For cleaning up directly from calibre's output you may need to use these Regexes for search instead:

Code:
-</p>\s+<p class="calibre[0-9]+">
Code:
([^>”\?\!\.])</p>\s+<p class="calibre[0-9]+">
Tex2002ans is offline   Reply With Quote