MobileRead Forums - View Single Post

theducks · 07-30-2012, 03:12 AM

Quote:

Originally Posted by XayneP_G

This problem has probably been addressed elsewhere, I suspect it has an easy solution, however I have very little experience with coding and at this point im stuck.

I used calibre to convert a PDF file to EPUB. The resulting file had paragraph breaks (<P>) where each line of text ended on the PDF. This means a lot of blank lines through the ebook I was trying to read.

I found I was able to delete the lines manually with Sigil, however it would be a very time consuming process to go through the entire text. As the superfluous paragraph breaks are indistinguishable from the genuine ones, a simple find and replace in the code is not an option either. Is there an easy solution to this problem?

REGEX find and replace in CV.

Think, what is the common pattern that distinguishes most false line ends?
lower case Letters or a comma with the next line starting in lower case (not perfect: Quotes and proper names (capitals) will be ignored)

search: (?sm)([a-z,])</p>\s+<p .+>([a-z])
replace: \1 \2