Thanks a lot ElfWreck! Actually, I spent some more time trying to learn about Regular Expressions (used by most text editors for Search and Replace) and I ended up doing this:
Converted PDF -> HTML
Now all the unwanted mid sentence pagebreaks are basically those that look like *</p> where * is some character other than a period (since a period indicates end of sentence and probably end of para). I used
Komodo Edit which is a free and powerful text/html editor to then open the HTML file. Then I used the Edit->Replace Feature (Ctrl-H) and entered the following:
(Make sure the following boxes are checked: Regex, Multiline and Replace)
Enter the following in the section - Find what:
Code:
([^\.'"!?:\)])</span></p>
<p><span class=font3>
Enter the following in the section - Replace with:
(Note: \1 above actually has a space after the 1)
In my particular HTML file, paragraps end as </span></p> and then
<p><span class=font3> would start the next para.
What the Regex expression above does is only find those paragraph breaks that do not have a (. , !, ), ?, : ) character just preceeding the paragraph break (since those would indicate complete sentences and probably the end of a legit paragraph).
Now you can keep using the Find and Replace feature to rapidly cycle through all instances to find these fault paragraph breaks. If it is indeed faulty, you just hit replace and the whole code section is replaced by a space thus bridging the broken sentence together. This ended up working really well and I managed to fix a 300 page (300 PDF page ie) book in 10 minutes! You can actually go even faster if you just use the Replace All feature although you might end up taking out a couple of legit paragraph breaks (for example, some paragraph ending sentences might end with a comma or some other character not being checked for).
Note- You can easily change the non expression part of the code above to modify it depending on how the paragraph end and start code is in your particular HTML file.
I hope this makes sense. It worked really well for me. I also find that HTML makes doing all the little tweaks really quick and painless.
Also, I'd like to mention that for some reason this Regex Expression refused to work on Notepad++ for me (which is why I moved to Komodo). If anyone can get it to work on Notepad++ do let me know.
Cheers