Quote:
Originally Posted by exaltedwombat
But extra paragraph markers appear. The laborious job of checking against the original and removing the spurious ones starts.
But I can export to docx. The correct paragraph indents survive! How do I tell a conversion process that THESE are the paragraphs I want, ignore those excess paragraph marks?
|
One word:
Toxaris's EPUB Tools
Postprocess OCR should combine broken sentences.
Note: Although it looks like you have Fiction, quotation marks at the end of lines may be tricky.
* * *
Or I use these Regex to combine:
Note: DO NOT Replace All. Only do these one-by-one.
Regex #1
This searches for hyphen at the end of line:
Search: -</p>\s+<p>
Replace: (nothing)
Before:
After:
Regex #2
This searches for paragraphs that DON'T end on a closing punctuation mark (period, exclamation point, question mark, [...]).
Search: ([^>””\?\!\.])</p>\s+<p>
Replace: \1(put-a-space-here)
Before:
After:
Regex #3
This searches for a lowercase letter in the beginning of a paragraph.
Search: <p>[a-z]
Before:
That'll lead you most of the way there, but the rest will have to be manually checked/corrected. And edge cases which can be either/or (like :) will have to be checked.
But like I mentioned in the Note, Fiction + quotation marks gets a bit trickier. You can tweak Regex #2 above to catch some of those cases, but Fiction requires a lot more manual checking.