MobileRead Forums - View Single Post

Tex2002ans · 12-10-2018, 09:39 PM

Quote:

Originally Posted by exaltedwombat

But extra paragraph markers appear. The laborious job of checking against the original and removing the spurious ones starts.

But I can export to docx. The correct paragraph indents survive! How do I tell a conversion process that THESE are the paragraphs I want, ignore those excess paragraph marks?

One word:

Toxaris's EPUB Tools

Postprocess OCR should combine broken sentences.

Note: Although it looks like you have Fiction, quotation marks at the end of lines may be tricky.

* * *

Or I use these Regex to combine:

Note: DO NOT Replace All. Only do these one-by-one.

Regex #1

This searches for hyphen at the end of line:

Search: -\s+
Replace: (nothing)

Before:

Spoiler:

After:

Spoiler:

Regex #2

This searches for paragraphs that DON'T end on a closing punctuation mark (period, exclamation point, question mark, [...]).

Search: ([^>””\?\!\.])\s+
Replace: \1(put-a-space-here)

Before:

Spoiler:

After:

Spoiler:

Regex #3

This searches for a lowercase letter in the beginning of a paragraph.

Search: [a-z]

Before:

Spoiler:

That'll lead you most of the way there, but the rest will have to be manually checked/corrected. And edge cases which can be either/or (like :) will have to be checked.

But like I mentioned in the Note, Fiction + quotation marks gets a bit trickier. You can tweak Regex #2 above to catch some of those cases, but Fiction requires a lot more manual checking.