View Single Post
Old 12-10-2018, 09:39 PM   #2
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by exaltedwombat View Post
But extra paragraph markers appear. The laborious job of checking against the original and removing the spurious ones starts.

But I can export to docx. The correct paragraph indents survive! How do I tell a conversion process that THESE are the paragraphs I want, ignore those excess paragraph marks?
One word:

Toxaris's EPUB Tools

Postprocess OCR should combine broken sentences.

Note: Although it looks like you have Fiction, quotation marks at the end of lines may be tricky.

* * *

Or I use these Regex to combine:

Note: DO NOT Replace All. Only do these one-by-one.

Regex #1

This searches for hyphen at the end of line:

Search: -</p>\s+<p>
Replace: (nothing)

Before:

Spoiler:
Code:
<p>This is a sen-</p>

<p>tence.</p>


After:

Spoiler:
Code:
<p>This is a sentence.</p>


Regex #2

This searches for paragraphs that DON'T end on a closing punctuation mark (period, exclamation point, question mark, [...]).

Search: ([^>””\?\!\.])</p>\s+<p>
Replace: \1(put-a-space-here)

Before:

Spoiler:
Code:
<p>This is a</p>

<p>sentence.</p>

<p>One, Two, Three,</p>

<p>Four.</p>


After:

Spoiler:
Code:
<p>This is a sentence.</p>

<p>One, Two, Three, Four.</p>


Regex #3

This searches for a lowercase letter in the beginning of a paragraph.

Search: <p>[a-z]

Before:

Spoiler:
Code:
<p>this is an example.</p>

<p>of broken paragraph.</p>


That'll lead you most of the way there, but the rest will have to be manually checked/corrected. And edge cases which can be either/or (like :) will have to be checked.

But like I mentioned in the Note, Fiction + quotation marks gets a bit trickier. You can tweak Regex #2 above to catch some of those cases, but Fiction requires a lot more manual checking.

Last edited by Tex2002ans; 12-10-2018 at 09:42 PM.
Tex2002ans is offline   Reply With Quote