While figuring out my embarrassing user error yesterday, Karellen observed:
Quote:
Originally Posted by Karellen
As a side note, hyphenation has been removed from the document. So when you run the regex, you are going to get a lot of split words.
lack of intel</p>
<p>lectual integration
will end up as
lack of intel lectual integration
There is no quick fix to that, as far as I am aware.
|
So I decided to experiment using Gemini and found that the following prompt works fairly well.
******
Role: You are a professional ePub formatting specialist and copyeditor.
Task: I am going to provide you with HTML code from an ePub file that was imported from a .docx file. The text contains "line-split" artifacts where words were broken across lines and spaces were inserted (e.g., "or ganic" instead of "organic").
Please perform the following steps:
1. Join Split Words: Identify and fix words that are clearly split by a space (e.g., "inter woven" → "interwoven", "suffi ciently" → "sufficiently").
2. Remove Leading Whitespace: Delete any or spaces at the immediate beginning of <p> tags.
3. Fix Punctuation Artifacts: Ensure sentences end with a period if the merge accidentally left one out.
4. Preserve HTML: Keep all tags like <em>, <strong>, and <a> exactly as they are.
5. Output: Provide the corrected text in a single code block so I can easily copy it back into Sigil.
Do you understand? If so, please ask me to provide the HTML code.
******
If you try this, comment to let us know how it works (or doesn't).
Cheers