View Single Post
Old 02-09-2026, 09:00 PM   #1
MikeMaloney
Member
MikeMaloney began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Jan 2021
Device: iBooks
Using AI to clean up docx imports

While figuring out my embarrassing user error yesterday, Karellen observed:

Quote:
Originally Posted by Karellen View Post
As a side note, hyphenation has been removed from the document. So when you run the regex, you are going to get a lot of split words.

lack of intel</p>

<p>lectual integration


will end up as

lack of intel lectual integration

There is no quick fix to that, as far as I am aware.
So I decided to experiment using Gemini and found that the following prompt works fairly well.
******

Role: You are a professional ePub formatting specialist and copyeditor.
Task: I am going to provide you with HTML code from an ePub file that was imported from a .docx file. The text contains "line-split" artifacts where words were broken across lines and spaces were inserted (e.g., "or ganic" instead of "organic").
Please perform the following steps:
1. Join Split Words: Identify and fix words that are clearly split by a space (e.g., "inter woven" → "interwoven", "suffi ciently" → "sufficiently").
2. Remove Leading Whitespace: Delete any &nbsp; or spaces at the immediate beginning of <p> tags.
3. Fix Punctuation Artifacts: Ensure sentences end with a period if the merge accidentally left one out.
4. Preserve HTML: Keep all tags like <em>, <strong>, and <a> exactly as they are.
5. Output: Provide the corrected text in a single code block so I can easily copy it back into Sigil.
Do you understand? If so, please ask me to provide the HTML code.

******
If you try this, comment to let us know how it works (or doesn't).

Cheers
MikeMaloney is offline   Reply With Quote