Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 02-09-2026, 09:00 PM   #1
MikeMaloney
Member
MikeMaloney began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Jan 2021
Device: iBooks
Using AI to clean up docx imports

While figuring out my embarrassing user error yesterday, Karellen observed:

Quote:
Originally Posted by Karellen View Post
As a side note, hyphenation has been removed from the document. So when you run the regex, you are going to get a lot of split words.

lack of intel</p>

<p>lectual integration


will end up as

lack of intel lectual integration

There is no quick fix to that, as far as I am aware.
So I decided to experiment using Gemini and found that the following prompt works fairly well.
******

Role: You are a professional ePub formatting specialist and copyeditor.
Task: I am going to provide you with HTML code from an ePub file that was imported from a .docx file. The text contains "line-split" artifacts where words were broken across lines and spaces were inserted (e.g., "or ganic" instead of "organic").
Please perform the following steps:
1. Join Split Words: Identify and fix words that are clearly split by a space (e.g., "inter woven" → "interwoven", "suffi ciently" → "sufficiently").
2. Remove Leading Whitespace: Delete any &nbsp; or spaces at the immediate beginning of <p> tags.
3. Fix Punctuation Artifacts: Ensure sentences end with a period if the merge accidentally left one out.
4. Preserve HTML: Keep all tags like <em>, <strong>, and <a> exactly as they are.
5. Output: Provide the corrected text in a single code block so I can easily copy it back into Sigil.
Do you understand? If so, please ask me to provide the HTML code.

******
If you try this, comment to let us know how it works (or doesn't).

Cheers
MikeMaloney is offline   Reply With Quote
Old 02-09-2026, 09:19 PM   #2
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 9,413
Karma: 6733754
Join Date: Nov 2009
Device: many
Or just run it through a spellchecker to find previously hyphenated words with a space in between and fix them.

And to muddy the waters, make sure you are the author or copyright owner of the ebook before you submit its contents to a network based (non-private) AI like Google's Gemini as otherwise you are technically violating its copyright and you could be charged for it in some countries.

To make that clearer, many newly created epubs now explicitly include a statement that submitting a work in whole or part to any AI violates copyright, right in the text of the epub. Portions of that epub's text will be added to that AI's body of knowledge and with the right prompts, can be extracted. Effectively you have allowed a digital copy to be made that can be discovered by anyone.

If you feel you must use an AI, then look for a local AI engine that does not phone home with your (or the author's) valuable intellectual property.

And fyi, your prompt should have specified xhtml not html. They do differ. Hopefully you made a checkpoint before you submitted it and compared your result to that earlier checkpoint to make sure nothing was inadvertently added or removed.

Last edited by KevinH; 02-09-2026 at 10:08 PM.
KevinH is offline   Reply With Quote
Old 02-09-2026, 10:23 PM   #3
MikeMaloney
Member
MikeMaloney began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Jan 2021
Device: iBooks
Good points all.
MikeMaloney is offline   Reply With Quote
Old 02-10-2026, 02:47 PM   #4
MikeMaloney
Member
MikeMaloney began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Jan 2021
Device: iBooks
Quote:
Originally Posted by KevinH View Post

If you feel you must use an AI, then look for a local AI engine that does not phone home with your (or the author's) valuable intellectual property.
I work on a Mac and this morning I remembered that the Writing Tools AI is available system-wide. After selecting a paragraph of text I ran the proofread option and it fixed broken words and punctuation in one go. It also ignored any embedded tags such as <em>.

FYI - The texts I work with are mostly from the late 19th and early 20th centuries and in the public domain. I'm turning them into ePubs mostly to force myself to read them more closely during the proofreading step and to learn the markup language. I'm exploring AI very carefully, mostly asking the various LLMs about things I'm actively studying so that I can verify the responses. I don't blindly accept assertions from humans either.
MikeMaloney is offline   Reply With Quote
Old 02-10-2026, 03:03 PM   #5
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 9,413
Karma: 6733754
Join Date: Nov 2009
Device: many
I have no idea if Apple's Writing Tools Ai phones home to work or is local.

FWIW, I have no problems with using AI with my own text, but I am very very careful when working with other peoples copyrighted data.

My tests using AI are mainly limited to trying to fast start programming projects but so far the results are pretty broken/bad with missing api use, non-existent methods and routines being invoked, etc. So far using AI for programming has been more work than it is worth imho.

Last edited by KevinH; 02-10-2026 at 03:07 PM.
KevinH is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Duplicate Imports from Kindle PC drdancm Library Management 3 05-12-2017 12:44 PM
DOCX Input and DOCX Metadata Reader SauliusP. Development 5 06-15-2012 02:17 AM
Short Fiction Martinez, Brian: A Good Clean, A Harsh Clean. v1. PDF, 13th Dec 2010 BrianMartinez Other Books 0 12-13-2010 09:27 PM
Short Fiction Martinez, Brian: A Good Clean, A Harsh Clean. v1. 13th Dec 2010 BrianMartinez Kindle Books 0 12-13-2010 09:25 PM
Short Fiction Martinez, Brian: A Good Clean, A Harsh Clean. v1. 13th Dec 2010 BrianMartinez ePub Books 0 12-13-2010 09:23 PM


All times are GMT -4. The time now is 06:03 AM.


MobileRead.com is a privately owned, operated and funded community.