|
|
#1 | |
|
Member
![]() Posts: 21
Karma: 10
Join Date: Jan 2021
Device: iBooks
|
Using AI to clean up docx imports
While figuring out my embarrassing user error yesterday, Karellen observed:
Quote:
****** Role: You are a professional ePub formatting specialist and copyeditor. Task: I am going to provide you with HTML code from an ePub file that was imported from a .docx file. The text contains "line-split" artifacts where words were broken across lines and spaces were inserted (e.g., "or ganic" instead of "organic"). Please perform the following steps: 1. Join Split Words: Identify and fix words that are clearly split by a space (e.g., "inter woven" → "interwoven", "suffi ciently" → "sufficiently"). 2. Remove Leading Whitespace: Delete any or spaces at the immediate beginning of <p> tags. 3. Fix Punctuation Artifacts: Ensure sentences end with a period if the merge accidentally left one out. 4. Preserve HTML: Keep all tags like <em>, <strong>, and <a> exactly as they are. 5. Output: Provide the corrected text in a single code block so I can easily copy it back into Sigil. Do you understand? If so, please ask me to provide the HTML code. ****** If you try this, comment to let us know how it works (or doesn't). Cheers |
|
|
|
|
|
|
#2 |
|
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,413
Karma: 6733754
Join Date: Nov 2009
Device: many
|
Or just run it through a spellchecker to find previously hyphenated words with a space in between and fix them.
And to muddy the waters, make sure you are the author or copyright owner of the ebook before you submit its contents to a network based (non-private) AI like Google's Gemini as otherwise you are technically violating its copyright and you could be charged for it in some countries. To make that clearer, many newly created epubs now explicitly include a statement that submitting a work in whole or part to any AI violates copyright, right in the text of the epub. Portions of that epub's text will be added to that AI's body of knowledge and with the right prompts, can be extracted. Effectively you have allowed a digital copy to be made that can be discovered by anyone. If you feel you must use an AI, then look for a local AI engine that does not phone home with your (or the author's) valuable intellectual property. And fyi, your prompt should have specified xhtml not html. They do differ. Hopefully you made a checkpoint before you submitted it and compared your result to that earlier checkpoint to make sure nothing was inadvertently added or removed. Last edited by KevinH; 02-09-2026 at 10:08 PM. |
|
|
|
|
|
#3 |
|
Member
![]() Posts: 21
Karma: 10
Join Date: Jan 2021
Device: iBooks
|
Good points all.
|
|
|
|
|
|
#4 | |
|
Member
![]() Posts: 21
Karma: 10
Join Date: Jan 2021
Device: iBooks
|
Quote:
FYI - The texts I work with are mostly from the late 19th and early 20th centuries and in the public domain. I'm turning them into ePubs mostly to force myself to read them more closely during the proofreading step and to learn the markup language. I'm exploring AI very carefully, mostly asking the various LLMs about things I'm actively studying so that I can verify the responses. I don't blindly accept assertions from humans either. |
|
|
|
|
|
|
#5 |
|
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,413
Karma: 6733754
Join Date: Nov 2009
Device: many
|
I have no idea if Apple's Writing Tools Ai phones home to work or is local.
FWIW, I have no problems with using AI with my own text, but I am very very careful when working with other peoples copyrighted data. My tests using AI are mainly limited to trying to fast start programming projects but so far the results are pretty broken/bad with missing api use, non-existent methods and routines being invoked, etc. So far using AI for programming has been more work than it is worth imho. Last edited by KevinH; 02-10-2026 at 03:07 PM. |
|
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Duplicate Imports from Kindle PC | drdancm | Library Management | 3 | 05-12-2017 12:44 PM |
| DOCX Input and DOCX Metadata Reader | SauliusP. | Development | 5 | 06-15-2012 02:17 AM |
| Short Fiction Martinez, Brian: A Good Clean, A Harsh Clean. v1. PDF, 13th Dec 2010 | BrianMartinez | Other Books | 0 | 12-13-2010 09:27 PM |
| Short Fiction Martinez, Brian: A Good Clean, A Harsh Clean. v1. 13th Dec 2010 | BrianMartinez | Kindle Books | 0 | 12-13-2010 09:25 PM |
| Short Fiction Martinez, Brian: A Good Clean, A Harsh Clean. v1. 13th Dec 2010 | BrianMartinez | ePub Books | 0 | 12-13-2010 09:23 PM |