MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Sigil (https://www.mobileread.com/forums/forumdisplay.php?f=203)
-   -   Sigil plug-in idea (https://www.mobileread.com/forums/showthread.php?t=265830)

martyger 09-29-2015 09:12 PM

Sigil plug-in idea
 
Many of us do epub conversions from old pulp magazines -- mysteries from the 20s and 30s, SF from the 30s and 40s -- tens of thousands of stories that have never been republished and don't deserve to die. Even with the best software, the the OCR generates many errors that need to be corrected manually. (Yellowed pages, ink bleeding, old typefaces are the main causes.)

This can be done (laboriously) in Sigil with spellcheck...but it could be streamlined to a few seconds with a simple Sigil plug-in. Most of the errors recur with frightening regularity -- things like weU (well) presendy (presently) '/ (,") iie (he) Td ("I'd) bom (born) bum (burn) hps (lips) gendy (gently) and so on.

I, literally, can supply a list of many hundreds of these non-words that recur in nearly every pulp conversion. It we could run a plug-in that would automatically correct *all* of these errors *before* we spellcheck, we could cut proofing time by a huge margin. The plug-in would access a database that provides a list of error-words and the corresponding fix.

I'm sure that we could come up with an initial list of many hundreds of errors...and if the plug-in could access a text file that the user can modify, they can add words for specialized conversions (medical, scientific, etc).

I hope someone thinks this is a good idea -- it sure as heck would help me.

Thanks.

Turtle91 09-29-2015 10:15 PM

That is a good idea. I seems there are similar functioning plugins out there - checking words against a pre-made list - like the spell check function. I would recommend having the option to confirm with the user for words that actually are real words ("bum") before automatically changing them.

KevinH 09-29-2015 10:45 PM

If you can supply a list of words in a text file, one pair per line separated by a vertical pipe character:

Td|I'd

I would be happy to write a small program to sort and index the list and then walk the text of every xhtml file parsing the text word by word, and looking in the list to see if the word needs to be replaced and if so doing the replacement. Please make the list case sensitive.

The hardest part will actually be where to split the text of a sentence into words and dealing with all the punctuation pieces stuck to the end.

KevinH

theducks 09-29-2015 11:02 PM

This sounds wonderful
But I would like to see it handled as 2 cases
1) sure thing fixes (red line words )
2) Context check required (step thru only) fixes eg Is it bum or burn

:bulb2:
Replace Options:
Curley/ straight quotes

KevinH 09-29-2015 11:25 PM

Hi,
Easiest would be two plugins. The first handles non-word to word corrections fully automatically.

The second searches a list of word to word corrections, where it presents the word and its sentence to you and you say replace or not.

Alternatively for the word to word conversion, you could add a condition such as only replace if any of a short list of other keywords are within say 5 words of the target

Something along the lines of

bum|burn:hot,fire,ignite,scald,inferno,blaze,flame ,heat

Effectively you are generating an automatic but context sensitive replacement.

A final proof read would always be needed but you could have the plugin , wrap the replaced word in span tags that turned it red. Then before writing out the epub, remove those created tags.

Creating, such wordlists could in fact be crowd sourced.

KevinH

ps. things like this is why I designed and added the plugin interface to begin with. It is perfect for automating cleanups.

BetterRed 09-30-2015 04:02 AM

IIRC, Toxaris said he might think about 'porting' some of the features of his EPUB Tools Word Addin to a Sigil plugin. Its Search and Replace and Dialogue Checker features are obvious candidates.

gipsy 09-30-2015 11:17 AM

The Plugin for tidying ePub files fix many of those errors.
We can simply see what else can we fix without problem :)

For example in my last test with the Π fixes with dictionary...
i must find a way to bypass the fix of some words.
My code find the word "ΓΙΟΥ" and change it to "ΠΟΥ". But they are both correct.

CalibUser 09-30-2015 02:26 PM

I wrote the "Plugin for tidying ePub files" for this reason. I have magazines and old books that I read in as PDF files and then need to correct a set of common misspellings. This plugin includes the ability to correct some misspelt words such as Tve, Fd, Til, Fve, Fm, Vm, Tm, tlieir, lli, words that should not be hyphenated, apostrophes that are the wrong way round and other fixes. It should be possible to extend this to work with customised lists of words.

I will look at extending it so that it reads a list of common errors in misspelt words from a file and corrects them when I have time....I am working on a different project at the moment.

martyger 09-30-2015 06:38 PM

I have been keeping a running list, but I have asked some friends for additions.

Also, I think most changes should be automatic, while others should offer a "spellcheck-like" set of options. For example, Td might be "I'd or I'd. Also, straight and curly quotes would have to be taken into account.

The list will have to have many case-specific fixes such as weU and WeU (but I can build that into the master list). I'm sure others will think of other things as well.

Also, a frequent error is a word ending with a capital L -- alL -- this is always all.

[a-z]L to l. search and replace would be nice too.

rubeus 10-01-2015 12:43 AM

Simple word replacements can be stored as saved searches, added to a group and then executed as a whole by executing the whole group. Can't see a plug-in for this as this functionality is already present.

martyger 10-01-2015 09:19 AM

Quote:

Originally Posted by rubeus (Post 3180300)
Simple word replacements can be stored as saved searches, added to a group and then executed as a whole by executing the whole group. Can't see a plug-in for this as this functionality is already present.

This might work. It's going to be a little laborious to do the initial population of two or three hundred, but I'll give it a shot:

2\Name="pulp errors/bom"
2\Find=" bom "
2\Replace=" born "
3\Name=pulp errors/L
3\Find=([a-z])L
3\Replace=\\1l.

If no one sees any flaws in this, I'll create the list and post the Pulp Errors Group text here so folks can just pop it into their sigil_searches.ini file.

CalibUser 10-01-2015 02:39 PM

I have updated my plugin "Plugin for tidying ePub files" at https://www.mobileread.com/forums/sho...d.php?t=264378 to enable a list of commonly misspelt words and their corrections in a separate file to be processed. The plugin uses the convention suggested by KevinH. Currently this plugin will change words that have been misspelt automatically, but not words where a context check is needed.

@theducks: DiapDealer has developed a plugin that will turn straight quotes to curly quotes at https://www.mobileread.com/forums/sho...d.php?t=247088


While I appreciate that there is little point in using a plugin solely for correcting words when Sigil has a built in function for executing a group of saved searches, my plugin can do more than this. For example, it can process chapter headings, making them uppercase, mixed case etc and strip out unwanted tags at the same time, allowing the user to apply different options to different ePubs. It also has a "bolt-in" image resizer to change the siae of an image if it is too small for the cover page.

martyger 10-01-2015 03:30 PM

Quote:

Originally Posted by CalibUser (Post 3180642)
I have updated my plugin "Plugin for tidying ePub files" at https://www.mobileread.com/forums/sho...d.php?t=264378 to enable a list of commonly misspelt words and their corrections in a separate file to be processed. The plugin uses the convention suggested by KevinH. Currently this plugin will change words that have been misspelt automatically, but not words where a context check is needed.

@theducks: DiapDealer has developed a plugin that will turn straight quotes to curly quotes at https://www.mobileread.com/forums/sho...d.php?t=247088


While I appreciate that there is little point in using a plugin solely for correcting words when Sigil has a built in function for executing a group of saved searches, my plugin can do more than this. For example, it can process chapter headings, making them uppercase, mixed case etc and strip out unwanted tags at the same time, allowing the user to apply different options to different ePubs. It also has a "bolt-in" image resizer to change the siae of an image if it is too small for the cover page.

Sounds perfect. Thank you. But I tried adding it and got an Invalid Plug-in Error. I have Python 2.7 and 3.5 and both are pointing to the correct executable. (Also, tried it on Sigil 8.7 and 8.9.)

CalibUser 10-01-2015 04:01 PM

@martyger: Another user has reported the same problem. It worked on my Windows 7 system. I will have another look at the code to find out what is happening.

CalibUser 10-01-2015 04:12 PM

The plugin should work now - there was an error in the filename that did not match the XML file in the plugin


All times are GMT -4. The time now is 08:50 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.