Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 09-29-2015, 08:12 PM   #1
martyger
Member
martyger began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Dec 2013
Device: none
Sigil plug-in idea

Many of us do epub conversions from old pulp magazines -- mysteries from the 20s and 30s, SF from the 30s and 40s -- tens of thousands of stories that have never been republished and don't deserve to die. Even with the best software, the the OCR generates many errors that need to be corrected manually. (Yellowed pages, ink bleeding, old typefaces are the main causes.)

This can be done (laboriously) in Sigil with spellcheck...but it could be streamlined to a few seconds with a simple Sigil plug-in. Most of the errors recur with frightening regularity -- things like weU (well) presendy (presently) '/ (,") iie (he) Td ("I'd) bom (born) bum (burn) hps (lips) gendy (gently) and so on.

I, literally, can supply a list of many hundreds of these non-words that recur in nearly every pulp conversion. It we could run a plug-in that would automatically correct *all* of these errors *before* we spellcheck, we could cut proofing time by a huge margin. The plug-in would access a database that provides a list of error-words and the corresponding fix.

I'm sure that we could come up with an initial list of many hundreds of errors...and if the plug-in could access a text file that the user can modify, they can add words for specialized conversions (medical, scientific, etc).

I hope someone thinks this is a good idea -- it sure as heck would help me.

Thanks.
martyger is offline   Reply With Quote
Old 09-29-2015, 09:15 PM   #2
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 3,394
Karma: 20212733
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
That is a good idea. I seems there are similar functioning plugins out there - checking words against a pre-made list - like the spell check function. I would recommend having the option to confirm with the user for words that actually are real words ("bum") before automatically changing them.
Turtle91 is offline   Reply With Quote
Old 09-29-2015, 09:45 PM   #3
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 9,070
Karma: 6361556
Join Date: Nov 2009
Device: many
If you can supply a list of words in a text file, one pair per line separated by a vertical pipe character:

Td|I'd

I would be happy to write a small program to sort and index the list and then walk the text of every xhtml file parsing the text word by word, and looking in the list to see if the word needs to be replaced and if so doing the replacement. Please make the list case sensitive.

The hardest part will actually be where to split the text of a sentence into words and dealing with all the punctuation pieces stuck to the end.

KevinH
KevinH is offline   Reply With Quote
Old 09-29-2015, 10:02 PM   #4
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 31,241
Karma: 61360164
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
This sounds wonderful
But I would like to see it handled as 2 cases
1) sure thing fixes (red line words )
2) Context check required (step thru only) fixes eg Is it bum or burn


Replace Options:
Curley/ straight quotes
theducks is offline   Reply With Quote
Old 09-29-2015, 10:25 PM   #5
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 9,070
Karma: 6361556
Join Date: Nov 2009
Device: many
Hi,
Easiest would be two plugins. The first handles non-word to word corrections fully automatically.

The second searches a list of word to word corrections, where it presents the word and its sentence to you and you say replace or not.

Alternatively for the word to word conversion, you could add a condition such as only replace if any of a short list of other keywords are within say 5 words of the target

Something along the lines of

bum|burn:hot,fire,ignite,scald,inferno,blaze,flame ,heat

Effectively you are generating an automatic but context sensitive replacement.

A final proof read would always be needed but you could have the plugin , wrap the replaced word in span tags that turned it red. Then before writing out the epub, remove those created tags.

Creating, such wordlists could in fact be crowd sourced.

KevinH

ps. things like this is why I designed and added the plugin interface to begin with. It is perfect for automating cleanups.

Last edited by KevinH; 09-29-2015 at 10:27 PM.
KevinH is offline   Reply With Quote
Old 09-30-2015, 03:02 AM   #6
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 22,006
Karma: 30277294
Join Date: Mar 2012
Location: Sydney Australia
Device: none
IIRC, Toxaris said he might think about 'porting' some of the features of his EPUB Tools Word Addin to a Sigil plugin. Its Search and Replace and Dialogue Checker features are obvious candidates.
BetterRed is offline   Reply With Quote
Old 09-30-2015, 10:17 AM   #7
gipsy
Connoisseur
gipsy began at the beginning.
 
Posts: 81
Karma: 10
Join Date: Nov 2013
Device: Kobo Aura HD
The Plugin for tidying ePub files fix many of those errors.
We can simply see what else can we fix without problem

For example in my last test with the Π fixes with dictionary...
i must find a way to bypass the fix of some words.
My code find the word "ΓΙΟΥ" and change it to "ΠΟΥ". But they are both correct.

Last edited by gipsy; 09-30-2015 at 10:24 AM.
gipsy is offline   Reply With Quote
Old 09-30-2015, 01:26 PM   #8
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
I wrote the "Plugin for tidying ePub files" for this reason. I have magazines and old books that I read in as PDF files and then need to correct a set of common misspellings. This plugin includes the ability to correct some misspelt words such as Tve, Fd, Til, Fve, Fm, Vm, Tm, tlieir, lli, words that should not be hyphenated, apostrophes that are the wrong way round and other fixes. It should be possible to extend this to work with customised lists of words.

I will look at extending it so that it reads a list of common errors in misspelt words from a file and corrects them when I have time....I am working on a different project at the moment.
CalibUser is offline   Reply With Quote
Old 09-30-2015, 05:38 PM   #9
martyger
Member
martyger began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Dec 2013
Device: none
I have been keeping a running list, but I have asked some friends for additions.

Also, I think most changes should be automatic, while others should offer a "spellcheck-like" set of options. For example, Td might be "I'd or I'd. Also, straight and curly quotes would have to be taken into account.

The list will have to have many case-specific fixes such as weU and WeU (but I can build that into the master list). I'm sure others will think of other things as well.

Also, a frequent error is a word ending with a capital L -- alL -- this is always all.

[a-z]L to l. search and replace would be nice too.

Last edited by martyger; 09-30-2015 at 08:01 PM. Reason: potential added feature to the plug-in
martyger is offline   Reply With Quote
Old 09-30-2015, 11:43 PM   #10
rubeus
Banned
rubeus ought to be getting tired of karma fortunes by now.rubeus ought to be getting tired of karma fortunes by now.rubeus ought to be getting tired of karma fortunes by now.rubeus ought to be getting tired of karma fortunes by now.rubeus ought to be getting tired of karma fortunes by now.rubeus ought to be getting tired of karma fortunes by now.rubeus ought to be getting tired of karma fortunes by now.rubeus ought to be getting tired of karma fortunes by now.rubeus ought to be getting tired of karma fortunes by now.rubeus ought to be getting tired of karma fortunes by now.rubeus ought to be getting tired of karma fortunes by now.
 
Posts: 272
Karma: 1224588
Join Date: Sep 2014
Device: Sony PRS 650
Simple word replacements can be stored as saved searches, added to a group and then executed as a whole by executing the whole group. Can't see a plug-in for this as this functionality is already present.
rubeus is offline   Reply With Quote
Old 10-01-2015, 08:19 AM   #11
martyger
Member
martyger began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Dec 2013
Device: none
Quote:
Originally Posted by rubeus View Post
Simple word replacements can be stored as saved searches, added to a group and then executed as a whole by executing the whole group. Can't see a plug-in for this as this functionality is already present.
This might work. It's going to be a little laborious to do the initial population of two or three hundred, but I'll give it a shot:

2\Name="pulp errors/bom"
2\Find=" bom "
2\Replace=" born "
3\Name=pulp errors/L
3\Find=([a-z])L
3\Replace=\\1l.

If no one sees any flaws in this, I'll create the list and post the Pulp Errors Group text here so folks can just pop it into their sigil_searches.ini file.

Last edited by martyger; 10-01-2015 at 08:30 AM.
martyger is offline   Reply With Quote
Old 10-01-2015, 01:39 PM   #12
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
I have updated my plugin "Plugin for tidying ePub files" at https://www.mobileread.com/forums/sho...d.php?t=264378 to enable a list of commonly misspelt words and their corrections in a separate file to be processed. The plugin uses the convention suggested by KevinH. Currently this plugin will change words that have been misspelt automatically, but not words where a context check is needed.

@theducks: DiapDealer has developed a plugin that will turn straight quotes to curly quotes at https://www.mobileread.com/forums/sho...d.php?t=247088


While I appreciate that there is little point in using a plugin solely for correcting words when Sigil has a built in function for executing a group of saved searches, my plugin can do more than this. For example, it can process chapter headings, making them uppercase, mixed case etc and strip out unwanted tags at the same time, allowing the user to apply different options to different ePubs. It also has a "bolt-in" image resizer to change the siae of an image if it is too small for the cover page.
CalibUser is offline   Reply With Quote
Old 10-01-2015, 02:30 PM   #13
martyger
Member
martyger began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Dec 2013
Device: none
Quote:
Originally Posted by CalibUser View Post
I have updated my plugin "Plugin for tidying ePub files" at https://www.mobileread.com/forums/sho...d.php?t=264378 to enable a list of commonly misspelt words and their corrections in a separate file to be processed. The plugin uses the convention suggested by KevinH. Currently this plugin will change words that have been misspelt automatically, but not words where a context check is needed.

@theducks: DiapDealer has developed a plugin that will turn straight quotes to curly quotes at https://www.mobileread.com/forums/sho...d.php?t=247088


While I appreciate that there is little point in using a plugin solely for correcting words when Sigil has a built in function for executing a group of saved searches, my plugin can do more than this. For example, it can process chapter headings, making them uppercase, mixed case etc and strip out unwanted tags at the same time, allowing the user to apply different options to different ePubs. It also has a "bolt-in" image resizer to change the siae of an image if it is too small for the cover page.
Sounds perfect. Thank you. But I tried adding it and got an Invalid Plug-in Error. I have Python 2.7 and 3.5 and both are pointing to the correct executable. (Also, tried it on Sigil 8.7 and 8.9.)
martyger is offline   Reply With Quote
Old 10-01-2015, 03:01 PM   #14
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
@martyger: Another user has reported the same problem. It worked on my Windows 7 system. I will have another look at the code to find out what is happening.
CalibUser is offline   Reply With Quote
Old 10-01-2015, 03:12 PM   #15
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
The plugin should work now - there was an error in the filename that did not match the XML file in the plugin
CalibUser is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Sigil on Nook vs Sigil on Kobo vs Sigil on iBook rosshalde Sigil 12 11-13-2014 09:34 AM
Epub crashes on Sigil for Mac, OK on Sigil for PC crystamichelle Sigil 6 08-14-2013 02:52 PM
Sigil 0.3.4 / Sigil 0.4.0 RC1 / Cover in Nook Color Bertrand Sigil 13 08-06-2011 04:06 AM
Sigil 0.3.4 / Problème CSS entre Sigil et iPad Grivels Software 10 07-03-2011 09:06 AM
My "read" tag idea enhancement for Calibre idea rcuadro Calibre 10 01-20-2011 04:23 PM


All times are GMT -4. The time now is 05:02 PM.


MobileRead.com is a privately owned, operated and funded community.