Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 07-16-2020, 07:16 AM   #196
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 441
Karma: 77256
Join Date: Sep 2011
Device: none
I speak not of common typos as you could deduce from previous words. As for correcting hyphenation by using the EPUB itself as a dictionary, as mentioned, many academic and scientific works could include innumerable terms not in any general dictionary, thus I made a word list from the EPUB itself to use as a custom dictionary. Such would be nice to automate someday if you'd consider it. For the work I recently converted, the log reported 20,000+ hyphens removed.
democrite is offline   Reply With Quote
Old 07-17-2020, 04:31 AM   #197
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
Quote:
Originally Posted by democrite View Post
For the work I recently converted, the log reported 20,000+ hyphens removed.
I can see the problem - 20,000 words is a lot to cope with. I don't have time at the moment to work on this. Rather than adding more code to the original plugin, I may, at some point in the future, create an auxiliary plugin that will create a dictionary of acceptable hyphenated words automatically that can be used with this plugin.
CalibUser is offline   Reply With Quote
Old 07-20-2020, 11:49 AM   #198
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
Sigil update 20th July 2020

Quote:
Originally Posted by democrite View Post
For the work I recently converted, the log reported 20,000+ hyphens removed.
I have managed to update the plugin so that it compiles a list of hyphenated words from an ePub and appends these to an existing file of hyphenated words. I have placed the updated plugin in the first post in this thread. I am not sure how quick it will be in going through 20,000 words, but I hope this will meet your requirements.
CalibUser is offline   Reply With Quote
Old 07-21-2020, 10:35 PM   #199
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 441
Karma: 77256
Join Date: Sep 2011
Device: none
Thank you very much. I have been confused a bit by what you mean by hyphenated words. My request has been for correction of removing hyphens followed by a space from words not in a dictionary but only in the source EPUB, e.g. "Areca tri- andra" that comes from PDF line breaks. Such is what your plugin does for dictionary words; I merely ask for inclusion of terms from the source EPUB. As for terms that should be hyphenated, there could be some such as "green-yel- low" and perhaps that's what you've been referrning to. There could also be cases such as "anti- inflammatory" which in my recent case the EPUB uses both spellings with and without hyphenation, and l'd prefer correction to the most commonly used if possible. As the author may use such orthography, perhaps in general it might be better someday to use the source EPUB first for corrections before matching against a dictionary.

Last edited by democrite; 07-21-2020 at 10:44 PM.
democrite is offline   Reply With Quote
Old 07-22-2020, 08:06 AM   #200
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
@democrite:

Quote:
Originally Posted by democrite View Post
I have been confused a bit by what you mean by hyphenated words.
Quote:
Originally Posted by democrite View Post
As for terms that should be hyphenated, there could be some such as "green-yel- low" and perhaps that's what you've been referrning to.
By hyphenated words I am referring to two correct words that are joined by a hyphen eg non-stop. I was not referring to "green-yel- low" or "Areca tri- andra" as examples of hyphenated words as these contain errors.

Quote:
Originally Posted by democrite View Post
My request has been for correction of removing hyphens followed by a space from words not in a dictionary but only in the source EPUB, e.g. "Areca tri- andra" that comes from PDF line breaks.
I cannot find any mention of removing hyphens followed by a space from words in any of your previous posts. However, you did write, in two previous posts:

Quote:
Originally Posted by democrite View Post
I recently found and used this plugin solely for hyphenation. On that note, calibre uses the eBook itself, scanning for words and compiling a dictionary. Would you someday consider such a feature?
and

Quote:
Originally Posted by democrite View Post
As for correcting hyphenation by using the EPUB itself as a dictionary, as mentioned, many academic and scientific works could include innumerable terms not in any general dictionary, thus I made a word list from the EPUB itself to use as a custom dictionary. Such would be nice to automate someday if you'd consider it.
This is why I enabled the plugin to make a list of hyphented words from an ePub and then allow the user to select which of these should be added to a dictionary (file) of hyphenated words in which the hyphen needed to be retained.
CalibUser is offline   Reply With Quote
Old 07-22-2020, 09:24 PM   #201
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 441
Karma: 77256
Join Date: Sep 2011
Device: none
I'm sorry for the mixup. I had not read the description and posts carefully enough to understand that the plugin checks hyphenated words and removes it if such a term is in the dictionary. I have always been referring to end of line PDF hyphens resulting in a hyphen followed by a space. Such words might not be in a dictionary for many terms as mentioned, and that has been what I've wanted all along.

I would guess then the recent changes do not compile a list of unhyphenated terms from the source EPUB and then check for hyphens followed by a space to see if such should be corrected?

As for what the plugin primary does, check hyphenated words and remove the hyphen if such a term is in a dictionary, I am not sure why people would want such a thing. OCR apps as far as I know in the years that I've used them do not make such errors.
democrite is offline   Reply With Quote
Old 07-23-2020, 05:28 AM   #202
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
@democrite:

Quote:
Originally Posted by democrite View Post
I'm sorry for the mixup.
No worries.

Quote:
Originally Posted by democrite View Post
I would guess then the recent changes do not compile a list of unhyphenated terms from the source EPUB and then check for hyphens followed by a space to see if such should be corrected?
Correct - recent changes compile a list of hyphenated words from the source EPUB.

Quote:
Originally Posted by democrite View Post
As for what the plugin primary does, check hyphenated words and remove the hyphen if such a term is in a dictionary, I am not sure why people would want such a thing. OCR apps as far as I know in the years that I've used them do not make such errors.
The plugin is designed for as many different OCR readers as possible; some OCR software can hyphenate words that are not normally hyphenated. One feature of the plugin is to examine hyphenated words and find out if, when removing the hyphen and joining the two separate words together, the word that is formed exists in the Hunspell dictionary. If it does, the plugin assumes that the hyphenated word should not be hyphenated and replaces it with the non-hyphenated version.

I find this is particularly useful with older publications that include hyphens where we would not use them now. For example 'today' in older books/magazines appears as 'to-day'; the hyphen in this word is not used in modern texts and so my plugin would reduce the word to 'today'. Unfortunately, the plugin can remove hyphens where these need to be retained, so the latest version of the plugin give the options of adding hyphenated words to a list of hyphenated words in which the hyphen must be retained. This also means that, if for example, one wants to keep the original format (with hyphens) of the scanned text, one could create a file of these words with the latest version of the plugin, so that the hyphen in, for example, 'to-day, can be retained for historical reasons.

There is a fairly simple solution to removing spaces after hyphens using Sigil's own search/replace facility. Use:

Find: [ ]?-[ ]?

Replace: -

This will remove spaces around hyphenated words. You could add this to Sigil Saved Searches so that you can retrieve it when you need it.

I did not include code to do this in the plugin because some books use (perhaps incorrectly) the normal hyphen with spaces in front and behind the hyphen in the text on purpose.
CalibUser is offline   Reply With Quote
Old 07-23-2020, 09:56 AM   #203
thosp
Member
thosp began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Dec 2014
Device: laptop & tablet
Following are two examples of errors produced by ePubTidyTool_v3.0.1.0..., in boyh cases the trailing space AFTER a </em> tag. This occured in Sigil 1.2.0 in Windows 10.

BEFORE
“The damned hurry” was the <em>size</em> of the anomaly at Siple Island. The agency deputy director sat at his desk, reviewing the report summary.

AFTER
“The damned hurry” was the <em>size</em>of the anomaly at Siple Island. The agency deputy director sat at his desk, reviewing the report summary.

***********
BEFORE
<em>Damned right, he will, </em> Deputy director Jameson thought to himself.

AFTER
<em>Damned right, he will,</em>Deputy director Jameson thought to himself.

Thank You
thosp is offline   Reply With Quote
Old 07-23-2020, 11:23 AM   #204
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
@thosp: Thank you for pointing out this issue. I will put a correction in the first post of this thread.

Last edited by CalibUser; 07-23-2020 at 11:29 AM.
CalibUser is offline   Reply With Quote
Old 07-23-2020, 05:44 PM   #205
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 441
Karma: 77256
Join Date: Sep 2011
Device: none
Quote:
Originally Posted by CalibUser View Post
The plugin is designed for as many different OCR readers as possible; some OCR software can hyphenate words that are not normally hyphenated. …
I now understand that such is perhaps the original design of the plugin. Would you consider making such optional? As I've mostly used FineReader, and I think most do, I do not recall it ever making such decisions. I also prefer to keep any and all terms as they exist in the original; I do not know if I am in the minority though if it were optional, more would have a choice.

Quote:
Originally Posted by CalibUser View Post
There is a fairly simple solution to removing spaces after hyphens using Sigil's own search/replace facility. Use:

Find: [ ]?-[ ]?

Replace: - …
Hyphens with a space after that occur with line breaks, one can never know if it the term itself is hyphenated or not, terms could be hyphenated in either case, e.g., "rem- edy", "green- yellow", or something not in a dictionary like "Alisma plantago- aquatica". Such is why I originally asked to use the source EPUB as a dictionary correction. Perhaps such you might still consider.
democrite is offline   Reply With Quote
Old 07-26-2020, 04:54 AM   #206
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
@democrite:

Quote:
Originally Posted by democrite View Post
Such is why I originally asked to use the source EPUB as a dictionary correction. Perhaps such you might still consider.
I have been thinking about your suggestion and have concluded that it is not viable to include it in my plugin.

Taking your previous example, "green-yel- low", it seems that you are looking for a plugin that will find this as an error in an ePub and then correct it to "green-yellow" using a dictionary. As there are so many different variations of where the hyphen(s) and space(s) could occur in just this one example (eg "gre -en- yell- ow", etc) - this could slow down my plugin considerably.

You may need a separate plugin for this.
CalibUser is offline   Reply With Quote
Old 07-26-2020, 10:03 AM   #207
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 79,856
Karma: 146918083
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by thosp View Post
Following are two examples of errors produced by ePubTidyTool_v3.0.1.0..., in boyh cases the trailing space AFTER a </em> tag. This occured in Sigil 1.2.0 in Windows 10.

BEFORE
“The damned hurry” was the <em>size</em> of the anomaly at Siple Island. The agency deputy director sat at his desk, reviewing the report summary.

AFTER
“The damned hurry” was the <em>size</em>of the anomaly at Siple Island. The agency deputy director sat at his desk, reviewing the report summary.

***********
BEFORE
<em>Damned right, he will, </em> Deputy director Jameson thought to himself.

AFTER
<em>Damned right, he will,</em>Deputy director Jameson thought to himself.

Thank You
If you use <i></i> instead of <em></em> does this bug show up?
JSWolf is offline   Reply With Quote
Old 07-26-2020, 01:26 PM   #208
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 441
Karma: 77256
Join Date: Sep 2011
Device: none
Quote:
Originally Posted by CalibUser View Post
Taking your previous example, "green-yel- low", it seems that you are looking for a plugin that will find this as an error in an ePub and then correct it to "green-yellow" using a dictionary. As there are so many different variations of where the hyphen(s) and space(s) could occur in just this one example (eg "gre -en- yell- ow", etc) - this could slow down my plugin considerably.

You may need a separate plugin for this.
If you could consider only the case of not touching hyphenated words as an option or default, that would be nice. In my current EPUB, on botany as you can tell, I find so many hyphenated words that might occur only once, variations in references (articles or books), regional names, e.g., "sa-ke", and so forth. I can modify the script by changing the regex which is fine, though perhaps others would like it as well.

Last edited by democrite; 07-26-2020 at 01:46 PM.
democrite is offline   Reply With Quote
Old 07-28-2020, 04:31 AM   #209
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
@democrite
Quote:
Originally Posted by democrite View Post
If you could consider only the case of not touching hyphenated words as an option or default, that would be nice. In my current EPUB, on botany as you can tell, I find so many hyphenated words that might occur only once, variations in references (articles or books), regional names, e.g., "sa-ke", and so forth. I can modify the script by changing the regex which is fine, though perhaps others would like it as well.
I think that this would require a different plugin.

@JSWolf:

Quote:
Originally Posted by JSWolf View Post
If you use <i></i> instead of <em></em> does this bug show up?
The plugin has been updated to version ePubTidyTool_v3.0.1.1 (first post in this thread) to correct the <em></em> error and ensures <i></i> is processed correctly
CalibUser is offline   Reply With Quote
Old 07-28-2020, 05:00 PM   #210
thosp
Member
thosp began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Dec 2014
Device: laptop & tablet
Greetings,

When 'Sigil's Check Report came up AFTER I ran the ePub Tidy Tool, it asked me if I wanted to continue ... with these details - Incorrect XHTML: OEBPS/Text/Section0001.xhtml Line/Col 737,7 Opening and ending tag mismatch. - if I say yes, the following:

<p>“Problems everywhere. Whatever happened to ‘relaxing over the summer?’”</p>

is changed to:

<p>“Problems everywhere. Whatever happened to’re laxing over the summer?’”</p>

?????
thosp is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Tidying Up My Kindle selectortone Calibre 2 07-17-2013 10:35 AM
developping a Plugin for Presentation files abdlink Plugins 4 04-15-2013 11:27 AM
Plugin to fix fb2 files oviksna Plugins 3 01-28-2013 08:53 AM
Tidying Up My Library JayLaFunk Library Management 2 09-20-2011 09:12 AM
Calibre 0.7.50 can't see plugin files mb_webguy Calibre 5 04-29-2011 03:41 AM


All times are GMT -4. The time now is 06:20 AM.


MobileRead.com is a privately owned, operated and funded community.