![]() |
#196 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 441
Karma: 77256
Join Date: Sep 2011
Device: none
|
I speak not of common typos as you could deduce from previous words. As for correcting hyphenation by using the EPUB itself as a dictionary, as mentioned, many academic and scientific works could include innumerable terms not in any general dictionary, thus I made a word list from the EPUB itself to use as a custom dictionary. Such would be nice to automate someday if you'd consider it. For the work I recently converted, the log reported 20,000+ hyphens removed.
|
![]() |
![]() |
![]() |
#197 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
|
I can see the problem - 20,000 words is a lot to cope with. I don't have time at the moment to work on this. Rather than adding more code to the original plugin, I may, at some point in the future, create an auxiliary plugin that will create a dictionary of acceptable hyphenated words automatically that can be used with this plugin.
|
![]() |
![]() |
![]() |
#198 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
|
Sigil update 20th July 2020
I have managed to update the plugin so that it compiles a list of hyphenated words from an ePub and appends these to an existing file of hyphenated words. I have placed the updated plugin in the first post in this thread. I am not sure how quick it will be in going through 20,000 words, but I hope this will meet your requirements.
|
![]() |
![]() |
![]() |
#199 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 441
Karma: 77256
Join Date: Sep 2011
Device: none
|
Thank you very much. I have been confused a bit by what you mean by hyphenated words. My request has been for correction of removing hyphens followed by a space from words not in a dictionary but only in the source EPUB, e.g. "Areca tri- andra" that comes from PDF line breaks. Such is what your plugin does for dictionary words; I merely ask for inclusion of terms from the source EPUB. As for terms that should be hyphenated, there could be some such as "green-yel- low" and perhaps that's what you've been referrning to. There could also be cases such as "anti- inflammatory" which in my recent case the EPUB uses both spellings with and without hyphenation, and l'd prefer correction to the most commonly used if possible. As the author may use such orthography, perhaps in general it might be better someday to use the source EPUB first for corrections before matching against a dictionary.
Last edited by democrite; 07-21-2020 at 10:44 PM. |
![]() |
![]() |
![]() |
#200 | |||||
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
|
@democrite:
Quote:
Quote:
Quote:
Quote:
Quote:
|
|||||
![]() |
![]() |
![]() |
#201 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 441
Karma: 77256
Join Date: Sep 2011
Device: none
|
I'm sorry for the mixup. I had not read the description and posts carefully enough to understand that the plugin checks hyphenated words and removes it if such a term is in the dictionary. I have always been referring to end of line PDF hyphens resulting in a hyphen followed by a space. Such words might not be in a dictionary for many terms as mentioned, and that has been what I've wanted all along.
I would guess then the recent changes do not compile a list of unhyphenated terms from the source EPUB and then check for hyphens followed by a space to see if such should be corrected? As for what the plugin primary does, check hyphenated words and remove the hyphen if such a term is in a dictionary, I am not sure why people would want such a thing. OCR apps as far as I know in the years that I've used them do not make such errors. |
![]() |
![]() |
![]() |
#202 | ||
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
|
@democrite:
No worries. Quote:
Quote:
I find this is particularly useful with older publications that include hyphens where we would not use them now. For example 'today' in older books/magazines appears as 'to-day'; the hyphen in this word is not used in modern texts and so my plugin would reduce the word to 'today'. Unfortunately, the plugin can remove hyphens where these need to be retained, so the latest version of the plugin give the options of adding hyphenated words to a list of hyphenated words in which the hyphen must be retained. This also means that, if for example, one wants to keep the original format (with hyphens) of the scanned text, one could create a file of these words with the latest version of the plugin, so that the hyphen in, for example, 'to-day, can be retained for historical reasons. There is a fairly simple solution to removing spaces after hyphens using Sigil's own search/replace facility. Use: Find: [ ]?-[ ]? Replace: - This will remove spaces around hyphenated words. You could add this to Sigil Saved Searches so that you can retrieve it when you need it. I did not include code to do this in the plugin because some books use (perhaps incorrectly) the normal hyphen with spaces in front and behind the hyphen in the text on purpose. |
||
![]() |
![]() |
![]() |
#203 |
Member
![]() Posts: 11
Karma: 10
Join Date: Dec 2014
Device: laptop & tablet
|
Following are two examples of errors produced by ePubTidyTool_v3.0.1.0..., in boyh cases the trailing space AFTER a </em> tag. This occured in Sigil 1.2.0 in Windows 10.
BEFORE “The damned hurry” was the <em>size</em> of the anomaly at Siple Island. The agency deputy director sat at his desk, reviewing the report summary. AFTER “The damned hurry” was the <em>size</em>of the anomaly at Siple Island. The agency deputy director sat at his desk, reviewing the report summary. *********** BEFORE <em>Damned right, he will, </em> Deputy director Jameson thought to himself. AFTER <em>Damned right, he will,</em>Deputy director Jameson thought to himself. Thank You |
![]() |
![]() |
![]() |
#204 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
|
@thosp: Thank you for pointing out this issue. I will put a correction in the first post of this thread.
Last edited by CalibUser; 07-23-2020 at 11:29 AM. |
![]() |
![]() |
![]() |
#205 | |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 441
Karma: 77256
Join Date: Sep 2011
Device: none
|
Quote:
Hyphens with a space after that occur with line breaks, one can never know if it the term itself is hyphenated or not, terms could be hyphenated in either case, e.g., "rem- edy", "green- yellow", or something not in a dictionary like "Alisma plantago- aquatica". Such is why I originally asked to use the source EPUB as a dictionary correction. Perhaps such you might still consider. |
|
![]() |
![]() |
![]() |
#206 | |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
|
@democrite:
Quote:
![]() Taking your previous example, "green-yel- low", it seems that you are looking for a plugin that will find this as an error in an ePub and then correct it to "green-yellow" using a dictionary. As there are so many different variations of where the hyphen(s) and space(s) could occur in just this one example (eg "gre -en- yell- ow", etc) - this could slow down my plugin considerably. ![]() You may need a separate plugin for this. |
|
![]() |
![]() |
![]() |
#207 | |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,856
Karma: 146918083
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
|
|
![]() |
![]() |
![]() |
#208 | |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 441
Karma: 77256
Join Date: Sep 2011
Device: none
|
Quote:
Last edited by democrite; 07-26-2020 at 01:46 PM. |
|
![]() |
![]() |
![]() |
#209 | |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
|
@democrite
Quote:
@JSWolf: The plugin has been updated to version ePubTidyTool_v3.0.1.1 (first post in this thread) to correct the <em></em> error and ensures <i></i> is processed correctly |
|
![]() |
![]() |
![]() |
#210 |
Member
![]() Posts: 11
Karma: 10
Join Date: Dec 2014
Device: laptop & tablet
|
Greetings,
When 'Sigil's Check Report came up AFTER I ran the ePub Tidy Tool, it asked me if I wanted to continue ... with these details - Incorrect XHTML: OEBPS/Text/Section0001.xhtml Line/Col 737,7 Opening and ending tag mismatch. - if I say yes, the following: <p>“Problems everywhere. Whatever happened to ‘relaxing over the summer?’”</p> is changed to: <p>“Problems everywhere. Whatever happened to’re laxing over the summer?’”</p> ????? |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Tidying Up My Kindle | selectortone | Calibre | 2 | 07-17-2013 10:35 AM |
developping a Plugin for Presentation files | abdlink | Plugins | 4 | 04-15-2013 11:27 AM |
Plugin to fix fb2 files | oviksna | Plugins | 3 | 01-28-2013 08:53 AM |
Tidying Up My Library | JayLaFunk | Library Management | 2 | 09-20-2011 09:12 AM |
Calibre 0.7.50 can't see plugin files | mb_webguy | Calibre | 5 | 04-29-2011 03:41 AM |