Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 07-26-2016, 08:20 AM   #61
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,705
Karma: 5444398
Join Date: Nov 2009
Device: many
When cleaning the text and parsing it before passing it to the spell checker it should be easy to filter out entities like & s h y ; and its numerical equivalents. But truly, soft-hyphenating words is probably best left to the very last step, after all other changes including spellchecking are completed.

So I recommend removing all soft-hyphens from the document using search and replace, until the text and epub are in an "as desired" state and then using a hyphenation library to add back in soft-hyphens if and only if you are producing an epub for readers that support them.
KevinH is offline   Reply With Quote
Old 07-26-2016, 06:14 PM   #62
varlog
actually it is /var/log
varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.
 
varlog's Avatar
 
Posts: 341
Karma: 2994236
Join Date: Sep 2012
Location: usually Europa
Device: prs t1
Nice too see that the blog can manage without me...

Quote:
Originally Posted by JSWolf View Post
Thank you for that link. It's perfect!
You are welcomed. I have it from one of your posts ... then I saw this post, from yesterday!, and I got a headache... .

Quote:
Originally Posted by KevinH View Post
If you are just playing around for yourself, please do whatever you want. ...
I'm playing around for myself, of course, hoping to contribute - if I get something worthy... I'm pragmatic, too.

Quote:
Originally Posted by brolny View Post
- There are no sub-tags xml:lang="und" and xml:lang="zxx" in your “multilanguage.epub” file. (https://www.w3.org/International/que...e#undetermined)
...
Yes, there are. I must admit, though, that, after your post, I had to correct "xzz" into "zxx". Thanks.
varlog is offline   Reply With Quote
Advert
Old 07-30-2016, 06:56 PM   #63
varlog
actually it is /var/log
varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.
 
varlog's Avatar
 
Posts: 341
Karma: 2994236
Join Date: Sep 2012
Location: usually Europa
Device: prs t1
In The Source: The State of Art

(sic)

It is XXi century, C++11... forget the lambdas...
I find myself using labels and goto statements by my QSHparser.
It shapes slowly up... I get my languages... lost some words...(wt...!?).
See picture.
But I have the feeling, that is not going to end well...
The current Sigil is not html5 conform. It does not allow attributes without quoted values... It does not allow unclosed tags, see "8.1.2.4 Optional tags"...so I cannot even try to provide for it. But the casual look in the actual goings on in the source (doing git rebase today) tells me html5 is coming...
... and the way the QtGod((c)varlog) entity named QTextEditor wants to update its highlighting, does not, principally, agree with serial parsing...
Am I, again, missing something obvious? Some insights, Kevin?

Latests version of my novella included.


tbc...?
Attached Thumbnails
Click image for larger version

Name:	SCE_next04.jpg
Views:	146
Size:	67.5 KB
ID:	150595  
Attached Files
File Type: epub Multilanguage.epub (7.8 KB, 104 views)
varlog is offline   Reply With Quote
Old 07-30-2016, 11:02 PM   #64
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,705
Karma: 5444398
Join Date: Nov 2009
Device: many
Sigil uses gumbo for parsing and gumbo is an html5 fully compliant parser. So your concerns about html5 compliance are really nothing to worry about as you are parsing xhtm, not html just to get tag name, any lang associated attribute, and text.

All you need to do with your Quickparser is simply follow the exact logic and flow of the python quickparser.py. Repeated calls after loading the parser will return tags, tag type and dict of tag attributes separate from text.

For each starting tag, you use the tag attributes and look for xml:lang or lang and push a tuple of start tag name and current language into the end of a list. When a closing tag happens, you pop off the last tag name and the language. (Start the list with the metadata language).

When text comes you split it at word boundaries, as is done now, and you simply look at the bottom of that list to determine the current language associated with that text, passing the word and language to the spellcheck engine.

It probably would be good to store the offset of each word as well, which you track in the parser.

Does that help at all? I know python is not a strength for you yet, so I would be happy to go through the logic line by line of quickparser.py if need be or answer and questions you might have.

Hope this helps,

KevinH
KevinH is offline   Reply With Quote
Old 07-31-2016, 05:51 AM   #65
varlog
actually it is /var/log
varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.
 
varlog's Avatar
 
Posts: 341
Karma: 2994236
Join Date: Sep 2012
Location: usually Europa
Device: prs t1
Quote:
Originally Posted by KevinH View Post
...
For each starting tag, you use the tag attributes and look for xml:lang or lang and push a tuple of start tag name and current language into the end of a list. When a closing tag happens, you pop off the last tag name and the language. (Start the list with the metadata language).

When text comes you split it at word boundaries, as is done now, and you simply look at the bottom of that list to determine the current language associated with that text, passing the word and language to the spellcheck engine.
...
I'm doing something like this: at the moment SetNewBook initializes parser with parentTag, which has lang attribute grabbed from dc:language.
The parser is used by HtmlSpellCheck::GetMultilanguageMisspelledWords, which is my version of the GetMisspelledWords, to build tag stack with attributes.
The XHTMLHighlighter::highlightBlock calls XHTMLHighlighter::CheckSpelling which calls, through HtmlSpellCheck::GetMisspeledWords, HtmlSpellCheck::Get(Multilanguage)MisspeledWords giving it chunks (lines) of text. All is well as long as it provides the whole text: parser, being serial, manages the chunks: if one have not the whole tag in it, it waits for the next (and the next, and the next...) one. It builds stack, sets appropriate language.
But when you start to edit something in CodeView, the XHTMLHighlighter delivers only the line being changed: it could be something like this: " \t\t\t\t</a> Merici! ".

So what to do? Silently abort parser and wait for better times? Disable temporary highlighting? Other ideas?
varlog is offline   Reply With Quote
Advert
Old 07-31-2016, 09:05 AM   #66
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,705
Karma: 5444398
Join Date: Nov 2009
Device: many
I did not realize the highlighting worked like that. You will need the offset of all misspelled words in the file and somehow use that info to decide what to highlight for any one line.

Can you push your changes to you github repo so I can see how things work and play with it abit to see how to handle the highlighting?

Last edited by KevinH; 07-31-2016 at 11:28 AM.
KevinH is offline   Reply With Quote
Old 07-31-2016, 02:11 PM   #67
varlog
actually it is /var/log
varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.
 
varlog's Avatar
 
Posts: 341
Karma: 2994236
Join Date: Sep 2012
Location: usually Europa
Device: prs t1
I'll do that, as soon as I finish the current rewrite .
I realized something that should be clear to me since this post.
My QSHParser is at the moment singleton: but it cannot be! Because of all this time travel and parallel universes the parser craps out as soon as there is more than one html file in the book.
As I understand it now, there have to be a separate instance for every resource file for which spell checking is invoked. Or separate warper class instance... I have to think a bit about it...
varlog is offline   Reply With Quote
Old 07-31-2016, 04:53 PM   #68
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,705
Karma: 5444398
Join Date: Nov 2009
Device: many
yes, qtconcurrent is used to spellcheck multiple files at the same time so each instance/thread must using its own Quickparser object.
KevinH is offline   Reply With Quote
Old 08-02-2016, 06:38 PM   #69
varlog
actually it is /var/log
varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.
 
varlog's Avatar
 
Posts: 341
Karma: 2994236
Join Date: Sep 2012
Location: usually Europa
Device: prs t1
In The Source: Excession

this or that...

Good news, bad news...
I've changed the status of QSHP from singleton to normal and invoke instances, whenever needed. It seems to be working, with the exception of XHTMLHighlighter, which doesn't get the languages right of course, sending some random garbage instead of proper text... had to disable checking end-begin of the tags in the stack to get some action.

Bad news.
I loaded "A Memory of Light" (some 362 908 words) and even though the normal Sigil is quicker, the SpellcheckEditor, seeking for word, time lag was subjective OK. Then I loaded "ESV Bible", some 1 250 882 words, and the things got weired. Instead of original ~2-3s I got bis to 10s lag.

Not acceptable. Some optimization is due...

But it (mostly) works like magic .


tbc...?
varlog is offline   Reply With Quote
Old 08-03-2016, 11:06 AM   #70
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,705
Karma: 5444398
Join Date: Nov 2009
Device: many
Great news. You are definitely getting closer!

That said, the XHTMLHighlighter issue is going to be a bug to fix. As someone edits a file and adds and removes lines, fixes spelling etc, means that offsets in to the file will change constantly meaning that there is no easy way to pass state information to the spellchecker that will allow it to spellcheck a word on the fly outside the default language when only passed a single line of text.

So we either let the highlighter spellcheck in all languages or in only the default language to decide to wavy underline it or not. And then let the formal spellcheck mechanism work to catch these instances properly.

Or we figure out some way to pass along more information that doesn't get outdated as soon as one character has been added or deleted making all offsets meaningless. Still not sure how to do that or if it is even possible. It is a shame we can not attach a hidden language attribute somehow to every word of text!

I think the wavy line spellcheck hint set as we type should either simply highlight based off of looking up the text outside of tags with the default dictionary or alternatively all opened dictionaries (okay if it is okay in any open dictioanry), until we figure out a way to determine a word's language from just a single line snippet/fragment.

KevinH
KevinH is offline   Reply With Quote
Old 08-04-2016, 09:38 AM   #71
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,705
Karma: 5444398
Join Date: Nov 2009
Device: many
Hi varlog,

I thought about highlighting a bit more:

The only sane way to handle it when given only a snippet of text is to look up the word in a already found to be misspelled list for that document and if found, return as wavy, otherwise since specific language is not know at that point (and can not be known), look up the word in all open dictionaries to decide if you want to mark it temporarily as something to look at or not.

What do you think? Is this approach doable?

KevinH
KevinH is offline   Reply With Quote
Old 08-04-2016, 03:50 PM   #72
varlog
actually it is /var/log
varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.
 
varlog's Avatar
 
Posts: 341
Karma: 2994236
Join Date: Sep 2012
Location: usually Europa
Device: prs t1
There is a way to know the language: at moment my QSHParser builds (optionaly) the tags DOME with position of the start of the tag. The algorithm could be like this:

on opening TextTab:
get DOME and keep it;
on user action:
find the current tag, according to the position of the cursor in text;
walk down the DOME to find the tag with language;
do something with this knowledge;
update DOME;


Of course "user action" would have to be carefully defined to avoid races. My knowledge about it is no-existent.

Anyway, I'll do some cleaning now and, perhaps end of next week, make the source public. It wont be finished (no Preferences, no Settings and others, that I don't know of...), but I will be something to play with - if somebody wanted to .

In the meantime I had another "shoot my foot" idea. The overhead for multi language checking is considerable and it will be relatively seldom used. Sigil should keep its "one language mode", use multi checking as an option.
varlog is offline   Reply With Quote
Old 08-09-2016, 06:16 PM   #73
varlog
actually it is /var/log
varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.varlog ought to be getting tired of karma fortunes by now.
 
varlog's Avatar
 
Posts: 341
Karma: 2994236
Join Date: Sep 2012
Location: usually Europa
Device: prs t1
In The Source: The Divine Invasion

Philip... perhaps a little bit overstretched... but that's my mood now...

... back from my [REALITY] break.
Actually I should be cleaning the code... I've even started to... but then I just looked closer at XHTMLHighlighter, which is a Sigil implementation of QSyntaxHighlighter: my debugger doesn't even know what's calling it, because it's internal Qt thing for it.
But QtC knows - it's CodeViewEditor... and this guy have some spell check relevant functions. AddSpellCheckContextMenu for instance... which depends on XHTMLHighlighter...

I have to investigate it and think about it...

But I have a deadline now. It is 21.08.2016.
varlog is offline   Reply With Quote
Old 08-10-2016, 02:02 AM   #74
brolny
Connoisseur
brolny began at the beginning.
 
Posts: 64
Karma: 10
Join Date: Sep 2015
Location: Yerevan, Armenia
Device: none
Quote:
Originally Posted by varlog View Post
As I see you now work with the spell-check window. Can you please pay attention to this. For European languages everything is ok, but for oriental letters there are 2 problems.
- Place for editing the word is too short - English has very short words and easy to read fonts.
- And most discomfort one that font is too small, so it's very difficult to read some words and some letter combinations are absolutely unreadable now.

PS
Font size of the lists I can change in Win10, not for edits, labels... font of which is too small in Sigil for windows IMHO
Attached Thumbnails
Click image for larger version

Name:	ddddd.png
Views:	145
Size:	65.1 KB
ID:	150801   Click image for larger version

Name:	sss.png
Views:	148
Size:	31.1 KB
ID:	150802   Click image for larger version

Name:	fs.png
Views:	144
Size:	22.7 KB
ID:	150803  

Last edited by brolny; 08-10-2016 at 02:52 AM.
brolny is offline   Reply With Quote
Old 08-10-2016, 09:36 AM   #75
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,705
Karma: 5444398
Join Date: Nov 2009
Device: many
Admittedly, that word length has to be worst case as it is much longer than any other word on the list. I assume simply growing the dialog size does not help? Do you by chance use a high dpi monitor (high res display). If so Windows is simply horrible at scaling for high dpi. We plan to move to Qt 5.6 in the hopes its autoscaling is improved.
KevinH is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Yet another new blog Nate the great Lounge 0 05-01-2011 04:32 PM
new to blog pemmike Introduce Yourself 6 01-03-2011 05:39 AM
Blog AlexRupflin Deutsches Forum 10 12-24-2008 04:05 AM
My first Blog....ever AJ Starr Introduce Yourself 7 05-23-2008 02:55 AM


All times are GMT -4. The time now is 07:30 PM.


MobileRead.com is a privately owned, operated and funded community.