View Single Post
Old 07-30-2016, 11:02 PM   #64
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,809
Karma: 6000000
Join Date: Nov 2009
Device: many
Sigil uses gumbo for parsing and gumbo is an html5 fully compliant parser. So your concerns about html5 compliance are really nothing to worry about as you are parsing xhtm, not html just to get tag name, any lang associated attribute, and text.

All you need to do with your Quickparser is simply follow the exact logic and flow of the python quickparser.py. Repeated calls after loading the parser will return tags, tag type and dict of tag attributes separate from text.

For each starting tag, you use the tag attributes and look for xml:lang or lang and push a tuple of start tag name and current language into the end of a list. When a closing tag happens, you pop off the last tag name and the language. (Start the list with the metadata language).

When text comes you split it at word boundaries, as is done now, and you simply look at the bottom of that list to determine the current language associated with that text, passing the word and language to the spellcheck engine.

It probably would be good to store the offset of each word as well, which you track in the parser.

Does that help at all? I know python is not a strength for you yet, so I would be happy to go through the logic line by line of quickparser.py if need be or answer and questions you might have.

Hope this helps,

KevinH
KevinH is offline   Reply With Quote