View Single Post
Old 01-02-2021, 01:33 PM   #27
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,885
Karma: 6120478
Join Date: Nov 2009
Device: many
Okay, just to see if basic query works on gumbo, I have integrated gumbo-query into Sigil (only locally on my tree), fixed it to work with our version of sigilgumbo and ran the following testcases:

Code:
#if TEST_GUMBO_QUERY
            if (1) {
                std::string page("<h1><a>wrong link</a><a class=\"special\"\\>some link</a></h1>");
                CDocument doc;
                doc.parse(page.c_str());

                CSelection c = doc.find("h1 a.special");
                CNode node = c.nodeAt(0);
                printf("Node: %s\n", node.text().c_str());
                std::string content = page.substr(node.startPos(), node.endPos()-node.startPos());
                printf("Node: %s\n", content.c_str());
            };
            if (1) {
                std::string page = "<html><div><span>1\n</span>2\n</div></html>";
                CDocument doc;
                doc.parse(page.c_str());
                CNode pNode = doc.find("div").nodeAt(0);
                std::string content = page.substr(pNode.startPos(), pNode.endPos() - pNode.startPos());
                printf("Node: #%s#\n", content.c_str());
            };
            if (1) {
                std::string page = "<html><div><span id=\"that's\">1\n</span>2\n</div></html>";
                CDocument doc;
                doc.parse(page.c_str());
                CNode pNode = doc.find("span[id=\"that's\"]").nodeAt(0);
                std::string content = page.substr(pNode.startPos(), pNode.endPos() - pNode.startPos());
                printf("Node: #%s#\n", content.c_str());
            };
            if (1) {
                std::string page("<h1><a>some link</a></h1>");
                CDocument doc;
                doc.parse(page.c_str());

                CSelection c = doc.find("h1 a");
                std::cout << c.nodeAt(0).text() << std::endl; // some link
            }
#endif
And it all seemed to pass with flying colours. So it would appear we could easily add gumbo-query to our Sigil project (it is available under a MIT License) and use it to test CSS selectors to see if they return a value.

So all we need now is the ability to extract and better parse the selector rules themselves.

We could do that in css-parser via a python interface or we could try and write a better simpler css parser ourselves (along the lines of what we did for xhtml parsing with QuickParser.cpp).

It would not be difficult to parse css in c++ basically with a simple state machine looking for the first occurrence of non-whitespace, and then checking for special chars like "@", ";", "{" and "}" to determine state with state specific parsing. It need not validate and really just needs to properly generate the selector rules.

What we need to see is if gumbo-query can handle all of the css selectors available in css3.
Since there are no docs of any sort, we really just have to study the source code.

To make this easier, I will push everything to my own github tree: https://github.com/kevinhendricks/Sigil for anyone who wants to play around with it at all.

The modified to work gumbo-query code will live in Sigil/src/Query/ and the test code for the time being is being run out of Sigil/src/main.cpp and can be played with there.

It is being built directly into Sigil for the time being, but we could easily change it to a standalone c++ shared or static library.

I will push what I have now in case anyone wants to play around with it.

If anyone is interested in gumbo, you also might want to check out Sigil/src/Misc/GumboInterface.cpp / .h which is our C++/Qt based interface to the gumbo parsing c library.

If we like gumbo-query, I will integrate it with QString/Qt to make it very easy to use.


I just pushed all of these test changes to https://github.com/kevinhendricks/Sigil

KevinH
KevinH is offline   Reply With Quote