12-31-2020, 04:14 PM | #16 |
Wizard
Posts: 1,632
Karma: 724945
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
|
I skimmed the code a bit and I share @wrCisco's superficial impression of what it seems to be doing.
Cool! |
12-31-2020, 09:47 PM | #17 |
Sigil Developer
Posts: 7,727
Karma: 5444398
Join Date: Nov 2009
Device: many
|
Okay, it seems the the CSSInfo parser of Sigil does not handle combinators at all nor pseudo classes nor @media rules.
To properly test a css selector that uses adjacent, child, or descendent combinators means some use of a css selector based query or xpath like interface for Sigil's html5 repair parser gumbo. And as far as I know, these simply do not exist in C++ or C. I will continue to search for one. The closest I can find is a jQuery like interface for gumbo here: https://github.com/lazytiger/gumbo-query but it appears to be 5 years old with no real updates. If I can not find anything useful, we must then turn to python and its css-parser and cssselect and lxml to do this properly. But that means we would just be pretty much duplicating wrCisco's plugin but internal to Sigil using pyqt5 in place of tk. That seems to be wasteful duplication. Perhaps we should delete the unused class removal feature from Sigil and instead point people to wrCisco's plugin for that functionality completely. Ideas? Thoughts? |
Advert | |
|
01-01-2021, 03:10 AM | #18 | |
Wizard
Posts: 1,632
Karma: 724945
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
|
I'm guessing you already evaluated Qt's CSS parser and determined it was unsuitable to the purpose?
Quote:
|
|
01-01-2021, 06:41 AM | #19 |
Grand Sorcerer
Posts: 27,602
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Couldn't we incorporate (with his permission, of course) wrCisco's python code into Sigili's python3lib and use the c++ embedded python interface to access it? Thus skipping the need to use PyQt at all for the gui? I'm not certain what else the existing plugin might provide, but even if we don't bring it entirely "in house" (eliminating the need for the third-party plugin altogether), surely we can come up with an interface to the portions we DO need to access via embedded python interpreter while still exposing those same absorbed parts to plugins via the plugin framework? Thus avoiding duplication.
|
01-01-2021, 09:59 AM | #20 | |
A Hairy Wizard
Posts: 3,119
Karma: 18727091
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
Quote:
If wrCisco doesn't object, it seems like incorporating BOTH of those plugins into the same Sigil function (with all the appropriate user selections) would make sense. As a very minor nit - the Remove Unused Selectors does not combine leftover CSS. Spoiler:
Last edited by Turtle91; 01-01-2021 at 10:01 AM. |
|
Advert | |
|
01-01-2021, 10:54 AM | #21 |
Sigil Developer
Posts: 7,727
Karma: 5444398
Join Date: Nov 2009
Device: many
|
Really, all of this depends on what wrCisco wants. But yes, we could use the python3lib direct interface and do the gui parts in qt.
As for Qt, there is no css parser for public use in Qt at all. They have their variant for qcss which we could extract and use but it is for their own version of css which is not compliant with css3. So there is no real point. The only other css parser in Qt is inside QWebEngine but it is very closely integrated with their internal DOM so extracting it for reuse would be a major major pain. I will take a look at the gumbo query code which would be nice to have anyway. If we can get that to work, we just need to better handle parsing css selectors to get what we need. |
01-01-2021, 10:58 AM | #22 |
Sigil Developer
Posts: 7,727
Karma: 5444398
Join Date: Nov 2009
Device: many
|
FWIW, combining CSS is not easy to do especially without the specificity rule calculations for css. You can not even sort the css selectors as order is important. That is really the domain of a css optimizer.
|
01-01-2021, 11:20 AM | #23 |
Grand Sorcerer
Posts: 27,602
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
For the record: I'm also not entirely opposed to just ditching Sigil's "Delete Unused Style Classes" and letting plugins handle it. Seeing as how the existing plugin already does it better, that approach has a simplistic/minimalist aspect to it that appeals greatly to me, if I'm being totally honest. I'm just spitballing, here.
|
01-01-2021, 01:53 PM | #24 | |
just an egg
Posts: 1,597
Karma: 4798866
Join Date: Mar 2015
Device: Kindle, iOS
|
Quote:
|
|
01-01-2021, 05:36 PM | #25 |
Enthusiast
Posts: 34
Karma: 467802
Join Date: Apr 2016
Device: none
|
Well, the idea of integrating the plugin's code into Sigil intrigues me, I'm almost tempted to try to write the glue code myself, if I may...
(As to the practical problem on how to integrate the code: one would need to put the appropriate python module in the python3lib directory and then write one or more c++ methods for the PythonRoutines class, which afterward could be called from elsewhere in Sigil, right? QVariants as arguments seem sufficiently harmless if one restrain themselves to numbers and strings - or lists of numbers and strings - I guess...) But the integration wouldn't be a completely straight forward change: - The code of the builtin Delete Unused Classes is a sort of specialization of the Sigil's Reports tool (which has the same flaws of the "Delete..." functionality in "Classes Used" and "CSS Classes"), so the code of the plugin would require some refactoring if we want to serve that as well. - The python package cssselect, which now is only required to run plugins, will be required to run Sigil itself. - There are two things in which the builtin Delete... is more thorough than the plugin: it looks for matches only in the xhtml files where a stylesheet is linked, and it collects selectors from <style> tags too, while the plugin is more conservative: it always looks for matches in every xhtml and xml file, and never considers <style> tags. The philosophy has always been to stay on the safe side of the error: it's better to leave some useless cruft than to remove some useful code. This, however, would probably not be good enough for a complete report on usages (so, more refactoring, or accepting a suboptimal system). As to integrate also cssUndefinedClasses, I'm not contrary in principle, and I understand that from a user point of view the two plugins do something very similar, but from the developer point of view they are two different beasts (apart from a few GUI functions, the only thing that they have in common is the css parsing). Combining rules, as Kevin pointed out, is in general trickier than it seems: it would be safe and not overly complicated only if the selectors were an exact match and the rules were one right after the other (as in the example of Turtle91, but that is the only case). So, I'm not 100% sure of what would be the best course of action now: maybe wait to see what Kevin can squeeze out of the gumbo query library? |
01-01-2021, 06:48 PM | #26 | |
Sigil Developer
Posts: 7,727
Karma: 5444398
Join Date: Nov 2009
Device: many
|
Please do take a shot at the glue code if you have any interest. Yes QVariants are used to pass things. And PythonRoutines is an example of how we do the interface with the embedded python at least for a few functions. There are other cases where we skip PythonRoutines in a few places in the code but not many.
The embedded python interface/bridge does have a few restrictions, it will not pass maps but will pass lists of strings, lists of lists, or a pointer to a python Object. etc. To pass maps, I pass a keys list and values lists and build the map on the other side of the bridge. Just let me know if you have any questions. Having cssselect be required for Sigil to function is not really an issue as all of lxml is already required for parsing and fixing pure xml files like the opf and the ncx during Sigil start-up. That said, if we can get a gumbo based query working we can instead focus on a better selector parser in C++ or even do the parsing in python in css_parser and pass back what to query for. There are lots of ways we could take this. I will try to get gumbo-query building and running with our sigil specific gumbo and at least try a query and see if its jQuery-like selector interface works. Thanks, KevinH Quote:
|
|
01-02-2021, 01:33 PM | #27 |
Sigil Developer
Posts: 7,727
Karma: 5444398
Join Date: Nov 2009
Device: many
|
Okay, just to see if basic query works on gumbo, I have integrated gumbo-query into Sigil (only locally on my tree), fixed it to work with our version of sigilgumbo and ran the following testcases:
Code:
#if TEST_GUMBO_QUERY if (1) { std::string page("<h1><a>wrong link</a><a class=\"special\"\\>some link</a></h1>"); CDocument doc; doc.parse(page.c_str()); CSelection c = doc.find("h1 a.special"); CNode node = c.nodeAt(0); printf("Node: %s\n", node.text().c_str()); std::string content = page.substr(node.startPos(), node.endPos()-node.startPos()); printf("Node: %s\n", content.c_str()); }; if (1) { std::string page = "<html><div><span>1\n</span>2\n</div></html>"; CDocument doc; doc.parse(page.c_str()); CNode pNode = doc.find("div").nodeAt(0); std::string content = page.substr(pNode.startPos(), pNode.endPos() - pNode.startPos()); printf("Node: #%s#\n", content.c_str()); }; if (1) { std::string page = "<html><div><span id=\"that's\">1\n</span>2\n</div></html>"; CDocument doc; doc.parse(page.c_str()); CNode pNode = doc.find("span[id=\"that's\"]").nodeAt(0); std::string content = page.substr(pNode.startPos(), pNode.endPos() - pNode.startPos()); printf("Node: #%s#\n", content.c_str()); }; if (1) { std::string page("<h1><a>some link</a></h1>"); CDocument doc; doc.parse(page.c_str()); CSelection c = doc.find("h1 a"); std::cout << c.nodeAt(0).text() << std::endl; // some link } #endif So all we need now is the ability to extract and better parse the selector rules themselves. We could do that in css-parser via a python interface or we could try and write a better simpler css parser ourselves (along the lines of what we did for xhtml parsing with QuickParser.cpp). It would not be difficult to parse css in c++ basically with a simple state machine looking for the first occurrence of non-whitespace, and then checking for special chars like "@", ";", "{" and "}" to determine state with state specific parsing. It need not validate and really just needs to properly generate the selector rules. What we need to see is if gumbo-query can handle all of the css selectors available in css3. Since there are no docs of any sort, we really just have to study the source code. To make this easier, I will push everything to my own github tree: https://github.com/kevinhendricks/Sigil for anyone who wants to play around with it at all. The modified to work gumbo-query code will live in Sigil/src/Query/ and the test code for the time being is being run out of Sigil/src/main.cpp and can be played with there. It is being built directly into Sigil for the time being, but we could easily change it to a standalone c++ shared or static library. I will push what I have now in case anyone wants to play around with it. If anyone is interested in gumbo, you also might want to check out Sigil/src/Misc/GumboInterface.cpp / .h which is our C++/Qt based interface to the gumbo parsing c library. If we like gumbo-query, I will integrate it with QString/Qt to make it very easy to use. I just pushed all of these test changes to https://github.com/kevinhendricks/Sigil KevinH |
01-02-2021, 03:34 PM | #28 |
Sigil Developer
Posts: 7,727
Karma: 5444398
Join Date: Nov 2009
Device: many
|
From eyeballing the CSelector.cpp and CParse.cpp code in Sigil/src/Query/ it appears that gumbo-query handles all of the combinators and many pseudo classes and pseudo elements so gumbo-query can be a valuable addition for Sigil even on its own. So I will integrate the a find_by_selector() method directly into our GumboInterface class so that it css selectors can be used to find gumbo nodes.
|
01-02-2021, 03:56 PM | #29 | |
Wizard
Posts: 1,632
Karma: 724945
Join Date: Oct 2014
Location: Antwerp
Device: Kobo Aura H2O
|
Quote:
https://github.com/lazytiger/gumbo-q....cpp#L382-L523 Btw, if you try something like this (p:first-child:first-of-type) it gives you a segmentation fault: Code:
void test_html() { std::string page = "<html><div class=\"chapter\"><p class=\"flush\">Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua</p><p>second child</p></div></html>"; CDocument doc; doc.parse(page.c_str()); CNode pNode = doc.find(".chapter > p:first-child:first-of-type").nodeAt(0); std::string content = page.substr(pNode.startPos(), pNode.endPos() - pNode.startPos()); printf("Node: #%s#\n", content.c_str()); } |
|
01-02-2021, 04:17 PM | #30 |
Sigil Developer
Posts: 7,727
Karma: 5444398
Join Date: Nov 2009
Device: many
|
We just have to check for empty vector found of Selections or GumboNode *. The code above will not be the interface we employ, instead we will add a find_by_ selector routine to our existing GumboInterface routine and remove use of CDocument and other query wrapper code completely.
Of course if the segfault happens in CParser, we would need to harden it. Last edited by KevinH; 01-02-2021 at 04:49 PM. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
alphabetizing stylesheet, check book, and remove unused styles | rjwse@aol.com | Calibre | 9 | 01-29-2020 06:48 PM |
Pseudo classes to be deleted as unused classes | Leonatus | Sigil | 2 | 09-23-2018 09:12 AM |
"unused stylesheet class" is actually used | AlanHK | Sigil | 6 | 06-20-2017 04:42 PM |
Search and Replace; delete "author" name from "serie" | roosten | Library Management | 6 | 12-17-2015 11:38 AM |
Cleaning a stylesheet of unused styles | roger64 | Sigil | 49 | 06-13-2012 05:23 AM |