Proposal: CSS "normalisation" functionality in Sigil

Man Eating Duck · 08-14-2013, 04:34 AM

This suggestion is related to issue #2019 http://code.google.com/p/sigil/issues/detail?id=2019, however I propose a few additional features. I'm aware that this might be a serious undertaking, however I'm willing to do whatever I can to help. My coding skills are rudimentary, but I intend to see what I can do. If the maintainers are adverse to the whole idea I will rethink it, otherwise any suggestions and tips are *very* welcome

Background: a very common use case for me is that I want to clean up purchased books, and from what I see in forums and articles I'm not alone. A surprising portion of commercial ebooks have atrocious code quality, due to some combination of: incompetence on part of the creator, too much reliance on automated tools, and bad source material.

A tedious first step is to locate and gather style declarations in one place. A set of tools to gather and normalize all style information in one css file would be of great help.

As is mentioned in #2019, some of this might be done with regex*. This is far from optimal, as you really need a proper XML parser to parse xhtml, which I believe Sigil has access to in Xerces. I propose (and would like to assist with) a new set of tools, some of which are also suggested in #2019. These tools would use the proper xml parsing engine:

1: convert inline styling to css classes and move them to a style file, i.e -> with corresponding css span.sgc-1 {font-style:italic;}. This could use existing styles with identical declarations if they exist, or just consistently add new style classes to the css file to be cleaned up later.

2: move embedded styles in html documents (ie <style type="text/css">) to a css file, renaming classes in case of conflicts.

3: Merge css files. This would combine multiple (all?) css files into one, renaming conflicting class names as necessary. Style occurences in XHTML would naturally also need to be renamed.

Now all styling information should be present in only one css file.

4: Merge identical styles into a single instance. The above steps would very likely leave the css file with heaps of styles with identical declarations. These should be merged into a single style for each unique set of declarations and tag types. This step *might* also include a set of predefined class names, such that a class with only an italic declaration would be named "italic". I don't know if there is any point to the latter, it might be better to leave renaming to the user who can more reliably judge what a class is intended to do.

5: Provide some tools to rename and delete css styles with corresponding tags as outlined in #2019, as well as converting tags with specific classes to other tags ( -> <h1>).

Steps 1-4 is intended to be relatively non-destructive, retaining all information and leave all actual tags in place. You would lose distinctions between different classes with identical declarations in step 4, but I seldom see semantic information stored in css classes anyway, they're often just randomly generated from some editing tool or another. 1-4 could potentially be done in a single operation (for instance with a dialog with checkboxes for each step). In step five the user would actually make his changes to declarations along with renaming and removing classes and tags, this might of course be destructive, but initiated by the user in all cases with respect to losing distinctions between similar classes for different purposes.

* Apart from regex being complicated to grasp for casual users, it is also theoretically impossible to reliably parse html with regex. I won't go into much detail, but a trivial example:

A paragraph with italics in it.

I've actually seen this very structure in the wild, with a corresponding .empty{}. If you want to remove the useless "empty" spans, an intuitive regex might be something like (?U)(.*), replace with /1. In the example above this would extend the italic span to encompass the rest of the paragraph.

DiapDealer · 08-14-2013, 07:32 AM

Quote:

Originally Posted by Man Eating Duck

* Apart from regex being complicated to grasp for casual users, it is also theoretically impossible to reliably parse html with regex. I won't go into much detail, but a trivial example:

A paragraph with italics in it.

I've actually seen this very structure in the wild, with a corresponding .empty{}. If you want to remove the useless "empty" spans, an intuitive regex might be something like (?U)(.*), replace with /1. In the example above this would extend the italic span to encompass the rest of the paragraph.

Which is why you would include the closing in the match to make sure you only got the all encompassing span.

Code:

(?U)<span class="empty">(.*)</span>\s+</p>

Replace with: \1\n

I'm not arguing that a true parser wouldn't do a more effective (safer) job. It would. I just don't think it would be a very simple task to provide an end user with a configurable, flexible interface to the parser in order to inform it of their desires (without actually writing code themselves).

cybmole · 08-14-2013, 08:53 AM

So how does this improve on simply running a book through calibre's epub to epub conversion - At 1st glance that already seems to do almost all of what you specified ?

why reinvent the wheel ?

Man Eating Duck · 08-14-2013, 09:03 AM

Quote:

Originally Posted by DiapDealer

Which is why you would include the closing in the match to make sure you only got the all encompassing span.

Code:

(?U)<span class="empty">(.*)</span>\s+</p>

Replace with: \1\n

I should have known that including a regex example, even in a footnote, would be distracting and a bad idea

Quote:

Originally Posted by DiapDealer

I'm not arguing that a true parser wouldn't do a more effective (safer) job. It would. I just don't think it would be a very simple task to provide an end user with a configurable, flexible interface to the parser in order to inform it of their desires (without actually writing code themselves).

This is not only an attempt to avoid HTML/Regex madness, but to semi-automate a set of tasks I find myself doing manually over and over again. It might also be helpful for the people creating the epubs in the first place.

theducks · 08-14-2013, 09:19 AM

I would love at see the REGEX (or any embeded tool) the can deal with matching up closing tags. (My REGEX foo is basic)

The example only works for the case shown

I have seen (IMHO Word Processor? garbage)

Code:

<p>
<span class="normal"><span>A paragraph with</span></span> <span class="italic">italics</span> <span class="normal"><span>in it.</span></span>
</p>

If you run the cleanup above, you end with broken first and last spans

DiapDealer · 08-14-2013, 09:20 AM

Quote:

Originally Posted by Man Eating Duck

I should have known that including a regex example, even in a footnote, would be distracting and a bad idea

Sorry, I couldn't resist.
Especially considering your use of the phrase "an intuitive regex." I just found it amusing that the solution to the problem you presented was one that was very intuitive for me.

Quote:

Originally Posted by Man Eating Duck

IThis is not only an attempt to avoid HTML/Regex madness, but to semi-automate a set of tasks I find myself doing manually over and over again. It might also be helpful for the people creating the epubs in the first place

Understood. I hope you get your wish -- as long as it doesn't break something else I rely on.

Quote:

Originally Posted by theducks

If you run the cleanup above, you end with broken first and last spans

Yep. You'd have to run two or three passes to clean that up:

Code:

<span>([^>]*?)</span>

followed by

Code:

<span class="normal">([^>].*?)</span>

Both replaced with \1
Provided the spans were being applied consistently, of course.

Sorry ... last hijack. I promise.

Man Eating Duck · 08-14-2013, 09:41 AM

Quote:

Originally Posted by cybmole

So how does this improve on simply running a book through calibre's epub to epub conversion - At 1st glance that already seems to do almost all of what you specified ?

why reinvent the wheel ?

Even if the conversion provides a lot of options, you have little control over what calibre actually does under the hood. In my experience converting epub -> epub almost always has unintended consequences and is not .

More importantly, the intention is to add functionality for simplifying handling css classes in a "style-like" way to Sigil, as it would make it more useful as an epub editor IMO. Other tools like calibre, Dreamweaver or even a custom-made script can probably also do these transformations in a "proper" DOM-based manner, but being able to to them in Sigil would simplify that workflow and hopefully be useful to a lot of people, not only me

My proposal is also just an idea, maybe a generalised interface to manipulate the DOM would be more flexible and useful in the general case, but I have no idea how it could/should be implemented.

Man Eating Duck · 08-14-2013, 10:07 AM

Quote:

Originally Posted by theducks

I would love at see the REGEX (or any embeded tool) the can deal with matching up closing tags. (My REGEX foo is basic)

The example only works for the case shown

I have seen (IMHO Word Processor? garbage)

Code:

<p>
<span class="normal"><span>A paragraph with</span></span> <span class="italic">italics</span> <span class="normal"><span>in it.</span></span>
</p>

If you run the cleanup above, you end with broken first and last spans

Yes, the point is that while regex *can* be useful to do very basic things with html, it is fundamentally incapable of "parsing" it in a reliable manner, which you'll see mentioned all over the internet if you google it. XML parsers, on the other hand, can do it properly. Cleaning up the mess left behind by epub generating tools and poor html understanding is the main purpose of my idea, and using the XML library included in Sigil is one way to approach it.

Edit: I see that the regex store already delivered on this one as well. Maybe we could just drop implementing the features, and instead have Sigil submitting the files to DiapDealer for regex'ing?

DiapDealer · 08-14-2013, 11:06 AM

Sorry. It's a disease I tell ya! I can't stop myself.

(more [on topic, I promise] later when I don't have to "type" on a tiny screen)

DiapDealer · 08-14-2013, 02:52 PM

On the subject of using a true "parser" vs regex to modify x?(ht)?ml:
(I know you've said your request isn't JUST about avoiding Html/Regex madness and its inability to "parse" the code but I still need to comment briefly [Ha!] on just that)

I've taken part (in the past) in the parser vs regex debate that rages continually all around the internet, but I honestly think that the "regex can't 'parse' html" argument, while true, isn't really saying a whole big bunch, when it's all said and done. It's almost--but not quite--apples and oranges. At best, it's a bit disingenuous... at worst, it's programmer snobbery ("Go away kid. You need to use a parser.").

While it's perfectly true that regex can't "parse" markup (because it doesn't understand the syntax), that fact isn't really very relevant to a neophyte asking; "what regex can I use to do X?" For the experts to tell those DIYers; "Parser! You must use a parser!" is, for all practical purposes, the same thing as telling that person; "you can't do what you want to do. Not today, anyway. Give it up."

Why? Because they can't just download a "parser" and tell it to fix their code. They have to create some sort of interface using a programming language they don't know how to use. And chances are, even if they figure that part out, they're not going to be able to integrate it into the application they actually want to do the editing in. So they're either going to have to give up or beg the application's devs to do it for them. Which even if you have the nicest, most accommodating devs in the world, ain't going to happen overnight (and you'll still be at their mercy for any updates, tweaks and new requests to be added to the miracle parser's bag o' tricks).

So for the DIYer whose coding skills are somewhat limited, I still say regex has the most bang for their buck. It's already included in a crap-ton of editing applications, and as long as you understand that it's bone-dead stupid about syntax, you can start to figure out how to think FOR it. Sure you may not be able to do everything you want to do with the click of a button, but with a good plan and a multi-pass approach if necessary (which Sigil makes painless), I still say there's very little you can't do with it that a parser can. And if you struggle mightily to get a handle on regex, I hear there's people on the internet that like the challenge of finding regex solutions. So even the uninitiated can get pretty quick results.

Anyway, back to the thread! Sorry for the diatribe. I promise this time I'm done.

Man Eating Duck · 08-14-2013, 07:05 PM

Quote:

Originally Posted by DiapDealer

On the subject of using a true "parser" vs regex to modify x?(ht)?ml:
<diatribe>.*</diatribe>

<Offtopic>
I love regex, it's a wonderful tool in my belt, and I use it all the time for everything from Indesign styling to converting SQL schemas between RDBMS varieties. I have used it almost daily at my place of employment for more than ten years, and consider myself pretty proficient. I also know its limitations. The much regretted example was something you (well, obviously not you or I, but you understand what I mean) might expect to work, but it doesn't.
</Offtopic>

This is almost completely irrelevant to what I want to accomplish, though, which is to do all these operations *moving css definitions* in bulk, including between file types, with the click of a button. Without humans crafting queries whatsoever, be it xpath or regex. When I just want to get on with reading the damn book, large parts of this cleaning is just a chore, better left to computer logic. Sigil do include the XML parser Xerces, but no CSS parser I could spot in the source, although there is some regex-based logic which parses at least parts of CSS (like class names). I will make a serious attempt implementing what I mention, but whether I can do much that is useful is left to be seen.

Jellby · 08-15-2013, 04:03 AM

Cleaning up CSS automatically may be considerably difficult if the stylesheet includes "advanced" selectors or rules like children, adjacent siblings, precedende, multiple classes, etc.

But there could be an option like the "flatten CSS" used by calibre, where every defined selector or combination is converted into a class (I believe that's what it does), then it's easier to "simplify", but you lose much of the charm of CSS. Nevertheless, the coding of most books needing a cleanup has no charm at all, so there's not much to lose.

cybmole · 08-15-2013, 06:43 AM

the "logic" for detecting functionally identical classes would be ridiculously complicated, because there are so many ways to say the same thing in shorthand. e,g, margin can be specified in up to 4 separate lines, : margin-top, margin-bottom... or in a single margin line with 4 parameters.

the best you could probably hope for is detecting identical blocks of definitions; even coping with the individual css lines being in a different order within different definitions would be challenging

there are "master sets" of "in-house" CSS out there, with an definition for everything you could possibly need - inspect any book that uses the adobe approach - but I much prefer to see only what is actually needed for the given book, within the style sheet, not a load of excess stuff.

Man Eating Duck · 08-15-2013, 08:54 AM

Quote:

Originally Posted by Jellby

Cleaning up CSS automatically may be considerably difficult if the stylesheet includes "advanced" selectors or rules like children, adjacent siblings, precedende, multiple classes, etc.

But there could be an option like the "flatten CSS" used by calibre, where every defined selector or combination is converted into a class (I believe that's what it does), then it's easier to "simplify", but you lose much of the charm of CSS. Nevertheless, the coding of most books needing a cleanup has no charm at all, so there's not much to lose.

Yes, it's more complicated than I thought at first, but I don't see a lot of it in actual use (IIRC I've seen multiple classes in one attribute in one of your books, but they don't need cleanup). As long as I don't actually change the definitions in any classes, and always leave (or add) a class in place of style="" attributes, it should be functionally identical. Advanced selectors are better left alone to start with. The first task will be to gather all definitions in one place, but I'll also have a look at how calibre flattens css.

Man Eating Duck · 08-15-2013, 09:22 AM

Quote:

Originally Posted by cybmole

the "logic" for detecting functionally identical classes would be ridiculously complicated, because there are so many ways to say the same thing in shorthand. e,g, margin can be specified in up to 4 separate lines, : margin-top, margin-bottom... or in a single margin line with 4 parameters.

the best you could probably hope for is detecting identical blocks of definitions; even coping with the individual css lines being in a different order within different definitions would be challenging

Yes, first priority is to gather all css definitions in one place (a css file). That wouldn't necessarily require advanced css parsing. Merging css classes is more complicated and will probably require a parser. Unfortunately it doesn't seem that Qt provides an API for the webkit css parser, that would've been great

Quote:

Originally Posted by cybmole

there are "master sets" of "in-house" CSS out there, with an definition for everything you could possibly need - inspect any book that uses the adobe approach - but I much prefer to see only what is actually needed for the given book, within the style sheet, not a load of excess stuff.

If extraneous classes bother you: Sigil already has a "Delete unused stylesheet classes" function, have you tried it?

08-14-2013, 04:34 AM	#1
Man Eating Duck Addict Posts: 254 Karma: 69786 Join Date: May 2006 Location: Oslo, Norway Device: Kobo Aura, Sony PRS-650	Proposal: CSS "normalisation" functionality in Sigil This suggestion is related to issue #2019 http://code.google.com/p/sigil/issues/detail?id=2019, however I propose a few additional features. I'm aware that this might be a serious undertaking, however I'm willing to do whatever I can to help. My coding skills are rudimentary, but I intend to see what I can do. If the maintainers are adverse to the whole idea I will rethink it, otherwise any suggestions and tips are very welcome Background: a very common use case for me is that I want to clean up purchased books, and from what I see in forums and articles I'm not alone. A surprising portion of commercial ebooks have atrocious code quality, due to some combination of: incompetence on part of the creator, too much reliance on automated tools, and bad source material. A tedious first step is to locate and gather style declarations in one place. A set of tools to gather and normalize all style information in one css file would be of great help. As is mentioned in #2019, some of this might be done with regex. This is far from optimal, as you really need a proper XML parser to parse xhtml, which I believe Sigil has access to in Xerces. I propose (and would like to assist with) a new set of tools, some of which are also suggested in #2019. These tools would use the proper xml parsing engine: 1: convert inline styling to css classes and move them to a style file, i.e <span style="font-style:italic"> -> <span class="sgc-1"> with corresponding css span.sgc-1 {font-style:italic;}. This could use existing styles with identical declarations if they exist, or just consistently add new style classes to the css file to be cleaned up later. 2: move embedded styles in html documents (ie <style type="text/css">) to a css file, renaming classes in case of conflicts. 3: Merge css files. This would combine multiple (all?) css files into one, renaming conflicting class names as necessary. Style occurences in XHTML would naturally also need to be renamed. Now all styling information should be present in only one css file. 4: Merge identical styles into a single instance. The above steps would very likely leave the css file with heaps of styles with identical declarations. These should be merged into a single style for each unique set of declarations and tag types. This step might* also include a set of predefined class names, such that a class with only an italic declaration would be named "italic". I don't know if there is any point to the latter, it might be better to leave renaming to the user who can more reliably judge what a class is intended to do. 5: Provide some tools to rename and delete css styles with corresponding tags as outlined in #2019, as well as converting tags with specific classes to other tags (<p class="h1"> -> <h1>). Steps 1-4 is intended to be relatively non-destructive, retaining all information and leave all actual tags in place. You would lose distinctions between different classes with identical declarations in step 4, but I seldom see semantic information stored in css classes anyway, they're often just randomly generated from some editing tool or another. 1-4 could potentially be done in a single operation (for instance with a dialog with checkboxes for each step). In step five the user would actually make his changes to declarations along with renaming and removing classes and tags, this might of course be destructive, but initiated by the user in all cases with respect to losing distinctions between similar classes for different purposes. * Apart from regex being complicated to grasp for casual users, it is also theoretically impossible to reliably parse html with regex. I won't go into much detail, but a trivial example: <p> <span class="empty">A paragraph with <span class="italic">italics</span> in it.</span> </p> I've actually seen this very structure in the wild, with a corresponding .empty{}. If you want to remove the useless "empty" spans, an intuitive regex might be something like (?U)<span class="empty">(.*)</span>, replace with /1. In the example above this would extend the italic span to encompass the rest of the paragraph.

08-14-2013, 09:19 AM	#5
theducks Well trained by Cats Posts: 29,799 Karma: 54830978 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	I would love at see the REGEX (or any embeded tool) the can deal with matching up closing tags. (My REGEX foo is basic) The example only works for the case shown I have seen (IMHO Word Processor? garbage) Code: <p> <span class="normal"><span>A paragraph with</span></span> <span class="italic">italics</span> <span class="normal"><span>in it.</span></span> </p> If you run the cleanup above, you end with broken first and last spans

08-14-2013, 02:52 PM	#10
DiapDealer Grand Sorcerer Posts: 27,548 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	On the subject of using a true "parser" vs regex to modify x?(ht)?ml: (I know you've said your request isn't JUST about avoiding Html/Regex madness and its inability to "parse" the code but I still need to comment briefly [Ha!] on just that) I've taken part (in the past) in the parser vs regex debate that rages continually all around the internet, but I honestly think that the "regex can't 'parse' html" argument, while true, isn't really saying a whole big bunch, when it's all said and done. It's almost--but not quite--apples and oranges. At best, it's a bit disingenuous... at worst, it's programmer snobbery ("Go away kid. You need to use a parser."). While it's perfectly true that regex can't "parse" markup (because it doesn't understand the syntax), that fact isn't really very relevant to a neophyte asking; "what regex can I use to do X?" For the experts to tell those DIYers; "Parser! You must use a parser!" is, for all practical purposes, the same thing as telling that person; "you can't do what you want to do. Not today, anyway. Give it up." Why? Because they can't just download a "parser" and tell it to fix their code. They have to create some sort of interface using a programming language they don't know how to use. And chances are, even if they figure that part out, they're not going to be able to integrate it into the application they actually want to do the editing in. So they're either going to have to give up or beg the application's devs to do it for them. Which even if you have the nicest, most accommodating devs in the world, ain't going to happen overnight (and you'll still be at their mercy for any updates, tweaks and new requests to be added to the miracle parser's bag o' tricks). So for the DIYer whose coding skills are somewhat limited, I still say regex has the most bang for their buck. It's already included in a crap-ton of editing applications, and as long as you understand that it's bone-dead stupid about syntax, you can start to figure out how to think FOR it. Sure you may not be able to do everything you want to do with the click of a button, but with a good plan and a multi-pass approach if necessary (which Sigil makes painless), I still say there's very little you can't do with it that a parser can. And if you struggle mightily to get a handle on regex, I hear there's people on the internet that like the challenge of finding regex solutions. So even the uninitiated can get pretty quick results. Anyway, back to the thread! Sorry for the diatribe. I promise this time I'm done. Last edited by DiapDealer; 08-14-2013 at 03:01 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
M92: "busy" indicator (a proposal)	pepe_alter_ego	Onyx Boox	3	11-27-2012 08:21 PM
epub CSS versus "Regular" CSS	konrad	ePub	4	02-18-2011 09:29 AM
To all those bugging me to read "A Modest Proposal"	lilac_jive	Lounge	2	02-18-2009 07:52 PM
Suggestions for Functionality of the "consumer" iLiad	Riocaz	iRex	20	06-01-2006 08:08 AM

08-14-2013, 08:53 AM	#3
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	So how does this improve on simply running a book through calibre's epub to epub conversion - At 1st glance that already seems to do almost all of what you specified ? why reinvent the wheel ?

08-14-2013, 11:06 AM	#9
DiapDealer Grand Sorcerer Posts: 27,548 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Sorry. It's a disease I tell ya! I can't stop myself. (more [on topic, I promise] later when I don't have to "type" on a tiny screen)

08-15-2013, 04:03 AM	#12
Jellby frumious Bandersnatch Posts: 7,516 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	Cleaning up CSS automatically may be considerably difficult if the stylesheet includes "advanced" selectors or rules like children, adjacent siblings, precedende, multiple classes, etc. But there could be an option like the "flatten CSS" used by calibre, where every defined selector or combination is converted into a class (I believe that's what it does), then it's easier to "simplify", but you lose much of the charm of CSS. Nevertheless, the coding of most books needing a cleanup has no charm at all, so there's not much to lose.

08-15-2013, 06:43 AM	#13
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	the "logic" for detecting functionally identical classes would be ridiculously complicated, because there are so many ways to say the same thing in shorthand. e,g, margin can be specified in up to 4 separate lines, : margin-top, margin-bottom... or in a single margin line with 4 parameters. the best you could probably hope for is detecting identical blocks of definitions; even coping with the individual css lines being in a different order within different definitions would be challenging there are "master sets" of "in-house" CSS out there, with an definition for everything you could possibly need - inspect any book that uses the adobe approach - but I much prefer to see only what is actually needed for the given book, within the style sheet, not a load of excess stuff.

Advert

Advert