MobileRead Forums - View Single Post - Proposal: CSS "normalisation" functionality in Sigil

Man Eating Duck · 08-14-2013, 04:34 AM

This suggestion is related to issue #2019 http://code.google.com/p/sigil/issues/detail?id=2019, however I propose a few additional features. I'm aware that this might be a serious undertaking, however I'm willing to do whatever I can to help. My coding skills are rudimentary, but I intend to see what I can do. If the maintainers are adverse to the whole idea I will rethink it, otherwise any suggestions and tips are *very* welcome

Background: a very common use case for me is that I want to clean up purchased books, and from what I see in forums and articles I'm not alone. A surprising portion of commercial ebooks have atrocious code quality, due to some combination of: incompetence on part of the creator, too much reliance on automated tools, and bad source material.

A tedious first step is to locate and gather style declarations in one place. A set of tools to gather and normalize all style information in one css file would be of great help.

As is mentioned in #2019, some of this might be done with regex*. This is far from optimal, as you really need a proper XML parser to parse xhtml, which I believe Sigil has access to in Xerces. I propose (and would like to assist with) a new set of tools, some of which are also suggested in #2019. These tools would use the proper xml parsing engine:

1: convert inline styling to css classes and move them to a style file, i.e -> with corresponding css span.sgc-1 {font-style:italic;}. This could use existing styles with identical declarations if they exist, or just consistently add new style classes to the css file to be cleaned up later.

2: move embedded styles in html documents (ie <style type="text/css">) to a css file, renaming classes in case of conflicts.

3: Merge css files. This would combine multiple (all?) css files into one, renaming conflicting class names as necessary. Style occurences in XHTML would naturally also need to be renamed.

Now all styling information should be present in only one css file.

4: Merge identical styles into a single instance. The above steps would very likely leave the css file with heaps of styles with identical declarations. These should be merged into a single style for each unique set of declarations and tag types. This step *might* also include a set of predefined class names, such that a class with only an italic declaration would be named "italic". I don't know if there is any point to the latter, it might be better to leave renaming to the user who can more reliably judge what a class is intended to do.

5: Provide some tools to rename and delete css styles with corresponding tags as outlined in #2019, as well as converting tags with specific classes to other tags ( -> <h1>).

Steps 1-4 is intended to be relatively non-destructive, retaining all information and leave all actual tags in place. You would lose distinctions between different classes with identical declarations in step 4, but I seldom see semantic information stored in css classes anyway, they're often just randomly generated from some editing tool or another. 1-4 could potentially be done in a single operation (for instance with a dialog with checkboxes for each step). In step five the user would actually make his changes to declarations along with renaming and removing classes and tags, this might of course be destructive, but initiated by the user in all cases with respect to losing distinctions between similar classes for different purposes.

* Apart from regex being complicated to grasp for casual users, it is also theoretically impossible to reliably parse html with regex. I won't go into much detail, but a trivial example:

A paragraph with italics in it.

I've actually seen this very structure in the wild, with a corresponding .empty{}. If you want to remove the useless "empty" spans, an intuitive regex might be something like (?U)(.*), replace with /1. In the example above this would extend the italic span to encompass the rest of the paragraph.

08-14-2013, 04:34 AM	#1
Man Eating Duck Addict Posts: 254 Karma: 69786 Join Date: May 2006 Location: Oslo, Norway Device: Kobo Aura, Sony PRS-650	Proposal: CSS "normalisation" functionality in Sigil This suggestion is related to issue #2019 http://code.google.com/p/sigil/issues/detail?id=2019, however I propose a few additional features. I'm aware that this might be a serious undertaking, however I'm willing to do whatever I can to help. My coding skills are rudimentary, but I intend to see what I can do. If the maintainers are adverse to the whole idea I will rethink it, otherwise any suggestions and tips are very welcome Background: a very common use case for me is that I want to clean up purchased books, and from what I see in forums and articles I'm not alone. A surprising portion of commercial ebooks have atrocious code quality, due to some combination of: incompetence on part of the creator, too much reliance on automated tools, and bad source material. A tedious first step is to locate and gather style declarations in one place. A set of tools to gather and normalize all style information in one css file would be of great help. As is mentioned in #2019, some of this might be done with regex. This is far from optimal, as you really need a proper XML parser to parse xhtml, which I believe Sigil has access to in Xerces. I propose (and would like to assist with) a new set of tools, some of which are also suggested in #2019. These tools would use the proper xml parsing engine: 1: convert inline styling to css classes and move them to a style file, i.e <span style="font-style:italic"> -> <span class="sgc-1"> with corresponding css span.sgc-1 {font-style:italic;}. This could use existing styles with identical declarations if they exist, or just consistently add new style classes to the css file to be cleaned up later. 2: move embedded styles in html documents (ie <style type="text/css">) to a css file, renaming classes in case of conflicts. 3: Merge css files. This would combine multiple (all?) css files into one, renaming conflicting class names as necessary. Style occurences in XHTML would naturally also need to be renamed. Now all styling information should be present in only one css file. 4: Merge identical styles into a single instance. The above steps would very likely leave the css file with heaps of styles with identical declarations. These should be merged into a single style for each unique set of declarations and tag types. This step might* also include a set of predefined class names, such that a class with only an italic declaration would be named "italic". I don't know if there is any point to the latter, it might be better to leave renaming to the user who can more reliably judge what a class is intended to do. 5: Provide some tools to rename and delete css styles with corresponding tags as outlined in #2019, as well as converting tags with specific classes to other tags (<p class="h1"> -> <h1>). Steps 1-4 is intended to be relatively non-destructive, retaining all information and leave all actual tags in place. You would lose distinctions between different classes with identical declarations in step 4, but I seldom see semantic information stored in css classes anyway, they're often just randomly generated from some editing tool or another. 1-4 could potentially be done in a single operation (for instance with a dialog with checkboxes for each step). In step five the user would actually make his changes to declarations along with renaming and removing classes and tags, this might of course be destructive, but initiated by the user in all cases with respect to losing distinctions between similar classes for different purposes. * Apart from regex being complicated to grasp for casual users, it is also theoretically impossible to reliably parse html with regex. I won't go into much detail, but a trivial example: <p> <span class="empty">A paragraph with <span class="italic">italics</span> in it.</span> </p> I've actually seen this very structure in the wild, with a corresponding .empty{}. If you want to remove the useless "empty" spans, an intuitive regex might be something like (?U)<span class="empty">(.*)</span>, replace with /1. In the example above this would extend the italic span to encompass the rest of the paragraph.