| 
 | |||||||
|  | 
|  | Thread Tools | Search this Thread | 
|  08-07-2013, 05:43 AM | #1 | 
| Addict            Posts: 264 Karma: 9246 Join Date: Feb 2010 Location: Berlin, Germany Device: Kobo H20, iPhone 6+, Macbook Pro | 
				
				Cleaning ePubs: automatically, fast and with as many generic rules as possible
			 
			
			Typical ePub XHTML has an extremly poor quality. Inline-Styles. DIVs. SPANs. Redundant class or id attributes. Lack of semantic markup. ... In other words: an horror for a friends of high quality semantic html. The quality of the css files is not better. The consequent use of elegant selectors? Negative report. The motive of the authors of the css to present a clear and lean set of rules? Negativ report. Therefore: you have to help yourself. Of course there are many ways/tools for users to clean HTML and CSS manually. But that needs to much time. One important goal for the editing is: Not to loose important markup with radical automatic cleaning. Typical information type, you want to keep ist: "This a header" "This is emphasized text" "This is an unorded list" Therefore you need a tool which allows an easy analysis of the markup. The tool should list all used elements with attributes and all elements without. Example: div div class="foo" span id="01zot" span id="02zot" p class="text-indent" h1 id="01bar" span style="italic" samp Then the tool should offer actions for a highlight group of entries: Example: 1 Convert div class="foo" into into p 2 Delete all attributes 3 Convert all span style="italic" into em 4 Convert all H3 into H4 5 Delete samp (but keep it's content) ... Of course this ist just a sketch to explain what I like to reach. Which tools for highly efficiently cleaning epubs manually do you use, can you advise? Which generic actions to you assign (e.g. via scripts or plugins) automatically - before or after manual cleaning actions? My goal is just a lean, semantic, beautiful xhtml like that: Code: <h1>Lorem ipsum</h1> <p>Dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et <em> dolore</em> magna aliqua.</p> <p><img src="/images/01.jpg" \></p> <p>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</p>  Thanks. Last edited by ibu; 08-07-2013 at 08:11 AM. | 
|   |   | 
|  08-07-2013, 07:14 AM | #2 | 
| Grand Sorcerer            Posts: 28,869 Karma: 207000000 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD | |
|   |   | 
|  08-07-2013, 07:27 AM | #3 | 
| Addict            Posts: 264 Karma: 9246 Join Date: Feb 2010 Location: Berlin, Germany Device: Kobo H20, iPhone 6+, Macbook Pro | 
			
			@DiapDealer And there are even no tools to help an editor with the manual tasks I listed in my examples? Right, I'm not looking for regular expressions. | 
|   |   | 
|  08-07-2013, 07:33 AM | #4 | |
| Guru            Posts: 776 Karma: 2751519 Join Date: Jul 2010 Location: UK Device: PW2, Nexus7 | Quote: 
 However, you could almost certainly write your own plugin to achieve the sort of analysis that you describe and then use regex to make the changes, (get the plugin to generate the regex to avoid typing manually.) I can't see it being a speedy process and I think there would be plenty of scope for corruputing the ePub. | |
|   |   | 
|  08-07-2013, 08:01 AM | #5 | |
| Grand Sorcerer            Posts: 28,869 Karma: 207000000 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD | Quote: 
 It sounds like you're looking for something that parses xhtml/css in order to clean it. I know of nothing like that: neither of the automatic, or manual assist variety (though Sigil's "Reports" can help you find unused css classes). Most parsers are used to create consistent (albeit usually "cluttered") markup, not clean it. "Clean" epub code has always been the purview of the ebook's creator. Anybody else will be expected to take time diving in and getting their hands dirty to clean someone else's code. Mainly because it's just not important enough to enough people to warrant the development of such a beast. | |
|   |   | 
|  08-07-2013, 08:11 AM | #6 | 
| Addict            Posts: 264 Karma: 9246 Join Date: Feb 2010 Location: Berlin, Germany Device: Kobo H20, iPhone 6+, Macbook Pro | 
			
			@Agama I'm not a programmer. So I depend on existing tools. OK, I see, that my hope for sophisticated cleaning scripts (with huge heuristics etc.) has to die. Let's concentrate to the task "efficient manual cleaning with sigil". | 
|   |   | 
|  08-07-2013, 08:22 AM | #7 | 
| Addict            Posts: 264 Karma: 9246 Join Date: Feb 2010 Location: Berlin, Germany Device: Kobo H20, iPhone 6+, Macbook Pro | 
			
			@DiapDealer Yes, you are right. I'm looking for a tool which parses a valid xhtml document. Only than secure and as well easy cleaning (without complex and risky regex) is possible. The html authoring tool dreamweaver e.g. offers some commands in it's GUI to perform some of my mentioned tasks. But I don't want to unzip an epub, edit all the files in Dreamweaver, pack it again as an epub, and than, perform the rest of cleaning inside Sigil (generate the TOC, edit the OPF, ...). I understand all your arguments about "not important enough". My hope was, that in the community of epub friends, there are many others who are looking for ways to clean existing epubs, because it is not rare, that cluttered source code is the cause of many presentation problems. And there's no hope at all, that the producers will deliver quality code. | 
|   |   | 
|  08-07-2013, 08:57 AM | #8 | |
| Grand Sorcerer            Posts: 28,869 Karma: 207000000 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD | Quote: 
 But in my experience (if I'm 100% honest): "cluttered" source code is rarely the cause of true presentation problems. The problem is that a very small subset of readers/hobbyists are extremely picky about the way they like their books presented on their particular device(s). So they make the effort to learn how to tweak the epub to meet their expectations (I'm one of these people). Convoluted and messy code makes it harder to see what needs to be done to accomplish your goals, but it's still not impossible. The intersection of the group of people who want to change how the epub is presented, and the group who recognize messy code, but don't already have the skills necessary to fix it, has to be pretty darn small. And even then... I believe "clean" code is mostly just more aesthetically pleasing than anything. "Cluttered markup" can display just as well as clean mark up can. So don't give up hope! Surely anybody who sees the value of clean markup and efficient css can't be above learning a scripting language to help achieve that goal?  Sounds like the "Open With" feature of Sigil might be helpful to you. It will allow you to use external editors on files within the epub without all the unzip/rezip nonsense. | |
|   |   | 
|  08-07-2013, 08:57 AM | #9 | 
| Guru            Posts: 776 Karma: 2751519 Join Date: Jul 2010 Location: UK Device: PW2, Nexus7 | 
			
			I may be able to extend my plugin. What do you consider the worst offenders in terms of poor quality? DIV and SPAN are not necessarily poor quality, it depends how they are used. What do you mean by "Lack of semantic markup"? I would have thought that any ePub file, being a valid XHTML document, is bound to contain semantic markup. Last edited by Agama; 08-07-2013 at 10:09 AM. Reason: typo | 
|   |   | 
|  08-07-2013, 10:07 AM | #10 | 
| frumious Bandersnatch            Posts: 7,570 Karma: 20150435 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura | Code: <h1>Chapter <span class="number">4</span></h1> <div class="summary">where a sample is given</div> <p>This is the true starting point of the chapter...</p> | 
|   |   | 
|  08-07-2013, 10:09 AM | #11 | 
| Guru            Posts: 776 Karma: 2751519 Join Date: Jul 2010 Location: UK Device: PW2, Nexus7 | 
			
			Looks good to me; hence my questions to the OP regarding poor quality.
		 | 
|   |   | 
|  08-07-2013, 10:22 AM | #12 | 
| frumious Bandersnatch            Posts: 7,570 Karma: 20150435 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura | 
			
			Yes my post was directed to the OP too    | 
|   |   | 
|  08-07-2013, 11:23 AM | #13 | 
| Addict            Posts: 264 Karma: 9246 Join Date: Feb 2010 Location: Berlin, Germany Device: Kobo H20, iPhone 6+, Macbook Pro | 
			
			@Jellby  Well, to be honest I don't want to go to deep into a general discussion on "what is the best markup". But because two of you asked, of course I answer in short to that. Sure there are markup situation, where a span/div is appropriate. Because there's no specific element for "number" and if you really need to distinguish between the word "chapter" and "the number", than Code: <h1>Chapter <span>4</span></h1> Code: class="number" If you need a selektor: "h1 span { }" does the job. I would prefer another markup. Code: <h1>Chapter <em>4</em></h1> In such a heading the number is the specific part. The word "chapter" is the general part. For the eye it is importan to recognize the number. Therefore it's good to emphasize it. @Agame OK, some example for bad markup: Code: <p class="h1"><samp>...</samp></p> <h2 class="h2" id="heading_id_2">...</h2> <span class="italic">...</span> (just to emphasized words inside sentences) <p class="text_noindent_top">...</p> (Neighbor selector can match it) <p class="text_indent">...</p> (each paragraph has that useless class) <p class="text_indent">...</p> "Semantic" means that, what is declared in the W3C Specs. To talk about semantic in our context, the semantic must be declared publically and authoritative. Only when it's authoritative, producer of interfaces for example blind people can offer a useful output for e.g. headings. All an author adds with classes or ids it's private joy. Last edited by ibu; 08-07-2013 at 12:14 PM. | 
|   |   | 
|  08-07-2013, 11:59 AM | #14 | 
| Grand Sorcerer            Posts: 28,869 Karma: 207000000 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD | 
			
			You sure seem to have specific ideas about how things "should" be coded, for someone who has proclaimed themselves "not a programmer."    | 
|   |   | 
|  08-07-2013, 12:12 PM | #15 | 
| Addict            Posts: 264 Karma: 9246 Join Date: Feb 2010 Location: Berlin, Germany Device: Kobo H20, iPhone 6+, Macbook Pro | 
			
			@DiapDealer I'm open for criticism  Please tell me, which of my "ideas" are strange in your opinion. "Proclaiming"? It's just the simple truth. And it's no coquetry. I'm not bad in markup, css, usability and accessibility. But I don't know programming. | 
|   |   | 
|  | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Touch Problem with all epubs, my epubs, or my kobo? (line clipping) | plague006 | Kobo Reader | 14 | 12-02-2011 11:32 PM | 
| Gui Plugin for Cleaning Ebooks, Fast | burbleburble | Plugins | 91 | 10-11-2011 04:45 PM |