MobileRead Forums - View Single Post - Cleaning ePubs: automatically, fast and with as many generic rules as possible

ibu · 08-07-2013, 05:43 AM

Typical ePub XHTML has an extremly poor quality.

Inline-Styles.
DIVs.
SPANs.
Redundant class or id attributes.
Lack of semantic markup.
...

In other words: an horror for a friends of high quality semantic html.

The quality of the css files is not better.

The consequent use of elegant selectors? Negative report.
The motive of the authors of the css to present a clear and lean set of rules? Negativ report.

Therefore:
you have to help yourself.

Of course there are many ways/tools for users to clean HTML and CSS manually.
But that needs to much time.

One important goal for the editing is:
Not to loose important markup with radical automatic cleaning.

Typical information type, you want to keep ist:
"This a header"
"This is emphasized text"
"This is an unorded list"

Therefore you need a tool which allows an easy analysis of the markup.

The tool should list all used elements with attributes and all elements without.

Example:

div
div class="foo"
span id="01zot"
span id="02zot"
p class="text-indent"
h1 id="01bar"
span style="italic"
samp

Then the tool should offer actions for a highlight group of entries:

Example:

1
Convert div class="foo" into into p

2
Delete all attributes

3
Convert all span style="italic" into em

4
Convert all H3 into H4

5
Delete samp (but keep it's content)

...

Of course this ist just a sketch to explain what I like to reach.

Which tools for highly efficiently cleaning epubs manually do you use, can you advise?

Which generic actions to you assign (e.g. via scripts or plugins) automatically - before or after manual cleaning actions?

My goal is just a lean, semantic, beautiful xhtml like that:

Code:

<h1>Lorem ipsum</h1>

<p>Dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et <em> dolore</em> magna aliqua.</p> 

<p><img src="/images/01.jpg" \></p>

<p>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</p>

And I don't like to spent more than - let's say 3 minutes - for a single book

Thanks.

08-07-2013, 05:43 AM	#1
ibu Addict Posts: 264 Karma: 9246 Join Date: Feb 2010 Location: Berlin, Germany Device: Kobo H20, iPhone 6+, Macbook Pro	Cleaning ePubs: automatically, fast and with as many generic rules as possible Typical ePub XHTML has an extremly poor quality. Inline-Styles. DIVs. SPANs. Redundant class or id attributes. Lack of semantic markup. ... In other words: an horror for a friends of high quality semantic html. The quality of the css files is not better. The consequent use of elegant selectors? Negative report. The motive of the authors of the css to present a clear and lean set of rules? Negativ report. Therefore: you have to help yourself. Of course there are many ways/tools for users to clean HTML and CSS manually. But that needs to much time. One important goal for the editing is: Not to loose important markup with radical automatic cleaning. Typical information type, you want to keep ist: "This a header" "This is emphasized text" "This is an unorded list" Therefore you need a tool which allows an easy analysis of the markup. The tool should list all used elements with attributes and all elements without. Example: div div class="foo" span id="01zot" span id="02zot" p class="text-indent" h1 id="01bar" span style="italic" samp Then the tool should offer actions for a highlight group of entries: Example: 1 Convert div class="foo" into into p 2 Delete all attributes 3 Convert all span style="italic" into em 4 Convert all H3 into H4 5 Delete samp (but keep it's content) ... Of course this ist just a sketch to explain what I like to reach. Which tools for highly efficiently cleaning epubs manually do you use, can you advise? Which generic actions to you assign (e.g. via scripts or plugins) automatically - before or after manual cleaning actions? My goal is just a lean, semantic, beautiful xhtml like that: Code: <h1>Lorem ipsum</h1> <p>Dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et <em> dolore</em> magna aliqua.</p> <p><img src="/images/01.jpg" \></p> <p>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</p> And I don't like to spent more than - let's say 3 minutes - for a single book Thanks. Last edited by ibu; 08-07-2013 at 08:11 AM.