Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 08-07-2013, 05:43 AM   #1
ibu
Addict
ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.
 
Posts: 264
Karma: 9246
Join Date: Feb 2010
Location: Berlin, Germany
Device: Kobo H20, iPhone 6+, Macbook Pro
Cleaning ePubs: automatically, fast and with as many generic rules as possible

Typical ePub XHTML has an extremly poor quality.

Inline-Styles.
DIVs.
SPANs.
Redundant class or id attributes.
Lack of semantic markup.
...

In other words: an horror for a friends of high quality semantic html.

The quality of the css files is not better.

The consequent use of elegant selectors? Negative report.
The motive of the authors of the css to present a clear and lean set of rules? Negativ report.


Therefore:
you have to help yourself.

Of course there are many ways/tools for users to clean HTML and CSS manually.
But that needs to much time.

One important goal for the editing is:
Not to loose important markup with radical automatic cleaning.

Typical information type, you want to keep ist:
"This a header"
"This is emphasized text"
"This is an unorded list"

Therefore you need a tool which allows an easy analysis of the markup.

The tool should list all used elements with attributes and all elements without.

Example:

div
div class="foo"
span id="01zot"
span id="02zot"
p class="text-indent"
h1 id="01bar"
span style="italic"
samp

Then the tool should offer actions for a highlight group of entries:

Example:

1
Convert div class="foo" into into p

2
Delete all attributes

3
Convert all span style="italic" into em

4
Convert all H3 into H4

5
Delete samp (but keep it's content)

...


Of course this ist just a sketch to explain what I like to reach.

Which tools for highly efficiently cleaning epubs manually do you use, can you advise?

Which generic actions to you assign (e.g. via scripts or plugins) automatically - before or after manual cleaning actions?


My goal is just a lean, semantic, beautiful xhtml like that:

Code:
<h1>Lorem ipsum</h1>

<p>Dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et <em> dolore</em> magna aliqua.</p> 

<p><img src="/images/01.jpg" \></p>

<p>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</p>
And I don't like to spent more than - let's say 3 minutes - for a single book

Thanks.

Last edited by ibu; 08-07-2013 at 08:11 AM.
ibu is offline   Reply With Quote
Old 08-07-2013, 07:14 AM   #2
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,547
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by ibu View Post
Which tools for highly efficiently cleaning epubs manually do you use, can you advise?

And I don't like to spent more than - let's say 3 minutes - for a single book
Negative Report.

They're all manual, inefficient and time consuming.
DiapDealer is offline   Reply With Quote
Advert
Old 08-07-2013, 07:27 AM   #3
ibu
Addict
ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.
 
Posts: 264
Karma: 9246
Join Date: Feb 2010
Location: Berlin, Germany
Device: Kobo H20, iPhone 6+, Macbook Pro
@DiapDealer

And there are even no tools to help an editor with the manual tasks I listed in my examples?
Right, I'm not looking for regular expressions.
ibu is offline   Reply With Quote
Old 08-07-2013, 07:33 AM   #4
Agama
Guru
Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.
 
Agama's Avatar
 
Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
Quote:
Originally Posted by ibu View Post
Which tools for highly efficiently cleaning epubs manually do you use, can you advise?

And I don't like to spent more than - let's say 3 minutes - for a single book

Thanks.
I use an automated plugin which works with calibre as a conversion post-processor but since I am always working from Markdown plain-text sources the output is predictable and therefore easy to tidy.

However, you could almost certainly write your own plugin to achieve the sort of analysis that you describe and then use regex to make the changes, (get the plugin to generate the regex to avoid typing manually.) I can't see it being a speedy process and I think there would be plenty of scope for corruputing the ePub.
Agama is offline   Reply With Quote
Old 08-07-2013, 08:01 AM   #5
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,547
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by ibu View Post
And there are even no tools to help an editor with the manual tasks I listed in my examples?
Right, I'm not looking for regular expressions.
Regular expressions represent the vast bulk of my arsenal for cleaning epub markup. But they're not generic. They always have to be tuned/tweaked for each book.

It sounds like you're looking for something that parses xhtml/css in order to clean it. I know of nothing like that: neither of the automatic, or manual assist variety (though Sigil's "Reports" can help you find unused css classes). Most parsers are used to create consistent (albeit usually "cluttered") markup, not clean it.

"Clean" epub code has always been the purview of the ebook's creator. Anybody else will be expected to take time diving in and getting their hands dirty to clean someone else's code. Mainly because it's just not important enough to enough people to warrant the development of such a beast.
DiapDealer is offline   Reply With Quote
Advert
Old 08-07-2013, 08:11 AM   #6
ibu
Addict
ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.
 
Posts: 264
Karma: 9246
Join Date: Feb 2010
Location: Berlin, Germany
Device: Kobo H20, iPhone 6+, Macbook Pro
@Agama
I'm not a programmer. So I depend on existing tools.

OK, I see, that my hope for sophisticated cleaning scripts (with huge heuristics etc.) has to die.

Let's concentrate to the task "efficient manual cleaning with sigil".
ibu is offline   Reply With Quote
Old 08-07-2013, 08:22 AM   #7
ibu
Addict
ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.
 
Posts: 264
Karma: 9246
Join Date: Feb 2010
Location: Berlin, Germany
Device: Kobo H20, iPhone 6+, Macbook Pro
@DiapDealer
Yes, you are right. I'm looking for a tool which parses a valid xhtml document. Only than secure and as well easy cleaning (without complex and risky regex) is possible.

The html authoring tool dreamweaver e.g. offers some commands in it's GUI to perform some of my mentioned tasks.
But I don't want to unzip an epub, edit all the files in Dreamweaver, pack it again as an epub, and than, perform the rest of cleaning inside Sigil (generate the TOC, edit the OPF, ...).

I understand all your arguments about "not important enough".
My hope was, that in the community of epub friends, there are many others who are looking for ways to clean existing epubs, because it is not rare, that cluttered source code is the cause of many presentation problems.

And there's no hope at all, that the producers will deliver quality code.
ibu is offline   Reply With Quote
Old 08-07-2013, 08:57 AM   #8
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,547
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by ibu View Post
I understand all your arguments about "not important enough".
My hope was, that in the community of epub friends, there are many others who are looking for ways to clean existing epubs, because it is not rare, that cluttered source code is the cause of many presentation problems.
I don't mean to squash your hope completely. Someone may be working on something like that now. Or you may be able to convince someone that's it's worthwhile to develop something that does what you wish. It would be nice.

But in my experience (if I'm 100% honest): "cluttered" source code is rarely the cause of true presentation problems. The problem is that a very small subset of readers/hobbyists are extremely picky about the way they like their books presented on their particular device(s). So they make the effort to learn how to tweak the epub to meet their expectations (I'm one of these people). Convoluted and messy code makes it harder to see what needs to be done to accomplish your goals, but it's still not impossible.

The intersection of the group of people who want to change how the epub is presented, and the group who recognize messy code, but don't already have the skills necessary to fix it, has to be pretty darn small. And even then... I believe "clean" code is mostly just more aesthetically pleasing than anything. "Cluttered markup" can display just as well as clean mark up can.

So don't give up hope! Surely anybody who sees the value of clean markup and efficient css can't be above learning a scripting language to help achieve that goal?

Quote:
Originally Posted by ibu View Post
But I don't want to unzip an epub, edit all the files in Dreamweaver, pack it again as an epub, and than, perform the rest of cleaning inside Sigil (generate the TOC, edit the OPF, ...).
Sounds like the "Open With" feature of Sigil might be helpful to you. It will allow you to use external editors on files within the epub without all the unzip/rezip nonsense.
DiapDealer is offline   Reply With Quote
Old 08-07-2013, 08:57 AM   #9
Agama
Guru
Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.
 
Agama's Avatar
 
Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
Quote:
Originally Posted by ibu View Post
@Agama
I'm not a programmer. So I depend on existing tools.
I may be able to extend my plugin. What do you consider the worst offenders in terms of poor quality?

DIV and SPAN are not necessarily poor quality, it depends how they are used.

What do you mean by "Lack of semantic markup"? I would have thought that any ePub file, being a valid XHTML document, is bound to contain semantic markup.

Last edited by Agama; 08-07-2013 at 10:09 AM. Reason: typo
Agama is offline   Reply With Quote
Old 08-07-2013, 10:07 AM   #10
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,515
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Code:
<h1>Chapter <span class="number">4</span></h1>

<div class="summary">where a sample is given</div>

<p>This is the true starting point of the chapter...</p>
DIVs and and SPANs there, isn't that an appropriate use of them? Isn't that semantic enough markup?
Jellby is offline   Reply With Quote
Old 08-07-2013, 10:09 AM   #11
Agama
Guru
Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.
 
Agama's Avatar
 
Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
Looks good to me; hence my questions to the OP regarding poor quality.
Agama is offline   Reply With Quote
Old 08-07-2013, 10:22 AM   #12
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,515
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Yes my post was directed to the OP too
Jellby is offline   Reply With Quote
Old 08-07-2013, 11:23 AM   #13
ibu
Addict
ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.
 
Posts: 264
Karma: 9246
Join Date: Feb 2010
Location: Berlin, Germany
Device: Kobo H20, iPhone 6+, Macbook Pro
@Jellby
Well, to be honest I don't want to go to deep into a general discussion on "what is the best markup".

But because two of you asked, of course I answer in short to that.

Sure there are markup situation, where a span/div is appropriate.

Because there's no specific element for "number" and if you really need to distinguish between the word "chapter" and "the number", than

Code:
<h1>Chapter <span>4</span></h1>
is OK.

Code:
class="number"
is superfluous.

If you need a selektor: "h1 span { }" does the job.

I would prefer another markup.


Code:
<h1>Chapter <em>4</em></h1>
For the best markup you have to ask, what you want to express.

In such a heading the number is the specific part. The word "chapter" is the general part.

For the eye it is importan to recognize the number.
Therefore it's good to emphasize it.

@Agame

OK, some example for bad markup:

Code:
<p class="h1"><samp>...</samp></p>

<h2 class="h2" id="heading_id_2">...</h2>

<span class="italic">...</span> (just to emphasized words inside sentences)

<p class="text_noindent_top">...</p> (Neighbor selector can match it)
<p class="text_indent">...</p> (each paragraph has that useless class)
<p class="text_indent">...</p>
The list can be continued endless.

"Semantic" means that, what is declared in the W3C Specs.
To talk about semantic in our context, the semantic must be declared publically and authoritative.

Only when it's authoritative, producer of interfaces for example blind people can offer a useful output for e.g. headings.


All an author adds with classes or ids it's private joy.

Last edited by ibu; 08-07-2013 at 12:14 PM.
ibu is offline   Reply With Quote
Old 08-07-2013, 11:59 AM   #14
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,547
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
You sure seem to have specific ideas about how things "should" be coded, for someone who has proclaimed themselves "not a programmer."
DiapDealer is offline   Reply With Quote
Old 08-07-2013, 12:12 PM   #15
ibu
Addict
ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.ibu can eat soup with a fork.
 
Posts: 264
Karma: 9246
Join Date: Feb 2010
Location: Berlin, Germany
Device: Kobo H20, iPhone 6+, Macbook Pro
@DiapDealer
I'm open for criticism

Please tell me, which of my "ideas" are strange in your opinion.

"Proclaiming"? It's just the simple truth. And it's no coquetry. I'm not bad in markup, css, usability and accessibility. But I don't know programming.
ibu is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Touch Problem with all epubs, my epubs, or my kobo? (line clipping) plague006 Kobo Reader 14 12-02-2011 11:32 PM
Gui Plugin for Cleaning Ebooks, Fast burbleburble Plugins 91 10-11-2011 04:45 PM


All times are GMT -4. The time now is 02:06 AM.


MobileRead.com is a privately owned, operated and funded community.