View Full Version : Easiest way to clean an ePub file?


mtrahan
04-20-2011, 04:44 PM
Hello everyone,

I've been the happy owner of a Kindle 3 for about 6 months now. I love it, and it's great that there are so many books available in the public domain for free. Only thing, as you probably all know, the formatting isn't always perfect in the books you can find around the web. So after spending some time reading them as-is, I quickly started spending some time trying to fix simple things in the ebooks I found—mainly adding TOC/correct chapter breaks, small tweaks in the CSS to change the indents or alignment, etc. (with Sigil that is, and doing the ePub/mobi conversion with Calibre).

I guess you can see where I'm going: the more I learned about ebooks, the more unsatisfied I was with the ones I had, and the more things I wanted to change in them.

A couple days ago, I finally found a text I had been searching for some time, in ePub format. Problem is: when I opened it, I noticed it was all in bold. So okay, I figured, I will just change the CSS. Then I noticed something strange (you will notice how little I know about ePubs): the "font-weight: bolder" tag was not in the "calibre2" class, which was used for every paragraph of the book, but it was in a "calibre3" class (that consister only of this "font-weight: bolder" tag).

When looking at the code view in Sigil, I noticed that every single paragraph of the book had those 2 classes called at the beginning: <p class="calibre2"><b class="calibre3">. My first reaction to get rid of the bold problem was simply to change the "font-weight" to put "normal" instead of "bolder". Doing this kinda fix the problem, but not in a very satisfying way I must admit: now I have a "calibre3" class that is exactly useless and is still called at the beginning of everything paragraph of the book...

My question is simple: what do you do in these circumstances? Remove all the <b class="calibre3">? Start a new clean file with the plain text?

Actually, I wouldn't really be asking if the problem was only this. Thing is, I realized that every paragraph was full of multiple and repetitive class calls (I don't know how to name this)... For example, here is one paragraph from the code view in Sigil:

<p class="calibre2"><b class="calibre3">— Fatime, dit-il à ma compagne, je suppose que cette jeune et jolie personne est</b> <b class="calibre3">au fait ; il ne me reste donc plus qu'à vous prévenir que nous avons pour convives</b> <b class="calibre3">deux vieux Allemands, à Paris depuis un mois, et qui brûlent du désir de connaître</b> <b class="calibre3">quelques jolies filles. L'un d'eux a pour vingt mille écus de diamants sur lui :</b> <b class="calibre3">Fatime, je te le recommande. L'autre, qui désire acheter une maison dans ce village,</b> <b class="calibre3">et à qui j'ai persuadé que je lui en trouverais une à très bon marché s'il apportait de</b> <b class="calibre3">quoi la payer comptant, aura sûrement plus de quarante mille francs dans sa poche,</b> <b class="calibre3">soit en or, soit en lettres à vue : Juliette, ce sera votre lot ; acquittez-vous bien de la</b> <b class="calibre3">mission et je vous ferai souvent faire de semblables parties.</b></p>

I have no idea how a ePub can end up with such repetitive formatting... Since the book needs no special formatting (it's all just regular text), I want every paragraph to have only a single class called at the beginning. Is that the correct way to do this?

So at this point, what do you suggest is the simplest way to get a clean book? Just copy paste the "book view" of Sigil and start with the plain text a new ePub file? Or is there a way to remove all the unnecessary formatting?

Thanks in advance for your help (and for all the great information available on the forums), and sorry if I made some grammar/syntax mistakes—as you can maybe guess from the quoted paragraph, english isn't my first language.

Michael

ATDrake
04-20-2011, 05:55 PM
Honestly, it looks like a serious auto-conversion error. I have no idea why your book is bolding all over like that, but I'm pretty sure it's not supposed to be. Likely someone started out with a shoddy source file and fed it into Calibre, which can only try its best with what it's given.

Since the bolding doesn't seem to do anything useful, I'd just get rid of it by doing a find/replace for

</b> <b class="calibre3">

and replace it with a single space, then replace

<p class="calibre2"><b class="calibre3">

and

</b></p>

with the plain, <p class="whatever">-only versions of the markup.

It's how I clean the cruft from my own e-books when I decide to redo the formatting to my liking, and usually a lot faster and easier than trying to build a new book from scratch using cut-and-pasted text.

Since you're converting from ePub to Mobi, I'd say when in doubt, discard or at least comment out the parts of the CSS file you don't understand, which don't seem to do anything useful.

Often Calibre conversions put in a lot of redundant stuff which you just don't need and can't use, given Mobi's limited display capabilities.

J'espère que cela vous aide, et bienvenue à MobileRead!

eping
04-21-2011, 08:54 AM
You can use mass replacement to clear them.
For advanced replacement, you can use a regex tool
(regular expression)
But it,s not an easy nor pleasant work.

If you see the source code of a Word HTML, you would know
Your example code is rather clean and optimized.

mtrahan
04-21-2011, 11:27 AM
Thanks for your help. I guess I'll stick with my current file and clean it as I can. I don't know much about regex tools, so I'll just do as ATDrake suggested:


replace </b> <b class="calibre3"> with single space
replace <p class="calibre2"><b class="calibre3"> with <p class="calibre"> (which is the "basic" class in the stylesheet)
replace </b></p> with only </p>
etc.


I'll also keep in mind the tip about commenting out the parts of the CSS file I don't understand and which doesn't seem to do anything useful. Merci!

I appreciate the fast and friendly help—great forum!

Michael

Update: Just spent some time cleaning the book as you suggested, and it looks scarier than it is. It's actually very easy to do and in a couple of minutes the file looks MUCH better. I guess I should have tried that "find & replace" thingy way earlier... Thanks again!

bfollowell
04-21-2011, 03:28 PM
Since the bolding doesn't seem to do anything useful, I'd just get rid of it by doing a find/replace for

</b> <b class="calibre3">

and replace it with a single space, then replace

<p class="calibre2"><b class="calibre3">

and

</b></p>

Actually, I wouldn't replace with a space. Just perform a search for:

<b class="calibre3">

and replace it with nothing at all. Sigil will remove all the bold tags and once you save the file or switch to book view and back or something like that, it will automatically delete all the correpsonding bold end tags as well. If you change the search/replace to all html files it should take you all of about five to twenty seconds, depending on the size of the book.

Remember to always make a working copy and save your original to fall back on if you really mess things up and need to start fresh.

Wa la, no more bold text.

Toxaris
04-22-2011, 06:16 AM
I would recommend spending some time in learning RegEx. It will help you!

mtrahan
04-22-2011, 08:45 AM
I would recommend spending some time in learning RegEx. It will help you!

I will. It's just I'm learning one thing at a time and for now it was already enough to experiment. It's actually the first book I reformat that much—doing cover, endnotes, etc. Now that I'm learning to get more out of it, I find Sigil really awesome.

Thanks again for the help.

susan_cassidy
04-22-2011, 03:57 PM
Wa la, no more bold text.

I hope you were joking. It's voilà, not 'wa la'.

bfollowell
04-27-2011, 08:06 AM
I hope you were joking. It's voilà, not 'wa la'.

Yes, I just didn't feel like going looking for the a with the little accent symbol over it.

Obviously, you knew what I meant though, correct?

I'm really glad you felt the need to spend more time "correcting" my post rather than trying to help the op though. Way to go. :thumbsup: