Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 04-05-2023, 02:38 AM   #1
akita328
Member
akita328 began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Aug 2019
Device: kindle, iPad Marvin
removing excessive <class> and other formatting horrors on epub

hi all,

wasn't sure where this sort of question should be posted, so I'm posting here...

I bought an ebook from amazon, and noticed it had pretty bad formatting errors that was making the book very difficult to read on my kindle, so I thought I'd try to fix it myself.

but when I used Calibre to convert azw3 to ePub, I saw this horrible coding where almost every single word has its own class/span/etc!!!! and I can't make heads or tails of removing and fixing weird indentations and margin problems. (I think I can fix the pagination problems, but that's minor compared to what I saw below)

here's an excerpt of the code. the entire book is like this:

<p class="block_21">“How<span class="text_14"> </span>can<span class="text_14"> </span>I<span class="text_14"> </span>persuade<span class="text_14"> </span>you<span class="text_14"> </span>that<span class="text_14"> </span>I<span class="text_14"> </span>mean<span class="text_14"> </span>you<span class="text_14"> </span>no<span class="text_14"> </span>harm?”<span class="text_14"> </span>he<span class="text_14"> </span>asked.<span class="text_14"> </span>“I<span class="text_14"> </span>swear to you that I will do nothing to you.”</p>

<p class="block_22">“Will<span class="text_18"> </span>you<span class="text_18"> </span>swear<span class="text_18"> </span>by<span class="text_18"> </span>the<span class="text_18"> </span>Blessed<span class="text_18"> </span>Virgin<span class="text_18"> </span>Mary?”<span class="text_18"> </span>she<span class="text_18"> </span>asked<span class="text_18"> </span>disbelievingly. “I swear it.”</p>

Each paragraph has its own block_# with its own Class="text_#" on almost every single word in the paragraph. (and the block# and text# are different pairing for the paragraphs...)

I took a peek at the original azw3 file, and it is just as bad. So azw-->epub conversion didn't do this. it's the horrible amazon encoding...

is there anyway to clean up the mess like above that will strip most of the junk to something resembling sane text file that I can fix the incorrect pagination, margins, and linefeeds? if I have to manually delete these things, it will be faster if I retype the entire book from scratch. I have the paperback copy of the book as well as the eBook version, so... I have a reference what it's SUPPOSED to look like..

I'm hoping there are ways to export to some other format that can be re-converted back into a simpler ePub doc with excessive use of <class> stripped... (I'm a novice and can edit existing ePub file, but won't know where to start if I have to start from scratch..)

I do use Sigil for minor cleaning up texts/typos and some formatting, etc, and use Calibre on MacOS... I'm generally computer savvy, but far from expert on stuff like this.

any help is appreciated...
akita328 is offline   Reply With Quote
Old 04-05-2023, 08:09 AM   #2
jhowell
Grand Sorcerer
jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.
 
jhowell's Avatar
 
Posts: 6,498
Karma: 84420419
Join Date: Nov 2011
Location: Tampa Bay, Florida
Device: Kindles
Try converting to MOBI format and then from MOBI back to EPUB. That will remove a lot of excess formatting.

Quote:
Originally Posted by akita328 View Post
I took a peek at the original azw3 file, and it is just as bad. So azw-->epub conversion didn't do this. it's the horrible amazon encoding...
It's really down to the publisher. Garbage in, garbage out. Can you share which title this is from Amazon?
jhowell is offline   Reply With Quote
Advert
Old 04-05-2023, 08:53 AM   #3
phossler
Wizard
phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.
 
Posts: 1,076
Karma: 412718
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
Diap's Toolbox plugin will take care of those very easily

https://www.mobileread.com/forums/sh...40#post2980740
phossler is offline   Reply With Quote
Old 04-05-2023, 09:53 AM   #4
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,817
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
second The toolbox plugin. (also for Sigil)
BUT it removes a specified tag , by a specified criteria.
eg
remove a span that has no attribute (naked).
Modify span class="slanty" and make it <i> (optionally keep the attribute)

But it is not intelligent. You need to understand WHY the code IS the way it is. (your example is extreme.

The Editor has a CSS clean function that combines all the identical CSS entries (but does not change the code in the book)
What I do is run the clean/combine. Then use the toolbox to reduce the combined items to 1 selector ( a bit tedious, but fairly quick each pass)
theducks is offline   Reply With Quote
Old 04-05-2023, 09:57 AM   #5
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,037
Karma: 129333114
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
One thing to do is modify the CSS. If you don't want text_14, delete it from the CSS. Then when you remove unused CSS, all those classes will go away. You can also remove empty spans using Diaps Editing Toolbag.

There's more you can do but that would depend on the code.
JSWolf is offline   Reply With Quote
Advert
Old 04-05-2023, 11:31 AM   #6
jhowell
Grand Sorcerer
jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.
 
jhowell's Avatar
 
Posts: 6,498
Karma: 84420419
Join Date: Nov 2011
Location: Tampa Bay, Florida
Device: Kindles
I was able to find the book on Amazon from the snippet of text provided. It is For Love of Evil by Piers Anthony. I looked at the free sample and it shows signs of being a conversion based on a PDF and does not appear to be professionally done. That book is part of a series that is not legitimately available as e-books as far as I know. I suspect that the rights holder will eventually notify Amazon and it will be taken down.
jhowell is offline   Reply With Quote
Old 04-05-2023, 12:08 PM   #7
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,037
Karma: 129333114
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by jhowell View Post
I was able to find the book on Amazon from the snippet of text provided. It is For Love of Evil by Piers Anthony. I looked at the free sample and it shows signs of being a conversion based on a PDF and does not appear to be professionally done. That book is part of a series that is not legitimately available as e-books as far as I know. I suspect that the rights holder will eventually notify Amazon and it will be taken down.
Actually, the series is partly available as eBooks. It's an 8 book series and only the first 5 books are available as eBooks.

The series is Incarnations of Immortality.
JSWolf is offline   Reply With Quote
Old 04-05-2023, 03:52 PM   #8
akita328
Member
akita328 began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Aug 2019
Device: kindle, iPad Marvin
Hi everybody,

thank you so much for your inputs. I will first try azw3-->mobi-->epub and see if that helps first... if not I will try learning how to use the Diap Toolbox..

and yes. the excerpt is from "For Love of Evil". I have all 7 of the "Incarnations of Immortality" in both paperback and kindle versions books 1-6. I'm waiting for book7 to be ebook-ed, but no luck yet. (there are few other books I have in hardcover that I would LOVE to get hold of ebook versions.. but unfortunately only German version exists, and no sign of English versions...) As far as I know, Anthony had some contract fallout (or something) with DelRay (?) after publishing the first 5 books, and changed publishers for the last two. first 5 books were fine... but this 6th was atrocious. It is bad enough that if I had realized that this formatting problem was this bad and wouldn't be a simple tweak in Sigil, would have returned the book... but.. too late now!

when learning Diap's Toolbox, I will work on bunch of copies until I understand what it's doing via trial and error methods. I usually use Sigil for simple edits I do (usually fixing typos or strange linefeed, etc), but have never attempted this level of clean up... so... may have to come back to get help later.

thanks!
akita328 is offline   Reply With Quote
Old 04-05-2023, 04:13 PM   #9
akita328
Member
akita328 began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Aug 2019
Device: kindle, iPad Marvin
Quote:
Originally Posted by JSWolf View Post
One thing to do is modify the CSS. If you don't want text_14, delete it from the CSS. Then when you remove unused CSS, all those classes will go away. You can also remove empty spans using Diaps Editing Toolbag.

There's more you can do but that would depend on the code.
Woo Hoo! I just did:
azw3-->mobi-->epub

and that combined most of the per-word styling consolidated, so not it's mostly per-paragraph styling.

then I tried commenting out one of the class (font color, since it's easy to see the change), and looked for "remove unused CSS rules" and poof! it's gone!!!

yay!

thank you everybody! I think I can get pretty far cleaning up (or at least understand why some paragraphs have very strange margins...)
akita328 is offline   Reply With Quote
Old 04-05-2023, 05:44 PM   #10
Sarmat89
Evangelist
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 482
Karma: 2267928
Join Date: Nov 2015
Device: none
You shouldn't just remove every span blindly, as some books use nested span tags to represent straight text within a larger italicized section.
Sarmat89 is offline   Reply With Quote
Old 04-05-2023, 07:58 PM   #11
Karellen
Wizard
Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.
 
Karellen's Avatar
 
Posts: 1,104
Karma: 4911876
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
Quote:
Originally Posted by Sarmat89 View Post
You shouldn't just remove every span blindly, as some books use nested span tags to represent straight text within a larger italicized section.
Yes, I agree. You need to spend 10min or so figuring out what the classes are doing, especially when they are nested. You might also be removing blockquotes, centering, right aligned, and lots of other styling.

I have never used a plugin to fix these problems. A few well placed regexes can either remove the code or find&replace the convoluted code with your own simpler classes.

In your example a simple regex would have fixed that
Find... <span class="text_14">(.*?)</span>
Replace... \1

Note: if you have nested <span>'s then the above regex will not work correctly as it will stop at the first <span> not the second.

Then look at what is leftover and figure out what it does and either leave it, replace it or remove it.

Last edited by Karellen; 04-09-2023 at 08:34 PM.
Karellen is online now   Reply With Quote
Old 04-06-2023, 02:51 AM   #12
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by phossler View Post
Diap's Toolbox plugin will take care of those very easily
Yes. I love Diap's "Editing Toolbag" (Calibre) / "TagMechanic" (Sigil). It makes cleaning this stuff up so much easier.

I've written a handful of mini-tutorials on how to use it:

That will help you change garbage like:

Code:
<p class="junk123"><span class="italics123">This</span><span class="junk456"> </span>is<span class="junk456"> </span>an<span class="junk456"> </span>example.</p>
into:

Code:
<p><em>This</em> is an example.</p>
And in 2021, I wrote even more tricks:

Quote:
Originally Posted by jhowell View Post
Try converting to MOBI format and then from MOBI back to EPUB. That will remove a lot of excess formatting.
No. It's much better to do a Calibre EPUB->EPUB. This will consolidate a lot of the horrible code.

There are a few bells and whistles you can select in the EPUB conversion to try to remove junk like extra:
  • Colors
  • Fonts
  • Font-size
  • Margins
  • [...]

This will make the CSS cleaner + be much less full of "useless" stuff, making your manual cleanup steps much easier.

I explained these methods in much more detail in:

when RbnJrg asked how to consolidate 26 books/EPUBs, with very similar formatting, into more manageable HTML+CSS code.

About a year later, I wrote:

which summarized more + explained some of the best, bleeding-edge methods.

- - -

Side Note: Some of these most-advanced cleanup tools are still in the works though...

But the pieces/concepts are all there.

Sigil 1.9.10+ added the "Advanced Find/Replace (List-Based)" method I was describing.

For more info on that, see KevinH's:

And the CSSToolbox is still in the works. (I think? I haven't talked with KevinH in a while.)

Last edited by Tex2002ans; 04-06-2023 at 03:09 AM.
Tex2002ans is offline   Reply With Quote
Old 04-06-2023, 03:22 AM   #13
akita328
Member
akita328 began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Aug 2019
Device: kindle, iPad Marvin
Quote:
Originally Posted by Karellen View Post
Yes, I agree. You need to spend 10min or so figuring out what the classes are doing, especially when they are nested. You might also be removing blockquotes, centering, right aligned, and lots of other styling.

I have never used a plugin to fix these problems. A few well placed regexes can either remove the code or find&replace the convoluted code with your own simpler classes.

In your example a simple regex would have fixed that
Find... <span class="text_14">(.*?)</span>
Replace... \1

Then look at what is leftover and figure out what it does and either leave it, replace it or remove it.
Not to worry. I have looked at all the classes, and verified what they are doing. many of them were duplicates... so I did a find/replace to consolidate several different labels into a single one..

now everything looks much neater now I can see where all the bad linefeeds are, so am systematically fixing them..

sigh
akita328 is offline   Reply With Quote
Old 04-06-2023, 03:26 AM   #14
akita328
Member
akita328 began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Aug 2019
Device: kindle, iPad Marvin
Quote:
Originally Posted by Tex2002ans View Post
Yes. I love Diap's "Editing Toolbag" (Calibre) / "TagMechanic" (Sigil). It makes cleaning this stuff up so much easier.

I've written a handful of mini-tutorials on how to use it:
[snip]
Wow! this is goldmine of information! thank you!

will be bookmarking these and learn more.
akita328 is offline   Reply With Quote
Old 04-06-2023, 03:55 AM   #15
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by akita328 View Post
Wow! this is goldmine of information! thank you!

will be bookmarking these and learn more.


Here's one more too:

You can always type stuff like this into your favorite search engines too:

Code:
cleanup CSS EPUB Tex2002ans site:mobileread.com
regular expressions Tex2002ans site:mobileread.com
That will lead you to any answers you need. (It's what I do whenever I need to dig up links to my old answers.)

I've written entire encyclopedias of tips/tricks/problems/anything-to-do with ebooks + ebook cleanup!

Quote:
Originally Posted by akita328 View Post
Not to worry. I have looked at all the classes, and verified what they are doing. many of them were duplicates... so I did a find/replace to consolidate several different labels into a single one.
That's exactly what KevinH's CSSToolbox was meant to accomplish!

It would let you:
  • Select a class
  • See all 100% exact (or very similar) CSS classes

then let you merge any "duplicates" into 1 class at the push of a button.

Instead of you manually doing multiple rounds of this, via Diap's TagMechanic + conversions + lots of elbow grease, it would cut down the cleanup work dramatically.

- - -

Side Note: KevinH's CSSToolbox isn't out yet, but will be in the near-future!

For now, you have to settle on rounds of Calibre EPUB->EPUB conversions with manual trash removal in the middle.

- - -

Right-Click Trick (to Rename Classes)

Another timesaving trick you can do in Sigil/Calibre is:
  • Right-Click on a class name in your HTML.

Now you can "Rename" that class to something more human-readable.

For example:

Code:
<p>This is a word in <span class="junk123">italics</span>.</p>
<p>Even <span class="junk123">more</span> in <span class="junk123">italics</span>.</p>
Right-Click on the class="" part of it, then you can
  • Change "junk123" -> "italics"

and it will auto-replace all others in the book too:

Code:
<p>This is a word in <span class="italics">italics</span>.</p>
<p>Even <span class="italics">more</span> in <span class="italics">italics</span>.</p>
Now, when you're doing your Diap rounds later, it makes it so much easier to see exactly what you marked as the code's intent. Then you could easily:

Code:
<p>This is a word in <i>italics</i>.</p>
<p>Even <i>more</i> in <i>italics</i>.</p>
- - -

Side Note #2: I manually did a lot of that Find/Replace stuff way back when... then the Right-Click trick was initially in Calibre, then Sigil added it recently.

Now it's so much faster than it used to be!

And that's pretty much how I initially came up with the concept for CSSToolbox...

Instead of manually flipping back-and-forth through the classes + CSS, then renaming the stuff...

I'll soon be able to yell at CSSToolbox:

"Hey! You know those dozen classes junk123, junk456, junk789... that are all effectively italics? How about you just find/rename those all for me in one shot?"

And:

"Hey! Help find my almost-twins! How about you merge/rename those for me too?"

Then it'll turn hundreds of Right-Click > Renames + lots of cross-eyed CSS compares into a few button presses too!

Last edited by Tex2002ans; 04-06-2023 at 05:05 AM.
Tex2002ans is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Removing class and id references Artha Sigil 10 07-24-2011 11:17 AM
Changing or removing <div class="calibrenavbar"> ptsefton Recipes 3 05-28-2011 08:30 AM
Problem with removing formatting jekoby Calibre 4 03-29-2011 04:57 AM
Trouble removing span class mufc Recipes 3 03-18-2011 03:29 PM
Ebook formatting - help with removing margins? geekgeek Amazon Kindle 8 12-22-2010 10:27 PM


All times are GMT -4. The time now is 04:18 PM.


MobileRead.com is a privately owned, operated and funded community.