![]() |
#1 |
Member
![]() Posts: 13
Karma: 10
Join Date: Aug 2019
Device: kindle, iPad Marvin
|
removing excessive <class> and other formatting horrors on epub
hi all,
wasn't sure where this sort of question should be posted, so I'm posting here... I bought an ebook from amazon, and noticed it had pretty bad formatting errors that was making the book very difficult to read on my kindle, so I thought I'd try to fix it myself. but when I used Calibre to convert azw3 to ePub, I saw this horrible coding where almost every single word has its own class/span/etc!!!! and I can't make heads or tails of removing and fixing weird indentations and margin problems. (I think I can fix the pagination problems, but that's minor compared to what I saw below) here's an excerpt of the code. the entire book is like this: <p class="block_21">“How<span class="text_14"> </span>can<span class="text_14"> </span>I<span class="text_14"> </span>persuade<span class="text_14"> </span>you<span class="text_14"> </span>that<span class="text_14"> </span>I<span class="text_14"> </span>mean<span class="text_14"> </span>you<span class="text_14"> </span>no<span class="text_14"> </span>harm?”<span class="text_14"> </span>he<span class="text_14"> </span>asked.<span class="text_14"> </span>“I<span class="text_14"> </span>swear to you that I will do nothing to you.”</p> <p class="block_22">“Will<span class="text_18"> </span>you<span class="text_18"> </span>swear<span class="text_18"> </span>by<span class="text_18"> </span>the<span class="text_18"> </span>Blessed<span class="text_18"> </span>Virgin<span class="text_18"> </span>Mary?”<span class="text_18"> </span>she<span class="text_18"> </span>asked<span class="text_18"> </span>disbelievingly. “I swear it.”</p> Each paragraph has its own block_# with its own Class="text_#" on almost every single word in the paragraph. (and the block# and text# are different pairing for the paragraphs...) I took a peek at the original azw3 file, and it is just as bad. So azw-->epub conversion didn't do this. it's the horrible amazon encoding... is there anyway to clean up the mess like above that will strip most of the junk to something resembling sane text file that I can fix the incorrect pagination, margins, and linefeeds? if I have to manually delete these things, it will be faster if I retype the entire book from scratch. I have the paperback copy of the book as well as the eBook version, so... I have a reference what it's SUPPOSED to look like.. I'm hoping there are ways to export to some other format that can be re-converted back into a simpler ePub doc with excessive use of <class> stripped... (I'm a novice and can edit existing ePub file, but won't know where to start if I have to start from scratch..) I do use Sigil for minor cleaning up texts/typos and some formatting, etc, and use Calibre on MacOS... I'm generally computer savvy, but far from expert on stuff like this. any help is appreciated... |
![]() |
![]() |
![]() |
#2 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,023
Karma: 90000009
Join Date: Nov 2011
Location: Charlottesville, VA
Device: Kindles
|
Try converting to MOBI format and then from MOBI back to EPUB. That will remove a lot of excess formatting.
It's really down to the publisher. Garbage in, garbage out. Can you share which title this is from Amazon? |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
Diap's Toolbox plugin will take care of those very easily
https://www.mobileread.com/forums/sh...40#post2980740 |
![]() |
![]() |
![]() |
#4 |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 30,891
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
second The toolbox plugin. (also for Sigil)
BUT it removes a specified tag , by a specified criteria. eg remove a span that has no attribute (naked). Modify span class="slanty" and make it <i> (optionally keep the attribute) But it is not intelligent. You need to understand WHY the code IS the way it is. (your example is extreme. The Editor has a CSS clean function that combines all the identical CSS entries (but does not change the code in the book) ![]() |
![]() |
![]() |
![]() |
#5 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,029
Karma: 144284074
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
One thing to do is modify the CSS. If you don't want text_14, delete it from the CSS. Then when you remove unused CSS, all those classes will go away. You can also remove empty spans using Diaps Editing Toolbag.
There's more you can do but that would depend on the code. |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,023
Karma: 90000009
Join Date: Nov 2011
Location: Charlottesville, VA
Device: Kindles
|
I was able to find the book on Amazon from the snippet of text provided. It is For Love of Evil by Piers Anthony. I looked at the free sample and it shows signs of being a conversion based on a PDF and does not appear to be professionally done. That book is part of a series that is not legitimately available as e-books as far as I know. I suspect that the rights holder will eventually notify Amazon and it will be taken down.
|
![]() |
![]() |
![]() |
#7 | |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,029
Karma: 144284074
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
The series is Incarnations of Immortality. |
|
![]() |
![]() |
![]() |
#8 |
Member
![]() Posts: 13
Karma: 10
Join Date: Aug 2019
Device: kindle, iPad Marvin
|
Hi everybody,
thank you so much for your inputs. I will first try azw3-->mobi-->epub and see if that helps first... if not I will try learning how to use the Diap Toolbox.. and yes. the excerpt is from "For Love of Evil". I have all 7 of the "Incarnations of Immortality" in both paperback and kindle versions books 1-6. I'm waiting for book7 to be ebook-ed, but no luck yet. (there are few other books I have in hardcover that I would LOVE to get hold of ebook versions.. but unfortunately only German version exists, and no sign of English versions...) As far as I know, Anthony had some contract fallout (or something) with DelRay (?) after publishing the first 5 books, and changed publishers for the last two. first 5 books were fine... but this 6th was atrocious. It is bad enough that if I had realized that this formatting problem was this bad and wouldn't be a simple tweak in Sigil, would have returned the book... but.. too late now! when learning Diap's Toolbox, I will work on bunch of copies until I understand what it's doing via trial and error methods. I usually use Sigil for simple edits I do (usually fixing typos or strange linefeed, etc), but have never attempted this level of clean up... so... may have to come back to get help later. thanks! |
![]() |
![]() |
![]() |
#9 | |
Member
![]() Posts: 13
Karma: 10
Join Date: Aug 2019
Device: kindle, iPad Marvin
|
Quote:
azw3-->mobi-->epub and that combined most of the per-word styling consolidated, so not it's mostly per-paragraph styling. then I tried commenting out one of the class (font color, since it's easy to see the change), and looked for "remove unused CSS rules" and poof! it's gone!!! yay! thank you everybody! I think I can get pretty far cleaning up (or at least understand why some paragraphs have very strange margins...) |
|
![]() |
![]() |
![]() |
#10 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 515
Karma: 2268308
Join Date: Nov 2015
Device: none
|
You shouldn't just remove every span blindly, as some books use nested span tags to represent straight text within a larger italicized section.
|
![]() |
![]() |
![]() |
#11 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,604
Karma: 9500498
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
|
Quote:
I have never used a plugin to fix these problems. A few well placed regexes can either remove the code or find&replace the convoluted code with your own simpler classes. In your example a simple regex would have fixed that Find... <span class="text_14">(.*?)</span> Replace... \1 Note: if you have nested <span>'s then the above regex will not work correctly as it will stop at the first <span> not the second. Then look at what is leftover and figure out what it does and either leave it, replace it or remove it. Last edited by Karellen; 04-09-2023 at 08:34 PM. |
|
![]() |
![]() |
![]() |
#12 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Yes. I love Diap's "Editing Toolbag" (Calibre) / "TagMechanic" (Sigil). It makes cleaning this stuff up so much easier.
I've written a handful of mini-tutorials on how to use it:
That will help you change garbage like: Code:
<p class="junk123"><span class="italics123">This</span><span class="junk456"> </span>is<span class="junk456"> </span>an<span class="junk456"> </span>example.</p> Code:
<p><em>This</em> is an example.</p> Quote:
There are a few bells and whistles you can select in the EPUB conversion to try to remove junk like extra:
This will make the CSS cleaner + be much less full of "useless" stuff, making your manual cleanup steps much easier. ![]() I explained these methods in much more detail in: when RbnJrg asked how to consolidate 26 books/EPUBs, with very similar formatting, into more manageable HTML+CSS code. About a year later, I wrote: which summarized more + explained some of the best, bleeding-edge methods. - - - Side Note: Some of these most-advanced cleanup tools are still in the works though... But the pieces/concepts are all there. ![]() Sigil 1.9.10+ added the "Advanced Find/Replace (List-Based)" method I was describing. For more info on that, see KevinH's: And the CSSToolbox is still in the works. (I think? I haven't talked with KevinH in a while.) Last edited by Tex2002ans; 04-06-2023 at 03:09 AM. |
|
![]() |
![]() |
![]() |
#13 | |
Member
![]() Posts: 13
Karma: 10
Join Date: Aug 2019
Device: kindle, iPad Marvin
|
Quote:
now everything looks much neater ![]() sigh |
|
![]() |
![]() |
![]() |
#14 | |
Member
![]() Posts: 13
Karma: 10
Join Date: Aug 2019
Device: kindle, iPad Marvin
|
Quote:
will be bookmarking these and learn more. |
|
![]() |
![]() |
![]() |
#15 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
![]() ![]() Here's one more too: You can always type stuff like this into your favorite search engines too: Code:
cleanup CSS EPUB Tex2002ans site:mobileread.com regular expressions Tex2002ans site:mobileread.com I've written entire encyclopedias of tips/tricks/problems/anything-to-do with ebooks + ebook cleanup! ![]() Quote:
It would let you:
then let you merge any "duplicates" into 1 class at the push of a button. ![]() Instead of you manually doing multiple rounds of this, via Diap's TagMechanic + conversions + lots of elbow grease, it would cut down the cleanup work dramatically. - - - Side Note: KevinH's CSSToolbox isn't out yet, but will be in the near-future! For now, you have to settle on rounds of Calibre EPUB->EPUB conversions with manual trash removal in the middle. - - - Right-Click Trick (to Rename Classes) Another timesaving trick you can do in Sigil/Calibre is:
Now you can "Rename" that class to something more human-readable. For example: Code:
<p>This is a word in <span class="junk123">italics</span>.</p>
<p>Even <span class="junk123">more</span> in <span class="junk123">italics</span>.</p>
and it will auto-replace all others in the book too: Code:
<p>This is a word in <span class="italics">italics</span>.</p> <p>Even <span class="italics">more</span> in <span class="italics">italics</span>.</p> Code:
<p>This is a word in <i>italics</i>.</p> <p>Even <i>more</i> in <i>italics</i>.</p> Side Note #2: I manually did a lot of that Find/Replace stuff way back when... then the Right-Click trick was initially in Calibre, then Sigil added it recently. Now it's so much faster than it used to be! And that's pretty much how I initially came up with the concept for CSSToolbox... Instead of manually flipping back-and-forth through the classes + CSS, then renaming the stuff... I'll soon be able to yell at CSSToolbox: "Hey! You know those dozen classes junk123, junk456, junk789... that are all effectively italics? How about you just find/rename those all for me in one shot?" ![]() And: "Hey! Help find my almost-twins! How about you merge/rename those for me too?" Then it'll turn hundreds of Right-Click > Renames + lots of cross-eyed CSS compares into a few button presses too! ![]() Last edited by Tex2002ans; 04-06-2023 at 05:05 AM. |
||
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Removing class and id references | Artha | Sigil | 10 | 07-24-2011 11:17 AM |
Changing or removing <div class="calibrenavbar"> | ptsefton | Recipes | 3 | 05-28-2011 08:30 AM |
Problem with removing formatting | jekoby | Calibre | 4 | 03-29-2011 04:57 AM |
Trouble removing span class | mufc | Recipes | 3 | 03-18-2011 03:29 PM |
Ebook formatting - help with removing margins? | geekgeek | Amazon Kindle | 8 | 12-22-2010 10:27 PM |