08-17-2015, 08:22 AM | #1 |
Bookworm
Posts: 975
Karma: 768585
Join Date: Aug 2010
Location: Netherlands
Device: Sony prs-650, Kobo Glo HD (2x), Kobo Glo
|
Is there Epub cleaning software (so much unneeded code inside)
Dear friends.
When I opened a commercial ebook that I bought (social DRM) I noticed the book is really slow. So I looked to the code and that was terrible.. no way I can clean this by hand... is there software (I hope free because it is for only one book,until now) or an online tool that can clean it for me ?? Look at the spoiler for a example. In real..this are only a couple of lines from the book... Spoiler:
|
08-17-2015, 08:33 AM | #2 | |
350 Hoarder
Posts: 3,574
Karma: 8281267
Join Date: Dec 2010
Location: Midwest USA
Device: Sony PRS-350, Kobo Glo & Glo HD, PW2
|
This looks to be a typical paragraph from your selection, and if it's the same throughout the book, I think you could clean it by hand pretty easily using Sigil's Find and Replace:
Quote:
Then search for <span class="dlct-007"> and in the replace box just leave it blank so it will delete all instance of it. Note the number of instances it finds here for the next step. Then search for </span> and do the same thing, leave the replace box blank to they'll all be deleted. Check that the number of instances found matches with the previous step, or there might be other span classes somewhere, some of which you might want to keep. You can find those after this step is done by just searching for "span" and see what comes up. If there are a lot of other instances of "dict-###" with various numbers, you could use regex the same way and get them all. Then just let Sigil clean up the stylesheet to delete any unused styles and try it. I've never found any automated software that can make the proper decisions about what stays and what should go. |
|
Advert | |
|
08-17-2015, 08:47 AM | #3 | |
Bookworm
Posts: 975
Karma: 768585
Join Date: Aug 2010
Location: Netherlands
Device: Sony prs-650, Kobo Glo HD (2x), Kobo Glo
|
Quote:
Due to my dyslection I really don't understand the regex way of searching. In the very beginning Sigil was working with wildcards like dict* but that isn't the case anymore.. Also it doesn't use any .css all the rules are in the html: Spoiler:
This is only a small part of one html... and they use the margins-right to position the words (or it seems that way) maybe it is better to put it all in Calibre and convert it to txt and then build it again proper with Sigil if that is possible. I really don't know what the publisher tries to do with it... Yes.. I really tried to understand the regex method.. but it just doesn't stay in the part I reserved for it in my brains And Imagine what happens with all the classes when I put it out with the KoboTouchExtended driver to a kepub .. Last edited by Nick_1964; 08-17-2015 at 08:53 AM. |
|
08-17-2015, 08:58 AM | #4 |
Gnu
Posts: 1,222
Karma: 15625359
Join Date: Jul 2009
Location: UK
Device: BeBook,JetBook Lite,PRS-300-350-505-650,+ran out of space to type
|
In Sigil use the Regex mode for find/replace
In the find box <span class="dlct-007">(.*?)</span> in the replace box \1 Replace all |
08-17-2015, 09:08 AM | #5 | |
Bookworm
Posts: 975
Karma: 768585
Join Date: Aug 2010
Location: Netherlands
Device: Sony prs-650, Kobo Glo HD (2x), Kobo Glo
|
Quote:
The dlct-007 is just one.. they start with 1 and end with.. I even don't know where..and everyone is different and used.. some containing larger txt for chapters so the lay-out would be gone anyway... I rather scan a book with abbyy then this mess.. |
|
Advert | |
|
08-17-2015, 09:19 AM | #6 | |
Gnu
Posts: 1,222
Karma: 15625359
Join Date: Jul 2009
Location: UK
Device: BeBook,JetBook Lite,PRS-300-350-505-650,+ran out of space to type
|
Quote:
Search for this <span class="dlct-007"> then any combination of items (.*?) ending with this </span> the brackets around .*? say "cut the text for later use" to use in the replace start with \ then the instance of saved text so as it's the first (and only) instance in your case use this for replace \1 from the look of the sample you posted I would just try replacing 007 for now as it seems to be the default paragraph and won't affect the layout. |
|
08-17-2015, 09:40 AM | #7 | |
Bookworm
Posts: 975
Karma: 768585
Join Date: Aug 2010
Location: Netherlands
Device: Sony prs-650, Kobo Glo HD (2x), Kobo Glo
|
Quote:
|
|
08-17-2015, 10:00 AM | #8 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Moved to the "Workshop" forum, where such questions belong.
|
08-17-2015, 10:07 AM | #9 |
Banned
Posts: 272
Karma: 1224588
Join Date: Sep 2014
Device: Sony PRS 650
|
Delete <span class="dlct-\d\d\d"> in Sigil and let do tidy do the rest.
|
08-17-2015, 10:08 AM | #10 | |
Bookworm
Posts: 975
Karma: 768585
Join Date: Aug 2010
Location: Netherlands
Device: Sony prs-650, Kobo Glo HD (2x), Kobo Glo
|
Excuse me .. didn't finded that one..
Quote:
But I do try to figure out how the regex is working so such examples are gold for me. Last edited by Nick_1964; 08-17-2015 at 10:29 AM. |
|
08-17-2015, 10:43 AM | #11 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
If you want Nick, I can clean it up for you. Just give me a sign.
|
08-17-2015, 10:54 AM | #12 |
Guru
Posts: 631
Karma: 7544080
Join Date: Apr 2013
Location: Berlin
Device: PRS 350, Kobo Aura
|
Is it a novel? If so, maybe use calibre to conver it to for example txt with markdown and then back to epub. Or convert it to docx and use toxaris word add-in. Good luck, it really looks terrible.
|
08-17-2015, 10:57 AM | #13 | ||
Bookworm
Posts: 975
Karma: 768585
Join Date: Aug 2010
Location: Netherlands
Device: Sony prs-650, Kobo Glo HD (2x), Kobo Glo
|
Quote:
But I am a bit further now.. just to find out that between every line, almost between every line there is a blank one. When I remove it by adding code to a .css I added..they are gone but also the paragraphs that does belong there.. but even visually I can't see what is a blanc line by a wrong html code and whats a paragraph, they almost all start with <p class="dlct-025"> and at places where i expect a paragraph they are to... The files are not splitted in sections where a new chapter begins (I always do that...) but the chapters are divided by a bunch of enters.. (</br> ) man oh man.. And more worse.. al the hyphen - are there (i suppose they are there in the real paper book) just in the middle of lines.. but also some text is marked just by - txt here - so i can't just remove all the hyphen - marks.. and guess what.. she also asked me to buy the 2 other parts and they are made exactly the same.. Quote:
They are 3 child books, for as far as I can see it is a story about a couple of kids that are bookkeepers (dunno.. seems a bit odd compared to the mess the books are) and have to fight dragons and other things in a underworld, don't know why and for what goal..and I don't want to know it to.. Last edited by Nick_1964; 08-17-2015 at 11:15 AM. |
||
08-17-2015, 01:41 PM | #14 |
Banned
Posts: 272
Karma: 1224588
Join Date: Sep 2014
Device: Sony PRS 650
|
|
08-17-2015, 01:45 PM | #15 |
Bookworm
Posts: 975
Karma: 768585
Join Date: Aug 2010
Location: Netherlands
Device: Sony prs-650, Kobo Glo HD (2x), Kobo Glo
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Create / Optimize Cbz files for Kobo (software inside) | satsuki_yatoshi | Kobo Reader | 20 | 06-22-2022 04:23 PM |
conversion problem? - cleaning up epub | potestus | Calibre | 1 | 05-31-2011 01:28 PM |
Stop Automatic Code cleaning in Sigil | ericp20 | Sigil | 11 | 05-27-2011 08:52 AM |
questions on epub and lrf and cleaning up book | Janette55 | Sony Reader | 1 | 03-11-2011 09:25 AM |
Unutterably Silly A pug cleaning the inside of your monitor! | Dusty Bottoms | Lounge | 4 | 05-03-2010 10:06 AM |