![]() |
#1 |
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Jun 2011
Device: kindle touch
|
How do you remove class="whitespace"?
A lot of my epub books have class="whitespace" or class="softbreak" where there once were page breaks from conversions in their past life. And I want to remove them, especially when they are mid sentence.
I seem to have no problems using regex to unwrap lines with other calibre classes. But it never works with these classes. eg if I test for the following regex calibre finds 223 instances: ([a-z0-9-,])(</p>)(\s)(<p class="whitespace"> </p> <p class="calibre3">) I want to replace it with \1\3 But the conversion never works. Nothing happens. After conversion calibre still finds 223 instances and the book looks the same. Can any expert out there tell me what I'm missing here? sorry if this question has been asked and answered before, but I couldn't find anything in the search. thanks so much in advance for any help!! |
![]() |
![]() |
![]() |
#2 | |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 30,889
Karma: 59840450
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
Personally, I use Sigil s I can see exactly what a Search finds (and the results of my replace ![]() |
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Jun 2011
Device: kindle touch
|
thanks for your answer but it didn't work, still finds all instances, but won't replace them. I also tried
([a-z0-9-,])(</p>)(\s)(<p class="whitespace">\s?</p>\s? <p class="calibre3">) but that didn't work either I might investigate Sigil. It's just frustrating that Calibre does most of what I want, but I can't figure this one out. |
![]() |
![]() |
![]() |
#4 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
in my experience, whitespace / softbreak are usually there to separate scenes within chapters, i.e. they correspond to where you'd see a blank line in a paper book; so irreversibly removing all of them without previewing the book may not be a good idea.
if you go the sigil route you could redefine them in CSS rather than remove them completely. Ps unless the book was originally a PDF, in which case a pdf to epub conversion may have added than between "pages" in that case I "think" the remove spaces between paragraphs option in calibre look and feel preferences will nuke them. Last edited by cybmole; 07-02-2011 at 06:01 AM. |
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Sounds like heuristics was enabled to detect softbreaks but line unwrapping wasn't enabled or configured to catch the broken line breaks. This happens when the book is mostly formatted properly but just a small percentage of lines are broken - e.g. page breaks from an OCR conversion.
The first thing I'd try is make sure line unwrapping is enabled and reduce the line unwrap factor to where those lines on page breaks get 'unwrapped'. You might need it as low as .1 if the book is as I described above. If there are no softbreaks in the book to be preserved then in order to not have the whitespace or softbreak classes created you can disable scene break detection completely under heuristics. theducks' suggestion would work if you're dealing with someone else's Calibre conversion - if you're doing the conversion yourself then the culprit is having softbreak detection enabled under heuristics, as heuristics occurs after search and replace. Also, you should be using \s* in your regex and generally make it more generic: Code:
([a-z0-9-,])</p>\s*<p class="(whitespace|softbreak)">\s*</p><p[^>]*> Last edited by ldolse; 07-02-2011 at 08:07 AM. |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
I am curious:
the heuristics options page refers to scene breaks & also to soft scene breaks , but not to softbreaks. can you define & explain please how these 3 terms are detected & processed within the heuristics logic please |
![]() |
![]() |
![]() |
#7 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Sorry, wasn't referring to the manual when I wrote that last reply. There are only two types of breaks - scene breaks - i.e. breaks which have some sort of repeating non-alphanumeric text, and soft scene breaks/softbreaks - breaks of whitespace in between scenes.
The 'whitespace' class comes from a function that runs before scene break detection to prevent extra whitespace around headings, blockquotes, etc from being accidentally detected/formatted as scene breaks. |
![]() |
![]() |
![]() |
#8 |
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Jun 2011
Device: kindle touch
|
thanks,
thats definitely helped, I've now only got 57 instances of mid sentence splits. And thanks also for the explanation. Another related question... The book was originally a PDF which has been cropped to removed the page numbering, and then converted to epub. I check the 'remove spaces between paragraphs' and enable heuristics (now with line unwrapping set to 1) but I still end up with a lot of whitespace where I don't want it. Even if I put something as simple as <p class="whitespace"> </p> in my regex Calibre just will not replace it. It's always still there after the conversion. I tried using the scenebreak replace to see which whitespace was scenebreak and I then got 379 occurrences of <p class="whitespace"> </p> and 76 occurrences of <p class="scenebreak">∗ ∗ ∗</p> I could then remove the 76 scenebreaks by using regex and turning scenebreak detection off. But I still cannot find a way to get rid of the remaining pesky whitespaces. I can live with them, the book formatting is much improved, but if there is an explanation, or a way to remove them during calibre conversion, I'd love to hear it. |
![]() |
![]() |
![]() |
#9 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Not sure whether it was a typo, but you don't want line un-wrap set to '1', you want it set to '.1', note the decimal point. And you only need it set to that for books where a small percentage of lines are unwrapped, the defaults should be good on a 'typical' book with hard breaks (e.g. pdf, many text files, some OCR sources). Setting it that low on a book where every line has hard breaks will cause poetry, etc to be un-wrapped.
It's possible that's the cause of your remaining splits. If the original source was a pdf, it's possible that the pattern used to remove the headers and footers didn't delete the surrounding tags associated with those headers and footers - that leftover garbage can trip up the later work Calibre does to try and fix the line breaks, etc. If you're not the one doing the intial conversion there's not much you can do to fix it at this point... |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Remove from Archive (book already "deleted" in Amazon account) | kindletommy | Amazon Kindle | 9 | 08-09-2012 06:17 PM |
Changing or removing <div class="calibrenavbar"> | ptsefton | Recipes | 3 | 05-28-2011 08:30 AM |
How do I remove the "Archived" Book shelf from my nook color? | leesiulung | Nook Color & Nook Tablet | 0 | 02-24-2011 03:02 PM |
How to remove "Fully read" books from "Last Open" list? | pjeanetta | PocketBook | 4 | 12-08-2010 10:30 AM |