How do you remove class="whitespace"?

greenlees · 07-01-2011, 04:46 PM

A lot of my epub books have class="whitespace" or class="softbreak" where there once were page breaks from conversions in their past life. And I want to remove them, especially when they are mid sentence.

I seem to have no problems using regex to unwrap lines with other calibre classes. But it never works with these classes.

eg

if I test for the following regex calibre finds 223 instances:

([a-z0-9-,])()(\s)( 
)

I want to replace it with

\1\3

But the conversion never works. Nothing happens. After conversion calibre still finds 223 instances and the book looks the same.

Can any expert out there tell me what I'm missing here?

sorry if this question has been asked and answered before, but I couldn't find anything in the search.

thanks so much in advance for any help!!

theducks · 07-01-2011, 04:52 PM

Quote:

Originally Posted by greenlees

A lot of my epub books have class="whitespace" or class="softbreak" where there once were page breaks from conversions in their past life. And I want to remove them, especially when they are mid sentence.

I seem to have no problems using regex to unwrap lines with other calibre classes. But it never works with these classes.

eg

if I test for the following regex calibre finds 223 instances:

([a-z0-9-,])()(\s)(\s?
)

I want to replace it with

\1\3

But the conversion never works. Nothing happens. After conversion calibre still finds 223 instances and the book looks the same.

Can any expert out there tell me what I'm missing here?

sorry if this question has been asked and answered before, but I couldn't find anything in the search.

thanks so much in advance for any help!!

I added a more flexible space detection in red.
Personally, I use Sigil s I can see exactly what a Search finds (and the results of my replace

)

greenlees · 07-01-2011, 05:19 PM

thanks for your answer but it didn't work, still finds all instances, but won't replace them. I also tried

([a-z0-9-,])()(\s)(\s?\s?
)

but that didn't work either

I might investigate Sigil. It's just frustrating that Calibre does most of what I want, but I can't figure this one out.

cybmole · 07-02-2011, 01:05 AM

in my experience, whitespace / softbreak are usually there to separate scenes within chapters, i.e. they correspond to where you'd see a blank line in a paper book; so irreversibly removing all of them without previewing the book may not be a good idea.

if you go the sigil route you could redefine them in CSS rather than remove them completely.

Ps unless the book was originally a PDF, in which case a pdf to epub conversion may have added than between "pages"
in that case I "think" the remove spaces between paragraphs option in calibre look and feel preferences will nuke them.

ldolse · 07-02-2011, 07:54 AM

Sounds like heuristics was enabled to detect softbreaks but line unwrapping wasn't enabled or configured to catch the broken line breaks. This happens when the book is mostly formatted properly but just a small percentage of lines are broken - e.g. page breaks from an OCR conversion.

The first thing I'd try is make sure line unwrapping is enabled and reduce the line unwrap factor to where those lines on page breaks get 'unwrapped'. You might need it as low as .1 if the book is as I described above.

If there are no softbreaks in the book to be preserved then in order to not have the whitespace or softbreak classes created you can disable scene break detection completely under heuristics.

theducks' suggestion would work if you're dealing with someone else's Calibre conversion - if you're doing the conversion yourself then the culprit is having softbreak detection enabled under heuristics, as heuristics occurs after search and replace.

Also, you should be using \s* in your regex and generally make it more generic:

Code:

([a-z0-9-,])</p>\s*<p class="(whitespace|softbreak)">\s*</p><p[^>]*>

With the above the replacement should be '\1 '.

cybmole · 07-02-2011, 08:13 AM

I am curious:
the heuristics options page refers to scene breaks & also to soft scene breaks , but not to softbreaks.

can you define & explain please how these 3 terms are detected & processed within the heuristics logic please

ldolse · 07-02-2011, 03:40 PM

Sorry, wasn't referring to the manual when I wrote that last reply. There are only two types of breaks - scene breaks - i.e. breaks which have some sort of repeating non-alphanumeric text, and soft scene breaks/softbreaks - breaks of whitespace in between scenes.

The 'whitespace' class comes from a function that runs before scene break detection to prevent extra whitespace around headings, blockquotes, etc from being accidentally detected/formatted as scene breaks.

greenlees · 07-02-2011, 04:48 PM

thanks,

thats definitely helped, I've now only got 57 instances of mid sentence splits. And thanks also for the explanation.

Another related question...

The book was originally a PDF which has been cropped to removed the page numbering, and then converted to epub.

I check the 'remove spaces between paragraphs' and enable heuristics (now with line unwrapping set to 1) but I still end up with a lot of whitespace where I don't want it.

Even if I put something as simple as

 

in my regex

Calibre just will not replace it. It's always still there after the conversion.

I tried using the scenebreak replace to see which whitespace was scenebreak and I then got

379 occurrences of 

and 76 occurrences of ∗ ∗ ∗

I could then remove the 76 scenebreaks by using regex and turning scenebreak detection off.

But I still cannot find a way to get rid of the remaining pesky whitespaces.

I can live with them, the book formatting is much improved, but if there is an explanation, or a way to remove them during calibre conversion, I'd love to hear it.

ldolse · 07-03-2011, 02:54 AM

Not sure whether it was a typo, but you don't want line un-wrap set to '1', you want it set to '.1', note the decimal point. And you only need it set to that for books where a small percentage of lines are unwrapped, the defaults should be good on a 'typical' book with hard breaks (e.g. pdf, many text files, some OCR sources). Setting it that low on a book where every line has hard breaks will cause poetry, etc to be un-wrapped.

It's possible that's the cause of your remaining splits.

If the original source was a pdf, it's possible that the pattern used to remove the headers and footers didn't delete the surrounding tags associated with those headers and footers - that leftover garbage can trip up the later work Calibre does to try and fix the line breaks, etc. If you're not the one doing the intial conversion there's not much you can do to fix it at this point...

07-01-2011, 04:46 PM	#1
greenlees Junior Member Posts: 8 Karma: 10 Join Date: Jun 2011 Device: kindle touch	How do you remove class="whitespace"? A lot of my epub books have class="whitespace" or class="softbreak" where there once were page breaks from conversions in their past life. And I want to remove them, especially when they are mid sentence. I seem to have no problems using regex to unwrap lines with other calibre classes. But it never works with these classes. eg if I test for the following regex calibre finds 223 instances: ([a-z0-9-,])(</p>)(\s)(<p class="whitespace"> </p> <p class="calibre3">) I want to replace it with \1\3 But the conversion never works. Nothing happens. After conversion calibre still finds 223 instances and the book looks the same. Can any expert out there tell me what I'm missing here? sorry if this question has been asked and answered before, but I couldn't find anything in the search. thanks so much in advance for any help!!

07-01-2011, 05:19 PM	#3
greenlees Junior Member Posts: 8 Karma: 10 Join Date: Jun 2011 Device: kindle touch	thanks for your answer but it didn't work, still finds all instances, but won't replace them. I also tried ([a-z0-9-,])(</p>)(\s)(<p class="whitespace">\s?</p>\s? <p class="calibre3">) but that didn't work either I might investigate Sigil. It's just frustrating that Calibre does most of what I want, but I can't figure this one out.

07-02-2011, 01:05 AM	#4
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	in my experience, whitespace / softbreak are usually there to separate scenes within chapters, i.e. they correspond to where you'd see a blank line in a paper book; so irreversibly removing all of them without previewing the book may not be a good idea. if you go the sigil route you could redefine them in CSS rather than remove them completely. Ps unless the book was originally a PDF, in which case a pdf to epub conversion may have added than between "pages" in that case I "think" the remove spaces between paragraphs option in calibre look and feel preferences will nuke them. Last edited by cybmole; 07-02-2011 at 06:01 AM.

07-02-2011, 07:54 AM	#5
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Sounds like heuristics was enabled to detect softbreaks but line unwrapping wasn't enabled or configured to catch the broken line breaks. This happens when the book is mostly formatted properly but just a small percentage of lines are broken - e.g. page breaks from an OCR conversion. The first thing I'd try is make sure line unwrapping is enabled and reduce the line unwrap factor to where those lines on page breaks get 'unwrapped'. You might need it as low as .1 if the book is as I described above. If there are no softbreaks in the book to be preserved then in order to not have the whitespace or softbreak classes created you can disable scene break detection completely under heuristics. theducks' suggestion would work if you're dealing with someone else's Calibre conversion - if you're doing the conversion yourself then the culprit is having softbreak detection enabled under heuristics, as heuristics occurs after search and replace. Also, you should be using \s* in your regex and generally make it more generic: Code: ([a-z0-9-,])</p>\s<p class="(whitespace\|softbreak)">\s</p><p[^>]> With the above the replacement should be '\1 '. Last edited by ldolse; 07-02-2011 at 08:07 AM.*

07-02-2011, 04:48 PM	#8
greenlees Junior Member Posts: 8 Karma: 10 Join Date: Jun 2011 Device: kindle touch	thanks, thats definitely helped, I've now only got 57 instances of mid sentence splits. And thanks also for the explanation. Another related question... The book was originally a PDF which has been cropped to removed the page numbering, and then converted to epub. I check the 'remove spaces between paragraphs' and enable heuristics (now with line unwrapping set to 1) but I still end up with a lot of whitespace where I don't want it. Even if I put something as simple as <p class="whitespace"> </p> in my regex Calibre just will not replace it. It's always still there after the conversion. I tried using the scenebreak replace to see which whitespace was scenebreak and I then got 379 occurrences of <p class="whitespace"> </p> and 76 occurrences of <p class="scenebreak">∗ ∗ ∗</p> I could then remove the 76 scenebreaks by using regex and turning scenebreak detection off. But I still cannot find a way to get rid of the remaining pesky whitespaces. I can live with them, the book formatting is much improved, but if there is an explanation, or a way to remove them during calibre conversion, I'd love to hear it.

07-02-2011, 08:13 AM	#6
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	I am curious: the heuristics options page refers to scene breaks & also to soft scene breaks , but not to softbreaks. can you define & explain please how these 3 terms are detected & processed within the heuristics logic please

07-02-2011, 03:40 PM	#7
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Sorry, wasn't referring to the manual when I wrote that last reply. There are only two types of breaks - scene breaks - i.e. breaks which have some sort of repeating non-alphanumeric text, and soft scene breaks/softbreaks - breaks of whitespace in between scenes. The 'whitespace' class comes from a function that runs before scene break detection to prevent extra whitespace around headings, blockquotes, etc from being accidentally detected/formatted as scene breaks.

07-03-2011, 02:54 AM	#9
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Not sure whether it was a typo, but you don't want line un-wrap set to '1', you want it set to '.1', note the decimal point. And you only need it set to that for books where a small percentage of lines are unwrapped, the defaults should be good on a 'typical' book with hard breaks (e.g. pdf, many text files, some OCR sources). Setting it that low on a book where every line has hard breaks will cause poetry, etc to be un-wrapped. It's possible that's the cause of your remaining splits. If the original source was a pdf, it's possible that the pattern used to remove the headers and footers didn't delete the surrounding tags associated with those headers and footers - that leftover garbage can trip up the later work Calibre does to try and fix the line breaks, etc. If you're not the one doing the intial conversion there's not much you can do to fix it at this point...

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Remove from Archive (book already "deleted" in Amazon account)	kindletommy	Amazon Kindle	9	08-09-2012 06:17 PM
Changing or removing <div class="calibrenavbar">	ptsefton	Recipes	3	05-28-2011 08:30 AM
How do I remove the "Archived" Book shelf from my nook color?	leesiulung	Nook Color & Nook Tablet	0	02-24-2011 03:02 PM
How to remove "Fully read" books from "Last Open" list?	pjeanetta	PocketBook	4	12-08-2010 10:30 AM

Advert

Advert