Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 07-01-2011, 04:46 PM   #1
greenlees
Junior Member
greenlees began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jun 2011
Device: kindle touch
How do you remove class="whitespace"?

A lot of my epub books have class="whitespace" or class="softbreak" where there once were page breaks from conversions in their past life. And I want to remove them, especially when they are mid sentence.

I seem to have no problems using regex to unwrap lines with other calibre classes. But it never works with these classes.

eg

if I test for the following regex calibre finds 223 instances:

([a-z0-9-,])(</p>)(\s)(<p class="whitespace"> </p>
<p class="calibre3">)

I want to replace it with

\1\3

But the conversion never works. Nothing happens. After conversion calibre still finds 223 instances and the book looks the same.

Can any expert out there tell me what I'm missing here?

sorry if this question has been asked and answered before, but I couldn't find anything in the search.

thanks so much in advance for any help!!
greenlees is offline   Reply With Quote
Old 07-01-2011, 04:52 PM   #2
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,800
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by greenlees View Post
A lot of my epub books have class="whitespace" or class="softbreak" where there once were page breaks from conversions in their past life. And I want to remove them, especially when they are mid sentence.

I seem to have no problems using regex to unwrap lines with other calibre classes. But it never works with these classes.

eg

if I test for the following regex calibre finds 223 instances:

([a-z0-9-,])(</p>)(\s)(<p class="whitespace">\s?</p>
<p class="calibre3">)

I want to replace it with

\1\3

But the conversion never works. Nothing happens. After conversion calibre still finds 223 instances and the book looks the same.

Can any expert out there tell me what I'm missing here?

sorry if this question has been asked and answered before, but I couldn't find anything in the search.

thanks so much in advance for any help!!
I added a more flexible space detection in red.
Personally, I use Sigil s I can see exactly what a Search finds (and the results of my replace )
theducks is online now   Reply With Quote
Advert
Old 07-01-2011, 05:19 PM   #3
greenlees
Junior Member
greenlees began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jun 2011
Device: kindle touch
thanks for your answer but it didn't work, still finds all instances, but won't replace them. I also tried

([a-z0-9-,])(</p>)(\s)(<p class="whitespace">\s?</p>\s?
<p class="calibre3">)

but that didn't work either

I might investigate Sigil. It's just frustrating that Calibre does most of what I want, but I can't figure this one out.
greenlees is offline   Reply With Quote
Old 07-02-2011, 01:05 AM   #4
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
in my experience, whitespace / softbreak are usually there to separate scenes within chapters, i.e. they correspond to where you'd see a blank line in a paper book; so irreversibly removing all of them without previewing the book may not be a good idea.

if you go the sigil route you could redefine them in CSS rather than remove them completely.

Ps unless the book was originally a PDF, in which case a pdf to epub conversion may have added than between "pages"
in that case I "think" the remove spaces between paragraphs option in calibre look and feel preferences will nuke them.

Last edited by cybmole; 07-02-2011 at 06:01 AM.
cybmole is offline   Reply With Quote
Old 07-02-2011, 07:54 AM   #5
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Sounds like heuristics was enabled to detect softbreaks but line unwrapping wasn't enabled or configured to catch the broken line breaks. This happens when the book is mostly formatted properly but just a small percentage of lines are broken - e.g. page breaks from an OCR conversion.

The first thing I'd try is make sure line unwrapping is enabled and reduce the line unwrap factor to where those lines on page breaks get 'unwrapped'. You might need it as low as .1 if the book is as I described above.

If there are no softbreaks in the book to be preserved then in order to not have the whitespace or softbreak classes created you can disable scene break detection completely under heuristics.

theducks' suggestion would work if you're dealing with someone else's Calibre conversion - if you're doing the conversion yourself then the culprit is having softbreak detection enabled under heuristics, as heuristics occurs after search and replace.

Also, you should be using \s* in your regex and generally make it more generic:
Code:
([a-z0-9-,])</p>\s*<p class="(whitespace|softbreak)">\s*</p><p[^>]*>
With the above the replacement should be '\1 '.

Last edited by ldolse; 07-02-2011 at 08:07 AM.
ldolse is offline   Reply With Quote
Advert
Old 07-02-2011, 08:13 AM   #6
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
I am curious:
the heuristics options page refers to scene breaks & also to soft scene breaks , but not to softbreaks.

can you define & explain please how these 3 terms are detected & processed within the heuristics logic please
cybmole is offline   Reply With Quote
Old 07-02-2011, 03:40 PM   #7
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Sorry, wasn't referring to the manual when I wrote that last reply. There are only two types of breaks - scene breaks - i.e. breaks which have some sort of repeating non-alphanumeric text, and soft scene breaks/softbreaks - breaks of whitespace in between scenes.

The 'whitespace' class comes from a function that runs before scene break detection to prevent extra whitespace around headings, blockquotes, etc from being accidentally detected/formatted as scene breaks.
ldolse is offline   Reply With Quote
Old 07-02-2011, 04:48 PM   #8
greenlees
Junior Member
greenlees began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jun 2011
Device: kindle touch
thanks,

thats definitely helped, I've now only got 57 instances of mid sentence splits. And thanks also for the explanation.

Another related question...

The book was originally a PDF which has been cropped to removed the page numbering, and then converted to epub.

I check the 'remove spaces between paragraphs' and enable heuristics (now with line unwrapping set to 1) but I still end up with a lot of whitespace where I don't want it.

Even if I put something as simple as

<p class="whitespace"> </p>

in my regex

Calibre just will not replace it. It's always still there after the conversion.

I tried using the scenebreak replace to see which whitespace was scenebreak and I then got

379 occurrences of <p class="whitespace"> </p>

and 76 occurrences of <p class="scenebreak">∗ ∗ ∗</p>

I could then remove the 76 scenebreaks by using regex and turning scenebreak detection off.

But I still cannot find a way to get rid of the remaining pesky whitespaces.

I can live with them, the book formatting is much improved, but if there is an explanation, or a way to remove them during calibre conversion, I'd love to hear it.
greenlees is offline   Reply With Quote
Old 07-03-2011, 02:54 AM   #9
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Not sure whether it was a typo, but you don't want line un-wrap set to '1', you want it set to '.1', note the decimal point. And you only need it set to that for books where a small percentage of lines are unwrapped, the defaults should be good on a 'typical' book with hard breaks (e.g. pdf, many text files, some OCR sources). Setting it that low on a book where every line has hard breaks will cause poetry, etc to be un-wrapped.

It's possible that's the cause of your remaining splits.

If the original source was a pdf, it's possible that the pattern used to remove the headers and footers didn't delete the surrounding tags associated with those headers and footers - that leftover garbage can trip up the later work Calibre does to try and fix the line breaks, etc. If you're not the one doing the intial conversion there's not much you can do to fix it at this point...
ldolse is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Remove from Archive (book already "deleted" in Amazon account) kindletommy Amazon Kindle 9 08-09-2012 06:17 PM
Changing or removing <div class="calibrenavbar"> ptsefton Recipes 3 05-28-2011 08:30 AM
How do I remove the "Archived" Book shelf from my nook color? leesiulung Nook Color & Nook Tablet 0 02-24-2011 03:02 PM
How to remove "Fully read" books from "Last Open" list? pjeanetta PocketBook 4 12-08-2010 10:30 AM


All times are GMT -4. The time now is 08:59 PM.


MobileRead.com is a privately owned, operated and funded community.