Using search&replace for blank lines.

mehetabelo · 04-14-2017, 08:04 PM

There's an issue I've run into that I can't understand. It *may* be a bug, but it may be something else.

I am trying to replace blank lines in an document using the Search & Replace function. I did some research in the forums and online and found a function that will work for what I want, with a few minor modifications on my part. I even used the wizard to make sure it was working. The code is as follows:

Code:

<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>
or
<(p[^>]*|span[^>]*)>(\s|&nbsp;|</?\s?br\s?/?>)*</?(p|span)>

When I have it input, and then do a 'test'

The code it should be replace is:

Code:

<p class="calibre5"> </p>
or 
<p class="calibre5"></p>
or
<p class="calibre5"><span class="calibre4"> </span></p>

They both work and find numerous matches in the particular document I'm looking at. The first one caught metadata tags, so I abandoned it for the second, just throwing it in there for thoughts.

Of note, just in case, Heuristic Processing is *not* on. Using Heuristic Processing For the first 2 examples, it works fine. However, if there's a span tag (like the third example) then it doesn't strip out the 'blank' line.

Anyway, the regex should get rid of all of the above. In the case of the last one, it should, at minimum remove the span tag and then I could rerun it, or set a second scan to remove the empty P tag. (I've tried it both ways).

The replace doesn't seem to work, at all. Despite the fact that it matches them when tested, when I look at the code of the new epub file made, there's no change with the tag they all remain the same. Is there something I'm doing wrong? If needed I can provide an example epub. It was initially downloaded with Fanficfare, and it has already been converted, epub (to) epub once, that's why it has the calibre tags.

I've made a test epub by stripping it down to almost nothing in the chapters just enough to test the regex. I can provide it, if needed.

kovidgoyal · 04-14-2017, 11:03 PM

A test file is always helpful.

mehetabelo · 04-15-2017, 12:04 AM

I uploaded it to zippyshare if that's acceptable.

Zippyshare

I included both the current epub and the original, the one made I stripped down prior to the test run (on this particular file). So the .epub is the one I ran with the regex previously mentioned.

kovidgoyal · 04-15-2017, 12:14 AM

Your problems are almost certainly caused by the non-breaking space -- you cannot match it with   as the processing pipeline converts it to the unicode character. Use \u00a0 instead

mehetabelo · 04-15-2017, 11:09 PM

I just tried it with both:

Code:

<(p[^>]*|span[^>]*)>(\s|\u00a0|</?\s?br\s?/?>)*</?(p|span)>
then
<(p[^>]*|span[^>]*)>(\s|</?\s?br\s?/?>)*</?(p|span)>

The second was to remove   completely. neither one worked. The empty tags still remain in the epub after conversion.

kovidgoyal · 04-16-2017, 03:17 AM

Works for me with
<p[^>]*><span[^>]*>.</span></p>
or to match only a nbsp
<p[^>]*><span[^>]*> </span></p>
where there is a literal nbsp between the span tags (dont copy paste the expression above as MR has trouble with literal nbsp characters)

kovidgoyal · 04-16-2017, 03:34 AM

Oh and if you want to use \s to match nbsp characters, use

Code:

(?u)<p[^>]*><span[^>]*>\s</span></p>

mehetabelo · 04-16-2017, 05:58 PM

That worked well... I made a few adaptions, but it was close enough to get me where I wanted to be. I wonder why the initial regex didn't work, even though it matched when I checked it?

Anyway, I know you have a busy schedule. I didn't actually expect you to be the one to answer the questions the whole time. I truly appreciate the time you spend helping, and the enormous amount of time you've spent working on the program. It is an amazing piece of work and is software I literally use daily.

04-14-2017, 08:04 PM	#1
mehetabelo e-Bibliophile Posts: 60 Karma: 10 Join Date: Jun 2009 Location: California Device: Paperwhite 1-3, Kobo AuraHD, Boox Afterglow2	Using search&replace for blank lines. There's an issue I've run into that I can't understand. It may be a bug, but it may be something else. I am trying to replace blank lines in an document using the Search & Replace function. I did some research in the forums and online and found a function that will work for what I want, with a few minor modifications on my part. I even used the wizard to make sure it was working. The code is as follows: Code: <(\w+)\b(?:\s+[\w\-.:]+(?:\s=\s(?:"[^"]"\|"[^"]"\|[\w\-.:]+))?)\s\/?>\s<\/\1\s> or <(p[^>]\|span[^>])>(\s\| \|</?\s?br\s?/?>)</?(p\|span)> When I have it input, and then do a 'test' The code it should be replace is: Code: <p class="calibre5"> </p> or <p class="calibre5"></p> or <p class="calibre5"><span class="calibre4"> </span></p> They both work and find numerous matches in the particular document I'm looking at. The first one caught metadata tags, so I abandoned it for the second, just throwing it in there for thoughts. Of note, just in case, Heuristic Processing is not* on. Using Heuristic Processing For the first 2 examples, it works fine. However, if there's a span tag (like the third example) then it doesn't strip out the 'blank' line. Anyway, the regex should get rid of all of the above. In the case of the last one, it should, at minimum remove the span tag and then I could rerun it, or set a second scan to remove the empty P tag. (I've tried it both ways). The replace doesn't seem to work, at all. Despite the fact that it matches them when tested, when I look at the code of the new epub file made, there's no change with the tag they all remain the same. Is there something I'm doing wrong? If needed I can provide an example epub. It was initially downloaded with Fanficfare, and it has already been converted, epub (to) epub once, that's why it has the calibre tags. I've made a test epub by stripping it down to almost nothing in the chapters just enough to test the regex. I can provide it, if needed. Last edited by mehetabelo; 04-14-2017 at 08:06 PM. Reason: fixed some possible misunderstandings.

04-15-2017, 11:09 PM	#5
mehetabelo e-Bibliophile Posts: 60 Karma: 10 Join Date: Jun 2009 Location: California Device: Paperwhite 1-3, Kobo AuraHD, Boox Afterglow2	I just tried it with both: Code: <(p[^>]\|span[^>])>(\s\|\u00a0\|</?\s?br\s?/?>)</?(p\|span)> then <(p[^>]\|span[^>])>(\s\|</?\s?br\s?/?>)</?(p\|span)> The second was to remove   completely. neither one worked. The empty tags still remain in the epub after conversion. Last edited by mehetabelo; 04-15-2017 at 11:13 PM.

04-16-2017, 03:34 AM	#7
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Oh and if you want to use \s to match nbsp characters, use Code: (?u)<p[^>]><span[^>]>\s</span></p>

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Regex in search problems (NOT Search&Replace; the search bar)	lairdb	Calibre	3	03-15-2017 08:10 PM
Aura One: *#&^%B blank lines between paragraphs	franklekens	Kobo Reader	14	09-14-2016 04:28 PM
Search & Replace Help	paulfiera	Conversion	7	08-06-2015 04:52 AM
Blank lines & top margins	travger	Kindle Formats	11	10-08-2012 09:35 AM
FB Reader version & blank lines	franklekens	PocketBook	2	03-01-2010 05:38 AM

04-14-2017, 11:03 PM	#2
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	A test file is always helpful.

04-15-2017, 12:04 AM	#3
mehetabelo e-Bibliophile Posts: 60 Karma: 10 Join Date: Jun 2009 Location: California Device: Paperwhite 1-3, Kobo AuraHD, Boox Afterglow2	I uploaded it to zippyshare if that's acceptable. Zippyshare I included both the current epub and the original, the one made I stripped down prior to the test run (on this particular file). So the .epub is the one I ran with the regex previously mentioned.

04-15-2017, 12:14 AM	#4
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Your problems are almost certainly caused by the non-breaking space -- you cannot match it with   as the processing pipeline converts it to the unicode character. Use \u00a0 instead

04-16-2017, 03:17 AM	#6
kovidgoyal creator of calibre Posts: 45,706 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Works for me with <p[^>]><span[^>]>.</span></p> or to match only a nbsp <p[^>]><span[^>]> </span></p> where there is a literal nbsp between the span tags (dont copy paste the expression above as MR has trouble with literal nbsp characters)

04-16-2017, 05:58 PM	#8
mehetabelo e-Bibliophile Posts: 60 Karma: 10 Join Date: Jun 2009 Location: California Device: Paperwhite 1-3, Kobo AuraHD, Boox Afterglow2	That worked well... I made a few adaptions, but it was close enough to get me where I wanted to be. I wonder why the initial regex didn't work, even though it matched when I checked it? Anyway, I know you have a busy schedule. I didn't actually expect you to be the one to answer the questions the whole time. I truly appreciate the time you spend helping, and the enormous amount of time you've spent working on the program. It is an amazing piece of work and is software I literally use daily.

Advert

Advert