View Single Post
Old 03-19-2011, 09:43 AM   #3
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,800
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by JustinD View Post
I have an epub book that has the following formatting:

Since long before the coming of Gods and mortals, the great rock of Krasnegar<br class="calibre1" />
had stood amid the storms and ice of the Winter Ocean, resolute and eternal.<br class="calibre1" />
Throughout long arctic nights it glimmered under the haunted dance of aurora and<br class="calibre1" />
the rays of the cold, sad moon, while the icepack ground in useless anger around<br class="calibre1" />


So, each line is unnaturally shortened by the <br class="calibre1" />

How do I edit to remove this while keeping my actual paragraphs? Sorry for what is a simple question but I am new to this. I am hoping I could have a regex but given there doesn't seem to be anything distinguishing the end of para from line I am a bit stumped.

any thoughts?
Justin
Jellby has it correct (and I had forgotten about those terribly formatted, exceptions. )

TEST your REPLACE code on a few before using the 'replace all' button

Save your work befor starting the NEXT whole document replace.
File 1 (open from the Recent list) 'Discard' is your friend


Step 1: use a COUNT SEARCH BEFORE (all HTML files) to get an idea of how bad it is

Regex:
Code:
 <br class="calibre1" />\s+<br class="calibre1" />
to look to see if they did those type of paragraph breaks.

If you have a lot (more than a few per section split) of </p> tags, those are probably just scene breaks .

Step 1.5: change the scene break to a scene marker (your choice)
the REPLACE for the search term above
Code:
</> <p class="scenebreak">* * *</p> <p class="whatever...">
Notes: scenebreak is the name of your css styling selector. The first </p> closes the previous <p> tag. the last <p class=whatever was used to start the original P tag" to make a next paragraph start. Tidy will make the code pretty, so don't worry about newlines

Step 2: is to Now replace the lone BR

Note: don't try and get all cases in a singe pass, but really-really take care to ONLY replace your current target case
Search:
Code:
(\w)<br class="calibre1" />\s+<br class="calibre1" />(\w)
Code:
\1</p> <p class="[COLOR="RoyalBlue"][COLOR="RoyalBlue"]whatever...">\2
the \1 and \2 puts whatever was matched before and after the BR, back with a end P and a start next P replacing the BR

Step 3:
you may have to create additional searches to handle punctuation and quote (remember to escape wild cards in the search) combination's.

Take your time to learn what works
theducks is offline   Reply With Quote