Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 04-24-2011, 05:10 PM   #1
Faster
Connoisseur
Faster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of lightFaster is a glorious beacon of light
 
Posts: 61
Karma: 12096
Join Date: Sep 2010
Location: Tasmania
Device: Sony PRS 650
Post Regex

This thread is for those wanting to use Regex and needing some starters.
It isn't intended to teach Regex but hopefully the expressions can be usefully adapted to your needs.

To understand the purpose of this Regex you should know that I modify novels solely for our own use.
Also my missus and I are happy with the default values of styles for standard elements such as paras and headings. Only minor tweaks are needed.
Why do it? Because I like the idea of something being as trim and efficient as possible. Maybe it improves speed of loading and page turning.

Here's the CSS stylesheet which is used for all our novels:
Code:
@namespace h "http://www.w3.org/1999/xhtml";
@page {
    margin-top: 12pt;
    margin-bottom: 1pt
    }
body {
    margin-left: 1%;
    margin-right: 1%
    }
p { 
    margin: 0;
    text-indent: 1em;
    font-family: "Times New Roman",serif;
    }
h1, h2 {
    margin: 0;
    text-align: center;
    }
.italic {
    font-style: italic;
    }
.bold {
    font-weight: bold
    }
.image { 
     margin-top: 1em; 
     margin-bottom: 1em; 
     margin-left: 1.2em; 
     text-align: center; 
     max-height: 100% 
    }
blockquote {
    font-family: Arial, sans-serif;
    font-style: italic;
    }
If you want this stylesheet, copy it into Notepad. Save as "stylesheet.css".
Modify as required; however, it's use of defaults makes it fine for most novels.
Possible changes could be leading and a bottom-margin in the p element.
(We prefer first-line indent to paragraph spacing so that we get more text on the page.

Here's what to add to the 'p' block:
Code:
line-height: 1.2em;
margin-bottom: 0.5em;
But no text-align: justify; please! It's bad for the eyes.

After a brief look at the original stylesheet of the novel, in partlicular looking for how italics and images, if any, have been tagged, the original stylesheet is trashed and replaced with the stylesheet above.
How?
Code:
Right click on stylesheet.css - "Remove". Right click on any folder "Add Existing Items...". Locate your new 'stylesheet.css'.
Now to simplify the HTML files that will use this stylesheet.
Here's where the Regex comes in. (Regex = Regular Expression)

'Match Case' and 'Minimal matching' can be left unchecked. They don't matter here.
Carry out all operations in Code View
If there are multiple HTMLs select All HTML Files rather than 'Current File'.

The description has turned out to be somewhat long-winded, sorry, but there's a summary for copying at the end.

--------------------------------------------------------------
TO FIND AND DELETE ALMOST ALL calibre TAGS
--------------------------------------------------------------

Here's an example of some original code:
Code:
</head>
<body class="calibre" style="">
  <div class="calibre1">
    <a class="calibre3"></a><br class="calibre1" />
    <br class="calibre1" />
    <br class="calibre1" id="filepos11907" />
    <p class="calibre4"><a class="calibre9" href="../Text/Novel_split_004.html#filepos3789"><span class="calibre5 bold calibre10 calibre11 sgc-1">One</span></a></p><a class="calibre3"></a><br class="calibre1" />
    <p class="calibre12"><br /></p><a class="calibre3"></a>
    <p class="calibre12">"Can you check under KCPS-TV? Or Cavanaugh. Check <span class="italic">Cavanaugh,</span>" I repeated, louder, in that stupid way people do when they're talking to foreigners, as if saying something louder is going to make it easier to understand.</p><a class="calibre3"></a>
Here's the Find/Replace:
Code:
Find:	(<\w+) class="calibre(\d+)?"?[^>]*(>)
Repl:	<p>
firstly it becomes:-

Code:
</head>
<p>
  <p>
    <p></a><p>
    <p>
    <p>
    <p><p><p>One</span></a></p><p></a><p>
    <p>
    <p>
    <p>"Can you check under KCPS-TV? Or Cavanaugh. Check <span class="italic">Cavanaugh,</span>" I repeated, louder, in that stupid way people do when they're talking to foreigners, as if saying something louder is going to make it easier to understand.</p><p></a>
Now switch to Book View and then back to Code View.

Sigil corrects <p> back to <body> and removes superfluous <p>s.
finally it becomes:-

Code:
</head>
<body>
  <p>One</p>
  <p>"Can you check under KCPS-TV? Or Cavanaugh. Check <span class="italic">Cavanaugh,</span>" I repeated, louder, in that stupid way people do when they're talking to foreigners, as if saying something louder is going to make it easier to understand.</p>
Compare this with what we started with!

The span class="italic" is preserved; however, if italic style is tagged using calibre tags you would need to do a Find/Replace prior to the above.

Here's how:
In Book View select some text in italics. Go to Code View note the tag applied.
Example: some text <span class="italic calibre7"> some text </span> some text
In THIS case use:

Code:
Find:	<span class="italic calibre7">
Repl:	<span class="italic">
or just
Find:	italic calibre7
Repl:	italic
The calibre label in the following line will not be deleted:
Code:
<p><img alt="" class="calibre32" src="../Images/00002.jpg" /></p>
Either add details of 'calibre32' to the stylesheet if it's important or do a Find/Replace to delete all the occurrences of class="calibre32"

--------------------------------------------------------------
INSERT HEADING TAGS PLUS CHAPTER SPLITS IF REQUIRED
--------------------------------------------------------------

If the word 'CHAPTER' or 'Chapter' is present

Note: if 'class' and 'span' are still present they will be removed.

Here's some examples of some original code:
Code:
	<p>Chapter 5</p>
or	<h3>Chapter five</h3>
or	<p class="calibre4"><span class="calibre7"><span class="calibre15">CHAPTER FIVE</span></span></p>
(A) if already split into chapters select 'All HTML Files'
Here's what happens to the first example:-

Code:
	<h2>Chapter 5</2>
(B) if currently a single html file and is to be split later by pressing F6
Here's what happens to the second example:-

Code:
	<hr class="sigilChapterBreak" /><h2>Chapter five</2>
Here's the Find/Replace:
Code:
Find:	<(p|h\d)[^>]*>(<span[^>]*>)*((Chapt|CHAPT|chapt)[^</]*)(</span>)*<(/p|/h\d)>

Repl(A):	<h2>\3</h2>
 
Repl(B):	<hr class="sigilChapterBreak" /><h2>\3</h2>
Note: You could uncheck 'Match Case' and simplify (Chapt|CHAPT|chapt) to (Chapt) - but I'd never remember to do that. It's easier to Copy the longer expression and Paste into the Find box.

To retain all the classes and spans use the following Find and Replace:
Code:
Find:	<p([^>]*>(<span[^>]*>)*(Chapt|CHAPT|chapt)[^</]*(</span>)*</)p>

Repl(A):	<h2\1h2>

Repl(B):	<hr class="sigilChapterBreak" /><h2\1h2>
The CHAPTER is labelled by DIGITS only

Code:
	<p>2</p> 
becomes:-
	<h2>Chapter 2</h2>
Here's the Find/Replace: (It's long because it's dual purpose - see below)
Code:
Find:	<(p|h\d)[^>]*>(<span[^>]*>)*((\d+)[^</]*)(</span>)*<(/p|/h\d)>

Repl(A):	<h2>Chapter \3</h2>
Here's another example - chapter 5 as it may appear in code, still with calibre tags but the process will remove these

Code:
	<p class="calibre4"><span class="calibre7">5</span></span></p>
becomes:-
Code:
	<hr class="sigilChapterBreak" /><h2>Chapter 2</2>
Note the addition of the word 'Chapter ' in the Replacement - remove it if not wanted.
Use the same Find code as above:
Code:
Find:	<(p|h\d)[^>]*>(<span[^>]*>)*((\d+)[^</]*)(</span>)*<(/p|/h\d)>

Repl(B):	<hr class="sigilChapterBreak" /><h2>Chapter \3</h2>
Note: the Replacement includes the code for sigilChapterBreak. Remove <hr class="sigilChapterBreak" /> if the HTMLs are already split.

To retain all the classes and spans use the following Find and Replace:
Code:
Find:	<p([^>]*>(<span[^>]*>)*(\d+)[^</]*(</span>)*</)p>

Repl(A):	<h2\1h2>

Repl(B):	<hr class="sigilChapterBreak" /><h2\1h2>

The CHAPTER heading shown by NUMBERS in WORDs only

This finds a single word or hyphenated words with no spaces in p... /p or hx.../hx tags
eg Two, Thirty, Forty-five, Sixty-Nine - and puts it in Heading 2 tags

Here's some examples of original code:
Code:
	<p>Twenty-one</p>
	<p class="calibre2"><span class="calibre3"><span class="calibre4">Twenty-one</span></span></p>
become:-
Code:
	<h2>Twenty-one</h2>
and

Code:
	<h2 class="chapterNumber" id="heading_id_2" style="text-indent: 0%;"><span class="bold">Thirty-four</span></h2>
becomes:-
Code:
	<h2>Thirty-four</h2>
Use this Find/Replace:
Code:
Find:	<(p|h\d)[^>]*>(<span[^>]*>)*([A-z]+[\-]?([a-z]*)?)(</span>)*<(/p|/h\d)>

Repl(A):	<h2>\3</h2>

Repl(B):	<hr class="sigilChapterBreak" /><h2>\3</h2>
Of course any other single word paragraph will also be selected - so use with discretion.

To retain all the classes and spans use the following Find and Replace:
Code:
Find:	<p([^>]*>(<span[^>]*>)*[A-z]+[\-]?([a-z]*)?(</span>)*</)p>

Repl(A):	<h2\1h2>

Repl(B):	<hr class="sigilChapterBreak" /><h2\1h2>
-------------------------------------------------------------
INCLUDING THE NEW STYLESHEET
-------------------------------------------------------------
This removes style descriptions from <head>_</head> and replace it with the link to the stylesheet. (Remember 'All HTML Files'!)

Code:
Find:	(<style)[^</style>].*(</style>)

Repl:	<link href="../Styles/stylesheet.css" rel="stylesheet" type="text/css" />
--------------------------------------------------------------
Here it is again briefly to put in Notepad for Copy/Pasting:
--------------------------------------------------------------
Code:
Delete all Calibre tags
Find:	(<\w+) class="calibre(\d+)?"?[^>]*(>)
Repl:	<p>
Then:	Book View then Code View
--------------------------------------------------------------
Wrap in <h2> tags when 'CHAPTER' or 'Chapter' word present:
Find:	<(p|h\d)[^>]*>(<span[^>]*>)*((Chapt|CHAPT|chapt)[^</]*)(</span>)*<(/p|/h\d)>
(A) Repl:	<h2>\3</h2>
(B) Repl:	<hr class="sigilChapterBreak" /><h2>\3</h2>
Retain cr@p:
Find:	<p([^>]*>(<span[^>]*>)*(Chapt|CHAPT|chapt)[^</]*(</span>)*</)p>
Repl(A):	<h2\1h2>
Repl(B):	<hr class="sigilChapterBreak" /><h2\1h2>
--------------------------------------------------------------
Wrap in <h2> tags when DIGITS only
Find:	<(p|h\d)[^>]*>(<span[^>]*>)*((\d+)[^</]*)(</span>)*<(/p|/h\d)>
(A) Repl:	<h2>Chapter \3</h2>
(B) Repl:	<hr class="sigilChapterBreak" /><h2>Chapter \3</h2>
Retain cr@p:
Find:	<(p|h\d)[^>]*>(<span[^>]*>)*([A-z]+[\-]?([a-z]*)?)(</span>)*<(/p|/h\d)>
Repl(A):	<h2>\3</h2>
Repl(B):	<hr class="sigilChapterBreak" /><h2>\3</h2>
--------------------------------------------------------------
Wrap in <h2> tags when NUMBERS in WORDs
Find:	<(p|h\d)[^>]*>(<span[^>]*>)*([A-z]+[\-]?([a-z]*)?)(</span>)*<(/p|/h\d)>
(A) Repl:	<h2>\3</h2>
(B) Repl:	<hr class="sigilChapterBreak" /><h2>\3</h2>
Retain cr@p:
Find:	<p([^>]*>(<span[^>]*>)*[A-z]+[\-]?([a-z]*)?(</span>)*</)p>
Repl(A):	<h2\1h2>
Repl(B):	<hr class="sigilChapterBreak" /><h2\1h2>
--------------------------------------------------------------
Include Stylesheet
Find:	(<style)[^</style>].*(</style>)
Repl:	<link href="../Styles/stylesheet.css" rel="stylesheet" type="text/css" />
--------------------------------------------------------------
Well thanks for reading - if you got this far!!!
Your best approach is to grab a copy of an epub with lots of 'calibre' tags and have fun trying out the expressions.
Remember to use the down arrows at the right-hand side of the Find and Replace boxes to recall recently used expressions.
Faster is offline   Reply With Quote
Old 04-24-2011, 05:46 PM   #2
Pablo
Guru
Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.Pablo ought to be getting tired of karma fortunes by now.
 
Pablo's Avatar
 
Posts: 970
Karma: 4999999
Join Date: Mar 2009
Location: Rosario, Argentina
Device: SONY PRS-505, PRS-T2
Thanks for sharing this!!!
Pablo is offline   Reply With Quote
Advert
Old 04-24-2011, 09:08 PM   #3
Jabby
Jr. - Junior Member
Jabby ought to be getting tired of karma fortunes by now.Jabby ought to be getting tired of karma fortunes by now.Jabby ought to be getting tired of karma fortunes by now.Jabby ought to be getting tired of karma fortunes by now.Jabby ought to be getting tired of karma fortunes by now.Jabby ought to be getting tired of karma fortunes by now.Jabby ought to be getting tired of karma fortunes by now.Jabby ought to be getting tired of karma fortunes by now.Jabby ought to be getting tired of karma fortunes by now.Jabby ought to be getting tired of karma fortunes by now.Jabby ought to be getting tired of karma fortunes by now.
 
Posts: 586
Karma: 2000358
Join Date: Aug 2010
Location: Alabama
Device: Archos, Asus, HP, Lenovo, Nexus and Samsung tablets in 7,8 and 10"
I am new to regex, limping along at the most basic level. This will both be useful and instructive.

Many thanks - John
Jabby is offline   Reply With Quote
Reply

Tags
clean, epub, regex, sigil, workshop


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex engine? troymc Sigil 10 07-09-2010 04:52 PM
What a regex is Worldwalker Calibre 20 05-10-2010 05:51 AM
Help with a regex A.T.E. Calibre 1 04-05-2010 07:50 AM
help with regex expression daesdaemar Workshop 4 02-19-2010 07:38 AM
Regex help... Bobthebass Workshop 6 04-26-2009 03:54 PM


All times are GMT -4. The time now is 01:32 PM.


MobileRead.com is a privately owned, operated and funded community.