04-24-2011, 05:10 PM | #1 |
Connoisseur
Posts: 61
Karma: 12096
Join Date: Sep 2010
Location: Tasmania
Device: Sony PRS 650
|
Regex
This thread is for those wanting to use Regex and needing some starters.
It isn't intended to teach Regex but hopefully the expressions can be usefully adapted to your needs. To understand the purpose of this Regex you should know that I modify novels solely for our own use. Also my missus and I are happy with the default values of styles for standard elements such as paras and headings. Only minor tweaks are needed. Why do it? Because I like the idea of something being as trim and efficient as possible. Maybe it improves speed of loading and page turning. Here's the CSS stylesheet which is used for all our novels: Code:
@namespace h "http://www.w3.org/1999/xhtml"; @page { margin-top: 12pt; margin-bottom: 1pt } body { margin-left: 1%; margin-right: 1% } p { margin: 0; text-indent: 1em; font-family: "Times New Roman",serif; } h1, h2 { margin: 0; text-align: center; } .italic { font-style: italic; } .bold { font-weight: bold } .image { margin-top: 1em; margin-bottom: 1em; margin-left: 1.2em; text-align: center; max-height: 100% } blockquote { font-family: Arial, sans-serif; font-style: italic; } Modify as required; however, it's use of defaults makes it fine for most novels. Possible changes could be leading and a bottom-margin in the p element. (We prefer first-line indent to paragraph spacing so that we get more text on the page. Here's what to add to the 'p' block: Code:
line-height: 1.2em; margin-bottom: 0.5em; After a brief look at the original stylesheet of the novel, in partlicular looking for how italics and images, if any, have been tagged, the original stylesheet is trashed and replaced with the stylesheet above. How? Code:
Right click on stylesheet.css - "Remove". Right click on any folder "Add Existing Items...". Locate your new 'stylesheet.css'. Here's where the Regex comes in. (Regex = Regular Expression) 'Match Case' and 'Minimal matching' can be left unchecked. They don't matter here. Carry out all operations in Code View If there are multiple HTMLs select All HTML Files rather than 'Current File'. The description has turned out to be somewhat long-winded, sorry, but there's a summary for copying at the end. -------------------------------------------------------------- TO FIND AND DELETE ALMOST ALL calibre TAGS -------------------------------------------------------------- Here's an example of some original code: Code:
</head> <body class="calibre" style=""> <div class="calibre1"> <a class="calibre3"></a><br class="calibre1" /> <br class="calibre1" /> <br class="calibre1" id="filepos11907" /> <p class="calibre4"><a class="calibre9" href="../Text/Novel_split_004.html#filepos3789"><span class="calibre5 bold calibre10 calibre11 sgc-1">One</span></a></p><a class="calibre3"></a><br class="calibre1" /> <p class="calibre12"><br /></p><a class="calibre3"></a> <p class="calibre12">"Can you check under KCPS-TV? Or Cavanaugh. Check <span class="italic">Cavanaugh,</span>" I repeated, louder, in that stupid way people do when they're talking to foreigners, as if saying something louder is going to make it easier to understand.</p><a class="calibre3"></a> Code:
Find: (<\w+) class="calibre(\d+)?"?[^>]*(>) Repl: <p> Code:
</head> <p> <p> <p></a><p> <p> <p> <p><p><p>One</span></a></p><p></a><p> <p> <p> <p>"Can you check under KCPS-TV? Or Cavanaugh. Check <span class="italic">Cavanaugh,</span>" I repeated, louder, in that stupid way people do when they're talking to foreigners, as if saying something louder is going to make it easier to understand.</p><p></a> Sigil corrects <p> back to <body> and removes superfluous <p>s. finally it becomes:- Code:
</head> <body> <p>One</p> <p>"Can you check under KCPS-TV? Or Cavanaugh. Check <span class="italic">Cavanaugh,</span>" I repeated, louder, in that stupid way people do when they're talking to foreigners, as if saying something louder is going to make it easier to understand.</p> The span class="italic" is preserved; however, if italic style is tagged using calibre tags you would need to do a Find/Replace prior to the above. Here's how: In Book View select some text in italics. Go to Code View note the tag applied. Example: some text <span class="italic calibre7"> some text </span> some text In THIS case use: Code:
Find: <span class="italic calibre7"> Repl: <span class="italic"> or just Find: italic calibre7 Repl: italic Code:
<p><img alt="" class="calibre32" src="../Images/00002.jpg" /></p> -------------------------------------------------------------- INSERT HEADING TAGS PLUS CHAPTER SPLITS IF REQUIRED -------------------------------------------------------------- If the word 'CHAPTER' or 'Chapter' is present Note: if 'class' and 'span' are still present they will be removed. Here's some examples of some original code: Code:
<p>Chapter 5</p> or <h3>Chapter five</h3> or <p class="calibre4"><span class="calibre7"><span class="calibre15">CHAPTER FIVE</span></span></p> Here's what happens to the first example:- Code:
<h2>Chapter 5</2> Here's what happens to the second example:- Code:
<hr class="sigilChapterBreak" /><h2>Chapter five</2> Code:
Find: <(p|h\d)[^>]*>(<span[^>]*>)*((Chapt|CHAPT|chapt)[^</]*)(</span>)*<(/p|/h\d)> Repl(A): <h2>\3</h2> Repl(B): <hr class="sigilChapterBreak" /><h2>\3</h2> To retain all the classes and spans use the following Find and Replace: Code:
Find: <p([^>]*>(<span[^>]*>)*(Chapt|CHAPT|chapt)[^</]*(</span>)*</)p> Repl(A): <h2\1h2> Repl(B): <hr class="sigilChapterBreak" /><h2\1h2> Code:
<p>2</p> becomes:- <h2>Chapter 2</h2> Code:
Find: <(p|h\d)[^>]*>(<span[^>]*>)*((\d+)[^</]*)(</span>)*<(/p|/h\d)> Repl(A): <h2>Chapter \3</h2> Code:
<p class="calibre4"><span class="calibre7">5</span></span></p> Code:
<hr class="sigilChapterBreak" /><h2>Chapter 2</2> Use the same Find code as above: Code:
Find: <(p|h\d)[^>]*>(<span[^>]*>)*((\d+)[^</]*)(</span>)*<(/p|/h\d)> Repl(B): <hr class="sigilChapterBreak" /><h2>Chapter \3</h2> To retain all the classes and spans use the following Find and Replace: Code:
Find: <p([^>]*>(<span[^>]*>)*(\d+)[^</]*(</span>)*</)p> Repl(A): <h2\1h2> Repl(B): <hr class="sigilChapterBreak" /><h2\1h2> The CHAPTER heading shown by NUMBERS in WORDs only This finds a single word or hyphenated words with no spaces in p... /p or hx.../hx tags eg Two, Thirty, Forty-five, Sixty-Nine - and puts it in Heading 2 tags Here's some examples of original code: Code:
<p>Twenty-one</p> <p class="calibre2"><span class="calibre3"><span class="calibre4">Twenty-one</span></span></p> Code:
<h2>Twenty-one</h2> Code:
<h2 class="chapterNumber" id="heading_id_2" style="text-indent: 0%;"><span class="bold">Thirty-four</span></h2> Code:
<h2>Thirty-four</h2> Code:
Find: <(p|h\d)[^>]*>(<span[^>]*>)*([A-z]+[\-]?([a-z]*)?)(</span>)*<(/p|/h\d)> Repl(A): <h2>\3</h2> Repl(B): <hr class="sigilChapterBreak" /><h2>\3</h2> To retain all the classes and spans use the following Find and Replace: Code:
Find: <p([^>]*>(<span[^>]*>)*[A-z]+[\-]?([a-z]*)?(</span>)*</)p> Repl(A): <h2\1h2> Repl(B): <hr class="sigilChapterBreak" /><h2\1h2> INCLUDING THE NEW STYLESHEET ------------------------------------------------------------- This removes style descriptions from <head>_</head> and replace it with the link to the stylesheet. (Remember 'All HTML Files'!) Code:
Find: (<style)[^</style>].*(</style>) Repl: <link href="../Styles/stylesheet.css" rel="stylesheet" type="text/css" /> Here it is again briefly to put in Notepad for Copy/Pasting: -------------------------------------------------------------- Code:
Delete all Calibre tags Find: (<\w+) class="calibre(\d+)?"?[^>]*(>) Repl: <p> Then: Book View then Code View -------------------------------------------------------------- Wrap in <h2> tags when 'CHAPTER' or 'Chapter' word present: Find: <(p|h\d)[^>]*>(<span[^>]*>)*((Chapt|CHAPT|chapt)[^</]*)(</span>)*<(/p|/h\d)> (A) Repl: <h2>\3</h2> (B) Repl: <hr class="sigilChapterBreak" /><h2>\3</h2> Retain cr@p: Find: <p([^>]*>(<span[^>]*>)*(Chapt|CHAPT|chapt)[^</]*(</span>)*</)p> Repl(A): <h2\1h2> Repl(B): <hr class="sigilChapterBreak" /><h2\1h2> -------------------------------------------------------------- Wrap in <h2> tags when DIGITS only Find: <(p|h\d)[^>]*>(<span[^>]*>)*((\d+)[^</]*)(</span>)*<(/p|/h\d)> (A) Repl: <h2>Chapter \3</h2> (B) Repl: <hr class="sigilChapterBreak" /><h2>Chapter \3</h2> Retain cr@p: Find: <(p|h\d)[^>]*>(<span[^>]*>)*([A-z]+[\-]?([a-z]*)?)(</span>)*<(/p|/h\d)> Repl(A): <h2>\3</h2> Repl(B): <hr class="sigilChapterBreak" /><h2>\3</h2> -------------------------------------------------------------- Wrap in <h2> tags when NUMBERS in WORDs Find: <(p|h\d)[^>]*>(<span[^>]*>)*([A-z]+[\-]?([a-z]*)?)(</span>)*<(/p|/h\d)> (A) Repl: <h2>\3</h2> (B) Repl: <hr class="sigilChapterBreak" /><h2>\3</h2> Retain cr@p: Find: <p([^>]*>(<span[^>]*>)*[A-z]+[\-]?([a-z]*)?(</span>)*</)p> Repl(A): <h2\1h2> Repl(B): <hr class="sigilChapterBreak" /><h2\1h2> -------------------------------------------------------------- Include Stylesheet Find: (<style)[^</style>].*(</style>) Repl: <link href="../Styles/stylesheet.css" rel="stylesheet" type="text/css" /> -------------------------------------------------------------- Your best approach is to grab a copy of an epub with lots of 'calibre' tags and have fun trying out the expressions. Remember to use the down arrows at the right-hand side of the Find and Replace boxes to recall recently used expressions. |
04-24-2011, 05:46 PM | #2 |
Guru
Posts: 970
Karma: 4999999
Join Date: Mar 2009
Location: Rosario, Argentina
Device: SONY PRS-505, PRS-T2
|
Thanks for sharing this!!!
|
Advert | |
|
04-24-2011, 09:08 PM | #3 |
Jr. - Junior Member
Posts: 586
Karma: 2000358
Join Date: Aug 2010
Location: Alabama
Device: Archos, Asus, HP, Lenovo, Nexus and Samsung tablets in 7,8 and 10"
|
I am new to regex, limping along at the most basic level. This will both be useful and instructive.
Many thanks - John |
Tags |
clean, epub, regex, sigil, workshop |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Regex engine? | troymc | Sigil | 10 | 07-09-2010 04:52 PM |
What a regex is | Worldwalker | Calibre | 20 | 05-10-2010 05:51 AM |
Help with a regex | A.T.E. | Calibre | 1 | 04-05-2010 07:50 AM |
help with regex expression | daesdaemar | Workshop | 4 | 02-19-2010 07:38 AM |
Regex help... | Bobthebass | Workshop | 6 | 04-26-2009 03:54 PM |