09-27-2012, 07:35 PM | #151 | |
Grand Sorcerer
Posts: 27,550
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
If I have the original text to proof against, I sometimes find it easier (and less frustrating) just to blast ALL the spans away. Every single one. And then redo any italic and/or other special formatting using the physical copy as a guide. It's drastic, yes, but sometimes it's less drastic than fixing the havoc that a regex run on nested spans can wreak. In one fell swoop, all span tags (opening and closing) ... gone (when you replace it with nothing of course): Code:
</?span[^>]*?> Use with an appropriate level of trepidation, of course... Last edited by DiapDealer; 09-27-2012 at 08:44 PM. |
|
09-28-2012, 12:15 AM | #152 |
Evangelist
Posts: 473
Karma: 15000
Join Date: Jul 2008
Device: Various and sundry
|
|
09-28-2012, 04:01 AM | #153 |
frumious Bandersnatch
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
I'd first identify the spans that do something (search for "<span ", replace them with something more meaningful (<i>, <strong>...), or with some other temporary mark), then delete the remaining bogus spans.
|
09-29-2012, 03:01 PM | #154 |
♫
Posts: 660
Karma: 506380
Join Date: Aug 2010
Location: Germany
Device: Kobo Aura / PB Lux 2 / Bookeen Frontlight / Kobo Mini / Nook Color
|
I would remove ALL <span> (without anything behind) and let Tidy remove the corresponding closing spans.
Then search for </p> <p class="calibre">([a-z]) and replace it with _\1 (_ = blank) Also search for ([a-zA-Z,])</p> <p class="calibre"> and replace it with \1_ (_ = blank) |
10-23-2012, 07:14 AM | #155 |
Color me gone
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
Just a very simple expression for finding instances of period, followed by a space, by a lower case letter, caused by poor OCR.
\. ([a-z]) Not a candidate for auto search and replace because it matches abbreviations, too. |
10-24-2012, 01:10 AM | #156 | |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Quote:
It is never a good idea to trust Tidy to make the right choice.... |
|
10-25-2012, 12:58 AM | #157 |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Now that Sigil has a nice Search Editor, I can add some more regex.
I would like to set up this one: It's about superscript text: The text me mes lle lles er e o placed within a sup tag and followed by a normal space should instead be followed by a Say <sup>me</sup>(normal space) should be replaced by <sup>me</sup> (me, lle are superscript short for M(adame), M(ademoiselle)... I hope I have been clear enough... |
10-25-2012, 05:29 AM | #158 |
Guru
Posts: 655
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
|
Search (put a space at end after </sup>)
Code:
<sup>(me|mes|lle|lles|er|e|o)</sup> Code:
<sup>\1</sup> |
10-25-2012, 05:57 AM | #159 |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
@Perkin
Thanks a lot for your help. I did not know how to deal with the "false" words like me, mes... Already in use. Last edited by roger64; 10-25-2012 at 09:30 AM. |
12-23-2012, 08:47 AM | #160 |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
To replace hyphen with non-breaking hyphen
Hi
I try to set up a regex for French language. We have some acronyms linked with hyphens 8208 like J.-C, P.-D.G. (and the list can grow) They are always unhappily hyphenated and it would be much better if they were not. That's why I would like to replace their hyphens with non-breaking hyphens 8209 I do not know how to set up this regex. Ideally, I would like to be able to just add easily one new word. I think there must be better than this. I wrote only 8208 instead of the full &#...: Search: (J.|P.)8208(C.|D.G) Replace: \18209\2 |
12-23-2012, 09:43 AM | #161 |
frumious Bandersnatch
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Are there instances of hyphen after a period that you do not want to replace? If there aren't you can just replace all ".-" with ".¬" (where I use ¬ for the non-breaking hyphen), with appropriate escaping of the period if needed.
|
12-23-2012, 09:46 AM | #162 |
Grand Sorcerer
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
I'm sure that the Regex gurus will come up with a much more efficient Regex, but I'd simply search for a capital letter with a period followed by ‐ and another capital letter followed by a period:
Find: ([[:upper:]]\.)‐([[:upper:]]\.) Replace: \1‑\2 This should work in Sigl and any other Editor with PCRE support. |
12-23-2012, 10:46 PM | #163 |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Hi
Like for many things, I gather experience book after book. After preparing an history book, I realized that to use a hyphen for J.-C. (70 occurrences of it in one book) was NOT a nice idea. I have no idea how many words of this kind I may find and I am really not sure that all occurrences of .- should deserve the same treatment. That's why, I thought first to add them one by one. But, in fact, I realize there does not seem to be a very big risk to try your solutions. So I will try them. Thanks for them. And enjoy a Merry Chrismas. |
12-24-2012, 03:36 AM | #164 |
frumious Bandersnatch
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Well, try searching for ".-" first and see which occurrences you find. With any luck you'll see they all want to be non-breaking, or you may see a pattern (like Doitsu's suggestion) and find some typos
|
01-09-2013, 09:08 AM | #165 |
Groupie
Posts: 171
Karma: 86271
Join Date: Feb 2012
Device: iPad, Kindle Touch, Sony PRS-T1
|
found myself parsing messy html today, removing empty <p> tags, or <p> tags containing , or <p><i></i></p>, <p><b> </b><p> etc. so that i could space the paragraphs consistently in css, and, inspired by this thread, thought i'd share the snippet in case anyone has a use for it.
i realize it could probably be more concise, and i wouldn't just blindly replace all, but it seems to do the job. it removes <p> tags that may also contain <b>, <i>, <span>, have no content, or 1 or more spaces, or a <br>,<br/>,<br />. Code:
<p[^>]*>((<\w+[^>/]*>)+)?(<br((\s)?/)?>| |\s*)((</\w+[^>]*>)+)?</p> Last edited by mzmm; 01-09-2013 at 09:15 AM. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Examples of Subgroups | emonti8384 | Lounge | 32 | 02-26-2011 06:00 PM |
Accessories Pen examples | Gunnerp245 | enTourage Archive | 15 | 02-21-2011 03:23 PM |
Stylesheet examples? | Skitzman69 | Sigil | 15 | 09-24-2010 08:24 PM |
Examples | kafkaesque1978 | iRiver Story | 1 | 07-26-2010 03:49 PM |
Looking for examples of typos in eBooks | Tonycole | General Discussions | 1 | 05-05-2010 04:23 AM |