Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 09-27-2012, 07:35 PM   #151
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,546
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
The HTML code looks like:

Code:
<p class="calibre"><span>bad policy to answer a</span></p>

<p class="calibre"><span>direct question. He kept shaking his head like a china figure.
Ugh. Those empty spans surrounding literally everything are always a pain in the ass. You'll almost surely need to get rid of them first. The problem is ... there can be nested spans (italics/bolds/etc) within them. And that makes it quite painful to regex them away (without funkifying your "real" formatting spans).

If I have the original text to proof against, I sometimes find it easier (and less frustrating) just to blast ALL the spans away. Every single one. And then redo any italic and/or other special formatting using the physical copy as a guide. It's drastic, yes, but sometimes it's less drastic than fixing the havoc that a regex run on nested spans can wreak.

In one fell swoop, all span tags (opening and closing) ... gone (when you replace it with nothing of course):
Code:
</?span[^>]*?>
It all depends on the complexity of the book's formatting, of course. I may not always opt for the "nuclear" span removal approach, but I've done it quite a few times.

Use with an appropriate level of trepidation, of course...

Last edited by DiapDealer; 09-27-2012 at 08:44 PM.
DiapDealer is offline   Reply With Quote
Old 09-28-2012, 12:15 AM   #152
JMikeD
Evangelist
JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.JMikeD is as sexy as a twisted cruller doughtnut.
 
JMikeD's Avatar
 
Posts: 473
Karma: 15000
Join Date: Jul 2008
Device: Various and sundry
Quote:
Originally Posted by DiapDealer View Post

Use with an appropriate level of trepidation, of course...
It's probably just as easy to export the entire thing to RTF, clean everything up in OpenOffice and use the ePub Export extension in OO. That gives pretty clean results.
JMikeD is offline   Reply With Quote
Advert
Old 09-28-2012, 04:01 AM   #153
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,515
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
I'd first identify the spans that do something (search for "<span ", replace them with something more meaningful (<i>, <strong>...), or with some other temporary mark), then delete the remaining bogus spans.
Jellby is offline   Reply With Quote
Old 09-29-2012, 03:01 PM   #154
WS64
WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.
 
WS64's Avatar
 
Posts: 660
Karma: 506380
Join Date: Aug 2010
Location: Germany
Device: Kobo Aura / PB Lux 2 / Bookeen Frontlight / Kobo Mini / Nook Color
I would remove ALL <span> (without anything behind) and let Tidy remove the corresponding closing spans.

Then search for
</p>

<p class="calibre">([a-z])
and replace it with
_\1
(_ = blank)

Also search for
([a-zA-Z,])</p>

<p class="calibre">
and replace it with
\1_
(_ = blank)
WS64 is offline   Reply With Quote
Old 10-23-2012, 07:14 AM   #155
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
Just a very simple expression for finding instances of period, followed by a space, by a lower case letter, caused by poor OCR.

\. ([a-z])

Not a candidate for auto search and replace because it matches abbreviations, too.
mrmikel is offline   Reply With Quote
Advert
Old 10-24-2012, 01:10 AM   #156
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
Quote:
Originally Posted by WS64 View Post
I would remove ALL <span> (without anything behind) and let Tidy remove the corresponding closing spans.
I would strongly recommend not to do that. If there are nested spans, Tidy doesn't always remove the correct closing span. That can make a real mess out of your book.

It is never a good idea to trust Tidy to make the right choice....
Toxaris is offline   Reply With Quote
Old 10-25-2012, 12:58 AM   #157
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Now that Sigil has a nice Search Editor, I can add some more regex.
I would like to set up this one:

It's about superscript text:

The text me mes lle lles er e o placed within a sup tag and followed by a normal space should instead be followed by a &nbsp;

Say <sup>me</sup>(normal space) should be replaced by
<sup>me</sup>&nbsp;

(me, lle are superscript short for M(adame), M(ademoiselle)...

I hope I have been clear enough...
roger64 is offline   Reply With Quote
Old 10-25-2012, 05:29 AM   #158
Perkin
Guru
Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.
 
Perkin's Avatar
 
Posts: 655
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
Search (put a space at end after </sup>)
Code:
<sup>(me|mes|lle|lles|er|e|o)</sup>
Replace
Code:
<sup>\1</sup>&nbsp;
Perkin is offline   Reply With Quote
Old 10-25-2012, 05:57 AM   #159
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
@Perkin

Thanks a lot for your help. I did not know how to deal with the "false" words like me, mes...



Already in use.

Last edited by roger64; 10-25-2012 at 09:30 AM.
roger64 is offline   Reply With Quote
Old 12-23-2012, 08:47 AM   #160
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
To replace hyphen with non-breaking hyphen

Hi

I try to set up a regex for French language.

We have some acronyms linked with hyphens 8208 like
J.-C, P.-D.G. (and the list can grow) They are always unhappily hyphenated and it would be much better if they were not. That's why I would like to replace their hyphens with non-breaking hyphens 8209
I do not know how to set up this regex. Ideally, I would like to be able to just add easily one new word.

I think there must be better than this.
I wrote only 8208 instead of the full &#...:

Search: (J.|P.)8208(C.|D.G)
Replace: \18209\2
roger64 is offline   Reply With Quote
Old 12-23-2012, 09:43 AM   #161
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,515
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Are there instances of hyphen after a period that you do not want to replace? If there aren't you can just replace all ".-" with ".¬" (where I use ¬ for the non-breaking hyphen), with appropriate escaping of the period if needed.
Jellby is offline   Reply With Quote
Old 12-23-2012, 09:46 AM   #162
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,583
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
I'm sure that the Regex gurus will come up with a much more efficient Regex, but I'd simply search for a capital letter with a period followed by &#8208; and another capital letter followed by a period:

Find: ([[:upper:]]\.)&#8208;([[:upper:]]\.)
Replace: \1&#8209;\2

This should work in Sigl and any other Editor with PCRE support.
Doitsu is offline   Reply With Quote
Old 12-23-2012, 10:46 PM   #163
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Hi

Like for many things, I gather experience book after book. After preparing an history book, I realized that to use a hyphen for J.-C. (70 occurrences of it in one book) was NOT a nice idea.

I have no idea how many words of this kind I may find and I am really not sure that all occurrences of .- should deserve the same treatment. That's why, I thought first to add them one by one.

But, in fact, I realize there does not seem to be a very big risk to try your solutions. So I will try them. Thanks for them.

And enjoy a Merry Chrismas.
roger64 is offline   Reply With Quote
Old 12-24-2012, 03:36 AM   #164
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,515
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Well, try searching for ".-" first and see which occurrences you find. With any luck you'll see they all want to be non-breaking, or you may see a pattern (like Doitsu's suggestion) and find some typos
Jellby is offline   Reply With Quote
Old 01-09-2013, 09:08 AM   #165
mzmm
Groupie
mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.
 
mzmm's Avatar
 
Posts: 171
Karma: 86271
Join Date: Feb 2012
Device: iPad, Kindle Touch, Sony PRS-T1
found myself parsing messy html today, removing empty <p> tags, or <p> tags containing &nbsp;, or <p><i></i></p>, <p><b> </b><p> etc. so that i could space the paragraphs consistently in css, and, inspired by this thread, thought i'd share the snippet in case anyone has a use for it.

i realize it could probably be more concise, and i wouldn't just blindly replace all, but it seems to do the job. it removes <p> tags that may also contain <b>, <i>, <span>, have no content, or 1 or more spaces, or a <br>,<br/>,<br />.

Code:
<p[^>]*>((<\w+[^>/]*>)+)?(<br((\s)?/)?>|&nbsp;|\s*)((</\w+[^>]*>)+)?</p>

Last edited by mzmm; 01-09-2013 at 09:15 AM.
mzmm is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Examples of Subgroups emonti8384 Lounge 32 02-26-2011 06:00 PM
Accessories Pen examples Gunnerp245 enTourage Archive 15 02-21-2011 03:23 PM
Stylesheet examples? Skitzman69 Sigil 15 09-24-2010 08:24 PM
Examples kafkaesque1978 iRiver Story 1 07-26-2010 03:49 PM
Looking for examples of typos in eBooks Tonycole General Discussions 1 05-05-2010 04:23 AM


All times are GMT -4. The time now is 02:28 PM.


MobileRead.com is a privately owned, operated and funded community.