Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 09-23-2011, 08:12 AM   #1
ghostyjack
Guru
ghostyjack ought to be getting tired of karma fortunes by now.ghostyjack ought to be getting tired of karma fortunes by now.ghostyjack ought to be getting tired of karma fortunes by now.ghostyjack ought to be getting tired of karma fortunes by now.ghostyjack ought to be getting tired of karma fortunes by now.ghostyjack ought to be getting tired of karma fortunes by now.ghostyjack ought to be getting tired of karma fortunes by now.ghostyjack ought to be getting tired of karma fortunes by now.ghostyjack ought to be getting tired of karma fortunes by now.ghostyjack ought to be getting tired of karma fortunes by now.ghostyjack ought to be getting tired of karma fortunes by now.
 
ghostyjack's Avatar
 
Posts: 718
Karma: 1085610
Join Date: Mar 2009
Location: Bristol, England
Device: PRS-T1, 1825PT, Galaxy Tab, One X, TF700T, Aura HD, Nexus 7
RegEx Help needed

I've just bought a book and it is full of sentences that have random words that begin with a capital letter in them.

As it's a book I've read before in paper form I realised that these are in fact not random words with capital letters, they are actually the start of the next sentence and the previous word is missing the full stop.

I'm not at all familiar with regex so was wondering what I would need to put in the S&R boxes to look for words that begin with a capital letter but are not preceeded by a full stop and a space? And what I need to put in to add the full stop.

Also there are a few words that not only had the full stop missing, the space is also missing and the words are joined together. So I also need to know what to put in the S&R boxes to search for words that have a capital in them so that I can add a full stop and a space at that point.

The are also other issues within it like combining paragraphs that have speach from two people where they should in fact be separated and speach marks not next to the spoken word (i.e. they have space on either side of them) and end up appearing on lines by themselves. These though I can figure out myself.
ghostyjack is offline   Reply With Quote
Old 09-23-2011, 08:53 AM   #2
charleski
Wizard
charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.
 
Posts: 1,196
Karma: 1281258
Join Date: Sep 2009
Device: PRS-505
Sentences that miss both the full stop and the space are easy:
Find:
([a-z])([A-Z])
Replace:
\1. \2

Sentences that only miss the full stop are harder, as the capitalised word might be a name. If this doesn't occur too often it's easiest just to step through with Find Next and hit Replace on those instances that aren't names. (Obviously a sentence can end in a variety of ways, this searches for hits where there is no punctuation between the words.)
Find:
([a-z]) ([A-Z])
Replace:
\1. \2

Alternatively, if you can list all the names in your book, first change them all to something like '#1', '#2', etc, then Replace All using the above code and then change the # codes back to the correct names.

Remember to save to a different file at each step so you can go back if you make a mistake.
charleski is offline   Reply With Quote
Advert
Old 09-23-2011, 12:22 PM   #3
susan_cassidy
Wizard
susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.susan_cassidy ought to be getting tired of karma fortunes by now.
 
Posts: 2,251
Karma: 3720310
Join Date: Jan 2009
Location: USA
Device: Kindle, iPad (not used much for reading)
Don't you need a \s+ or something between the lower case expression and the upper-case expression, or is this some special flavor of regexes, peculiar to Calibre? In the second place that you have the regex, it looks like a space, but I think a \s+ would be better, and more reliable.
susan_cassidy is offline   Reply With Quote
Old 09-23-2011, 01:44 PM   #4
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
\s matches white space characters. A space will match a space. \s is only necessary if the letters are separated by characters such as a tab in addition to a space.
user_none is offline   Reply With Quote
Old 09-23-2011, 09:09 PM   #5
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
You can get a bit fancier with this regex and use positive and negative lookahead/lookbehind, which gives you a chance to take care of the most common proper names in the book as well:

(?<=[a-z])(?= [A-Z](?!(obert|avid|ebecca)))

Use the trailing characters of the proper name (i.e. remove the first letter) in the last set of parentheses, sepearated by '|' in case that's not obvious above. Then you only need to use this for replacement:
.
ldolse is offline   Reply With Quote
Advert
Old 10-29-2011, 06:33 PM   #6
shamanNS
Guru
shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.
 
Posts: 879
Karma: 10113994
Join Date: Feb 2010
Location: Serbia
Device: Kindle PW5 [bricked], Kindle PW1
Can someone help me create RegEx for deleting couple of lines in all html files that are in the same folder (or maybe all html files currently open in Notepad++ ? I have about 200 ebooks needed to be edited,and each chapter is separate html file,so manualy editing them would take ages...


Here is the example what lines should be deleted (the one i bold letters):
link

The part in red color (p1) is the only thing that varies in those html files,and it's incremental (p1,p2,p3.....).
<p> in blue color at the begining of the line should not be deleted,only rest of the line.

Here is one set of html files (one ebook ) for testing purposes: link.

p.s. I have Python installed on my Windows 7,so solution can be in form of python script,if it easier.

Thanks in advance.
shamanNS is offline   Reply With Quote
Old 10-29-2011, 07:40 PM   #7
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,703
Karma: 54369092
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
The code is malformed to boot.
Code:
<P class="next">
<A HREF=p2.html><IMG BORDER=0 SRC=../../graphics/next.gif></A>
</P>
It is missing a closing </P>

anyway, Sigil can do this on a complete book (all HTML files)

Search:
Code:
<P class="next">\s+<A HREF=p\d+.html><IMG BORDER=0 SRC=../../graphics/next.gif></A>
The \s+ handles the line split
the \d+ grabs any number of digits
Replace with <nothing>
theducks is offline   Reply With Quote
Old 10-29-2011, 07:47 PM   #8
shamanNS
Guru
shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.
 
Posts: 879
Karma: 10113994
Join Date: Feb 2010
Location: Serbia
Device: Kindle PW5 [bricked], Kindle PW1
Thanks. But what about other bolded lines? First one ending with .css, the oith some table tags and center tag?
shamanNS is offline   Reply With Quote
Old 10-29-2011, 07:59 PM   #9
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Are you planning on some sort of conversion for these files - i.e to xhtml or html4? Since they dont have many closing tags and the general markup is pretty bad. All of this regex is Python (PCRE if you remove the mode flags), Notepad++ has a horrible syntax, really not worth the effort.

It's no problem to do what you propose, it just needs something like :
Code:
(<link[^<>]+>|<center>(?=\s*<table)|</?tr>|</?td>|</?table[^<>]*>|<P[^<>]*next[^<>]*>\s*<a[^<>]*>\s*<img[^<>]*>\s*</a>\s*</center>)
It's super messy, but it's the least effort - It assumes you don't have any other tables which you would like to keep! (that could be fixed, but I have a feeling these don't use (m)any tables)

If you are planning to convert, something like :
Quote:
<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;I assume this is paragraph text
should become :
Quote:
<p>I assume this is paragraph text</p>
And here's some regex for that :
Code:
Find : (?mi)^(<(?P<tag>br|p)>(?P<spaces>(?:&nbsp;| )+)(?P<paratex>[^\n\r]+)$|^<br>$)
Replace : <p>\g<paratex></p>
It preserves lines as blank paragraphs - it however does not capture the final </p> tag from the bottom of the file. You could clean that up relatively easily with another match.

The next one will do pretty much the same thing for headings.
Code:
Find : (?mi)^\s*<(?P<tag>h\d)[^<>]*>(?P<heading>[^<>]+)(</\1>)?$
Replace :<\g<tag>>\g<heading></\g<tag>>
After that it should be in good enough shape to run through HTML TiDy or whatever.

If you are still going slow on Monday, I'll write something to do this in batches or something - contract work til then :/

Last edited by Serpentine; 10-29-2011 at 08:01 PM. Reason: mention the heading regex
Serpentine is offline   Reply With Quote
Old 10-29-2011, 08:45 PM   #10
shamanNS
Guru
shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.
 
Posts: 879
Karma: 10113994
Join Date: Feb 2010
Location: Serbia
Device: Kindle PW5 [bricked], Kindle PW1
Those html files are all from a book publisher, distributed on CD, intendeed to be viewed in web browser. And in the browser it look ok, if I enlarge letters with Ctrl+ it reflows nicely. There is now visible table, don't know why they use table tags,maybe because on the left corner there is always visible TOC... maybe they used tables instead of frames. As far I noticed there is only pair of <td><tr> tags,one before chapter title and one after all book text,so it's like all book chapter text is in one "table" (html code wise,there is no visible table while viewing the html in web browser). So no,there are no tables I need to keep.

Anyway,I am converting it to .mobi with Calibre (import zipped html to Calibre,convert that to mobi) and read on my Kindle. I've only converted one book. And it look normal on Kindle,text reflows when I change text size, paragraphs have indents . Book has TOC and also I can navigate through chapters by pressing left/right buttons on Kindle.
I've manually removed those bold-ed lines and then converted it in Kindle.

Before that tried importing book in Calibre without touching those html files, via one html file that references all the chapter html files,converted it to .mobi but book had 2 problem when view on Kindle or Mobipocket Reader for PC:
1) when moving cursor I've realized that whole "page" is actually in one box of text (it displayed boarder around whole "page" text)
2) when I press next page button on Kindle it show blank page,and on the second button press it skips to next chapter.

That is why I tried editing html files and deleting <td> and >tr> tags.After that book looks normal,as it was proper html. No need to clean more of html junk in them

I'm off to sleep now (it 2 AM here ),so I will test your regex tomorrow.

Last edited by shamanNS; 10-29-2011 at 08:47 PM.
shamanNS is offline   Reply With Quote
Old 10-29-2011, 10:07 PM   #11
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,703
Karma: 54369092
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Tables are old school (AND lazy) way to quickly force position of content.

Be very careful when carving out chunks. You could break things big time.
Tags are usually in pairs. You want to check that you have the correct (nested) pairing in what you CARVE OUT.
theducks is offline   Reply With Quote
Old 10-29-2011, 11:31 PM   #12
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Quote:
Originally Posted by theducks View Post
Be very careful when carving out chunks. You could break things big time.
Tags are usually in pairs. You want to check that you have the correct (nested) pairing in what you CARVE OUT.
Haha - not this, no this is oldschool. Closing tags? you expect far too much! (I'm surprised browsers ever rendered things correctly back in the day)

That said, I'm pretty sure I can write a little script to convert these books into (simple) epub without too much effort - and since they're all from the same publisher at a set date, I'd be surprised if the files didn't have the exact same patterns. I'll try find some time on Monday/Tuesday
Serpentine is offline   Reply With Quote
Old 10-30-2011, 05:23 AM   #13
shamanNS
Guru
shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.
 
Posts: 879
Karma: 10113994
Join Date: Feb 2010
Location: Serbia
Device: Kindle PW5 [bricked], Kindle PW1
Yeah, all html files have the same awful markup, but for the ebooks from 2001. and from small publisher I'm not surprised.
shamanNS is offline   Reply With Quote
Old 11-01-2011, 12:58 PM   #14
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Did you resolve this problem?

I have some free time to see if I can make a little conversion script if you'd like.
Serpentine is offline   Reply With Quote
Old 11-02-2011, 10:22 AM   #15
shamanNS
Guru
shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.shamanNS ought to be getting tired of karma fortunes by now.
 
Posts: 879
Karma: 10113994
Join Date: Feb 2010
Location: Serbia
Device: Kindle PW5 [bricked], Kindle PW1
I've managed a workflow with Notepad++ and "Find in files" search option. It's not one -click/one find & replace but it does not take much longer. I replace them one by one with these:

Code:
<A HREF=.*t.gif><\/A>
<LINK.*
<center><table width=97%>
<td><tr>
<\/TD><\/TR><\/TABLE>
<A HREF=.*t.gif><\/A>
<\/CENTER>
After that I import html file(s)/zip to Calibre,set extra CSS for p,h2,h3 and set TOC detection for h2 and h3 tags,and result is perfect on my Kindle


It seams faster to me than to import html files in Sigil and do find/replace there,save epub,import epub to Calibre,convert to mobi.
shamanNS is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex Faster Sigil 2 04-24-2011 09:08 PM
Chapter detection when only digits - regex needed Perkin Calibre 15 09-20-2010 06:25 PM
RegEx REPLACEMENT: Help needed! LARdT Sigil 12 01-04-2010 07:25 PM
Regex help needed gandor62 Calibre 2 11-04-2009 10:27 AM


All times are GMT -4. The time now is 06:49 AM.


MobileRead.com is a privately owned, operated and funded community.