RegEx Help needed

ghostyjack · 09-23-2011, 08:12 AM

I've just bought a book and it is full of sentences that have random words that begin with a capital letter in them.

As it's a book I've read before in paper form I realised that these are in fact not random words with capital letters, they are actually the start of the next sentence and the previous word is missing the full stop.

I'm not at all familiar with regex so was wondering what I would need to put in the S&R boxes to look for words that begin with a capital letter but are not preceeded by a full stop and a space? And what I need to put in to add the full stop.

Also there are a few words that not only had the full stop missing, the space is also missing and the words are joined together. So I also need to know what to put in the S&R boxes to search for words that have a capital in them so that I can add a full stop and a space at that point.

The are also other issues within it like combining paragraphs that have speach from two people where they should in fact be separated and speach marks not next to the spoken word (i.e. they have space on either side of them) and end up appearing on lines by themselves. These though I can figure out myself.

charleski · 09-23-2011, 08:53 AM

Sentences that miss both the full stop and the space are easy:
Find:
([a-z])([A-Z])
Replace:
\1. \2

Sentences that only miss the full stop are harder, as the capitalised word might be a name. If this doesn't occur too often it's easiest just to step through with Find Next and hit Replace on those instances that aren't names. (Obviously a sentence can end in a variety of ways, this searches for hits where there is no punctuation between the words.)
Find:
([a-z]) ([A-Z])
Replace:
\1. \2

Alternatively, if you can list all the names in your book, first change them all to something like '#1', '#2', etc, then Replace All using the above code and then change the # codes back to the correct names.

Remember to save to a different file at each step so you can go back if you make a mistake.

susan_cassidy · 09-23-2011, 12:22 PM

Don't you need a \s+ or something between the lower case expression and the upper-case expression, or is this some special flavor of regexes, peculiar to Calibre? In the second place that you have the regex, it looks like a space, but I think a \s+ would be better, and more reliable.

user_none · 09-23-2011, 01:44 PM

\s matches white space characters. A space will match a space. \s is only necessary if the letters are separated by characters such as a tab in addition to a space.

ldolse · 09-23-2011, 09:09 PM

You can get a bit fancier with this regex and use positive and negative lookahead/lookbehind, which gives you a chance to take care of the most common proper names in the book as well:

(?<=[a-z])(?= [A-Z](?!(obert|avid|ebecca)))

Use the trailing characters of the proper name (i.e. remove the first letter) in the last set of parentheses, sepearated by '|' in case that's not obvious above. Then you only need to use this for replacement:
.

shamanNS · 10-29-2011, 06:33 PM

Can someone help me create RegEx for deleting couple of lines in all html files that are in the same folder (or maybe all html files currently open in Notepad++ ? I have about 200 ebooks needed to be edited,and each chapter is separate html file,so manualy editing them would take ages...

Here is the example what lines should be deleted (the one i bold letters):
link

The part in red color (p1) is the only thing that varies in those html files,and it's incremental (p1,p2,p3.....).
 in blue color at the begining of the line should not be deleted,only rest of the line.

Here is one set of html files (one ebook

) for testing purposes: link.

p.s. I have Python installed on my Windows 7,so solution can be in form of python script,if it easier.

Thanks in advance.

theducks · 10-29-2011, 07:40 PM

The code is malformed to boot.

Code:

<P class="next">
<A HREF=p2.html><IMG BORDER=0 SRC=../../graphics/next.gif></A>
</P>

It is missing a closing 

anyway, Sigil can do this on a complete book (all HTML files)

Search:

Code:

<P class="next">\s+<A HREF=p\d+.html><IMG BORDER=0 SRC=../../graphics/next.gif></A>

The \s+ handles the line split
the \d+ grabs any number of digits
Replace with <nothing>

shamanNS · 10-29-2011, 07:47 PM

Thanks. But what about other bolded lines? First one ending with .css, the oith some table tags and center tag?

Serpentine · 10-29-2011, 07:59 PM

Are you planning on some sort of conversion for these files - i.e to xhtml or html4? Since they dont have many closing tags and the general markup is pretty bad. All of this regex is Python (PCRE if you remove the mode flags), Notepad++ has a horrible syntax, really not worth the effort.

It's no problem to do what you propose, it just needs something like :

Code:

(<link[^<>]+>|<center>(?=\s*<table)|</?tr>|</?td>|</?table[^<>]*>|<P[^<>]*next[^<>]*>\s*<a[^<>]*>\s*<img[^<>]*>\s*</a>\s*</center>)

It's super messy, but it's the least effort - It assumes you don't have any other tables which you would like to keep! (that could be fixed, but I have a feeling these don't use (m)any tables)

If you are planning to convert, something like :

Quote:

I assume this is paragraph text

should become :

Quote:

I assume this is paragraph text

And here's some regex for that :

Code:

Find : (?mi)^(<(?P<tag>br|p)>(?P<spaces>(?:&nbsp;| )+)(?P<paratex>[^\n\r]+)$|^<br>$)
Replace : <p>\g<paratex></p>

It preserves lines as blank paragraphs - it however does not capture the final tag from the bottom of the file. You could clean that up relatively easily with another match.

The next one will do pretty much the same thing for headings.

Code:

Find : (?mi)^\s*<(?P<tag>h\d)[^<>]*>(?P<heading>[^<>]+)(</\1>)?$
Replace :<\g<tag>>\g<heading></\g<tag>>

After that it should be in good enough shape to run through HTML TiDy or whatever.

If you are still going slow on Monday, I'll write something to do this in batches or something - contract work til then :/

shamanNS · 10-29-2011, 08:45 PM

Those html files are all from a book publisher, distributed on CD, intendeed to be viewed in web browser. And in the browser it look ok, if I enlarge letters with Ctrl+ it reflows nicely. There is now visible table, don't know why they use table tags,maybe because on the left corner there is always visible TOC... maybe they used tables instead of frames. As far I noticed there is only pair of <td><tr> tags,one before chapter title and one after all book text,so it's like all book chapter text is in one "table" (html code wise,there is no visible table while viewing the html in web browser). So no,there are no tables I need to keep.

Anyway,I am converting it to .mobi with Calibre (import zipped html to Calibre,convert that to mobi) and read on my Kindle. I've only converted one book. And it look normal on Kindle,text reflows when I change text size, paragraphs have indents . Book has TOC and also I can navigate through chapters by pressing left/right buttons on Kindle.
I've manually removed those bold-ed lines and then converted it in Kindle.

Before that tried importing book in Calibre without touching those html files, via one html file that references all the chapter html files,converted it to .mobi but book had 2 problem when view on Kindle or Mobipocket Reader for PC:
1) when moving cursor I've realized that whole "page" is actually in one box of text (it displayed boarder around whole "page" text)
2) when I press next page button on Kindle it show blank page,and on the second button press it skips to next chapter.

That is why I tried editing html files and deleting <td> and >tr> tags.After that book looks normal,as it was proper html. No need to clean more of html junk in them

I'm off to sleep now (it 2 AM here

),so I will test your regex tomorrow.

theducks · 10-29-2011, 10:07 PM

Tables are old school (AND lazy) way to quickly force position of content.

Be very careful when carving out chunks. You could break things big time.
Tags are usually in pairs. You want to check that you have the correct (nested) pairing in what you CARVE OUT.

Serpentine · 10-29-2011, 11:31 PM

Quote:

Originally Posted by theducks

Be very careful when carving out chunks. You could break things big time.
Tags are usually in pairs. You want to check that you have the correct (nested) pairing in what you CARVE OUT.

Haha - not this, no this is oldschool. Closing tags? you expect far too much! (I'm surprised browsers ever rendered things correctly back in the day)

That said, I'm pretty sure I can write a little script to convert these books into (simple) epub without too much effort - and since they're all from the same publisher at a set date, I'd be surprised if the files didn't have the exact same patterns. I'll try find some time on Monday/Tuesday

shamanNS · 10-30-2011, 05:23 AM

Yeah, all html files have the same awful markup, but for the ebooks from 2001. and from small publisher I'm not surprised.

Serpentine · 11-01-2011, 12:58 PM

Did you resolve this problem?

I have some free time to see if I can make a little conversion script if you'd like.

shamanNS · 11-02-2011, 10:22 AM

I've managed a workflow with Notepad++ and "Find in files" search option. It's not one -click/one find & replace but it does not take much longer. I replace them one by one with these:

Code:

<A HREF=.*t.gif><\/A>
<LINK.*
<center><table width=97%>
<td><tr>
<\/TD><\/TR><\/TABLE>
<A HREF=.*t.gif><\/A>
<\/CENTER>

After that I import html file(s)/zip to Calibre,set extra CSS for p,h2,h3 and set TOC detection for h2 and h3 tags,and result is perfect on my Kindle

It seams faster to me than to import html files in Sigil and do find/replace there,save epub,import epub to Calibre,convert to mobi.

10-29-2011, 06:33 PM	#6
shamanNS Guru Posts: 886 Karma: 10113994 Join Date: Feb 2010 Location: Serbia Device: Kindle PW5 [bricked], Kindle PW1	Can someone help me create RegEx for deleting couple of lines in all html files that are in the same folder (or maybe all html files currently open in Notepad++ ? I have about 200 ebooks needed to be edited,and each chapter is separate html file,so manualy editing them would take ages... Here is the example what lines should be deleted (the one i bold letters): link The part in red color (p1) is the only thing that varies in those html files,and it's incremental (p1,p2,p3.....). <p> in blue color at the begining of the line should not be deleted,only rest of the line. Here is one set of html files (one ebook ) for testing purposes: link. p.s. I have Python installed on my Windows 7,so solution can be in form of python script,if it easier. Thanks in advance.

10-29-2011, 07:40 PM	#7
theducks Well trained by Cats Posts: 29,800 Karma: 54830978 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	The code is malformed to boot. Code: <P class="next"> <A HREF=p2.html><IMG BORDER=0 SRC=../../graphics/next.gif></A> </P> It is missing a closing </P> anyway, Sigil can do this on a complete book (all HTML files) Search: Code: <P class="next">\s+<A HREF=p\d+.html><IMG BORDER=0 SRC=../../graphics/next.gif></A> The \s+ handles the line split the \d+ grabs any number of digits Replace with <nothing>

10-29-2011, 08:45 PM	#10
shamanNS Guru Posts: 886 Karma: 10113994 Join Date: Feb 2010 Location: Serbia Device: Kindle PW5 [bricked], Kindle PW1	Those html files are all from a book publisher, distributed on CD, intendeed to be viewed in web browser. And in the browser it look ok, if I enlarge letters with Ctrl+ it reflows nicely. There is now visible table, don't know why they use table tags,maybe because on the left corner there is always visible TOC... maybe they used tables instead of frames. As far I noticed there is only pair of <td><tr> tags,one before chapter title and one after all book text,so it's like all book chapter text is in one "table" (html code wise,there is no visible table while viewing the html in web browser). So no,there are no tables I need to keep. Anyway,I am converting it to .mobi with Calibre (import zipped html to Calibre,convert that to mobi) and read on my Kindle. I've only converted one book. And it look normal on Kindle,text reflows when I change text size, paragraphs have indents . Book has TOC and also I can navigate through chapters by pressing left/right buttons on Kindle. I've manually removed those bold-ed lines and then converted it in Kindle. Before that tried importing book in Calibre without touching those html files, via one html file that references all the chapter html files,converted it to .mobi but book had 2 problem when view on Kindle or Mobipocket Reader for PC: 1) when moving cursor I've realized that whole "page" is actually in one box of text (it displayed boarder around whole "page" text) 2) when I press next page button on Kindle it show blank page,and on the second button press it skips to next chapter. That is why I tried editing html files and deleting <td> and >tr> tags.After that book looks normal,as it was proper html. No need to clean more of html junk in them I'm off to sleep now (it 2 AM here ),so I will test your regex tomorrow. Last edited by shamanNS; 10-29-2011 at 08:47 PM.

11-02-2011, 10:22 AM	#15
shamanNS Guru Posts: 886 Karma: 10113994 Join Date: Feb 2010 Location: Serbia Device: Kindle PW5 [bricked], Kindle PW1	I've managed a workflow with Notepad++ and "Find in files" search option. It's not one -click/one find & replace but it does not take much longer. I replace them one by one with these: Code: <A HREF=.t.gif><\/A> <LINK. <center><table width=97%> <td><tr> <\/TD><\/TR><\/TABLE> <A HREF=.*t.gif><\/A> <\/CENTER> After that I import html file(s)/zip to Calibre,set extra CSS for p,h2,h3 and set TOC detection for h2 and h3 tags,and result is perfect on my Kindle It seams faster to me than to import html files in Sigil and do find/replace there,save epub,import epub to Calibre,convert to mobi.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Regex	Faster	Sigil	2	04-24-2011 09:08 PM
Chapter detection when only digits - regex needed	Perkin	Calibre	15	09-20-2010 06:25 PM
RegEx REPLACEMENT: Help needed!	LARdT	Sigil	12	01-04-2010 07:25 PM
Regex help needed	gandor62	Calibre	2	11-04-2009 10:27 AM

09-23-2011, 08:12 AM	#1
ghostyjack Guru Posts: 718 Karma: 1085610 Join Date: Mar 2009 Location: Bristol, England Device: PRS-T1, 1825PT, Galaxy Tab, One X, TF700T, Aura HD, Nexus 7	RegEx Help needed I've just bought a book and it is full of sentences that have random words that begin with a capital letter in them. As it's a book I've read before in paper form I realised that these are in fact not random words with capital letters, they are actually the start of the next sentence and the previous word is missing the full stop. I'm not at all familiar with regex so was wondering what I would need to put in the S&R boxes to look for words that begin with a capital letter but are not preceeded by a full stop and a space? And what I need to put in to add the full stop. Also there are a few words that not only had the full stop missing, the space is also missing and the words are joined together. So I also need to know what to put in the S&R boxes to search for words that have a capital in them so that I can add a full stop and a space at that point. The are also other issues within it like combining paragraphs that have speach from two people where they should in fact be separated and speach marks not next to the spoken word (i.e. they have space on either side of them) and end up appearing on lines by themselves. These though I can figure out myself.

09-23-2011, 08:53 AM	#2
charleski Wizard Posts: 1,196 Karma: 1281258 Join Date: Sep 2009 Device: PRS-505	Sentences that miss both the full stop and the space are easy: Find: ([a-z])([A-Z]) Replace: \1. \2 Sentences that only miss the full stop are harder, as the capitalised word might be a name. If this doesn't occur too often it's easiest just to step through with Find Next and hit Replace on those instances that aren't names. (Obviously a sentence can end in a variety of ways, this searches for hits where there is no punctuation between the words.) Find: ([a-z]) ([A-Z]) Replace: \1. \2 Alternatively, if you can list all the names in your book, first change them all to something like '#1', '#2', etc, then Replace All using the above code and then change the # codes back to the correct names. Remember to save to a different file at each step so you can go back if you make a mistake.

09-23-2011, 12:22 PM	#3
susan_cassidy Wizard Posts: 2,251 Karma: 3720310 Join Date: Jan 2009 Location: USA Device: Kindle, iPad (not used much for reading)	Don't you need a \s+ or something between the lower case expression and the upper-case expression, or is this some special flavor of regexes, peculiar to Calibre? In the second place that you have the regex, it looks like a space, but I think a \s+ would be better, and more reliable.

09-23-2011, 01:44 PM	#4
user_none Sigil & calibre developer Posts: 2,488 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR	\s matches white space characters. A space will match a space. \s is only necessary if the letters are separated by characters such as a tab in addition to a space.

09-23-2011, 09:09 PM	#5
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	You can get a bit fancier with this regex and use positive and negative lookahead/lookbehind, which gives you a chance to take care of the most common proper names in the book as well: (?<=[a-z])(?= [A-Z](?!(obert\|avid\|ebecca))) Use the trailing characters of the proper name (i.e. remove the first letter) in the last set of parentheses, sepearated by '\|' in case that's not obvious above. Then you only need to use this for replacement: .

10-29-2011, 07:47 PM	#8
shamanNS Guru Posts: 886 Karma: 10113994 Join Date: Feb 2010 Location: Serbia Device: Kindle PW5 [bricked], Kindle PW1	Thanks. But what about other bolded lines? First one ending with .css, the oith some table tags and center tag?

10-29-2011, 10:07 PM	#11
theducks Well trained by Cats Posts: 29,800 Karma: 54830978 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	Tables are old school (AND lazy) way to quickly force position of content. Be very careful when carving out chunks. You could break things big time. Tags are usually in pairs. You want to check that you have the correct (nested) pairing in what you CARVE OUT.

10-30-2011, 05:23 AM	#13
shamanNS Guru Posts: 886 Karma: 10113994 Join Date: Feb 2010 Location: Serbia Device: Kindle PW5 [bricked], Kindle PW1	Yeah, all html files have the same awful markup, but for the ebooks from 2001. and from small publisher I'm not surprised.

11-01-2011, 12:58 PM	#14
Serpentine Evangelist Posts: 416 Karma: 1045911 Join Date: Sep 2011 Location: Cape Town, South Africa Device: Kindle 3	Did you resolve this problem? I have some free time to see if I can make a little conversion script if you'd like.

Advert

Advert