09-23-2011, 08:12 AM | #1 |
Guru
Posts: 718
Karma: 1085610
Join Date: Mar 2009
Location: Bristol, England
Device: PRS-T1, 1825PT, Galaxy Tab, One X, TF700T, Aura HD, Nexus 7
|
RegEx Help needed
I've just bought a book and it is full of sentences that have random words that begin with a capital letter in them.
As it's a book I've read before in paper form I realised that these are in fact not random words with capital letters, they are actually the start of the next sentence and the previous word is missing the full stop. I'm not at all familiar with regex so was wondering what I would need to put in the S&R boxes to look for words that begin with a capital letter but are not preceeded by a full stop and a space? And what I need to put in to add the full stop. Also there are a few words that not only had the full stop missing, the space is also missing and the words are joined together. So I also need to know what to put in the S&R boxes to search for words that have a capital in them so that I can add a full stop and a space at that point. The are also other issues within it like combining paragraphs that have speach from two people where they should in fact be separated and speach marks not next to the spoken word (i.e. they have space on either side of them) and end up appearing on lines by themselves. These though I can figure out myself. |
09-23-2011, 08:53 AM | #2 |
Wizard
Posts: 1,196
Karma: 1281258
Join Date: Sep 2009
Device: PRS-505
|
Sentences that miss both the full stop and the space are easy:
Find: ([a-z])([A-Z]) Replace: \1. \2 Sentences that only miss the full stop are harder, as the capitalised word might be a name. If this doesn't occur too often it's easiest just to step through with Find Next and hit Replace on those instances that aren't names. (Obviously a sentence can end in a variety of ways, this searches for hits where there is no punctuation between the words.) Find: ([a-z]) ([A-Z]) Replace: \1. \2 Alternatively, if you can list all the names in your book, first change them all to something like '#1', '#2', etc, then Replace All using the above code and then change the # codes back to the correct names. Remember to save to a different file at each step so you can go back if you make a mistake. |
Advert | |
|
09-23-2011, 12:22 PM | #3 |
Wizard
Posts: 2,251
Karma: 3720310
Join Date: Jan 2009
Location: USA
Device: Kindle, iPad (not used much for reading)
|
Don't you need a \s+ or something between the lower case expression and the upper-case expression, or is this some special flavor of regexes, peculiar to Calibre? In the second place that you have the regex, it looks like a space, but I think a \s+ would be better, and more reliable.
|
09-23-2011, 01:44 PM | #4 |
Sigil & calibre developer
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
\s matches white space characters. A space will match a space. \s is only necessary if the letters are separated by characters such as a tab in addition to a space.
|
09-23-2011, 09:09 PM | #5 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
You can get a bit fancier with this regex and use positive and negative lookahead/lookbehind, which gives you a chance to take care of the most common proper names in the book as well:
(?<=[a-z])(?= [A-Z](?!(obert|avid|ebecca))) Use the trailing characters of the proper name (i.e. remove the first letter) in the last set of parentheses, sepearated by '|' in case that's not obvious above. Then you only need to use this for replacement: . |
Advert | |
|
10-29-2011, 06:33 PM | #6 |
Guru
Posts: 886
Karma: 10113994
Join Date: Feb 2010
Location: Serbia
Device: Kindle PW5 [bricked], Kindle PW1
|
Can someone help me create RegEx for deleting couple of lines in all html files that are in the same folder (or maybe all html files currently open in Notepad++ ? I have about 200 ebooks needed to be edited,and each chapter is separate html file,so manualy editing them would take ages...
Here is the example what lines should be deleted (the one i bold letters): link The part in red color (p1) is the only thing that varies in those html files,and it's incremental (p1,p2,p3.....). <p> in blue color at the begining of the line should not be deleted,only rest of the line. Here is one set of html files (one ebook ) for testing purposes: link. p.s. I have Python installed on my Windows 7,so solution can be in form of python script,if it easier. Thanks in advance. |
10-29-2011, 07:40 PM | #7 |
Well trained by Cats
Posts: 29,800
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
The code is malformed to boot.
Code:
<P class="next">
<A HREF=p2.html><IMG BORDER=0 SRC=../../graphics/next.gif></A>
</P>
anyway, Sigil can do this on a complete book (all HTML files) Search: Code:
<P class="next">\s+<A HREF=p\d+.html><IMG BORDER=0 SRC=../../graphics/next.gif></A> the \d+ grabs any number of digits Replace with <nothing> |
10-29-2011, 07:47 PM | #8 |
Guru
Posts: 886
Karma: 10113994
Join Date: Feb 2010
Location: Serbia
Device: Kindle PW5 [bricked], Kindle PW1
|
Thanks. But what about other bolded lines? First one ending with .css, the oith some table tags and center tag?
|
10-29-2011, 07:59 PM | #9 | ||
Evangelist
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
Are you planning on some sort of conversion for these files - i.e to xhtml or html4? Since they dont have many closing tags and the general markup is pretty bad. All of this regex is Python (PCRE if you remove the mode flags), Notepad++ has a horrible syntax, really not worth the effort.
It's no problem to do what you propose, it just needs something like : Code:
(<link[^<>]+>|<center>(?=\s*<table)|</?tr>|</?td>|</?table[^<>]*>|<P[^<>]*next[^<>]*>\s*<a[^<>]*>\s*<img[^<>]*>\s*</a>\s*</center>) If you are planning to convert, something like : Quote:
Quote:
Code:
Find : (?mi)^(<(?P<tag>br|p)>(?P<spaces>(?: | )+)(?P<paratex>[^\n\r]+)$|^<br>$) Replace : <p>\g<paratex></p> The next one will do pretty much the same thing for headings. Code:
Find : (?mi)^\s*<(?P<tag>h\d)[^<>]*>(?P<heading>[^<>]+)(</\1>)?$ Replace :<\g<tag>>\g<heading></\g<tag>> If you are still going slow on Monday, I'll write something to do this in batches or something - contract work til then :/ Last edited by Serpentine; 10-29-2011 at 08:01 PM. Reason: mention the heading regex |
||
10-29-2011, 08:45 PM | #10 |
Guru
Posts: 886
Karma: 10113994
Join Date: Feb 2010
Location: Serbia
Device: Kindle PW5 [bricked], Kindle PW1
|
Those html files are all from a book publisher, distributed on CD, intendeed to be viewed in web browser. And in the browser it look ok, if I enlarge letters with Ctrl+ it reflows nicely. There is now visible table, don't know why they use table tags,maybe because on the left corner there is always visible TOC... maybe they used tables instead of frames. As far I noticed there is only pair of <td><tr> tags,one before chapter title and one after all book text,so it's like all book chapter text is in one "table" (html code wise,there is no visible table while viewing the html in web browser). So no,there are no tables I need to keep.
Anyway,I am converting it to .mobi with Calibre (import zipped html to Calibre,convert that to mobi) and read on my Kindle. I've only converted one book. And it look normal on Kindle,text reflows when I change text size, paragraphs have indents . Book has TOC and also I can navigate through chapters by pressing left/right buttons on Kindle. I've manually removed those bold-ed lines and then converted it in Kindle. Before that tried importing book in Calibre without touching those html files, via one html file that references all the chapter html files,converted it to .mobi but book had 2 problem when view on Kindle or Mobipocket Reader for PC: 1) when moving cursor I've realized that whole "page" is actually in one box of text (it displayed boarder around whole "page" text) 2) when I press next page button on Kindle it show blank page,and on the second button press it skips to next chapter. That is why I tried editing html files and deleting <td> and >tr> tags.After that book looks normal,as it was proper html. No need to clean more of html junk in them I'm off to sleep now (it 2 AM here ),so I will test your regex tomorrow. Last edited by shamanNS; 10-29-2011 at 08:47 PM. |
10-29-2011, 10:07 PM | #11 |
Well trained by Cats
Posts: 29,800
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Tables are old school (AND lazy) way to quickly force position of content.
Be very careful when carving out chunks. You could break things big time. Tags are usually in pairs. You want to check that you have the correct (nested) pairing in what you CARVE OUT. |
10-29-2011, 11:31 PM | #12 | |
Evangelist
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
Quote:
That said, I'm pretty sure I can write a little script to convert these books into (simple) epub without too much effort - and since they're all from the same publisher at a set date, I'd be surprised if the files didn't have the exact same patterns. I'll try find some time on Monday/Tuesday |
|
10-30-2011, 05:23 AM | #13 |
Guru
Posts: 886
Karma: 10113994
Join Date: Feb 2010
Location: Serbia
Device: Kindle PW5 [bricked], Kindle PW1
|
Yeah, all html files have the same awful markup, but for the ebooks from 2001. and from small publisher I'm not surprised.
|
11-01-2011, 12:58 PM | #14 |
Evangelist
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
Did you resolve this problem?
I have some free time to see if I can make a little conversion script if you'd like. |
11-02-2011, 10:22 AM | #15 |
Guru
Posts: 886
Karma: 10113994
Join Date: Feb 2010
Location: Serbia
Device: Kindle PW5 [bricked], Kindle PW1
|
I've managed a workflow with Notepad++ and "Find in files" search option. It's not one -click/one find & replace but it does not take much longer. I replace them one by one with these:
Code:
<A HREF=.*t.gif><\/A> <LINK.* <center><table width=97%> <td><tr> <\/TD><\/TR><\/TABLE> <A HREF=.*t.gif><\/A> <\/CENTER> It seams faster to me than to import html files in Sigil and do find/replace there,save epub,import epub to Calibre,convert to mobi. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Regex | Faster | Sigil | 2 | 04-24-2011 09:08 PM |
Chapter detection when only digits - regex needed | Perkin | Calibre | 15 | 09-20-2010 06:25 PM |
RegEx REPLACEMENT: Help needed! | LARdT | Sigil | 12 | 01-04-2010 07:25 PM |
Regex help needed | gandor62 | Calibre | 2 | 11-04-2009 10:27 AM |