01-06-2011, 02:35 AM | #1 |
Groupie
Posts: 152
Karma: 474196
Join Date: Jan 2011
Location: Ottawa
Device: Kobo Aura H2O
|
Fixing broken sentences.
Hi, all. I've found out through a few other threads how to fix broken sentences left by conversions from PDF to ePub formats. Currently, I'm using:
Find: ([a-z])</p>\s+<p class="calibre2"> Replace: \1_ (The _ being a space) I was wondering if there was a way to add something to skip over breaks where the first letter of the second line is a capital? For example, I'd like to find this: ...blahblah</p> <p class="calibre2">blahblah... But not this: ...blahblah</p> <p class="calibre2">Blahblah... Basically, this would help me a lot while trying to fix things like scripts or screenplays, or books with multi-line chapter titles, such as: CHAPTER 6: The Plot Thickens Ottawa Any help would be much appreciated. Thanks in advance. |
01-06-2011, 02:44 AM | #2 | |
Calibre Plugins Developer
Posts: 4,688
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Quote:
Find: ([a-z])</p>\s+<p class="calibre2">([a-z]) Replace: \1_\2 (Replacing underscore with a space). With matching case turned on of course Note you may also want to join sentences ending in commas, colons, etc etc. That is why some of the other expressions in threads here are more complex than just looking for paragraphs ending with [a-z]. Last edited by kiwidude; 01-06-2011 at 02:48 AM. |
|
Advert | |
|
01-06-2011, 03:25 AM | #3 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
i have been using this a lot with good resuts. with some sources you have to inspect the code ( via sigil) as sometimes calibre2 has to be changed to calibre[some other number] and sometimes there is also a span - class thingie to search for
e.g. Code:
</p>\s+<p class="calibre2"><span class="none">([a-z]) |
01-06-2011, 05:05 AM | #4 |
frumious Bandersnatch
Posts: 7,539
Karma: 19001081
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
I don't know if the regexp in Sigil allows something like "[^A-Z]" to match anything but an uppercase letter (which would match lowercase letters, as well as quote marks, parentheses, dashes...).
|
01-06-2011, 11:26 AM | #5 |
Well trained by Cats
Posts: 30,490
Karma: 58055868
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
My un-wrap line Regex
([\w",])</p>\s+<p class="calibre2">([\w"“…]) \1 \2 Letters Commas, (curly) Quotes Not Perfect Code:
ask Samuel if |
Advert | |
|
01-07-2011, 06:15 AM | #6 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
is it safe to wild card the calibre2 bit ?
e.g. would this work ? ([\w",])</p>\s+<p class="calibre\d+">([\w"“…]) or will that cause it to mess with titles & chapter headers ? I see that different books have different class names. some do not even have calibre+digit(s) they have a different naming structure e.g. I have seen class="MsoPLainText", so maybe find [\w",])</p>\s+<p class="[A-Za-z2-9]*">([\w"“…]) that will exclude calibre1 ? on a related issue, I have a book with far too much space between chapter header & start of text. the code uses 3 consecutive instances of <p class="MsoPlainText"> </p> how do I test for 3 consecutive instances of that line, and replace with only 1, or maybe 2 instances ? Last edited by cybmole; 01-07-2011 at 06:32 AM. |
01-07-2011, 07:31 AM | #7 |
Calibre Plugins Developer
Posts: 4,688
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
You ask "will this work". Like any regular expression related to paragraph matching there is always the possibility of edge cases it catches that you don't want it to. The "wider" you make your regex, the more likely that is to happen. If you intend to step through each find/replace one at a time so you can undo any that you don't want then you can experiment with it. However I find each and every document is different depending on how many times it has been converted in the past, manual editing, what it's original format was, what settings/program was used to convert it along the way etc. So long as you don't expect to stumble onto the holy grail of regexes that fixes all the problems for every document... it doesn't exist
In answer to your second question, yes you can do it. Just paste the text you want to find three times separated by \s+. e.g. <p class="MsoPlainText"> </p>\s+<p class="MsoPlainText"> </p> \s+<p class="MsoPlainText"> </p> |
01-07-2011, 08:04 AM | #8 | |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Quote:
then I have to put the expression once only into replace - no shorthand for that ? PS I ask only beacuse I am trying to learn shorthand expressions, not becasue it will save a lot of time the code you have given me has worked perfectly - thanks. |
|
01-07-2011, 08:32 AM | #9 | |
frumious Bandersnatch
Posts: 7,539
Karma: 19001081
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Quote:
where *** stands for whatever expression you want matched 3 times, but you have to take newlines into account. Have a look here |
|
01-07-2011, 08:48 AM | #10 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
thinking it through - the annoying section structure is ABABA where A is the expression, B is the line feed stuff.
so if I find (AB){2} and replace with nothing, I should end up with one instance only of expression A. I already fixed up the text using your long hand version though so I cannot easily test that now. |
01-07-2011, 05:13 PM | #11 | |
Calibre Plugins Developer
Posts: 4,688
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Quote:
However yes I believe you could also do something like (<p class="MsoPlainText"> </p>\s+){3} In this case you don't have to worry about ABABAB because Sigil reformats the document anyway so it does not matter if the last B (the spaces rendered as a newline in code view) get replaced. |
|
01-09-2011, 03:14 AM | #12 |
Groupie
Posts: 152
Karma: 474196
Join Date: Jan 2011
Location: Ottawa
Device: Kobo Aura H2O
|
Awesome. Using a variety of these seem to be working well for me. Thanks a million, guys.
|
01-13-2011, 06:23 PM | #13 | |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Quote:
anything that does not should not be followed by a </p> previously I'd been looking for lines that began mid sentence i.e. that began with a lower case letter but really there is no need to test 1st character of next line, just test the previous "line" end - to determine if it is a true "end" so I am now getting good results with this find ([Ia-z,])</p>\s*<p> replace with\1 plus a single space which bypasses the calibre tags issue. . I could expand the range to test for for digits / capitalized words but have not yet needed to. |
|
01-13-2011, 07:24 PM | #14 | |
Calibre Plugins Developer
Posts: 4,688
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Quote:
The theory of what you say is indeed what the OP on this thread was doing with their first post. However as has been mentioned before there are other "line endings" you would need to test for such as punctuation characters (colons, semi-colons, hyphens), numeric amounts etc. Your regex also wouldn't include uppercase words, foreign language characters and so on. Also unless you step through each one then if your book includes poems laid out they will get trashed. Expressions earlier in this thread and in others similar can improve readability of most of the paragraphs. However imho I think people do need to be reminded that the expressions in this thread will not catch "every" situation nor should they just blindly do "Replace All" because they saw a regex in a thread that someone said worked for them. Last edited by kiwidude; 01-13-2011 at 07:30 PM. |
|
01-14-2011, 03:44 AM | #15 | |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Quote:
otherwise the expression should be something like ([Ia-z,])</p>\s*<p class="calibre2"> replace with \1 trailing space after \1 points taken about poems, & about blindly applying - I usually do I few find - replace cycles before hitting replace all, & if I do screw up I close sigil, discarding all changes & start over |
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
fixing broken button (guide) | ashadocat | Sony Reader Dev Corner | 0 | 10-01-2009 02:52 AM |
Unutterably Silly Memorable FIRST SENTENCES - Only Yours, please | Dr. Drib | Lounge | 431 | 02-13-2009 05:57 AM |
Unutterably Silly Final sentences | pshrynk | Lounge | 97 | 02-08-2009 12:45 PM |
Sentences We Love | Dr. Drib | Sony Reader | 110 | 07-13-2007 11:44 PM |