10-29-2012, 05:57 AM | #1 |
Junior Member
Posts: 4
Karma: 10
Join Date: Oct 2012
Device: pc, kindle
|
Regex find and replace
Hi, I'm trying to clean up a pdf converted to epub in Calibre that is filled with linebreaks. Using Sigil's regex search/replace is there any expression I could use in Replace that will keep the character from the original string. I can fix most of the line breaks instantly with replace all but lose the character found by the search as well. See below I'm looking for a linebreak followed by lowercase letters (and whatever format code is inbetween).
Search: </p> <p class="calibre1">[(a-w)] Replace: " " Result: The cat drank from the milk bowl The cat drank from the ilk bowl |
10-29-2012, 06:11 AM | #2 |
Grand Sorcerer
Posts: 27,546
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Search:
</p> <p class="calibre1">([a-w]) Replace: \1 There's a space before the \1 Do you purposely mean to exclude words that start with a lowercase x, y, or z? Why not [a-z]? |
Advert | |
|
10-29-2012, 06:25 AM | #3 |
Junior Member
Posts: 4
Karma: 10
Join Date: Oct 2012
Device: pc, kindle
|
oh i meant a-z, it was a typo...
I tried what you said and it replaced the character with \1 like '\1ilk'. |
10-29-2012, 06:40 AM | #4 |
Calibre Plugins Developer
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
If you get \1 in your text after a replace it means you haven't correctly specified the () parentheses in your Find text. For instance in your very first text you posted, you had the brackets the wrong way around where it should have been ([a-z]) instead of what you typed of [(a-z)]
If you are using Sigil 0.6 then I instead recommend you right-click on the "Find" box, and under "Example Searches" choose "Join Paragraphs". It isn't quite the same as the case you are looking to catch, but "most" of the time it will achieve the same thing (or improve upon it). The difference is that the expression in this example search is looking for sentences that have unfinished endings, rather than as you are doing of finding sentences that represent unfinished beginnings. There are still some edge cases it will not catch, such as conversation text which has a finished sentence (but not completed quotes) but it is better than most. And unlike your approach it will catch a situation like this: <p>The reason</p> <p>Bob did this was... Of course since the original PDF may have OCR errors (like stray commas), or there may be genuine reasons for the text having a new paragraph (like poetry) you should never do a blanket Replace All with such an expression, but it is better than starting from scratch . Last edited by kiwidude; 10-29-2012 at 06:57 AM. Reason: Miissing slash |
10-29-2012, 06:53 AM | #5 |
Junior Member
Posts: 4
Karma: 10
Join Date: Oct 2012
Device: pc, kindle
|
Wow thanks, i did have the parantheses and brackets mixed up.
Yeah I have searched both ways with lower case on either end of line break, and I also run searches for conversations (,") commas, hyphens and and then I run a search for upper case letters that ill probably have to hand review because it could be a dropped period rather than a linebreak. Normally I try do it all by hand but this is a 900 page document. Ill probably ruin poetry if I replace all, but I can manually fix it i suppose. But thanks this will make it much quicker now. For the ..., if I confine myself to (a-z) , and ," I wont mess up that. I usually run a manual search for ... and capital letters and see if it is the end of sentence or mid-sentence pause. Last edited by SanatyrZeo; 10-29-2012 at 06:57 AM. Reason: additions |
Advert | |
|
10-29-2012, 07:03 AM | #6 |
Junior Member
Posts: 4
Karma: 10
Join Date: Oct 2012
Device: pc, kindle
|
Omg 3000 changes at once, thought Sigil was going to hang for a second!
Also I think searching for lowercase letters AFTER a line break will help avoid joining lines that are meant to end with no punctuation like quote attributions or chapter names. Last edited by SanatyrZeo; 10-29-2012 at 07:06 AM. Reason: ps |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Sigil Wildcards/Regex Find/Replace | Adman35 | Sigil | 7 | 08-16-2014 01:02 PM |
Regex Find and Replace - Spaces | essayhead | Sigil | 2 | 08-10-2012 07:41 PM |
regex replace??? | schuster | Conversion | 14 | 01-29-2011 09:02 AM |
RegEx find and replace | iblesq | Sigil | 1 | 01-10-2011 09:26 PM |
REGEX find and replace help please | potestus | Sigil | 13 | 09-18-2010 04:14 PM |