12-06-2012, 01:48 PM | #1 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
regex (.*) not liking hidden characters
trying to fix a book where div has been used rather than p throughout
so book layout layout is thousands of lines/paragraphse like thesethese: <div class="c3"> some body text beginning on a new line, followed by the closing div tag, also on a new line </div> I would expect this to work: find <div class="c3">(.*)</div> replace all <p class="c3">\1</p> but I get no matches. to get the regex to work, I carefully have to copy & paste in whatever hidden characters are separating the div tags from the body text i.e. whatever is causing the line breaks. the (.*) regex then works as expected once it is within the linebreak characters so is this a) just a vary badly formatted source b) some side effect of pretty print / tidy settings c) a bug in regex engine or ( more likely!) in my understanding of how it should work ? now I think ( from limited testing )that pretty print has no issues with <div> all on one line example </div> layouts so it is probably not option b) ? |
12-06-2012, 02:00 PM | #2 |
calibre/Sigil Developer
Posts: 4,601
Karma: 2092290
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
You need to tick the "DotAll" option, or add (?s). It is not a bug, it is just how PCRE works for multiline expressions.
|
12-06-2012, 03:03 PM | #3 |
♫
Posts: 660
Karma: 506380
Join Date: Aug 2010
Location: Germany
Device: Kobo Aura / PB Lux 2 / Bookeen Frontlight / Kobo Mini / Nook Color
|
And actually I would just replace <div with <p and let tidy do the rest
|
12-06-2012, 03:24 PM | #4 | |
Grand Sorcerer
Posts: 27,465
Karma: 192992430
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
Just as easy for me to replace: Code:
<(/?)div([^>]*?)> Code:
<\1p\2> Last edited by DiapDealer; 12-06-2012 at 03:42 PM. |
|
12-06-2012, 03:50 PM | #5 |
Evangelist
Posts: 490
Karma: 1665031
Join Date: Nov 2010
Location: Vancouver Island, Nanaimo
Device: K2 (retired), Kobo Touch (passed to the wife), KGlo, Galaxy TabPro
|
This is one I use for most of my search and replace where there is a start and end tag or character (such as quotes):
Find: (?s)<div(.*?)</div> Replace: <p\1</p> Things I have learned from those more familiar with Regex and Sigil than myself: (?s) search over multiple lines (.*?) look for whatever comes after this and stop at first instance found. In the above, look for the </div> and stop the search at the very 1st one found. Without this I have had instances where it does not stop at the first instance found but have ended up with 2 or 3 paragraphs and sometimes the entire chapter highlighted. |
12-06-2012, 04:11 PM | #6 |
Grand Sorcerer
Posts: 27,465
Karma: 192992430
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
I used to use similar F&R, Danger, but that approach burned me too many times when stuff was nested--as divs can be. Plus, I've learned to be appropriately afraid of relying too heavily on the potential greediness of (.*?). So now, when the actual tags are what is needing replaced, I don't waste time trying to match/capture any-and-all text those tags might contain. I just match/capture/replace the tags themselves. To each their own though... that's the beauty of regex.
|
12-06-2012, 05:02 PM | #7 |
Evangelist
Posts: 490
Karma: 1665031
Join Date: Nov 2010
Location: Vancouver Island, Nanaimo
Device: K2 (retired), Kobo Touch (passed to the wife), KGlo, Galaxy TabPro
|
Hmm, never thought of nested tags. Yah I can see where that would burn you if you don't pay attention or do a blind find/replace all. Something I very quickly learned NOT to do unless I am absolutely positive it will be ok.
Thanks for the heads up, so far I haven't had any nested tags in the books I've recently been fixing up but I do know I have some books that do have them that I will be fixing. Always learning something here |
12-16-2012, 06:44 PM | #8 | |
Addict
Posts: 254
Karma: 69786
Join Date: May 2006
Location: Oslo, Norway
Device: Kobo Aura, Sony PRS-650
|
Quote:
Version your files, and always do a visual inspection + validate immediately after a replace even though you're sure it will be OK. Regexes are too useful not to be applied to html, even if you might invoke a few elder horrors |
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Regex Solution to hidden href search? | MizSuz | Sigil | 16 | 09-29-2012 07:40 PM |
ePub validation error - not liking div tags | Kratos | ePub | 19 | 07-23-2012 11:14 AM |
I am really liking my new Sony PRS-T1 | noshoes | Sony Reader | 7 | 01-25-2012 08:03 AM |
Touch So How Is Everyone Liking Theirs? | MorganM | Kobo Reader | 34 | 06-29-2011 01:45 PM |
How are you liking your iPad case? | Maggie Leung | Apple Devices | 46 | 06-10-2010 05:08 AM |