Need help for a regex

wobohohoho · 12-29-2012, 01:14 AM

Hello.
I'm trying to find and replace elements in HTM documents from a decompiled CHM to make chapter headings in order to create a TOC. The unique identifiers for sub-chapters are as follows:

Code:

      <div class="TLV1" id="B01306002.0-103" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[2]">
        <div class="HD">
          Taking a history
        </div>

I used find: <div class="TLV1"\s+(.*?)\s+ <div class="HD">\s+(.*?)\s+</div> and replace: <div class="TLV1" \1<h2 class="HD">\1</h2> for these instances, but it's not perfect.
A few non-subchapter elements (box items) are also included if the above expression is used, for example they look like this:

Code:

        <div class="SIDEBAR BOX">
          <div class="TLV1" id="B01306002.0-167" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[7]/SIDEBAR[2]/TLV1[1]">
            <div class="HD">
              The jugular venous systems
            </div><a id="T5-2"></a>

The regular part would be SIDEBAR (there's SIDEBAR LIST, etc.).

To add to the complexity, one more unique identifier for sub-chapters exist, to which the original search string I use cannot pick up:

Code:

      
      <div class="TLV1" id="B01306002.0-90" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]">
        <div class="HD" id="H10-1" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]/HD[1]">
          On being busy: Corrigan's secret door
        </div>

What's the suitable search string that includes both of what I want and ignore elements marked SIDEBAR?

wobohohoho · 12-29-2012, 01:52 AM

Quote:

Originally Posted by wobohohoho

I used find: <div class="TLV1"\s+(.*?)\s+ <div class="HD">\s+(.*?)\s+</div> and replace: <div class="TLV1" \1<h2 class="HD">\1</h2> for these instances, but it's not perfect.

Sorry. It's <div class="TLV1" \1<h2 class="HD">\2</h2>.

wobohohoho · 12-29-2012, 02:26 AM

I've somewhat solved the problem for the second part of finding the unique subchapters that have codes like:

Code:

      <div class="TLV1" id="B01306002.0-90" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]">
        <div class="HD" id="H10-1" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]/HD[1]">
          On being busy: Corrigan's secret door
        </div>

With find:
<div class="TLV1"\s+(.*?)\s+<div class="HD"(\s+(.*?)\s+)(\s+(.*?)\s+)</div>
And replace:
<div class="TLV1" \1<h2 class="HD"\2\4</h2>

But now I pick up SIDEBAR elements as well. So whatever search string that would ignore the word SIDEBAR should work with both.

Serpentine · 12-31-2012, 11:25 PM

I did one of these conversions a while back - I'm not sure why you're trying to preserve the sidebar, it's not going to render correctly on most readers anyway. Strip it all out, rather just use a real ToC.

wobohohoho · 01-02-2013, 04:42 AM

Thanks for the tip.

I'm just preserving the code as I'm not too sure what they're for, though it could be because there's instances of JavaScript use particularly for the inline CHM TOC (that didn't work anyway in the CHM for some reason).

However for my particular case, I decided to just place all SIDEBAR elements as h3, therefore making them sub-elements to the sub-chapters, and allowing me to use the regex I've found that already work, rather than needing to differentiate them all with an all-encompassing regex.

12-29-2012, 01:14 AM	#1
wobohohoho Member Posts: 15 Karma: 10 Join Date: Dec 2012 Location: KL, Malaysia Device: Freda (WP 7.8) EPUB reader app	Need help for a regex Hello. I'm trying to find and replace elements in HTM documents from a decompiled CHM to make chapter headings in order to create a TOC. The unique identifiers for sub-chapters are as follows: Code: <div class="TLV1" id="B01306002.0-103" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[2]"> <div class="HD"> Taking a history </div> I used find: *<div class="TLV1"\s+(.?)\s+ <div class="HD">\s+(.?)\s+</div>* and replace: <div class="TLV1" \1<h2 class="HD">\1</h2> for these instances, but it's not perfect. A few non-subchapter elements (box items) are also included if the above expression is used, for example they look like this: Code: <div class="SIDEBAR BOX"> <div class="TLV1" id="B01306002.0-167" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[7]/SIDEBAR[2]/TLV1[1]"> <div class="HD"> The jugular venous systems </div><a id="T5-2"></a> The regular part would be SIDEBAR (there's SIDEBAR LIST, etc.). To add to the complexity, one more unique identifier for sub-chapters exist, to which the original search string I use cannot pick up: Code: <div class="TLV1" id="B01306002.0-90" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]"> <div class="HD" id="H10-1" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]/HD[1]"> On being busy: Corrigan's secret door </div> What's the suitable search string that includes both of what I want and ignore elements marked SIDEBAR?

12-29-2012, 02:26 AM	#3
wobohohoho Member Posts: 15 Karma: 10 Join Date: Dec 2012 Location: KL, Malaysia Device: Freda (WP 7.8) EPUB reader app	I've somewhat solved the problem for the second part of finding the unique subchapters that have codes like: Code: <div class="TLV1" id="B01306002.0-90" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]"> <div class="HD" id="H10-1" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]/HD[1]"> On being busy: Corrigan's secret door </div> With find: *<div class="TLV1"\s+(.?)\s+<div class="HD"(\s+(.?)\s+)(\s+(.?)\s+)</div> And replace: <div class="TLV1" \1<h2 class="HD"\2\4</h2>** But now I pick up SIDEBAR elements as well. So whatever search string that would ignore the word SIDEBAR should work with both.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
RegEx Help	ghostyjack	Workshop	4	03-22-2012 09:24 AM
Regex	Gunnerp245	Conversion	5	03-05-2012 04:15 PM
Help me with regex please.	eVrajka	Library Management	5	08-15-2011 12:17 PM
regex help please	thevoiceofcheese	Calibre	2	08-01-2011 11:27 PM
Regex	Faster	Sigil	2	04-24-2011 09:08 PM

12-31-2012, 11:25 PM	#4
Serpentine Evangelist Posts: 416 Karma: 1045911 Join Date: Sep 2011 Location: Cape Town, South Africa Device: Kindle 3	I did one of these conversions a while back - I'm not sure why you're trying to preserve the sidebar, it's not going to render correctly on most readers anyway. Strip it all out, rather just use a real ToC.

01-02-2013, 04:42 AM	#5
wobohohoho Member Posts: 15 Karma: 10 Join Date: Dec 2012 Location: KL, Malaysia Device: Freda (WP 7.8) EPUB reader app	Thanks for the tip. I'm just preserving the code as I'm not too sure what they're for, though it could be because there's instances of JavaScript use particularly for the inline CHM TOC (that didn't work anyway in the CHM for some reason). However for my particular case, I decided to just place all SIDEBAR elements as h3, therefore making them sub-elements to the sub-chapters, and allowing me to use the regex I've found that already work, rather than needing to differentiate them all with an all-encompassing regex.

Advert

Advert