MobileRead Forums - View Single Post

wobohohoho · 12-29-2012, 01:14 AM

Hello.
I'm trying to find and replace elements in HTM documents from a decompiled CHM to make chapter headings in order to create a TOC. The unique identifiers for sub-chapters are as follows:

Code:

      <div class="TLV1" id="B01306002.0-103" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[2]">
        <div class="HD">
          Taking a history
        </div>

I used find: <div class="TLV1"\s+(.*?)\s+ <div class="HD">\s+(.*?)\s+</div> and replace: <div class="TLV1" \1<h2 class="HD">\1</h2> for these instances, but it's not perfect.
A few non-subchapter elements (box items) are also included if the above expression is used, for example they look like this:

Code:

        <div class="SIDEBAR BOX">
          <div class="TLV1" id="B01306002.0-167" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[7]/SIDEBAR[2]/TLV1[1]">
            <div class="HD">
              The jugular venous systems
            </div><a id="T5-2"></a>

The regular part would be SIDEBAR (there's SIDEBAR LIST, etc.).

To add to the complexity, one more unique identifier for sub-chapters exist, to which the original search string I use cannot pick up:

Code:

      
      <div class="TLV1" id="B01306002.0-90" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]">
        <div class="HD" id="H10-1" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]/HD[1]">
          On being busy: Corrigan's secret door
        </div>

What's the suitable search string that includes both of what I want and ignore elements marked SIDEBAR?

12-29-2012, 01:14 AM	#1
wobohohoho Member Posts: 15 Karma: 10 Join Date: Dec 2012 Location: KL, Malaysia Device: Freda (WP 7.8) EPUB reader app	Need help for a regex Hello. I'm trying to find and replace elements in HTM documents from a decompiled CHM to make chapter headings in order to create a TOC. The unique identifiers for sub-chapters are as follows: Code: <div class="TLV1" id="B01306002.0-103" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[2]"> <div class="HD"> Taking a history </div> I used find: *<div class="TLV1"\s+(.?)\s+ <div class="HD">\s+(.?)\s+</div>* and replace: <div class="TLV1" \1<h2 class="HD">\1</h2> for these instances, but it's not perfect. A few non-subchapter elements (box items) are also included if the above expression is used, for example they look like this: Code: <div class="SIDEBAR BOX"> <div class="TLV1" id="B01306002.0-167" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[7]/SIDEBAR[2]/TLV1[1]"> <div class="HD"> The jugular venous systems </div><a id="T5-2"></a> The regular part would be SIDEBAR (there's SIDEBAR LIST, etc.). To add to the complexity, one more unique identifier for sub-chapters exist, to which the original search string I use cannot pick up: Code: <div class="TLV1" id="B01306002.0-90" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]"> <div class="HD" id="H10-1" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]/HD[1]"> On being busy: Corrigan's secret door </div> What's the suitable search string that includes both of what I want and ignore elements marked SIDEBAR?