View Single Post
Old 12-29-2012, 02:14 AM   #1
wobohohoho began at the beginning.
wobohohoho's Avatar
Posts: 15
Karma: 10
Join Date: Dec 2012
Location: KL, Malaysia
Device: Freda (WP 7.8) EPUB reader app
Need help for a regex

I'm trying to find and replace elements in HTM documents from a decompiled CHM to make chapter headings in order to create a TOC. The unique identifiers for sub-chapters are as follows:

      <div class="TLV1" id="B01306002.0-103" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[2]">
        <div class="HD">
          Taking a history
I used find: <div class="TLV1"\s+(.*?)\s+ <div class="HD">\s+(.*?)\s+</div> and replace: <div class="TLV1" \1<h2 class="HD">\1</h2> for these instances, but it's not perfect.
A few non-subchapter elements (box items) are also included if the above expression is used, for example they look like this:
        <div class="SIDEBAR BOX">
          <div class="TLV1" id="B01306002.0-167" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[7]/SIDEBAR[2]/TLV1[1]">
            <div class="HD">
              The jugular venous systems
            </div><a id="T5-2"></a>
The regular part would be SIDEBAR (there's SIDEBAR LIST, etc.).

To add to the complexity, one more unique identifier for sub-chapters exist, to which the original search string I use cannot pick up:

      <div class="TLV1" id="B01306002.0-90" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]">
        <div class="HD" id="H10-1" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]/HD[1]">
          On being busy: Corrigan's secret door

What's the suitable search string that includes both of what I want and ignore elements marked SIDEBAR?
wobohohoho is offline   Reply With Quote