Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 12-29-2012, 01:14 AM   #1
wobohohoho
Member
wobohohoho began at the beginning.
 
wobohohoho's Avatar
 
Posts: 11
Karma: 10
Join Date: Dec 2012
Location: KL, Malaysia
Device: Freda (Windows Phone) EPub reader app
Need help for a regex

Hello.
I'm trying to find and replace elements in HTM documents from a decompiled CHM to make chapter headings in order to create a TOC. The unique identifiers for sub-chapters are as follows:

Code:
      <div class="TLV1" id="B01306002.0-103" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[2]">
        <div class="HD">
          Taking a history
        </div>
I used find: <div class="TLV1"\s+(.*?)\s+ <div class="HD">\s+(.*?)\s+</div> and replace: <div class="TLV1" \1<h2 class="HD">\1</h2> for these instances, but it's not perfect.
A few non-subchapter elements (box items) are also included if the above expression is used, for example they look like this:
Code:
        <div class="SIDEBAR BOX">
          <div class="TLV1" id="B01306002.0-167" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[7]/SIDEBAR[2]/TLV1[1]">
            <div class="HD">
              The jugular venous systems
            </div><a id="T5-2"></a>
The regular part would be SIDEBAR (there's SIDEBAR LIST, etc.).

To add to the complexity, one more unique identifier for sub-chapters exist, to which the original search string I use cannot pick up:

Code:
      
      <div class="TLV1" id="B01306002.0-90" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]">
        <div class="HD" id="H10-1" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]/HD[1]">
          On being busy: Corrigan's secret door
        </div>

What's the suitable search string that includes both of what I want and ignore elements marked SIDEBAR?
wobohohoho is offline   Reply With Quote
Old 12-29-2012, 01:52 AM   #2
wobohohoho
Member
wobohohoho began at the beginning.
 
wobohohoho's Avatar
 
Posts: 11
Karma: 10
Join Date: Dec 2012
Location: KL, Malaysia
Device: Freda (Windows Phone) EPub reader app
Quote:
Originally Posted by wobohohoho View Post
I used find: <div class="TLV1"\s+(.*?)\s+ <div class="HD">\s+(.*?)\s+</div> and replace: <div class="TLV1" \1<h2 class="HD">\1</h2> for these instances, but it's not perfect.
Sorry. It's <div class="TLV1" \1<h2 class="HD">\2</h2>.
wobohohoho is offline   Reply With Quote
Old 12-29-2012, 02:26 AM   #3
wobohohoho
Member
wobohohoho began at the beginning.
 
wobohohoho's Avatar
 
Posts: 11
Karma: 10
Join Date: Dec 2012
Location: KL, Malaysia
Device: Freda (Windows Phone) EPub reader app
I've somewhat solved the problem for the second part of finding the unique subchapters that have codes like:

Code:
      <div class="TLV1" id="B01306002.0-90" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]">
        <div class="HD" id="H10-1" id_xpath="/CHAPTER[1]/TBD[1]/TLV1[18]/HD[1]">
          On being busy: Corrigan's secret door
        </div>
With find:
<div class="TLV1"\s+(.*?)\s+<div class="HD"(\s+(.*?)\s+)(\s+(.*?)\s+)</div>
And replace:
<div class="TLV1" \1<h2 class="HD"\2\4</h2>

But now I pick up SIDEBAR elements as well. So whatever search string that would ignore the word SIDEBAR should work with both.
wobohohoho is offline   Reply With Quote
Old 12-31-2012, 11:25 PM   #4
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
I did one of these conversions a while back - I'm not sure why you're trying to preserve the sidebar, it's not going to render correctly on most readers anyway. Strip it all out, rather just use a real ToC.
Serpentine is offline   Reply With Quote
Old 01-02-2013, 04:42 AM   #5
wobohohoho
Member
wobohohoho began at the beginning.
 
wobohohoho's Avatar
 
Posts: 11
Karma: 10
Join Date: Dec 2012
Location: KL, Malaysia
Device: Freda (Windows Phone) EPub reader app
Thanks for the tip.

I'm just preserving the code as I'm not too sure what they're for, though it could be because there's instances of JavaScript use particularly for the inline CHM TOC (that didn't work anyway in the CHM for some reason).

However for my particular case, I decided to just place all SIDEBAR elements as h3, therefore making them sub-elements to the sub-chapters, and allowing me to use the regex I've found that already work, rather than needing to differentiate them all with an all-encompassing regex.
wobohohoho is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
RegEx Help ghostyjack Workshop 4 03-22-2012 09:24 AM
Regex Gunnerp245 Conversion 5 03-05-2012 04:15 PM
Help me with regex please. eVrajka Library Management 5 08-15-2011 12:17 PM
regex help please thevoiceofcheese Calibre 2 08-01-2011 11:27 PM
Regex Faster Sigil 2 04-24-2011 09:08 PM


All times are GMT -4. The time now is 04:34 AM.


MobileRead.com is a privately owned, operated and funded community.