MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Plugins (https://www.mobileread.com/forums/forumdisplay.php?f=268)
-   -   Searching within tags (https://www.mobileread.com/forums/showthread.php?t=328683)

Doitsu 03-29-2020 04:49 AM

Searching within tags
 
Quote:

Originally Posted by carmenchu (Post 3969408)
Well: so far, in 'non greedy' mode,
(?<=\>)\b([^<]+)(?=\</) selects between tags, not nested
(?<=\>)\b([^<]+)(?=\<) skips tags.
Useful when the mouse gets temperamental, and one wishes to manually extract/move some text.
:thumbsup: for the Sigil User Guide and the links to regex references

If you have basic programming skills, you could also write an ad-hoc Sigil plugin using the BeautifulSoup library, which is bundled with Sigil, to manipulate tags. (The Sigil API documentation is here.)
This will save you the hassle of coming up with complex regular expressions.

For example the following minimal plugin code:

Spoiler:
Code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, os
from sigil_bs4 import BeautifulSoup

def run(bk):

    # get all html files
    for (html_id, href) in bk.text_iter():
        file_name = os.path.basename(href)
        html = bk.readfile(html_id)
       
        # convert html to soup
        soup = BeautifulSoup(html, 'html.parser')
        orig_html = str(soup)
       
        # get all span tags
        spans = soup.find_all('span')
        for span in spans:
            if 'class' in span.attrs:
                if 'Calibre13' in span['class']:
                    # remove class attribute
                    del span['class']
                    # change <span> to <b>
                    span.name = 'b'
                else:
                    # delete <span> tags with other classes
                    span.unwrap()
            else:
                # delete <span> tags w/o classes
                span.unwrap()

        # update file with changes
        if str(soup) != orig_html:
            bk.writefile(html_id, str(soup))
            print(file_name, 'updated')

    print('Done')
    return 0



will look for span tags with a Calibre13 class and replace them with <b> tags. (All other <span> tags will be deleted.)

Before:

Code:

<p>This should be <span class="Calibre6 Calibre13 Calibre2">bolded</span>. <span class="Calibre2">This span is redundant</span> <span>and this span should also be deleted.</span></p>
After:

Code:

<p>This should be <b>bolded</b>. This span is redundant and this span should also be deleted.</p>
If you want to test the plugin code:
  • Create a MyPlugin folder in the Sigil plugins folder
  • Save the plugin code as plugin.py in that folder.
  • Create a plugin.xml file with the following contents:
    Spoiler:
    Code:

    <?xml version="1.0" encoding="UTF-8"?>
    <plugin>
        <name>MyPlugin</name>
        <type>edit</type>
        <autostart>true</autostart>
        <author>carmenchu</author>
        <description>bs4 test</description>
        <engine>python3.4</engine>
        <version>0.0.1</version>
        <oslist>unx,win,osx</oslist>
    </plugin>


    and also save it in the MyPlugin folder.
(To run the plugin, select Plugins > Edit > MyPlugin.)

carmenchu 04-06-2020 11:16 AM

Quote:

Originally Posted by Doitsu (Post 3969485)
If you have basic programming skills, you could also write an ad-hoc Sigil plugin using the BeautifulSoup library, which is bundled with Sigil, to manipulate tags. (The Sigil API documentation is here.)...

Thanks: very useful for what I am trying to do as a plugin.
Only, I do need a little help with syntax to make this modified code work:
Spoiler:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, os
from sigil_bs4 import BeautifulSoup

def run(bk):

# get all html files
for (html_id, href) in bk.text_iter():
file_name = os.path.basename(href)
html = bk.readfile(html_id)

# convert html to soup
soup = BeautifulSoup(html, 'html.parser')
orig_html = str(soup)

# get all i tags
italics = soup.find_all('i') # how for 'i', 'b', 'small', 'br', 'h1/2/3...'
for i in italics:
if 'class' in i.attrs:
print(file_name, 'found') # finds
if 'calibre' in i['class']:
# remove class attribute
print(file_name, 'found attrib') # doesn't find "calibre3"
del i['class']
# # change <span> to <b>
# span.name = 'b'
# else:
# # delete <span> tags with other classes
# span.unwrap()
# else:
# # delete <span> tags w/o classes
# span.unwrap()

# update file with changes
if str(soup) != orig_html:
bk.writefile(html_id, str(soup))
print(file_name, 'updated')

print('Done')
return 0

1. how to pass to soup.find_all() a list of tags as argument
2. how to rework
Code:

if 'calibre' in tag['class']
so that it would match a substring, i.e., 'calibre15'.
3. Would the code work as well for selecting <meta... /> tag by 'name' and deleting it? How?
Maybe it's trivial, but I am green--python 2.+ for Gimp is the fartest I have gone. And couldn't make anything of your link :(
Thanks!

* Sorry for the delay: too many irons...
** Does this get 'out of topic'? (better in plug-ins)

Doitsu 04-06-2020 11:45 AM

Quote:

Originally Posted by carmenchu (Post 3972970)
1. how to pass to soup.find_all() a list of tags as argument

You can use a list as the search parameter. For example:
Code:

tags = soup.find_all(['i', 'b', 'small', 'br'])
Quote:

Originally Posted by carmenchu (Post 3972970)
2. how to rework
Code:

if 'calibre' in tag['class']
so that it would match a substring, i.e., 'calibre15'.

You can use a regular expression with find_all().

Quote:

Originally Posted by carmenchu (Post 3972970)
3. Would the code work as well for selecting <meta... /> tag by 'name'

Yes, they're treated like all other tags.

Quote:

Originally Posted by carmenchu (Post 3972970)
[...]and deleting it? How?

You can delete tags with decompose(). However, since this modifies the "soup," it should be done last.

Quote:

Originally Posted by carmenchu (Post 3972970)
** Does this get 'out of topic'? (better in plug-ins)

I also think it should be moved to plugins.


All times are GMT -4. The time now is 08:43 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.