MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Plugins (https://www.mobileread.com/forums/forumdisplay.php?f=268)
-   -   Post your Useful Plugin Code Fragments Here (https://www.mobileread.com/forums/showthread.php?t=268623)

KevinH 12-14-2015 02:49 PM

Post your Useful Plugin Code Fragments Here
 
Please reserve this thread for plugin developers and others to share their code fragments useful for Sigil plugins. Any questions about them should be directed to the Plugin Development "sticky" thread.

Thanks!

KevinH

KevinH 12-15-2015 10:54 AM

Using the built in Quick Parser to parse OPF Metadata
 
Code:

    # Example of using the provided stream based QuickParser
    # to parse metadataxml (to look for cover id)
    # Also rebuilds the metadata xml in res
    ps = bk.qp
    ps.setContent(bk.getmetadataxml())
    res = []
    coverid = None
    # parse the metadataxml, store away cover_id and rebuild it
    for text, tagprefix, tagname, tagtype, tagattr in ps.parse_iter():
        if text is not None:
            # print(text)
            res.append(text)
        else:
            # print(tagprefix, tagname, tagtype, tagattr)
            if tagname == "meta" and tagattr.get("name",'') == "cover":
                coverid = tagattr["content"]
            res.append(ps.tag_info_to_xml(tagname, tagtype, tagattr))
    original_metadata = "".join(res)


rubeus 12-15-2015 02:21 PM

How to get width and height from an image?
 
You need:

Python Interpreter > 3 and PIL library installed

or

the internal builtin Python Interpreter from 0.9.0 and up.

Code:

from PIL import Image
from io import BytesIO

Code:

    for (id, href, mime) in bk.image_iter():
        im = Image.open(BytesIO(bk.readfile(id)))
        (width, height) = im.size
        print ('id={} href={} mime={} width={} height={}'.format(id, href, mime, width,height))


DiapDealer 01-02-2016 02:51 PM

Creating self-deleting temp folders with python's contextmanager:

Code:

from contextlib import contextmanager

@contextmanager
def make_temp_directory():
    import tempfile
    import shutil
    temp_dir = tempfile.mkdtemp()
    yield temp_dir
    shutil.rmtree(temp_dir)

Then in your plugin, you can simply do something like:
Code:

with make_temp_directory() as temp_dir:
    do
    stuff
    with
    things
    in
    the
    temp_dir

It's not perfect, but barring any untrapped errors (or platform-specific permission problems), "temp_dir" will delete itself after completion of the with statement.

slowsmile 12-17-2016 06:12 AM

Using BeautifulSoup, here's a quick way to remove all garbage proprietary data from an html file:


Code:

try:
    import os.path

    from sigil_bs4 import BeautifulSoup
except:
    from bs4 import BeautifulSoup


def fixHTML(work_dir, file)

    output = os.path.join(work_dir, 'clean_html.htm')
    outfp = open(output, 'wt', encoding=('utf-8'))
    html = open(file, 'rt', encoding='utf-8').read()
   
    soup = BeautifulSoup(html, 'html.parser')
   
    # remove all unwanted proprietary attributes from the html file 
    search_tags = ['p', 'span', 'div', 'body', 'a', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'br'] 
    search_attribs =  ['dir', 'name', 'title', 'link', 'id' ,'text', 'lang', 'clear'] 
    for tag in soup.findAll(search_tags):
        for attribute in search_attribs:
            del tag[attribute]

    outfp.writelines(str(soup))
    outfp.close()
   
    os.remove(file)
    os.rename(output, file)
    return(file)


DiapDealer 12-17-2016 09:29 AM

Quote:

Originally Posted by slowsmile (Post 3444379)
Using BeautifulSoup, here's a quick way to remove all garbage proprietary data from an html fille.

Nice example of deleting attributes from tags with bs4, but why would "id" or "lang" attributes be considered garbage (or proprietary)? Removing "id", for instance, could break a whole bunch of links in files (html toc and ncx included). Seems a very odd attribute to want to nuke ("name" should probably be converted to "id" to prevent any possible link breakage, as well).

slowsmile 12-17-2016 07:06 PM

The 'lang' and 'id' attributes are garbage in what I'm doing at the moment. I'm currently writing a plugin to convert opendoc html to epub. This means that you have to initially remove all bookmarks and the TOC from the html as part of the html clean up process. My plugin app then regenerates a new TOC on conversion to epub. And apart from the lang declaration in the html header namespace, the lang attributes within the html code itself also seems to be completely superfluous. I've never seen 'lang' used in epubs within the html code.

I've also read that the 'name' attribute is now also deprecated, which is why 'id' should always be used in epubs now.

DiapDealer 12-17-2016 08:18 PM

Quote:

Originally Posted by slowsmile (Post 3444651)
The 'lang' and 'id' attributes are garbage in what I'm doing at the moment. I'm currently writing a plugin to convert opendoc html to epub. This means that you have to initially remove all bookmarks and the TOC from the html as part of the html clean up process. My plugin app then regenerates a new TOC on conversion to epub.

No problem. As I said, it's a very useful snippet for deleting attributes with bs4, I was just nervous about folks associating the "id" parameter as garbage or proprietary. :)

Quote:

Originally Posted by slowsmile (Post 3444651)
And apart from the lang declaration in the html header namespace, the lang attributes within the html code itself also seems to be completely superfluous. I've never seen 'lang' used in epubs within the html code.

Multi-language epubs (or epubs that just display other languages) can make use of it extensively. It's why Sigil's spellchecking is being enhanced to parse the lang attribute in the html. You might not ever encounter it, but it's not really that rare.

Quote:

Originally Posted by slowsmile (Post 3444651)
I've also read that the 'name' attribute is now also deprecated, which is why 'id' should always be used in epubs now.

It is deprecated, but it will often still "work." That's why converting "names" to "id" can be beneficial when working with cluttered/proprietary/old html.

slowsmile 12-17-2016 10:16 PM

@DiapDealer...Thanks for the info. I was unaware that 'lang' was used that much in epubs so I guess I've learned something. I know that the html text is in utf-8 whereas I think the tag text is more or less ascii. So I'm slightly surprised that you need the 'lang' attribute everywhere in the html because I thought that utf-8 could be defined regionally for different languages within the epub html with the help of python. I guess that utf-8 isn't used like that when you use python in an html app.

Regarding the use of 'name' or 'id' -- I always use 'id' now because you will always get an error with epubcheck if you use 'name'. Although deprecated does not mean that you can't use it, it does infer that the 'name' attribute will be dropped from html sometime in the future -- perhaps when standard epub html eventually moves to HTML5. I also note that when you convert Word to HTML -- Word HTML still uses 'name' and not 'id'. So I'm guessing that that the removal of 'name' from epub html will not happen for quite a while.

Also, I think Kindle mobi allows the 'name' attribute'(because you can upload Word filtered html to KDP) whereas vendors that use standard IDPF epubs will not allow it.

Doitsu 12-18-2016 05:21 AM

Quote:

Originally Posted by slowsmile (Post 3444379)
Using BeautifulSoup, here's a quick way to remove all garbage proprietary data from an html file:

BTW, bs4 returns the attributes as an attrs dictionary and if you're absolutely sure that you don't need any of them you could delete them all at once by assigning an empty dictionary to attrs.

Here's a minimalist proof-of-concept example:

Spoiler:
Code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from sigil_bs4 import BeautifulSoup

def run(bk):
    # get all (X)HMTL files
    for (html_id, href) in bk.text_iter():
        html = bk.readfile(html_id)
        soup = BeautifulSoup(html, 'html.parser')
        orig_soup = str(soup)
       
        for tag in soup.find_all(True):
            if tag.name not in ['style', 'a', 'nav', 'link', 'html', 'svg', 'image', 'meta'] and tag.attrs != {}:
                tag.attrs = {}

        if str(soup) != orig_soup:
            bk.writefile(html_id, str(soup))
            print(bk.id_to_href(html_id) + ' updated.')
   
    return 0

def main():
    print('I reached main when I should not have\n')
    return -1

if __name__ == "__main__":
    sys.exit(main())



Quote:

Originally Posted by slowsmile (Post 3444732)
So I'm slightly surprised that you need the 'lang' attribute everywhere in the html [...]

You don't need to use lang attributes, unless you create a multilingual epub book, however, if you do use it, the IDPF recommends using both lang and xml:lang attributes.

Quote:

Originally Posted by slowsmile (Post 3444732)
Regarding the use of 'name' or 'id' -- I always use 'id' now because you will always get an error with epubcheck if you use 'name'. Although deprecated does not mean that you can't use it, it does infer that the 'name' attribute will be dropped from html sometime in the future -- perhaps when standard epub html eventually moves to HTML5.

The epub 2.0.1. standard is based on XHTML 1.1 and XHTML 1.1 no longer allows the use of name attributes as fragment identifiers.

Quote:

Originally Posted by slowsmile (Post 3444732)
I also note that when you convert Word to HTML -- Word HTML still uses 'name' and not 'id'. So I'm guessing that that the removal of 'name' from epub html will not happen for quite a while.

Just because MS Word doesn't generate XHTML 1.1 compliant output doesn't mean it's OK to use it as is, even though many epub apps can handle name attributes as fragment identifiers.

Quote:

Originally Posted by slowsmile (Post 3444732)
Also, I think Kindle mobi allows the 'name' attribute'(because you can upload Word filtered html to KDP) whereas vendors that use standard IDPF epubs will not allow it.

Amazon indeed supports the upload of ebooks with MS Word generated html files, however, IMHO, that doesn't mean that they officially condone the use of the name attribute. IIRC, the Kindle Publishing Guidelines recommend using only well-formed (X)HTML files.
Based on strings found in the kindlegen binary, it also looks like KindleGen uses HTMLTidy internally to clean up all HTML files.

DiapDealer 12-18-2016 06:27 AM

For the record; I wasn't supporting the use of "name" in epubs, I was suggesting that when working with alternative content that is going to be massaged into an epub, it's better to convert any "name" attributes to "id", rather than just delete them. Parsing the content for hrefs that contain the "name" attributes as fragments should be trivial enough to determne which ones can be safely deleted.

slowsmile 12-18-2016 06:32 AM

@Doitsu...Interesting what you say about Kindle. Their's is a proprietary format that is closely related to epub with some peculiar quirks. Similar to iBooks proprietary version of epub. You can do that if you are a mammoth company like those two.

Here's another piece of BS code for html that I've found very useful:

Spoiler:
Code:

    # remove all anchors but preserve
    # all anchors with internet links   
    for m in soup.findAll('a'):
        if 'href="http:' in str(m) or \
          'href="https:' in str(m) or \
          'mailto:' in str(m) or \
          '@' in str(m):
            pass           
        else:
            m.replaceWithChildren()



In my conversion plugin, I've also noticed significant differences between ODF html rendered from OO and LO. One problem I had was clearing out all the myriad FONT, FACE and SIZE declarations in these two different ODF html versions.

I used this code to remove all SIZE = 3 attributes from the html because it was causing problems. Notice that OO uses an integer while LO uses a string numeric for the size value.

Spoiler:
Code:

    # remove all 'size = 3' font declarations from OO or LO html     
    for x in soup.findAll('font'):
      if x.has_attr('size'):
          if x['size'] == "3" or x['size'] == 3:
              x.replaceWithChildren()



Both Tidy and BS have saved my bacon on many occasions. They are both remarkably useful and easy to use for processing html.

Doitsu 12-18-2016 07:03 AM

Quote:

Originally Posted by slowsmile (Post 3444852)
In my conversion plugin, I've also noticed significant differences between ODF html rendered from OO and LO. One problem I had was clearing out all the myriad FONT, FACE and SIZE declarations in these two different ODF html versions.

Before you re-invent the wheel, you might want to have a look at Writer2xhtml/Writer2LaTeX and my ODT import plugin.

slowsmile 12-18-2016 09:50 AM

I don't think that I'm re-inventing the wheel really. And even though my plugin converter will give a full conversion(upload ready) I do not regard that as its main purpose. It's just a plugin that will save you alot of time in your conversion workflow by automatically doing all the drudge jobs like re-styling your new epub from scratch, adding metadata, adding images, creating a stylesheet etc. The plugin's main purpose is to quickly bring the plugin user to a point where he or she can just concentrate on finishing-off tasks in Sigil like final epub re-styling, embedding fonts, adding extra images, fixed layout tasks etc.

I'm also guessing that people will probably criticize the plugin and perhaps say, "Why bother when their are already good converters like Calibre, Scrivener, Jutoh etc ?" The main difference between those converters and my plugin converter is that those converters have editors, toc editors, complex settings, stylers, menus, sub-menus and pre-compiler options etc. They are complex apps that take some time to learn. The only editor my plugin app uses to style epubs is LibreOffice or OpenOffice because the plugin ports all styles -- default styles, heading styles, font styles and named styles to the epub stylesheet. It can do this because it also ports all in-tag styling to the CSS as well. So with my plugin all you have to do is style your ebook in LO or OO as you like and then, after filling in the metadata in the dialog window, just push the OK button and your html doc will convert to epub -- whose layout and styling should exactly mimic the layout and styling of the ODT version. The plugin also has a very simple interface which anyone can learn to use quickly.

DiapDealer 12-18-2016 10:50 AM

On a side-note: I sent you an email about testing your plugin on other platforms, @slowsmile. Did you recieve it?

slowsmile 12-18-2016 06:12 PM

@DiapDealer...That's strange. I haven't received your email yet. I've also checked my spam etc.

I'll send you another private email giving you my email address again, just in case I typed it wrong.

slowsmile 12-18-2016 09:49 PM

@DiapDealer...I received your second email without a problem and have just emailed you some info + plugin + test file. Much thanks for your help.

slowsmile 12-22-2016 06:39 AM

Here's another interesting BeautifulSoup snippet that I've just successfully used:

Code:

  # convert all html text to block text format       
    for tag in soup.findAll('p'):
        if tag.has_attr('style') and 'text-align: center' not in tag['style'].lower():
            del tag['style']
            tag['class'] = 'BlockText'

The above four lines of code will delete all 'style' attributes in p tags and then add the BlockText class - centered text will not be affected.

DiapDealer 01-27-2020 04:50 PM

Fairly self-contained Python code to make your PyQt5-based GUI plugin match Sigil's light/dark theme. It should be compatible with any version of Sigil that supports PyQt5 plugins. The dark theme just won't appear unless you're using Sigil 1.1.0 or higher.

Code:

def dark_palette(bk, app):
    supports_theming = (bk.launcher_version() >= 20200117)
    if not supports_theming:
        return
    if bk.colorMode() != "dark":
        return
    try:
        from PyQt5.QtCore import Qt
        from PyQt5.QtGui import QColor, QPalette
        from PyQt5.QtWidgets import QStyleFactory
    except ImportError:
        return

    p = QPalette()
    sigil_colors = bk.color
    dark_color = QColor(sigil_colors("Window"))
    disabled_color = QColor(127,127,127)
    dark_link_color = QColor(108, 180, 238)
    text_color = QColor(sigil_colors("Text"))
    p.setColor(p.Window, dark_color)
    p.setColor(p.WindowText, text_color)
    p.setColor(p.Base, QColor(sigil_colors("Base")))
    p.setColor(p.AlternateBase, dark_color)
    p.setColor(p.ToolTipBase, dark_color)
    p.setColor(p.ToolTipText, text_color)
    p.setColor(p.Text, text_color)
    p.setColor(p.Disabled, p.Text, disabled_color)
    p.setColor(p.Button, dark_color)
    p.setColor(p.ButtonText, text_color)
    p.setColor(p.Disabled, p.ButtonText, disabled_color)
    p.setColor(p.BrightText, Qt.red)
    p.setColor(p.Link, dark_link_color)
    p.setColor(p.Highlight, QColor(sigil_colors("Highlight")))
    p.setColor(p.HighlightedText, QColor(sigil_colors("HighlightedText")))
    p.setColor(p.Disabled, p.HighlightedText, disabled_color)

    app.setStyle(QStyleFactory.create("Fusion"))
    app.setPalette(p)

Then after you initialize your QApplication and before you show/exec it:

Code:

app = QApplication(sys.argv)
Add the following function call that takes the BookContainer object and QApplication object as parameters:

Code:

dark_palette(bk, app)
That's about it. Building your PyQt5 Application Widgets is up to you.

I may add some platform-specific tweaks and bug workarounds from time to time. I know for a fact there's some Mac color issues.


All times are GMT -4. The time now is 08:32 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.