Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 02-24-2012, 08:18 AM   #1
nimblebooks
Enthusiast
nimblebooks began at the beginning.
 
Posts: 28
Karma: 10
Join Date: May 2010
Device: Kindle
Understanding html input plugin

Can someone point me to the documentation or source for the html input plugin? I need to understand better what it is doing. Sorry for the stupid questions, but I am learning as I go. Deeply appreciative that Calibre exists and of what it does. Fred
nimblebooks is offline   Reply With Quote
Old 02-24-2012, 08:37 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,787
Karma: 4998511
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Look for plugins/html_input.py in the source code.
kovidgoyal is offline   Reply With Quote
 
Enthusiast
Old 02-25-2012, 09:06 PM   #3
nimblebooks
Enthusiast
nimblebooks began at the beginning.
 
Posts: 28
Karma: 10
Join Date: May 2010
Device: Kindle
So it looks to me that the function doing all the file rewriting is rewrite_links which is defined in base.py and removes all absolute links. Then it looks for a CSS file and cssutils parses that. Is this broadly correct?


Code:
def rewrite_links(root, link_repl_func, resolve_base_href=False):
    '''
    Rewrite all the links in the document.  For each link
    ``link_repl_func(link)`` will be called, and the return value
    will replace the old link.

    Note that links may not be absolute (unless you first called
    ``make_links_absolute()``), and may be internal (e.g.,
    ``'#anchor'``).  They can also be values like
    ``'mailto:email'`` or ``'javascript:expr'``.

    If the ``link_repl_func`` returns None, the attribute or
    tag text will be removed completely.
    '''
    from cssutils import parseString, parseStyle, replaceUrls, log
    log.setLevel(logging.WARN)

    if resolve_base_href:
        resolve_base_href(root)
    for el, attrib, link, pos in iterlinks(root, find_links_in_css=False):
        new_link = link_repl_func(link.strip())
        if new_link == link:
            continue
        if new_link is None:
            # Remove the attribute or element content
            if attrib is None:
                el.text = ''
            else:
                del el.attrib[attrib]
            continue
        if attrib is None:
            new = el.text[:pos] + new_link + el.text[pos+len(link):]
            el.text = new
        else:
            cur = el.attrib[attrib]
            if not pos and len(cur) == len(link):
                # Most common case
                el.attrib[attrib] = new_link
            else:
                new = cur[:pos] + new_link + cur[pos+len(link):]
                el.attrib[attrib] = new

    def set_property(v):
        if v.CSS_PRIMITIVE_VALUE == v.cssValueType and \
           v.CSS_URI == v.primitiveType:
                v.setStringValue(v.CSS_URI,
                        link_repl_func(v.getStringValue()))

    for el in root.iter():
        try:
            tag = el.tag
        except UnicodeDecodeError:
            continue

        if tag == XHTML('style') and el.text and \
                (_css_url_re.search(el.text) is not None or '@import' in
                        el.text):
            stylesheet = parseString(el.text)
            replaceUrls(stylesheet, link_repl_func)
            repl = stylesheet.cssText
            if isbytestring(repl):
                repl = repl.decode('utf-8')
            el.text = '\n'+ repl + '\n'

        if 'style' in el.attrib:
            text = el.attrib['style']
            if _css_url_re.search(text) is not None:
                try:
                    stext = parseStyle(text)
                except:
                    # Parsing errors are raised by cssutils
                    continue
                for p in stext.getProperties(all=True):
                    v = p.cssValue
                    if v.CSS_VALUE_LIST == v.cssValueType:
                        for item in v:
                            set_property(item)
                    elif v.CSS_PRIMITIVE_VALUE == v.cssValueType:
                        set_property(v)
                repl = stext.cssText.replace('\n', ' ').replace('\r',
                        ' ')
                if isbytestring(repl):
                    repl = repl.decode('utf-8')
                el.attrib['style'] = repl
nimblebooks is offline   Reply With Quote
Old 02-26-2012, 01:06 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,787
Karma: 4998511
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
That is the resolving of linked resources, yes. You are probably more interested in HTML parsing. Look at parse_utils.py and preprocess.py for that.
kovidgoyal is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Plugin not customizable: Plugin: HTML Output does not need customization flyingfoxlee Conversion 2 02-24-2012 02:24 AM
telling the input plugin to allow a rel=nofollow nimblebooks Conversion 0 02-22-2012 05:01 PM
HTML input plugin stripping text within toc tags in child html file nimblebooks Conversion 3 02-21-2012 03:24 PM
Plugin which uses net as input and output medve Development 0 12-04-2011 03:20 PM
Looking For MHT Input Conversion Plugin FlooseMan Dave Plugins 4 03-30-2010 05:52 PM


All times are GMT -4. The time now is 06:19 PM.


MobileRead.com is a privately owned, operated and funded community.