View Single Post
Old 12-18-2016, 04:21 AM   #10
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,736
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by slowsmile View Post
Using BeautifulSoup, here's a quick way to remove all garbage proprietary data from an html file:
BTW, bs4 returns the attributes as an attrs dictionary and if you're absolutely sure that you don't need any of them you could delete them all at once by assigning an empty dictionary to attrs.

Here's a minimalist proof-of-concept example:

Spoiler:
Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from sigil_bs4 import BeautifulSoup

def run(bk):
    # get all (X)HMTL files
    for (html_id, href) in bk.text_iter():
        html = bk.readfile(html_id)
        soup = BeautifulSoup(html, 'html.parser')
        orig_soup = str(soup)
        
        for tag in soup.find_all(True):
            if tag.name not in ['style', 'a', 'nav', 'link', 'html', 'svg', 'image', 'meta'] and tag.attrs != {}:
                tag.attrs = {}

        if str(soup) != orig_soup:
            bk.writefile(html_id, str(soup))
            print(bk.id_to_href(html_id) + ' updated.')
    
    return 0

def main():
    print('I reached main when I should not have\n')
    return -1

if __name__ == "__main__":
    sys.exit(main())


Quote:
Originally Posted by slowsmile View Post
So I'm slightly surprised that you need the 'lang' attribute everywhere in the html [...]
You don't need to use lang attributes, unless you create a multilingual epub book, however, if you do use it, the IDPF recommends using both lang and xml:lang attributes.

Quote:
Originally Posted by slowsmile View Post
Regarding the use of 'name' or 'id' -- I always use 'id' now because you will always get an error with epubcheck if you use 'name'. Although deprecated does not mean that you can't use it, it does infer that the 'name' attribute will be dropped from html sometime in the future -- perhaps when standard epub html eventually moves to HTML5.
The epub 2.0.1. standard is based on XHTML 1.1 and XHTML 1.1 no longer allows the use of name attributes as fragment identifiers.

Quote:
Originally Posted by slowsmile View Post
I also note that when you convert Word to HTML -- Word HTML still uses 'name' and not 'id'. So I'm guessing that that the removal of 'name' from epub html will not happen for quite a while.
Just because MS Word doesn't generate XHTML 1.1 compliant output doesn't mean it's OK to use it as is, even though many epub apps can handle name attributes as fragment identifiers.

Quote:
Originally Posted by slowsmile View Post
Also, I think Kindle mobi allows the 'name' attribute'(because you can upload Word filtered html to KDP) whereas vendors that use standard IDPF epubs will not allow it.
Amazon indeed supports the upload of ebooks with MS Word generated html files, however, IMHO, that doesn't mean that they officially condone the use of the name attribute. IIRC, the Kindle Publishing Guidelines recommend using only well-formed (X)HTML files.
Based on strings found in the kindlegen binary, it also looks like KindleGen uses HTMLTidy internally to clean up all HTML files.
Doitsu is offline   Reply With Quote