Quote:
Originally Posted by slowsmile
Using BeautifulSoup, here's a quick way to remove all garbage proprietary data from an html file:
|
BTW, bs4 returns the attributes as an
attrs dictionary and if you're absolutely sure that you don't need any of them you could delete them all at once by assigning an empty dictionary to
attrs.
Here's a minimalist proof-of-concept example:
Spoiler:
Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from sigil_bs4 import BeautifulSoup
def run(bk):
# get all (X)HMTL files
for (html_id, href) in bk.text_iter():
html = bk.readfile(html_id)
soup = BeautifulSoup(html, 'html.parser')
orig_soup = str(soup)
for tag in soup.find_all(True):
if tag.name not in ['style', 'a', 'nav', 'link', 'html', 'svg', 'image', 'meta'] and tag.attrs != {}:
tag.attrs = {}
if str(soup) != orig_soup:
bk.writefile(html_id, str(soup))
print(bk.id_to_href(html_id) + ' updated.')
return 0
def main():
print('I reached main when I should not have\n')
return -1
if __name__ == "__main__":
sys.exit(main())
Quote:
Originally Posted by slowsmile
So I'm slightly surprised that you need the 'lang' attribute everywhere in the html [...]
|
You don't need to use lang attributes, unless you create a multilingual epub book, however, if you do use it, the
IDPF recommends using both lang and xml:lang attributes.
Quote:
Originally Posted by slowsmile
Regarding the use of 'name' or 'id' -- I always use 'id' now because you will always get an error with epubcheck if you use 'name'. Although deprecated does not mean that you can't use it, it does infer that the 'name' attribute will be dropped from html sometime in the future -- perhaps when standard epub html eventually moves to HTML5.
|
The epub 2.0.1. standard is based on XHTML 1.1 and
XHTML 1.1 no longer allows the use of name attributes as fragment identifiers.
Quote:
Originally Posted by slowsmile
I also note that when you convert Word to HTML -- Word HTML still uses 'name' and not 'id'. So I'm guessing that that the removal of 'name' from epub html will not happen for quite a while.
|
Just because MS Word doesn't generate XHTML 1.1 compliant output doesn't mean it's OK to use it as is, even though many epub apps can handle name attributes as fragment identifiers.
Quote:
Originally Posted by slowsmile
Also, I think Kindle mobi allows the 'name' attribute'(because you can upload Word filtered html to KDP) whereas vendors that use standard IDPF epubs will not allow it.
|
Amazon indeed supports the upload of ebooks with MS Word generated html files, however, IMHO, that doesn't mean that they officially condone the use of the name attribute. IIRC, the Kindle Publishing Guidelines recommend using only well-formed (X)HTML files.
Based on strings found in the kindlegen binary, it also looks like KindleGen uses HTMLTidy internally to clean up all HTML files.