MobileRead Forums - View Single Post - Post your Useful Plugin Code Fragments Here

Doitsu · 12-18-2016, 05:21 AM

Quote:

Originally Posted by slowsmile

Using BeautifulSoup, here's a quick way to remove all garbage proprietary data from an html file:

BTW, bs4 returns the attributes as an attrs dictionary and if you're absolutely sure that you don't need any of them you could delete them all at once by assigning an empty dictionary to attrs.

Here's a minimalist proof-of-concept example:

Spoiler:

Quote:

Originally Posted by slowsmile

So I'm slightly surprised that you need the 'lang' attribute everywhere in the html [...]

You don't need to use lang attributes, unless you create a multilingual epub book, however, if you do use it, the IDPF recommends using both lang and xml:lang attributes.

Quote:

Originally Posted by slowsmile

Regarding the use of 'name' or 'id' -- I always use 'id' now because you will always get an error with epubcheck if you use 'name'. Although deprecated does not mean that you can't use it, it does infer that the 'name' attribute will be dropped from html sometime in the future -- perhaps when standard epub html eventually moves to HTML5.

The epub 2.0.1. standard is based on XHTML 1.1 and XHTML 1.1 no longer allows the use of name attributes as fragment identifiers.

Quote:

Originally Posted by slowsmile

I also note that when you convert Word to HTML -- Word HTML still uses 'name' and not 'id'. So I'm guessing that that the removal of 'name' from epub html will not happen for quite a while.

Just because MS Word doesn't generate XHTML 1.1 compliant output doesn't mean it's OK to use it as is, even though many epub apps can handle name attributes as fragment identifiers.

Quote:

Originally Posted by slowsmile

Also, I think Kindle mobi allows the 'name' attribute'(because you can upload Word filtered html to KDP) whereas vendors that use standard IDPF epubs will not allow it.

Amazon indeed supports the upload of ebooks with MS Word generated html files, however, IMHO, that doesn't mean that they officially condone the use of the name attribute. IIRC, the Kindle Publishing Guidelines recommend using only well-formed (X)HTML files.
Based on strings found in the kindlegen binary, it also looks like KindleGen uses HTMLTidy internally to clean up all HTML files.