HTML5 parsing

nickredding · 08-07-2012, 07:21 PM

This is a heads up for people converting stuff that has HTML5 tags in it. I've discovered that BeautifulSoup rearranges HTML code fragments that have a tag within a <figcaption> tag. It yanks the tag out of the <figcaption> tag and puts it after the closing </figure>. The following output from the Python interpreter illustrates this.

Code:

>>> y=BeautifulSoup('<html><body><div><header><figure><img /><figcaption><p>caption text</p></figcaption></figure></header></div></body></html>')
>>> print y
<html><body><div><header><figure><img /><figcaption></figcaption></figure></header><p>caption text</p></div></body></html>

If the <figcaption> text is enclosed directly there is no problem.

Code:

>>> z=BeautifulSoup('<html><body><div><header><figure><img /><figcaption>caption text</figcaption></figure></header></div></body></html>')
>>> print z
<html><body><div><header><figure><img /><figcaption>caption text</figcaption></figure></header></div></body></html>
>>>

There are newspaper websites that are using inside <figcaption> so be aware that BeautifulSoup will rearrange your HTML in these cases.

Very strange...there is no reason not to use inside <figcaption> as far as I can see from the HTML specificiation.

If you enclose text directly within a <figcaption> as well as a tag, BeautifulSoup only moves the tag, as can be seen below.

Code:

>>> w=BeautifulSoup('<html><body><div><header><figure><img /><figcaption>initial text<p>caption text</p></figcaption></figure></header></div></body></
html>')
>>> print w
<html><body><div><header><figure><img /><figcaption>initial text</figcaption></figure></header><p>caption text</p></div></body></html>
>>>

kovidgoyal · 08-08-2012, 12:13 AM

IIRC, Beautiful soup needs to be told what tags are nestable and it is not HTML5 aware. Simply add the html 5 tags to the NESTABLE_TAGS field in the BeautifulSoup class in ebooks/BeautifulSoup.py and that should fix it.

nickredding · 08-08-2012, 11:03 AM

Thanks. There is another solution that doesn't involve running calibre from source (since that is the only way to extend BeautifulSoup as far as I can see) and that is to change the HTML5 tags to DIV with preprocess_regexps. This is done in _postprocess_html anyway but at that point it's too late since BeautifulSoup has already "fixed" the HTML.

kovidgoyal · 08-08-2012, 11:11 AM

There's no need to run calibre from source. Simply import BeautifulSoup and modify the class variable.

BeautifulSoup.NESTED_TAGS

nickredding · 08-08-2012, 12:06 PM

I learned some Python today! I was under the impression class variables could only be accessed via class inheritance or through an object instance.

kovidgoyal · 08-08-2012, 02:24 PM

In python, everything, including classes are just objects. A class is an object of type "type".

nickredding · 08-08-2012, 05:42 PM

For anyone else having this issue, place the following code in your recipe:

Code:

from calibre.ebooks.BeautifulSoup import BeautifulSoup
    for x in ['article', 'aside', 'header', 'footer', 'nav', 'figcaption', 'figure', 'section']:
        BeautifulSoup.NESTABLE_BLOCK_TAGS.append(x)
        BeautifulSoup.RESET_NESTING_TAGS[x]=None
        BeautifulSoup.NESTABLE_TAGS[x]=[]

This will cause the HTML5 tags to be treated like DIVs by the parser (which is what they get replaced with later in the conversion process).

kovidgoyal · 08-09-2012, 01:08 AM

I can add that to calibre for the next release. Just to make sure I get it right, the patch needed is:

Code:

=== modified file 'src/calibre/ebooks/BeautifulSoup.py'
--- src/calibre/ebooks/BeautifulSoup.py 2010-04-17 16:37:28 +0000
+++ src/calibre/ebooks/BeautifulSoup.py 2012-08-09 05:06:42 +0000
@@ -1454,7 +1454,8 @@
     #According to the HTML standard, these block tags can contain
     #another tag of the same type. Furthermore, it's common
     #to actually use these tags this way.
-    NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']
+    NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del',
+            'article', 'aside', 'header', 'footer', 'nav', 'figcaption', 'figure', 'section']
 
     #Lists can contain other lists, but there are restrictions.
     NESTABLE_LIST_TAGS = { 'ol' : [],

nickredding · 08-09-2012, 09:50 AM

Thanks. That's going to avoid issues for recipes as a lot of newspapers are using HTML5 these days.

08-07-2012, 07:21 PM	#1
nickredding onlinenewsreader.net Posts: 324 Karma: 10143 Join Date: Dec 2009 Location: Phoenix, AZ & Victoria, BC Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire	HTML5 parsing This is a heads up for people converting stuff that has HTML5 tags in it. I've discovered that BeautifulSoup rearranges HTML code fragments that have a <p> tag within a <figcaption> tag. It yanks the <p> tag out of the <figcaption> tag and puts it after the closing </figure>. The following output from the Python interpreter illustrates this. Code: >>> y=BeautifulSoup('<html><body><div><header><figure><img /><figcaption><p>caption text</p></figcaption></figure></header></div></body></html>') >>> print y <html><body><div><header><figure><img /><figcaption></figcaption></figure></header><p>caption text</p></div></body></html> If the <figcaption> text is enclosed directly there is no problem. Code: >>> z=BeautifulSoup('<html><body><div><header><figure><img /><figcaption>caption text</figcaption></figure></header></div></body></html>') >>> print z <html><body><div><header><figure><img /><figcaption>caption text</figcaption></figure></header></div></body></html> >>> There are newspaper websites that are using <p> inside <figcaption> so be aware that BeautifulSoup will rearrange your HTML in these cases. Very strange...there is no reason not to use <p> inside <figcaption> as far as I can see from the HTML specificiation. If you enclose text directly within a <figcaption> as well as a <p> tag, BeautifulSoup only moves the <p> tag, as can be seen below. Code: >>> w=BeautifulSoup('<html><body><div><header><figure><img /><figcaption>initial text<p>caption text</p></figcaption></figure></header></div></body></ html>') >>> print w <html><body><div><header><figure><img /><figcaption>initial text</figcaption></figure></header><p>caption text</p></div></body></html> >>>

08-08-2012, 05:42 PM	#7
nickredding onlinenewsreader.net Posts: 324 Karma: 10143 Join Date: Dec 2009 Location: Phoenix, AZ & Victoria, BC Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire	Solution code For anyone else having this issue, place the following code in your recipe: Code: from calibre.ebooks.BeautifulSoup import BeautifulSoup for x in ['article', 'aside', 'header', 'footer', 'nav', 'figcaption', 'figure', 'section']: BeautifulSoup.NESTABLE_BLOCK_TAGS.append(x) BeautifulSoup.RESET_NESTING_TAGS[x]=None BeautifulSoup.NESTABLE_TAGS[x]=[] This will cause the HTML5 tags to be treated like DIVs by the parser (which is what they get replaced with later in the conversion process).

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Parsing data from feed	atordo	Recipes	1	01-23-2014 03:50 PM
Parsing Index	Steven630	Recipes	0	07-06-2012 04:53 AM
iPad PageList parsing using Javascript.	Oh.Danny.Boy	Apple Devices	0	05-17-2012 05:24 PM
Changing Format Without Parsing	Sidetrack	Conversion	10	04-01-2011 12:47 AM
Parsing Titles	cgraving	Calibre	3	01-17-2011 02:52 AM

08-08-2012, 12:13 AM	#2
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	IIRC, Beautiful soup needs to be told what tags are nestable and it is not HTML5 aware. Simply add the html 5 tags to the NESTABLE_TAGS field in the BeautifulSoup class in ebooks/BeautifulSoup.py and that should fix it.

08-08-2012, 11:03 AM	#3
nickredding onlinenewsreader.net Posts: 324 Karma: 10143 Join Date: Dec 2009 Location: Phoenix, AZ & Victoria, BC Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire	Thanks. There is another solution that doesn't involve running calibre from source (since that is the only way to extend BeautifulSoup as far as I can see) and that is to change the HTML5 tags to DIV with preprocess_regexps. This is done in _postprocess_html anyway but at that point it's too late since BeautifulSoup has already "fixed" the HTML.

08-08-2012, 11:11 AM	#4
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	There's no need to run calibre from source. Simply import BeautifulSoup and modify the class variable. BeautifulSoup.NESTED_TAGS

08-08-2012, 12:06 PM	#5
nickredding onlinenewsreader.net Posts: 324 Karma: 10143 Join Date: Dec 2009 Location: Phoenix, AZ & Victoria, BC Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire	I learned some Python today! I was under the impression class variables could only be accessed via class inheritance or through an object instance.

08-08-2012, 02:24 PM	#6
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	In python, everything, including classes are just objects. A class is an object of type "type".

08-09-2012, 09:50 AM	#9
nickredding onlinenewsreader.net Posts: 324 Karma: 10143 Join Date: Dec 2009 Location: Phoenix, AZ & Victoria, BC Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire	Thanks. That's going to avoid issues for recipes as a lot of newspapers are using HTML5 these days.

Advert

Advert