Thread: HTML5 parsing
View Single Post
Old 08-07-2012, 07:21 PM   #1
nickredding
onlinenewsreader.net
nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'
 
Posts: 328
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
HTML5 parsing

This is a heads up for people converting stuff that has HTML5 tags in it. I've discovered that BeautifulSoup rearranges HTML code fragments that have a <p> tag within a <figcaption> tag. It yanks the <p> tag out of the <figcaption> tag and puts it after the closing </figure>. The following output from the Python interpreter illustrates this.
Code:
>>> y=BeautifulSoup('<html><body><div><header><figure><img /><figcaption><p>caption text</p></figcaption></figure></header></div></body></html>')
>>> print y
<html><body><div><header><figure><img /><figcaption></figcaption></figure></header><p>caption text</p></div></body></html>
If the <figcaption> text is enclosed directly there is no problem.
Code:
>>> z=BeautifulSoup('<html><body><div><header><figure><img /><figcaption>caption text</figcaption></figure></header></div></body></html>')
>>> print z
<html><body><div><header><figure><img /><figcaption>caption text</figcaption></figure></header></div></body></html>
>>>
There are newspaper websites that are using <p> inside <figcaption> so be aware that BeautifulSoup will rearrange your HTML in these cases.

Very strange...there is no reason not to use <p> inside <figcaption> as far as I can see from the HTML specificiation.

If you enclose text directly within a <figcaption> as well as a <p> tag, BeautifulSoup only moves the <p> tag, as can be seen below.

Code:
>>> w=BeautifulSoup('<html><body><div><header><figure><img /><figcaption>initial text<p>caption text</p></figcaption></figure></header></div></body></
html>')
>>> print w
<html><body><div><header><figure><img /><figcaption>initial text</figcaption></figure></header><p>caption text</p></div></body></html>
>>>
nickredding is offline   Reply With Quote