This is a heads up for people converting stuff that has HTML5 tags in it. I've discovered that BeautifulSoup rearranges HTML code fragments that have a <p> tag within a <figcaption> tag. It yanks the <p> tag out of the <figcaption> tag and puts it after the closing </figure>. The following output from the Python interpreter illustrates this.
Code:
>>> y=BeautifulSoup('<html><body><div><header><figure><img /><figcaption><p>caption text</p></figcaption></figure></header></div></body></html>')
>>> print y
<html><body><div><header><figure><img /><figcaption></figcaption></figure></header><p>caption text</p></div></body></html>
If the <figcaption> text is enclosed directly there is no problem.
Code:
>>> z=BeautifulSoup('<html><body><div><header><figure><img /><figcaption>caption text</figcaption></figure></header></div></body></html>')
>>> print z
<html><body><div><header><figure><img /><figcaption>caption text</figcaption></figure></header></div></body></html>
>>>
There are newspaper websites that are using <p> inside <figcaption> so be aware that BeautifulSoup will rearrange your HTML in these cases.
Very strange...there is no reason not to use <p> inside <figcaption> as far as I can see from the HTML specificiation.
If you enclose text directly within a <figcaption> as well as a <p> tag, BeautifulSoup only moves the <p> tag, as can be seen below.
Code:
>>> w=BeautifulSoup('<html><body><div><header><figure><img /><figcaption>initial text<p>caption text</p></figcaption></figure></header></div></body></
html>')
>>> print w
<html><body><div><header><figure><img /><figcaption>initial text</figcaption></figure></header><p>caption text</p></div></body></html>
>>>