08-07-2012, 07:21 PM | #1 |
onlinenewsreader.net
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
HTML5 parsing
This is a heads up for people converting stuff that has HTML5 tags in it. I've discovered that BeautifulSoup rearranges HTML code fragments that have a <p> tag within a <figcaption> tag. It yanks the <p> tag out of the <figcaption> tag and puts it after the closing </figure>. The following output from the Python interpreter illustrates this.
Code:
>>> y=BeautifulSoup('<html><body><div><header><figure><img /><figcaption><p>caption text</p></figcaption></figure></header></div></body></html>') >>> print y <html><body><div><header><figure><img /><figcaption></figcaption></figure></header><p>caption text</p></div></body></html> Code:
>>> z=BeautifulSoup('<html><body><div><header><figure><img /><figcaption>caption text</figcaption></figure></header></div></body></html>') >>> print z <html><body><div><header><figure><img /><figcaption>caption text</figcaption></figure></header></div></body></html> >>> Very strange...there is no reason not to use <p> inside <figcaption> as far as I can see from the HTML specificiation. If you enclose text directly within a <figcaption> as well as a <p> tag, BeautifulSoup only moves the <p> tag, as can be seen below. Code:
>>> w=BeautifulSoup('<html><body><div><header><figure><img /><figcaption>initial text<p>caption text</p></figcaption></figure></header></div></body></ html>') >>> print w <html><body><div><header><figure><img /><figcaption>initial text</figcaption></figure></header><p>caption text</p></div></body></html> >>> |
08-08-2012, 12:13 AM | #2 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
IIRC, Beautiful soup needs to be told what tags are nestable and it is not HTML5 aware. Simply add the html 5 tags to the NESTABLE_TAGS field in the BeautifulSoup class in ebooks/BeautifulSoup.py and that should fix it.
|
Advert | |
|
08-08-2012, 11:03 AM | #3 |
onlinenewsreader.net
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
Thanks. There is another solution that doesn't involve running calibre from source (since that is the only way to extend BeautifulSoup as far as I can see) and that is to change the HTML5 tags to DIV with preprocess_regexps. This is done in _postprocess_html anyway but at that point it's too late since BeautifulSoup has already "fixed" the HTML.
|
08-08-2012, 11:11 AM | #4 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
There's no need to run calibre from source. Simply import BeautifulSoup and modify the class variable.
BeautifulSoup.NESTED_TAGS |
08-08-2012, 12:06 PM | #5 |
onlinenewsreader.net
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
I learned some Python today! I was under the impression class variables could only be accessed via class inheritance or through an object instance.
|
Advert | |
|
08-08-2012, 02:24 PM | #6 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
In python, everything, including classes are just objects. A class is an object of type "type".
|
08-08-2012, 05:42 PM | #7 |
onlinenewsreader.net
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
Solution code
For anyone else having this issue, place the following code in your recipe:
Code:
from calibre.ebooks.BeautifulSoup import BeautifulSoup for x in ['article', 'aside', 'header', 'footer', 'nav', 'figcaption', 'figure', 'section']: BeautifulSoup.NESTABLE_BLOCK_TAGS.append(x) BeautifulSoup.RESET_NESTING_TAGS[x]=None BeautifulSoup.NESTABLE_TAGS[x]=[] |
08-09-2012, 01:08 AM | #8 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I can add that to calibre for the next release. Just to make sure I get it right, the patch needed is:
Code:
=== modified file 'src/calibre/ebooks/BeautifulSoup.py' --- src/calibre/ebooks/BeautifulSoup.py 2010-04-17 16:37:28 +0000 +++ src/calibre/ebooks/BeautifulSoup.py 2012-08-09 05:06:42 +0000 @@ -1454,7 +1454,8 @@ #According to the HTML standard, these block tags can contain #another tag of the same type. Furthermore, it's common #to actually use these tags this way. - NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del'] + NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del', + 'article', 'aside', 'header', 'footer', 'nav', 'figcaption', 'figure', 'section'] #Lists can contain other lists, but there are restrictions. NESTABLE_LIST_TAGS = { 'ol' : [], |
08-09-2012, 09:50 AM | #9 |
onlinenewsreader.net
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
Thanks. That's going to avoid issues for recipes as a lot of newspapers are using HTML5 these days.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Parsing data from feed | atordo | Recipes | 1 | 01-23-2014 03:50 PM |
Parsing Index | Steven630 | Recipes | 0 | 07-06-2012 04:53 AM |
iPad PageList parsing using Javascript. | Oh.Danny.Boy | Apple Devices | 0 | 05-17-2012 05:24 PM |
Changing Format Without Parsing | Sidetrack | Conversion | 10 | 04-01-2011 12:47 AM |
Parsing Titles | cgraving | Calibre | 3 | 01-17-2011 02:52 AM |