Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 08-07-2012, 07:21 PM   #1
nickredding
onlinenewsreader.net
nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'
 
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
HTML5 parsing

This is a heads up for people converting stuff that has HTML5 tags in it. I've discovered that BeautifulSoup rearranges HTML code fragments that have a <p> tag within a <figcaption> tag. It yanks the <p> tag out of the <figcaption> tag and puts it after the closing </figure>. The following output from the Python interpreter illustrates this.
Code:
>>> y=BeautifulSoup('<html><body><div><header><figure><img /><figcaption><p>caption text</p></figcaption></figure></header></div></body></html>')
>>> print y
<html><body><div><header><figure><img /><figcaption></figcaption></figure></header><p>caption text</p></div></body></html>
If the <figcaption> text is enclosed directly there is no problem.
Code:
>>> z=BeautifulSoup('<html><body><div><header><figure><img /><figcaption>caption text</figcaption></figure></header></div></body></html>')
>>> print z
<html><body><div><header><figure><img /><figcaption>caption text</figcaption></figure></header></div></body></html>
>>>
There are newspaper websites that are using <p> inside <figcaption> so be aware that BeautifulSoup will rearrange your HTML in these cases.

Very strange...there is no reason not to use <p> inside <figcaption> as far as I can see from the HTML specificiation.

If you enclose text directly within a <figcaption> as well as a <p> tag, BeautifulSoup only moves the <p> tag, as can be seen below.

Code:
>>> w=BeautifulSoup('<html><body><div><header><figure><img /><figcaption>initial text<p>caption text</p></figcaption></figure></header></div></body></
html>')
>>> print w
<html><body><div><header><figure><img /><figcaption>initial text</figcaption></figure></header><p>caption text</p></div></body></html>
>>>
nickredding is offline   Reply With Quote
Old 08-08-2012, 12:13 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,840
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
IIRC, Beautiful soup needs to be told what tags are nestable and it is not HTML5 aware. Simply add the html 5 tags to the NESTABLE_TAGS field in the BeautifulSoup class in ebooks/BeautifulSoup.py and that should fix it.
kovidgoyal is online now   Reply With Quote
Advert
Old 08-08-2012, 11:03 AM   #3
nickredding
onlinenewsreader.net
nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'
 
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
Thanks. There is another solution that doesn't involve running calibre from source (since that is the only way to extend BeautifulSoup as far as I can see) and that is to change the HTML5 tags to DIV with preprocess_regexps. This is done in _postprocess_html anyway but at that point it's too late since BeautifulSoup has already "fixed" the HTML.
nickredding is offline   Reply With Quote
Old 08-08-2012, 11:11 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,840
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
There's no need to run calibre from source. Simply import BeautifulSoup and modify the class variable.

BeautifulSoup.NESTED_TAGS
kovidgoyal is online now   Reply With Quote
Old 08-08-2012, 12:06 PM   #5
nickredding
onlinenewsreader.net
nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'
 
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
I learned some Python today! I was under the impression class variables could only be accessed via class inheritance or through an object instance.
nickredding is offline   Reply With Quote
Advert
Old 08-08-2012, 02:24 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,840
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
In python, everything, including classes are just objects. A class is an object of type "type".
kovidgoyal is online now   Reply With Quote
Old 08-08-2012, 05:42 PM   #7
nickredding
onlinenewsreader.net
nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'
 
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
Solution code

For anyone else having this issue, place the following code in your recipe:

Code:
from calibre.ebooks.BeautifulSoup import BeautifulSoup
    for x in ['article', 'aside', 'header', 'footer', 'nav', 'figcaption', 'figure', 'section']:
        BeautifulSoup.NESTABLE_BLOCK_TAGS.append(x)
        BeautifulSoup.RESET_NESTING_TAGS[x]=None
        BeautifulSoup.NESTABLE_TAGS[x]=[]
This will cause the HTML5 tags to be treated like DIVs by the parser (which is what they get replaced with later in the conversion process).
nickredding is offline   Reply With Quote
Old 08-09-2012, 01:08 AM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,840
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I can add that to calibre for the next release. Just to make sure I get it right, the patch needed is:

Code:
=== modified file 'src/calibre/ebooks/BeautifulSoup.py'
--- src/calibre/ebooks/BeautifulSoup.py 2010-04-17 16:37:28 +0000
+++ src/calibre/ebooks/BeautifulSoup.py 2012-08-09 05:06:42 +0000
@@ -1454,7 +1454,8 @@
     #According to the HTML standard, these block tags can contain
     #another tag of the same type. Furthermore, it's common
     #to actually use these tags this way.
-    NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']
+    NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del',
+            'article', 'aside', 'header', 'footer', 'nav', 'figcaption', 'figure', 'section']
 
     #Lists can contain other lists, but there are restrictions.
     NESTABLE_LIST_TAGS = { 'ol' : [],
kovidgoyal is online now   Reply With Quote
Old 08-09-2012, 09:50 AM   #9
nickredding
onlinenewsreader.net
nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'
 
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
Thanks. That's going to avoid issues for recipes as a lot of newspapers are using HTML5 these days.
nickredding is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Parsing data from feed atordo Recipes 1 01-23-2014 03:50 PM
Parsing Index Steven630 Recipes 0 07-06-2012 04:53 AM
iPad PageList parsing using Javascript. Oh.Danny.Boy Apple Devices 0 05-17-2012 05:24 PM
Changing Format Without Parsing Sidetrack Conversion 10 04-01-2011 12:47 AM
Parsing Titles cgraving Calibre 3 01-17-2011 02:52 AM


All times are GMT -4. The time now is 09:25 PM.


MobileRead.com is a privately owned, operated and funded community.