Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-31-2011, 03:38 PM   #1
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
use_embedded_content, encoding, a warning illustrated by recipe for inter.it

This post follows from another thread (Reciper for www.inter.it - some letters are wrong!) at https://www.mobileread.com/forums/sho...d.php?t=126886, where the encoding needed for inter.it is discussed. As the setting for use_embedded_content also plays a role in creating this recipe, and raises an issue which may be relevant for other recipes, this short post discusses the embedded content issue and summarises the encoding issue as a single post in a new thread.

The content for inter.it can be used to create an e-book either directly from embedded content (use_embedded_content = False):

Spoiler:
Code:
class AdvancedUserRecipe1300997108(BasicNewsRecipe):
    title          = u'Inter'
    encoding  = 'UTF-8'
    use_embedded_content = True
    oldest_article = 7
    max_articles_per_feed = 100

    feeds          = [(u'Inter News', u'http://veleno.inter.it/aas/rss/index_full_it.xml')]
    remove_tags    = [dict(name='div', attrs={'class':'piccolowww'})]
    remove_tags = [dict(name='span', attrs={'style':'padding-left:120px;'})]


or by following links to individual pages (use_embedded_content = False):

Spoiler:
Code:
class AdvancedUserRecipe1300997108(BasicNewsRecipe):
    title          = u'Inter'
    encoding  = 'Windows-1252'
    use_embedded_content = False
    oldest_article = 7
    max_articles_per_feed = 100

    feeds          = [(u'Inter News', u'http://veleno.inter.it/aas/rss/index_full_it.xml')]
    remove_tags    = [dict(name='div', attrs={'class':'piccolowww'})]
    remove_tags = [dict(name='span', attrs={'style':'padding-left:120px;'})]


When checking this e-book over a number of days to verify which encoding should be used, before adding the explicit value for use_embedded_content, I noticed that the default guessing causes the embedded content to be used some days, and the links to individual pages to be used on other days. As different encodings are needed in these two cases (and as there are small differences in content also), it becomes important to specify use_embedded_content with either a True or False value, and to specify the corresponding encoding, rather than the default guessing, to obtain consistent output without invalid or incorrect characters, and the correct content.

And so the warning: when embedded content is available, it may be necessary to specify use_embedded_content to obtain the correct output.

Examination of the two recipe versions above shows that UTF-8 encoding (the default, which need not be explicitly specified) is appropriate when the embedded content is used, but not when links to individual pages are used, where encoding = 'Windows-1252' has been specified. For these individual pages FireFox Page Info indicates encoding as ISO-8859-1. Windows-1252 has been specified instead, as it is a superset of ISO-8859-1, including all characters found in ISO-8859-1, and adding some characters such as the Euro symbol (€) which are missing from ISO-8859-1. ISO-8859-1 is often declared as the encoding for pages which contain these added characters, and which are this in fact Windows-1252. The Euro symbol is also available in ISO-8859-15, but in that case replaces another character in ISO-8859-1. Since inter.it has used the HTML entity, €, rather than a single character in the pages I have seen so far, it is possible that the correct encoding could be ISO-8859-15 rather than Windows-1252. I will add a further post here if I can confirm the correct encoding by spotting a character such as the Euro symbol which differs for the two encodings in future content.

It may be worth noting that news content from the Euro zone is likely to use the Euro symbol, and so, if not UTF-8, is likely to be encoded as Windows-1252 or ISO-8859-15, rather than ISO-8859-1, which does not include the Euro symbol.

(In the case of inter.it, the content which gave rise to the initial question regarding encoding, the date and time for each article, appears only when the page links are followed, so use_embedded_content = False is needed to obtain this item)
oneillpt is offline   Reply With Quote
Reply

Tags
encoding, inter.it, use_embedded_content


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Nook (classic) problems with Sports Illustrated Recipe spedinfargo Recipes 2 02-03-2011 06:41 PM


All times are GMT -4. The time now is 02:49 PM.


MobileRead.com is a privately owned, operated and funded community.