View Single Post
Old 07-20-2011, 08:45 PM   #5
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Quote:
Originally Posted by tylau0 View Post
This plugin overrides the default mobi periodical generation routine with another that makes use of the kindlegen program available at Amazon.com
A plugin or updated mobi generation is clearly the way to go, as this will allow automatic uploading on connection of the Kindle.

If you are continuing to work on this while waiting for an update of Calibre with mobi indexing as desired, or if the feedback is of any use for any other projects you may have under way, I have tested the plugin with six recipes for which Calibre generates correct epub and mobi output (although of course without proper back button behaviour), using both current and previous versions of kindlegen - although unfortunately the recipe I included to test the masthead was one of those which failed, so I could just have tested with the current version.

Four of the six recipes generated azw output files, the other two failed. Of the four which produced azw output, two had correct back button behaviour, the other two produced azw files which could be viewed with Kindle for PC, but opened on the Kindle itself showing a table of contents but with a message box which displayed "The selected item could not be opened. If you purchased this item from Amazon, delete the item and download it from Archived Items." More comments on this below.

I tested using the command line ebook-convert with "--test -vv --debug-pipeline" to generate small e-books, and generated epub, azw and mobi versions to compare. In one case, one of the four articles extracted showed a loss of some text in the first article in the azw output when compared to the epub or mobi versions:

The recipe used was:
Spoiler:
Code:
class IrishTimes(BasicNewsRecipe):
    title          = u'The Irish TimesAZW'
    encoding  = 'ISO-8859-1'
    __author__    = "Derry FitzGerald, Ray Kinsella, David O'Callaghan and Phil Burns"
    language = 'en_IE'
    timefmt = ' (%A, %B %d, %Y)'


    oldest_article = 1.0
    max_articles_per_feed  = 100
    no_stylesheets = True
    simultaneous_downloads= 5

    r = re.compile('.*(?P<url>http:\/\/(www.irishtimes.com)|(rss.feedsportal.com\/c)\/.*\.html?).*')
    remove_tags    = [dict(name='div', attrs={'class':'footer'})]
    extra_css      = 'p, div { margin: 0pt; border: 0pt; text-indent: 0.5em } .headline {font-size: large;} \n .fact { padding-top: 10pt  }'

    feeds          = [
                      ('Frontpage', 'http://www.irishtimes.com/feeds/rss/newspaper/index.rss'),
                      ('Ireland', 'http://www.irishtimes.com/feeds/rss/newspaper/ireland.rss'),
                      ('World', 'http://www.irishtimes.com/feeds/rss/newspaper/world.rss'),
                      ('Finance', 'http://www.irishtimes.com/feeds/rss/newspaper/finance.rss'),
                      ('Features', 'http://www.irishtimes.com/feeds/rss/newspaper/features.rss'),
                      ('Sport', 'http://www.irishtimes.com/feeds/rss/newspaper/sport.rss'),
                      ('Opinion', 'http://www.irishtimes.com/feeds/rss/newspaper/opinion.rss'),
                      ('Letters', 'http://www.irishtimes.com/feeds/rss/newspaper/letters.rss'),
                      ('Magazine', 'http://www.irishtimes.com/feeds/rss/newspaper/magazine.rss'),
                      ('Health', 'http://www.irishtimes.com/feeds/rss/newspaper/health.rss'),
                      ('Education & Parenting', 'http://www.irishtimes.com/feeds/rss/newspaper/education.rss'),
                      ('Motors', 'http://www.irishtimes.com/feeds/rss/newspaper/motors.rss'),
                      ('An Teanga Bheo', 'http://www.irishtimes.com/feeds/rss/newspaper/anteangabheo.rss'),
                      ('Commercial Property', 'http://www.irishtimes.com/feeds/rss/newspaper/commercialproperty.rss'),
                      ('Science Today', 'http://www.irishtimes.com/feeds/rss/newspaper/sciencetoday.rss'),
                      ('Property', 'http://www.irishtimes.com/feeds/rss/newspaper/property.rss'),
                      ('The Tickets', 'http://www.irishtimes.com/feeds/rss/newspaper/theticket.rss'),
                      ('Weekend', 'http://www.irishtimes.com/feeds/rss/newspaper/weekend.rss'),
                      ('News features', 'http://www.irishtimes.com/feeds/rss/newspaper/newsfeatures.rss'),
                      ('Obituaries', 'http://www.irishtimes.com/feeds/rss/newspaper/obituaries.rss'),
                    ]


    def print_version(self, url):
        if url.count('rss.feedsportal.com'):
            u = url.replace('0Bhtml/story01.htm','_pf0Bhtml/story01.htm')
        else:
            u = url.replace('.html','_pf.html')
        return u

    def get_article_url(self, article):
        return article.link


The second recipe which produced a useable azw file (loss of text not noticed in this case, but possible of course when more articles are extracted) was:
Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1301251451(BasicNewsRecipe):
    title          = u'Depeche du MidiAZW'
    encoding  = 'Windows-1252'
    oldest_article = 7
    max_articles_per_feed = 100
    remove_javascript     = True
    keep_only_tags = [dict(name='div', attrs={'class':'article'})]
    remove_tags_after = [dict(name='iframe', attrs={'scrolling':'no'})]

    feeds          = [(u'Accueil', u'http://www.ladepeche.fr/rss/39.rss'),
	(u'Ariege', u'http://www.ladepeche.fr/rss/63.rss'), 
	(u'Aude', u'http://www.ladepeche.fr/rss/64.rss'), 
	(u'Haute-Garonne', u'http://www.ladepeche.fr/rss/66.rss'), 
	(u'Lot', u'http://www.ladepeche.fr/rss/68.rss'), 
	(u'Hautes-Pyrenees', u'http://www.ladepeche.fr/rss/70.rss'), 
	(u'Pyrenees', u'http://www.ladepeche.fr/rss/484.rss'), 
	(u'Actu', u'http://www.ladepeche.fr/rss/75.rss'), 
	(u'A la Une', u'http://www.ladepeche.fr/rss/76.rss'), 
	(u"L'evenement", u'http://www.ladepeche.fr/rss/77.rss'), 
	(u'France', u'http://www.ladepeche.fr/rss/164.rss'), 
	(u'Monde', u'http://www.ladepeche.fr/rss/165.rss'), 
	(u'Faits divers', u'http://www.ladepeche.fr/rss/167.rss'), 
	(u'Insolite', u'http://www.ladepeche.fr/rss/168.rss'), 
	(u'Politique', u'http://www.ladepeche.fr/rss/171.rss'), 
	(u'High Tech / Sciences', u'http://www.ladepeche.fr/rss/389.rss'), 
	(u'Sortir a', u'http://www.ladepeche.fr/rss/83.rss'), 
	(u'Meteo', u'http://www.ladepeche.fr/rss/100.rss')
	]


The third azw file producing recipe, with problems described above, was:
Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1311043192(BasicNewsRecipe):
    title          = u'AvuiAZW'
    oldest_article = 7
    max_articles_per_feed = 100

    feeds          = [(u'Avui', u'http://www.avui.cat/puigcerda/nacional.feed?type=rss')]
    
    keep_only_tags = [dict(name='div', attrs={'id':'article-complet'})]
    remove_tags = [dict(name='div', attrs={'class':['botonera']})]

This recipe failed at first to produce an azw file, as it was an initial version returning the complete page. The faulty azw file was only generated when the keep_only_tags and remove_tags were added to restrict the text extracted. I found with nickredding's code that more azw files were generated, but the extra azw files (beyond the first two which worked here) also were faulty and showed the same message box.

The fourth recipe which produced a faulty azw file was:
Spoiler:
Code:
__license__  = 'GPL v3'
__copyright__ = '2011 Phil Burns'
'''
TheJournal.ie
'''
import re

from calibre.web.feeds.news import BasicNewsRecipe

class TheJournal(BasicNewsRecipe):

    __author_              = ' Phil Burns'
    title                  = u'TheJournal.ieAZW'
    oldest_article        = 1
    max_articles_per_feed  = 100
    encoding              = 'utf8'
    language              = 'en_IE'
    timefmt                = ' (%A, %B %d, %Y)'

    no_stylesheets        = True
    remove_tags            = [dict(name='div', attrs={'class':'footer'}),
                          dict(name=['script', 'noscript'])]

    extra_css              = 'p, div { margin: 0pt; border: 0pt; text-indent: 0.5em }'

    feeds                  = [
                          (u'Latest News', u'http://www.thejournal.ie/feed/')]


The two recipes which completely failed were:
Spoiler:
Code:
import re
from calibre import strftime
from time import gmtime
from calibre.web.feeds.news import BasicNewsRecipe

class HaaretzPrint_en(BasicNewsRecipe):
    title                 = 'Haaretz - print editAZW'
    __author__            = 'Darko Miletic'
    description           = "Haaretz.com is the world's leading English-language Website for real-time news and analysis of Israel and the Middle East."
    publisher             = 'Haaretz'
    category              = "news, Haaretz, Israel news, Israel newspapers, Israel business news, Israel financial news, Israeli news,Israeli newspaper, Israeli newspapers, news from Israel, news in Israel, news Israel, news on Israel, newspaper Israel, Israel sports news, Israel diplomacy news"
    oldest_article        = 2
    max_articles_per_feed = 25
    no_stylesheets        = True
    encoding              = 'utf8'
    use_embedded_content  = False
    language              = 'en_IL'
    publication_type      = 'newspaper'
    PREFIX                = 'http://www.haaretz.com'
    masthead_url          = PREFIX + '/images/logos/logoGrey.gif'
    extra_css             = ' body{font-family: Verdana,Arial,Helvetica,sans-serif } '

    preprocess_regexps = [(re.compile(r'</body>.*?</html>', re.DOTALL|re.IGNORECASE),lambda match: '</body></html>')]

    conversion_options = {
                          'comment'  : description
                        , 'tags'     : category
                        , 'publisher': publisher
                        , 'language' : language
                        }

    keep_only_tags    = [dict(attrs={'id':'threecolumns'})]
    remove_attributes = ['width','height']
    remove_tags       = [
                           dict(name=['iframe','link','object','embed'])
                          ,dict(name='div',attrs={'class':'rightcol'})
                        ]


    feeds = [
              (u'News'          , PREFIX + u'/print-edition/news'         )
             ,(u'Opinion'       , PREFIX + u'/print-edition/opinion'      )
             ,(u'International', PREFIX + u'/news/international'      )
             ,(u'Defense and Diplomacy', PREFIX + u'/news/diplomacy-defense'      )
             ,(u'Features'      , PREFIX + u'/print-edition/features'     )
             ,(u'Business'      , PREFIX + u'/print-edition/business'     )
             ,(u'Real estate'   , PREFIX + u'/print-edition/real-estate'  )
             ,(u'Sports'        , PREFIX + u'/print-edition/sports'       )
             ,(u'Travel'        , PREFIX + u'/print-edition/travel'       )
             ,(u'Books'         , PREFIX + u'/print-edition/books'        )
             ,(u'Food & Wine'   , PREFIX + u'/print-edition/food-wine'    )
             ,(u'Arts & Leisure', PREFIX + u'/print-edition/arts-leisure' )
             #,(u'A Special Place in Hell', PREFIX + u'/blogs/a-special-place-in-hell'     )
             #,(u'Strenger than Fiction', PREFIX + u'/blogs/strenger-than-fiction'     )
             #,(u'MESS Report'      , PREFIX + u'/blogs/mess-report'     )
            ]


    def print_version(self, url):
        article = url.rpartition('/')[2]
        return 'http://www.haaretz.com/misc/article-print-page/' + article

    def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            soup = self.index_to_soup(feedurl)
            for item in soup.findAll(attrs={'class':'text'}):
                sp = item.find('span',attrs={'class':'h3 font-weight-normal'})
                desc = item.find('p')
                description = ''
                if sp:
                    if desc:
                       description = self.tag_to_string(desc)
                    link        = sp.a
                    url         = self.PREFIX + link['href']
                    title       = self.tag_to_string(link)
                    times        = strftime('%a, %d %b %Y %H:%M:%S +0000',gmtime())
                    articles.append({
                                          'title'      :title
                                         ,'date'       :times
                                         ,'url'        :url
                                         ,'description':description
                                        })
            totalfeeds.append((feedtitle, articles))
        return totalfeeds


    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        return soup

which could have tested the masthead with kindlegen 1.1, if it had generated output, and:
Spoiler:
Code:
class AdvancedUserRecipe1311083909(BasicNewsRecipe):
    title          = u'DiarioAltoAragonAZW'
    oldest_article = 7
    max_articles_per_feed = 101

    feeds          = [(u'Portada', u'http://www.diariodelaltoaragon.es/rss.aspx')]
    
    keep_only_tags = [dict(name='div', attrs={'id':'bloquenoticia'})]
    remove_tags = [
       dict(name='div', attrs={'id':['imagen_sin_bordes', 'ctl00ContentPlaceHolder1_pnPopUp', 
          'ctl00ContentPlaceHolder1_divGoogle', 'ctl00_ContentPlaceHolder1_UpdatePanelVotos']}),
       dict(name='iframe'),
       dict(name='a', attrs={'id':['click']}),
       dict(name='a', attrs={'class':['twitter-share-button']})
    ]


As all six recipes produced epub and mobi versions, my suspicion is that the problem may be with the html extraction, either that Calibre removes content which would prove problematic which is left in here (and the lost text with the first recipe suggests comparison of the html extracted with Calibre and here could be useful - I will report if I find anything of interest in this respect, or kindlegen is simply more sensitive to unwanted or unsupported html than ebook-convert. As kindlegen seems to be based on MobiPocket mobigen, which I called without difficulty in my own extended version of the MobiPocket webcompanion which I continued to develop and use after Amazon bought MobiPocket and dropped the webcompanion, until I bought a Kindle in January and started to use Calibre for News generation, I am more inclined to suspect that it is something with the html passed to kindlegen which causes failure - five of these six recipes are for publications which extracted without difficulty when I used mobigen in my own software.
oneillpt is offline   Reply With Quote