MobileRead Forums - View Single Post

oneillpt · 07-20-2011, 08:45 PM

Quote:

Originally Posted by tylau0

This plugin overrides the default mobi periodical generation routine with another that makes use of the kindlegen program available at Amazon.com

A plugin or updated mobi generation is clearly the way to go, as this will allow automatic uploading on connection of the Kindle.

If you are continuing to work on this while waiting for an update of Calibre with mobi indexing as desired, or if the feedback is of any use for any other projects you may have under way, I have tested the plugin with six recipes for which Calibre generates correct epub and mobi output (although of course without proper back button behaviour), using both current and previous versions of kindlegen - although unfortunately the recipe I included to test the masthead was one of those which failed, so I could just have tested with the current version.

Four of the six recipes generated azw output files, the other two failed. Of the four which produced azw output, two had correct back button behaviour, the other two produced azw files which could be viewed with Kindle for PC, but opened on the Kindle itself showing a table of contents but with a message box which displayed "The selected item could not be opened. If you purchased this item from Amazon, delete the item and download it from Archived Items." More comments on this below.

I tested using the command line ebook-convert with "--test -vv --debug-pipeline" to generate small e-books, and generated epub, azw and mobi versions to compare. In one case, one of the four articles extracted showed a loss of some text in the first article in the azw output when compared to the epub or mobi versions:

The recipe used was:

Spoiler:

Code:

class IrishTimes(BasicNewsRecipe):
    title          = u'The Irish TimesAZW'
    encoding  = 'ISO-8859-1'
    __author__    = "Derry FitzGerald, Ray Kinsella, David O'Callaghan and Phil Burns"
    language = 'en_IE'
    timefmt = ' (%A, %B %d, %Y)'


    oldest_article = 1.0
    max_articles_per_feed  = 100
    no_stylesheets = True
    simultaneous_downloads= 5

    r = re.compile('.*(?P<url>http:\/\/(www.irishtimes.com)|(rss.feedsportal.com\/c)\/.*\.html?).*')
    remove_tags    = [dict(name='div', attrs={'class':'footer'})]
    extra_css      = 'p, div { margin: 0pt; border: 0pt; text-indent: 0.5em } .headline {font-size: large;} \n .fact { padding-top: 10pt  }'

    feeds          = [
                      ('Frontpage', 'http://www.irishtimes.com/feeds/rss/newspaper/index.rss'),
                      ('Ireland', 'http://www.irishtimes.com/feeds/rss/newspaper/ireland.rss'),
                      ('World', 'http://www.irishtimes.com/feeds/rss/newspaper/world.rss'),
                      ('Finance', 'http://www.irishtimes.com/feeds/rss/newspaper/finance.rss'),
                      ('Features', 'http://www.irishtimes.com/feeds/rss/newspaper/features.rss'),
                      ('Sport', 'http://www.irishtimes.com/feeds/rss/newspaper/sport.rss'),
                      ('Opinion', 'http://www.irishtimes.com/feeds/rss/newspaper/opinion.rss'),
                      ('Letters', 'http://www.irishtimes.com/feeds/rss/newspaper/letters.rss'),
                      ('Magazine', 'http://www.irishtimes.com/feeds/rss/newspaper/magazine.rss'),
                      ('Health', 'http://www.irishtimes.com/feeds/rss/newspaper/health.rss'),
                      ('Education & Parenting', 'http://www.irishtimes.com/feeds/rss/newspaper/education.rss'),
                      ('Motors', 'http://www.irishtimes.com/feeds/rss/newspaper/motors.rss'),
                      ('An Teanga Bheo', 'http://www.irishtimes.com/feeds/rss/newspaper/anteangabheo.rss'),
                      ('Commercial Property', 'http://www.irishtimes.com/feeds/rss/newspaper/commercialproperty.rss'),
                      ('Science Today', 'http://www.irishtimes.com/feeds/rss/newspaper/sciencetoday.rss'),
                      ('Property', 'http://www.irishtimes.com/feeds/rss/newspaper/property.rss'),
                      ('The Tickets', 'http://www.irishtimes.com/feeds/rss/newspaper/theticket.rss'),
                      ('Weekend', 'http://www.irishtimes.com/feeds/rss/newspaper/weekend.rss'),
                      ('News features', 'http://www.irishtimes.com/feeds/rss/newspaper/newsfeatures.rss'),
                      ('Obituaries', 'http://www.irishtimes.com/feeds/rss/newspaper/obituaries.rss'),
                    ]


    def print_version(self, url):
        if url.count('rss.feedsportal.com'):
            u = url.replace('0Bhtml/story01.htm','_pf0Bhtml/story01.htm')
        else:
            u = url.replace('.html','_pf.html')
        return u

    def get_article_url(self, article):
        return article.link

The second recipe which produced a useable azw file (loss of text not noticed in this case, but possible of course when more articles are extracted) was:

Spoiler:

The third azw file producing recipe, with problems described above, was:

Spoiler:

This recipe failed at first to produce an azw file, as it was an initial version returning the complete page. The faulty azw file was only generated when the keep_only_tags and remove_tags were added to restrict the text extracted. I found with nickredding's code that more azw files were generated, but the extra azw files (beyond the first two which worked here) also were faulty and showed the same message box.

The fourth recipe which produced a faulty azw file was:

Spoiler:

The two recipes which completely failed were:

Spoiler:

Code:

import re
from calibre import strftime
from time import gmtime
from calibre.web.feeds.news import BasicNewsRecipe

class HaaretzPrint_en(BasicNewsRecipe):
    title                 = 'Haaretz - print editAZW'
    __author__            = 'Darko Miletic'
    description           = "Haaretz.com is the world's leading English-language Website for real-time news and analysis of Israel and the Middle East."
    publisher             = 'Haaretz'
    category              = "news, Haaretz, Israel news, Israel newspapers, Israel business news, Israel financial news, Israeli news,Israeli newspaper, Israeli newspapers, news from Israel, news in Israel, news Israel, news on Israel, newspaper Israel, Israel sports news, Israel diplomacy news"
    oldest_article        = 2
    max_articles_per_feed = 25
    no_stylesheets        = True
    encoding              = 'utf8'
    use_embedded_content  = False
    language              = 'en_IL'
    publication_type      = 'newspaper'
    PREFIX                = 'http://www.haaretz.com'
    masthead_url          = PREFIX + '/images/logos/logoGrey.gif'
    extra_css             = ' body{font-family: Verdana,Arial,Helvetica,sans-serif } '

    preprocess_regexps = [(re.compile(r'</body>.*?</html>', re.DOTALL|re.IGNORECASE),lambda match: '</body></html>')]

    conversion_options = {
                          'comment'  : description
                        , 'tags'     : category
                        , 'publisher': publisher
                        , 'language' : language
                        }

    keep_only_tags    = [dict(attrs={'id':'threecolumns'})]
    remove_attributes = ['width','height']
    remove_tags       = [
                           dict(name=['iframe','link','object','embed'])
                          ,dict(name='div',attrs={'class':'rightcol'})
                        ]


    feeds = [
              (u'News'          , PREFIX + u'/print-edition/news'         )
             ,(u'Opinion'       , PREFIX + u'/print-edition/opinion'      )
             ,(u'International', PREFIX + u'/news/international'      )
             ,(u'Defense and Diplomacy', PREFIX + u'/news/diplomacy-defense'      )
             ,(u'Features'      , PREFIX + u'/print-edition/features'     )
             ,(u'Business'      , PREFIX + u'/print-edition/business'     )
             ,(u'Real estate'   , PREFIX + u'/print-edition/real-estate'  )
             ,(u'Sports'        , PREFIX + u'/print-edition/sports'       )
             ,(u'Travel'        , PREFIX + u'/print-edition/travel'       )
             ,(u'Books'         , PREFIX + u'/print-edition/books'        )
             ,(u'Food & Wine'   , PREFIX + u'/print-edition/food-wine'    )
             ,(u'Arts & Leisure', PREFIX + u'/print-edition/arts-leisure' )
             #,(u'A Special Place in Hell', PREFIX + u'/blogs/a-special-place-in-hell'     )
             #,(u'Strenger than Fiction', PREFIX + u'/blogs/strenger-than-fiction'     )
             #,(u'MESS Report'      , PREFIX + u'/blogs/mess-report'     )
            ]


    def print_version(self, url):
        article = url.rpartition('/')[2]
        return 'http://www.haaretz.com/misc/article-print-page/' + article

    def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            soup = self.index_to_soup(feedurl)
            for item in soup.findAll(attrs={'class':'text'}):
                sp = item.find('span',attrs={'class':'h3 font-weight-normal'})
                desc = item.find('p')
                description = ''
                if sp:
                    if desc:
                       description = self.tag_to_string(desc)
                    link        = sp.a
                    url         = self.PREFIX + link['href']
                    title       = self.tag_to_string(link)
                    times        = strftime('%a, %d %b %Y %H:%M:%S +0000',gmtime())
                    articles.append({
                                          'title'      :title
                                         ,'date'       :times
                                         ,'url'        :url
                                         ,'description':description
                                        })
            totalfeeds.append((feedtitle, articles))
        return totalfeeds


    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        return soup

which could have tested the masthead with kindlegen 1.1, if it had generated output, and:

Spoiler:

As all six recipes produced epub and mobi versions, my suspicion is that the problem may be with the html extraction, either that Calibre removes content which would prove problematic which is left in here (and the lost text with the first recipe suggests comparison of the html extracted with Calibre and here could be useful - I will report if I find anything of interest in this respect, or kindlegen is simply more sensitive to unwanted or unsupported html than ebook-convert. As kindlegen seems to be based on MobiPocket mobigen, which I called without difficulty in my own extended version of the MobiPocket webcompanion which I continued to develop and use after Amazon bought MobiPocket and dropped the webcompanion, until I bought a Kindle in January and started to use Calibre for News generation, I am more inclined to suspect that it is something with the html passed to kindlegen which causes failure - five of these six recipes are for publications which extracted without difficulty when I used mobigen in my own software.