MobileRead Forums - View Single Post

fenuks · 01-25-2014, 11:37 AM

Thank you for advice, Kovid. Unfortunately this don't solve my problem.

Let's say I have this tag:

Code:

<img class="alignright size-medium wp-image-48200" width="250" height="360" src="image-48200.png" alt="title" style="width: 250px; height: 360px;"></img>

Only alignright class has css which is

Code:

float: right;
margin: 0px 0px 5px 10px;

so my

Code:

extra_css = '.alignright {float: right; margin: 0px 0px 5px 10px;}'

It works, ebook-convert leaves alignright class and adds my custom css. But if I have next image in this or other article with same classes but other width or height attribute its class will be renamed to alignright{NUMBER}. I tried to remove width and height attributes by

Code:

remove_attributes = ['width', 'height']

and auto width and height to my extra_css. But if I remove these attributes calibre renames class name to sth like calibre2 and doesn't add extra_css.

There's one more problem. It seams that calibre preserves only first class name. If class string of an element will be "size-medium alignright wp-image-48200" instead of "alignright size-medium wp-image-48200" then in output ebook this element won't have alignright class as expected but size-medium.

Here's a full recipe if you want to test it yourself or found my explanation not clear enough:

Spoiler:

Code:

# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:fdm=marker:ai
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Comment
from calibre.ebooks.BeautifulSoup import BeautifulSoup
import re

class FilmOrgPl(BasicNewsRecipe):
    title = u'Film.org.pl'
    __author__ = 'fenuks'
    description = u"Recenzje, analizy, artykuły, rankingi - wszystko o filmie dla miłośników kina. Opisy efektów specjalnych, wersji reżyserskich, remake'ów, sequeli. No i forum filmowe. Jedne z największych w Polsce."
    category = 'film'
    language = 'pl'
    cover_url = 'http://film.org.pl/wp-content/themes/KMF/images/logo_kmf10.png'
    ignore_duplicate_articles = {'title', 'url'}
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    remove_javascript = True
    remove_empty_feeds = True
    use_embedded_content = False

    remove_attributes = ['width', 'height', 'style']
    preprocess_regexps = [(re.compile(ur'<h3>Przeczytaj także:</h3>.*</body>', re.IGNORECASE|re.DOTALL), lambda m: '</body>'),]
    extra_css = '.alignright {float: right; margin: 0px 0px 5px 10px;} .aligncenter {margin: 0px auto; display: block;} .alignleft {float:left; margin-right:5px;}'

    keep_only_tags = [dict(attrs={'class':['content_recenzja']})]

    feeds = [(u'Recenzje', u'http://film.org.pl/r/recenzje/feed/'),
            #(u'Artyku\u0142', u'http://film.org.pl/a/artykul/feed/'),
            #(u'Analiza', u'http://film.org.pl/a/analiza/feed/'),
            #(u'Ranking', u'http://film.org.pl/a/ranking/feed/'),
            #(u'Blog', u'http://film.org.pl/kmf/blog/feed/'),
            #(u'Ludzie', u'http://film.org.pl/a/ludzie/feed/'),
            #(u'Seriale', u'http://film.org.pl/a/seriale/feed/'),
            #(u'Oceanarium', u'http://film.org.pl/a/ocenarium/feed/'),
            #(u'VHS', u'http://film.org.pl/a/vhs-a/feed/')
    ]
                
    def preprocess_html(self, soup):
        for c in soup.findAll('h11'):
            c.name = 'h1'
        for c in soup.findAll('h16'):
            c.name = 'h2'
        for c in soup.findAll('h17'):
            c.name = 'h3'
        for r in soup.findAll('br'):
            r.extract()
        for tag in soup.findAll('h8'):
            tag_index = tag.parent.contents.index(tag)
            tag.parent.insert(tag_index+1, BeautifulSoup('<br></br>'))
        return soup