Custom recipes (archive, read-only) - Page 168

Starson17 · 08-24-2010, 04:40 PM

Quote:

Originally Posted by poluk

I try based on the financial times recipes to adapt it to lloyd's List and I get this error
Could you tell me what to change in "log-in-box" with the webpage source concerning that part for login?

You didn't post your recipe or the login page you are trying to access, so it's a bit hard to advise you. However, from the error, it looks like your recipe probably attempts to find the login form by "name" and you have used the "id."

I don't do many login recipes, but it's been my experience that if the form is not identified by "name=" in the html, you need to use this:

Code:

br.select_form(nr=0) 
or 
br.select_form(nr=1)

to find the form by sequential number on the page instead of :

Code:

br.select_form(name='log-in-box')

Starson17 · 08-24-2010, 04:53 PM

Quote:

Originally Posted by kerrware

It seemed to download the first two articles into seperate directories each with an index.html first and an image subdirectory. Displaying the index file in Firefox shows the article data is being downloaded ok.
When I run the recipe in Calibre I get the the index summary pages ok but all the artciles refered to just contain header (Next Link, etc.) and footer lines (downloaded by Calibre, etc.).
Have I missed a something out?

What happens when you click on the index.html in the first directory? Does Firefox allow you to click through to the articles and see the article content? (As dwanthny said, if you had used CODE tags, it would have been easier to run your recipe to check it out.)

cisaak · 08-24-2010, 05:25 PM

In my newspaper recipe, I have replaced the standard Kindle masthead with "MYTEXT" using the following command:

def get_masthead_title(self)
return 'MYTEXT'

Unfortunately, MYTEXT is truncated when viewed on my Kindle's screen. Apparently, I must use a CSS command to format the substitute masthead. I have used CSS to format other tags, e.g., the body of the article, but I do not know how to apply a CSS to the masthead. Can anyone help?

poluk · 08-24-2010, 05:43 PM

Thanks for your help Starson17 !
Here is the recipes code:

Code:

#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2008, Darko Miletic <darko.miletic at gmail.com>'
'''
Lloyds
'''

from calibre.web.feeds.news import BasicNewsRecipe

class Lloyd(BasicNewsRecipe):
    title                 = u'Lloyd'
   __author__            = 'Darko Miletic and Sujata Raman'
    description           = 'Shipping News'
    oldest_article        = 2
    language = 'en'

    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    needs_subscription    = True
    simultaneous_downloads= 1
    delay                 = 1

    LOGIN = 'http://www.lloydslist.com/ll/login.htm'

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            br.open(self.LOGIN)
            br.select_form(nr=0) 
            br['username'] = self.username
            br['password'] = self.password
            br.submit()
        return br

   

    feeds = [(u'Containers', u'http://www.lloydslist.com/ll/sector/containers/?service=rss')
, (u'Dry Cargo', u'http://www.lloydslist.com/ll/sector/dry-cargo/?service=rss')
, (u'Finance', u'http://www.lloydslist.com/ll/sector/finance/?service=rss')
, (u'Insurance', u'http://www.lloydslist.com/ll/sector/insurance/?service=rss')
, (u'Port and Logistic', u'http://www.lloydslist.com/ll/sector/ports-and-logistics/?service=rss')
, (u'Regulation', u'http://www.lloydslist.com/ll/sector/regulation/?service=rss')
, (u'Ship Operation', u'http://www.lloydslist.com/ll/sector/ship-operations/?service=rss')
]

    def preprocess_html(self, soup):
        content_type = soup.find('meta', {'http-equiv':'Content-Type'})
        if content_type:
            content_type['content'] = 'text/html; charset=utf-8'
        return soup

As you said I changed the way of looking for the form and now I get a new error (so we progress thanks to you !!!)

Quote:

ClientForm.ControlNotFoundError: no control matching name 'username'

Starson17 · 08-24-2010, 05:52 PM

Quote:

Originally Posted by poluk

As you said I changed the way of looking for the form and now I get a new error (so we progress thanks to you !!!)

Code:

ClientForm.ControlNotFoundError: no control matching name 'username'

Your username control is not named 'username'. Find the form and determine the name of the control that is submitted as the username.

IOW, this is wrong:

Code:

br['username'] = self.username

it should be:

Code:

br['something_else_not_username'] = self.username

You probably have the name of the password control wrong, too.

Starson17 · 08-24-2010, 06:01 PM

Quote:

Originally Posted by cisaak

In my newspaper recipe, I have replaced the standard Kindle masthead with "MYTEXT" ...

Unfortunately, MYTEXT is truncated when viewed on my Kindle's screen. Apparently, I must use a CSS command to format the substitute masthead. I have used CSS to format other tags, e.g., the body of the article, but I do not know how to apply a CSS to the masthead. Can anyone help?

Not without a Kindle (anyone want to send me one?

) as I'm not sure where in the recipe the Kindle is picking up the masthead.

However, the masthead is only used in a few places in an EPUB. Open the EPUB, find the masthead and change the css file to modify its properties, then convert the EPUB to whatever format Kindle uses and see if that fixes it. If so, modify the extra_css in your recipe to make the same change.

If you have a problem understanding this, take it a step at a time, and let me know which step you have trouble with.

TonytheBookworm · 08-24-2010, 08:17 PM

I know in the calibre preferences under conversion and mobi output there is a dropdown that allows you to pick the font you wish to use. It would be good to have a user customized size as well in there.

naisren · 08-24-2010, 11:25 PM

Calibre give us many choices to customize news from any possible site, I use Calibre to get news instead of using Mobipocket Reader.
I met several issues during using calibre, could you kindly help solve them?
1. Menu in navigation part of each article
When click the link of menu, pop up an error in PC or PDA

2. How to avoid or reduce "Property: Invalid value for "CSS Level 2.1" property: 225 [85:1: width]" using recipe to output?

DoctorOhh · 08-24-2010, 11:43 PM

This recipe wasn't working due to a redirected feed. I corrected the recipe. Removed one old feed and added two new feeds.

naisren · 08-24-2010, 11:46 PM

Code:

<li><a href="/Business_Etiquette_1.html" />Business Etiquette</a></li>

as you see, there is "/" in the code

Code:

<a href="/Business_Etiquette_1.html" />

, and another "/" in

Code:

</a>

In reality, the browser can deal with it as without the first "/" , viz

Code:

<li><a href="/Business_Etiquette_1.html">Business Etiquette</a></li>

It seems Calibre can not deal with it as the browser, firefox or IE, it will skip after meeting the first "/".

link "a" tag is one case, division div tag has also such problems, such as

Code:

<div id="text"/>......</div>

How to deal with such codes using recipe, I can't get any links using:
soup.find(id='text').findAll('a') to handle the mentioned code.

kerrware · 08-25-2010, 04:09 AM

Thanks for feedback - new to forum so still learning.
Hopefully I've added the recipe code correctly this time.

Quote:

Originally Posted by Starson17
What happens when you click on the index.html in the first directory? Does Firefox allow you to click through to the articles and see the article content? (As dwanthny said, if you had used CODE tags, it would have been easier to run your recipe to check it out.)

Yes, Firefox does allow me to click through to the articles and see the article content. I've since ceated a second recipe for another site (which does not require a login so only used the Basic Add New Recipe page in Calibre) and that worked first time (apart from possibly needing a bit of pruning).

Code:

from calibre.web.feeds.news import BasicNewsRecipe
import re

class AdvancedUserRecipe1282596648(BasicNewsRecipe):
    title          = u'Ilkeston Advertsier'
    oldest_article = 7
    max_articles_per_feed = 100
    needs_subscription = True

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            br.open('http://auth.jpress.co.uk/login.aspx?ReturnURL=http%3a%2f%2fwww.ilkestonadvertiser.co.uk%2ftemplate%2fRegister.aspx%3fReturnURL%3dhttp%3a%2f%2fwww.ilkestonadvertiser.co.uk%2ffrontpage.aspx&SiteRef=IAS')
            br.select_form(name='Form1')
            br['ctl00$txtEmailAddress']  = self.username
            br['ctl00$txtPassword'] = self.password
            br.submit()
        return br

    feeds          = [(u'Ilkeston Today - News', u'http://www.ilkestonadvertiser.co.uk/getfeed.aspx?sectionid=795&format=rss')]

Starson17 · 08-25-2010, 08:49 AM

Quote:

Originally Posted by kerrware

Yes, Firefox does allow me to click through to the articles and see the article content.

If you are seeing the article content stored locally (when running ebook-convert), and you can click through from the initial index.html to the index.html files in the folders to see that content, then I see no reason why you should have problems converting the html structure, with article content, to an EPUB. Where is the problem occurring? I'd check it for you, but have no username/password for the site.

Starson17 · 08-25-2010, 09:06 AM

Quote:

Originally Posted by naisren

as you see, there is "/" in the code

Code:

<a href="/Business_Etiquette_1.html" />

, and another "/" in

Code:

</a>

It seems Calibre can not deal with it as the browser, firefox or IE, it will skip after meeting the first "/".
link "a" tag is one case, division div tag has also such problems, such as

Code:

<div id="text"/>......</div>

How to deal with such codes using recipe, I can't get any links using:
soup.find(id='text').findAll('a') to handle the mentioned code.

Sorry, but I can't quite follow your question. Are you saying you can't reference tags by "id" or "href," etc.?

I've never run into the trailing slashes inside opening tags like you've posted, so I have no first hand experience. I would still expect normal referencing to work, but if it doesn't, you have various options. You can try search and replace to remove them with preprocess_regexps. You could remove just the slashes, or modify the whole tag with S&R, or use pre or postprocess_html and Beautiful Soup to identify the tag and extract or modify it. It's possible the slashes are confusing Beautiful Soup, so printing the results (see code in my post above on how to do this) might help you figure out what the recipe is seeing and where it's being confused.

More info would be needed to advise further.

naisren · 08-25-2010, 11:47 AM

Quote:

Originally Posted by Starson17

Sorry, but I can't quite follow your question. Are you saying you can't reference tags by "id" or "href," etc.?

I've never run into the trailing slashes inside opening tags like you've posted, so I have no first hand experience. I would still expect normal referencing to work, but if it doesn't, you have various options. You can try search and replace to remove them with preprocess_regexps. You could remove just the slashes, or modify the whole tag with S&R, or use pre or postprocess_html and Beautiful Soup to identify the tag and extract or modify it. It's possible the slashes are confusing Beautiful Soup, so printing the results (see code in my post above on how to do this) might help you figure out what the recipe is seeing and where it's being confused.

More info would be needed to advise further.

Thanks for your help and sorry for my confusing expression.

The following is part of the source code, frow which I try to get feed.

Code:

<div id="rightContainer" />
<span id="list" />
<ul><li><a href="/Health_Report_1.html" target="_blank">[ <font color=#E43026>Health Report</font> ] </a> <a href="/lrc/201008/se-health-cancer-developing-world-25aug10.lrc" target=_blank><img src=/images/lrc.gif border=0></a> <a href="/VOA_Special_English/Experts-Urge-More-Efforts-to-Fight-Cancer-in-Poor-Countries-38652_1.html" target="_blank"><img src=/images/yi.gif border=0></a> <a href="/VOA_Special_English/Experts-Urge-More-Efforts-to-Fight-Cancer-in-Poor-Countries-38652.html" target="_blank">Experts Urge More Efforts to Fight Cancer in Poor Countries  (2010-8-25)</a></li></ul>
</span>
</div>

My recipe is

Code:

import re
from calibre.web.feeds.news import BasicNewsRecipe

class VOA(BasicNewsRecipe):

    title      = 'VOA News'
    __author__ = 'voa'
    description = 'VOA through 51'
    language = 'en'
    remove_javascript = True

    remove_tags_before = dict(id=['rightContainer'])
    remove_tags_after  = dict(id=['listads'])
    remove_tags        = [
                          dict(id=['contentAds']), dict(id=['playbar']), dict(id=['menubar']), 
                         ]    
    no_stylesheets = True
    extra_css = '''
                '''


    def parse_index(self):
        soup = self.index_to_soup('http://www.51voa.com/')
        feeds = []
        section = []
        title = None

       #for x in soup.find(id='list').findAll('a'):
        for x in soup.find(id='rightContainer').findAll('a'):
                if '/VOA_Special_English/' in x['href'] or '/VOA_Standard_English/' in x['href'] or '/VOA_Standard_English/' in x['href']:
                    article = {
                            'url' : 'http://www.51voa.com/' + x['href'],
                            'title' : self.tag_to_string(x),
                            'date': '',
                            'description': '',
                        }
                    section.append(article)

        feeds.append(('Newest', section))

        return feeds

I use the recipe here to fetch the feed from the source code, but get no links. could you give an example for how to use "regexps" to deal with the weird code here, and in case

Code:

<br/>

tag comes in. Thanks a lot for your teaching.

miangue · 08-25-2010, 01:26 PM

Quote:

Originally Posted by Starson17

extra_css is used to control formatting. Search this thread for some samples and read here.

Starson Thanks, I put the line "extra_css" and it came out like this:

Code:

class AdvancedUserRecipe1282450582(BasicNewsRecipe):
    title          = u'LaRepublica.com'
    oldest_article = 7
    max_articles_per_feed = 100
    use_embedded_content   = False
    no_stylesheets = True
    extra_css = '''
                    .titulo {font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
                    .periodista {font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
                    .fecha_publicacion {font-family:Helvetica,Arial,sans-serif;font-size:small;}
	'''
    keep_only_tags    = [
                       dict(name='div', attrs={'id':['noticia']})
                             ]
    remove_tags = [
                       dict(name='div', attrs={'id':['iconos', 'relacionados', 'documentos_adjuntos']}),
                       dict(name='span', attrs={'id':['comentarios']})
                        ]

    feeds          = [(u'Noticias', u'http://www.larepublica.com.co/rss/larepublica.xml')]

But todoas forms does not work

. What I can be doing wrong?

Can anyone help me please? ...

I should clarify that the labels want to change the format are:

Code:

<div id="titulo">
<div id="periodista">
<div id="fecha_publicacion">

THANK YOU!!!

08-24-2010, 05:25 PM	#2508
cisaak Member Posts: 17 Karma: 10 Join Date: Aug 2010 Device: Kindle DX	Formatting Masthead In my newspaper recipe, I have replaced the standard Kindle masthead with "MYTEXT" using the following command: def get_masthead_title(self) return 'MYTEXT' Unfortunately, MYTEXT is truncated when viewed on my Kindle's screen. Apparently, I must use a CSS command to format the substitute masthead. I have used CSS to format other tags, e.g., the body of the article, but I do not know how to apply a CSS to the masthead. Can anyone help?

08-24-2010, 11:25 PM	#2513
naisren Enthusiast Posts: 41 Karma: 12 Join Date: Jul 2009 Device: ppc	main menu, section menu, css for calibre mobipocket output Calibre give us many choices to customize news from any possible site, I use Calibre to get news instead of using Mobipocket Reader. I met several issues during using calibre, could you kindly help solve them? 1. Menu in navigation part of each article When click the link of menu, pop up an error in PC or PDA 2. How to avoid or reduce "Property: Invalid value for "CSS Level 2.1" property: 225 [85:1: width]" using recipe to output? Attached Thumbnails

08-24-2010, 11:46 PM	#2515
naisren Enthusiast Posts: 41 Karma: 12 Join Date: Jul 2009 Device: ppc	Code: <li><a href="/Business_Etiquette_1.html" />Business Etiquette</a></li> as you see, there is "/" in the code Code: <a href="/Business_Etiquette_1.html" /> , and another "/" in Code: </a> In reality, the browser can deal with it as without the first "/" , viz Code: <li><a href="/Business_Etiquette_1.html">Business Etiquette</a></li> It seems Calibre can not deal with it as the browser, firefox or IE, it will skip after meeting the first "/". link "a" tag is one case, division div tag has also such problems, such as Code: <div id="text"/>......</div> How to deal with such codes using recipe, I can't get any links using: soup.find(id='text').findAll('a') to handle the mentioned code.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

08-24-2010, 08:17 PM	#2512
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	I know in the calibre preferences under conversion and mobi output there is a dropdown that allows you to pick the font you wish to use. It would be good to have a user customized size as well in there.