Custom recipes (archive, read-only) - Page 167

French · 08-22-2010, 01:26 PM

Well I've attempted this but not getting what I expected.

Anyone care to take on the Baltimore Sun?

http://www.baltimoresun.com/about/bl...62819.htmlpage

TonytheBookworm · 08-22-2010, 01:30 PM

Quote:

Originally Posted by Starson17

Based on your quoted html from above, have you tried the correct href? IOW, have you tried this:

Code:

remove_tags = [dict(name='a', attrs={'href':'#comments_controls'})]

(it's always the little things that trip you up!)

Yeah it is always the little things. what happen is the thing was underlined in the source on firefox and at first glance it looked like it was a space but I should have thought and did after you showed me. hrefs don't have spaced haha

Anyway live and learn thanks again. I hope someone actually enjoys reading that blog. Oh while i'm at it is there a way in calibre using a regexpression to insert or remove a line?
For example:

Code:

 <p class="calibre9" This is the line I wish to keep </p>
 <p class="calibre9" This is the line I wish to delete </p>
 <p class="calibre9" Some more stuff </p>

I want to do something like
[code ]
var string = "This is the line I wish to delete"
remove_tag [
where contains(string)

or
var string = "This is the line I wish to delete"
var replacestring = "Calibre Rocks"
replace_tag [
replace_where(string,replacestring)
]

[/code]

the above is just pseudo code but I hope you understand my logic.

Starson17 · 08-22-2010, 01:52 PM

Quote:

Originally Posted by TonytheBookworm

Yeah it is always the little things. what happen is the thing was underlined in the source on firefox and at first glance it looked like it was a space

There was also a missing "s." "#comment_controls" makes more sense, but they used "#comments_controls." I"ve given up anything but copy/paste. FireBug makes it easy.

Quote:

Oh while i'm at it is there a way in calibre using a regexpression to insert or remove a line?
For example:

Code:

 <p class="calibre9" This is the line I wish to keep </p>
 <p class="calibre9" This is the line I wish to delete </p>
 <p class="calibre9" Some more stuff </p>

I want to do something like

Code:

   var string = "This is the line I wish to delete"
   remove_tag  [
                      where contains(string)

  or 
   var string = "This is the line I wish to delete"
   var replacestring = "Calibre Rocks"
   replace_tag [
                    replace_where(string,replacestring)
                    ]

the above is just pseudo code but I hope you understand my logic.

You're posting stuff that has class="calibreN" type labels. I suspect you know this, but those are created after the recipe has finished - so you really need to be looking at the web page html, not the final html or epub.

Also, your examples are missing the closing tag marker ">" after <p class="calibre9"

However, assuming that you're just using that as an example (I.e., you're as lazy as I am and didn't want to go back and open up the original site), the answer to your question is "yes - it's possible to insert or remove a line." and "yes, I understand your pseudo code."

Am I correct in thinking that your next question is "How?"

Spoiler:

TonytheBookworm · 08-22-2010, 06:24 PM

Quote:

Am I correct in thinking that your next question is "How?"

Spoiler:

Man i wish there was a way i could ask questions without flooding this board and all.

lets say in every parse i get something that has a doubleclick.net ad in it
I tried

Code:

filter_regexps = [r'feedads\.g\.doubleclick\.net']

and yeah i didn't see any indent errors this time.
thought well maybe if i use preprocess_regexps and remove all the instances of doubleclick first.
So then i looked in the beautiful soup documentation and after a big headache i'm still kinda lost

I tried this as well...

Code:

preprocess_regexps     = [(re.compile(r'feedads\.g\.doubleclick\.net', re.DOTALL), lambda m: '')]

thanks again

TonytheBookworm · 08-22-2010, 10:34 PM

Quote:

Originally Posted by French

Well I've attempted this but not getting what I expected.

Anyone care to take on the Baltimore Sun?

http://www.baltimoresun.com/about/bl...62819.htmlpage

Not sure exactly what you expect. But I just used 2 of the feeds (you will have to put the ones that you want into the recipe under the feed section).

This works for me the only issue I have is for the life of me I can't figure out how to get it to remove the doubleclick.net add that it puts on some of the articles. Maybe someone can help you/me out on that one. I have tried filter_regexp with no go. Anyway enjoy...

Trickery · 08-23-2010, 06:19 AM

Made these w/ icons. Hopefully they help and get added to the main program. I made a .py for all the Gawker Media Brand websites and Consumerist and added the icons for good measure.

Gawker.com
deadspin.com
io9.com
jalopnik.com
jezebel.com
kotaku.com
lifehacker.com
fleshbot.com
Consumerist.com

Consumerist is done and working well, just can't figure out how to remove a lone image that shows up on each page for twitter. It's small though, so not a big detractor.

marco_polo · 08-23-2010, 11:29 AM

Hallo, I´m new her, Somebody can help me by create recipe from www.europasur.es ? I´try edit a recipe from El Pais , but rss links are very different ... Thank you

Starson17 · 08-23-2010, 01:15 PM

Quote:

Originally Posted by TonytheBookworm

Man i wish there was a way i could ask questions without flooding this board and all.

I've worried about that issue, too, but most responses I received said that they didn't mind when I asked if people thought there was too much of this type of "how to" in this thread. It does provide a lot of good info, which is searchable and helps others write recipes, so I wouldn't worry too much. Just try to use the code and spoiler tags to keep the indents and the length of posts minimized. Personally, I like to read kiklop's recipes to see how he approaches certain problems, then I can ask him when I don't understand something.

Quote:

lets say in every parse i get something that has a doubleclick.net ad in it
I tried

Code:

filter_regexps = [r'feedads\.g\.doubleclick\.net']

and yeah i didn't see any indent errors this time.
thought well maybe if i use preprocess_regexps and remove all the instances of doubleclick first.
So then i looked in the beautiful soup documentation and after a big headache i'm still kinda lost

I tried this as well...

Code:

preprocess_regexps     = [(re.compile(r'feedads\.g\.doubleclick\.net', re.DOTALL), lambda m: '')]

I've never needed to use either of those methods to remove doubleclick ads. For me, it's always been possible to define either a keep_only or a remove. As to Beautiful Soup, I'm still learning, and I've read that page at least 50 times. I expect I'll end up reading it another 50 times eventually.

Let's start with filter_regexps. I've only used it once. It's used to prevent a link from being followed. Most of the time, you're not following a link because recursion is off and Calibre isn't following links on the pages. What you normally want to do is remove the link or graphic from your page, not prevent it from being followed by Calibre.

OTOH, I use preprocess_regexps a lot - but as a sort of last resort. It's simply a powerful search and replace on the HTML. You could do most of your remove_tags with preprocess_regexps if you wanted to. But, it's not tag-aware, so remove_tags is better in most cases (it won't be confused if there's a div tag inside a div tag, where S&R might find the open div tag of an outer tag and the close div of an inner tag. Why don't you show me the actual page source for the doubleclick you want to deal with, or give me a link,so I can understand what you are trying to remove?

BTW, If you look at page source with your browser, it may not be the same as what Calibre sees. It may also be wrong if you look at it with FireBug. To see it as Calibre will see it I like to do this:

Code:

    def preprocess_html(self, soup):
        print 'The soup is: ', soup
        return soup

If you add this code, it does nothing, but the print statement sends the html in cleaned-up Beautiful Soup form into your textfile.txt as Calibre will see it. (you are using ebook-convert ....>textfile.txt format - right?)

poluk · 08-23-2010, 02:19 PM

Hi
I try based on the financial times recipes to adapt it to lloyd's List
and I get this error

Quote:

mechanize._mechanize.FormNotFoundError: no form matching name 'log-in-box'

Could you tell me what to change in "log-in-box" with the webpage source concerning that part for login?

Code:

"<div class="grid_4 prefix_2 controls-container">

    <div class="grid_4 first last common-box last-in-row" id="log-in-box">

        <h2 class="common-box-header">Please Log In</h2>

        

        <form class="log-in" method="post" action="/ll/security_check">
            <fieldset>
                <label for="j_username">Username:</label>
                <input class="common-field log-in-page" type="text" name="j_username" id="j_username"
                       value="" tabindex="1"/>

                <label for="j_password">Password:</label>

                <input class="common-field log-in-page" type="password" name="j_password" id="j_password" tabindex="2"/>

                <input class="submit log-in-page" type="submit" value="Log In" tabindex="4"/>

                <label for="_spring_security_remember_me"><input type="checkbox" id="_spring_security_remember_me" name="_spring_security_remember_me" tabindex="3"/>Remember me</label>

                <a class="pwd-reminder" href="/ll/forgotten-password.htm">Forgotten your password?</a>


            </fieldset>

the website I try to make a recipe is: http://www.lloydslist.com/ll/

miangue · 08-23-2010, 04:01 PM

Let's see if someone can help me.
I made this recipe from a magazine in Colombia (larepublica.com.co). Everything comes as is the want but with a problem, is that the source of the title of each story as I get the source of the article and wanted to come out big and bold but How I can do this?, What command should I add? ... Thanks!

Here's the recipe:

Quote:

class AdvancedUserRecipe1282450582(BasicNewsRecipe):
title = u'LaRepublica.com'
oldest_article = 7
max_articles_per_feed = 100
use_embedded_content = False
no_stylesheets = True

keep_only_tags = [
dict(name='div', attrs={'id':['noticia']})
]
remove_tags = [
dict(name='div', attrs={'id':['iconos', 'relacionados', 'documentos_adjuntos']}),
dict(name='span', attrs={'id':['comentarios']})
]

feeds = [(u'Noticias', u'http://www.larepublica.com.co/rss/larepublica.xml')]

And here is part of the source code where the title of the news:

Quote:

<div id="noticia">

<div id="titulo">
Interés de inversionistas sube el Igbc hasta 13.602,04 unidades
</div>

<div id="info">

TonytheBookworm · 08-23-2010, 07:24 PM

Quote:

Code:

    def preprocess_html(self, soup):
        print 'The soup is: ', soup
        return soup

If you add this code, it does nothing, but the print statement sends the html in cleaned-up Beautiful Soup form into your textfile.txt as Calibre will see it. (you are using ebook-convert ....>textfile.txt format - right?)

Yes, I'm using the ebook-convert string you gave me works great for debugging. As for the preprocess_html thanks for that method I will use that in the future as well to test my code.

As for the the issue where I had the doubleclick. It was the baltimoresun. I took a stab at it for that guy/gal that wanted someone to look at it. To the most part everything is fine with the rss feed except it puts in that google ad on some of the pages generally the first article. when i look at the orginal source it has ad.doubleclick.net in it then after it is rendered with calibre it is feedsad.g.doubleclick.net here is the recipe I am currently using for it...

Spoiler:

Personally, I don't like the RSS feed of that site. I have considered trying to make a feed myself from this...
http://www.baltimoresun.com/services...print-edition/

which actually gives you some nice pretty images and so forth. I figured in that cause I would simply use a Recursions =1 and then somehow strip what I didn't want using maybe keep_only or remove_tags. Or I could simply take and somehow make a print_version that looks for the text of Print inside a <a> tag and then simply get that url and pass it back. The only issue with using the print version on that is I loose the photos which I don't want to do. It is just something I'm playing with to learn and to also help someone else in the process

thanks for taking the time to teach me and answer my questions. I really appreciate it.

kerrware · 08-24-2010, 06:54 AM

Been trying to create my first simple recipe for a local paper - Ilkeston Advertiser (Derbyshire, England) with Free RSS Feeds. Manage to get the logon process working and ran the recipe in test mode. It seemed to download the first two articles into seperate directories each with an index.html first and an image subdirectory. Displaying the index file in Firefox shows the article data is being downloaded ok.
When I run the recipe in Calibre I get the the index summary pages ok but all the artciles refered to just contain header (Next Link, etc.) and footer lines (downloaded by Calibre, etc.).
Have I missed a something out?

Thanks.

Spoiler:

DoctorOhh · 08-24-2010, 07:12 AM

Quote:

Originally Posted by kerrware

Have I missed a something out?

I applaud you using the spoiler tags, but first you have to wrap your recipe in the code tags (the # above) then wrap that with the spoiler tags. Placing your recipe in the code tags keeps your recipe intact with the critical spaces in their proper places. This makes trying your recipe and reviewing it easier on those that have the needed skills to assist you.

Starson17 · 08-24-2010, 07:53 AM

Quote:

Originally Posted by TonytheBookworm

As for the the issue where I had the doubleclick. It was the baltimoresun.

I looked at a page from that site. The doubleclick ads seemed to be inside <noscript> tags. If that's it, why not just remove those tags?

Quote:

I have considered trying to make a feed myself from ...

Sometimes building the feed yourself is best.

Quote:

thanks for taking the time to teach me and answer my questions. I really appreciate it.

You're welcome.

Starson17 · 08-24-2010, 07:59 AM

Quote:

Originally Posted by miangue

I get the source of the article and wanted to come out big and bold but How I can do this?, What command should I add?

extra_css is used to control formatting. Search this thread for some samples and read here.

08-24-2010, 06:54 AM	#2502
kerrware Junior Member Posts: 7 Karma: 10 Join Date: Jun 2010 Device: none	My Recipe fails to place Articles data in epub. Been trying to create my first simple recipe for a local paper - Ilkeston Advertiser (Derbyshire, England) with Free RSS Feeds. Manage to get the logon process working and ran the recipe in test mode. It seemed to download the first two articles into seperate directories each with an index.html first and an image subdirectory. Displaying the index file in Firefox shows the article data is being downloaded ok. When I run the recipe in Calibre I get the the index summary pages ok but all the artciles refered to just contain header (Next Link, etc.) and footer lines (downloaded by Calibre, etc.). Have I missed a something out? Thanks. Spoiler: from calibre.web.feeds.news import BasicNewsRecipe import re class AdvancedUserRecipe1282596648(BasicNewsRecipe): title = u'Ilkeston Advertsier' oldest_article = 7 max_articles_per_feed = 100 needs_subscription = True def get_browser(self): br = BasicNewsRecipe.get_browser() if self.username is not None and self.password is not None: br.open('http://auth.jpress.co.uk/login.aspx?ReturnURL=http%3a%2f%2fwww.ilkestonadve rtiser.co.uk%2ftemplate%2fRegister.aspx%3fReturnUR L%3dhttp%3a%2f%2fwww.ilkestonadvertiser.co.uk%2ffr ontpage.aspx&SiteRef=IAS') br.select_form(name='Form1') br['ctl00$txtEmailAddress'] = self.username br['ctl00$txtPassword'] = self.password br.submit() return br feeds = [(u'Ilkeston Today - News', u'http://www.ilkestonadvertiser.co.uk/getfeed.aspx?sectionid=795&format=rss')]

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

08-22-2010, 01:26 PM	#2491
French Groupie Posts: 151 Karma: 1002968 Join Date: Dec 2008 Device: none	Well I've attempted this but not getting what I expected. Anyone care to take on the Baltimore Sun? http://www.baltimoresun.com/about/bl...62819.htmlpage

08-23-2010, 11:29 AM	#2497
marco_polo Junior Member Posts: 3 Karma: 10 Join Date: Aug 2010 Device: PRS 900	Hallo, I´m new her, Somebody can help me by create recipe from www.europasur.es ? I´try edit a recipe from El Pais , but rss links are very different ... Thank you