Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 03-29-2010, 01:48 PM   #1681
MichaelMSeattle
Enthusiast
MichaelMSeattle began at the beginning.
 
Posts: 30
Karma: 16
Join Date: Sep 2009
Device: sony prs-505/600
Quote:
Originally Posted by Starson17 View Post
Over the weekend I ran all comics of the GoComics.com recipe at size 1200 and 4 strips from each. I have the 200+ comics available broken up into four groups (four recipes) A-F, G-M, N-Z and Editorial comics. They all ran fine. However, I ran them at 8 hour intervals, not in sequence, and I set the delay option to 2 and the simultaneous connections option to 1 to minimize server load. I have seen occasional failures in the past that may be related to server load or anti-scraping tools on their server.
Hi again. Sorry to be out of touch this weekend (home pc was down)
I appreciate your patience with this. Will you kindly post one of the four recipes you tested with so I can try it as well?

I made the changes you suggested and set the delay to 5 seconds and only return one day. It still hung up at about 19%. I also commented out the IF statement that worked with tag processing and when I ran it again got another "a" error. That's why I'd like to try your recipe.

Thanks again for your assistance,
-Mike
MichaelMSeattle is offline  
Old 03-29-2010, 02:43 PM   #1682
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
Quote:
Originally Posted by kiklop74 View Post
You are complicating too much. Calibre already extracts appropriate link from the feed (feedburner:Origlink). You just need to add the part for printing which is 'print/'. So the correct code would be:

Code:
def print_version(self, url):
     return url + 'print/'
thanks...!

any ideas for this feed?
http://www.electronista.com/rss/electronista.rss

there is a "print" version for every article, but i am not able to determine the "print" link from the article link.

example:
link article
"http://www.electronista.com/articles/10/03/26/geforce.gtx.480.and.470.finally.official/"

print version
"http://www.electronista.com/print/73919"

i am not able to argue the id (in this case 73919)
gambarini is offline  
Advert
Old 03-29-2010, 03:22 PM   #1683
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by gambarini View Post
there is a "print" version for every article, but i am not able to determine the "print" link from the article link.
You may want to look at the 4 links in my post # 1658 in this thread. One of them has a discussion of solving issues like this.

Update: I was thinking of this.

Starson17 is offline  
Old 03-29-2010, 03:35 PM   #1684
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
Quote:
Originally Posted by gambarini View Post
any ideas for this feed?
http://www.electronista.com/rss/electronista.rss

there is a "print" version for every article, but i am not able to determine the "print" link from the article link.
Starson17 already gave you general idea, but I think that print version of article is not always the best approach. In the end you are downloading two pages to get one. In this particular case I recommend not to use print version and just scrape common page. It will be faster and cleaner approach. Always strive for the simple solution.
kiklop74 is offline  
Old 03-29-2010, 03:49 PM   #1685
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
Quote:
Originally Posted by kiklop74 View Post
Starson17 already gave you general idea, but I think that print version of article is not always the best approach. In the end you are downloading two pages to get one. In this particular case I recommend not to use print version and just scrape common page. It will be faster and cleaner approach. Always strive for the simple solution.
thanks again... i agree with you.
gambarini is offline  
Advert
Old 03-29-2010, 03:51 PM   #1686
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
Quote:
Originally Posted by Starson17 View Post
You may want to look at the 4 links in my post # 1658 in this thread. One of them has a discussion of solving issues like this.

Update: I was thinking of this.

o yes... thanks
gambarini is offline  
Old 03-29-2010, 04:26 PM   #1687
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kiklop74 View Post
I think that print version of article is not always the best approach. In the end you are downloading two pages to get one. In this particular case I recommend not to use print version and just scrape common page. It will be faster and cleaner approach. Always strive for the simple solution.
I tend to focus on answering the question asked, because it often presents an interesting puzzle. Many times, at the back of my mind, I suspect that the question I am answering may not be the question that should have been asked.

I notice that kiklop often looks more broadly at the big picture - the problem to be solved - and gives answers that result in a better final result. He has far more experience than I, so listen carefully to what he suggests.
Starson17 is offline  
Old 03-29-2010, 04:58 PM   #1688
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
Now i have yet another problem, with this feedsportal rss.

http://feeds.punto-informatico.it/c/...8866/index.rss

...
i am working on this, now.

gambarini is offline  
Old 03-29-2010, 09:28 PM   #1689
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
Quote:
Originally Posted by gambarini View Post
Now i have yet another problem, with this feedsportal rss.

http://feeds.punto-informatico.it/c/...8866/index.rss
This is a classic case of obfuscated links. But let me explain few things first. This January Kovid and myself exchanged several mails regarding problem related to slow feed download. After some experiments I found out that the main culprit was the usage of obfuscated links from feed. The solution was to update default implementation of get_article_url to take into account not only link tag but also feedburner:OrigLink which (if exists) contains the real non-obfuscated link. However this solution does not cover all cases. Sometimes feeds do not have origlink tag but instead use guid tag. In those cases a recipe developer should override get_article_url and read the value of guid tag. That way we get the maximum download speed and optionally we can work on print url if the site offers one.

punto-informatico.it does not offer special print page so you will need to scrape the default page. Just add this to your recipe to get the real links:

Code:
def get_article_url(self, article):
     return article.get('guid',  None)
kiklop74 is offline  
Old 03-29-2010, 09:52 PM   #1690
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by MichaelMSeattle View Post
Will you kindly post one of the four [GoComics.com] recipes you tested with so I can try it as well?
This ran Sunday.
Code:
#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = 'Copyright 2010 Starson17'
'''
www.gocomics.com
'''
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, NavigableString
import urllib, re, mechanize

class GoComics(BasicNewsRecipe):
    title               = 'GoComicsA-C'
    __author__          = 'Starson17' 
    __version__         = '1.01'
    __date__            = '13 March 2010'
    description         = '200+ Comics - Customize for more days/comics: Defaults to 7 days, 15 comics - 10 general, 5 editorial.'
    language            = 'en'
    use_embedded_content= False
    no_stylesheets      = True
    remove_javascript   = True
    cover_url           = 'http://paulbuckley14059.files.wordpress.com/2008/06/calvin-and-hobbes.jpg'

    ####### USER PREFERENCES - COMICS, IMAGE SIZE AND NUMBER OF COMICS TO RETRIEVE ########
    # num_comics_to_get - I've tried up to 99 on Calvin&Hobbes
    num_comics_to_get = 4
    # comic_size 300 is small, 600 is medium, 900 is large, 1500 is extra-large
    comic_size = 1200
    # CHOOSE COMIC STRIPS BELOW - REMOVE COMMENT '# ' FROM IN FRONT OF DESIRED STRIPS 
    # Please do not overload their servers by selecting all comics and 1000 strips from each!
    
    keep_only_tags     = [dict(name='div', attrs={'class':['feature','banner']}),
                          ]

    remove_tags = [dict(name='a', attrs={'class':['beginning','prev','cal','next','newest']}),
                   dict(name='div', attrs={'class':['tag-wrapper']}),
                   dict(name='ul', attrs={'class':['share-nav','feature-nav']}),
                   ]
    
    def get_browser(self):
        br = BasicNewsRecipe.get_browser(self)
        orig_open_novisit = br.open_novisit
        def my_open_no_visit(url, **kwargs):
            req = mechanize.Request(
                    url,
                    headers = {
                        'Referer':'http://www.gocomics.com/',
                        })
            return orig_open_novisit(req)
        br.open_novisit = my_open_no_visit
        return br
         
    def parse_index(self):
        feeds = []
        for title, url in [
                            ######## COMICS - GENERAL ########
                            (u"2 Cows and a Chicken", u"http://www.gocomics.com/2cowsandachicken"),
                            (u"9 to 5", u"http://www.gocomics.com/9to5"),
                            (u"The Academia Waltz", u"http://www.gocomics.com/academiawaltz"),
                            (u"Adam@Home", u"http://www.gocomics.com/adamathome"),
                            (u"Agnes", u"http://www.gocomics.com/agnes"),
                            (u"Andy Capp", u"http://www.gocomics.com/andycapp"),
                            (u"Animal Crackers", u"http://www.gocomics.com/animalcrackers"),
                            (u"Annie", u"http://www.gocomics.com/annie"),
                            (u"The Argyle Sweater", u"http://www.gocomics.com/theargylesweater"),
                            (u"Ask Shagg", u"http://www.gocomics.com/askshagg"),
                            (u"B.C.", u"http://www.gocomics.com/bc"),
                            (u"Back in the Day", u"http://www.gocomics.com/backintheday"),
                            (u"Bad Reporter", u"http://www.gocomics.com/badreporter"),
                            (u"Baldo", u"http://www.gocomics.com/baldo"),
                            (u"Ballard Street", u"http://www.gocomics.com/ballardstreet"),
                            (u"Barkeater Lake", u"http://www.gocomics.com/barkeaterlake"),
                            (u"The Barn", u"http://www.gocomics.com/thebarn"),
                            (u"Basic Instructions", u"http://www.gocomics.com/basicinstructions"),
                            (u"Bewley", u"http://www.gocomics.com/bewley"),
                            (u"Big Top", u"http://www.gocomics.com/bigtop"),
                            (u"Biographic", u"http://www.gocomics.com/biographic"),
                            (u"Birdbrains", u"http://www.gocomics.com/birdbrains"),
                            (u"Bleeker: The Rechargeable Dog", u"http://www.gocomics.com/bleeker"),
                            (u"Bliss", u"http://www.gocomics.com/bliss"),
                            (u"Bloom County", u"http://www.gocomics.com/bloomcounty"),
                            (u"Bo Nanas", u"http://www.gocomics.com/bonanas"),
                            (u"Bob the Squirrel", u"http://www.gocomics.com/bobthesquirrel"),
                            (u"The Boiling Point", u"http://www.gocomics.com/theboilingpoint"),
                            (u"Boomerangs", u"http://www.gocomics.com/boomerangs"),
                            (u"The Boondocks", u"http://www.gocomics.com/boondocks"),
                            (u"Bottomliners", u"http://www.gocomics.com/bottomliners"),
                            (u"Bound and Gagged", u"http://www.gocomics.com/boundandgagged"),
                            (u"Brainwaves", u"http://www.gocomics.com/brainwaves"),
                            (u"Brenda Starr", u"http://www.gocomics.com/brendastarr"),
                            (u"Brewster Rockit", u"http://www.gocomics.com/brewsterrockit"),
                            (u"Broom Hilda", u"http://www.gocomics.com/broomhilda"),
                            (u"Calvin and Hobbes", u"http://www.gocomics.com/calvinandhobbes"),
                            (u"Candorville", u"http://www.gocomics.com/candorville"),
                            (u"Cathy", u"http://www.gocomics.com/cathy"),
                            (u"C'est la Vie", u"http://www.gocomics.com/cestlavie"),
                            (u"Chuckle Bros", u"http://www.gocomics.com/chucklebros"),
                            (u"Citizen Dog", u"http://www.gocomics.com/citizendog"),
                            (u"The City", u"http://www.gocomics.com/thecity"),
                            (u"Cleats", u"http://www.gocomics.com/cleats"),
                            (u"Close to Home", u"http://www.gocomics.com/closetohome"),
                            (u"Compu-toon", u"http://www.gocomics.com/compu-toon"),
                            (u"Cornered", u"http://www.gocomics.com/cornered"),
                            (u"Cul de Sac", u"http://www.gocomics.com/culdesac"),
                            # (u"Daddy's Home", u"http://www.gocomics.com/daddyshome"),
                            # (u"Deep Cover", u"http://www.gocomics.com/deepcover"),
                            # (u"Dick Tracy", u"http://www.gocomics.com/dicktracy"),
                            # (u"The Dinette Set", u"http://www.gocomics.com/dinetteset"),
                            # (u"Dog Eat Doug", u"http://www.gocomics.com/dogeatdoug"),
                            # (u"Domestic Abuse", u"http://www.gocomics.com/domesticabuse"),
                            # (u"Doodles", u"http://www.gocomics.com/doodles"),
                            # (u"Doonesbury", u"http://www.gocomics.com/doonesbury"),
                            # (u"The Doozies", u"http://www.gocomics.com/thedoozies"),
                            # (u"The Duplex", u"http://www.gocomics.com/duplex"),
                            # (u"Eek!", u"http://www.gocomics.com/eek"),
                            # (u"The Elderberries", u"http://www.gocomics.com/theelderberries"),
                            # (u"Flight Deck", u"http://www.gocomics.com/flightdeck"),
                            # (u"Flo and Friends", u"http://www.gocomics.com/floandfriends"),
                            # (u"The Flying McCoys", u"http://www.gocomics.com/theflyingmccoys"),
                            # (u"For Better or For Worse", u"http://www.gocomics.com/forbetterorforworse"),
                            # (u"For Heaven's Sake", u"http://www.gocomics.com/forheavenssake"),
                            # (u"Fort Knox", u"http://www.gocomics.com/fortknox"),
                            # (u"FoxTrot", u"http://www.gocomics.com/foxtrot"),
                            # (u"FoxTrot Classics", u"http://www.gocomics.com/foxtrotclassics"),
                            # (u"Frank & Ernest", u"http://www.gocomics.com/frankandernest"),
                            # (u"Fred Basset", u"http://www.gocomics.com/fredbasset"),
                            # (u"Free Range", u"http://www.gocomics.com/freerange"),
                            # (u"Frog Applause", u"http://www.gocomics.com/frogapplause"),
                            # (u"The Fusco Brothers", u"http://www.gocomics.com/thefuscobrothers"),
                            # (u"Garfield", u"http://www.gocomics.com/garfield"),
                            # (u"Garfield Minus Garfield", u"http://www.gocomics.com/garfieldminusgarfield"),
                            # (u"Gasoline Alley", u"http://www.gocomics.com/gasolinealley"),
                            # (u"Gil Thorp", u"http://www.gocomics.com/gilthorp"),
                            # (u"Ginger Meggs", u"http://www.gocomics.com/gingermeggs"),
                            # (u"Girls & Sports", u"http://www.gocomics.com/girlsandsports"),
                            # (u"Haiku Ewe", u"http://www.gocomics.com/haikuewe"),
                            # (u"Heart of the City", u"http://www.gocomics.com/heartofthecity"),
                            # (u"Heathcliff", u"http://www.gocomics.com/heathcliff"),
                            # (u"Herb and Jamaal", u"http://www.gocomics.com/herbandjamaal"),
                            # (u"Home and Away", u"http://www.gocomics.com/homeandaway"),
                            # (u"Housebroken", u"http://www.gocomics.com/housebroken"),
                            # (u"Hubert and Abby", u"http://www.gocomics.com/hubertandabby"),
                            # (u"Imagine This", u"http://www.gocomics.com/imaginethis"),
                            # (u"In the Bleachers", u"http://www.gocomics.com/inthebleachers"),
                            # (u"In the Sticks", u"http://www.gocomics.com/inthesticks"),
                            # (u"Ink Pen", u"http://www.gocomics.com/inkpen"),
                            # (u"It's All About You", u"http://www.gocomics.com/itsallaboutyou"),
                            # (u"Joe Vanilla", u"http://www.gocomics.com/joevanilla"),
                            # (u"La Cucaracha", u"http://www.gocomics.com/lacucaracha"),
                            # (u"Last Kiss", u"http://www.gocomics.com/lastkiss"),
                            # (u"Legend of Bill", u"http://www.gocomics.com/legendofbill"),
                            # (u"Liberty Meadows", u"http://www.gocomics.com/libertymeadows"),
                            # (u"Lio", u"http://www.gocomics.com/lio"),
                            # (u"Little Dog Lost", u"http://www.gocomics.com/littledoglost"),
                            # (u"Little Otto", u"http://www.gocomics.com/littleotto"),
                            # (u"Loose Parts", u"http://www.gocomics.com/looseparts"),
                            # (u"Love Is...", u"http://www.gocomics.com/loveis"),
                            # (u"Maintaining", u"http://www.gocomics.com/maintaining"),
                            # (u"The Meaning of Lila", u"http://www.gocomics.com/meaningoflila"),
                            # (u"Middle-Aged White Guy", u"http://www.gocomics.com/middleagedwhiteguy"),
                            # (u"The Middletons", u"http://www.gocomics.com/themiddletons"),
                            # (u"Momma", u"http://www.gocomics.com/momma"),
                            # (u"Mutt & Jeff", u"http://www.gocomics.com/muttandjeff"),
                            # (u"Mythtickle", u"http://www.gocomics.com/mythtickle"),
                            # (u"Nest Heads", u"http://www.gocomics.com/nestheads"),
                            # (u"NEUROTICA", u"http://www.gocomics.com/neurotica"),
                            # (u"New Adventures of Queen Victoria", u"http://www.gocomics.com/thenewadventuresofqueenvictoria"),
                            # (u"Non Sequitur", u"http://www.gocomics.com/nonsequitur"),
                            # (u"The Norm", u"http://www.gocomics.com/thenorm"),
                            # (u"On A Claire Day", u"http://www.gocomics.com/onaclaireday"),
                            # (u"One Big Happy", u"http://www.gocomics.com/onebighappy"),
                            # (u"The Other Coast", u"http://www.gocomics.com/theothercoast"),
                            # (u"Out of the Gene Pool Re-Runs", u"http://www.gocomics.com/outofthegenepool"),
                            # (u"Overboard", u"http://www.gocomics.com/overboard"),
                            # (u"Pibgorn", u"http://www.gocomics.com/pibgorn"),
                            # (u"Pibgorn Sketches", u"http://www.gocomics.com/pibgornsketches"),
                            # (u"Pickles", u"http://www.gocomics.com/pickles"),
                            # (u"Pinkerton", u"http://www.gocomics.com/pinkerton"),
                            # (u"Pluggers", u"http://www.gocomics.com/pluggers"),
                            # (u"Pooch Cafe", u"http://www.gocomics.com/poochcafe"),
                            # (u"PreTeena", u"http://www.gocomics.com/preteena"),
                            # (u"The Quigmans", u"http://www.gocomics.com/thequigmans"),
                            # (u"Rabbits Against Magic", u"http://www.gocomics.com/rabbitsagainstmagic"),
                            # (u"Real Life Adventures", u"http://www.gocomics.com/reallifeadventures"),
                            # (u"Red and Rover", u"http://www.gocomics.com/redandrover"),
                            # (u"Red Meat", u"http://www.gocomics.com/redmeat"),
                            # (u"Reynolds Unwrapped", u"http://www.gocomics.com/reynoldsunwrapped"),
                            # (u"Ronaldinho Gaucho", u"http://www.gocomics.com/ronaldinhogaucho"),
                            # (u"Rubes", u"http://www.gocomics.com/rubes"),
                            # (u"Scary Gary", u"http://www.gocomics.com/scarygary"),
                            # (u"Shoe", u"http://www.gocomics.com/shoe"),
                            # (u"Shoecabbage", u"http://www.gocomics.com/shoecabbage"),
                            # (u"Skin Horse", u"http://www.gocomics.com/skinhorse"),
                            # (u"Slowpoke", u"http://www.gocomics.com/slowpoke"),
                            # (u"Speed Bump", u"http://www.gocomics.com/speedbump"),
                            # (u"State of the Union", u"http://www.gocomics.com/stateoftheunion"),
                            # (u"Stone Soup", u"http://www.gocomics.com/stonesoup"),
                            # (u"Strange Brew", u"http://www.gocomics.com/strangebrew"),
                            # (u"Sylvia", u"http://www.gocomics.com/sylvia"),
                            # (u"Tank McNamara", u"http://www.gocomics.com/tankmcnamara"),
                            # (u"Tiny Sepuku", u"http://www.gocomics.com/tinysepuku"),
                            # (u"TOBY", u"http://www.gocomics.com/toby"),
                            # (u"Tom the Dancing Bug", u"http://www.gocomics.com/tomthedancingbug"),
                            # (u"Too Much Coffee Man", u"http://www.gocomics.com/toomuchcoffeeman"),
                            # (u"W.T. Duck", u"http://www.gocomics.com/wtduck"),
                            # (u"Watch Your Head", u"http://www.gocomics.com/watchyourhead"),
                            # (u"Wee Pals", u"http://www.gocomics.com/weepals"),
                            # (u"Winnie the Pooh", u"http://www.gocomics.com/winniethepooh"),
                            # (u"Wizard of Id", u"http://www.gocomics.com/wizardofid"),
                            # (u"Working It Out", u"http://www.gocomics.com/workingitout"),
                            # (u"Yenny", u"http://www.gocomics.com/yenny"),
                            # (u"Zack Hill", u"http://www.gocomics.com/zackhill"),
                            # (u"Ziggy", u"http://www.gocomics.com/ziggy"),
                            ######## COMICS - EDITORIAL ########
                            # ("Lalo Alcaraz","http://www.gocomics.com/laloalcaraz"),
                            # ("Nick Anderson","http://www.gocomics.com/nickanderson"),
                            # ("Chuck Asay","http://www.gocomics.com/chuckasay"),
                            # ("Tony Auth","http://www.gocomics.com/tonyauth"),
                            # ("Donna Barstow","http://www.gocomics.com/donnabarstow"),
                            # ("Bruce Beattie","http://www.gocomics.com/brucebeattie"),
                            # ("Clay Bennett","http://www.gocomics.com/claybennett"),
                            # ("Lisa Benson","http://www.gocomics.com/lisabenson"),
                            # ("Steve Benson","http://www.gocomics.com/stevebenson"),
                            # ("Chip Bok","http://www.gocomics.com/chipbok"),
                            # ("Steve Breen","http://www.gocomics.com/stevebreen"),
                            # ("Chris Britt","http://www.gocomics.com/chrisbritt"),
                            # ("Stuart Carlson","http://www.gocomics.com/stuartcarlson"),
                            # ("Ken Catalino","http://www.gocomics.com/kencatalino"),
                            # ("Paul Conrad","http://www.gocomics.com/paulconrad"),
                            # ("Jeff Danziger","http://www.gocomics.com/jeffdanziger"),
                            # ("Matt Davies","http://www.gocomics.com/mattdavies"),
                            # ("John Deering","http://www.gocomics.com/johndeering"),
                            # ("Bob Gorrell","http://www.gocomics.com/bobgorrell"),
                            # ("Walt Handelsman","http://www.gocomics.com/walthandelsman"),
                            # ("Clay Jones","http://www.gocomics.com/clayjones"),
                            # ("Kevin Kallaugher","http://www.gocomics.com/kevinkallaugher"),
                            # ("Steve Kelley","http://www.gocomics.com/stevekelley"),
                            # ("Dick Locher","http://www.gocomics.com/dicklocher"),
                            # ("Chan Lowe","http://www.gocomics.com/chanlowe"),
                            # ("Mike Luckovich","http://www.gocomics.com/mikeluckovich"),
                            # ("Gary Markstein","http://www.gocomics.com/garymarkstein"),
                            # ("Glenn McCoy","http://www.gocomics.com/glennmccoy"),
                            # ("Jim Morin","http://www.gocomics.com/jimmorin"),
                            # ("Jack Ohman","http://www.gocomics.com/jackohman"),
                            # ("Pat Oliphant","http://www.gocomics.com/patoliphant"),
                            # ("Joel Pett","http://www.gocomics.com/joelpett"),
                            # ("Ted Rall","http://www.gocomics.com/tedrall"),
                            # ("Michael Ramirez","http://www.gocomics.com/michaelramirez"),
                            # ("Marshall Ramsey","http://www.gocomics.com/marshallramsey"),
                            # ("Steve Sack","http://www.gocomics.com/stevesack"),
                            # ("Ben Sargent","http://www.gocomics.com/bensargent"),
                            # ("Drew Sheneman","http://www.gocomics.com/drewsheneman"),
                            # ("John Sherffius","http://www.gocomics.com/johnsherffius"),
                            # ("Small World","http://www.gocomics.com/smallworld"),
                            # ("Scott Stantis","http://www.gocomics.com/scottstantis"),
                            # ("Wayne Stayskal","http://www.gocomics.com/waynestayskal"),
                            # ("Dana Summers","http://www.gocomics.com/danasummers"),
                            # ("Paul Szep","http://www.gocomics.com/paulszep"),
                            # ("Mike Thompson","http://www.gocomics.com/mikethompson"),
                            # ("Tom Toles","http://www.gocomics.com/tomtoles"),
                            # ("Gary Varvel","http://www.gocomics.com/garyvarvel"),
                            # ("ViewsAfrica","http://www.gocomics.com/viewsafrica"),
                            # ("ViewsAmerica","http://www.gocomics.com/viewsamerica"),
                            # ("ViewsAsia","http://www.gocomics.com/viewsasia"),
                            # ("ViewsBusiness","http://www.gocomics.com/viewsbusiness"),
                            # ("ViewsEurope","http://www.gocomics.com/viewseurope"),
                            # ("ViewsLatinAmerica","http://www.gocomics.com/viewslatinamerica"),
                            # ("ViewsMidEast","http://www.gocomics.com/viewsmideast"),
                            # ("Views of the World","http://www.gocomics.com/viewsoftheworld"),
                            # ("Kerry Waghorn","http://www.gocomics.com/facesinthenews"),
                            # ("Dan Wasserman","http://www.gocomics.com/danwasserman"),
                            # ("Signe Wilkinson","http://www.gocomics.com/signewilkinson"),
                            # ("Wit of the World","http://www.gocomics.com/witoftheworld"),
                            # ("Don Wright","http://www.gocomics.com/donwright"),
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        description = ''
        date = ''
        current_articles = []
        pages = range(1, self.num_comics_to_get+1)
        for page in pages:
            page_soup = self.index_to_soup(url)
            if page_soup:
                strip_title = page_soup.h1.a.string
                date_title = page_soup.find('ul', attrs={'class': 'feature-nav'}).li.string
                title = strip_title + ' - ' + date_title
                strip_url_date = page_soup.h1.a['href']
                prev_strip_url_date = page_soup.find('a', attrs={'class': 'prev'})['href']
                page_url = 'http://www.gocomics.com' + strip_url_date
                prev_page_url = 'http://www.gocomics.com' + prev_strip_url_date
            current_articles.append({'title': title, 'url': page_url, 'description':'', 'date':''})
            url = prev_page_url
        current_articles.reverse()
        return current_articles

    def preprocess_html(self, soup):
        if soup.title:
            title_string = soup.title.string.strip()
            _cd = title_string.split(',',1)[1]
            comic_date = ' '.join(_cd.split(' ', 4)[0:-1])
        if soup.h1.span:
            artist = soup.h1.span.string
            soup.h1.span.string.replaceWith(comic_date + artist)
        feature_item = soup.find('p',attrs={'class':'feature_item'})
        if feature_item.a:
            a_tag = feature_item.a
            a_href = a_tag["href"]
            img_tag = a_tag.img
            img_tag["src"] = a_href
            img_tag["width"] = self.comic_size
            img_tag["height"] = None
        return soup
        
    extra_css = '''
                    h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
                    h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
                    img {max-width:100%; min-width:100%;}
                    p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
		'''
Starson17 is offline  
Old 03-30-2010, 03:37 AM   #1691
dhiru
Connoisseur
dhiru began at the beginning.
 
Posts: 83
Karma: 10
Join Date: Aug 2009
Device: iphone, Irex iliad, sony prs950, kindle Dx, Ipad
hi kovid the DNA India recipe has lot of unwanted material in each article. is it possible to clean it?
thanks
dhiru is offline  
Old 03-30-2010, 03:40 AM   #1692
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
Quote:
Originally Posted by kiklop74 View Post
This is a classic case of obfuscated links. But let me explain few things first. This January Kovid and myself exchanged several mails regarding problem related to slow feed download. After some experiments I found out that the main culprit was the usage of obfuscated links from feed. The solution was to update default implementation of get_article_url to take into account not only link tag but also feedburner:OrigLink which (if exists) contains the real non-obfuscated link. However this solution does not cover all cases. Sometimes feeds do not have origlink tag but instead use guid tag. In those cases a recipe developer should override get_article_url and read the value of guid tag. That way we get the maximum download speed and optionally we can work on print url if the site offers one.

punto-informatico.it does not offer special print page so you will need to scrape the default page. Just add this to your recipe to get the real links:

Code:
def get_article_url(self, article):
     return article.get('guid',  None)
great!!!!!!!!
gambarini is offline  
Old 03-30-2010, 07:25 AM   #1693
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
Quote:
Originally Posted by kiklop74 View Post
This is a classic case of obfuscated links. But let me explain few things first. This January Kovid and myself exchanged several mails regarding problem related to slow feed download. After some experiments I found out that the main culprit was the usage of obfuscated links from feed. The solution was to update default implementation of get_article_url to take into account not only link tag but also feedburner:OrigLink which (if exists) contains the real non-obfuscated link. However this solution does not cover all cases. Sometimes feeds do not have origlink tag but instead use guid tag. In those cases a recipe developer should override get_article_url and read the value of guid tag. That way we get the maximum download speed and optionally we can work on print url if the site offers one.

punto-informatico.it does not offer special print page so you will need to scrape the default page. Just add this to your recipe to get the real links:

Code:
def get_article_url(self, article):
     return article.get('guid',  None)
the link returned from get_article_url is correct (great !!!) but in the epub i find only:

This article was downloaded by calibre from
http://punto-informatico.it/2843719/...e-opteron.aspx

Last edited by gambarini; 03-30-2010 at 10:28 AM.
gambarini is offline  
Old 03-30-2010, 02:55 PM   #1694
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kiklop74 View Post
This is a classic case of obfuscated links. But let me explain few things first. This January Kovid and myself exchanged several mails regarding problem related to slow feed download. After some experiments I found out that the main culprit was the usage of obfuscated links from feed.
I'd like to pick your brain a bit on this subject. IIRC, (I haven't' worked on recipes in a while), the obfuscated link problem is solved by setting up a browser inside the recipe, and having the browser "click" on the obfuscated link, then feeding the results back to the recipe (by writing it into a local file) for further processing.

Quote:
The solution was to update default implementation of get_article_url to take into account not only link tag but also feedburner:OrigLink which (if exists) contains the real non-obfuscated link.
Having looked at miscellaneous information about feedburner, I think this tag may exist in the data the recipe receives from the RSS feed. Are you saying that get_article_url was rewritten to find and use this link in cases where the link tag was missing? Basically, just an improvement in the underlying method of obtaining the link which would previously have required using obfuscated link retrieval methods?

Quote:
However this solution does not cover all cases. Sometimes feeds do not have origlink tag but instead use guid tag.
I'm really not that familiar with the content of RSS feeds, but I've seen the guid.

Quote:
In those cases a recipe developer should override get_article_url and read the value of guid tag.
Why wasn't get_article_url rewritten to pick up guid if both the origlink and links were missing?

Thanks (in advance) for filling in some blank spots for me.
Starson17 is offline  
Old 03-30-2010, 03:03 PM   #1695
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
Quote:
Originally Posted by Starson17 View Post
I'd like to pick your brain a bit on this subject. IIRC, (I haven't' worked on recipes in a while), the obfuscated link problem is solved by setting up a browser inside the recipe, and having the browser "click" on the obfuscated link, then feeding the results back to the recipe (by writing it into a local file) for further processing.
You are confusing obfuscated link with obfuscated content. Obfuscated link is the link that is just alias to the real link where the content is.

Quote:
Originally Posted by Starson17 View Post
Having looked at miscellaneous information about feedburner, I think this tag may exist in the data the recipe receives from the RSS feed. Are you saying that get_article_url was rewritten to find and use this link in cases where the link tag was missing? Basically, just an improvement in the underlying method of obtaining the link which would previously have required using obfuscated link retrieval methods?

No. As I said in the previous post the fix was implemented because download of the content using obfuscated links is several times slower than when using direct links. The fix simply looks for feedburner_origlink tag first. If it exists it takes content of it. If it does not it returns content of the link tag.

Quote:
Originally Posted by Starson17 View Post
Why wasn't get_article_url rewritten to pick up guid if both the origlink and links were missing?
Because guid tag is not guaranteed to contain valid URL to the content, while origlink is. Therefore if a developer is sure that guid contains valid links he can override the method in his recipe.
kiklop74 is offline  
Closed Thread


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom column read ? pchrist7 Calibre 2 10-04-2010 02:52 AM
Archive for custom screensavers sleeplessdave Amazon Kindle 1 07-07-2010 12:33 PM
How to back up preferences and custom recipes? greenapple Calibre 3 03-29-2010 05:08 AM
Donations for Custom Recipes ddavtian Calibre 5 01-23-2010 04:54 PM
Help understanding custom recipes andersent Calibre 0 12-17-2009 02:37 PM


All times are GMT -4. The time now is 09:48 PM.


MobileRead.com is a privately owned, operated and funded community.