Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 09-12-2011, 11:18 PM   #1
JayKindle
Connoisseur
JayKindle began at the beginning.
 
JayKindle's Avatar
 
Posts: 68
Karma: 10
Join Date: Sep 2011
Device: Kindle 3 w/3G + WiFi
Question hide banner and add from a phpBB2 site?

I am trying to fetch a RSS Feed from a site I visit. The site is a phpBB2 site.

The RSS they have is: http://feeds.feedburner.com/Mixingonbeatcom -- It's suppose to fetch the most recent posts.

The issue is when I add this to Calibre custom news, it fetches the top logo banner and an ad underneath the main logo banner, plus I see a google ad block, but though I don't see a banner, I hope I can block that also.

Plus I think that the most recent topic can't be read. I do see it in a normal RSS reader, and I see it in my Kindle 3 View Articles List, but when I then zoom in to see it, it's not shown, only these banners I mention above. The other past posts can be seen.

Please advise.
JayKindle is offline   Reply With Quote
Old 09-14-2011, 02:48 PM   #2
JayKindle
Connoisseur
JayKindle began at the beginning.
 
JayKindle's Avatar
 
Posts: 68
Karma: 10
Join Date: Sep 2011
Device: Kindle 3 w/3G + WiFi
Perhaps it may be easier if you guys point me to the direction on how I can do this myself. I think it may be among the lines of removing tags, but I need a bit of schooling. Thanks in advance.
JayKindle is offline   Reply With Quote
Old 09-15-2011, 03:32 AM   #3
JayKindle
Connoisseur
JayKindle began at the beginning.
 
JayKindle's Avatar
 
Posts: 68
Karma: 10
Join Date: Sep 2011
Device: Kindle 3 w/3G + WiFi
Okay a little more info. The site I am trying to fetch has a Print feature--which have a cleaner layout--but still has ad banners.

I was trying to follow the recipe for making links to fetch the data from a Print page instead. But I am having problems knowing where I add what to the code -- or more like what do I put to make this work.

Here is a normal link to that site:
http://www.mixingonbeat.com/phpbb/viewtopic.php?t=6452

Here is a print link to that site:
http://www.mixingonbeat.com/phpbb/vi...ote=viewresult

Here is their RSS Feed to that page:
http://www.mixingonbeat.com/phpbb/rss.php?t=6452

Here is the code I am trying to work with:

Spoiler:
'''
We need to take and find all instances of /content/printVersion/
So in order to do this we take and setup a temp list
Then we turn on the flag to tell calibre/beautifulsoup that the articles are obfuscated
Then we take and get the obfuscated article (in our case the print version)
We take and create a browser and let calibre do all the work for us. It will open an internal browser and follow
then links that match the regular expression of .*?(\\/)(content)(\\/)(printVersion)(\\/)
so basically any link that looks like this /content/printVersion/
it takes and writes all the information to a temp html file. that the recipe/calibre will parse from.
And thats all that is needed for this recipe.
'''

temp_files = []
articles_are_obfuscated = True

def get_obfuscated_article(self, url):
br = self.get_browser()
print 'THE CURRENT URL IS: ', url
br.open(url)
'''
we need to use a try catch block:
what this does is trys to do an operation and if it fails instead of crashing it simply catchs it and does
something with the error.
So in our case we take and check to see if we can follow /content/printVersion, then if we can't
then we simply pass it back the original calling url
'''

try:
response = br.follow_link(url_regex='.*?(\\/)(content)(\\/)(printVersion)(\\/)', nr = 0)
html = response.read()
except:
response = br.open(url)
html = response.read()

self.temp_files.append(PersistentTemporaryFile('_f a.html'))
self.temp_files[-1].write(html)
self.temp_files[-1].close()
return self.temp_files[-1].name


I really hope someone can help. Thanks.
JayKindle is offline   Reply With Quote
Old 09-15-2011, 07:40 AM   #4
JayKindle
Connoisseur
JayKindle began at the beginning.
 
JayKindle's Avatar
 
Posts: 68
Karma: 10
Join Date: Sep 2011
Device: Kindle 3 w/3G + WiFi
Okay, I know you guys must be busy or just away for a few days.

Anyhow, I've been looking inside the other written recipes and I have been testing for several hours a few codes. I think I got it working: I was able to take down the two banners--and all the rest of the images. Plus clean up some unwanted content.

Right now, my only issue is that there is a very large gap between the author and his post or reply. The author name is on top and a few large spaces down is the text of the person. The problem is that it seems that the person who wrote it, has signed his name under the text, and this is not correct.

Can anyone tell me what code can clear up that huge gap? Also if you see any errors or other things that could make this work better, by all means please let me know. And please don't laugh, I don't know nothing about Python and have very basic skills in HTML -- thanks to MySpace for learning a bit here and there.

Spoiler:
from calibre.web.feeds.news import BasicNewsRecipe

class AutoBlog(BasicNewsRecipe):
title = u'MixingOnBeat'
timefmt = ' [%Y%b%d %H%M]'
language = 'en'
description = 'newspaper'
oldest_article = 60
max_articles_per_feed = 200
no_stylesheets = True
encoding = 'utf8'
use_embedded_content = False
auto_cleanup = True
remove_empty_feeds = True

remove_tags = [
dict(name='div', attrs={'id':['logo', 'sponsor', 'related_objects', 'inset module', 'footer', 'strip_control', 'header', 'navigation', 'Google']}), dict(name='hr'), dict(name='img'), dict(name='Google')
,dict(name=['meta', 'link', 'iframe', 'object', 'embed', 'Google'])
,dict(attrs={'class':['logo', 'sponsor', 'googleAd', 'genbox', 'copyright', 'nav', 'thLeft', 'thRight', 'catHead', 'postdetails', 'signature', 'Google']})
,dict(attrs={'id':['article-promo', 'googleads', 'moduleArticleToolsContainer', 'gallery-subcontent', 'Google']})
]

feeds = [(u'New Topics', u'http://www.mixingonbeat.com/phpbb/rss.php'),
(u'MOB News / Announcements', u'http://www.mixingonbeat.com/phpbb/rss.php?f=1'),
(u'MOB Lounge (non-DJ Topics)', u'http://www.mixingonbeat.com/phpbb/rss.php?f=5'),
(u'Equipment Support (DJs Only)', u'http://www.mixingonbeat.com/phpbb/rss.php?f=6'),
(u'General Mixing Support', u'http://www.mixingonbeat.com/phpbb/rss.php?f=132'),
(u'Harmonic Mixing Support', u'http://www.mixingonbeat.com/phpbb/rss.php?f=34'),
(u'Software Mixing (DJs)', u'http://www.mixingonbeat.com/phpbb/rss.php?f=30'),
(u'MixMeister Support', u'http://www.mixingonbeat.com/phpbb/rss.php?f=66'),
(u'Video Mixing', u'http://www.mixingonbeat.com/phpbb/rss.php?f=49'),
(u'General DJ Discussions', u'http://www.mixingonbeat.com/phpbb/rss.php?f=11'),
(u'Battle DJs', u'http://www.mixingonbeat.com/phpbb/rss.php?f=110'),
(u'Club DJs', u'http://www.mixingonbeat.com/phpbb/rss.php?f=8'),
(u'Karaoke DJs', u'http://www.mixingonbeat.com/phpbb/rss.php?f=32'),
(u'Mobile DJs', u'http://www.mixingonbeat.com/phpbb/rss.php?f=9'),
(u'Radio and Mixshow DJs', u'http://www.mixingonbeat.com/phpbb/rss.php?f=10')
]
JayKindle is offline   Reply With Quote
Old 09-16-2011, 12:00 PM   #5
JayKindle
Connoisseur
JayKindle began at the beginning.
 
JayKindle's Avatar
 
Posts: 68
Karma: 10
Join Date: Sep 2011
Device: Kindle 3 w/3G + WiFi
I need to know if there is a setting I can adjust on how short an article can be? Since this is a Forum Bulletin Site (exactly like MobileRead Forum), sometimes an article can be as long as a single sentence. I am noticing that if it's this short, it gets skipped. Only the title of the topic is generated, but the content is not fetched, just looks blank. Is there a setting I can adjust?
JayKindle is offline   Reply With Quote
Old 09-16-2011, 12:06 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,952
Karma: 5036099
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
IIRC there is no length limitation on articles.
kovidgoyal is offline   Reply With Quote
Old 09-16-2011, 02:08 PM   #7
JayKindle
Connoisseur
JayKindle began at the beginning.
 
JayKindle's Avatar
 
Posts: 68
Karma: 10
Join Date: Sep 2011
Device: Kindle 3 w/3G + WiFi
Thanks for your reply! Wow! I finally get a response! For a moment there, I thought my post was hidden or something.

I had to Google what IIRC meant. ("If I Remember Correctly").

I wonder why I am having this issue with my quest? Do you wish to help me troubleshoot it? Here is the recipe to get news from the site.

Code:
from calibre.web.feeds.news import BasicNewsRecipe

class AutoBlog(BasicNewsRecipe):
    title          = u'MixingOnBeat'
    language = 'en'
    description = 'blog'
    remove_javascript      = True
    oldest_article = 80
    max_articles_per_feed = 200
    timefmt  = ' '
    no_stylesheets        = True
    use_embedded_content  = False
    auto_cleanup = True
    remove_empty_feeds    = False

    remove_attributes = ['style', 'top']
    remove_tags = [
                     dict(name='div', attrs={'id':['logo', 'sponsor', 'related_objects', 'inset module', 'footer', 'strip_control', 'header', 'navigation', 'Google']}), dict(name='hr'), dict(name='img')
                    ,dict(name=['meta', 'link', 'iframe', 'object', 'embed', 'Google'])
                    ,dict(attrs={'class':['logo', 'sponsor', 'googleAd', 'genbox', 'copyright', 'nav', 'thLeft', 'thRight', 'catHead', 'postdetails', 'signature', 'Google']})
                    ,dict(attrs={'id':['article-promo', 'googleads', 'moduleArticleToolsContainer', 'gallery-subcontent', 'Google']})
                  ]

    feeds          = [(u'New Topics', u'http://www.mixingonbeat.com/phpbb/rss.php'),
	(u'MOB News / Announcements', u'http://www.mixingonbeat.com/phpbb/rss.php?f=1'),
	(u'MOB Lounge (non-DJ Topics)', u'http://www.mixingonbeat.com/phpbb/rss.php?f=5'),
	(u'Equipment Support (DJs Only)', u'http://www.mixingonbeat.com/phpbb/rss.php?f=6'),
	(u'General Mixing Support', u'http://www.mixingonbeat.com/phpbb/rss.php?f=132'),
	(u'Harmonic Mixing Support', u'http://www.mixingonbeat.com/phpbb/rss.php?f=34'),
	(u'Software Mixing (DJs)', u'http://www.mixingonbeat.com/phpbb/rss.php?f=30'),
	(u'MixMeister Support', u'http://www.mixingonbeat.com/phpbb/rss.php?f=66'),
	(u'Video Mixing', u'http://www.mixingonbeat.com/phpbb/rss.php?f=49'),
	(u'General DJ Discussions', u'http://www.mixingonbeat.com/phpbb/rss.php?f=11'),
	(u'Battle DJs', u'http://www.mixingonbeat.com/phpbb/rss.php?f=110'),
	(u'Club DJs', u'http://www.mixingonbeat.com/phpbb/rss.php?f=8'),
	(u'Karaoke DJs', u'http://www.mixingonbeat.com/phpbb/rss.php?f=32'),
	(u'Mobile DJs', u'http://www.mixingonbeat.com/phpbb/rss.php?f=9'),
	(u'Radio and Mixshow DJs', u'http://www.mixingonbeat.com/phpbb/rss.php?f=10')
	]
Now the topic in question is: MixingOnBeat.com :: View topic - RSS Feeds added to MOB -- only the title appears, but the content is skipped. Here is a direct link to the topic in question: http://www.mixingonbeat.com/phpbb/viewtopic.php?t=6452

I hope it can be fixed, perhaps my recipe is the problem?
JayKindle is offline   Reply With Quote
Old 09-16-2011, 02:13 PM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,952
Karma: 5036099
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I'm afraid I don't have the time to help with recipe creation, sorry.
kovidgoyal is offline   Reply With Quote
Old 09-16-2011, 02:33 PM   #9
JayKindle
Connoisseur
JayKindle began at the beginning.
 
JayKindle's Avatar
 
Posts: 68
Karma: 10
Join Date: Sep 2011
Device: Kindle 3 w/3G + WiFi
Ah, ok. Hopefully someone can.
JayKindle is offline   Reply With Quote
Old 09-16-2011, 04:48 PM   #10
JayKindle
Connoisseur
JayKindle began at the beginning.
 
JayKindle's Avatar
 
Posts: 68
Karma: 10
Join Date: Sep 2011
Device: Kindle 3 w/3G + WiFi
Update:

A friend told me to see how kindlefeeder.com would convert the above site. I must tell you guys it was very clean. It was sent to me as a MOBI file. I wish there was a way to reverse engineer it and see what kind of recipe they used? Anyone?
JayKindle is offline   Reply With Quote
Old 09-17-2011, 06:24 PM   #11
JayKindle
Connoisseur
JayKindle began at the beginning.
 
JayKindle's Avatar
 
Posts: 68
Karma: 10
Join Date: Sep 2011
Device: Kindle 3 w/3G + WiFi
Update

I've been able to succeed in making a good recipe out of this project. My only two issue right now that are beyond me is:

1) The original site is split in two columns. Where the name of the Author of the article is on the left and the text article is on the right. I wish I could learn a code which will make bot of them into one column.

2) This is going to be difficult to explain, but here goes anyways. Each article I can select with my Kindle 3, it seems to put the text in a squared box. the problem with this is if the article is beyond the page size of the Kindle, when I skip forward to read the rest of it, it will be cropped or skipped. In order for me to continue reading what is left, I have to change the aA font size to the smallest size, but sometimes even the smallest size may be not enough to read the article--though in the ebook viewer in Calibre, it shows that it does exist. So I would like to learn what code would take out the text being put in a square box?

I hope someone can suggest codes that I can try. Thanks.
JayKindle is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Add books to list on this site? BeeVee Reading Recommendations 1 02-14-2011 02:36 PM
Online banner advertisement BookGenie Lounge 7 02-03-2011 12:29 AM
Have you seen the EZread banner ad?? Robertb Astak EZReader 2 05-12-2010 12:27 PM
Help Me Choose My New Banner! Lilly Lounge 20 02-19-2010 05:45 AM


All times are GMT -4. The time now is 10:28 PM.


MobileRead.com is a privately owned, operated and funded community.