Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 01-03-2012, 02:36 PM   #1
kiavash
Old Linux User
kiavash began at the beginning.
 
Posts: 36
Karma: 12
Join Date: Jan 2012
Device: NST
Recipe for Microwave Journal?

Hi there,

I know it is my 1st post, but believe me I had done my homework searching/reading as many posts/pages that I could to solve it by myself unsuccessfully. Last resort is asking for help...

So I am trying to write a recipe that downloads articles from Microwave Journal website and convert it to ebook. Like NYT, MWJ also needs user/pass (which is Free, BTW). And also it has RSS site. To login, it sends you to another site and I think (not sure) that once logged in, the other site uses cookies and send the browser back to mwjournal.com. The login page has a checkbox for Remember me.

With the above foreword, I wrote the following recipe:

Spoiler:
PHP Code:
__license__   'GPL v3'

'''
mwjournal.com
'''

class MWJournal(BasicNewsRecipe):
    
title          u'Microwave Journal'
    
oldest_article 30
    max_articles_per_feed 
100
    auto_cleanup 
True
    no_stylesheets 
True
    remove_javascript 
True
    language              
'en'
    
feeds          = [(u'Current Issue'u'http://www.mwjournal.com/rss/Rss.asp?type=99')]

    
needs_subscription True

    def get_browser
(self):
        
br BasicNewsRecipe.get_browser()
        if 
self.username is not None and self.password is not None:
            
br.open('http://www.omeda.com/cgi-win/mwjreg.cgi?m=login')
            
br.select_form(nr 0)
            
br['EMAIL_ADDRESS']   = self.username
            br
['PASSWORD'] = self.password
            br
.form.find_control(name='remember_me',type="checkbox").get(nr=0).selected True
            br
.submit()
        return 
br 


I got "nr = 0" by inspecting the html file for the login page (the 1st FORM is for username/password). I also did check Remember me box (and tested unchecked too). Anyway, still when the epub is made, the site doesn't consider the user to be logged in (yes! I checked username password to be correct).

I added two attachments. ePub showing the final result (not logged in) and TXT showing ebook-convert output (I manually deleted user/password, otherwise there were there correctly).

Any help would be highly appreciated.

PS. omeda.com hosts other magazines as well which I searched recipes online repository to see if any of those magazines are already there to reuse the code, but I found none.
Attached Files
File Type: txt recipts_output.txt (14.2 KB, 498 views)
File Type: epub mwjournal.epub (106.9 KB, 223 views)

Last edited by kiavash; 01-06-2012 at 01:01 PM. Reason: Adding Attachments
kiavash is offline   Reply With Quote
Old 01-05-2012, 12:49 PM   #2
kiavash
Old Linux User
kiavash began at the beginning.
 
Posts: 36
Karma: 12
Join Date: Jan 2012
Device: NST
Any clues? Please!
kiavash is offline   Reply With Quote
Old 01-06-2012, 01:05 PM   #3
kiavash
Old Linux User
kiavash began at the beginning.
 
Posts: 36
Karma: 12
Join Date: Jan 2012
Device: NST
I modified get_browser and tried to set a cookie policy and still I get the same result. It didn't login. What is happening?

Spoiler:
PHP Code:
   def get_browser(self):
        
br BasicNewsRecipe.get_browser()
        
policy cookielib.DefaultCookiePolicy(allowed_domains=['.omeda.com','omeda.com','mwjournal.com','.mwjournal.com'])
        
cjar   cookielib.CookieJar(policy)
        
br.set_cookiejar(cjar)
        if 
self.username is not None and self.password is not None:
            
br.open('http://www.omeda.com/cgi-win/mwjreg.cgi?m=login')
            
br.select_form(nr 0)
            
br['EMAIL_ADDRESS']   = self.username
            br
['PASSWORD'] = self.password
            br
.submit('submitButtonName')
        return 
br 


Is there a way to dump all the HTML communication out to a file or folder, to see if the login is successful and it moved to fetch the articles after login? I know about --debug but that only dumpt the HTML of the RSS articles. How about to dump def get_browser(self): output to a file?
kiavash is offline   Reply With Quote
Old 01-06-2012, 02:55 PM   #4
kiavash
Old Linux User
kiavash began at the beginning.
 
Posts: 36
Karma: 12
Join Date: Jan 2012
Device: NST
Another (unsuccessful) try:

This time, I checked if there is a JavaScript messing with login process, so fired up Firebug and inspected the login page. I saw this JavaScript.

Spoiler:

PHP Code:
function setActiveStyleSheet(title) {
  var 
iamain;
  for(
i=0; (document.getElementsByTagName("link")[i]); i++) {
    if(
a.getAttribute("rel").indexOf("style") != -&& a.getAttribute("title")) {
      
a.disabled true;
      if(
a.getAttribute("title") == titlea.disabled false;
    }
  }
}

function 
getActiveStyleSheet() {
  var 
ia;
  for(
i=0; (document.getElementsByTagName("link")[i]); i++) {
    if(
a.getAttribute("rel").indexOf("style") != -&& a.getAttribute("title") && !a.disabled) return a.getAttribute("title");
  }
  return 
null;
}

function 
getPreferredStyleSheet() {
  var 
ia;
  for(
i=0; (document.getElementsByTagName("link")[i]); i++) {
    if(
a.getAttribute("rel").indexOf("style") != -1
       
&& a.getAttribute("rel").indexOf("alt") == -1
       
&& a.getAttribute("title")
       ) return 
a.getAttribute("title");
  }
  return 
null;
}

function 
createCookie(name,value,days) {
  if (
days) {
    var 
date = new Date();
    
date.setTime(date.getTime()+(days*24*60*60*1000));
    var 
expires "; expires="+date.toGMTString();
  }
  else 
expires "";
  
document.cookie name+"="+value+expires+"; path=/";
}

function 
readCookie(name) {
  var 
nameEQ name "=";
  var 
ca document.cookie.split(';');
  for(var 
i=0;ca.length;i++) {
    var 
ca[i];
    while (
c.charAt(0)==' 'c.substring(1,c.length);
    if (
c.indexOf(nameEQ) == 0) return c.substring(nameEQ.length,c.length);
  }
  return 
null;
}

window.onload = function(e) {
  var 
cookie readCookie("style");
  var 
title cookie cookie getPreferredStyleSheet();
  
setActiveStyleSheet(title);
}

window.onunload = function(e) {
  var 
title getActiveStyleSheet();
  
createCookie("style"title365);
}

var 
cookie readCookie("style");
var 
title cookie cookie getPreferredStyleSheet();
setActiveStyleSheet(title); 


Now I am out of my comfort zone as I don't know this language, but I can see couple of functions with cookie in their names: readCookie(name) and createCookie(name,value,days)

Does get_browser() remove the JavaScript?

I tried removing remove_javascript = True from recipe and changing it to False, but didn't login.

I tried to follow this post,
Spoiler:
Some web developers like to submit their log in form by using a JavaScript function that, before submitting the form, manipulates some input values (perhaps to prevent the very thing we are discussing here). If you run into that, examine the function and mimic it's behavior. If the script is hard to find, use Firebug's find feature to search for it. Here is an example of setting the value of a hidden input:

PHP Code:
        br.select_form(nr 0)
        
# find by index
        
ctl_1 br.find_control(type 'hidden'nr 3)
        
# or by name
        
ctl_2 br.find_control(type 'hidden'name 'meal')
        
ctl_1.readonly False
        ctl_2
.readonly False

        ctl_1
.value 'spam'
        
ctl_2.value 'eggs' 
but completely lost.

Anybody? Please!
kiavash is offline   Reply With Quote
Old 01-06-2012, 03:31 PM   #5
Barty
doofus
Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.
 
Barty's Avatar
 
Posts: 2,507
Karma: 12615905
Join Date: Sep 2010
Device: Kobo Libra 2, Kindle Voyage
hi, what's happening is when you submit the login it returns this

Code:
<html>
<head>
<title>Redirect to BVD</title>
</head>

<body onLoad="document.forms[0].submit();">
<form method="post" action="http://www.mwjournal.com/default.asp">
<input type="hidden" name="cust_id" value="xxxxxxxxxx">
<input type="hidden" name="status" value="xxxxxxxxxx">
<input type="hidden" name="reqURL" value="xxxxxxxxxx">
<input type="hidden" name="email" value="xxxxxxxxxx">
<input type="hidden" name="password" value="xxxxxxxxxx">
<input type="hidden" name="fname" value="xxxxxxxxxx">
<input type="hidden" name="lname" value="xxxxxxxxxx">
<input type="hidden" name="company" value="xxxxxxxxxx">
<input type="hidden" name="country" value="xxxxxxxxxx">
<input type="hidden" name="job_title" value="xxxxxxxxxx">
<input type="hidden" name="newsletter" value="xxxxxxxxxx">
<input type="hidden" name="microwave_advisor" value="xxxxxxxxxx">
<input type="hidden" name="microview" value="xxxxxxxxxx">
<input type="hidden" name="remember_me" value="xxxxxxxxxx">
<input type="hidden" name="state" value="xxxxxxxxxx">








 
<!--<include>redirect-fields.htm</include>-->
</form>
</body>
</html>
In other words, another form you need to submit. So this will do the trick

Code:
    def get_browser(self): 
        br = BasicNewsRecipe.get_browser() 
        if self.username is not None and self.password is not None: 
            br.open('http://www.omeda.com/cgi-win/mwjreg.cgi?m=login') 
            br.select_form('login') 
            br['EMAIL_ADDRESS']   = self.username 
            br['PASSWORD'] = self.password 
            html = br.submit().read()
            open('/jwtmp.html','wb').write(html)
            br.open('file:///jwtmp.html')
            br.select_form(nr=0) 
            br.submit()
        return br
I'm writing the result to a temp file, open it, then submit the form in it. I don't know if there's a more elegant way to do it without going through the temp file.

Note: you will want to clean that up for production code. You probably don't want to write to the root like that (permission problem), and you'll want to delete the temp file afterward.
Barty is offline   Reply With Quote
Old 01-06-2012, 08:43 PM   #6
kiavash
Old Linux User
kiavash began at the beginning.
 
Posts: 36
Karma: 12
Join Date: Jan 2012
Device: NST
Hi Barty,

Thanks a lot. This explains everything. A big step forward thanks to your help.

I am going to study ESPN recipe closely. Kovid used "TemporaryFile" to eliminate writing to the root (or any folder that may not have permission). Hopefully "TemporaryFile" or "PersistentTemporaryFile" (example) will be the magic bullet.

PHP Code:
from calibre.ptempfile import TemporaryFile
...
    
def get_browser(self):
        
br BasicNewsRecipe.get_browser()
        
br.set_handle_refresh(False)
        
url = ('https://r.espn.go.com/members/v3_1/login')
        
raw br.open(url).read()
        
raw re.sub(r'(?s)<form>.*?id="regsigninbtn".*?</form>'''raw)
        
with TemporaryFile(suffix='.htm') as fname:
            
with open(fname'wb') as f:
                
f.write(raw)
            
br.open_local_file(fname)

        
br.form br.forms().next()
        
br.form.find_control(name='username'type='text').value self.username
        br
.form['password'] = self.password
        br
.submit().read()
        
br.open('http://espn.go.com').read()
        
br.set_handle_refresh(True)
        return 
br 
kiavash is offline   Reply With Quote
Old 01-08-2012, 12:11 AM   #7
kiavash
Old Linux User
kiavash began at the beginning.
 
Posts: 36
Karma: 12
Join Date: Jan 2012
Device: NST
Only thing left is the cover. That part is even less documented on the website. More to read...

So far, the script looks like this with plenty of comments documenting what is happening.

Spoiler:
PHP Code:
'''
Microwave Journal Monthly Magazine
You need to sign up (free) and get username/password.
'''

import re    # Import the regular expressions module.
from calibre.ptempfile import TemporaryFile # we need this for saving to a temp file

class MWJournal(BasicNewsRecipe):
    
# Title to use for the ebook.
    
title          u'Microwave Journal'

    
#A brief description for the ebook.
    
description u'Microwave Journal web site ebook created using rss feeds.'

    
# Set publisher and publication type.
    
publisher 'Horizon House'
    
publication_type 'magazine'
    
language 'en'
    
    
oldest_article 30        # monthly published magazine
    
max_articles_per_feed 100
    remove_empty_feeds 
True
    auto_cleanup 
True
    
    
# Disable stylesheets and javascript from site.
    
no_stylesheets True
    remove_javascript 
True
    
    needs_subscription 
True    # oh yeah... we need to login btw.

    # Timeout for fetching files from the server in seconds. The default of 120 seconds, seems somewhat excessive.
    
timeout 30
    
    
# Specify extra CSS - overrides ALL other CSS (IE. Added last).
    
extra_css 'body { font-family: verdana, helvetica, sans-serif; } \
                 .introduction, .first { font-weight: bold; } \
                 .cross-head { font-weight: bold; font-size: 125%; } \
                 .cap, .caption { display: block; font-size: 80%; font-style: italic; } \
                 .cap, .caption, .caption img, .caption span { display: block; text-align: center; margin: 5px auto; } \
                 .byl, .byd, .byline img, .byline-name, .byline-title, .author-name, .author-position, \
                    .correspondent-portrait img, .byline-lead-in, .name, .bbc-role { display: block; \
                    text-align: center; font-size: 80%; font-style: italic; margin: 1px auto; } \
                 .story-date, .published { font-size: 80%; } \
                 table { width: 100%; } \
                 td img { display: block; margin: 5px auto; } \
                 ul { padding-top: 10px; } \
                 ol { padding-top: 10px; } \
                 li { padding-top: 5px; padding-bottom: 5px; } \
                 h1 { text-align: center; font-size: 175%; font-weight: bold; } \
                 h2 { text-align: center; font-size: 150%; font-weight: bold; } \
                 h3 { text-align: center; font-size: 125%; font-weight: bold; } \
                 h4, h5, h6 { text-align: center; font-size: 100%; font-weight: bold; }'

    
remove_tags    = [
                        
dict(name='div'attrs={'class':'boxadzonearea350'}), # Removes banner ads
                        
dict(name='font'attrs={'class':'footer'}),    # remove fonts if you do like your fonts more! Comment out to use website's fonts
                     
]
                     
    
# Remove various tag attributes to improve the look of the ebook pages.
    
remove_attributes = [ 'border''cellspacing''align''cellpadding''colspan',
                          
'valign''vspace''hspace''alt''width''height' ]

    
# Remove the line breaks,
    
preprocess_regexps     = [(re.compile(r'<br[ ]*/>'re.IGNORECASE), lambda m''),
                              (
re.compile(r'<br[ ]*clear.*/>'re.IGNORECASE), lambda m'')]
    
    
# Select the feeds that you are interested.
    
feeds          = [
                        (
u'Current Issue'u'http://www.mwjournal.com/rss/Rss.asp?type=99'),
                        (
u'Industry News'u'http://mwjournal.com/rss/Rss.asp?type=1'),
                        
#(u'Resources', u'http://mwjournal.com/rss/Rss.asp?type=3'),
                        #(u'Buyer"s Guide', u'http://mwjournal.com/rss/Rss.asp?type=5'),
                        
(u'Events'u'http://mwjournal.com/rss/Rss.asp?type=2'),
                        
#(u'All Updates', u'http://mwjournal.com/rss/Rss.asp?type=0'),
                    
]

    
    
cover_url 'http://www.mwjournal.com/IssueImg/3_MWJ_CurrIss_CoverImg_12_2011.jpg'
    
    
def print_version(selfurl):
        
'''
        this function uses the print version of the article.  Replaces the URL with its print version and fetch that page instead.
        '''
        
return url.replace('http://mwjournal.com/Journal/article.asp?HH_ID=''http://mwjournal.com/Journal/Print.asp?Id=')
        
        
    
def get_browser(self):
        
'''
        Microwave Journal website, directs the login page to omeda.com once login info is submitted, omeda.com redirects to mwjournal.com with again the browser logs in into that site (hidden from the user). To overcome this obstacle, first login page is fetch and its output is stored to an HTML file. Then the HTML file is opened again and second login form is submitted (Many thanks to Barty which helped with second page login).
        '''
        
br BasicNewsRecipe.get_browser() 
        if 
self.username is not None and self.password is not None:
            
url = ('http://www.omeda.com/cgi-win/mwjreg.cgi?m=login'#  main login page.
            
br.open(url)    # fetch the 1st login page
            
br.select_form('login')        # finds the login form
            
br['EMAIL_ADDRESS']   = self.username    # fills the username
            
br['PASSWORD'] = self.password        # fills the password
            
raw br.submit().read()        # submit the form and read the 2nd login form
            # save it to an htm temp file (from ESPN recipe written by  Kovid Goyal kovid@kovidgoyal.net
            
with TemporaryFile(suffix='.htm') as fname:
                
with open(fname'wb') as f:
                    
f.write(raw)
                
br.open_local_file(fname)
            
br.select_form(nr=0)    # finds submit on the 2nd form
            
didwelogin br.submit().read()        # submit it and read the return html
            
if 'Welcome ' not in didwelogin:    # did it login successfully? Is Username/password correct?
                
raise Exception('Failed to login, are you sure your username and password are correct?')
            
#login is done
        
return br 


Actually it uses the ESPN recipe's technique to and dump the 1st login page into the temp folder. I am actually ready to write a couple paragraph and add them into here teaching others how to solve the problem with two HTML login.
kiavash is offline   Reply With Quote
Old 01-08-2012, 01:53 AM   #8
kiavash
Old Linux User
kiavash began at the beginning.
 
Posts: 36
Karma: 12
Join Date: Jan 2012
Device: NST
There it is. It fetches the latest cover and add it to the ebook.

PHP Code:
    #  No magazine is complete without cover. Let's get it then!
    # The function is adapted from the Economist recipe
    
def get_cover_url(self):
        
cover_url None
        cover_page_location 
'http://www.mwjournal.com/Journal/'    # Cover image is located on this page
        
soup self.index_to_soup(cover_page_location)    
        
cover_item soup.find('img',attrs={'src':lambda xand '/IssueImg/3_MWJ_CurrIss_CoverImg' in x})    # There are three files named cover, we want the highest resolution which is the 3rd image. So we look for the pattern. Remember that the name of the cover image changes every month so we cannot search for the complete name. Instead we are searching for the pattern
        
if cover_item:
            
cover_url 'http://www.mwjournal.com' cover_item['src'].strip()    # yeah! we found it. Let's fetch the image file and pass it as cover to calibre
        
return cover_url 
kiavash is offline   Reply With Quote
Old 01-08-2012, 01:57 AM   #9
kiavash
Old Linux User
kiavash began at the beginning.
 
Posts: 36
Karma: 12
Join Date: Jan 2012
Device: NST
By the way the full script is attached. If it is clean enough I would recommend adding it to the next calibre release so others can use it as well.

I also added it my ReadBeam.com account, once approved by their admin (hopefully soon) I will get my e-magazine automatically every month.

Thanks you all for making it happen.

Edit: Get the latest few posts bellow.

Last edited by kiavash; 01-14-2012 at 04:29 PM.
kiavash is offline   Reply With Quote
Old 01-10-2012, 02:05 AM   #10
kiavash
Old Linux User
kiavash began at the beginning.
 
Posts: 36
Karma: 12
Join Date: Jan 2012
Device: NST
Few more updates... I am documenting all of these so somebody else can use it to write a new recipe easier:

This code removes the hyperlinks as well as line breaks. You cannot fine Hyperlinks in real magazine.

PHP Code:
    preprocess_regexps     = [(re.compile(r'<br[ ]*/>'re.IGNORECASE), lambda m''),
                              (
re.compile(r'<br[ ]*clear.*/>'re.IGNORECASE), lambda m''),
                              (
re.compile(r'<a.*?>'), lambda h1''),
                              (
re.compile(r'</a>'), lambda h2'')
                              ] 
This code uses the correct Print page. B/c we have few RSS sources and each have its own Print page.

PHP Code:
    def print_version(selfurl):
        print 
url
        
if url.find('/Journal/article.asp?HH_ID=') >= 0:
            return 
self.browser.open_novisit(url).geturl().replace('/Journal/article.asp?HH_ID=''/Journal/Print.asp?Id=')
        
elif  url.find('/News/article.asp?HH_ID=') >= 0:
            return 
self.browser.open_novisit(url).geturl().replace('/News/article.asp?HH_ID=''/Journal/Print.asp?Id=')
        
elif  url.find('/Resources/TechLib.asp?HH_ID=') >= 0:
            return 
self.browser.open_novisit(url).geturl().replace('/Resources/TechLib.asp?HH_ID=''/Resources/PrintRessource.asp?Id='
This is the latest code so far. It is running really well. Maybe still few hiccups.

Spoiler:
PHP Code:
##
## Title:        Microwave Journal RSS recipe
## Contact:      Kiavash (use Mobile Read)
##
## License:      GNU General Public License v3 - http://www.gnu.org/copyleft/gpl.html
## Copyright:    Kiavash
##
## Written:      Jan 2012
## Last Edited:  Jan 2012
##

__license__   'GNU General Public License v3 - http://www.gnu.org/copyleft/gpl.html'
__copyright__   'Kiavash'
__author__ 'Kaivash'

'''
Microwave Journal Monthly Magazine
You need to sign up (free) and get username/password.
'''

import re    # Import the regular expressions module.
from calibre.ptempfile import TemporaryFile # we need this for saving to a temp file

class MWJournal(BasicNewsRecipe):
    
# Title to use for the ebook.
    
title          u'Microwave Journal'

    
#A brief description for the ebook.
    
description u'Microwave Journal web site ebook created using rss feeds.'

    
# Set publisher and publication type.
    
publisher 'Horizon House'
    
publication_type 'magazine'
    
    
oldest_article 31        # monthly published magazine. Some months are 31 days!
    
max_articles_per_feed 100
    remove_empty_feeds 
True
    auto_cleanup 
True
    
    
# Disable stylesheets and javascript from site.
    
no_stylesheets True
    remove_javascript 
True
    
    asciiize 
True    # Converts all none ascii characters to their ascii equivalents
    
    
needs_subscription True    # oh yeah... we need to login btw.

    # Timeout for fetching files from the server in seconds. The default of 120 seconds, seems somewhat excessive.
    
timeout 30
    
    
# Specify extra CSS - overrides ALL other CSS (IE. Added last).
    
    
extra_css 'body { font-family: verdana, helvetica, sans-serif; } \
                 .introduction, .first { font-weight: bold; } \
                 .cross-head { font-weight: bold; font-size: 125%; } \
                 .cap, .caption { display: block; font-size: 80%; font-style: italic; } \
                 .cap, .caption, .caption img, .caption span { display: block; margin: 5px auto; } \
                 .byl, .byd, .byline img, .byline-name, .byline-title, .author-name, .author-position, \
                    .correspondent-portrait img, .byline-lead-in, .name, .bbc-role { display: block; \
                    font-size: 80%; font-style: italic; margin: 1px auto; } \
                 .story-date, .published { font-size: 80%; } \
                 table { width: 100%; } \
                 td img { display: block; margin: 5px auto; } \
                 ul { padding-top: 10px; } \
                 ol { padding-top: 10px; } \
                 li { padding-top: 5px; padding-bottom: 5px; } \
                 h1 { font-size: 175%; font-weight: bold; } \
                 h2 { font-size: 150%; font-weight: bold; } \
                 h3 { font-size: 125%; font-weight: bold; } \
                 h4, h5, h6 { font-size: 100%; font-weight: bold; }'

    
remove_tags    = [
                        
dict(name='div'attrs={'class':'boxadzonearea350'}), # Removes banner ads
                        
dict(name='font'attrs={'class':'footer'}),    # remove fonts if you do like your fonts more! Comment out to use website's fonts
                        
dict(name='div'attrs={'class':'newsarticlead'})
                     ]
                     
    
# Remove various tag attributes to improve the look of the ebook pages.
    
remove_attributes = [ 'border''cellspacing''align''cellpadding''colspan',
                          
'valign''vspace''hspace''alt''width''height' ]

    
# Remove the line breaks as well as href links. Books don't have links generally speaking
    
preprocess_regexps     = [(re.compile(r'<br[ ]*/>'re.IGNORECASE), lambda m''),
                              (
re.compile(r'<br[ ]*clear.*/>'re.IGNORECASE), lambda m''),
                              (
re.compile(r'<a.*?>'), lambda h1''),
                              (
re.compile(r'</a>'), lambda h2'')
                              ]
    
    
# Select the feeds that you are interested.
    
feeds          = [
                        (
u'Current Issue'u'http://www.mwjournal.com/rss/Rss.asp?type=99'),
                        
#(u'Industry News', u'http://mwjournal.com/rss/Rss.asp?type=1'),
                        #(u'Resources', u'http://mwjournal.com/rss/Rss.asp?type=3'),
                        #(u'Buyer\'s Guide', u'http://mwjournal.com/rss/Rss.asp?type=5'),
                        
(u'Events'u'http://mwjournal.com/rss/Rss.asp?type=2'),
                        
#(u'All Updates', u'http://mwjournal.com/rss/Rss.asp?type=0'),
                    
]

    
#  No magazine is complete without cover. Let's get it then!
    # The function is adapted from the Economist recipe
    
def get_cover_url(self):
        
cover_url None
        cover_page_location 
'http://www.mwjournal.com/Journal/'    # Cover image is located on this page
        
soup self.index_to_soup(cover_page_location)    
        
cover_item soup.find('img',attrs={'src':lambda xand '/IssueImg/3_MWJ_CurrIss_CoverImg' in x})    # There are three files named cover, we want the highest resolution which is the 3rd image. So we look for the pattern. Remember that the name of the cover image changes every month so we cannot search for the complete name. Instead we are searching for the pattern
        
if cover_item:
            
cover_url 'http://www.mwjournal.com' cover_item['src'].strip()    # yeah! we found it. Let's fetch the image file and pass it as cover to calibre
        
return cover_url

    def print_version
(selfurl):
        print 
url
        
if url.find('/Journal/article.asp?HH_ID=') >= 0:
            return 
self.browser.open_novisit(url).geturl().replace('/Journal/article.asp?HH_ID=''/Journal/Print.asp?Id=')
        
elif  url.find('/News/article.asp?HH_ID=') >= 0:
            return 
self.browser.open_novisit(url).geturl().replace('/News/article.asp?HH_ID=''/Journal/Print.asp?Id=')
        
elif  url.find('/Resources/TechLib.asp?HH_ID=') >= 0:
            return 
self.browser.open_novisit(url).geturl().replace('/Resources/TechLib.asp?HH_ID=''/Resources/PrintRessource.asp?Id=')

    
def get_browser(self):
        
'''
        Microwave Journal website, directs the login page to omeda.com once login info is submitted, omeda.com redirects to mwjournal.com with again the browser logs in into that site (hidden from the user). To overcome this obsticle, first login page is fetch and its output is stored to an HTML file. Then the HTML file is opened again and second login form is submitted (Many thanks to Barty which helped with second page login).
        '''
        
br BasicNewsRecipe.get_browser() 
        if 
self.username is not None and self.password is not None:
            
url = ('http://www.omeda.com/cgi-win/mwjreg.cgi?m=login'#  main login page.
            
br.open(url)    # fetch the 1st login page
            
br.select_form('login')        # finds the login form
            
br['EMAIL_ADDRESS']   = self.username    # fills the username
            
br['PASSWORD'] = self.password        # fills the password
            
raw br.submit().read()        # submit the form and read the 2nd login form
            # save it to an htm temp file (from ESPN recipe written by  Kovid Goyal kovid@kovidgoyal.net
            
with TemporaryFile(suffix='.htm') as fname:
                
with open(fname'wb') as f:
                    
f.write(raw)
                
br.open_local_file(fname)
            
br.select_form(nr=0)    # finds submit on the 2nd form
            
didwelogin br.submit().read()        # submit it and read the return html
            
if 'Welcome ' not in didwelogin:    # did it login successfully? Is Username/password correct?
                
raise Exception('Failed to login, are you sure your username and password are correct?')
            
#login is done
        
return br 
kiavash is offline   Reply With Quote
Old 01-12-2012, 06:08 PM   #11
kiavash
Old Linux User
kiavash began at the beginning.
 
Posts: 36
Karma: 12
Join Date: Jan 2012
Device: NST
Maybe I am spending too much time with this recipe, but I had been reading MW journal for a long time and I want to be able to keep reading it as my eye sight is getting worse using my Nook (thanks to bigger fonts.)

Here a little more tweaks. I posted the latest here and on my account on Read Beam. This time all the tabs are changed to space to be homogenous with other Calibre's codes.

How can I check this into Calibre's build w/o needing to recompile the whole thing?
Spoiler:
PHP Code:
##
## Title:        Microwave Journal RSS recipe
## Contact:      Kiavash (use Mobile Read)
##
## License:      GNU General Public License v3 - http://www.gnu.org/copyleft/gpl.html
## Copyright:    Kiavash
##
## Written:      Jan 2012
## Last Edited:  Jan 2012
##

__license__   'GNU General Public License v3 - http://www.gnu.org/copyleft/gpl.html'
__copyright__   'Kiavash'
__author__ 'Kaivash'

'''
Microwave Journal Monthly Magazine
You need to sign up (free) and get username/password.
'''

import re    # Import the regular expressions module.
from calibre.ptempfile import TemporaryFile # we need this for saving to a temp file

class MWJournal(BasicNewsRecipe):
    
# Title to use for the ebook.
    
title          u'Microwave Journal'

    
#A brief description for the ebook.
    
description u'Microwave Journal web site ebook created using rss feeds.'

    
# Set publisher and publication type.
    
publisher 'Horizon House'
    
publication_type 'magazine'
    
    
oldest_article 31        # monthly published magazine. Some months are 31 days!
    
max_articles_per_feed 100
    remove_empty_feeds 
True
    auto_cleanup 
True
    
    
# Disable stylesheets and javascript from site.
    
no_stylesheets True
    remove_javascript 
True
    
    asciiize 
True    # Converts all none ascii characters to their ascii equivalents
    
    
needs_subscription True    # oh yeah... we need to login btw.

    # Timeout for fetching files from the server in seconds. The default of 120 seconds, seems somewhat excessive.
    
timeout 30
    
    
# Specify extra CSS - overrides ALL other CSS (IE. Added last).
    
    
extra_css 'body { font-family: verdana, helvetica, sans-serif; } \
                 .introduction, .first { font-weight: bold; } \
                 .cross-head { font-weight: bold; font-size: 125%; } \
                 .cap, .caption { display: block; font-size: 80%; font-style: italic; } \
                 .cap, .caption, .caption img, .caption span { display: block; margin: 5px auto; } \
                 .byl, .byd, .byline img, .byline-name, .byline-title, .author-name, .author-position, \
                    .correspondent-portrait img, .byline-lead-in, .name, .bbc-role { display: block; \
                    font-size: 80%; font-style: italic; margin: 1px auto; } \
                 .story-date, .published { font-size: 80%; } \
                 table { width: 100%; } \
                 td img { display: block; margin: 5px auto; } \
                 ul { padding-top: 10px; } \
                 ol { padding-top: 10px; } \
                 li { padding-top: 5px; padding-bottom: 5px; } \
                 h1 { font-size: 175%; font-weight: bold; } \
                 h2 { font-size: 150%; font-weight: bold; } \
                 h3 { font-size: 125%; font-weight: bold; } \
                 h4, h5, h6 { font-size: 100%; font-weight: bold; }'

    
remove_tags    = [
                        
dict(name='div'attrs={'class':'boxadzonearea350'}), # Removes banner ads
                        
dict(name='font'attrs={'class':'footer'}),    # remove fonts if you do like your fonts more! Comment out to use website's fonts
                        
dict(name='div'attrs={'class':'newsarticlead'})
                     ]
                     
    
# Remove various tag attributes to improve the look of the ebook pages.
    
remove_attributes = [ 'border''cellspacing''align''cellpadding''colspan',
                          
'valign''vspace''hspace''alt''width''height' ]

    
# Remove the line breaks as well as href links. Books don't have links generally speaking
    
preprocess_regexps     = [(re.compile(r'<br[ ]*/>'re.IGNORECASE), lambda m''),
                              (
re.compile(r'<br[ ]*clear.*/>'re.IGNORECASE), lambda m''),
                              (
re.compile(r'<a.*?>'), lambda h1''),
                              (
re.compile(r'</a>'), lambda h2'')
                              ]
    
    
# Select the feeds that you are interested.
    
feeds          = [
                        (
u'Current Issue'u'http://www.mwjournal.com/rss/Rss.asp?type=99'),
                        (
u'Industry News'u'http://www.mwjournal.com/rss/Rss.asp?type=1'),
                        (
u'Resources'u'http://www.mwjournal.com/rss/Rss.asp?type=3'),
                        (
u'Buyer\'s Guide'u'http://www.mwjournal.com/rss/Rss.asp?type=5'),
                        (
u'Events'u'http://www.mwjournal.com/rss/Rss.asp?type=2'),
                        (
u'All Updates'u'http://www.mwjournal.com/rss/Rss.asp?type=0'),
                    ]

    
#  No magazine is complete without cover. Let's get it then!
    # The function is adapted from the Economist recipe
    
def get_cover_url(self):
        
cover_url None
        cover_page_location 
'http://www.mwjournal.com/Journal/'    # Cover image is located on this page
        
soup self.index_to_soup(cover_page_location)    
        
cover_item soup.find('img',attrs={'src':lambda xand '/IssueImg/3_MWJ_CurrIss_CoverImg' in x})    # There are three files named cover, we want the highest resolution which is the 3rd image. So we look for the pattern. Remember that the name of the cover image changes every month so we cannot search for the complete name. Instead we are searching for the pattern
        
if cover_item:
            
cover_url 'http://www.mwjournal.com' cover_item['src'].strip()    # yeah! we found it. Let's fetch the image file and pass it as cover to calibre
        
return cover_url

    def print_version
(selfurl):
        if 
url.find('/Journal/article.asp?HH_ID=') >= 0:
            return 
self.browser.open_novisit(url).geturl().replace('/Journal/article.asp?HH_ID=''/Journal/Print.asp?Id=')
        
elif  url.find('/News/article.asp?HH_ID=') >= 0:
            return 
self.browser.open_novisit(url).geturl().replace('/News/article.asp?HH_ID=''/Journal/Print.asp?Id=')
        
elif  url.find('/Resources/TechLib.asp?HH_ID=') >= 0:
            return 
self.browser.open_novisit(url).geturl().replace('/Resources/TechLib.asp?HH_ID=''/Resources/PrintRessource.asp?Id=')

    
def get_browser(self):
        
'''
        Microwave Journal website, directs the login page to omeda.com once login info is submitted, omeda.com redirects to mwjournal.com with again the browser logs in into that site (hidden from the user). To overcome this obsticle, first login page is fetch and its output is stored to an HTML file. Then the HTML file is opened again and second login form is submitted (Many thanks to Barty which helped with second page login).
        '''
        
br BasicNewsRecipe.get_browser() 
        if 
self.username is not None and self.password is not None:
            
url = ('http://www.omeda.com/cgi-win/mwjreg.cgi?m=login'#  main login page.
            
br.open(url)    # fetch the 1st login page
            
br.select_form('login')        # finds the login form
            
br['EMAIL_ADDRESS']   = self.username    # fills the username
            
br['PASSWORD'] = self.password        # fills the password
            
raw br.submit().read()        # submit the form and read the 2nd login form
            # save it to an htm temp file (from ESPN recipe written by  Kovid Goyal kovid@kovidgoyal.net
            
with TemporaryFile(suffix='.htm') as fname:
                
with open(fname'wb') as f:
                    
f.write(raw)
                
br.open_local_file(fname)
            
br.select_form(nr=0)    # finds submit on the 2nd form
            
didwelogin br.submit().read()        # submit it and read the return html
            
if 'Welcome ' not in didwelogin:    # did it login successfully? Is Username/password correct?
                
raise Exception('Failed to login, are you sure your username and password are correct?')
            
#login is done
        
return br 
kiavash is offline   Reply With Quote
Old 01-14-2012, 04:28 PM   #12
kiavash
Old Linux User
kiavash began at the beginning.
 
Posts: 36
Karma: 12
Join Date: Jan 2012
Device: NST
Looks like I need to make a zip file so it is included into Calibre. So here it is attached. Latest and most up to date.

Last edited by kiavash; 02-02-2012 at 12:45 AM. Reason: It doesn't work with the new site
kiavash is offline   Reply With Quote
Old 01-14-2012, 09:52 PM   #13
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,771
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Your recipe was included in 0.8.35 already. http://bazaar.launchpad.net/~kovid/c...journal.recipe
kovidgoyal is offline   Reply With Quote
Old 01-14-2012, 11:03 PM   #14
kiavash
Old Linux User
kiavash began at the beginning.
 
Posts: 36
Karma: 12
Join Date: Jan 2012
Device: NST
Cool. Thanks.

I am going to check the recipe every month and update the script here if needed.
kiavash is offline   Reply With Quote
Old 02-02-2012, 12:43 AM   #15
kiavash
Old Linux User
kiavash began at the beginning.
 
Posts: 36
Karma: 12
Join Date: Jan 2012
Device: NST
Exclamation Needs update!

As Read Beam sent me the e-magazine this month, I noticed that it doesn't look right. So, I checked the site and apparently Microwave Journal had changed almost everything (removing RSS is one of them). Stay tune as I will update the recipe in the next few days to adapt the latest site changes!
kiavash is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Democracy Journal recipe? davidnye Recipes 3 02-26-2013 08:09 AM
Recipe request: World Journal teraflame Recipes 0 03-09-2011 01:11 PM
New Journal of Physics recipe chemacortes Recipes 0 01-05-2011 08:08 AM
Poughkeepsie Journal recipe weebl Recipes 0 12-02-2010 08:56 AM
New England of Journal recipe Ebookerr Calibre 1 08-26-2010 04:59 AM


All times are GMT -4. The time now is 07:50 AM.


MobileRead.com is a privately owned, operated and funded community.