Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 08-25-2010, 01:52 PM   #2521
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by naisren View Post
Thanks for your help and sorry for my confusing expression.
That's OK. I looked at your site and ran your recipe (thank you for using code tags - you may also want to add spoiler tags to reduce the length).

I now understand your problem. The site has bad html. The page you are trying to parse to get feeds is seen as a giant NavigableString inside a single tag. There are no other tags within it as far as BeautifulSoup is concerned. I don't know exactly why, but I suspect it isn't solely due to the fact that it is using the " />" format to immediately close div tags, then trying to close them again with the normal </div>, so there are two closings (another bit of bad html.)

Whatever is going on is confusing Beautiful Soup to the extent that it can't find anything except the first surrounding tag. It should still be possible to extract feeds, but it will require much trickier programming to get links out of the giant string which is soup.contents[0].string. You will need to treat it as a string, then extract from the string, instead of trying to find tags within that string (although you may be able to use BS to convert it into a tag-based structure with some trickery).

It's an interesting problem, and I regret that I don't have time now to attack it. If you solve it, post your solution.
Starson17 is offline  
Old 08-25-2010, 04:08 PM   #2522
stewie1
Junior Member
stewie1 began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Aug 2010
Device: Kindle
The Financial Times recipe that's currently posted isn't the complete print edition. I'm a subscriber and trying to put something together that will allow me to get the day's print edition (http://www.ft.com/us-edition). Unfortunately, there is no RSS feed for this.

Can anyone help, either by putting a recipe together, or directing me to a template I might be able to use to give it a shot myself (note I am a complete novice at this).

Thanks.
stewie1 is offline  
Old 08-25-2010, 10:36 PM   #2523
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Starson17, I thought I was getting the knack of this but apparently not. If it is an indent problem then hit me. I don't think it is though but I can't seem to understand why the print is not being appended to the url. I looked at the newyorker as an example and I to the most part use the same print_version()

my code
Spoiler:
Code:
class AdvancedUserRecipe1282101454(BasicNewsRecipe):
    title = 'The TMZ'
    __author__ = 'TonytheBookworm'
    description = 'Celeb Gossip and News'
    publisher = 'The TMZ'
    category = 'news, celebrity, USA'
    oldest_article = 1
    max_articles_per_feed = 100
    no_stylesheets = True
    remove_javascript = True
    
    extra_css = '''
                    h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
                    h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
                    p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
		        '''
    
    
   
              
                  
                  
    feeds          = [
                       ('TOP 20', 'http://www.tmz.com/rss.xml'),
                       ('Exclusives', 'http://www.tmz.com/category/exclusives/rss.xml')
                    ]

    def print_version(self, url):
       print_url = url +'print'
       print 'print_url is: ', print_url
       return print_url


I also am using the the test you gave me of
ebook-convert tmz.recipe output_dir --test -vv > myrecipe.txt

I don't really see any errors in there and I also don't see the print_url is: ... either
Sorry to bug you I will get it one of these days.
TonytheBookworm is offline  
Old 08-26-2010, 03:33 AM   #2524
kerrware
Junior Member
kerrware began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Jun 2010
Device: none
Quote:
Originally Posted by Starson17 View Post
If you are seeing the article content stored locally (when running ebook-convert), and you can click through from the initial index.html to the index.html files in the folders to see that content, then I see no reason why you should have problems converting the html structure, with article content, to an EPUB. Where is the problem occurring? I'd check it for you, but have no username/password for the site.
Thanks for response. I tried the "Add Book" function in Calibre unsing the debug output from ebook-convert and then converted it to an epub book and got the same result - no article content. I guess this means that the epub conversion process can't handle the html from the site for some reason.

The only thing I can think of is to try and introduce some "remove_tags" type code to try and simplyfy the html so it can be converted. This could take some time (not that familiar with html or python code). Any suggestions as to what I can and can't remove?

Thanks.
Attached Files
File Type: zip ebook-convert output files.zip (312.9 KB, 278 views)
kerrware is offline  
Old 08-26-2010, 07:51 AM   #2525
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kerrware View Post
I tried the "Add Book" function in Calibre unsing the debug output from ebook-convert and then converted it to an epub book and got the same result - no article content. I guess this means that the epub conversion process can't handle the html from the site for some reason.
Send me your username/password by PM and I'll take a look.
Starson17 is offline  
Old 08-26-2010, 07:58 AM   #2526
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
Starson17, I thought I was getting the knack of this but apparently not. If it is an indent problem then hit me. I don't think it is though but I can't seem to understand why the print is not being appended to the url. I looked at the newyorker as an example and I to the most part use the same print_version()

my code
Spoiler:
Code:
class AdvancedUserRecipe1282101454(BasicNewsRecipe):
    title = 'The TMZ'
    __author__ = 'TonytheBookworm'
    description = 'Celeb Gossip and News'
    publisher = 'The TMZ'
    category = 'news, celebrity, USA'
    oldest_article = 1
    max_articles_per_feed = 100
    no_stylesheets = True
    remove_javascript = True
    
    extra_css = '''
                    h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
                    h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
                    p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
		        '''
    
    
   
              
                  
                  
    feeds          = [
                       ('TOP 20', 'http://www.tmz.com/rss.xml'),
                       ('Exclusives', 'http://www.tmz.com/category/exclusives/rss.xml')
                    ]

    def print_version(self, url):
       print_url = url +'print'
       print 'print_url is: ', print_url
       return print_url


I also am using the the test you gave me of
ebook-convert tmz.recipe output_dir --test -vv > myrecipe.txt

I don't really see any errors in there and I also don't see the print_url is: ... either
Sorry to bug you I will get it one of these days.
I ran your code. It puts these in the output file:
Code:
print_url is:  http://www.tmz.com/2010/08/18/tmz-seinfeld-uncle-leo-crank-call-police-burbank-kim-kardashian-facebook/print
print_url is:  http://www.tmz.com/2010/08/25/mel-gibson-extortion-case-oksana-grigorieva-domestic-violence-sheriff-district-attorney-investigation/print
AFAICT, it's working perfectly. I even tested one of those links, and it works, too. What do you think is busted?

Last edited by Starson17; 08-26-2010 at 11:57 AM.
Starson17 is offline  
Old 08-26-2010, 09:11 AM   #2527
cisaak
Member
cisaak began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Aug 2010
Device: Kindle DX
Quote:
Originally Posted by Starson17 View Post
Not without a Kindle (anyone want to send me one? ) as I'm not sure where in the recipe the Kindle is picking up the masthead.

However, the masthead is only used in a few places in an EPUB. Open the EPUB, find the masthead and change the css file to modify its properties, then convert the EPUB to whatever format Kindle uses and see if that fixes it. If so, modify the extra_css in your recipe to make the same change.

If you have a problem understanding this, take it a step at a time, and let me know which step you have trouble with.
I used calibre to convert my recipe to EPUB form. I do not know how to "open" the EPUB and look for the masthead. I checked the Debug section of the calibre conversion and found the css in the input directory. The term "masthead" is not mentioned. What is my next step?
cisaak is offline  
Old 08-26-2010, 10:15 AM   #2528
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by naisren View Post
My recipe is
Spoiler:
Code:
import re
from calibre.web.feeds.news import BasicNewsRecipe

class VOA(BasicNewsRecipe):

    title      = 'VOA News'
    __author__ = 'voa'
    description = 'VOA through 51'
    language = 'en'
    remove_javascript = True

    remove_tags_before = dict(id=['rightContainer'])
    remove_tags_after  = dict(id=['listads'])
    remove_tags        = [
                          dict(id=['contentAds']), dict(id=['playbar']), dict(id=['menubar']), 
                         ]    
    no_stylesheets = True
    extra_css = '''
                '''


    def parse_index(self):
        soup = self.index_to_soup('http://www.51voa.com/')
        feeds = []
        section = []
        title = None

       #for x in soup.find(id='list').findAll('a'):
        for x in soup.find(id='rightContainer').findAll('a'):
                if '/VOA_Special_English/' in x['href'] or '/VOA_Standard_English/' in x['href'] or '/VOA_Standard_English/' in x['href']:
                    article = {
                            'url' : 'http://www.51voa.com/' + x['href'],
                            'title' : self.tag_to_string(x),
                            'date': '',
                            'description': '',
                        }
                    section.append(article)

        feeds.append(('Newest', section))

        return feeds
OK, your problem was so interesting, I couldn't resist looking at it further. Your problem is the bad html code in your source page http://www.51voa.com/. The closing angle bracket of each opening tag is ' />', instead of ' >'. That results in each tag being closed twice. Beautiful Soup is confused and sees the entire page as a single [document] element having a single NavigableString of text, not as multiple tags, so none of the tag-based searches or manipulation commands will work. There are no tags for BeautifulSoup to find or work with.

To fix this, you first grab that page (as you have already done in your code):

Code:
        soup = self.index_to_soup('http://www.51voa.com/')
Then, grab the string that is in the contents of the big single [document] element and search and replace the bad closing brackets as follows:
Code:
        rawc = soup.contents[0].string.replace(' />', ' >')
Now it's fixed, but it's still text, so you next convert the string back into a BeautifulSoup object:
Code:
        soup = BeautifulSoup(rawc, fromEncoding=self.encoding)
(Also, add
Code:
from calibre.ebooks.BeautifulSoup import BeautifulSoup
to your recipe)

That's it! Put the two extra lines above after your first index_to_soup line. Be aware that any legitimate single element tags, such as <img>, <br> etc. will get mangled with the simple search and replace above. You may have to special case any tags that are allowed to have a closing slash inside the opening tag so they don't get mangled.

Edit: I forgot, you also need this line:
encoding = 'utf-8'
or else the final step will fail.

Last edited by Starson17; 08-26-2010 at 04:08 PM.
Starson17 is offline  
Old 08-26-2010, 10:21 AM   #2529
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by cisaak View Post
I used calibre to convert my recipe to EPUB form. I do not know how to "open" the EPUB and look for the masthead.
change the file extension from .epub to .zip and unpack it into a directory. You will find the css file there and will be able to open the unpacked book with your browser.
Starson17 is offline  
Old 08-26-2010, 11:38 AM   #2530
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kerrware View Post
The only thing I can think of is to try and introduce some "remove_tags" type code to try and simplyfy the html so it can be converted. This could take some time (not that familiar with html or python code). Any suggestions as to what I can and can't remove?
Try this:
Code:
    keep_only_tags = dict(name='div', attrs={'id':['ds-headline','viewarticle']})
You may find some other items you want to keep (use FireFox/FireBug to find them), but you're right, there's something in there that's messing up the conversion.

Last edited by Starson17; 08-26-2010 at 11:57 AM.
Starson17 is offline  
Old 08-26-2010, 01:13 PM   #2531
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:

AFAICT, it's working perfectly. I even tested one of those links, and it works, too. What do you think is busted?
notice is the attached screenshot. that up top I have Home: Michael Bay.... I know I can remove the tags to get rid of that. And also down at the bottom I have tags:Michael Bay ... etc.

The reason I suspected that it is not hitting the print_url is because when i navigate to the page manually I do not see any of those tags at all (as in if i look at the original html /print version) However, I do notice the tags if I do not append /print . Any thoughts?


notice that when you goto http://www.tmz.com/2010/08/24/michae...r-pistol-whip/
You will get the div tag of <breadcrumbs> that i wish to remove.

But when you go to http://www.tmz.com/2010/08/24/michae...tol-whip/print
You do not get the div tag of <breadcrumbs> maybe I've got an indention problem going on here that just isn't obvious to me or something but for sure something is going on



I'm a dumb_____ for whatever reason the recipe that i posted to you is the correct version. geany didn't save it yet I thought it had. I went and opened it back up after rebooting my computer today and noticed i still had the old version haha. Live and learn. Sorry for trouble.
Attached Thumbnails
Click image for larger version

Name:	tmztest.jpg
Views:	326
Size:	144.6 KB
ID:	57180  

Last edited by TonytheBookworm; 08-26-2010 at 02:19 PM. Reason: added more info
TonytheBookworm is offline  
Old 08-26-2010, 02:21 PM   #2532
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
I'm a dumb_____ for whatever reason the recipe that i posted to you is the correct version. geany didn't save it yet I thought it had. I went and opened it back up after rebooting my computer today and noticed i still had the old version haha. Live and learn. Sorry for trouble.
No sweat - I've done the same thing a few times myself.
Starson17 is offline  
Old 08-26-2010, 03:05 PM   #2533
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Working copy of TMZ

Here is the tmz.
Attached Files
File Type: rar tmz.rar (959 Bytes, 260 views)
TonytheBookworm is offline  
Old 08-26-2010, 05:45 PM   #2534
Ebookerr
Junior Member
Ebookerr began at the beginning.
 
Posts: 9
Karma: 18
Join Date: Sep 2008
Device: UTstarcom
New England Journal of Medicine Recipe

The New England of Journal changed their format and the current Calibre version (0.7.15) recipe does not work

Ebookerr
Ebookerr is offline  
Old 08-26-2010, 09:34 PM   #2535
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Starson I know ask before but I'm still not clear on how to do it
if i have for instance the following link (example only)
http://www.testpage.com/doi/abs/10.1...495?ai=rv&af=R
but I need it to be the following
http://www.testpage.com/doi/full/10....viewType=Print

how would I do this?
I figure it would be some form or search and replace like i seen in beautiful soup but again i'm kinda clueless (any examples of this in action ? )

the abs needs to be replaced with full and then simply append the ?view part yet not sure how this is done thanks again. I hope to learn something new.

Maybe something like ?
Spoiler:
Code:
def preprocess_html(soup)
     for abs in soup.findall('abs)
     newtext ='full'
     abs.replaceWith(newtext)

the above is just a non-working guess

Last edited by TonytheBookworm; 08-26-2010 at 09:48 PM.
TonytheBookworm is offline  
Closed Thread


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom column read ? pchrist7 Calibre 2 10-04-2010 02:52 AM
Archive for custom screensavers sleeplessdave Amazon Kindle 1 07-07-2010 12:33 PM
How to back up preferences and custom recipes? greenapple Calibre 3 03-29-2010 05:08 AM
Donations for Custom Recipes ddavtian Calibre 5 01-23-2010 04:54 PM
Help understanding custom recipes andersent Calibre 0 12-17-2009 02:37 PM


All times are GMT -4. The time now is 08:24 AM.


MobileRead.com is a privately owned, operated and funded community.