Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 09-28-2010, 03:57 PM   #1
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
maya recipe

this one is hard, and it is Hebrew only.

i want to do this step by step, so i understand what i am doing.
i want to create a recipe for this page and similar pages.

if you go in to the page, you will see a list of articles (right hand side of the page). the actual link to the article is the 2nd link in each pair. i have recognized that that all the relevant links (and only them) have an id="SubjectHref*" (the * represents some numbers).

the URLs i want to get in stage one is 'http://maya.tase.co.il/' + [the href from tag <a> with id SubjectHref*].
i have then need to do the same in the next page (see the bottom of the page)


this is the code i have so far and i am a little lost now. its built on the NZ herald recipe. can some one tell me if this is the right way?


Spoiler:
Code:
from calibre.web.feeds.recipes import BasicNewsRecipe

class NewZealandHerald(BasicNewsRecipe):

    title       = 'maya recipe'
    __author__  = 'marbs'
    description = 'Daily news'
    timefmt = ' [%d %b, %Y]'
    language = '_Hebrew'

#    no_stylesheets = True
#     remove_tags_before = dict(name='div', attrs={'class':'contentContainer left eight'})
   # remove_tags_after  = dict(name='div', attrs={'class':'callToAction'})
   # remove_tags = [
   #    dict(name='iframe'),
   #    dict(name='div', attrs={'class':['sectionHeader', 'tools','callToAction', 'contentContainer right two nopad relatedColumn']}),
       #dict(name='div', attrs={'id':['shareContainer']}),
       #dict(name='form', attrs={'onsubmit':"return verifySearch(this.w,'Keyword, citation, or #author')"}),
       #dict(name='table', attrs={'cellspacing':'0'}),
#    ]

#    def preprocess_html(self, soup):
#        table = soup.find('table')
#        if table is not None:
#            table.extract()
#        return soup

    #TO GET ARTICLES IN SECTION
    def maya_parse_section(self, url):
            soup = self.index_to_soup(url)
            div = soup.find(attrs={'id':('SubjectHref'+"*")})
            date = div.find(attrs={'href'})

            current_articles = []
            for x in date.findAllNext(attrs = {'id': ('SubjectHref'+"*")}):
                if x = 30: break
                for li in x.findAll('li'):
                    a = li.find('a', href=True)
                    if a is None:
                        continue
                    title = self.tag_to_string(a)
                    url = a.get('href', False)
                    if not url or not title:
                        continue
                    if url.startswith('/'):
                         url = 'http://www.nzherald.co.nz'+url
                    self.log('\t\tFound article:', title)
                    self.log('\t\t\t', url)
                    current_articles.append({'title': title, 'url':url,
                        'description':'', 'date':''})

            return current_articles


    # To GET SECTIONS
    def parse_index(self):
            feeds = ['example feed', 'http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press=']

               articles = self.maya_parse_section(url)
               if articles:
                   feeds.append((title, articles))
            return feeds
marbs is offline   Reply With Quote
Old 09-28-2010, 04:23 PM   #2
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
On first look at that thing, why not do something in the area of this:
You said you wanted the second link
for example: http://maya.tase.co.il/bursa/report....port_cd=570152

it always has report_cd in it so why not just follow it with a regex match ?
Spoiler:

Code:
    from calibre.ptempfile import PersistentTemporaryFile

    temp_files = []
    articles_are_obfuscated = True

    def get_obfuscated_article(self, url):
        br = self.get_browser()
        br.open(url)
        response = br.follow_link(url_regex='?report_cd', nr = 0)
        html = response.read()
        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()
        return self.temp_files[-1].name


or maybe use something like this:
Spoiler:

Code:
    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            
            if not re.search('javascript', item['href']):
              print 'FOUND GOOD URL'
              url = self.INDEX + item['href']
              print 'url is: ', url
              title       = self.tag_to_string(item)
              print 'title is: ', title
            current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles

Last edited by TonytheBookworm; 09-28-2010 at 05:25 PM. Reason: edited code
TonytheBookworm is offline   Reply With Quote
Advert
Old 09-30-2010, 03:56 PM   #3
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
Hey Tony!

the reason i didn't use regex to follow the link is because i haven't wrapped my head around it yet. i don't fully understand the concept. i tryed running your code.
when i used the 1st one i got raw HTML from the feed page.
when i used the 2nd code i got "NameError: global name 're' is not defined"
ill have to read it a bit more (after a good nights sleep.)

i am going to work on it some more....
marbs is offline   Reply With Quote
Old 09-30-2010, 04:09 PM   #4
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by marbs View Post
the reason i didn't use regex to follow the link is because i haven't wrapped my head around it yet. i don't fully understand the concept. i tryed running your code.
when i used the 1st one i got raw HTML from the feed page.
when i used the 2nd code i got "NameError: global name 're' is not defined"
ill have to read it a bit more (after a good nights sleep.)

i am going to work on it some more....
you have to import re

the second set of code works. I tested it.
It is up to you to clean it up and get what you want and get rid of what you don't.
but as far as getting the link you wanted here is what i did.
Spoiler:

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AlisonB(BasicNewsRecipe):
    title      = 'blah'
    __author__ = 'Tonythebookworm'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'column'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 10
    INDEX = '"http://maya.tase.co.il/'
    

    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
              print 'FOUND GOOD URL'
              url         = self.INDEX + item['href']
              print 'url is: ', url
              title       = self.tag_to_string(item)
              print 'title is: ', title
            current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles

Last edited by TonytheBookworm; 09-30-2010 at 04:14 PM. Reason: added code
TonytheBookworm is offline   Reply With Quote
Old 10-01-2010, 03:12 AM   #5
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
i went over it again

and you are right, i works.
so i wanted to take it to the next step. on the urls that you found, there is the clean version of the reports i am trying to get.
it is the "src" attr from the iframe tag (in some cases, i want to do this step by step).
so i added a sub function. i gave it all the information in needs to do what you did.
Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AlisonB(BasicNewsRecipe):
    title      = 'blah'
    __author__ = 'Tonythebookworm'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'column'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 10
    INDEX = '"http://maya.tase.co.il/'

    def make_links1(self, url, title, description, date):
        title = 'Temp1'
        current_articles1 = []
        soup = self.index_to_soup(url)
        for item in soup.findAll('iframe'):
             print 'FOUND GOOD URL'
             url         =  item['src']
             print 'url is: ', url

        current_articles1.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this           
        return current_articles1
    

    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
               title       = self.tag_to_string(item)
               print 'title is: ', title
               current_articles=make_links1(url, title, description, date)
  
        return current_articles

when i run it, i get "NameError: global name 'make_links1' is not defined"
it looks right to me, i have no idea what i did wrong.
marbs is offline   Reply With Quote
Advert
Old 10-01-2010, 03:14 PM   #6
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
make_links is a built in function so of course make_links1 is not valid

In other words put the new stuff in with the old stuff because there is no point in reinventing the wheel.

I don't have the time to debug your code but basically do a for loop to find some stuff and then do whatever you need to with it. then do another for loop to find other stuff. and append it to the article list like i showed you. You can even take and rename title to temp1 after the first for loop if you like.

Last edited by TonytheBookworm; 10-01-2010 at 03:24 PM.
TonytheBookworm is offline   Reply With Quote
Old 10-02-2010, 01:27 PM   #7
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
i see.

i changed it to fit. mu 2nd call of soup it not opening the url (temp2) and souping it(the html file that the url leeds to). it is just souping the url it self.what am i doing wrong?
Spoiler:

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AlisonB(BasicNewsRecipe):
    title      = 'blah'
    __author__ = 'Tonythebookworm'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'column'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 10
    INDEX = '"http://maya.tase.co.il/'
    

    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
              temp2= self.INDEX + item['href']
              print 'url1 is', temp2
              soup1 = self.index_to_soup(temp2)
              print 'the new soup is', temp2
              print '6714' 
              for item1 in soup1.findAll('iframe'):
                 print 'item1 is:' , item1
                 print 'FOUND GOOD URL'
                 url = item1['src']
                 print 'url is: ', url
                 title       = self.tag_to_string(item)
                 print 'title is: ', title
                 current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles
marbs is offline   Reply With Quote
Old 10-07-2010, 04:19 PM   #8
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
is it possible to soup twice?

can i soup the url that i found in the 1st soup
marbs is offline   Reply With Quote
Old 10-07-2010, 09:42 PM   #9
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
can i soup the url that i found in the 1st soup
Yes..
Starson17 is offline   Reply With Quote
Old 10-08-2010, 02:08 AM   #10
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
i found my bug

but i need some help fixing it.
i marked it HERE in the code. i want to format the url as u"www....com" i am giving it a simple string. i tried ' and " and []. still cant get the syntax right. can i get some help with that?

Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AlisonB(BasicNewsRecipe):
    title      = 'blah'
    __author__ = 'Tonythebookworm'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'column'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 10
    INDEX = '"http://maya.tase.co.il/'
    

    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'url is', url

        print 'The soup is: ', soup
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
              temp3= self.INDEX + item['href']                                #HERE
             # temp2=[temp3]
              print 'url1 is', temp2
              soup1 = self.index_to_soup(temp3)                            #AND HERE
              print 'the new soup is', temp2
              print '6714' 
              for item1 in soup1.findAll('iframe'):
                 print 'item1 is:' , item1
                 print 'FOUND GOOD URL'
                 url = item1['src']
                 print 'url is: ', url
                 title       = self.tag_to_string(item)
                 print 'title is: ', title
                 current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles
marbs is offline   Reply With Quote
Old 10-08-2010, 09:13 AM   #11
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
but i need some help fixing it.
i marked it HERE in the code. i want to format the url as u"www....com" i am giving it a simple string. i tried ' and " and []. still cant get the syntax right. can i get some help with that?
It looks correct to me, except:
Quote:
Code:
    INDEX = '"http://maya.tase.co.il/'
This has an extra leading quote - it should be
Quote:
Code:
    INDEX = 'http://maya.tase.co.il/'
And: don't you want 'the new soup is', soup1 instead of 'the new soup is', temp2
And: I don't see any iframes in soup1?
Quote:
Spoiler:
Code:
            if not re.search('javascript', item['href']):
              temp3= self.INDEX + item['href']                                #HERE
             # temp2=[temp3]
              print 'url1 is', temp2
              soup1 = self.index_to_soup(temp3)                            #AND HERE
              print 'the new soup is', temp2
              print '6714' 
              for item1 in soup1.findAll('iframe'):
                 print 'item1 is:' , item1
                 print 'FOUND GOOD URL'
                 url = item1['src']
                 print 'url is: ', url
                 title       = self.tag_to_string(item)
                 print 'title is: ', title
                 current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles
Starson17 is offline   Reply With Quote
Old 10-09-2010, 05:12 PM   #12
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
sometimes al you need is someone to hit you over the head with the answer.

so thanks starson17. it now downloads my articles (there is still a lot of work, but i get news ant the end, and not an error).

i didnt think of this when i started, but and calibre deal with pdf files?
some of the reports come in pdf form. i get gibrish where the pdf use to be. can i do anything about it? does it matter if my output format is pdf?

this is the code:
Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AlisonB(BasicNewsRecipe):
    title      = 'blah'
    __author__ = 'Tonythebookworm'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'column'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 10
    INDEX = 'http://maya.tase.co.il/'
    

    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'url is', url

#        print 'The soup is: ', soup
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
              temp2= self.INDEX + 'bursa/' + item['href']
        #      temp2=[temp3]
              print 'url1 is', temp2
              soup1 = self.index_to_soup(temp2)
  #            print 'the new soup is', soup1
              print '6714' 
              for item1 in soup1.findAll('iframe'):
                 print 'item1 is:' , item1
                 print 'FOUND GOOD URL'
                 url = item1['src']
                 print 'url is: ', url
                 title       = self.tag_to_string(item)
                 print 'title is: ', title
                 current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles

the 2nd article is a pdf file. (i am working with a feed that is very rarely updated, so i know the page format very well.)
can i import a library that deals with pdf?
thanks for the help.

ps
i also wanted to know if you can add an output file type to the recipe it self that will override the default for calibre (if the default is pdf, but i want one self built recipe to come out as epub?)
marbs is offline   Reply With Quote
Old 10-09-2010, 09:34 PM   #13
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
i didnt think of this when i started, but and calibre deal with pdf files?
I've never seen it done with a recipe.
Quote:
ps
i also wanted to know if you can add an output file type to the recipe it self that will override the default for calibre (if the default is pdf, but i want one self built recipe to come out as epub?)
Again, I'm afraid the answer is no. You could put together a script and do it externally with ebook-convert by specifying the desired format.
Starson17 is offline   Reply With Quote
Old 10-09-2010, 10:01 PM   #14
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
Again, I'm afraid the answer is no. You could put together a script and do it externally with ebook-convert by specifying the desired format.
I wish there was more documentation or a way to view the actual source to see what options are available under the the conversion_options function. Not trying to go against what your saying but based on this from the api documentation I'm led to believe that you can over-ride the conversion defaults in code/recipe.

Spoiler:

#: Recipe specific options to control the conversion of the downloaded
#: content into an e-book. These will override any user or plugin specified
#: values, so only use if absolutely necessary. For example::
#:
#: conversion_options = {
#: 'base_font_size' : 16,
#: 'tags' : 'mytag1,mytag2',
#: 'title' : 'My Title',
#: 'linearize_tables' : True,
#: }
#:
conversion_options = {}


I'm just not sure what the actual variable name is. Maybe it is 'output_format': epub or something like that. Kovid can you chime in on this one please ?
TonytheBookworm is offline   Reply With Quote
Old 10-11-2010, 03:19 PM   #15
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,866
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
you cannot override the output format from within a recipe.

Trying to extract text from PDFs is not going to be easy. Just try converting your PDF in calibre to see what will happen.
kovidgoyal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
New recipe kiklop74 Recipes 0 10-05-2010 04:41 PM
New recipe kiklop74 Recipes 0 10-01-2010 02:42 PM
New Title from Book View Cafe: A Princess of Passyunk by Maya Kaathryn Bohnhoff suelange Self-Promotions by Authors and Publishers 0 08-11-2010 04:35 PM
Recipe Help lrain5 Calibre 3 05-09-2010 10:42 PM
Recipe Help Please estral Calibre 1 06-11-2009 02:35 PM


All times are GMT -4. The time now is 02:32 PM.


MobileRead.com is a privately owned, operated and funded community.