Custom recipes (archive, read-only) - Page 174

bmsleight · 09-02-2010, 03:11 PM

Hi c.espinosas,

Try this for a recipe for Milenio Diario (mexican newspaper, http://impreso.milenio.com/Nacional/)

Although I do not speak Spanish, so there maybe some errors.

Spoiler:

Anyone help me on https://www.mobileread.com/forums/sho...postcount=2568 (instructable) ?

Regards,
Brendan

TonytheBookworm · 09-02-2010, 03:25 PM

Wow!!!! This is confusing to say the least... On the page I'm trying to work on howstuffworks the next page sometimes is in pagination other times in top10pagnation (why the heck can't they stay consistent and make my life easier). Anyway, can one of you take at look at http://feeds.feedburner.com/Howstuff...ffDailyRssFeed and tell me how you would solve the multipage issue where the next page doesn't always fall under the same tag structure.

thanks

Starson17 · 09-02-2010, 03:33 PM

Quote:

Originally Posted by TonytheBookworm

Wow!!!! This is confusing to say the least... On the page I'm trying to work on howstuffworks the next page sometimes is in pagination other times in top10pagnation (why the heck can't they stay consistent and make my life easier). Anyway, can one of you take at look at http://feeds.feedburner.com/Howstuff...ffDailyRssFeed and tell me how you would solve the multipage issue where the next page doesn't always fall under the same tag structure.

thanks

Is there any reason you can't do something like this to find both:

Code:

soup.find('div',attrs={'class':['pagination', 'top10pagnation']})

kiklop74 · 09-02-2010, 03:57 PM

Quote:

Originally Posted by TonytheBookworm

Wow!!!! This is confusing to say the least...

You are complicating things. This site has printable pages so just add this to your recipe:

Code:

    keep_only_tags = [dict(name='div',attrs={'class':'content'})]

    def print_version(self, url):
        return url + '/printable'

TonytheBookworm · 09-02-2010, 03:57 PM

Quote:

Originally Posted by Starson17

Is there any reason you can't do something like this to find both:

Code:

soup.find('div',attrs={'class':['pagination', 'top10pagnation']})

I tried that with no luck....
Here is what I have thus far:
For some reason it takes a century to finish even when i use the text command line

I discovered after further looking at the html that it appears the pagination even though it is in different tags it always appears be nested inside of articleFooter. So here is what I came up with. Notice my comments. Definitely in the learning process on this one haha.

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1282101454(BasicNewsRecipe):
    title = 'How Stuff Works'
    language = 'en'
    __author__ = 'TonytheBookworm'
    description = 'How stuff works'
    publisher = 'Tony'
    category = 'information'
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    #INDEX                 = u'http://www.adventuregamers.com'
    #extra_css = '.headline {font-size: x-large;} \n .fact { padding-top: 10pt }'
    #masthead_url = 'http://gawand.org/wp-content/uploads/2010/06/ajc-logo.gif'
    keep_only_tags    = [
                         dict(name='div', attrs={'class':['articleBody','articleFooter']})
      #                 ,dict(attrs={'id':['cxArticleText','cxArticleBodyText']})
                        ]
    feeds          = [
                      ('AutoStuff', 'http://feeds.feedburner.com/HowstuffworksAutostuffDailyRssFeed'),
                      
                    ]
   
    def append_page(self, soup, appendtag, position):
        pager = soup.find('div',attrs={'class':'articleFooter'}) # articleFooter contains the nextpage navigation 
        print 'the pager soup is: ', pager
        if pager:
           nexturl = pager.a['href']
           print 'THE NEXT URL IS: ', nexturl
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'articleBody'}) # find the content body for the nextpage
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag) 
   
                    
    def preprocess_html(self, soup):
       for item in soup.findAll(style=True):
           del item['style']
       self.append_page(soup, soup.body, 3)
       # don't think i need this then again I'm not sure 
       
       #pager = soup.find('div',attrs={'class':'toolbar_fat'})
       #if pager:
        #  pager.extract()        
       return soup

by the way once again THANK YOU FOR DEVOTING YOUR TIME IN HELPING ME. Very much appreciated!!! That goes for others as well.

added****
it looks like I gets stuck in a infinite loop
notice how it takes and successfully gets the next url
then when it goes to the next url it takes and find the url for the previous page. so it goes back to it. then it turns around and goes to the next page again then back and so on

Spoiler:

Starson17 · 09-02-2010, 04:50 PM

Quote:

Originally Posted by TonytheBookworm

I tried that with no luck....

I copied your two typos, but did it work without the typos? It's "top10Pagination" not "top10pagnation"

Quote:

Code:

        pager = soup.find('div',attrs={'class':'articleFooter'}) # articleFooter contains the nextpage navigation 
        print 'the pager soup is: ', pager
        if pager:
           nexturl = pager.a['href']

With the code above, you're going for any link tag in the footer. IMHO, that's way too broad. "pager" should only be the tag that has a link to a next page, not something found on every article. It should never be found on the last page, or on a single page article. If you have to hunt for "Next page" (or whatever text appears on the Next page link) then do that, but unless you're certain that the articleFooter never has any other links, your code won't work. I suspect your last page has a first page link or a previous page link in the articleFooter. Either would put your recipe into an endless loop.

TonytheBookworm · 09-02-2010, 06:05 PM

Quote:

Originally Posted by kiklop74

You are complicating things. This site has printable pages so just add this to your recipe:

Code:

    keep_only_tags = [dict(name='div',attrs={'class':'content'})]

    def print_version(self, url):
        return url + '/printable'

I thought that at first as well. But seems that it isn't consistent. Notice
http://auto.howstuffworks.com/auto-r...-car-feats.htm (unless i'm overlooking something)... Needless to say the multipage thing is both interesting and challenging so I'm definitely learning from you guys..

bmsleight · 09-02-2010, 06:36 PM

TonytheBookworm,

The /printable still works on this article.
http://auto.howstuffworks.com/auto-r....htm/printable

TonytheBookworm · 09-02-2010, 06:42 PM

Quote:

Originally Posted by bmsleight

TonytheBookworm,

The /printable still works on this article.
http://auto.howstuffworks.com/auto-r....htm/printable

Alright I'll just go back in my cave and hide. After hours on end trying to figure this out and then even though some of the pages don't show a print button like kiklop74 and bmsleight mention just append a /printable.... grrr live and learn I guess. I did learn a lot about the multiple pages though and almost had it but the thing is it kept changing from nextbottom to pagerbottom to whatever... printversion makes it soooooooooooooooooooo much easier.. thanks for putting up with me guys.

TonytheBookworm · 09-02-2010, 07:34 PM

Okay so the /printable append worked on the autoStuff but the webmaster for whatever reason can't seem to keep a constant format. So When it comes to the other feeds for example Computers. The dang url needs to be modified from..
http://feedproxy.google.com/~r/Howst...icroformat.htm
to this
http://computer.howstuffworks.com/mi....htm/printable

Sure I understand how to change it from previous recipes I have worked on and been helped with. Yet they were consistent. But in this recipe here it is a case of do this for some. then do this for some other then do this for something else. Do I some how call the feeds separately within the recipe ?

Like

Spoiler:

This recipe here if anyone wants to tackle it with me is very confusing at least to me it is. Because like i say some of its feeds uses this then others use that. nothing stays the same.. Even with the printversion the formating doesn't stay the same like url + /printable in some cases it will be computer.howstuffworks.com/social-networking/blah.html instead of computer.howstuffworks.com/someothercrap/blah.html my guess would be some form of a regexpression... so without further flooding the forum about this recipe if someone out there doesn't mind tackling this by all means do

I'm curious to know how you resolve this issue.... I will continue to work and play with it but again I don't wanna keep fooding the forum on this

thanks again

added*******
Is there something like a switch in python(beautifulsoup) ?
For instance
If Feed_title = Auto
then do this...
Else If Feed_title = computers
then do this...
Default
do this....

I think that would work in this situation because then for each individual feed I could have it do what it needs to do to get the printurl...

TonytheBookworm · 09-03-2010, 12:19 AM

Quote:

Originally Posted by Starson17

I don't have the answer for you, but I have seen a list of Python editors, including some free ones, so a Google might prove helpful. I use UltraEdit. It has three features I really like.

One is the ability to search defined folders and files. including subdirs for certain text, then open one or more of the located files. I often search *.recipe files in the resource directory for "keep_only" or "parse_index," etc, to see how other working recipes used those commands.

The second feature is having multiple files open for editing. I keep my recipe, my batch file for executing my recipe and my output error file all open.

The last feature is the ability to execute a batch file with a single keystroke. I have the batch file for executing the recipe connected to that key.

Modify recipe, save it, hit execute, read errors in error file, rinse and repeat.

I believe notepad++ is free and will do some of the above.

I have ultraedit, could you enlighten me on the batch file you use for executing your recipes. I see where that would be very useful. And output error file. Does utraedit automatically update the changed content or do you have to reopen it again ?

JvdW · 09-03-2010, 03:28 AM

Hello All,

Is there anybody who can help me with the builtin nrcnext recipe?

I have tried to fix it myself but it looks like it doesn't even start downloading articles so debugging the recipe itself isn't going to work.
Following is the output from convert-ebook nrcnext.recipe test --test -vv
More 'v' don't add more information

Thanks in advance,

Joop

Resolved conversion options
calibre version: 0.7.15
{'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 0,
'book_producer': None,
'change_justification': 'original',
'chapter': None,
'chapter_mark': 'pagebreak',
'comments': None,
'cover': None,
'debug_pipeline': u'test',
'disable_font_rescaling': False,
'dont_download_recipe': False,
'extra_css': None,
'font_size_mapping': None,
'footer_regex': '(?i)(?<=<hr>)((\\s*<a name=\\d+></a>((<img.+?>)* \\s*)?\\d+ \\s*.*?\\s*)|(\\s* <a name=\\d+></a>((<img.+?>)* \\s*)?.*? \\s*\\d+))(?= )' ,
'header_regex': '(?i)(?<=<hr>)((\\s*<a name=\\d+></a>((<img.+?>)* \\s*)?\\d+ \\s*.*?\\s*)|(\\s* <a name=\\d+></a>((<img.+?>)* \\s*)?.*? \\s*\\d+))(?= )' ,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x03C0ECF0>,
'insert_blank_line': False,
'insert_metadata': False,
'isbn': None,
'keep_ligatures': False,
'language': None,
'level1_toc': None,
'level2_toc': None,
'level3_toc': None,
'line_height': 0,
'linearize_tables': False,
'lrf': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'max_toc_links': 50,
'no_chapters_in_toc': False,
'no_inline_navbars': False,
'output_profile': <calibre.customize.profiles.OutputProfile object at 0x03C0EED0>,
'page_breaks_before': None,
'password': None,
'prefer_metadata_cover': False,
'preprocess_html': False,
'pretty_print': True,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': None,
'remove_first_image': False,
'remove_footer': False,
'remove_header': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'series': None,
'series_index': None,
'tags': None,
'test': True,
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'use_auto_toc': False,
'username': None,
'verbose': 6}
1% Converting input to HTML...
InputFormatPlugin: Recipe Input running
Trying to get latest version of recipe: ncrnext
1% Fetching feeds...
1% Got feeds from index page
1% Trying to download cover...
1% Generating masthead...
Synthesizing mastheadImage
Python function terminated unexpectedly
list index out of range (Error Code: 1)

c.espinosas · 09-03-2010, 05:56 AM

Quote:

Originally Posted by bmsleight

Hi c.espinosas,

Try this for a recipe for Milenio Diario (mexican newspaper, http://impreso.milenio.com/Nacional/)

Although I do not speak Spanish, so there maybe some errors.

Spoiler:

Anyone help me on https://www.mobileread.com/forums/sho...postcount=2568 (instructable) ?

Regards,
Brendan

Many thanks Brendan!
I already tried it, and works very well.
Best,
Carlos

c.espinosas · 09-03-2010, 06:00 AM

Quote:

Originally Posted by bmsleight

Hi c.espinosas,

Try this for a recipe for Milenio Diario (mexican newspaper, http://impreso.milenio.com/Nacional/)

Although I do not speak Spanish, so there maybe some errors.

...
Regards,
Brendan

Many thanks Brendan!
I already tried it and works very well.
Best,
Carlos

Starson17 · 09-03-2010, 11:20 AM

Quote:

Originally Posted by TonytheBookworm

I have ultraedit, could you enlighten me on the batch file you use for executing your recipes. I see where that would be very useful. And output error file. Does utraedit automatically update the changed content or do you have to reopen it again ?

For AJC recipe - it's simple:

Code:

c:
cd  \Projects\Calibre\Recipes\AJC
ebook-convert AJC_5.recipe AJC_5 --test -vv > AJC.txt
:ebook-convert AJC_5.recipe AJC_5.epub> AJC.txt

Each time I test a new recipe, copy the old folder that has AJC_N.recipe, AJC.txt and the batch file, then rename the new folder from AJC-Copy to NewRecipeName and update the names of the files inside to

NewRecipeName.txt
NewRecipeName_1.recipe

and set the batch file to:

Code:

c:
cd  \Projects\Calibre\Recipes\NewRecipeName
ebook-convert NewRecipeName_1.recipe NewRecipeName_1 --test -vv > NewRecipeName.txt

When making drastic changes to the Recipe, save it as
NewRecipeName_2.recipe
and update the batch file to

Code:

c:
cd  \Projects\Calibre\Recipes\NewRecipeName
ebook-convert NewRecipeName_2.recipe NewRecipeName_2 --test -vv > NewRecipeName.txt

.This lets you step back in case your changes were horrible.
Use the Advanced "Run Windows Program" F10 to run the batch file. with a single key press. Output ends up in the \Projects\Calibre\Recipes\NewRecipeName\NewRecipeN ame_2 folder as html

Keep a master recipe open in UltraEdit with all your previously worked out tricks and techniques for simple cut and paste.

09-02-2010, 07:34 PM	#2605
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	Okay so the /printable append worked on the autoStuff but the webmaster for whatever reason can't seem to keep a constant format. So When it comes to the other feeds for example Computers. The dang url needs to be modified from.. http://feedproxy.google.com/~r/Howst...icroformat.htm to this http://computer.howstuffworks.com/mi....htm/printable Sure I understand how to change it from previous recipes I have worked on and been helped with. Yet they were consistent. But in this recipe here it is a case of do this for some. then do this for some other then do this for something else. Do I some how call the feeds separately within the recipe ? Like Spoiler: Code: feeds = [ ('AutoStuff', 'http://feeds.feedburner.com/HowstuffworksAutostuffDailyRssFeed')] def print_version(self, url): return url + '/printable' ####now do some more feeds feeds = [ ('Computers', 'http://feeds.feedburner.com/HowstuffworksComputerstuffDailyRssFeed')] def print_version(self, url); ....... This recipe here if anyone wants to tackle it with me is very confusing at least to me it is. Because like i say some of its feeds uses this then others use that. nothing stays the same.. Even with the printversion the formating doesn't stay the same like url + /printable in some cases it will be computer.howstuffworks.com/social-networking/blah.html instead of computer.howstuffworks.com/someothercrap/blah.html my guess would be some form of a regexpression... so without further flooding the forum about this recipe if someone out there doesn't mind tackling this by all means do I'm curious to know how you resolve this issue.... I will continue to work and play with it but again I don't wanna keep fooding the forum on this thanks again added******* Is there something like a switch in python(beautifulsoup) ? For instance If Feed_title = Auto then do this... Else If Feed_title = computers then do this... Default do this.... I think that would work in this situation because then for each individual feed I could have it do what it needs to do to get the printurl... Last edited by TonytheBookworm; 09-02-2010 at 08:59 PM. Reason: added more question

09-03-2010, 03:28 AM	#2607
JvdW Zealot Posts: 115 Karma: 150 Join Date: Jul 2008 Location: Netherlands Veenendaal Device: Palm T5, Sony PRS-505, Nook Color	Problems with nrcnext recipe Hello All, Is there anybody who can help me with the builtin nrcnext recipe? I have tried to fix it myself but it looks like it doesn't even start downloading articles so debugging the recipe itself isn't going to work. Following is the output from convert-ebook nrcnext.recipe test --test -vv More 'v' don't add more information Thanks in advance, Joop Resolved conversion options calibre version: 0.7.15 {'asciiize': False, 'author_sort': None, 'authors': None, 'base_font_size': 0, 'book_producer': None, 'change_justification': 'original', 'chapter': None, 'chapter_mark': 'pagebreak', 'comments': None, 'cover': None, 'debug_pipeline': u'test', 'disable_font_rescaling': False, 'dont_download_recipe': False, 'extra_css': None, 'font_size_mapping': None, 'footer_regex': '(?i)(?<=<hr>)((\\s<a name=\\d+></a>((<img.+?>)<br>\\s)?\\d+<br>\\s.?\\s)\|(\\s* <a name=\\d+></a>((<img.+?>)<br>\\s)?.?<br>\\s\\d+))(?=<br>)' , 'header_regex': '(?i)(?<=<hr>)((\\s<a name=\\d+></a>((<img.+?>)<br>\\s)?\\d+<br>\\s.?\\s)\|(\\s* <a name=\\d+></a>((<img.+?>)<br>\\s)?.?<br>\\s\\d+))(?=<br>)' , 'input_encoding': None, 'input_profile': <calibre.customize.profiles.InputProfile object at 0x03C0ECF0>, 'insert_blank_line': False, 'insert_metadata': False, 'isbn': None, 'keep_ligatures': False, 'language': None, 'level1_toc': None, 'level2_toc': None, 'level3_toc': None, 'line_height': 0, 'linearize_tables': False, 'lrf': False, 'margin_bottom': 5.0, 'margin_left': 5.0, 'margin_right': 5.0, 'margin_top': 5.0, 'max_toc_links': 50, 'no_chapters_in_toc': False, 'no_inline_navbars': False, 'output_profile': <calibre.customize.profiles.OutputProfile object at 0x03C0EED0>, 'page_breaks_before': None, 'password': None, 'prefer_metadata_cover': False, 'preprocess_html': False, 'pretty_print': True, 'pubdate': None, 'publisher': None, 'rating': None, 'read_metadata_from_opf': None, 'remove_first_image': False, 'remove_footer': False, 'remove_header': False, 'remove_paragraph_spacing': False, 'remove_paragraph_spacing_indent_size': 1.5, 'series': None, 'series_index': None, 'tags': None, 'test': True, 'timestamp': None, 'title': None, 'title_sort': None, 'toc_filter': None, 'toc_threshold': 6, 'use_auto_toc': False, 'username': None, 'verbose': 6} 1% Converting input to HTML... InputFormatPlugin: Recipe Input running Trying to get latest version of recipe: ncrnext 1% Fetching feeds... 1% Got feeds from index page 1% Trying to download cover... 1% Generating masthead... Synthesizing mastheadImage Python function terminated unexpectedly list index out of range (Error Code: 1)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

09-02-2010, 03:25 PM	#2597
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	Wow!!!! This is confusing to say the least... On the page I'm trying to work on howstuffworks the next page sometimes is in pagination other times in top10pagnation (why the heck can't they stay consistent and make my life easier). Anyway, can one of you take at look at http://feeds.feedburner.com/Howstuff...ffDailyRssFeed and tell me how you would solve the multipage issue where the next page doesn't always fall under the same tag structure. thanks

09-02-2010, 06:36 PM	#2603
bmsleight Member Posts: 24 Karma: 540 Join Date: Aug 2010 Device: Kindle 3	TonytheBookworm, The /printable still works on this article. http://auto.howstuffworks.com/auto-r....htm/printable

Advert

Advert