Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 08-22-2010, 01:26 PM   #2491
French
Groupie
French ought to be getting tired of karma fortunes by now.French ought to be getting tired of karma fortunes by now.French ought to be getting tired of karma fortunes by now.French ought to be getting tired of karma fortunes by now.French ought to be getting tired of karma fortunes by now.French ought to be getting tired of karma fortunes by now.French ought to be getting tired of karma fortunes by now.French ought to be getting tired of karma fortunes by now.French ought to be getting tired of karma fortunes by now.French ought to be getting tired of karma fortunes by now.French ought to be getting tired of karma fortunes by now.
 
Posts: 151
Karma: 1002968
Join Date: Dec 2008
Device: none
Well I've attempted this but not getting what I expected.

Anyone care to take on the Baltimore Sun?


http://www.baltimoresun.com/about/bl...62819.htmlpage
French is offline  
Old 08-22-2010, 01:30 PM   #2492
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
Based on your quoted html from above, have you tried the correct href? IOW, have you tried this:
Code:
remove_tags = [dict(name='a', attrs={'href':'#comments_controls'})]
(it's always the little things that trip you up!)
Yeah it is always the little things. what happen is the thing was underlined in the source on firefox and at first glance it looked like it was a space but I should have thought and did after you showed me. hrefs don't have spaced haha Anyway live and learn thanks again. I hope someone actually enjoys reading that blog. Oh while i'm at it is there a way in calibre using a regexpression to insert or remove a line?
For example:
Code:
 <p class="calibre9" This is the line I wish to keep </p>
 <p class="calibre9" This is the line I wish to delete </p>
 <p class="calibre9" Some more stuff </p>
I want to do something like
[code ]
var string = "This is the line I wish to delete"
remove_tag [
where contains(string)

or
var string = "This is the line I wish to delete"
var replacestring = "Calibre Rocks"
replace_tag [
replace_where(string,replacestring)
]

[/code]

the above is just pseudo code but I hope you understand my logic.
TonytheBookworm is offline  
Old 08-22-2010, 01:52 PM   #2493
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
Yeah it is always the little things. what happen is the thing was underlined in the source on firefox and at first glance it looked like it was a space
There was also a missing "s." "#comment_controls" makes more sense, but they used "#comments_controls." I"ve given up anything but copy/paste. FireBug makes it easy.


Quote:
Oh while i'm at it is there a way in calibre using a regexpression to insert or remove a line?
For example:
Code:
 <p class="calibre9" This is the line I wish to keep </p>
 <p class="calibre9" This is the line I wish to delete </p>
 <p class="calibre9" Some more stuff </p>
I want to do something like
Code:
   var string = "This is the line I wish to delete"
   remove_tag  [
                      where contains(string)

  or 
   var string = "This is the line I wish to delete"
   var replacestring = "Calibre Rocks"
   replace_tag [
                    replace_where(string,replacestring)
                    ]
the above is just pseudo code but I hope you understand my logic.
You're posting stuff that has class="calibreN" type labels. I suspect you know this, but those are created after the recipe has finished - so you really need to be looking at the web page html, not the final html or epub.

Also, your examples are missing the closing tag marker ">" after <p class="calibre9"

However, assuming that you're just using that as an example (I.e., you're as lazy as I am and didn't want to go back and open up the original site), the answer to your question is "yes - it's possible to insert or remove a line." and "yes, I understand your pseudo code."

Am I correct in thinking that your next question is "How?"

Spoiler:
Many ways. Let me send you here first, then get back to you.

Also remove_tags, preprocess_regexps, preprocess_html or postprocess_html. I've got to go - back later if you have Q's
Starson17 is offline  
Old 08-22-2010, 06:24 PM   #2494
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:

Am I correct in thinking that your next question is "How?"

Spoiler:
Many ways. Let me send you here first, then get back to you.

Also remove_tags, preprocess_regexps, preprocess_html or postprocess_html. I've got to go - back later if you have Q's
Man i wish there was a way i could ask questions without flooding this board and all.
lets say in every parse i get something that has a doubleclick.net ad in it
I tried
Code:
filter_regexps = [r'feedads\.g\.doubleclick\.net']
and yeah i didn't see any indent errors this time.
thought well maybe if i use preprocess_regexps and remove all the instances of doubleclick first.
So then i looked in the beautiful soup documentation and after a big headache i'm still kinda lost
I tried this as well...
Code:
preprocess_regexps     = [(re.compile(r'feedads\.g\.doubleclick\.net', re.DOTALL), lambda m: '')]
thanks again
TonytheBookworm is offline  
Old 08-22-2010, 10:34 PM   #2495
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by French View Post
Well I've attempted this but not getting what I expected.

Anyone care to take on the Baltimore Sun?


http://www.baltimoresun.com/about/bl...62819.htmlpage
Not sure exactly what you expect. But I just used 2 of the feeds (you will have to put the ones that you want into the recipe under the feed section).

This works for me the only issue I have is for the life of me I can't figure out how to get it to remove the doubleclick.net add that it puts on some of the articles. Maybe someone can help you/me out on that one. I have tried filter_regexp with no go. Anyway enjoy...
Attached Files
File Type: rar balt.rar (506 Bytes, 227 views)
TonytheBookworm is offline  
Old 08-23-2010, 06:19 AM   #2496
Trickery
Pew Pew!
Trickery has a complete set of Star Wars action figures.Trickery has a complete set of Star Wars action figures.Trickery has a complete set of Star Wars action figures.
 
Trickery's Avatar
 
Posts: 29
Karma: 270
Join Date: Aug 2010
Device: Kindle v3
Made these w/ icons. Hopefully they help and get added to the main program. I made a .py for all the Gawker Media Brand websites and Consumerist and added the icons for good measure.

Gawker.com
deadspin.com
io9.com
jalopnik.com
jezebel.com
kotaku.com
lifehacker.com
fleshbot.com
Consumerist.com

Consumerist is done and working well, just can't figure out how to remove a lone image that shows up on each page for twitter. It's small though, so not a big detractor.
Attached Files
File Type: zip gawker media and consumerist with icons.zip (36.9 KB, 241 views)
Trickery is offline  
Old 08-23-2010, 11:29 AM   #2497
marco_polo
Junior Member
marco_polo began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Aug 2010
Device: PRS 900
Hallo, I´m new her, Somebody can help me by create recipe from www.europasur.es ? I´try edit a recipe from El Pais , but rss links are very different ... Thank you
marco_polo is offline  
Old 08-23-2010, 01:15 PM   #2498
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
Man i wish there was a way i could ask questions without flooding this board and all.
I've worried about that issue, too, but most responses I received said that they didn't mind when I asked if people thought there was too much of this type of "how to" in this thread. It does provide a lot of good info, which is searchable and helps others write recipes, so I wouldn't worry too much. Just try to use the code and spoiler tags to keep the indents and the length of posts minimized. Personally, I like to read kiklop's recipes to see how he approaches certain problems, then I can ask him when I don't understand something.

Quote:
lets say in every parse i get something that has a doubleclick.net ad in it
I tried
Code:
filter_regexps = [r'feedads\.g\.doubleclick\.net']
and yeah i didn't see any indent errors this time.
thought well maybe if i use preprocess_regexps and remove all the instances of doubleclick first.
So then i looked in the beautiful soup documentation and after a big headache i'm still kinda lost
I tried this as well...
Code:
preprocess_regexps     = [(re.compile(r'feedads\.g\.doubleclick\.net', re.DOTALL), lambda m: '')]
I've never needed to use either of those methods to remove doubleclick ads. For me, it's always been possible to define either a keep_only or a remove. As to Beautiful Soup, I'm still learning, and I've read that page at least 50 times. I expect I'll end up reading it another 50 times eventually.

Let's start with filter_regexps. I've only used it once. It's used to prevent a link from being followed. Most of the time, you're not following a link because recursion is off and Calibre isn't following links on the pages. What you normally want to do is remove the link or graphic from your page, not prevent it from being followed by Calibre.

OTOH, I use preprocess_regexps a lot - but as a sort of last resort. It's simply a powerful search and replace on the HTML. You could do most of your remove_tags with preprocess_regexps if you wanted to. But, it's not tag-aware, so remove_tags is better in most cases (it won't be confused if there's a div tag inside a div tag, where S&R might find the open div tag of an outer tag and the close div of an inner tag. Why don't you show me the actual page source for the doubleclick you want to deal with, or give me a link,so I can understand what you are trying to remove?

BTW, If you look at page source with your browser, it may not be the same as what Calibre sees. It may also be wrong if you look at it with FireBug. To see it as Calibre will see it I like to do this:

Code:
    def preprocess_html(self, soup):
        print 'The soup is: ', soup
        return soup
If you add this code, it does nothing, but the print statement sends the html in cleaned-up Beautiful Soup form into your textfile.txt as Calibre will see it. (you are using ebook-convert ....>textfile.txt format - right?)
Starson17 is offline  
Old 08-23-2010, 02:19 PM   #2499
poluk
Enthusiast
poluk is on a distinguished road
 
Posts: 34
Karma: 54
Join Date: Jul 2008
Device: not yet
Hi
I try based on the financial times recipes to adapt it to lloyd's List
and I get this error

Quote:
mechanize._mechanize.FormNotFoundError: no form matching name 'log-in-box'
Could you tell me what to change in "log-in-box" with the webpage source concerning that part for login?


Code:
"<div class="grid_4 prefix_2 controls-container">

    <div class="grid_4 first last common-box last-in-row" id="log-in-box">

        <h2 class="common-box-header">Please Log In</h2>

        

        <form class="log-in" method="post" action="/ll/security_check">
            <fieldset>
                <label for="j_username">Username:</label>
                <input class="common-field log-in-page" type="text" name="j_username" id="j_username"
                       value="" tabindex="1"/>

                <label for="j_password">Password:</label>

                <input class="common-field log-in-page" type="password" name="j_password" id="j_password" tabindex="2"/>

                <input class="submit log-in-page" type="submit" value="Log In" tabindex="4"/>

                <label for="_spring_security_remember_me"><input type="checkbox" id="_spring_security_remember_me" name="_spring_security_remember_me" tabindex="3"/>Remember me</label>

                <a class="pwd-reminder" href="/ll/forgotten-password.htm">Forgotten your password?</a>


            </fieldset>
the website I try to make a recipe is: http://www.lloydslist.com/ll/
poluk is offline  
Old 08-23-2010, 04:01 PM   #2500
miangue
Junior Member
miangue began at the beginning.
 
miangue's Avatar
 
Posts: 4
Karma: 10
Join Date: Aug 2010
Location: Colombia
Device: Sony PRS-300
How I can change the title font?...

Let's see if someone can help me.
I made this recipe from a magazine in Colombia (larepublica.com.co). Everything comes as is the want but with a problem, is that the source of the title of each story as I get the source of the article and wanted to come out big and bold but How I can do this?, What command should I add? ... Thanks!

Here's the recipe:

Quote:
class AdvancedUserRecipe1282450582(BasicNewsRecipe):
title = u'LaRepublica.com'
oldest_article = 7
max_articles_per_feed = 100
use_embedded_content = False
no_stylesheets = True

keep_only_tags = [
dict(name='div', attrs={'id':['noticia']})
]
remove_tags = [
dict(name='div', attrs={'id':['iconos', 'relacionados', 'documentos_adjuntos']}),
dict(name='span', attrs={'id':['comentarios']})
]

feeds = [(u'Noticias', u'http://www.larepublica.com.co/rss/larepublica.xml')]


And here is part of the source code where the title of the news:

Quote:
<div id="noticia">
<!-- Titulo de la noticia -->
<div id="titulo">
Interés de inversionistas sube el Igbc hasta 13.602,04 unidades
</div>
<!-- Info de la noticia -->
<div id="info">



Last edited by miangue; 08-23-2010 at 04:03 PM.
miangue is offline  
Old 08-23-2010, 07:24 PM   #2501
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:

Code:
    def preprocess_html(self, soup):
        print 'The soup is: ', soup
        return soup
If you add this code, it does nothing, but the print statement sends the html in cleaned-up Beautiful Soup form into your textfile.txt as Calibre will see it. (you are using ebook-convert ....>textfile.txt format - right?)
Yes, I'm using the ebook-convert string you gave me works great for debugging. As for the preprocess_html thanks for that method I will use that in the future as well to test my code.


As for the the issue where I had the doubleclick. It was the baltimoresun. I took a stab at it for that guy/gal that wanted someone to look at it. To the most part everything is fine with the rss feed except it puts in that google ad on some of the pages generally the first article. when i look at the orginal source it has ad.doubleclick.net in it then after it is rendered with calibre it is feedsad.g.doubleclick.net here is the recipe I am currently using for it...
Spoiler:

class AdvancedUserRecipe1282101454(BasicNewsRecipe):
title = 'The Baltimoresun'
__author__ = 'TonytheBookworm'
description = 'Baltimoresun News'
publisher = 'The Baltimoresun'
category = 'news, politics, USA'
oldest_article = 7
max_articles_per_feed = 100
no_stylesheets = True
remove_javascript = True
filter_regexps = [r'feedads\.g\.doubleclick\.net']
extra_css = '.headline {font-size: x-large;} \n .fact { padding-top: 10pt }'

feeds = [
('Crime Beat', 'http://feeds.feedburner.com/news_crime_blog'),
('Getting There', 'http://feeds2.feedburner.com/gettingthere_blog'),
]



Personally, I don't like the RSS feed of that site. I have considered trying to make a feed myself from this...
http://www.baltimoresun.com/services...print-edition/

which actually gives you some nice pretty images and so forth. I figured in that cause I would simply use a Recursions =1 and then somehow strip what I didn't want using maybe keep_only or remove_tags. Or I could simply take and somehow make a print_version that looks for the text of Print inside a <a> tag and then simply get that url and pass it back. The only issue with using the print version on that is I loose the photos which I don't want to do. It is just something I'm playing with to learn and to also help someone else in the process thanks for taking the time to teach me and answer my questions. I really appreciate it.
TonytheBookworm is offline  
Old 08-24-2010, 06:54 AM   #2502
kerrware
Junior Member
kerrware began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Jun 2010
Device: none
My Recipe fails to place Articles data in epub.

Been trying to create my first simple recipe for a local paper - Ilkeston Advertiser (Derbyshire, England) with Free RSS Feeds. Manage to get the logon process working and ran the recipe in test mode. It seemed to download the first two articles into seperate directories each with an index.html first and an image subdirectory. Displaying the index file in Firefox shows the article data is being downloaded ok.
When I run the recipe in Calibre I get the the index summary pages ok but all the artciles refered to just contain header (Next Link, etc.) and footer lines (downloaded by Calibre, etc.).
Have I missed a something out?

Thanks.


Spoiler:

from calibre.web.feeds.news import BasicNewsRecipe
import re

class AdvancedUserRecipe1282596648(BasicNewsRecipe):
title = u'Ilkeston Advertsier'
oldest_article = 7
max_articles_per_feed = 100
needs_subscription = True

def get_browser(self):
br = BasicNewsRecipe.get_browser()
if self.username is not None and self.password is not None:
br.open('http://auth.jpress.co.uk/login.aspx?ReturnURL=http%3a%2f%2fwww.ilkestonadve rtiser.co.uk%2ftemplate%2fRegister.aspx%3fReturnUR L%3dhttp%3a%2f%2fwww.ilkestonadvertiser.co.uk%2ffr ontpage.aspx&SiteRef=IAS')
br.select_form(name='Form1')
br['ctl00$txtEmailAddress'] = self.username
br['ctl00$txtPassword'] = self.password
br.submit()
return br

feeds = [(u'Ilkeston Today - News', u'http://www.ilkestonadvertiser.co.uk/getfeed.aspx?sectionid=795&format=rss')]
kerrware is offline  
Old 08-24-2010, 07:12 AM   #2503
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,890
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen
Quote:
Originally Posted by kerrware View Post
Have I missed a something out?
I applaud you using the spoiler tags, but first you have to wrap your recipe in the code tags (the # above) then wrap that with the spoiler tags. Placing your recipe in the code tags keeps your recipe intact with the critical spaces in their proper places. This makes trying your recipe and reviewing it easier on those that have the needed skills to assist you.
DoctorOhh is offline  
Old 08-24-2010, 07:53 AM   #2504
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
As for the the issue where I had the doubleclick. It was the baltimoresun.
I looked at a page from that site. The doubleclick ads seemed to be inside <noscript> tags. If that's it, why not just remove those tags?

Quote:
I have considered trying to make a feed myself from ...
Sometimes building the feed yourself is best.
Quote:
thanks for taking the time to teach me and answer my questions. I really appreciate it.
You're welcome.
Starson17 is offline  
Old 08-24-2010, 07:59 AM   #2505
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by miangue View Post
I get the source of the article and wanted to come out big and bold but How I can do this?, What command should I add?
extra_css is used to control formatting. Search this thread for some samples and read here.
Starson17 is offline  
Closed Thread


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom column read ? pchrist7 Calibre 2 10-04-2010 02:52 AM
Archive for custom screensavers sleeplessdave Amazon Kindle 1 07-07-2010 12:33 PM
How to back up preferences and custom recipes? greenapple Calibre 3 03-29-2010 05:08 AM
Donations for Custom Recipes ddavtian Calibre 5 01-23-2010 04:54 PM
Help understanding custom recipes andersent Calibre 0 12-17-2009 02:37 PM


All times are GMT -4. The time now is 05:42 PM.


MobileRead.com is a privately owned, operated and funded community.