09-19-2011, 08:33 PM | #1 |
Enthusiast
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
|
Adding a comic strip to a newspaper's recipe
I'd like to add a comic strip from the index page of a newspaper.
Until now, I manage to replace the cover with the comic image, using: def get_cover_url ...but it would be nicer to have the comic inserted as an article. In my recype, the articles are retrieved with parse_index. |
09-20-2011, 02:09 PM | #2 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
09-20-2011, 11:53 PM | #3 |
Enthusiast
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
|
Just done something like that, but the article shows strange characters only. I guess that happens because the link points to an image instead of and HTML file. How do I solve that?
Last edited by macpablus; 09-21-2011 at 12:18 AM. |
09-21-2011, 04:30 PM | #4 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
If you post your recipe it would be easier to see what the problem is. You might review some of my comic recipes, such as Arcamax or Gocomics/Comics.com to see how articles and images interact. Basically, you want a link to an html page with an img tag on it that holds your strip. If the site doesn't have a page like that (it should, otherwise how do you see it) you can build it yourself in the recipe.
|
09-21-2011, 08:42 PM | #5 | ||
Enthusiast
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
|
Quote:
Spoiler:
Quote:
So, it seems that I should "build it myself" in the recipe... |
||
09-22-2011, 01:03 PM | #6 |
Enthusiast
Posts: 28
Karma: 10
Join Date: Sep 2011
Device: Sony PRS-350, Kindle Touch
|
Okay. I see your problem.
In fact, the return value of parse_index(self) is: Code:
[ ('title', [ {'title':..., 'url':..., 'description':..., 'date':...}, More dictionaries as above ... ] ), More tuples with genres ] On each of these pages, the values of remove_tags and so on are executed, resulting in a cleaned HTML-page. A working example would be: Spoiler:
Last edited by a.peter; 09-22-2011 at 01:21 PM. |
09-22-2011, 05:59 PM | #7 | |
Enthusiast
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
|
Quote:
The problem is that my complete recipe has other feeds (i.e, the content of the whole newspaper, with many different sections and articles), so the option keep_only_tags will affect each of the articles. :-( |
|
09-23-2011, 02:16 AM | #8 | |
Enthusiast
Posts: 28
Karma: 10
Join Date: Sep 2011
Device: Sony PRS-350, Kindle Touch
|
Quote:
The good point is, that the keep_only_tags member is a list of dictionaries. You may add any other expression you need to parse other pages. If i take a look at an article, e. g. http://www.pagina12.com.ar/diario/el...011-09-22.html, i see that the actual article is embedded into a <div class="nota top12"> tag. A modified keep_only_tags may be: Code:
keep_only_tags = [dict(name='div', attrs={'id':'rudy_paz'}), dict(name='div', attrs={'class':'nota top12'})]
It's no matter if they dont appear on the same page. But if you pass one page with the comic strip and a list of pages with articles, it will work on both of them. By the way: For convenience, you may replace the second part of a dictionary entry of the keep_only_tags by a compiled regular expression, e. g. attrs={'class':re.compile('top.*')} But don't forget to add a Code:
import re |
|
09-23-2011, 12:39 PM | #9 | |||
Enthusiast
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
|
Quote:
http://www.pagina12.com.ar/imprimir/...011-09-22.html The actual article is contained into this tag: <div id="cuerpo">. But before this, there's also more content needed for the articles (title, subtitle, author), with tags <h5>, <h1>, etc.. These would be excluded by the keep_only_tags, and if try to include them also, the page that have the comic strip would show these tags, of course. I think the way to go would be, as Starson suggest: Quote:
Quote:
Maybe you know, pete? ;-) |
|||
09-23-2011, 01:54 PM | #10 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
I'm following along. So far, a.peter's comments have been excellent, so I haven't posted anything. One comment I have is that you made it harder to help you by posting only a partial recipe. I suspect you were trying to simplify, but a solution to one part may complicate another part - as you know from the comments on keep_only.
It sort of sounds like you're worried about this interaction so posting the entire recipe would be good. I'm also not sure exactly where your problem is. You've posted about worrying that using keep_only for the articles will keep the wrong stuff for the comic strip page. That sounds like you've got the recipe working for the page with links to the feed(s) and the page with links to the articles and your only problem left is controlling any excess junk that appears with the strip without affecting the articles. If that's where you are, then there are many options. If you aren't at that point yet, then we need to get you there. You may want to review BeautifulSoup, extract() and insert(). Those tools will let you modify a page as needed. You can postprocess_html, identify the page that has the comic strip and process it with BS to do whatever you need, including building a page entirely from scratch if that's needed. |
09-23-2011, 03:18 PM | #11 | ||
Enthusiast
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
|
Quote:
Sorry for that. Here's the entire (original) recipe, that in fact is included in the last version of Calibre: Spoiler:
My goal is to generate a new feed containing only the comic strip from... http://www.pagina12.com.ar/diario/ultimas/index.html ..that is included in <div class="top12 center" id="rudy_paz">. So, your description seems correct (again!): Quote:
Last edited by macpablus; 09-23-2011 at 03:25 PM. |
||
09-25-2011, 07:48 AM | #12 | |
Enthusiast
Posts: 28
Karma: 10
Join Date: Sep 2011
Device: Sony PRS-350, Kindle Touch
|
Quote:
First of all i saw that the daily comic is located at http://www.pagina12.com.ar/diario/principal/index.html. All i had to do was to add this page as a single feed 'Humor' with a single article. Then i modified the postprocess_html. I tried to find a div with id='rudy_paz'. When this div is present, i extracted the div from the soup, removed all content from the soups body, added the image again and returned the soup. Spoiler:
Then the remove_tags_before seems not to work as i expected so i removed it. The complete recipe is here: Spoiler:
In the debug mode (only two feeds with each two articles) produced the following output: Pagina12.epub I put your name into the copyright and added myself as a co-author |
|
09-25-2011, 08:08 PM | #13 |
Enthusiast
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
|
Great!
Definitively, you deserve the co-authoring. ;-) But now, I'm going for more. I'll try to add a second comic strip, to see if I learned something from this. Stay tuned! |
09-26-2011, 10:59 AM | #14 |
Enthusiast
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
|
And here it is. Modified postprocess_html that inserts a second comic strip (the one located at the end of the page):
Spoiler:
Now, I'm trying to insert an <hr> tag between the two, but I can't find the way. |
09-26-2011, 12:07 PM | #15 | |
Enthusiast
Posts: 28
Karma: 10
Join Date: Sep 2011
Device: Sony PRS-350, Kindle Touch
|
Quote:
And a few new things to learn. First of all: programmers are lazy. Always try to do as much as possible inside of loops. To do this, we will use findAll instead of find to look for all images in the page. The good thing is, that the second parameter (attrs) accepts lists of values. Code:
images = soup.findAll('div', attrs={'id':['rudy_paz', 'rep']}) Now we have a list which we may iterate over, using Code:
for image in images: <do something with variable image> To create a new Tag you call something like hr = Tag(soup, "tr"). This creates a <hr></hr>. To add this to the soup at a certain position you may call soup.body.insert(0, hr). But because programmers are lazy they will call something like Code:
soup.body.insert(0, Tag(soup, "hr")) Spoiler:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Android Daily Comic Strip Viewer (Has to be Downloaded Via PC) | obsessed2 | enTourage Archive | 5 | 04-28-2011 06:35 PM |
Comic strip contest!! | The Terminator | Lounge | 1 | 02-22-2011 08:50 PM |
Dilbert Comic Strip | switchman2210 | General Discussions | 4 | 09-24-2010 07:57 PM |
Automatic Daily Comic Strip Download | Adam B. | iRex | 10 | 08-11-2007 05:33 AM |