Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 06-16-2010, 01:57 PM   #2116
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
That is as far as I can help you. This is starting to be really complicated and my time is required elsewhere. In your place I'd just leave the links, they do not obstruct the main text so much.
kiklop74 is offline  
Old 06-16-2010, 02:07 PM   #2117
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by lordvetinari2 View Post
We are almost there.
Sorry, but no you're not. The last little bit is often the hardest.

Quote:
The only section that is not displaying perfectly is the Ecosfera. Check this link: http://ecosfera.publico.pt/noticia.aspx?id=1442165
I looked at it.

Quote:
There are a few elements there (ECOSFERA_polaroid and ECOSFERA_link_rel) that I am trying to remove, but within these father elements there are child elements also using ECOSFERA_texto_01. How do I say "keep element X, as long as X is not within Y"?
It can be done, but not with the simple "keep" tag statements you are using. See below.

Quote:
Finally, the links on the bottom right corner under "Legislação" should not appear either. They are not in any specifically named div or table, so I do not know how to deal with them.
If the tags aren't labeled with class or id, etc., they can't easily be referenced for removal or to be kept. There are other ways to reference them, but now you are adding significant complexity. Basically, you use BeautifulSoup and find tags by position relative to other tags.

Read this and this and this and this.
(Particularly the last one on BeautifulSoup)

Last edited by Starson17; 06-16-2010 at 03:13 PM.
Starson17 is offline  
Advert
Old 06-16-2010, 02:58 PM   #2118
lordvetinari2
Zealot
lordvetinari2 is on a distinguished road
 
Posts: 137
Karma: 61
Join Date: Jun 2006
Location: Gijón, Spain
Device: Kindle 3G+WiFi & Galaxy Note
Quote:
Originally Posted by Starson17 View Post
Sorry, but no you're not. The last little bit is often the hardest.
Read this and this and this and this.
(Particularly the last one on BeautifulSoup)
Thanks again! I had read the Recipe API Documentation, of course.
I skimmed through that last link and I kind of understand what you mean. I only did a little Java at University, and I see I am biting more than I can chew here. I will leave it as it is and submit a ticket for replacing the old recipe which my own, which at least works 95%.
lordvetinari2 is offline  
Old 06-16-2010, 03:12 PM   #2119
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by lordvetinari2 View Post
Thanks again! I had read the Recipe API Documentation, of course.
I skimmed through that last link and I kind of understand what you mean. I only did a little Java at University, and I see I am biting more than I can chew here. I will leave it as it is and submit a ticket for replacing the old recipe which my own, which at least works 95%.
You have what looks like a tough site to clean properly. Have you looked for print links? Sometimes they are the easiest way to get a clean feed.
Starson17 is offline  
Old 06-16-2010, 03:27 PM   #2120
lordvetinari2
Zealot
lordvetinari2 is on a distinguished road
 
Posts: 137
Karma: 61
Join Date: Jun 2006
Location: Gijón, Spain
Device: Kindle 3G+WiFi & Galaxy Note
Quote:
Originally Posted by Starson17 View Post
You have what looks like a tough site to clean properly. Have you looked for print links? Sometimes they are the easiest way to get a clean feed.
Indeed, that's the first thing I looked for. The website manages printing in two ways, depending on the section:
1. Open a print dialog box that will print the current page as it shows, with all the icons, comments, menus and other garbage.
2. Open a pop-up window saying there's been a bad server request.

So, not very useful.

Also, the RSS is awful. Sometimes it gives links as www.sociedade.publico.pt, sometimes as www.publico.pt/sociedade, sometimes as www.publico.pt, etc, etc I cannot make head or tails out of it, really.

I know, this newspaper website is a mess structurally and otherwise. But it's my favourite Portuguese newspaper (very popular there, too) and I gotta keep learning that beautiful language.
lordvetinari2 is offline  
Advert
Old 06-16-2010, 04:08 PM   #2121
lordvetinari2
Zealot
lordvetinari2 is on a distinguished road
 
Posts: 137
Karma: 61
Join Date: Jun 2006
Location: Gijón, Spain
Device: Kindle 3G+WiFi & Galaxy Note
Quote:
Originally Posted by kiklop74 View Post
That is as far as I can help you. This is starting to be really complicated and my time is required elsewhere. In your place I'd just leave the links, they do not obstruct the main text so much.
No problem, thanks for your help anyway.

I have just uploaded the (mostly) working recipe to the tracker:

http://bugs.calibre-ebook.com/ticket/5854
lordvetinari2 is offline  
Old 06-16-2010, 04:16 PM   #2122
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by lordvetinari2 View Post
I know, this newspaper website is a mess structurally and otherwise.
Well, if you decide you want to dip into the soup, let us know. Other than that, I don't see any way to deal with the structure at that site. Even if you decide to go that way, it's quite likely they will change the site and break all your work. The less organized and more random the site organization, the harder it is to make reliable recipes.
Starson17 is offline  
Old 06-17-2010, 01:43 PM   #2123
lordvetinari2
Zealot
lordvetinari2 is on a distinguished road
 
Posts: 137
Karma: 61
Join Date: Jun 2006
Location: Gijón, Spain
Device: Kindle 3G+WiFi & Galaxy Note
I am afraid I found some more problems. I don't really mind issues 2-4, but would like to solve them if it's easy. Issue 1, however, is more of a critical error.

Issue 1: Some articles show up with completely garbled text (see "gardbledText.jpg"), both in Calibre and in my PRS-300. Every time I download the news, the articles that show up corrupt are different ones, so it's not an issue with a specific article. Problem with the server?

Issue 2: I had to delete the "Ecosfera" feed from the recipe, because it was making my PRS-300 freeze & reboot, although the articles from said feed displayed just fine on Calibre. As a result, some articles from the main feed (which conform to the "Ecosfera" structure) are showing up empty on the resulting ebook. This also happens with articles from other feeds, which are completely empty, such as http://desporto.publico.pt/noticia.aspx?id=1442218 Is there an EASY way to say, "if you find an empty article, delete it from the book and from the TOC"?

Issue 3: Sometimes the feed provides the same article twice. For instance, "Proposta de composição no exame do 9º ano provocou mais um corrupio nas escolas" under the "Educação" section appears twice, with the same URL, the same title and the same exact content. Is there an EASY way to say, "if you find repeated articles, delete all of them except for the newest one"?

Issue 4: Some articles have the "Next" link disabled. Under PRS-300, I cannot navigate to them. Under Calibre, clicking on them makes no difference. This happens with the "Australiano Tim Cahill suspenso por um jogo" (9th) article from the "Desporto" section, for instance. Any EASY way to solve this?

I ran the recipe with the debugging parameters as follow:
ebook-convert publico_pt_test.recipe .epub -vv --debug-pipeline p --extract-to x

I ran the resulting ePUB through Adobe's Epubcheck (http://code.google.com/p/epubcheck/) and it returned hundreds of errors. Is this normal?

Attached:
1. parsing_debug.zip > Results of debugging with -vv
2. ebook-convert_log.txt > Terminal messages from debugging
3. epubcheck_log.txt > Results of epubcheck for compliance
4. gardbledText.jpg > Garbled text on my Reader
5. publico_pt_test.epub > ePUB with today's news
6. publico_pt_test.txt > Current recipe
Attached Thumbnails
Click image for larger version

Name:	gardbledText.jpg
Views:	229
Size:	129.4 KB
ID:	53591  
Attached Files
File Type: zip parsing_debug.zip (6.82 MB, 221 views)
File Type: txt ebook-convert_log.txt (216.3 KB, 274 views)
File Type: txt epubcheck_log.txt (55.4 KB, 304 views)
File Type: epub publico_pt_test.epub (1.35 MB, 241 views)
File Type: txt publico_pt_test.txt (2.1 KB, 210 views)
lordvetinari2 is offline  
Old 06-17-2010, 02:32 PM   #2124
nook.life
Member
nook.life began at the beginning.
 
Posts: 12
Karma: 10
Join Date: May 2010
Device: Nook
Cyanide And Happiness?

Any progress on the Cyanide & Happiness request?
Here are the links...

The website is http://www.explosm.net/comics/
and the RSS is: http://feeds.feedburner.com/Explosm


I would really, really, really appreciate it if someone could help me with this.

Thanks so much!!!
nook.life is offline  
Old 06-17-2010, 02:37 PM   #2125
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by lordvetinari2 View Post
I am afraid I found some more problems.
There's a lot there. I'll take an initial stab at it.

Quote:
Issue 1: Some articles show up with completely garbled text (see "gardbledText.jpg"), both in Calibre and in my PRS-300. Every time I download the news, the articles that show up corrupt are different ones, so it's not an issue with a specific article. Problem with the server?
I've never seen this behavior before. I'd need to reproduce it and run tests. Typically, I use pre and postprocess_html then print the Soup. That lets me look at the raw html at different stages. I can't do it now.
Quote:
Issue 2: I had to delete the "Ecosfera" feed from the recipe, because it was making my PRS-300 freeze & reboot, although the articles from said feed displayed just fine on Calibre. As a result, some articles from the main feed (which conform to the "Ecosfera" structure) are showing up empty on the resulting ebook. This also happens with articles from other feeds, which are completely empty, such as http://desporto.publico.pt/noticia.aspx?id=1442218 Is there an EASY way to say, "if you find an empty article, delete it from the book and from the TOC"?
No easy way. Are you sure that these articles are empty? Sometimes articles are empty because you have stripped all the contents, sometimes because the content is there, but it's hidden by remaining scripting/ comment tags, etc.. Finding the code in the content that is causing the freezing on your PRS might help. If there is bad code, find that, and if you are stripping too strongly with tag control, fix that.

Quote:
Issue 3: Sometimes the feed provides the same article twice. For instance, "Proposta de composição no exame do 9º ano provocou mais um corrupio nas escolas" under the "Educação" section appears twice, with the same URL, the same title and the same exact content. Is there an EASY way to say, "if you find repeated articles, delete all of them except for the newest one"?
No easy way I know of.
Quote:
Issue 4: Some articles have the "Next" link disabled. Under PRS-300, I cannot navigate to them. Under Calibre, clicking on them makes no difference. This happens with the "Australiano Tim Cahill suspenso por um jogo" (9th) article from the "Desporto" section, for instance. Any EASY way to solve this?
I'd need to look at the link and the recipe. If there is a link on your source page, AFAIK, it won't follow unless the recursion is turned on. Even then, you may want to control following with match or filter_regexps. For a "Next"
link, are you following to the next page or the next article. If the former, I'd be looking at multipage code. If the latter, I'd hope the article was already in the feed.

Quote:
I ran the resulting ePUB through Adobe's Epubcheck (http://code.google.com/p/epubcheck/) and it returned hundreds of errors. Is this normal?
I've never tried it.

Sorry I can't help more.
Starson17 is offline  
Old 06-17-2010, 02:45 PM   #2126
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by nook.life View Post
Any progress on the Cyanide & Happiness request?
Here are the links...

The website is http://www.explosm.net/comics/
and the RSS is: http://feeds.feedburner.com/Explosm


I would really, really, really appreciate it if someone could help me with this.

Thanks so much!!!
I took a look at it. I told you I took a look at it. I asked you a question. You didn't respond, so I stopped. I like to know there's really someone out there.
Starson17 is offline  
Old 06-17-2010, 03:28 PM   #2127
lordvetinari2
Zealot
lordvetinari2 is on a distinguished road
 
Posts: 137
Karma: 61
Join Date: Jun 2006
Location: Gijón, Spain
Device: Kindle 3G+WiFi & Galaxy Note
As always, thanks a lot for your help, Starson17.

Quote:
Originally Posted by Starson17 View Post
I'd need to reproduce it and run tests. Typically, I use pre and postprocess_html then print the Soup.
Pre and post stuff is in the ZIP attachment from my previous post. Is that what you mean?

Quote:
Originally Posted by Starson17 View Post
Finding the code in the content that is causing the freezing on your PRS might help. If there is bad code, find that, and if you are stripping too strongly with tag control, fix that.
The thing is, content from that feed appears in tag names that are also used for elements that I don't need. One of those is the meta name content, which provides an unclosed tag when parsed. Anyway, I'm guessing it means messing about with some deep BeautifulSoup stuff, so I prefer to remove that feed completely and be done with it.

Quote:
Originally Posted by Starson17 View Post
For a "Next" link, are you following to the next page or the next article. If the former, I'd be looking at multipage code. If the latter, I'd hope the article was already in the feed.
It's just going to the next article, there's no multipage used in these feeds. Yes, the article is already in the feed, as I can get there one pageturn at a time.
lordvetinari2 is offline  
Old 06-17-2010, 03:58 PM   #2128
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by lordvetinari2 View Post
As always, thanks a lot for your help, Starson17.
You're welcome. Be aware, I'm no expert, but I've been able to make the recipes do anything I've really tried to get them to do, so I've wandered through many different parts.

Quote:
Pre and post stuff is in the ZIP attachment from my previous post. Is that what you mean?
What I mean is that I run preprocess_html(soup) with a simple print command:
Code:
print 'The preprocess soup is: ', soup
Then I do it with postprocess_html. This lets me see the html sorted by BeautifulSoup at different stages. Your garbled text is presumably not garbled on the source page, so it's getting garbled during processing. This would help track down where it's happening.


Quote:
The thing is, content from that feed appears in tag names that are also used for elements that I don't need. One of those is the meta name content, which provides an unclosed tag when parsed. Anyway, I'm guessing it means messing about with some deep BeautifulSoup stuff, so I prefer to remove that feed completely and be done with it.
All your questions have answers only found in BeautifulSoup. The worse the site, the more you need it. The entire recipe system uses it under the hood, anyway. Each time you asked if there was an easy way to do something, I thought .... not unless you think using Beautiful Soup is easy.

Quote:
It's just going to the next article, there's no multipage used in these feeds. Yes, the article is already in the feed, as I can get there one pageturn at a time.
So if it's just going to the next article, why not strip that "Next" element and not worry about whether it links or not?

Three methods of stripping I typically use:

1) Use the remove_tags, keep_only_tags, etc. This is easy.

2) Use preprocess_html(soup), find your tag, use .extract() This is only a bit harder.

3) Get down and dirty with .preprocess_regexps. You provide a list of regexp substitution rules to run on the downloaded html. Each element of the list is a two element tuple. The first element of the tuple is a compiled regular expression and the second a callable that takes a single match object and returns a string to replace the match. It's basically text-based, not tag-based, search and replace in the html. You can remove tags, change tags, fix broken tags, change links, etc. It's very flexible for difficult situations.
Starson17 is offline  
Old 06-17-2010, 07:02 PM   #2129
lordvetinari2
Zealot
lordvetinari2 is on a distinguished road
 
Posts: 137
Karma: 61
Join Date: Jun 2006
Location: Gijón, Spain
Device: Kindle 3G+WiFi & Galaxy Note
Quote:
Originally Posted by Starson17 View Post
Then I do it with postprocess_html. This lets me see the html sorted by BeautifulSoup at different stages. Your garbled text is presumably not garbled on the source page, so it's getting garbled during processing. This would help track down where it's happening.
I've checked the folders created by the debugging mode and it seems that the articles are corrupted on download. That's weird, because it's always different articles every time.
I tried limiting simultaneous_downloads to 1, but that didn't solve the issue.

Quote:
Originally Posted by Starson17 View Post
So if it's just going to the next article, why not strip that "Next" element and not worry about whether it links or not?
I think we are misunderstanding each other. Please check the attached image. The "Next" link I mean is the one at the top, on the navigation menu.
I downloaded the news to LRF instead and noticed that the "Next" text did not even had link formatting in Calibre, while it did have link formatting in ePUB, but didn't work. It's like there is no link at all, rather than a non-active link.
Attached Thumbnails
Click image for larger version

Name:	next_navi.png
Views:	223
Size:	207.6 KB
ID:	53604  
lordvetinari2 is offline  
Old 06-17-2010, 08:22 PM   #2130
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by lordvetinari2 View Post
it seems that the articles are corrupted on download. That's weird, because it's always different articles every time.
That is odd. I've never seen it before.

Quote:
I think we are misunderstanding each other. Please check the attached image. The "Next" link I mean is the one at the top, on the navigation menu.
Yes, I misunderstood. I thought you were referring to links on the page, not navigation bar links. The navbar links are created as the html is constructed. Typically, the Next link on feed_0/article_0 is to feed_0/article_1/index.html, which has a Next link to feed_0/article_2/index.html, etc. until the last article in feed_0, where the Next link points to feed_1/index.html.

The last "Next" link is invalid if there is no next feed. I suppose it's a bug, but not one I notice, as I don't use the navbar. If your next article's index.html isn't built, that would make an invalid Next link.
Starson17 is offline  
Closed Thread


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom column read ? pchrist7 Calibre 2 10-04-2010 02:52 AM
Archive for custom screensavers sleeplessdave Amazon Kindle 1 07-07-2010 12:33 PM
How to back up preferences and custom recipes? greenapple Calibre 3 03-29-2010 05:08 AM
Donations for Custom Recipes ddavtian Calibre 5 01-23-2010 04:54 PM
Help understanding custom recipes andersent Calibre 0 12-17-2009 02:37 PM


All times are GMT -4. The time now is 10:31 AM.


MobileRead.com is a privately owned, operated and funded community.