Quote:
Originally Posted by lordvetinari2
I am afraid I found some more problems.
|
There's a lot there. I'll take an initial stab at it.
Quote:
Issue 1: Some articles show up with completely garbled text (see "gardbledText.jpg"), both in Calibre and in my PRS-300. Every time I download the news, the articles that show up corrupt are different ones, so it's not an issue with a specific article. Problem with the server?
|
I've never seen this behavior before. I'd need to reproduce it and run tests. Typically, I use pre and postprocess_html then print the Soup. That lets me look at the raw html at different stages. I can't do it now.
Quote:
Issue 2: I had to delete the "Ecosfera" feed from the recipe, because it was making my PRS-300 freeze & reboot, although the articles from said feed displayed just fine on Calibre. As a result, some articles from the main feed (which conform to the "Ecosfera" structure) are showing up empty on the resulting ebook. This also happens with articles from other feeds, which are completely empty, such as http://desporto.publico.pt/noticia.aspx?id=1442218 Is there an EASY way to say, "if you find an empty article, delete it from the book and from the TOC"?
|
No easy way. Are you sure that these articles are empty? Sometimes articles are empty because you have stripped all the contents, sometimes because the content is there, but it's hidden by remaining scripting/ comment tags, etc.. Finding the code in the content that is causing the freezing on your PRS might help. If there is bad code, find that, and if you are stripping too strongly with tag control, fix that.
Quote:
Issue 3: Sometimes the feed provides the same article twice. For instance, "Proposta de composição no exame do 9º ano provocou mais um corrupio nas escolas" under the "Educação" section appears twice, with the same URL, the same title and the same exact content. Is there an EASY way to say, "if you find repeated articles, delete all of them except for the newest one"?
|
No easy way I know of.
Quote:
Issue 4: Some articles have the "Next" link disabled. Under PRS-300, I cannot navigate to them. Under Calibre, clicking on them makes no difference. This happens with the "Australiano Tim Cahill suspenso por um jogo" (9th) article from the "Desporto" section, for instance. Any EASY way to solve this?
|
I'd need to look at the link and the recipe. If there is a link on your source page, AFAIK, it won't follow unless the recursion is turned on. Even then, you may want to control following with match or filter_regexps. For a "Next"
link, are you following to the next page or the next article. If the former, I'd be looking at multipage code. If the latter, I'd hope the article was already in the feed.
I've never tried it.
Sorry I can't help more.