View Single Post
Old 01-17-2011, 01:46 PM   #1
bcollier
Member
bcollier began at the beginning.
 
bcollier's Avatar
 
Posts: 22
Karma: 10
Join Date: Jan 2011
Device: Kindle DX
FIX: New York Times Recipe

I've updated the New York Times recipe to resolve the issue from my previous post of occasionally missing articles. The problem was in the postprocess_html function, when a minor formatting problem causes the whole article to not be included. I've included the updated procedure, recipe, and explanation below (this may be happening in other recipes as well).

The caption of one of the photos has a paragraph within a paragraph:


<p class="caption"><p><em>“There’s no doubt in my mind that the whole trial will be about did he know right from wrong.”</em><strong> CLARENCE DUPNIK</strong> Pima County sheriff</p> </p>

The postprocess procedure thinks there are two paragraphs, and the second paragraph is empty. Thus, the variable caption.contents[0] throws an index out of range error.

for caption in soup.findAll(True, {'class':'caption'}) :
if caption and caption.contents[0]:

The first fix was just to change caption.contents[0] to len(caption) > 0. The second fix is that I added TRY / EXCEPT blocks to every minor change this procedure makes so that in the event there are small inconsistencies that would cause an article to not be included, the errors are logged but the article is still included.
Attached Files
File Type: txt updated nytimes postprocess_html.txt (3.8 KB, 342 views)
File Type: zip nytimes_sub.zip (7.5 KB, 279 views)

Last edited by bcollier; 01-17-2011 at 01:52 PM.
bcollier is offline   Reply With Quote