MobileRead Forums - View Single Post

adinb · 03-29-2007, 03:57 AM

Now that I'm getting better with more complex .Net regex's, I can also articulate potential bug #2 a little more clearly to you:

-when the "apply extractor pattern to linked content" the Link Refomatter field is still using the groupings from the link element (i.e. guid, link, etc) and not the link extractor pattern.

I'll use "The Raw Story" as an example. It's a pretty basic RSS feed with the link element = 'link'. There's a printable version of each story, but you have to follow the link element and use the link extractor pattern on the followed link. (For this example I'll say that we grabbed 'http://rawstory.com/news/2007/Colbert_invites_Rom_Emanuel_on_show_0327.html')

On the followed link, I'll apply the regex "action='(http://rawstory.com/printstory.php\?story=\d+)'>" to snag the proper url for the printable version. With this regex, I should be able to make the link reformatter just {0} since I was able to pull the entire link. (yeah, I could optimize the regex, but I like 'em a little more readable, vice using backreferences, etc)

Looking in the log, the reformatted link ends up as "http://rawstory.com/news/2007/Colbert_invites_Rom_Emanuel_on_show_0327.html" instead of "http://rawstory.com/printstory.php?story=5513".

Doing a little more testing, if I move around the parens to make the regex "action='http://rawstory.com/printstory.php\?story=(\d+)'>" (which makes {0}=5513) and setting the the link reformatter field to "http://rawstory.com/printstory.php?story={0}" (which should again result in "http://rawstory.com/printstory.php?story=5513") results in the following reformatted link (copied from the log):
"http://rawstory.com/printstory.php?story=http://rawstory.com/news/2007/Colbert_invites_Rom_Emanuel_on_show_0327.html"

Which is why it initially looks like the extractor isn't being applied to linked content.

If there's just some sort of undocumented selector to force the link reformatter field to use the link extractor patter when following the link element, I'd ***love*** to see it.

03-29-2007, 03:57 AM	#148
adinb RSS &amp; Gadget Addict! Posts: 82 Karma: 67 Join Date: May 2005 Location: Albuquerque, NM Device: Sony PRS-500, iPod Touch, iPhone	Now that I'm getting better with more complex .Net regex's, I can also articulate potential bug #2 a little more clearly to you: -when the "apply extractor pattern to linked content" the Link Refomatter field is still using the groupings from the link element (i.e. guid, link, etc) and not the link extractor pattern. I'll use "The Raw Story" as an example. It's a pretty basic RSS feed with the link element = 'link'. There's a printable version of each story, but you have to follow the link element and use the link extractor pattern on the followed link. (For this example I'll say that we grabbed 'http://rawstory.com/news/2007/Colbert_invites_Rom_Emanuel_on_show_0327.html') On the followed link, I'll apply the regex "action='(http://rawstory.com/printstory.php\?story=\d+)'>" to snag the proper url for the printable version. With this regex, I should be able to make the link reformatter just {0} since I was able to pull the entire link. (yeah, I could optimize the regex, but I like 'em a little more readable, vice using backreferences, etc) Looking in the log, the reformatted link ends up as "http://rawstory.com/news/2007/Colbert_invites_Rom_Emanuel_on_show_0327.html" instead of "http://rawstory.com/printstory.php?story=5513". Doing a little more testing, if I move around the parens to make the regex "action='http://rawstory.com/printstory.php\?story=(\d+)'>" (which makes {0}=5513) and setting the the link reformatter field to "http://rawstory.com/printstory.php?story={0}" (which should again result in "http://rawstory.com/printstory.php?story=5513") results in the following reformatted link (copied from the log): "http://rawstory.com/printstory.php?story=http://rawstory.com/news/2007/Colbert_invites_Rom_Emanuel_on_show_0327.html" Which is why it initially looks like the extractor isn't being applied to linked content. If there's just some sort of undocumented selector to force the link reformatter field to use the link extractor patter when following the link element, I'd *love* to see it.