MobileRead Forums - View Single Post

hacker · 08-12-2005, 07:26 PM

I just took 5 minutes and recreated what you did with the HowStuffWorks site, using Perl. Here's the process:

Fetch big.htm
Slice out the links section (bounded by <ul></ul> tag pairs)
Fetch each of the pages linked there, by appending /printable to the end of each URL
Yank the middle section out of each 'printable' page (bounded by  comment tags)
Strip out any <script> and <style> tag pairs, including anything in-between them (as well as any attributes they use). No need for any of that on a PDA.
Strip out a few key page elements (the categoryNav, first <center>.*?</center> tag, and the last <td align=right>.*</td> tag)
Convert to Plucker.

So far, it looks great, and all in 10 lines of code. Here's some of my magic:

Code:

$content =~ s,<(s(?:cript|tyle))[^>]*>.*?,,gis;
  
  my ($start, $end) = map "<!-- $_ of article body -->", 'start', 'end';
  $content =~ s,.*${start}(.*?)${end}.*,\1,gis;

Not too hard to do at all.

Doing this for every site you want to fetch could be gruesome and painful. Avoid it, there are tools out there that do all of this already, using templates that describe each site's "stomach" or main content area.

08-12-2005, 07:26 PM	#7
hacker Technology Mercenary Posts: 617 Karma: 2561 Join Date: Feb 2003 Location: East Lyme, CT Device: Direct Neural Implant	I just took 5 minutes and recreated what you did with the HowStuffWorks site, using Perl. Here's the process: Fetch big.htm Slice out the links section (bounded by <ul></ul> tag pairs) Fetch each of the pages linked there, by appending /printable to the end of each URL Yank the middle section out of each 'printable' page (bounded by <!-- (start\|end) of article body --> comment tags) Strip out any <script> and <style> tag pairs, including anything in-between them (as well as any attributes they use). No need for any of that on a PDA. Strip out a few key page elements (the categoryNav, first <center>.?</center> tag, and the last <td align=right>.</td> tag) Convert to Plucker. So far, it looks great, and all in 10 lines of code. Here's some of my magic: Code: $content =~ s,<(s(?:cript\|tyle))[^>]>.?,,gis; my ($start, $end) = map "<!-- $_ of article body -->", 'start', 'end'; $content =~ s,.${start}(.?)${end}.*,\1,gis; Not too hard to do at all. Doing this for every site you want to fetch could be gruesome and painful. Avoid it, there are tools out there that do all of this already, using templates that describe each site's "stomach" or main content area.