View Single Post
Old 08-12-2005, 07:26 PM   #7
hacker
Technology Mercenary
hacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with others
 
hacker's Avatar
 
Posts: 617
Karma: 2561
Join Date: Feb 2003
Location: East Lyme, CT
Device: Direct Neural Implant
I just took 5 minutes and recreated what you did with the HowStuffWorks site, using Perl. Here's the process:

  1. Fetch big.htm
  2. Slice out the links section (bounded by <ul></ul> tag pairs)
  3. Fetch each of the pages linked there, by appending /printable to the end of each URL
  4. Yank the middle section out of each 'printable' page (bounded by <!-- (start|end) of article body --> comment tags)
  5. Strip out any <script> and <style> tag pairs, including anything in-between them (as well as any attributes they use). No need for any of that on a PDA.
  6. Strip out a few key page elements (the categoryNav, first <center>.*?</center> tag, and the last <td align=right>.*</td> tag)
  7. Convert to Plucker.
So far, it looks great, and all in 10 lines of code. Here's some of my magic:

Code:
$content =~ s,<(s(?:cript|tyle))[^>]*>.*?,,gis;
  
  my ($start, $end) = map "<!-- $_ of article body -->", 'start', 'end';
  $content =~ s,.*${start}(.*?)${end}.*,\1,gis;
Not too hard to do at all.


Doing this for every site you want to fetch could be gruesome and painful. Avoid it, there are tools out there that do all of this already, using templates that describe each site's "stomach" or main content area.
hacker is offline   Reply With Quote