View Single Post
Old 07-26-2006, 05:06 AM   #1
goybert
Junior Member
goybert began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Jul 2006
Plucker: Help needed with spidering

Hi,
I have been having trouble with spidering the following website, http://southseas.nla.gov.au/refs/falc/contents.html
At first I had several filters on, and couldn't get past the first page. For the purposes of testing, I removed all the filters, set max depth to 2, and still could not get past the first page. Here is the progress text:


---------------------------------------------------------------------
Initializing Plucker spidering engine...

-----------------------------------------------------------
Updating channel: falconer...
-----------------------------------------------------------
Pluckerdir is 'C:\Program Files\Plucker'...
Using proxy '' with authentication for user ''...
ZLib compression turned on
Using exclusion list C:\Program Files\Plucker\exclusionlist.txt
Using exclusion list C:\Program Files\Plucker\exclusionlist.txt
---- 0 collected, 1 to do ----
Processing http://southseas.nla.gov.au/refs/falc/contents.html...
Retrieved ok.
Parsed ok.
---- all 1 pages retrieved and parsed ----
Writing out collected data...
Writing document 'falconer' to file C:\Program Files\Plucker\channels/falconer/falconer.pdb
Converting http://southseas.nla.gov.au/refs/falc/contents.html...
Converted 2: http://southseas.nla.gov.au/refs/falc/contents.html
Default charset is MIBenum 2252 (windows-1252)
New document <PluckerIndexDocument 'plucker:/~special~/index' at 9611924> added
Converted 1: plucker:/~special~/index
New document <PluckerMetadataDocument 'plucker:/~special~/metadata' at 9568372> added
Converted 5: plucker:/~special~/metadata
Wrote 1 <= plucker:/~special~/index
Wrote 2 <= http://southseas.nla.gov.au/refs/falc/contents.html
Wrote 5 <= plucker:/~special~/metadata
Unknown items encountered:
</tbody>: ['http://southseas.nla.gov.au/refs/falc/contents.html']
<tbody>: ['http://southseas.nla.gov.au/refs/falc/contents.html']
Done!
Installing channel output to destinations...
Setting new due date...
Tasks completed for all channels.
---------------------------------------------------------------------


If anyone could possibly point out what have i been doing wrong, I'd be much obliged.


UPD: Well, I have succeeded in spidering the site after downloading sunrise XP, with the minor setback that sunrise turned out to be a sneaky son of a bitch, having its regexp filters defaulted to "exclude", resulting in me trying to download the entire internet for an hour (I got about 18% done, according to the progress bar). Thus, the problem ceased to be, but another problem arose before me - the problem of thread removal - in solving which I, sadly, failed.

Last edited by goybert; 07-26-2006 at 06:44 AM.
goybert is offline   Reply With Quote