![]() |
#1 |
Junior Member
![]() Posts: 4
Karma: 14
Join Date: Oct 2002
Location: Tokyo, Japan
|
http://www.iht.com/articleindex.html
This has every article in the paper. set exclude urls for www.iht.com as a wildcard and exceptions for www.iht.com/articles as a wildcard because all of the articles are in this subdirectory. Unfortunately, you will have to scroll down a bit for after clicking through each link because the directory of the paper is present on every page but you can get the whole paper in about 400k. |
![]() |
![]() |
#2 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Jan 2003
|
Just wondering as a new isolox user, what you mean by setting up the exclude file urls for www.iht.com as a wildcard etc and the exceptions for www.iht.com/articles. I hope this isn't a dumb question or otherwise bothersome, but there it is. Thanks
|
![]() |
Advert | |
|
![]() |
#3 |
Junior Member
![]() Posts: 4
Karma: 14
Join Date: Oct 2002
Location: Tokyo, Japan
|
The 3.3 version of IsiloX has the ability exclude specific URLs or URL ranges from being included in the spidered document.
Set the articleindex as the page to be retrieved. Under the links tab, choose 1. At the bottom of that tab is a button labeled "URL filters". Cllick on this button. Click on add exclusion. type in "www.iht.com/" and set that as a wildcard. Everything begining with the expression will be exclused, which is everything in the IHT site at this point. Next, add the inclusion filters. www.iht.com/articles will be one since every article in the paper will fall under this subheading with a numeric code assigned to it. Also, added the homepage www.iht.com/articleindex as a regular expression. I don't know if that is necessary or not. |
![]() |
![]() |
#4 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Jan 2003
|
Great,
Thanks for the info. I am using version 3.25 which is most of my problem in not understanding, I think... |
![]() |
![]() |
#5 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 34
Karma: 3184
Join Date: Nov 2002
Location: NYC
Device: Axim x51v;T|X;NX73;SEK750
|
Too bad there's not a way to fool the site into thinking isilo is javascript capable, then the headers would be removed - or is this possible?
|
![]() |
Advert | |
|
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
International Herald Tribune: European Edition | Raoul O'Malley | Calibre | 1 | 05-02-2010 12:20 AM |
Boston Herald Bashes iPad | Lotus Esprit | Apple Devices | 65 | 04-23-2010 09:52 AM |
It's the year of the e-reader ... - The Sydney Morning Herald | AprilHare | News | 0 | 01-07-2010 10:18 PM |
Chicago Tribune now available on the Kindle! | daffy4u | Amazon Kindle | 14 | 08-11-2008 01:10 PM |
Herald Tribune on how e-books spur sales | Alexander Turcic | News | 0 | 08-05-2005 05:09 PM |