![]() |
#1 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 111
Karma: 1013536
Join Date: Aug 2005
|
Tomeraider offers metapublishing
![]() We have just started releasing a new method to get ebooks. The idea is simple: You download a script that will convert an entire website into a neatly formatted e-book in TomeRaider format. Already we have some great sites ready for conversion, such as: eHow, How stuff works, BBC Medical, US City Travel Guide, Strange Stories. The site is downloaded by your computer and processed on your computer. I think you will agree that, though in the early stages, this is pretty revolutionary. Ware calling this process "metapublishing". If you know any sites you would like us to write a raid script for then just ask me via email or reply to this thread. |
![]() |
![]() |
![]() |
#3 |
Technology Mercenary
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 617
Karma: 2561
Join Date: Feb 2003
Location: East Lyme, CT
Device: Direct Neural Implant
|
Sitescooper has been doing this for at least the last 5 years, probably longer, and it supports quite a few output formats (including my favorite).
This isn't revolutionary at all, and it seems everyone is jumping onto the "per-site template" bandwagon: Sunrise, Mobipocket and now TomeRaider. Its a dead-end approach, since the "templates" are very fragile. Add an inner table on the site and your whole template breaks and has to be debugged and written over. Rename a page resource on the server-side, change a query string and all of it topples down in to a pile of goo. Its a dead-end direction for "metapublishing" from the client perspective. Now, if content providers offered their website for download as a neatly packaged ebook-type-of-file, that would be a different story entirely, but hardly any do. The fear (speaking as a content provider) is that we're getting pounded by thousands of users who are all running these tools, crawlers, spiders and other things against our sites (and our client's sites) without considering the implications of a script that requests 1,600 pages in 5 minutes (as someone did yesterday on one of our servers, trying to get the entire history of the jpilot mailing list with some Java tool). I'm blocking dozens per-day, and I'll continue to block them until they begin to adhere to the robots.txt specification and learn to respect Crawl-delay and If-Modified-Since header for feeds and other content. Most of the RSS readers out there are another perfect example. The whole point of RSS is to syndicate the news, but instead we get 5,000 people fetching the same feed every hour, even though it specifically says not to fetch it more than once a week. Even then, in some cases there are no new items in a couple of weeks, but they still fetch it every hour anyway (ignoring If-Modified-Since). Sorry, now you're blocked. I wish people who wrote these "tools" would consider the implications of what they're doing. Most do not. |
![]() |
![]() |
![]() |
#4 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 111
Karma: 1013536
Join Date: Aug 2005
|
Hey Hacker
>>>>Sitescooper has been doing this for at least the last 5 years, probably longer, and it supports quite a few output formats (including my favorite). Incorrect. Sitesooper merely gets the articles. Its doesn’t homogenize them , categorize them or index them. It also doesn’t provide cross source searching, filtering or field sorting. If NewsRaider were just an application that leeched news articles, then yes, it would hardly be revolutionary. >>>>Its a dead-end approach, since the "templates" are very fragile. Add an inner table on the site and your whole template breaks and has to be debugged and written over. Rename a page resource on the server-side, change a query string and all of it topples down in to a pile of goo. Incorrect with reference to NewsRaider. If a site template changes we get an aletr and change the script, and upload the script and it is automatically updated in every client. In general users wont even notice the change >>>>The fear (speaking as a content provider) is that we're getting pounded by thousands of users who are all running these tools, crawlers, spiders and other things against our sites (and our client's sites) without considering the implications of a script that requests 1,600 pages in 5 minutes (as someone did yesterday on one of our servers, trying to get the entire history of the jpilot mailing list with some Java tool). I’m not sure what sites you are referring to but NewsRaider is very polite in terms of bandwidth and retrieval. In our tests we find it saves bandwidth when compared against RSS and direct browsing. Sure, some of these “other things” are just plain rude, but NewsRaider isn’t, quite the contrary. You are jumping the gun a bit on these accusations >>>>>Sorry, now you're blocked. From where? >>>>I wish people who wrote these "tools" would consider the implications of what they're doing. Most do not. As said, most do not. But we have been in the business of handheld content since way before Avant Go, MobiPocket and Isilo and we do appreciate all of these concerns. NewsRaider was designed as a site friendly, user friendly and bandwidth friendly application. And I am friendly too ![]() Mat |
![]() |
![]() |
![]() |
#5 |
Technology Mercenary
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 617
Karma: 2561
Join Date: Feb 2003
Location: East Lyme, CT
Device: Direct Neural Implant
|
Where in this entire thread, was the product "NewsRaider" mentioned at all? I don't see anything in this script that does what you suggest (cross source searching, filtering or field sorting).
TomeRaider's "new" metapublishing system does nothing of the sort. It simply hits a full page, strips off the extraneous elements and converts it to TomeRaider format, EXACTLY like Sitescooper has been doing for over 5 years... Or did I miss some something unstated in this thread? That was buried inside HTML comments or in white-on-white text? |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 111
Karma: 1013536
Join Date: Aug 2005
|
Ahhhh... my appologies hacker.
I only fond out about this site today as someone started a thread on it about NewsRaider which I replied to and then thought I would let the forum know about the similar functionality in TomeRaider. I got confused between TomeRaider and it’s sister app NewsRaider. TomeRaider is for “static” ebooks and NewsRaider is for dynamic news. >>>> TomeRaider's "new" metapublishing system does nothing of the sort. It simply hits a full page, strips off the extraneous elements and converts it to TomeRaider format, EXACTLY like Sitescooper has been doing for over 5 years... Have you tried it? I think you probably should before you comment because, believe it or not, its actually pretty impressive. Take the “How stuff works” script that you refer to. Once it has run you end up with a really great, super compressed and indexed 20 meg (ish) ebook. This ebook is so so handy, so portable and so fast to browse and search. This has never been done before. I know this because I know that only TomeRaider can handle files of this size with such ease (You might like to try the 1 gigabite full Wikipedia on your handheld if you really want to see TomeRaider pushing some boundaries ) So, again my apologies for getting confused, but the fact remains, both TomeRaider and NewsRaider are revolutionary. Of course, I am biased but I think the facts speak for themselves Best wishes Mat |
![]() |
![]() |
![]() |
#7 |
Technology Mercenary
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 617
Karma: 2561
Join Date: Feb 2003
Location: East Lyme, CT
Device: Direct Neural Implant
|
I just took 5 minutes and recreated what you did with the HowStuffWorks site, using Perl. Here's the process:
Code:
$content =~ s,<(s(?:cript|tyle))[^>]*>.*?,,gis; my ($start, $end) = map "<!-- $_ of article body -->", 'start', 'end'; $content =~ s,.*${start}(.*?)${end}.*,\1,gis; Doing this for every site you want to fetch could be gruesome and painful. Avoid it, there are tools out there that do all of this already, using templates that describe each site's "stomach" or main content area. |
![]() |
![]() |
![]() |
#8 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 111
Karma: 1013536
Join Date: Aug 2005
|
Kindly show me the script, in perl or any language of your choice, that will take a site such as:
http://www.whfoods.com/foodstoc.php And turn it into a perfectly formatted e-book, with images and a title page, indexed so it can be searched in an instant and browsed easier than a webpage. The resultant file will be viewable on Palm OS, Pocket PC, Windows, Symbian and Smartphone. The result much be highly compressed and yet able to be accessed, on any of the above platforms, with unparalleled speed. If you show me this script then my hat is my dinner? It has taken a long time to get here, but when I say TomeRaider/ NewsRaider is revolutionary I really believe that. I have been working on it for 7 years and will, justifiably I think, defend the TomeRaider/NewsRaider corner? |
![]() |
![]() |
![]() |
#9 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 111
Karma: 1013536
Join Date: Aug 2005
|
How come whenever I write "
![]() |
![]() |
![]() |
![]() |
#10 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 914
Karma: 3410461
Join Date: May 2004
Device: Kindle Touch
|
Hacker, I wish I had your knowledge of Regex. *envy*
|
![]() |
![]() |
![]() |
#11 | |
Fully Converged
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 18,171
Karma: 14021202
Join Date: Oct 2002
Location: Switzerland
Device: Too many to count here.
|
Quote:
|
|
![]() |
![]() |
![]() |
#12 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 914
Karma: 3410461
Join Date: May 2004
Device: Kindle Touch
|
I tried Newsraider today and it looks fine to me. I think some hardcore users like hacker will always prefer the manual way of writing custom perl scripts to finetune the scraping process. But for non-geeks like myself NR seems to be the easier and faster solution.
|
![]() |
![]() |
![]() |
#13 |
Technology Mercenary
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 617
Karma: 2561
Join Date: Feb 2003
Location: East Lyme, CT
Device: Direct Neural Implant
|
There are several reasons I can't try tools like Newsraider and TomeRaider:
I use Perl, Python, C, whatever happens to do the job, because there simply are no other alternatives. |
![]() |
![]() |
![]() |
#14 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 111
Karma: 1013536
Join Date: Aug 2005
|
Hi Hacker
1. Doesn't run on my OS of choice (Linux or BSD) Well… I cant change that, but you can. If you want the source to do a Linux version, we could give that a whirl. I’d rather start with NewsRaider first. 2. Doesn't come with source code so I can customize it to my needs If you write a Linux version of NewsRaider then that could be open source. I come from a freeware/shareware background and so ill probably need to be told exactly what that would mean ![]() 3. Doesn't support my preferred Palm reader applications We are currently working on a Palm dedicated version of NewsRaider. TomeRaider is a popular Palm Reader with hundreds of files not available in other formats. 4. Uses "Yet Another Proprietary Templating System" I don’t see why that’s a bad thing. New systems need to be tried out in order to progress. Its doubtful that we will be using XML or C++ or Python in 100 years time. New operating systems and languages will be started. Some will drift away like OPL has done and others will evolve. That is surely a good thing for technological progress and understanding? We realised that for NewsRaider to work and be popular it needed a scripting system that even non programmers could use. So the language is very simple, hard to mess up and easy to pick up. To a C++ eye it will look immature, but we are already getting total non programmers giving it a whirl and getting results. Anyway, thanks for this discussion, it’s been informative to me ![]() ![]() Cheers Mat |
![]() |
![]() |
![]() |
#15 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 214
Karma: 6370
Join Date: May 2003
Location: Asia
Device: Tungsten T5
|
Mobileread is the place where text reader developers converge it seems, Hacker (Plucker), Laurens (Sunrise for Plucker), Mobipocket, and now Malt from Tome Raider. All noteworthy in their own little way, and all superlative programs for Palm and Windows Mobile Handheld...
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
HTML to TomeRaider 3 | pruss | Reading and Management | 4 | 10-09-2007 04:46 AM |
tomeraider convert | leszcz2 | Workshop | 1 | 08-30-2007 06:47 PM |
TomeRaider 3: P r i n t i s D e a d | sUnShInE | Reading and Management | 7 | 11-16-2004 05:15 PM |
TomeRaider 3 is Final now | Colin Dunstan | Reading and Management | 3 | 10-23-2004 01:20 PM |
TomeRaider 3 first beta available | Colin Dunstan | Reading and Management | 7 | 10-18-2004 12:39 PM |