Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Reading and Management

Notices

Reply
 
Thread Tools Search this Thread
Old 08-12-2005, 05:02 AM   #1
MatYadabyte
Zealot
MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.
 
Posts: 111
Karma: 1013536
Join Date: Aug 2005
Tomeraider offers metapublishing

Mat from Tomeraider is a nice person and I recommend you check out his link below -Alex.

We have just started releasing a new method to get ebooks. The idea is simple: You download a script that will convert an entire website into a neatly formatted e-book in TomeRaider format. Already we have some great sites ready for conversion, such as: eHow, How stuff works, BBC Medical, US City Travel Guide, Strange Stories.

The site is downloaded by your computer and processed on your computer.

I think you will agree that, though in the early stages, this is pretty revolutionary. Ware calling this process "metapublishing". If you know any sites you would like us to write a raid script for then just ask me via email or reply to this thread.
MatYadabyte is offline   Reply With Quote
Old 08-12-2005, 06:46 AM   #2
doctorow
Guru
doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.
 
doctorow's Avatar
 
Posts: 914
Karma: 3410461
Join Date: May 2004
Device: Kindle Touch
Thanks Mat.

Sounds a bit like Mobipocket's eNews, doesn't it?
doctorow is offline   Reply With Quote
Advert
Old 08-12-2005, 11:22 AM   #3
hacker
Technology Mercenary
hacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with others
 
hacker's Avatar
 
Posts: 617
Karma: 2561
Join Date: Feb 2003
Location: East Lyme, CT
Device: Direct Neural Implant
Sitescooper has been doing this for at least the last 5 years, probably longer, and it supports quite a few output formats (including my favorite).

This isn't revolutionary at all, and it seems everyone is jumping onto the "per-site template" bandwagon: Sunrise, Mobipocket and now TomeRaider. Its a dead-end approach, since the "templates" are very fragile. Add an inner table on the site and your whole template breaks and has to be debugged and written over. Rename a page resource on the server-side, change a query string and all of it topples down in to a pile of goo.

Its a dead-end direction for "metapublishing" from the client perspective. Now, if content providers offered their website for download as a neatly packaged ebook-type-of-file, that would be a different story entirely, but hardly any do.

The fear (speaking as a content provider) is that we're getting pounded by thousands of users who are all running these tools, crawlers, spiders and other things against our sites (and our client's sites) without considering the implications of a script that requests 1,600 pages in 5 minutes (as someone did yesterday on one of our servers, trying to get the entire history of the jpilot mailing list with some Java tool).

I'm blocking dozens per-day, and I'll continue to block them until they begin to adhere to the robots.txt specification and learn to respect Crawl-delay and If-Modified-Since header for feeds and other content.

Most of the RSS readers out there are another perfect example. The whole point of RSS is to syndicate the news, but instead we get 5,000 people fetching the same feed every hour, even though it specifically says not to fetch it more than once a week. Even then, in some cases there are no new items in a couple of weeks, but they still fetch it every hour anyway (ignoring If-Modified-Since).

Sorry, now you're blocked.

I wish people who wrote these "tools" would consider the implications of what they're doing. Most do not.
hacker is offline   Reply With Quote
Old 08-12-2005, 04:02 PM   #4
MatYadabyte
Zealot
MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.
 
Posts: 111
Karma: 1013536
Join Date: Aug 2005
Hey Hacker

>>>>Sitescooper has been doing this for at least the last 5 years, probably longer, and it supports quite a few output formats (including my favorite).


Incorrect. Sitesooper merely gets the articles. Its doesn’t homogenize them , categorize them or index them. It also doesn’t provide cross source searching, filtering or field sorting. If NewsRaider were just an application that leeched news articles, then yes, it would hardly be revolutionary.

>>>>Its a dead-end approach, since the "templates" are very fragile. Add an inner table on the site and your whole template breaks and has to be debugged and written over. Rename a page resource on the server-side, change a query string and all of it topples down in to a pile of goo.

Incorrect with reference to NewsRaider. If a site template changes we get an aletr and change the script, and upload the script and it is automatically updated in every client. In general users wont even notice the change

>>>>The fear (speaking as a content provider) is that we're getting pounded by thousands of users who are all running these tools, crawlers, spiders and other things against our sites (and our client's sites) without considering the implications of a script that requests 1,600 pages in 5 minutes (as someone did yesterday on one of our servers, trying to get the entire history of the jpilot mailing list with some Java tool).

I’m not sure what sites you are referring to but NewsRaider is very polite in terms of bandwidth and retrieval. In our tests we find it saves bandwidth when compared against RSS and direct browsing. Sure, some of these “other things” are just plain rude, but NewsRaider isn’t, quite the contrary. You are jumping the gun a bit on these accusations

>>>>>Sorry, now you're blocked.

From where?


>>>>I wish people who wrote these "tools" would consider the implications of what they're doing. Most do not.

As said, most do not. But we have been in the business of handheld content since way before Avant Go, MobiPocket and Isilo and we do appreciate all of these concerns. NewsRaider was designed as a site friendly, user friendly and bandwidth friendly application. And I am friendly too

Mat
MatYadabyte is offline   Reply With Quote
Old 08-12-2005, 04:27 PM   #5
hacker
Technology Mercenary
hacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with others
 
hacker's Avatar
 
Posts: 617
Karma: 2561
Join Date: Feb 2003
Location: East Lyme, CT
Device: Direct Neural Implant
Where in this entire thread, was the product "NewsRaider" mentioned at all? I don't see anything in this script that does what you suggest (cross source searching, filtering or field sorting).

TomeRaider's "new" metapublishing system does nothing of the sort. It simply hits a full page, strips off the extraneous elements and converts it to TomeRaider format, EXACTLY like Sitescooper has been doing for over 5 years...

Or did I miss some something unstated in this thread? That was buried inside HTML comments or in white-on-white text?
hacker is offline   Reply With Quote
Advert
Old 08-12-2005, 06:18 PM   #6
MatYadabyte
Zealot
MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.
 
Posts: 111
Karma: 1013536
Join Date: Aug 2005
Ahhhh... my appologies hacker.

I only fond out about this site today as someone started a thread on it about NewsRaider which I replied to and then thought I would let the forum know about the similar functionality in TomeRaider. I got confused between TomeRaider and it’s sister app NewsRaider.

TomeRaider is for “static” ebooks and NewsRaider is for dynamic news.


>>>> TomeRaider's "new" metapublishing system does nothing of the sort. It simply hits a full page, strips off the extraneous elements and converts it to TomeRaider format, EXACTLY like Sitescooper has been doing for over 5 years...


Have you tried it? I think you probably should before you comment because, believe it or not, its actually pretty impressive. Take the “How stuff works” script that you refer to. Once it has run you end up with a really great, super compressed and indexed 20 meg (ish) ebook. This ebook is so so handy, so portable and so fast to browse and search. This has never been done before.

I know this because I know that only TomeRaider can handle files of this size with such ease (You might like to try the 1 gigabite full Wikipedia on your handheld if you really want to see TomeRaider pushing some boundaries )

So, again my apologies for getting confused, but the fact remains, both TomeRaider and NewsRaider are revolutionary. Of course, I am biased but I think the facts speak for themselves

Best wishes

Mat
MatYadabyte is offline   Reply With Quote
Old 08-12-2005, 07:26 PM   #7
hacker
Technology Mercenary
hacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with others
 
hacker's Avatar
 
Posts: 617
Karma: 2561
Join Date: Feb 2003
Location: East Lyme, CT
Device: Direct Neural Implant
I just took 5 minutes and recreated what you did with the HowStuffWorks site, using Perl. Here's the process:

  1. Fetch big.htm
  2. Slice out the links section (bounded by <ul></ul> tag pairs)
  3. Fetch each of the pages linked there, by appending /printable to the end of each URL
  4. Yank the middle section out of each 'printable' page (bounded by <!-- (start|end) of article body --> comment tags)
  5. Strip out any <script> and <style> tag pairs, including anything in-between them (as well as any attributes they use). No need for any of that on a PDA.
  6. Strip out a few key page elements (the categoryNav, first <center>.*?</center> tag, and the last <td align=right>.*</td> tag)
  7. Convert to Plucker.
So far, it looks great, and all in 10 lines of code. Here's some of my magic:

Code:
$content =~ s,<(s(?:cript|tyle))[^>]*>.*?,,gis;
  
  my ($start, $end) = map "<!-- $_ of article body -->", 'start', 'end';
  $content =~ s,.*${start}(.*?)${end}.*,\1,gis;
Not too hard to do at all.


Doing this for every site you want to fetch could be gruesome and painful. Avoid it, there are tools out there that do all of this already, using templates that describe each site's "stomach" or main content area.
hacker is offline   Reply With Quote
Old 08-12-2005, 07:50 PM   #8
MatYadabyte
Zealot
MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.
 
Posts: 111
Karma: 1013536
Join Date: Aug 2005
Kindly show me the script, in perl or any language of your choice, that will take a site such as:

http://www.whfoods.com/foodstoc.php

And turn it into a perfectly formatted e-book, with images and a title page, indexed so it can be searched in an instant and browsed easier than a webpage. The resultant file will be viewable on Palm OS, Pocket PC, Windows, Symbian and Smartphone. The result much be highly compressed and yet able to be accessed, on any of the above platforms, with unparalleled speed.

If you show me this script then my hat is my dinner? It has taken a long time to get here, but when I say TomeRaider/ NewsRaider is revolutionary I really believe that. I have been working on it for 7 years and will, justifiably I think, defend the TomeRaider/NewsRaider corner?
MatYadabyte is offline   Reply With Quote
Old 08-12-2005, 07:51 PM   #9
MatYadabyte
Zealot
MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.
 
Posts: 111
Karma: 1013536
Join Date: Aug 2005
How come whenever I write "" in Word and past it into here it becomes "?"?
MatYadabyte is offline   Reply With Quote
Old 08-12-2005, 07:57 PM   #10
doctorow
Guru
doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.
 
doctorow's Avatar
 
Posts: 914
Karma: 3410461
Join Date: May 2004
Device: Kindle Touch
Hacker, I wish I had your knowledge of Regex. *envy*
doctorow is offline   Reply With Quote
Old 08-12-2005, 08:01 PM   #11
Alexander Turcic
Fully Converged
Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.
 
Alexander Turcic's Avatar
 
Posts: 18,163
Karma: 14021202
Join Date: Oct 2002
Location: Switzerland
Device: Too many to count here.
Quote:
Originally Posted by NewsRaider
How come whenever I write "" in Word and past it into here it becomes "?"?
Looks strange indeed. Are you using some non-US encoding in Word?
Alexander Turcic is offline   Reply With Quote
Old 08-12-2005, 08:45 PM   #12
doctorow
Guru
doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.doctorow ought to be getting tired of karma fortunes by now.
 
doctorow's Avatar
 
Posts: 914
Karma: 3410461
Join Date: May 2004
Device: Kindle Touch
I tried Newsraider today and it looks fine to me. I think some hardcore users like hacker will always prefer the manual way of writing custom perl scripts to finetune the scraping process. But for non-geeks like myself NR seems to be the easier and faster solution.
doctorow is offline   Reply With Quote
Old 08-12-2005, 11:21 PM   #13
hacker
Technology Mercenary
hacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with others
 
hacker's Avatar
 
Posts: 617
Karma: 2561
Join Date: Feb 2003
Location: East Lyme, CT
Device: Direct Neural Implant
There are several reasons I can't try tools like Newsraider and TomeRaider:

  1. Doesn't run on my OS of choice (Linux or BSD)
  2. Doesn't come with source code so I can customize it to my needs
  3. Doesn't support my preferred Palm reader applications
  4. Uses "Yet Another Proprietary Templating System"
While I'm sure its a great reader and I'm sure its output looks satisfactory, its not for me, mostly for the reasons above.

I use Perl, Python, C, whatever happens to do the job, because there simply are no other alternatives.
hacker is offline   Reply With Quote
Old 08-13-2005, 03:42 AM   #14
MatYadabyte
Zealot
MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.MatYadabyte ought to be getting tired of karma fortunes by now.
 
Posts: 111
Karma: 1013536
Join Date: Aug 2005
Hi Hacker


1. Doesn't run on my OS of choice (Linux or BSD)

Well… I cant change that, but you can. If you want the source to do a Linux version, we could give that a whirl. I’d rather start with NewsRaider first.

2. Doesn't come with source code so I can customize it to my needs

If you write a Linux version of NewsRaider then that could be open source. I come from a freeware/shareware background and so ill probably need to be told exactly what that would mean

3. Doesn't support my preferred Palm reader applications

We are currently working on a Palm dedicated version of NewsRaider. TomeRaider is a popular Palm Reader with hundreds of files not available in other formats.

4. Uses "Yet Another Proprietary Templating System"

I don’t see why that’s a bad thing. New systems need to be tried out in order to progress. Its doubtful that we will be using XML or C++ or Python in 100 years time. New operating systems and languages will be started. Some will drift away like OPL has done and others will evolve. That is surely a good thing for technological progress and understanding?

We realised that for NewsRaider to work and be popular it needed a scripting system that even non programmers could use. So the language is very simple, hard to mess up and easy to pick up. To a C++ eye it will look immature, but we are already getting total non programmers giving it a whirl and getting results.


Anyway, thanks for this discussion, it’s been informative to me And if you do want to do Linux NewsRaider the offer is there, if not, no worries

Cheers

Mat
MatYadabyte is offline   Reply With Quote
Old 08-13-2005, 01:27 PM   #15
gadgetguru
Addict
gadgetguru got an A in P-Chem.gadgetguru got an A in P-Chem.gadgetguru got an A in P-Chem.gadgetguru got an A in P-Chem.gadgetguru got an A in P-Chem.gadgetguru got an A in P-Chem.gadgetguru got an A in P-Chem.gadgetguru got an A in P-Chem.gadgetguru got an A in P-Chem.gadgetguru got an A in P-Chem.gadgetguru got an A in P-Chem.
 
Posts: 214
Karma: 6370
Join Date: May 2003
Location: Asia
Device: Tungsten T5
Mobileread is the place where text reader developers converge it seems, Hacker (Plucker), Laurens (Sunrise for Plucker), Mobipocket, and now Malt from Tome Raider. All noteworthy in their own little way, and all superlative programs for Palm and Windows Mobile Handheld...
gadgetguru is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
HTML to TomeRaider 3 pruss Reading and Management 4 10-09-2007 04:46 AM
tomeraider convert leszcz2 Workshop 1 08-30-2007 06:47 PM
TomeRaider 3: P r i n t i s D e a d sUnShInE Reading and Management 7 11-16-2004 05:15 PM
TomeRaider 3 is Final now Colin Dunstan Reading and Management 3 10-23-2004 01:20 PM
TomeRaider 3 first beta available Colin Dunstan Reading and Management 7 10-18-2004 12:39 PM


All times are GMT -4. The time now is 11:11 PM.


MobileRead.com is a privately owned, operated and funded community.