Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > Miscellaneous > Archive > Sunrise

Notices

 
 
Thread Tools Search this Thread
Old 06-18-2005, 02:13 AM   #1
wwang
Connoisseur
wwang began at the beginning.
 
Posts: 62
Karma: 10
Join Date: Jun 2005
Location: California, USA
Device: Palm TX SmartQ 7
Large documents slow and crashing?

I am using 0.41n. On an xp workstation with 512megs and lots of diskspace.

I have the heap set to 1296. Yes, this is 1296.

I have some pretty large documents (web sites that I take with me). The sites may be from 200-1500kb. It seems that the sites are kind of slow and large to crawl them. The site is www.candlepowerforums.com. It is a UBB forum. I have 0.41n crawl multiple topics within the same site.

IF sunrise can crawl the topic, the system does not crash. Sometimes it just crashes and closes sunrise.

Having the settings for the heap set this high seems to help, but I really can't go much higher, the system will not allow it. 1300 seems to be the max.

The forum topics on the server seems to be slow. (I think the server is slow). So each thread seems to take 5-10+ minutes.

Any ideas why it crashes or what I might be able to do about the crashes?

It crashes more if I try to crawl multiple documents at the same time. So I have scaled back to only doing one document at a time, but this makes it slow to finish geting all the documents I crawl, as long as 45 minutes for a full update.

Another question. I know that sunrise access 1-5 documents at a time. I also can see that within each document, it crawls 2 links at a time. Is there any way to change the setting to more then 2 links within a document at one time, this might help me finish faster, because the forums that I follow may pause a long time accessing some of the links. Running fewer documents at once has seemed to make sunrise more stable, but at the expense of the full access of all of my documents takes 45 minutes This could potentially make a huge difference if it could do as many as 4 links at a time? And I believe that processing 4 links is a suggested max?

But I would take any ideas you have that might make it more stable and faster.
wwang is offline  
Old 06-18-2005, 07:23 AM   #2
Laurens
Jah Blessed
Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.
 
Laurens's Avatar
 
Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
Unfortunately, there is no real solution for this. The core JPluck library, responsible for generating Plucker documents, makes wasteful use of memory. This can't be resolved, at least not without rewriting the library from scratch. (Which I'm not going to do, since I'm working on my own product.)

Sunrise is primarily meant for converting smaller content, such as news sites and RSS feeds, and is unsuitable for handling larger content. (The desktop tool for my commercial product does not have these weaknesses.)
Laurens is offline  
Advert
Old 06-18-2005, 11:38 PM   #3
wwang
Connoisseur
wwang began at the beginning.
 
Posts: 62
Karma: 10
Join Date: Jun 2005
Location: California, USA
Device: Palm TX SmartQ 7
thanks for the help

ok. Is there anything you can help out with the 2 thread retrival question? Is there a way to jimmy it to 4 threads retrieved simultanious instead of just 2?
wwang is offline  
Old 06-19-2005, 04:00 AM   #4
Laurens
Jah Blessed
Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.
 
Laurens's Avatar
 
Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
Quote:
Originally Posted by wwang
ok. Is there anything you can help out with the 2 thread retrival question? Is there a way to jimmy it to 4 threads retrieved simultanious instead of just 2?
The number of threads is hardcoded at 2 per host, as stated in the the HTTP spec:

Quote:
A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy.
Laurens is offline  
Old 06-19-2005, 04:07 AM   #5
Alexander Turcic
Fully Converged
Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.
 
Alexander Turcic's Avatar
 
Posts: 18,175
Karma: 14021202
Join Date: Oct 2002
Location: Switzerland
Device: Too many to count here.
... too bad many clients don't necessarily follow this rule. In Firefox, you can easily change the # of connections through changing the network.http.max-persistent-connections-per-server value.
Alexander Turcic is offline  
Advert
Old 06-20-2005, 08:49 AM   #6
wwang
Connoisseur
wwang began at the beginning.
 
Posts: 62
Karma: 10
Join Date: Jun 2005
Location: California, USA
Device: Palm TX SmartQ 7
http spec for multiple connections...

I know we SHOULD not, but I know that for many slow sites, I can and do open multiple windows to the same sometimes slow sites. I don't see how this is much different... I know the site is waiting or doing other processing that I am not going to overload it by doing an outrageous amount of connections. Just something more then 2 to get around a limitation in the plucker code problem listed above in this same thread.

I know before we had sunrise, we had multiple threads for the same file, and there was a WAY I could get more then 4, it may have been a bug, but there still was a way. Might you share same exploit for those of who might be running into a problem? I know that the server is not competely busy and eventually I know there are many files I am not going to be requesting because the files are not local to the remote host so I will not be downloading many of the embedded files.

I also don't see the difference between having multiple windows open to the same server , say 3 documents x 2 threads = 6 connections, or 2 documents x 3 threads = 6 connections...

I know that no matter what you do, if you overload the server too much, you are not going to get the data any faster! :-)
wwang is offline  
Old 06-20-2005, 09:06 AM   #7
hacker
Technology Mercenary
hacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with others
 
hacker's Avatar
 
Posts: 617
Karma: 2561
Join Date: Feb 2003
Location: East Lyme, CT
Device: Direct Neural Implant
Quote:
Originally Posted by wwang
I know the site is waiting or doing other processing that I am not going to overload it by doing an outrageous amount of connections.
How do you KNOW the site is doing processing, and not just queuing up your requests? The more requests you make, the less others can make. Webservers don't just arbitrarily allow an infinite number of connections you know. If you hit the server with 50 requests at once, and the maximum allowed requests is set to say... 500, then you just ate a good portion of reqeuests that other users could have used.

Quote:
Just something more then 2 to get around a limitation in the plucker code problem listed above in this same thread.
You mean limitation in Sunrise, not Plucker.

Quote:
I know before we had sunrise, we had multiple threads for the same file, and there was a WAY I could get more then 4, it may have been a bug, but there still was a way.
Just be careful, you might find your IP or your host locked out if you pound servers too fast with requests. I know I've personally locked out over 900 unique hosts for slamming domains I host with their mobile spider tools, because they didn't add a delay between requests, or they ignored robots.txt and spidered content they weren't allowed to spider, or many other reasons.

I'm not alone here either, lots of system administrators are blocking and locking out misbehaving spiders via IP or netblock. Don't fall into the trap. Make sure you add a delay between requests and make sure you're parsing robots.txt properly with your spidering tool.

Quote:
I know that the server is not competely busy and eventually I know there are many files I am not going to be requesting because the files are not local to the remote host so I will not be downloading many of the embedded files.
Might I ask how you know that the server isn't busy? How do you know its one server involved? Do you know what processor and RAM it has? Do you run the server?

Seriously, I run a farm of VERY busy servers here doing lots of things (cvs/svn, mail, content generation, dbms, torrent trackers for distributing Free Software projects, mirroring major projects, etc.), and while the webservers may only see 200/300 connections per-second, those particular box are VERY busy doing other things, which takes away RAM, threads, and so on from the webserver's processing.

Quote:
I know that no matter what you do, if you overload the server too much, you are not going to get the data any faster! :-)
...or at all. Just be careful you don't get yourself, or your provider's netblock banned for slamming a server too fast or too hard.
hacker is offline  
Old 06-20-2005, 12:26 PM   #8
Laurens
Jah Blessed
Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.
 
Laurens's Avatar
 
Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
Quote:
Originally Posted by wwang
I also don't see the difference between having multiple windows open to the same server , say 3 documents x 2 threads = 6 connections, or 2 documents x 3 threads = 6 connections...
Sunrise pools HTTP requests across all active updates, meaning that it only performs 2 simultaneous requests to a given host at any time, regardless of the number of updates that are currently running. Even if there are 5 documents updating and connecting to the same host, it will still use only 2 connections.
Laurens is offline  
Old 06-21-2005, 09:38 PM   #9
wwang
Connoisseur
wwang began at the beginning.
 
Posts: 62
Karma: 10
Join Date: Jun 2005
Location: California, USA
Device: Palm TX SmartQ 7
so can you help with any kind of alternatives to the problem?

Speed would NOT be an issue I guess If I could run all the threads at the same time that would go to different sites so I would still be limited to the 2 threads/connections issues. But I run into the problem of the large files makes it impossible to do more then 1 document at a time because of the plucker file size corruption issue. ARGHHHH!!!! So help? You say wait for your comercial product, but for how long? I would like to have something in the mean time that works... Please can you help give me some practical suggestions?
wwang is offline  
Old 06-21-2005, 10:24 PM   #10
hacker
Technology Mercenary
hacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with othershacker plays well with others
 
hacker's Avatar
 
Posts: 617
Karma: 2561
Join Date: Feb 2003
Location: East Lyme, CT
Device: Direct Neural Implant
Quote:
Originally Posted by wwang
Speed would NOT be an issue I guess If I could run all the threads at the same time that would go to different sites so I would still be limited to the 2 threads/connections issues.
This is actually very tricky to get right, and reduce fetching duplicate content in separate threads.

For example, lets say you want to fetch Slashdot and Freshmeat and a Technorati RSS feed. Slashdot has a link in one of its articles to Freshmeat, and Freshmeat links back to Slashdot on its page. If you have Slashdot's fetch running in one thread and Freshmeat's fetch running in another, how do you stop them from fetching each other's content in duplicate? (wasting bandwidth and fetch threads) The answer lies in shared pools of memory in one case (there are others, but this is one possible solution).

Quote:
But I run into the problem of the large files makes it impossible to do more then 1 document at a time because of the plucker file size corruption issue. ARGHHHH!!!! So help?
Plucker's Python and C++ distillers don't suffer from these problems. I regularly build 700M+ Plucker documents which work perfectly (though they take a very long time to build, of course).

You might try another distiller, there are about 1/2 dozen of them that output the Plucker format (2 in Java, Sunrise, cplucker, Plucker's Python distiller, a third-party C++ distiller, PDAConvert, jSyncManager and several others).

Quote:
You say wait for your comercial product, but for how long? I would like to have something in the mean time that works... Please can you help give me some practical suggestions?
The commercial product Laurens is writing is not going to be using the Plucker format (unless he changed his mind). When/if that commercial version is released, he is dropping support for Plucker, so you'll have no choice if you've migrated all of your documents to Sunrise.

Either you get stuck with the bugs in Sunrise, or you buy his commercial product. He's hoping you buy his commercial product, obviously. But there's another option... just keep using Plucker, but don't build your documents with the proprietary Sunrise distiller.
hacker is offline  
Old 06-22-2005, 10:12 PM   #11
wwang
Connoisseur
wwang began at the beginning.
 
Posts: 62
Karma: 10
Join Date: Jun 2005
Location: California, USA
Device: Palm TX SmartQ 7
Thanks. I really like sunrise and reluctant to give it up, but it seems I am starting to be very limited and may be forced to migrate. Sad but true. Thanks for some of the suggestions.
wwang is offline  
 


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Large FB2s slow? edwlarkey Astak EZReader 2 10-09-2009 12:29 AM
Not saving large files - crashing shilpa Sigil 10 08-20-2009 09:26 AM
calibre - very slow conversion, very slow on PRS cremofix Calibre 3 06-10-2009 04:21 PM
Sony PRS-505 slow whilst changing Chapters. XHTML too large? pato1 Workshop 3 06-08-2009 03:05 PM
Crashing and restarting while formatting large font, normal? rahulm Sony Reader 4 03-23-2009 08:20 PM


All times are GMT -4. The time now is 08:09 AM.


MobileRead.com is a privately owned, operated and funded community.