![]() |
#1 |
Connoisseur
![]() Posts: 62
Karma: 10
Join Date: Jun 2005
Location: California, USA
Device: Palm TX SmartQ 7
|
Large documents slow and crashing?
I am using 0.41n. On an xp workstation with 512megs and lots of diskspace.
I have the heap set to 1296. Yes, this is 1296. I have some pretty large documents (web sites that I take with me). The sites may be from 200-1500kb. It seems that the sites are kind of slow and large to crawl them. The site is www.candlepowerforums.com. It is a UBB forum. I have 0.41n crawl multiple topics within the same site. IF sunrise can crawl the topic, the system does not crash. Sometimes it just crashes and closes sunrise. Having the settings for the heap set this high seems to help, but I really can't go much higher, the system will not allow it. 1300 seems to be the max. The forum topics on the server seems to be slow. (I think the server is slow). So each thread seems to take 5-10+ minutes. Any ideas why it crashes or what I might be able to do about the crashes? It crashes more if I try to crawl multiple documents at the same time. So I have scaled back to only doing one document at a time, but this makes it slow to finish geting all the documents I crawl, as long as 45 minutes for a full update. Another question. I know that sunrise access 1-5 documents at a time. I also can see that within each document, it crawls 2 links at a time. Is there any way to change the setting to more then 2 links within a document at one time, this might help me finish faster, because the forums that I follow may pause a long time accessing some of the links. Running fewer documents at once has seemed to make sunrise more stable, but at the expense of the full access of all of my documents takes 45 minutes This could potentially make a huge difference if it could do as many as 4 links at a time? And I believe that processing 4 links is a suggested max? But I would take any ideas you have that might make it more stable and faster. |
![]() |
![]() |
#2 |
Jah Blessed
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
|
Unfortunately, there is no real solution for this. The core JPluck library, responsible for generating Plucker documents, makes wasteful use of memory. This can't be resolved, at least not without rewriting the library from scratch. (Which I'm not going to do, since I'm working on my own product.)
Sunrise is primarily meant for converting smaller content, such as news sites and RSS feeds, and is unsuitable for handling larger content. (The desktop tool for my commercial product does not have these weaknesses.) |
![]() |
Advert | |
|
![]() |
#3 |
Connoisseur
![]() Posts: 62
Karma: 10
Join Date: Jun 2005
Location: California, USA
Device: Palm TX SmartQ 7
|
thanks for the help
ok. Is there anything you can help out with the 2 thread retrival question? Is there a way to jimmy it to 4 threads retrieved simultanious instead of just 2?
|
![]() |
![]() |
#4 | ||
Jah Blessed
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
|
Quote:
Quote:
|
||
![]() |
![]() |
#5 |
Fully Converged
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 18,175
Karma: 14021202
Join Date: Oct 2002
Location: Switzerland
Device: Too many to count here.
|
... too bad many clients don't necessarily follow this rule. In Firefox, you can easily change the # of connections through changing the network.http.max-persistent-connections-per-server value.
|
![]() |
Advert | |
|
![]() |
#6 |
Connoisseur
![]() Posts: 62
Karma: 10
Join Date: Jun 2005
Location: California, USA
Device: Palm TX SmartQ 7
|
http spec for multiple connections...
I know we SHOULD not, but I know that for many slow sites, I can and do open multiple windows to the same sometimes slow sites. I don't see how this is much different... I know the site is waiting or doing other processing that I am not going to overload it by doing an outrageous amount of connections. Just something more then 2 to get around a limitation in the plucker code problem listed above in this same thread.
I know before we had sunrise, we had multiple threads for the same file, and there was a WAY I could get more then 4, it may have been a bug, but there still was a way. Might you share same exploit for those of who might be running into a problem? I know that the server is not competely busy and eventually I know there are many files I am not going to be requesting because the files are not local to the remote host so I will not be downloading many of the embedded files. I also don't see the difference between having multiple windows open to the same server , say 3 documents x 2 threads = 6 connections, or 2 documents x 3 threads = 6 connections... I know that no matter what you do, if you overload the server too much, you are not going to get the data any faster! :-) |
![]() |
![]() |
#7 | |||||
Technology Mercenary
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 617
Karma: 2561
Join Date: Feb 2003
Location: East Lyme, CT
Device: Direct Neural Implant
|
Quote:
Quote:
Quote:
I'm not alone here either, lots of system administrators are blocking and locking out misbehaving spiders via IP or netblock. Don't fall into the trap. Make sure you add a delay between requests and make sure you're parsing robots.txt properly with your spidering tool. Quote:
Seriously, I run a farm of VERY busy servers here doing lots of things (cvs/svn, mail, content generation, dbms, torrent trackers for distributing Free Software projects, mirroring major projects, etc.), and while the webservers may only see 200/300 connections per-second, those particular box are VERY busy doing other things, which takes away RAM, threads, and so on from the webserver's processing. Quote:
|
|||||
![]() |
![]() |
#8 | |
Jah Blessed
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
|
Quote:
|
|
![]() |
![]() |
#9 |
Connoisseur
![]() Posts: 62
Karma: 10
Join Date: Jun 2005
Location: California, USA
Device: Palm TX SmartQ 7
|
so can you help with any kind of alternatives to the problem?
Speed would NOT be an issue I guess If I could run all the threads at the same time that would go to different sites so I would still be limited to the 2 threads/connections issues. But I run into the problem of the large files makes it impossible to do more then 1 document at a time because of the plucker file size corruption issue. ARGHHHH!!!! So help? You say wait for your comercial product, but for how long? I would like to have something in the mean time that works... Please can you help give me some practical suggestions?
|
![]() |
![]() |
#10 | |||
Technology Mercenary
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 617
Karma: 2561
Join Date: Feb 2003
Location: East Lyme, CT
Device: Direct Neural Implant
|
Quote:
For example, lets say you want to fetch Slashdot and Freshmeat and a Technorati RSS feed. Slashdot has a link in one of its articles to Freshmeat, and Freshmeat links back to Slashdot on its page. If you have Slashdot's fetch running in one thread and Freshmeat's fetch running in another, how do you stop them from fetching each other's content in duplicate? (wasting bandwidth and fetch threads) The answer lies in shared pools of memory in one case (there are others, but this is one possible solution). Quote:
You might try another distiller, there are about 1/2 dozen of them that output the Plucker format (2 in Java, Sunrise, cplucker, Plucker's Python distiller, a third-party C++ distiller, PDAConvert, jSyncManager and several others). Quote:
Either you get stuck with the bugs in Sunrise, or you buy his commercial product. He's hoping you buy his commercial product, obviously. But there's another option... just keep using Plucker, but don't build your documents with the proprietary Sunrise distiller. |
|||
![]() |
![]() |
#11 |
Connoisseur
![]() Posts: 62
Karma: 10
Join Date: Jun 2005
Location: California, USA
Device: Palm TX SmartQ 7
|
Thanks. I really like sunrise and reluctant to give it up, but it seems I am starting to be very limited and may be forced to migrate. Sad but true. Thanks for some of the suggestions.
|
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Large FB2s slow? | edwlarkey | Astak EZReader | 2 | 10-09-2009 12:29 AM |
Not saving large files - crashing | shilpa | Sigil | 10 | 08-20-2009 09:26 AM |
calibre - very slow conversion, very slow on PRS | cremofix | Calibre | 3 | 06-10-2009 04:21 PM |
Sony PRS-505 slow whilst changing Chapters. XHTML too large? | pato1 | Workshop | 3 | 06-08-2009 03:05 PM |
Crashing and restarting while formatting large font, normal? | rahulm | Sony Reader | 4 | 03-23-2009 08:20 PM |