|
|
View Full Version : Apologies for the outage
Alexander Turcic 12-29-2006, 12:03 PM Duh, we just had our first real outage. It's somehow related to Murphy's Law I think, because it was only the last two days when I was unable to access the Internet for other reasons.
I wish I could say who was to blame for the outage, like a squirrel which somehow squeezed its way into our server, but the truth is: I don't know. A note popped up in my e-mail saying that our Apache server went down on December 28, 2006 05:22:23 PST. A simple restart did the trick - alas with a day delay thanks to my absense. We'll make sure this is not going to happen again!
Thanks to everyone who informed us, including David @ TeleRead, Laurens, Simon, Roland, Daniel, and everyone else!
Anyways, I hope everyone of you saddled up and is ready to ride into a Happy New Year!
Hadrien 12-29-2006, 12:34 PM We use another machine to ping the server on Feedbooks. If the ping's not working it send us a texto on our cellphone, that's pretty usefull. You still need an Internet connection with SSH access to the server though to restart everything...
tribble 12-29-2006, 12:46 PM If a simple apache restart did the trick, you might want to take a look at this: http://www.tildeslash.com/monit/
:)
Lisana 12-29-2006, 01:02 PM Here I'd just found the forum a few days ago, and you were gone! :o I'm glad that all it took was a simple server restart. :)
Alexander Turcic 12-29-2006, 01:49 PM Thanks for the tips guys! It's really my fault, because I didn't plan a backup solution for something like this (me not being around, server going down). It looks as if the Apache process didn't do a graceful restart after the logs had been rotated (which is being done every day).
It's just weird because we didn't change any settings.
Alexander Turcic 12-29-2006, 02:16 PM OK, here is what happened: Every night, we rotate our log files. After rotation, the Apache server receives a "USR1" signal to be gracefully reloaded. Gracefully means that before Apache is reloaded, its children must wait to complete their request before dying. The problem is that on rare occasions, but especially during high system loads, some children may still be up waiting to finish their requests while the master process is already being reloaded. When this happens, Apache fails to restart.
Googling reveals this link (http://www.apacheweek.com/features/tips):
Sending the parent Apache process a USR1 signal will make it close the current log files, and re-open them, without loosing any connections currently in progress. This should be used instead of a HUP signal in any log rotation script. The script should first move the current log files to new names (the logs are still open at this stage). Then it should send a USR1 signal to the parent Apache process. The parent will tell the child process to die when they have finished processing their current request, and will open the log files for newly created children (since the old files have been renamed, the opened files will be newly created). As the old children finish their current requests they will close their handle to the (old) log files, and exit. When all the children are dead you can safely process the old log files (for example, by compressing it). Since you cannot know for definite when the old children have all died, the best way to do this is to make your log rotation script sleep for a while after sending the USR1 signal.
What it doesn't tell me is for how long we should make the rotation script sleep. The only other solution I can think of is to send the "HUP" signal instead, which would close all children and spawn new children instead. Alas, it would also mean that all connections that existed during the reload would be dropped.
Leaping Gnome 12-29-2006, 02:22 PM Just tell it to sleep for 5 minutes, that should be plenty of time for anything it is doing.
Glad to see you guys are back up. I wondered what happened, after being down for over a day I was wondering if for some reason the site was taken down. I checked a few other eBook sites, but didn't see any notices or posts, so wasn't sure what was happening.
tribble 12-29-2006, 03:59 PM Alexander, try monit. It lets you monitor the apache and restart it, if it doesnt react on HTTP requests. You can make it send you an email, if the restart does not work. Its a pretty awesome too that monit thingy :) And you can monitor almost anything else on the machine.
Alexander Turcic 12-31-2006, 06:33 AM Thanks for the tips, guys!
We just installed monit and configured it to make sure that this kind of outage is not going to happen again.
sUnShInE 01-03-2007, 11:21 AM Happy New Year everyone.
Love,
Murphy's Law :D
|