Quote:
Originally Posted by kovidgoyal
There is some kind of issue with my VPS host where the networking is going down because fo hypervisor issues, needing a restart, this is the third time its happened. If they dont fix it, I guess I will write a script to check connectivity and restart networking automatically.
|
I'm surprised that you don't have some kind of monitoring system in place. There are plenty of choices, many of them free.
https://en.wikipedia.org/wiki/Compar...toring_systems
I have personally used OpenView, PRTG, and Hobbit/BigBrother/Xymon (they keep renaming the silly thing!) Nagios is another popular one, but I have no personal experience with it.
For what I think you'd need to do, I would recommend Xymon. It is simple to set up and use. And free. I monitored hundreds of aspects on hundreds of servers with it in my IT job back before I retired. It comes with many monitoring scripts, more can be downloaded, or you can write your own. I went the custom route and wrote all my own monitoring and action scripts. Use shell, PERL, Python - whatever you want for the scripts. For the basics of checking network sanity, there are included scripts for that, you don't need custom. I was doing things like logging into RDBMS systems and querying for available extents and stuff like that - I had to write custom scripts for that kind of detailed monitoring. And I wrote custom filesystem monitoring. Statistical and short term and long term trend monitoring, so I could predict how long until a filsystem might fill up. Once your monitoring script detects a problem, then you can have a response script. The typical response is to send the appropriate person an email or page them (kind of tough to do if what you're monitoring is the network, and it is down!) So in that case you might have a response script that automatically restarts the network without human intervention.
It is also handy to monitor from both inside and outside. Check the network from inside your VPS host. But also run monitors from the outside that try to connect to your website through various routes. If access fails via all routes, it's probably your server. But if only one route fails and the others succeed it's probably not under your control to fix (unless the problem router happens to be in your domain). Monitoring from the outside also gives you the ability to email yourself about the problem, something you can't do from the inside if your networking is down.
Anyway, lots of good stuff you can do with these monitoring frameworks.
I also used OpenView and PRTG in my job, but those are complex and costly. Way overkill for making sure the Calibre website is up and running and it's underlying infrastructure is sane. You don't need a corporate solution designed to monitor 5000 servers for that (unless the Calibre website is a lot bigger than I imagined!)