1 |
Hi Alan, |
2 |
|
3 |
This isn't exactly what you describe for your needs but have you |
4 |
considered using auto-remediation outside of the box? I've been using |
5 |
StackStorm https://stackstorm.com/ for the last year in an environment |
6 |
of ~1500 physical servers for this purpose and it's been quite successful. |
7 |
|
8 |
It has been handling cases like restarting SNMP daemons that segfault, |
9 |
hadoop instances that loose to contact with the ZooKeeper cluster, |
10 |
restarting nginx daemons that stop responding to requests by analysing |
11 |
the last write date in nginx's access logs, the list goes on. |
12 |
|
13 |
StackStorm is event driven platform that has many integrations available |
14 |
allowing it to interact with internal and external service providers. |
15 |
It's Python based and can use ssh to execute remote commands which |
16 |
sounds like an acceptable approach since you're using ansible. |
17 |
|
18 |
Connecting SNMP traps up to StackStorm's event bus to trigger automated |
19 |
responses based on the trap contents would be inline with common use cases. |
20 |
|
21 |
Regards, |
22 |
Carlos |
23 |
|
24 |
On 16/10/17 17:50, Alan McKinnon wrote: |
25 |
> Nagios and I go way back, way way waaaaaay back. I now recommend it |
26 |
> never be used unless there really is no other option. There is just so |
27 |
> many problems with actually using the bloody thing, but let's not get |
28 |
> into that:-) |
29 |
> |
30 |
> I have a full monitoring system that tracks and reports on the state of |
31 |
> most things, but as it's a monitoring system it is forbidden to make |
32 |
> changes of any kind at all, and that includes restarting failed daemons. |
33 |
> Turns out that daemons that failed for no good reason are becoming more |
34 |
> and more common in this day and age, mostly because we treat them like |
35 |
> cattle not pets and use virtualization and containers so much. And |
36 |
> there's our old friend the Linux oom-killer.... |
37 |
> |
38 |
> What I need here is a small app that will be a constrained, |
39 |
> single-purpose watchdog. If a daemon fails, the watchdog attempts 3 |
40 |
> restarts to get it going, and records the fact it did it (that goes into |
41 |
> the big monitoring system as a reportable fact). If the restart fails, |
42 |
> then a human needs to attend to it as it is seriously or beyond the |
43 |
> scope of a watchdog. |
44 |
> |
45 |
> Like you, I'm tired of being woken at 2am because something dropped 1 |
46 |
> ping when the nightly database maintenance fired up on the vmware |
47 |
> cluster:-) |