1 |
On 16/10/2017 17:41, Mick wrote: |
2 |
> On Monday, 16 October 2017 16:12:53 BST Alan McKinnon wrote: |
3 |
>> On 16/10/2017 17:08, Ian Zimmerman wrote: |
4 |
>>> On 2017-10-16 14:11, Alan McKinnon wrote: |
5 |
>>>> My needs here are pretty simple: |
6 |
>>>> local watchdog that checks if a program is running and restart it if |
7 |
>>>> not. If that fails 3 times or so, alert me. |
8 |
>>>> Maybe a few file/dir/fifo monitors as well. Not much else. |
9 |
>>>> |
10 |
>>>> I don't need any of monit's graphing features or M/monit, I have other |
11 |
>>>> tools for that. And mostly don't even need it's http API either. |
12 |
>>> |
13 |
>>> supervisor (aka supervisord) |
14 |
>>> |
15 |
>>> http://supervisord.org/ |
16 |
>>> |
17 |
>>> python based, not sure if that's okay with you |
18 |
>> |
19 |
>> I forgot about supervisord. Like monit, it runs everywhere and might be |
20 |
>> easier for the team-mates to understand and work with. |
21 |
>> |
22 |
>> Python is not a problem, all these hosts are ansible-managed anyway, so |
23 |
>> they all have to run python-2.7 |
24 |
>> |
25 |
>> Good find, thanks! |
26 |
> |
27 |
> I've used Nagios in the past, but have not kept up with its development and |
28 |
> the many plugins it provides. It could do any of the above tasks and much |
29 |
> more. It can run scripts (perl, or bash) via daemons (nrpe) on the remote |
30 |
> systems to restart applications, et al. The Nagios server possessed the |
31 |
> ability to set up quite intelligent monitoring and alert hierarchies with |
32 |
> multilayered comms structures to make sure you are not woken up at 2 a.m. by |
33 |
> your boss, just because a ping failed to his home NAS. I also found the logs |
34 |
> which can be also stored on SQL quite useful both in troubleshooting problems |
35 |
> and in producing reports. It can monitor network connectivity, remote OS |
36 |
> parameters and applications. Writing your own plugin/module to monitor quite |
37 |
> specialised use cases is not particularly difficult either. |
38 |
> |
39 |
> I expect you may find Nagios more complicated to set up than monit, at least |
40 |
> initially, but if you don't have the luxury of time to invest on setting up |
41 |
> Nagios monit may be a better fit. I don't have in depth experience with other |
42 |
> monitoring software to comment, so something else may suit better your |
43 |
> specific needs. |
44 |
> |
45 |
|
46 |
|
47 |
Nagios and I go way back, way way waaaaaay back. I now recommend it |
48 |
never be used unless there really is no other option. There is just so |
49 |
many problems with actually using the bloody thing, but let's not get |
50 |
into that :-) |
51 |
|
52 |
I have a full monitoring system that tracks and reports on the state of |
53 |
most things, but as it's a monitoring system it is forbidden to make |
54 |
changes of any kind at all, and that includes restarting failed daemons. |
55 |
Turns out that daemons that failed for no good reason are becoming more |
56 |
and more common in this day and age, mostly because we treat them like |
57 |
cattle not pets and use virtualization and containers so much. And |
58 |
there's our old friend the Linux oom-killer.... |
59 |
|
60 |
What I need here is a small app that will be a constrained, |
61 |
single-purpose watchdog. If a daemon fails, the watchdog attempts 3 |
62 |
restarts to get it going, and records the fact it did it (that goes into |
63 |
the big monitoring system as a reportable fact). If the restart fails, |
64 |
then a human needs to attend to it as it is seriously or beyond the |
65 |
scope of a watchdog. |
66 |
|
67 |
Like you, I'm tired of being woken at 2am because something dropped 1 |
68 |
ping when the nightly database maintenance fired up on the vmware |
69 |
cluster :-) |
70 |
|
71 |
|
72 |
-- |
73 |
Alan McKinnon |
74 |
alan.mckinnon@×××××.com |