1 |
Brian Kroth wrote: |
2 |
> I have no problems with 2.6.20-r10. I ran it for 4 hours last night and |
3 |
> some weeks before this. 2.6.20-r6 before that, again no problems. |
4 |
> 2.6.22-r8 and 2.6.23 both die as soon as cactid or nagios start running. |
5 |
> I really don't think this is bad ram anymore. I'll see if I can get an |
6 |
> exact test for others to try. Any other kernel debug tweaks I should try? |
7 |
> |
8 |
> Thanks for all your help, |
9 |
> Brian |
10 |
|
11 |
I haven't found a way of reproducing this on other machines yet because |
12 |
it takes lots of time to setup cacti. In playing around with cactid |
13 |
though what I've found is that the error happens /nearly/ everytime I |
14 |
specify something like this: |
15 |
|
16 |
cactid --verbosity=5 -f 1 -l 100 |
17 |
|
18 |
but not ever (yet) with this |
19 |
|
20 |
cactid --verbosity=5 -f 1 -l 10 |
21 |
|
22 |
With sec monitoring kern.log for "Bad page state in 'cactid'" and |
23 |
killing cactid when that happens I've noticed that that last line of |
24 |
output from cactid is always something like this: |
25 |
|
26 |
10/31/2007 10:22:32 PM - CACTID: Poller[0] Host[42] DEBUG: The POPEN |
27 |
returned the following File Descriptor 5 |
28 |
|
29 |
The kern.log shows this: |
30 |
|
31 |
Oct 31 22:30:09 tux-mc Bad page state in process 'cactid' |
32 |
Oct 31 22:30:09 tux-mc page:c14070c0 flags:0x40000001 mapping:00000000 |
33 |
mapcount:0 count:0 |
34 |
Oct 31 22:30:09 tux-mc Trying to fix it up, but a reboot is needed |
35 |
Oct 31 22:30:09 tux-mc Backtrace: |
36 |
Oct 31 22:30:09 tux-mc [<c044bf67>] bad_page+0x63/0x92 |
37 |
Oct 31 22:30:09 tux-mc [<c044c90c>] free_hot_cold_page+0x7c/0x17f |
38 |
Oct 31 22:30:09 tux-mc [<c0455c24>] do_wp_page+0x223/0x3ed |
39 |
Oct 31 22:30:09 tux-mc [<c0456f24>] __handle_mm_fault+0x2ad/0x305 |
40 |
Oct 31 22:30:09 tux-mc [<c0414616>] do_page_fault+0x1da/0x7d5 |
41 |
Oct 31 22:30:09 tux-mc [<c041c2d5>] do_fork+0x15d/0x217 |
42 |
Oct 31 22:30:09 tux-mc [<c041443c>] do_page_fault+0x0/0x7d5 |
43 |
Oct 31 22:30:09 tux-mc [<c06e8db5>] error_code+0x75/0x80 |
44 |
Oct 31 22:30:09 tux-mc [<c06e0000>] svc_defer+0xfa/0x139 |
45 |
Oct 31 22:30:09 tux-mc ======================= |
46 |
|
47 |
The version of cactid in portage is slightly old. After updating from |
48 |
0.8.6i-r1 to 0.8.6j the problem seems to happen less frequently, but |
49 |
still happens. With that in mind might this actually be a software |
50 |
problem and not a kernel problem? Shouldn't PAX be preventing userland |
51 |
software from screwing up the page table? |
52 |
|
53 |
I can send more kernel output if anyone's interested. Any thoughts on |
54 |
what else I should be doing to test this? |
55 |
|
56 |
Thanks, |
57 |
Brian |