1 |
pageexec@××××××××.hu wrote: |
2 |
> On 31 Oct 2007 at 22:46, Brian Kroth wrote: |
3 |
> |
4 |
>> but not ever (yet) with this |
5 |
>> |
6 |
>> cactid --verbosity=5 -f 1 -l 10 |
7 |
> |
8 |
> what does the -l switch do? |
9 |
|
10 |
-f , -l allow you to limit the range of hostids to scan. In the |
11 |
examples I gave the first scanned hostids 1-100, the example above scans |
12 |
hostids 1-10. I was doing this originally to see if I could pinpoint |
13 |
one particular host check that was causing it, but it seems to have more |
14 |
to do with large hosts scans. I think that might be because of the |
15 |
number of forks and allocations. |
16 |
|
17 |
> |
18 |
>> The version of cactid in portage is slightly old. After updating from |
19 |
>> 0.8.6i-r1 to 0.8.6j the problem seems to happen less frequently, but |
20 |
>> still happens. With that in mind might this actually be a software |
21 |
>> problem and not a kernel problem? Shouldn't PAX be preventing userland |
22 |
>> software from screwing up the page table? |
23 |
> |
24 |
> i'm almost sure it's a bug somewhere in vma mirroring as that's the |
25 |
> only thing i changed in .22 and on and it does play with page locking |
26 |
> (the bad page state is triggered because a to-be-freed page is still |
27 |
> locked, that's means there's a missing unlock somewhere in the code, |
28 |
> but i couldn't figure it out from the code yet). |
29 |
|
30 |
Where's the code for this? I'm no kernel guru by any means, but I'd |
31 |
still be interested to look at it and learn. |
32 |
|
33 |
> |
34 |
>> I can send more kernel output if anyone's interested. Any thoughts on |
35 |
>> what else I should be doing to test this? |
36 |
> |
37 |
> i'll need your mm/memory.o from the failing kernel and if it occured on |
38 |
> multiple machines or kernels, indicate which of your report corresponds |
39 |
> to which .o (well, i can find it out from the disasm eventually, but it |
40 |
> helps me if i don't have to ;-). then can you send me a /proc/pid/maps file |
41 |
> from cactid and nagios (if you use grsec make sure that addresses are not |
42 |
> hidden and preferably not randomized either)? |
43 |
> |
44 |
|
45 |
So far this is only on that single machine, and only for nagios and |
46 |
cacti. I rebuilt the kernel with the config that's attached. I've |
47 |
basically turned on a few more debug settings in the kernel and turned |
48 |
off the randomization features of pax (CONFIG_PAX_ASLR) and the "remove |
49 |
addresses" feature of grsec (CONFIG_GRKERNSEC_PROC_MEMMAP) like you |
50 |
asked. Tweaked my sec script to copy the maps files before killing the |
51 |
offending processes. Everything should be in the tar. Let me know if |
52 |
you need anything else. |
53 |
|
54 |
Nov 3 18:27:38 tux-mc IPMI Watchdog: driver initialized |
55 |
Nov 3 18:32:59 tux-mc Bad page state in process 'nagios' |
56 |
Nov 3 18:32:59 tux-mc page:c129d620 flags:0x40000001 mapping:00000000 |
57 |
mapcount:0 count:0 |
58 |
Nov 3 18:32:59 tux-mc Trying to fix it up, but a reboot is needed |
59 |
Nov 3 18:32:59 tux-mc Backtrace: |
60 |
Nov 3 18:32:59 tux-mc [<c044c150>] bad_page+0x63/0x92 |
61 |
Nov 3 18:32:59 tux-mc [<c044cc18>] free_hot_cold_page+0x7c/0x194 |
62 |
Nov 3 18:32:59 tux-mc [<c0456110>] do_wp_page+0x22e/0x426 |
63 |
Nov 3 18:32:59 tux-mc [<c0457463>] __handle_mm_fault+0x2ad/0x305 |
64 |
Nov 3 18:32:59 tux-mc [<c0414576>] do_page_fault+0x1da/0x7d5 |
65 |
Nov 3 18:32:59 tux-mc [<c041c269>] do_fork+0x15d/0x217 |
66 |
Nov 3 18:32:59 tux-mc [<c041439c>] do_page_fault+0x0/0x7d5 |
67 |
Nov 3 18:32:59 tux-mc [<c06e9525>] error_code+0x75/0x80 |
68 |
Nov 3 18:32:59 tux-mc [<c06e0000>] svc_setup_socket+0x1aa/0x223 |
69 |
Nov 3 18:32:59 tux-mc ======================= |
70 |
Nov 3 18:35:07 tux-mc Bad page state in process 'cactid' |
71 |
Nov 3 18:35:07 tux-mc page:c12efdc0 flags:0x40000001 mapping:00000000 |
72 |
mapcount:0 count:0 |
73 |
Nov 3 18:35:07 tux-mc Trying to fix it up, but a reboot is needed |
74 |
Nov 3 18:35:07 tux-mc Backtrace: |
75 |
Nov 3 18:35:07 tux-mc [<c044c150>] bad_page+0x63/0x92 |
76 |
Nov 3 18:35:07 tux-mc [<c044cc18>] free_hot_cold_page+0x7c/0x194 |
77 |
Nov 3 18:35:07 tux-mc [<c0456110>] do_wp_page+0x22e/0x426 |
78 |
Nov 3 18:35:07 tux-mc [<c0457463>] __handle_mm_fault+0x2ad/0x305 |
79 |
Nov 3 18:35:07 tux-mc [<c0414576>] do_page_fault+0x1da/0x7d5 |
80 |
Nov 3 18:35:07 tux-mc [<c04682d5>] sys_read+0x68/0x6a |
81 |
Nov 3 18:35:07 tux-mc [<c041439c>] do_page_fault+0x0/0x7d5 |
82 |
Nov 3 18:35:07 tux-mc [<c06e9525>] error_code+0x75/0x80 |
83 |
Nov 3 18:35:07 tux-mc [<c06e0000>] svc_setup_socket+0x1aa/0x223 |
84 |
Nov 3 18:35:07 tux-mc ======================= |
85 |
|
86 |
|
87 |
PS - would you like me to take this off list? |
88 |
|
89 |
Thanks again, |
90 |
Brian |