1 |
Hey all, |
2 |
|
3 |
we hit some nice traffic last night that took our main gateway down. |
4 |
Pacemaker was configured to failover to our second one, but that one died |
5 |
aswell. |
6 |
|
7 |
In a little post-analysis, I found the following in the logs: |
8 |
|
9 |
Apr 14 21:42:11 cesar1 kernel: [27613652.439846] BUG: soft lockup - CPU#4 |
10 |
stuck for 22s! [swapper/4:0] |
11 |
Apr 14 21:42:11 cesar1 kernel: [27613652.440319] Stack: |
12 |
Apr 14 21:42:11 cesar1 kernel: [27613652.440446] Call Trace: |
13 |
Apr 14 21:42:11 cesar1 kernel: [27613652.440595] <IRQ> |
14 |
Apr 14 21:42:12 cesar1 kernel: [27613652.440828] <EOI> |
15 |
Apr 14 21:42:12 cesar1 kernel: [27613652.440979] Code: c1 51 da 03 81 48 c7 |
16 |
c2 4e da 03 81 e9 dd fe ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 55 b8 |
17 |
00 00 01 00 48 89 e5 f0 0f c1 07 <89> c2 |
18 |
Apr 14 21:42:12 cesar1 CRON[13599]: nss_ldap: could not connect to any LDAP |
19 |
server as cn=admin,dc=rz,dc=dawanda,dc=com - Can't contact LDAP server |
20 |
Apr 14 21:42:12 cesar1 CRON[13599]: nss_ldap: could not search LDAP server |
21 |
- Server is unavailable |
22 |
Apr 14 21:42:24 cesar1 crmd: [7287]: ERROR: process_lrm_event: LRM |
23 |
operation management-gateway-ip1_stop_0 (917) Timed Out (timeout=20000ms) |
24 |
Apr 14 21:42:48 cesar1 kernel: [27613688.611501] BUG: soft lockup - CPU#7 |
25 |
stuck for 22s! [named:32166] |
26 |
Apr 14 21:42:48 cesar1 kernel: [27613688.611914] Stack: |
27 |
Apr 14 21:42:48 cesar1 kernel: [27613688.612036] Call Trace: |
28 |
Apr 14 21:42:48 cesar1 kernel: [27613688.612200] <IRQ> |
29 |
Apr 14 21:42:48 cesar1 kernel: [27613688.612408] <EOI> |
30 |
Apr 14 21:42:48 cesar1 kernel: [27613688.612626] Code: c1 51 da 03 81 48 c7 |
31 |
c2 4e da 03 81 e9 dd fe ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 55 b8 |
32 |
00 00 01 00 48 89 e5 f0 0f c1 07 <89> c2 |
33 |
Apr 14 21:42:55 cesar1 kernel: [27613695.946295] BUG: soft lockup - CPU#0 |
34 |
stuck for 21s! [ksoftirqd/0:3] |
35 |
|
36 |
Apr 14 21:42:55 cesar1 kernel: [27613695.946785] Stack: |
37 |
Apr 14 21:42:55 cesar1 kernel: [27613695.946917] Call Trace: |
38 |
Apr 14 21:42:55 cesar1 kernel: [27613695.947137] Code: c4 00 00 81 a8 44 e0 |
39 |
ff ff ff 01 00 00 48 63 80 44 e0 ff ff a9 00 ff ff 07 74 36 65 48 8b 04 25 |
40 |
c8 c4 00 00 83 a8 44 e0 ff ff 01 <5d> c3 |
41 |
|
42 |
We're using irqbalance to not only hit the first CPU for ethernet card |
43 |
hardware interrupts when traffic comes in (learned from last much more |
44 |
intensive DDoS). |
45 |
However, since this not helped, I'd like to find out what else we can do. |
46 |
Our gateway has to do NAT and has a few other iptables rules it needs in |
47 |
order to run OpenStack behind, |
48 |
so I can't just drop it. |
49 |
|
50 |
Regarding the logs, I can see, that something caused the CPU cores to get |
51 |
stuck for a number of different processes. |
52 |
Has anyone ever encountered such error messages I quoted above or knows |
53 |
other things one might want to do in order to prevent hugh unsocialized |
54 |
incoming traffic from bringing a Linux node down? |
55 |
|
56 |
Best regards, |
57 |
Christian. |