1 |
On 15/04/2013 16:07, Christian Parpart wrote: |
2 |
> Hey all, |
3 |
> |
4 |
> we hit some nice traffic last night that took our main gateway down. |
5 |
> Pacemaker was configured to failover to our second one, but that one |
6 |
> died aswell. |
7 |
> |
8 |
> In a little post-analysis, I found the following in the logs: |
9 |
> |
10 |
> Apr 14 21:42:11 cesar1 kernel: [27613652.439846] BUG: soft lockup - |
11 |
> CPU#4 stuck for 22s! [swapper/4:0] |
12 |
> Apr 14 21:42:11 cesar1 kernel: [27613652.440319] Stack: |
13 |
> Apr 14 21:42:11 cesar1 kernel: [27613652.440446] Call Trace: |
14 |
> Apr 14 21:42:11 cesar1 kernel: [27613652.440595] <IRQ> |
15 |
> Apr 14 21:42:12 cesar1 kernel: [27613652.440828] <EOI> |
16 |
> Apr 14 21:42:12 cesar1 kernel: [27613652.440979] Code: c1 51 da 03 81 48 |
17 |
> c7 c2 4e da 03 81 e9 dd fe ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 |
18 |
> 55 b8 00 00 01 00 48 89 e5 f0 0f c1 07 <89> c2 |
19 |
> Apr 14 21:42:12 cesar1 CRON[13599]: nss_ldap: could not connect to any |
20 |
> LDAP server as cn=admin,dc=rz,dc=dawanda,dc=com - Can't contact LDAP server |
21 |
> Apr 14 21:42:12 cesar1 CRON[13599]: nss_ldap: could not search LDAP |
22 |
> server - Server is unavailable |
23 |
> Apr 14 21:42:24 cesar1 crmd: [7287]: ERROR: process_lrm_event: LRM |
24 |
> operation management-gateway-ip1_stop_0 (917) Timed Out (timeout=20000ms) |
25 |
> Apr 14 21:42:48 cesar1 kernel: [27613688.611501] BUG: soft lockup - |
26 |
> CPU#7 stuck for 22s! [named:32166] |
27 |
> Apr 14 21:42:48 cesar1 kernel: [27613688.611914] Stack: |
28 |
> Apr 14 21:42:48 cesar1 kernel: [27613688.612036] Call Trace: |
29 |
> Apr 14 21:42:48 cesar1 kernel: [27613688.612200] <IRQ> |
30 |
> Apr 14 21:42:48 cesar1 kernel: [27613688.612408] <EOI> |
31 |
> Apr 14 21:42:48 cesar1 kernel: [27613688.612626] Code: c1 51 da 03 81 48 |
32 |
> c7 c2 4e da 03 81 e9 dd fe ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 |
33 |
> 55 b8 00 00 01 00 48 89 e5 f0 0f c1 07 <89> c2 |
34 |
> Apr 14 21:42:55 cesar1 kernel: [27613695.946295] BUG: soft lockup - |
35 |
> CPU#0 stuck for 21s! [ksoftirqd/0:3] |
36 |
> Apr 14 21:42:55 cesar1 kernel: [27613695.946785] Stack: |
37 |
> Apr 14 21:42:55 cesar1 kernel: [27613695.946917] Call Trace: |
38 |
> Apr 14 21:42:55 cesar1 kernel: [27613695.947137] Code: c4 00 00 81 a8 44 |
39 |
> e0 ff ff ff 01 00 00 48 63 80 44 e0 ff ff a9 00 ff ff 07 74 36 65 48 8b |
40 |
> 04 25 c8 c4 00 00 83 a8 44 e0 ff ff 01 <5d> c3 |
41 |
> |
42 |
> We're using irqbalance to not only hit the first CPU for ethernet card |
43 |
> hardware interrupts when traffic comes in (learned from last much more |
44 |
> intensive DDoS). |
45 |
|
46 |
To use irqbalance is wise. You could also try using receive packet |
47 |
steering [1] [2]: |
48 |
|
49 |
#!/bin/bash |
50 |
iface='eth*' |
51 |
flow=16384 |
52 |
echo $flow > /proc/sys/net/core/rps_sock_flow_entries |
53 |
queues=(/sys/class/net/${iface}/queues/rx-*) |
54 |
for rx in "${queues[@]}"; do |
55 |
echo $(sed -e 's/0/f/g' < $rx/rps_cpus) > $rx/rps_cpus |
56 |
echo $flow > $rx/rps_flow_cnt |
57 |
done |
58 |
|
59 |
I have found this to be beneficial on systems running networking |
60 |
applications that are subject to a high load, but not for systems that |
61 |
are simply forwarding packets and processing them entirely in kernel space. |
62 |
|
63 |
> However, since this not helped, I'd like to find out what else we can |
64 |
> do. Our gateway has to do NAT and has a few other iptables rules it |
65 |
> needs in order to run OpenStack behind, |
66 |
> so I can't just drop it. |
67 |
> |
68 |
> Regarding the logs, I can see, that something caused the CPU cores to |
69 |
> get stuck for a number of different processes. |
70 |
> Has anyone ever encountered such error messages I quoted above or knows |
71 |
|
72 |
I used to encounter them but they cleared up at some point during the |
73 |
3.4 (longterm) kernel series. If you also use the 3.4 series, I would |
74 |
advise upgrading if running < 3.4.51. If you are not using a longterm |
75 |
kernel, consider doing so unless there is a feature in a later kernel |
76 |
that you cannot do without. My experience of the later 'stable' kernels |
77 |
lately is that they have a tendency to introduce serious regressions. |
78 |
|
79 |
> other things one might want to do in order to prevent hugh unsocialized |
80 |
> incoming traffic from bringing a Linux node down? |
81 |
|
82 |
If you can, talk with your upstream to see if there is a way in which |
83 |
such traffic can be throttled there. |
84 |
|
85 |
Be sure to use good quality NICs. In particular, it should support |
86 |
multiqueue and adjustable interrupt coalescing (preferably on a dynamic |
87 |
basis). For what it's worth, I'm using Intel 82576 based cards for busy |
88 |
hosts. These support dynamic interrupt throttling. Even without such a |
89 |
feature, some cards will allow their behaviour to be altered via ethtool |
90 |
-C. Google will turn up a lot of information on this topic. |
91 |
|
92 |
I should add that the stability of the driver is of paramount |
93 |
performance. Though my Intel cards have been solid, the igb driver |
94 |
bundled with the 3.4 kernel is not, which took me a long time to figure |
95 |
out. I now use a local ebuild to compile the igb driver from upstream. |
96 |
Not only did it improve performance, but it resolved all stability |
97 |
issues that I had experienced up until then. |
98 |
|
99 |
In the event that you are also using the igb driver, ensure that it is |
100 |
configured optimally for multiqueue. Here's an example for the upstream |
101 |
driver (my NIC has 4 ports, each with 8 queues): |
102 |
|
103 |
# cat /etc/modprobe.d/igb.conf |
104 |
options igb RSS=8,8,8,8 |
105 |
|
106 |
Enable I/OAT if your hardware supports it. Some hardware will support it |
107 |
but fail to expose a BIOS option to enable it, in which case you can try |
108 |
using dca_force [3] (YMMV). Similarly, make use of x2APIC if supported, |
109 |
but do not make use of the IOMMU provided by Intel as of Nehalem (boot |
110 |
with intel_iommu=off if in doubt). |
111 |
|
112 |
Consider fine-tuning sysctl.conf, especially those pertaining to buffer |
113 |
sizes/limits. I would consider this essential if operating at gigabit |
114 |
speeds or higher. Examples are widespread, such as in section 3.1 of the |
115 |
Mellanox performance tuning guide [4]. |
116 |
|
117 |
--Kerin |
118 |
|
119 |
[1] https://lwn.net/Articles/361440/ |
120 |
[2] http://thread.gmane.org/gmane.linux.network/179883/focus=179976 |
121 |
[3] https://github.com/ice799/dca_force |
122 |
[4] |
123 |
http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters_rev_1_0.pdf |