Gentoo Archives: gentoo-user

From: hw <hw@×××××.de>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] Networking trouble
Date: Thu, 29 Oct 2015 10:29:31
Message-Id: 5631F4FE.90000@gc-24.de
In Reply to: Re: [gentoo-user] Networking trouble by "J. Roeleveld"
1 J. Roeleveld wrote:
2 > On Thursday, October 15, 2015 05:46:07 PM hw wrote:
3 >> J. Roeleveld wrote:
4 >>> On Thursday, October 15, 2015 03:30:01 PM hw wrote:
5 >>>> Hi,
6 >>>>
7 >>>> I have a xen host with some HV guests which becomes unreachable via
8 >>>> the network after apparently random amount of times. I have already
9 >>>> switched the network card to see if that would make a difference,
10 >>>> and with the card currently installed, it worked fine for over 20 days
11 >>>> until it become unreachable again. Before switching the network card,
12 >>>> it would run a week or two before becoming unreachable. The previous
13 >>>> card was the on-board BCM5764M which uses the tg3 driver.
14 >>>>
15 >>>> There are messages like this in the log file:
16 >>>>
17 >>>>
18 >>>> Oct 14 20:58:02 moonflo kernel: ------------[ cut here ]------------
19 >>>> Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at
20 >>>> net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14 20:58:02
21 >>>> moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 timed
22 >>>> out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb md4 hmac
23 >>>> nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter ip_tables
24 >>>> xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau
25 >>>> snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO)
26 >>>> zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight
27 >>>> drm_kms_helper
28 >>>> ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm snd_timer snd
29 >>>> soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul ablk_helper
30 >>>> cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd usb_storage
31 >>>> ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo kernel: CPU:
32 >>>> 10 PID: 0 Comm: swapper/10 Tainted: P O 4.0.5-gentoo #3 Oct
33 >>>> 14
34 >>>> 20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800
35 >>>> Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02 moonflo
36 >>>> kernel: ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8
37 >>>> 0000000000000001 Oct 14 20:58:02 moonflo kernel: ffff880124d43de8
38 >>>> ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02
39 >>>> moonflo
40 >>>> kernel: 0000000000000000 ffff8800d45f2000 0000000000000001
41 >>>> ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace:
42 >>>> Oct 14 20:58:02 moonflo kernel: <IRQ> [<ffffffff814da8d8>]
43 >>>> dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel:
44 >>>> [<ffffffff81088850>]
45 >>>> warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel:
46 >>>> [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02 moonflo
47 >>>> kernel: [<ffffffff812b31c5>] ? add_interrupt_randomness+0x35/0x1e0 Oct
48 >>>> 14
49 >>>> 20:58:02 moonflo kernel: [<ffffffff8145b819>] dev_watchdog+0x259/0x270
50 >>>> Oct
51 >>>> 14 20:58:02 moonflo kernel: [<ffffffff8145b5c0>] ?
52 >>>> dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel:
53 >>>> [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo
54 >>>> kernel: [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 Oct 14
55 >>>> 20:58:02 moonflo kernel: [<ffffffff810d42a6>]
56 >>>> run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel:
57 >>>> [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 Oct 14 20:58:02 moonflo
58 >>>> kernel: [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 Oct 14 20:58:02 moonflo
59 >>>> kernel: [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 Oct 14
60 >>>> 20:58:02 moonflo kernel: [<ffffffff814e1e8e>]
61 >>>> xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo kernel:
62 >>>> <EOI>
63 >>>>
64 >>>> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct 14 20:58:02
65 >>>>
66 >>>> moonflo kernel: [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
67 >>>> Oct
68 >>>> 14 20:58:02 moonflo kernel: [<ffffffff810459e0>] ?
69 >>>> xen_safe_halt+0x10/0x20
70 >>>> Oct 14 20:58:02 moonflo kernel: [<ffffffff81053979>] ?
71 >>>> default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel:
72 >>>> [<ffffffff810542da>]
73 >>>> ? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel:
74 >>>> [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 Oct 14 20:58:02
75 >>>> moonflo kernel: [<ffffffff81047cd5>] ? cpu_bringup_and_idle+0x25/0x40
76 >>>> Oct
77 >>>> 14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]--- Oct 14
78 >>>> 20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up
79 >>>>
80 >>>>
81 >>>> After that, there are lots of messages about the link being up, one
82 >>>> message
83 >>>> every 12 seconds. When you unplug the network cable, you get a message
84 >>>> that the link is down, and no message when you plug it in again.
85 >>>>
86 >>>> I was hoping that switching the network card (to one that uses a
87 >>>> different
88 >>>> driver) might solve the problem, and it did not. Now I can only guess
89 >>>> that
90 >>>> the network card goes to sleep and sometimes cannot be woken up again.
91 >>>>
92 >>>> I tried to reduce the connection speed to 100Mbit and found that
93 >>>> accessing
94 >>>> the VMs (via RDP) becomes too slow to use them. So I disabled the power
95 >>>> management of the network card (through sysfs) and will have to see if
96 >>>> the
97 >>>> problem persists.
98 >>>>
99 >>>> We'll be getting decent network cards in a couple days, but since the
100 >>>> problem doesn't seem to be related to a particular
101 >>>> card/model/manufacturer,
102 >>>> that might not fix it, either.
103 >>>>
104 >>>> This problem seems to only occur on machines that operate as a xen
105 >>>> server.
106 >>>> Other machines, identical Z800s, not running xen, run just fine.
107 >>>>
108 >>>> What would you suggest?
109 >>>
110 >>> More info required:
111 >>>
112 >>> - Which version of Xen
113 >>
114 >> 4.5.1
115 >>
116 >> Installed versions: 4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags -debug
117 >> -efi -flask -xsm)
118 >
119 > Ok, recent one.
120 >
121 >>> - Does this only occur with HVM guests?
122 >>
123 >> The host has been running only HVM guests every time it happend.
124 >> It was running a PV guest in between (which I had to shut down
125 >> because other VMs were migrated, requiring the RAM).
126 >
127 > The PV didn't have any issues?
128
129 The whole server has the issue, not a particular VM. While the PV guest
130 was running, the server didn't freeze.
131
132 >>> - Which network-driver are you using inside the guest
133 >>
134 >> r8169, compiled as a module
135 >>
136 >> Same happened with the tg3 driver when the on-board cards were used.
137 >> The tg3 driver is completely disabled in the kernel config, i. e.
138 >> not even compiled as a module.
139 >
140 > You have network cards assigned to the guests?
141
142 No, they are all connected via a bridge.
143
144 I enabled STP on the bridge and the server was ok for a week, then had
145 to be restarted. I'm seeing lots of messages in the log:
146
147
148 Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating
149 Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu
150 Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating
151 Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu
152 Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating
153 Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu
154 Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating
155 Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu
156 Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating
157
158
159 and sometimes:
160
161 Oct 28 10:47:04 moonflo kernel: brloc: port 1(enp55s4) neighbor 8000.00:00:10:11:12:00 lost
162
163
164 Any idea what this means?
165
166 (Google has gone on strike, and another search engine didn't give any useful
167 findings ...)
168
169
170 >>> - Can you connect to the "local" console of the guest?
171 >>
172 >> Yes, the host seems to be running fine except for having no network
173 >> connectivity. There's a keyboard and monitor physically connected to
174 >> it with which you can log in and do stuff.
175 >
176 > The HOST loses network connectivity?
177
178 Yes.
179
180 Apparently when it became unresponsive yesterday, it was not possible
181 to log in at the console, either. I wasn't there yesterday, though I've
182 see that happen before. We tried to shut it down via acpid by pressing the
183 power button. It didn't turn off, so it was switched off by holding the power
184 button. What I can see in the log is:
185
186
187 Oct 28 14:12:33 moonflo logger[20322]: /etc/xen/scripts/block: remove XENBUS_PATH=backend/vbd/2/768
188 Oct 28 14:12:33 moonflo logger[20323]: /etc/xen/scripts/vif-bridge: offline type_if=vif XENBUS_PATH=backend/vif/2/0
189 Oct 28 14:12:33 moonflo logger[20347]: /etc/xen/scripts/vif-bridge: brctl delif brloc vif2.0 failed
190 Oct 28 14:12:33 moonflo logger[20353]: /etc/xen/scripts/vif-bridge: ifconfig vif2.0 down failed
191 Oct 28 14:12:33 moonflo logger[20361]: /etc/xen/scripts/vif-bridge: Successful vif-bridge offline for vif2.0, bridge brloc.
192 Oct 28 14:12:33 moonflo logger[20372]: /etc/xen/scripts/vif-bridge: remove type_if=tap XENBUS_PATH=backend/vif/2/0
193 Oct 28 14:12:33 moonflo logger[20391]: /etc/xen/scripts/vif-bridge: Successful vif-bridge remove for vif2.0-emu, bridge brloc.
194 Oct 28 14:15:33 moonflo shutdown[20476]: shutting down for system halt
195 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Oct 28 14:17:34 moonflo syslog-ng[4611]: syslog-ng starting up; version='3.6.2'
196
197
198 And:
199
200
201 Oct 24 11:47:42 moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 timed out
202 Oct 24 11:47:42 moonflo kernel: Modules linked in: xt_physdev br_netfilter iptable_filter ip_tables xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) zuni
203 code(PO) zavl(PO) zcommon(PO) znvpair(PO) nouveau snd_hda_codec_realtek snd_hda_codec_generic video spl(O) backlight zlib_deflate drm_kms_helper snd_hda_intel snd_
204 hda_controller snd_hda_codec snd_pcm snd_timer r8169 snd ttm soundcore mii xts aesni_intel glue_helper lrw gf128mul ablk_helper cryptd aes_x86_64 sha256_generic hi
205 d_generic usbhid uhci_hcd usb_storage ehci_pci ehci_hcd usbcore usb_common
206 Oct 24 11:47:42 moonflo kernel: CPU: 12 PID: 0 Comm: swapper/12 Tainted: P O 4.0.5-gentoo #3
207 Oct 24 11:47:42 moonflo kernel: Hardware name: Hewlett-Packard HP Z800 Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013
208 Oct 24 11:47:42 moonflo kernel: ffffffff8175a77d ffff880124d83d98 ffffffff814da8d8 0000000000000001
209 Oct 24 11:47:42 moonflo kernel: ffff880124d83de8 ffff880124d83dd8 ffffffff81088850 ffff880124d83e68
210 Oct 24 11:47:42 moonflo kernel: 0000000000000000 ffff88011efd8000 0000000000000001 ffff8800d4eb5e80
211 Oct 24 11:47:42 moonflo kernel: Call Trace:
212 Oct 24 11:47:42 moonflo kernel: <IRQ> [<ffffffff814da8d8>] dump_stack+0x45/0x57
213 Oct 24 11:47:42 moonflo kernel: [<ffffffff81088850>] warn_slowpath_common+0x80/0xc0
214 Oct 24 11:47:42 moonflo kernel: [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50
215 Oct 24 11:47:42 moonflo kernel: [<ffffffff812b31c5>] ? add_interrupt_randomness+0x35/0x1e0
216 Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b819>] dev_watchdog+0x259/0x270
217 Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80
218 Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80
219 Oct 24 11:47:42 moonflo kernel: [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70
220 Oct 24 11:47:42 moonflo kernel: [<ffffffff810d42a6>] run_timer_softirq+0x176/0x2b0
221 Oct 24 11:47:42 moonflo kernel: [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0
222 Oct 24 11:47:42 moonflo kernel: [<ffffffff8108c04e>] irq_exit+0x7e/0xa0
223 Oct 24 11:47:42 moonflo kernel: [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50
224 Oct 24 11:47:42 moonflo kernel: [<ffffffff814e1e8e>] xen_do_hypervisor_callback+0x1e/0x40
225 Oct 24 11:47:42 moonflo kernel: <EOI> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
226 Oct 24 11:47:42 moonflo kernel: [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
227 Oct 24 11:47:42 moonflo kernel: [<ffffffff810459e0>] ? xen_safe_halt+0x10/0x20
228 Oct 24 11:47:42 moonflo kernel: [<ffffffff81053979>] ? default_idle+0x9/0x10
229 Oct 24 11:47:42 moonflo kernel: [<ffffffff810542da>] ? arch_cpu_idle+0xa/0x10
230 Oct 24 11:47:42 moonflo kernel: [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0
231 Oct 24 11:47:42 moonflo kernel: [<ffffffff81047cd5>] ? cpu_bringup_and_idle+0x25/0x40
232 Oct 24 11:47:42 moonflo kernel: ---[ end trace 320b6f98f8fc070f ]---
233 Oct 24 11:47:42 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up
234
235
236 That was two days before it went down. After that, messages about topology changes
237 are starting to appear.
238
239 I'm not sure if I should call this "progress" ;)
240
241 >
242 >> You get no answer when you ping the host while it is unreachable.
243 >>
244 >>> - If yes, does it still have no connectivity?
245 >>
246 >> It has been restarted this morning when it was found to be unreachable.
247 >>
248 >>> I saw the same on my lab machine, which was related to:
249 >>> - Not using correct drivers inside HVM guests
250 >>
251 >> There are Windoze 7 guests running that have PV drivers installed.
252 >> One of those has formerly been running on a VMware host and was
253 >> migrated on Tuesday. I deinstalled the VMware tools from it.
254 >
255 > Which PV drivers?
256
257 Xen GPL PV Driver Developers
258 17.09.2014
259 0.11.0.373
260 Univention GmbH
261
262 > And did you ensure all VMWare related drivers were removed?
263 > I am not convinced uninstalling the VMWare tools is sufficient.
264
265 What would I need to look at to make sure they are removed?
266
267 The problem has been there before the VM that had VMWare drivers
268 installed was migrated to this server. So I don't think they are
269 causing this problem.
270
271
272 >> Since Monday, a HVM Linux system (a modified 32-bit Debian) has also
273 >> been migrated from the VMware host to this one. I don't know if it
274 >> has VMware tools installed (I guess it does because it could be shut
275 >> down via VMware) and how those might react now. It's working, and I
276 >> don't want to touch it.
277 >>
278 >> However, the problem already occured before this migration, when the
279 >> on-board cards were still used.
280 >>
281 >>> - Switch hardware not keeping the MAC/IP/Port lists long enough
282 >>
283 >> What might be the reason for the lists becoming too short? Too many
284 >> devices connected to the network?
285 >
286 > No network activity for a while. (clean installs, nothing running)
287 > Switch forgetting the MAC-address assigned to the VM.
288 >
289 > Connecting to the VM-console, I could ping www.google.com and then the
290 > connectivity re-appeared.
291
292 Half of the switches have been replaced last week in order to track down
293 what appears to be a weird network problem. The problem is that the RDP
294 clients are being randomly stalled. If it was only that, I'd suspect this
295 server some more, but the internet connection goes through the same switches
296 and is apprently also slowed down when the RPD clients are stalled. They
297 got also randomly stalled when the RDP clients were accessing a totally
298 different server (the VMWare server), so this might be entirely unrelated.
299
300 Replacing the switches didn't fix the problem, so I'll probably put them
301 back into service and replace the other half.
302
303 >> The host has been connected to two different switches and showed the
304 >> problem. Previously, that was an 8-port 1Gb switch, now it's a 24-port
305 >> 1Gb switch. However, the 8-port switch is also connected to the 24-port
306 >> switch the host is now connected to. (The 24-port switch connects it
307 >> "directly" to the rest of the network.)
308 >
309 > Assuming it's a managed switch, you could test this.
310 > Alternatively, check if you can access the VMs from the host.
311
312 Good idea, I'll try that when it happens when I'm here.
313
314 The network cards have arrived, Intel PRO 1000 dual port, made for IBM.
315 I hope I get to swap the card today. Those *really* should work.
316
317 Hm, I could plug in two of them and give each VM and the host its own
318 physical card. Do you think that might help?

Replies

Subject Author
Re: [gentoo-user] Networking trouble "J. Roeleveld" <joost@××××××××.org>