Gentoo Archives: gentoo-user

From: "J. Roeleveld" <joost@××××××××.org>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] Networking trouble
Date: Thu, 29 Oct 2015 17:26:09
Message-Id: 0F927A0B-7591-42A7-8850-CFAB564E6E17@antarean.org
In Reply to: Re: [gentoo-user] Networking trouble by hw
1 On 29 October 2015 11:29:18 CET, hw <hw@×××××.de> wrote:
2 >J. Roeleveld wrote:
3 >> On Thursday, October 15, 2015 05:46:07 PM hw wrote:
4 >>> J. Roeleveld wrote:
5 >>>> On Thursday, October 15, 2015 03:30:01 PM hw wrote:
6 >>>>> Hi,
7 >>>>>
8 >>>>> I have a xen host with some HV guests which becomes unreachable
9 >via
10 >>>>> the network after apparently random amount of times. I have
11 >already
12 >>>>> switched the network card to see if that would make a difference,
13 >>>>> and with the card currently installed, it worked fine for over 20
14 >days
15 >>>>> until it become unreachable again. Before switching the network
16 >card,
17 >>>>> it would run a week or two before becoming unreachable. The
18 >previous
19 >>>>> card was the on-board BCM5764M which uses the tg3 driver.
20 >>>>>
21 >>>>> There are messages like this in the log file:
22 >>>>>
23 >>>>>
24 >>>>> Oct 14 20:58:02 moonflo kernel: ------------[ cut here
25 >]------------
26 >>>>> Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at
27 >>>>> net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14
28 >20:58:02
29 >>>>> moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0
30 >timed
31 >>>>> out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb
32 >md4 hmac
33 >>>>> nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter
34 >ip_tables
35 >>>>> xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau
36 >>>>> snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO)
37 >>>>> zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight
38 >>>>> drm_kms_helper
39 >>>>> ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm
40 >snd_timer snd
41 >>>>> soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul
42 >ablk_helper
43 >>>>> cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd
44 >usb_storage
45 >>>>> ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo
46 >kernel: CPU:
47 >>>>> 10 PID: 0 Comm: swapper/10 Tainted: P O 4.0.5-gentoo
48 >#3 Oct
49 >>>>> 14
50 >>>>> 20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800
51 >>>>> Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02
52 >moonflo
53 >>>>> kernel: ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8
54 >>>>> 0000000000000001 Oct 14 20:58:02 moonflo kernel: ffff880124d43de8
55 >>>>> ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02
56 >>>>> moonflo
57 >>>>> kernel: 0000000000000000 ffff8800d45f2000 0000000000000001
58 >>>>> ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace:
59 >>>>> Oct 14 20:58:02 moonflo kernel: <IRQ> [<ffffffff814da8d8>]
60 >>>>> dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel:
61 >>>>> [<ffffffff81088850>]
62 >>>>> warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel:
63 >>>>> [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02
64 >moonflo
65 >>>>> kernel: [<ffffffff812b31c5>] ?
66 >add_interrupt_randomness+0x35/0x1e0 Oct
67 >>>>> 14
68 >>>>> 20:58:02 moonflo kernel: [<ffffffff8145b819>]
69 >dev_watchdog+0x259/0x270
70 >>>>> Oct
71 >>>>> 14 20:58:02 moonflo kernel: [<ffffffff8145b5c0>] ?
72 >>>>> dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel:
73 >>>>> [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02
74 >moonflo
75 >>>>> kernel: [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 Oct
76 >14
77 >>>>> 20:58:02 moonflo kernel: [<ffffffff810d42a6>]
78 >>>>> run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel:
79 >>>>> [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 Oct 14 20:58:02
80 >moonflo
81 >>>>> kernel: [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 Oct 14 20:58:02
82 >moonflo
83 >>>>> kernel: [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 Oct
84 >14
85 >>>>> 20:58:02 moonflo kernel: [<ffffffff814e1e8e>]
86 >>>>> xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo
87 >kernel:
88 >>>>> <EOI>
89 >>>>>
90 >>>>> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct 14
91 >20:58:02
92 >>>>>
93 >>>>> moonflo kernel: [<ffffffff810013aa>] ?
94 >xen_hypercall_sched_op+0xa/0x20
95 >>>>> Oct
96 >>>>> 14 20:58:02 moonflo kernel: [<ffffffff810459e0>] ?
97 >>>>> xen_safe_halt+0x10/0x20
98 >>>>> Oct 14 20:58:02 moonflo kernel: [<ffffffff81053979>] ?
99 >>>>> default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel:
100 >>>>> [<ffffffff810542da>]
101 >>>>> ? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel:
102 >>>>> [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 Oct 14
103 >20:58:02
104 >>>>> moonflo kernel: [<ffffffff81047cd5>] ?
105 >cpu_bringup_and_idle+0x25/0x40
106 >>>>> Oct
107 >>>>> 14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]---
108 >Oct 14
109 >>>>> 20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up
110 >>>>>
111 >>>>>
112 >>>>> After that, there are lots of messages about the link being up,
113 >one
114 >>>>> message
115 >>>>> every 12 seconds. When you unplug the network cable, you get a
116 >message
117 >>>>> that the link is down, and no message when you plug it in again.
118 >>>>>
119 >>>>> I was hoping that switching the network card (to one that uses a
120 >>>>> different
121 >>>>> driver) might solve the problem, and it did not. Now I can only
122 >guess
123 >>>>> that
124 >>>>> the network card goes to sleep and sometimes cannot be woken up
125 >again.
126 >>>>>
127 >>>>> I tried to reduce the connection speed to 100Mbit and found that
128 >>>>> accessing
129 >>>>> the VMs (via RDP) becomes too slow to use them. So I disabled the
130 >power
131 >>>>> management of the network card (through sysfs) and will have to
132 >see if
133 >>>>> the
134 >>>>> problem persists.
135 >>>>>
136 >>>>> We'll be getting decent network cards in a couple days, but since
137 >the
138 >>>>> problem doesn't seem to be related to a particular
139 >>>>> card/model/manufacturer,
140 >>>>> that might not fix it, either.
141 >>>>>
142 >>>>> This problem seems to only occur on machines that operate as a xen
143 >>>>> server.
144 >>>>> Other machines, identical Z800s, not running xen, run just fine.
145 >>>>>
146 >>>>> What would you suggest?
147 >>>>
148 >>>> More info required:
149 >>>>
150 >>>> - Which version of Xen
151 >>>
152 >>> 4.5.1
153 >>>
154 >>> Installed versions: 4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags
155 >-debug
156 >>> -efi -flask -xsm)
157 >>
158 >> Ok, recent one.
159 >>
160 >>>> - Does this only occur with HVM guests?
161 >>>
162 >>> The host has been running only HVM guests every time it happend.
163 >>> It was running a PV guest in between (which I had to shut down
164 >>> because other VMs were migrated, requiring the RAM).
165 >>
166 >> The PV didn't have any issues?
167 >
168 >The whole server has the issue, not a particular VM. While the PV
169 >guest
170 >was running, the server didn't freeze.
171 >
172 >>>> - Which network-driver are you using inside the guest
173 >>>
174 >>> r8169, compiled as a module
175 >>>
176 >>> Same happened with the tg3 driver when the on-board cards were used.
177 >>> The tg3 driver is completely disabled in the kernel config, i. e.
178 >>> not even compiled as a module.
179 >>
180 >> You have network cards assigned to the guests?
181 >
182 >No, they are all connected via a bridge.
183 >
184 >I enabled STP on the bridge and the server was ok for a week, then had
185 >to be restarted. I'm seeing lots of messages in the log:
186 >
187 >
188 >Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,
189 >propagating
190 >Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn
191 >bpdu
192 >Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,
193 >propagating
194 >Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn
195 >bpdu
196 >Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,
197 >propagating
198 >Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn
199 >bpdu
200 >Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,
201 >propagating
202 >Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn
203 >bpdu
204 >Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,
205 >propagating
206 >
207 >
208 >and sometimes:
209 >
210 >Oct 28 10:47:04 moonflo kernel: brloc: port 1(enp55s4) neighbor
211 >8000.00:00:10:11:12:00 lost
212 >
213 >
214 >Any idea what this means?
215 >
216 >(Google has gone on strike, and another search engine didn't give any
217 >useful
218 >findings ...)
219 >
220 >
221 >>>> - Can you connect to the "local" console of the guest?
222 >>>
223 >>> Yes, the host seems to be running fine except for having no network
224 >>> connectivity. There's a keyboard and monitor physically connected
225 >to
226 >>> it with which you can log in and do stuff.
227 >>
228 >> The HOST loses network connectivity?
229 >
230 >Yes.
231 >
232 >Apparently when it became unresponsive yesterday, it was not possible
233 >to log in at the console, either. I wasn't there yesterday, though
234 >I've
235 >see that happen before. We tried to shut it down via acpid by pressing
236 >the
237 >power button. It didn't turn off, so it was switched off by holding the
238 >power
239 >button. What I can see in the log is:
240 >
241 >
242 >Oct 28 14:12:33 moonflo logger[20322]: /etc/xen/scripts/block: remove
243 >XENBUS_PATH=backend/vbd/2/768
244 >Oct 28 14:12:33 moonflo logger[20323]: /etc/xen/scripts/vif-bridge:
245 >offline type_if=vif XENBUS_PATH=backend/vif/2/0
246 >Oct 28 14:12:33 moonflo logger[20347]: /etc/xen/scripts/vif-bridge:
247 >brctl delif brloc vif2.0 failed
248 >Oct 28 14:12:33 moonflo logger[20353]: /etc/xen/scripts/vif-bridge:
249 >ifconfig vif2.0 down failed
250 >Oct 28 14:12:33 moonflo logger[20361]: /etc/xen/scripts/vif-bridge:
251 >Successful vif-bridge offline for vif2.0, bridge brloc.
252 >Oct 28 14:12:33 moonflo logger[20372]: /etc/xen/scripts/vif-bridge:
253 >remove type_if=tap XENBUS_PATH=backend/vif/2/0
254 >Oct 28 14:12:33 moonflo logger[20391]: /etc/xen/scripts/vif-bridge:
255 >Successful vif-bridge remove for vif2.0-emu, bridge brloc.
256 >Oct 28 14:15:33 moonflo shutdown[20476]: shutting down for system halt
257 >^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Oct
258 >28 14:17:34 moonflo syslog-ng[4611]: syslog-ng starting up;
259 >version='3.6.2'
260 >
261 >
262 >And:
263 >
264 >
265 >Oct 24 11:47:42 moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169):
266 >transmit queue 0 timed out
267 >Oct 24 11:47:42 moonflo kernel: Modules linked in: xt_physdev
268 >br_netfilter iptable_filter ip_tables xen_pciback xen_gntalloc
269 >xen_gntdev bridge stp llc zfs(PO) zuni
270 >code(PO) zavl(PO) zcommon(PO) znvpair(PO) nouveau snd_hda_codec_realtek
271 >snd_hda_codec_generic video spl(O) backlight zlib_deflate
272 >drm_kms_helper snd_hda_intel snd_
273 >hda_controller snd_hda_codec snd_pcm snd_timer r8169 snd ttm soundcore
274 >mii xts aesni_intel glue_helper lrw gf128mul ablk_helper cryptd
275 >aes_x86_64 sha256_generic hi
276 >d_generic usbhid uhci_hcd usb_storage ehci_pci ehci_hcd usbcore
277 >usb_common
278 >Oct 24 11:47:42 moonflo kernel: CPU: 12 PID: 0 Comm: swapper/12
279 >Tainted: P O 4.0.5-gentoo #3
280 >Oct 24 11:47:42 moonflo kernel: Hardware name: Hewlett-Packard HP Z800
281 >Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013
282 >Oct 24 11:47:42 moonflo kernel: ffffffff8175a77d ffff880124d83d98
283 >ffffffff814da8d8 0000000000000001
284 >Oct 24 11:47:42 moonflo kernel: ffff880124d83de8 ffff880124d83dd8
285 >ffffffff81088850 ffff880124d83e68
286 >Oct 24 11:47:42 moonflo kernel: 0000000000000000 ffff88011efd8000
287 >0000000000000001 ffff8800d4eb5e80
288 >Oct 24 11:47:42 moonflo kernel: Call Trace:
289 >Oct 24 11:47:42 moonflo kernel: <IRQ> [<ffffffff814da8d8>]
290 >dump_stack+0x45/0x57
291 >Oct 24 11:47:42 moonflo kernel: [<ffffffff81088850>]
292 >warn_slowpath_common+0x80/0xc0
293 >Oct 24 11:47:42 moonflo kernel: [<ffffffff810888d1>]
294 >warn_slowpath_fmt+0x41/0x50
295 >Oct 24 11:47:42 moonflo kernel: [<ffffffff812b31c5>] ?
296 >add_interrupt_randomness+0x35/0x1e0
297 >Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b819>]
298 >dev_watchdog+0x259/0x270
299 >Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b5c0>] ?
300 >dev_graft_qdisc+0x80/0x80
301 >Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b5c0>] ?
302 >dev_graft_qdisc+0x80/0x80
303 >Oct 24 11:47:42 moonflo kernel: [<ffffffff810d4047>]
304 >call_timer_fn.isra.30+0x17/0x70
305 >Oct 24 11:47:42 moonflo kernel: [<ffffffff810d42a6>]
306 >run_timer_softirq+0x176/0x2b0
307 >Oct 24 11:47:42 moonflo kernel: [<ffffffff8108bd0a>]
308 >__do_softirq+0xda/0x1f0
309 >Oct 24 11:47:42 moonflo kernel: [<ffffffff8108c04e>]
310 >irq_exit+0x7e/0xa0
311 >Oct 24 11:47:42 moonflo kernel: [<ffffffff8130e075>]
312 >xen_evtchn_do_upcall+0x35/0x50
313 >Oct 24 11:47:42 moonflo kernel: [<ffffffff814e1e8e>]
314 >xen_do_hypervisor_callback+0x1e/0x40
315 >Oct 24 11:47:42 moonflo kernel: <EOI> [<ffffffff810013aa>] ?
316 >xen_hypercall_sched_op+0xa/0x20
317 >Oct 24 11:47:42 moonflo kernel: [<ffffffff810013aa>] ?
318 >xen_hypercall_sched_op+0xa/0x20
319 >Oct 24 11:47:42 moonflo kernel: [<ffffffff810459e0>] ?
320 >xen_safe_halt+0x10/0x20
321 >Oct 24 11:47:42 moonflo kernel: [<ffffffff81053979>] ?
322 >default_idle+0x9/0x10
323 >Oct 24 11:47:42 moonflo kernel: [<ffffffff810542da>] ?
324 >arch_cpu_idle+0xa/0x10
325 >Oct 24 11:47:42 moonflo kernel: [<ffffffff810bd170>] ?
326 >cpu_startup_entry+0x190/0x2f0
327 >Oct 24 11:47:42 moonflo kernel: [<ffffffff81047cd5>] ?
328 >cpu_bringup_and_idle+0x25/0x40
329 >Oct 24 11:47:42 moonflo kernel: ---[ end trace 320b6f98f8fc070f ]---
330 >Oct 24 11:47:42 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up
331 >
332 >
333 >That was two days before it went down. After that, messages about
334 >topology changes
335 >are starting to appear.
336 >
337 >I'm not sure if I should call this "progress" ;)
338 >
339 >>
340 >>> You get no answer when you ping the host while it is unreachable.
341 >>>
342 >>>> - If yes, does it still have no connectivity?
343 >>>
344 >>> It has been restarted this morning when it was found to be
345 >unreachable.
346 >>>
347 >>>> I saw the same on my lab machine, which was related to:
348 >>>> - Not using correct drivers inside HVM guests
349 >>>
350 >>> There are Windoze 7 guests running that have PV drivers installed.
351 >>> One of those has formerly been running on a VMware host and was
352 >>> migrated on Tuesday. I deinstalled the VMware tools from it.
353 >>
354 >> Which PV drivers?
355 >
356 >Xen GPL PV Driver Developers
357 >17.09.2014
358 >0.11.0.373
359 >Univention GmbH
360 >
361 >> And did you ensure all VMWare related drivers were removed?
362 >> I am not convinced uninstalling the VMWare tools is sufficient.
363 >
364 >What would I need to look at to make sure they are removed?
365 >
366 >The problem has been there before the VM that had VMWare drivers
367 >installed was migrated to this server. So I don't think they are
368 >causing this problem.
369 >
370 >
371 >>> Since Monday, a HVM Linux system (a modified 32-bit Debian) has also
372 >>> been migrated from the VMware host to this one. I don't know if it
373 >>> has VMware tools installed (I guess it does because it could be shut
374 >>> down via VMware) and how those might react now. It's working, and I
375 >>> don't want to touch it.
376 >>>
377 >>> However, the problem already occured before this migration, when the
378 >>> on-board cards were still used.
379 >>>
380 >>>> - Switch hardware not keeping the MAC/IP/Port lists long enough
381 >>>
382 >>> What might be the reason for the lists becoming too short? Too many
383 >>> devices connected to the network?
384 >>
385 >> No network activity for a while. (clean installs, nothing running)
386 >> Switch forgetting the MAC-address assigned to the VM.
387 >>
388 >> Connecting to the VM-console, I could ping www.google.com and then
389 >the
390 >> connectivity re-appeared.
391 >
392 >Half of the switches have been replaced last week in order to track
393 >down
394 >what appears to be a weird network problem. The problem is that the
395 >RDP
396 >clients are being randomly stalled. If it was only that, I'd suspect
397 >this
398 >server some more, but the internet connection goes through the same
399 >switches
400 >and is apprently also slowed down when the RPD clients are stalled.
401 >They
402 >got also randomly stalled when the RDP clients were accessing a totally
403 >different server (the VMWare server), so this might be entirely
404 >unrelated.
405 >
406 >Replacing the switches didn't fix the problem, so I'll probably put
407 >them
408 >back into service and replace the other half.
409 >
410 >>> The host has been connected to two different switches and showed the
411 >>> problem. Previously, that was an 8-port 1Gb switch, now it's a
412 >24-port
413 >>> 1Gb switch. However, the 8-port switch is also connected to the
414 >24-port
415 >>> switch the host is now connected to. (The 24-port switch connects
416 >it
417 >>> "directly" to the rest of the network.)
418 >>
419 >> Assuming it's a managed switch, you could test this.
420 >> Alternatively, check if you can access the VMs from the host.
421 >
422 >Good idea, I'll try that when it happens when I'm here.
423 >
424 >The network cards have arrived, Intel PRO 1000 dual port, made for IBM.
425 >I hope I get to swap the card today. Those *really* should work.
426 >
427 >Hm, I could plug in two of them and give each VM and the host its own
428 >physical card. Do you think that might help?
429
430 Quick reply from mobile.
431 Will give a more detailed one later.
432
433 Noticed you are using ZFS. Where is your swap partition located?
434
435 On ZFS or?
436
437 --
438 Joost
439 --
440 Sent from my Android device with K-9 Mail. Please excuse my brevity.

Replies

Subject Author
Re: [gentoo-user] Networking trouble hw <hw@×××××.de>