Gentoo Archives: gentoo-user

From: hw <hw@×××××.de>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] Networking trouble
Date: Thu, 15 Oct 2015 15:46:29
Message-Id: 561FCA3F.8090803@gc-24.de
In Reply to: Re: [gentoo-user] Networking trouble by "J. Roeleveld"
1 J. Roeleveld wrote:
2 > On Thursday, October 15, 2015 03:30:01 PM hw wrote:
3 >> Hi,
4 >>
5 >> I have a xen host with some HV guests which becomes unreachable via
6 >> the network after apparently random amount of times. I have already
7 >> switched the network card to see if that would make a difference,
8 >> and with the card currently installed, it worked fine for over 20 days
9 >> until it become unreachable again. Before switching the network card,
10 >> it would run a week or two before becoming unreachable. The previous
11 >> card was the on-board BCM5764M which uses the tg3 driver.
12 >>
13 >> There are messages like this in the log file:
14 >>
15 >>
16 >> Oct 14 20:58:02 moonflo kernel: ------------[ cut here ]------------
17 >> Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at
18 >> net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14 20:58:02
19 >> moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 timed
20 >> out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb md4 hmac
21 >> nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter ip_tables
22 >> xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau
23 >> snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO)
24 >> zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight drm_kms_helper
25 >> ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm snd_timer snd
26 >> soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul ablk_helper
27 >> cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd usb_storage
28 >> ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo kernel: CPU:
29 >> 10 PID: 0 Comm: swapper/10 Tainted: P O 4.0.5-gentoo #3 Oct 14
30 >> 20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800
31 >> Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02 moonflo
32 >> kernel: ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8
33 >> 0000000000000001 Oct 14 20:58:02 moonflo kernel: ffff880124d43de8
34 >> ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02 moonflo
35 >> kernel: 0000000000000000 ffff8800d45f2000 0000000000000001
36 >> ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace:
37 >> Oct 14 20:58:02 moonflo kernel: <IRQ> [<ffffffff814da8d8>]
38 >> dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel: [<ffffffff81088850>]
39 >> warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel:
40 >> [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02 moonflo
41 >> kernel: [<ffffffff812b31c5>] ? add_interrupt_randomness+0x35/0x1e0 Oct 14
42 >> 20:58:02 moonflo kernel: [<ffffffff8145b819>] dev_watchdog+0x259/0x270 Oct
43 >> 14 20:58:02 moonflo kernel: [<ffffffff8145b5c0>] ?
44 >> dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel:
45 >> [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo
46 >> kernel: [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 Oct 14
47 >> 20:58:02 moonflo kernel: [<ffffffff810d42a6>]
48 >> run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel:
49 >> [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 Oct 14 20:58:02 moonflo
50 >> kernel: [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 Oct 14 20:58:02 moonflo
51 >> kernel: [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 Oct 14
52 >> 20:58:02 moonflo kernel: [<ffffffff814e1e8e>]
53 >> xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo kernel: <EOI>
54 >> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct 14 20:58:02
55 >> moonflo kernel: [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct
56 >> 14 20:58:02 moonflo kernel: [<ffffffff810459e0>] ? xen_safe_halt+0x10/0x20
57 >> Oct 14 20:58:02 moonflo kernel: [<ffffffff81053979>] ?
58 >> default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel: [<ffffffff810542da>]
59 >> ? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel:
60 >> [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 Oct 14 20:58:02
61 >> moonflo kernel: [<ffffffff81047cd5>] ? cpu_bringup_and_idle+0x25/0x40 Oct
62 >> 14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]--- Oct 14
63 >> 20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up
64 >>
65 >>
66 >> After that, there are lots of messages about the link being up, one message
67 >> every 12 seconds. When you unplug the network cable, you get a message that
68 >> the link is down, and no message when you plug it in again.
69 >>
70 >> I was hoping that switching the network card (to one that uses a different
71 >> driver) might solve the problem, and it did not. Now I can only guess that
72 >> the network card goes to sleep and sometimes cannot be woken up again.
73 >>
74 >> I tried to reduce the connection speed to 100Mbit and found that accessing
75 >> the VMs (via RDP) becomes too slow to use them. So I disabled the power
76 >> management of the network card (through sysfs) and will have to see if the
77 >> problem persists.
78 >>
79 >> We'll be getting decent network cards in a couple days, but since the
80 >> problem doesn't seem to be related to a particular card/model/manufacturer,
81 >> that might not fix it, either.
82 >>
83 >> This problem seems to only occur on machines that operate as a xen server.
84 >> Other machines, identical Z800s, not running xen, run just fine.
85 >>
86 >> What would you suggest?
87 >
88 > More info required:
89 >
90 > - Which version of Xen
91
92 4.5.1
93
94 Installed versions: 4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags -debug -efi -flask -xsm)
95
96 > - Does this only occur with HVM guests?
97
98 The host has been running only HVM guests every time it happend.
99 It was running a PV guest in between (which I had to shut down
100 because other VMs were migrated, requiring the RAM).
101
102 > - Which network-driver are you using inside the guest
103
104 r8169, compiled as a module
105
106 Same happened with the tg3 driver when the on-board cards were used.
107 The tg3 driver is completely disabled in the kernel config, i. e.
108 not even compiled as a module.
109
110 > - Can you connect to the "local" console of the guest?
111
112 Yes, the host seems to be running fine except for having no network
113 connectivity. There's a keyboard and monitor physically connected to
114 it with which you can log in and do stuff.
115
116 You get no answer when you ping the host while it is unreachable.
117
118 > - If yes, does it still have no connectivity?
119
120 It has been restarted this morning when it was found to be unreachable.
121
122 > I saw the same on my lab machine, which was related to:
123 > - Not using correct drivers inside HVM guests
124
125 There are Windoze 7 guests running that have PV drivers installed.
126 One of those has formerly been running on a VMware host and was
127 migrated on Tuesday. I deinstalled the VMware tools from it.
128
129 Since Monday, a HVM Linux system (a modified 32-bit Debian) has also
130 been migrated from the VMware host to this one. I don't know if it
131 has VMware tools installed (I guess it does because it could be shut
132 down via VMware) and how those might react now. It's working, and I
133 don't want to touch it.
134
135 However, the problem already occured before this migration, when the
136 on-board cards were still used.
137
138 > - Switch hardware not keeping the MAC/IP/Port lists long enough
139
140 What might be the reason for the lists becoming too short? Too many
141 devices connected to the network?
142
143 The host has been connected to two different switches and showed the
144 problem. Previously, that was an 8-port 1Gb switch, now it's a 24-port
145 1Gb switch. However, the 8-port switch is also connected to the 24-port
146 switch the host is now connected to. (The 24-port switch connects it
147 "directly" to the rest of the network.)

Replies

Subject Author
Re: [gentoo-user] Networking trouble "J. Roeleveld" <joost@××××××××.org>