Gentoo Archives: gentoo-user

From: "J. Roeleveld" <joost@××××××××.org>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] Networking trouble
Date: Fri, 16 Oct 2015 05:31:49
Message-Id: 2569546.9sPlulUjpb@andromeda
In Reply to: Re: [gentoo-user] Networking trouble by hw
1 On Thursday, October 15, 2015 05:46:07 PM hw wrote:
2 > J. Roeleveld wrote:
3 > > On Thursday, October 15, 2015 03:30:01 PM hw wrote:
4 > >> Hi,
5 > >>
6 > >> I have a xen host with some HV guests which becomes unreachable via
7 > >> the network after apparently random amount of times. I have already
8 > >> switched the network card to see if that would make a difference,
9 > >> and with the card currently installed, it worked fine for over 20 days
10 > >> until it become unreachable again. Before switching the network card,
11 > >> it would run a week or two before becoming unreachable. The previous
12 > >> card was the on-board BCM5764M which uses the tg3 driver.
13 > >>
14 > >> There are messages like this in the log file:
15 > >>
16 > >>
17 > >> Oct 14 20:58:02 moonflo kernel: ------------[ cut here ]------------
18 > >> Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at
19 > >> net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14 20:58:02
20 > >> moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 timed
21 > >> out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb md4 hmac
22 > >> nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter ip_tables
23 > >> xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau
24 > >> snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO)
25 > >> zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight
26 > >> drm_kms_helper
27 > >> ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm snd_timer snd
28 > >> soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul ablk_helper
29 > >> cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd usb_storage
30 > >> ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo kernel: CPU:
31 > >> 10 PID: 0 Comm: swapper/10 Tainted: P O 4.0.5-gentoo #3 Oct
32 > >> 14
33 > >> 20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800
34 > >> Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02 moonflo
35 > >> kernel: ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8
36 > >> 0000000000000001 Oct 14 20:58:02 moonflo kernel: ffff880124d43de8
37 > >> ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02
38 > >> moonflo
39 > >> kernel: 0000000000000000 ffff8800d45f2000 0000000000000001
40 > >> ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace:
41 > >> Oct 14 20:58:02 moonflo kernel: <IRQ> [<ffffffff814da8d8>]
42 > >> dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel:
43 > >> [<ffffffff81088850>]
44 > >> warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel:
45 > >> [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02 moonflo
46 > >> kernel: [<ffffffff812b31c5>] ? add_interrupt_randomness+0x35/0x1e0 Oct
47 > >> 14
48 > >> 20:58:02 moonflo kernel: [<ffffffff8145b819>] dev_watchdog+0x259/0x270
49 > >> Oct
50 > >> 14 20:58:02 moonflo kernel: [<ffffffff8145b5c0>] ?
51 > >> dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel:
52 > >> [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo
53 > >> kernel: [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 Oct 14
54 > >> 20:58:02 moonflo kernel: [<ffffffff810d42a6>]
55 > >> run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel:
56 > >> [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 Oct 14 20:58:02 moonflo
57 > >> kernel: [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 Oct 14 20:58:02 moonflo
58 > >> kernel: [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 Oct 14
59 > >> 20:58:02 moonflo kernel: [<ffffffff814e1e8e>]
60 > >> xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo kernel:
61 > >> <EOI>
62 > >>
63 > >> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct 14 20:58:02
64 > >>
65 > >> moonflo kernel: [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
66 > >> Oct
67 > >> 14 20:58:02 moonflo kernel: [<ffffffff810459e0>] ?
68 > >> xen_safe_halt+0x10/0x20
69 > >> Oct 14 20:58:02 moonflo kernel: [<ffffffff81053979>] ?
70 > >> default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel:
71 > >> [<ffffffff810542da>]
72 > >> ? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel:
73 > >> [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 Oct 14 20:58:02
74 > >> moonflo kernel: [<ffffffff81047cd5>] ? cpu_bringup_and_idle+0x25/0x40
75 > >> Oct
76 > >> 14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]--- Oct 14
77 > >> 20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up
78 > >>
79 > >>
80 > >> After that, there are lots of messages about the link being up, one
81 > >> message
82 > >> every 12 seconds. When you unplug the network cable, you get a message
83 > >> that the link is down, and no message when you plug it in again.
84 > >>
85 > >> I was hoping that switching the network card (to one that uses a
86 > >> different
87 > >> driver) might solve the problem, and it did not. Now I can only guess
88 > >> that
89 > >> the network card goes to sleep and sometimes cannot be woken up again.
90 > >>
91 > >> I tried to reduce the connection speed to 100Mbit and found that
92 > >> accessing
93 > >> the VMs (via RDP) becomes too slow to use them. So I disabled the power
94 > >> management of the network card (through sysfs) and will have to see if
95 > >> the
96 > >> problem persists.
97 > >>
98 > >> We'll be getting decent network cards in a couple days, but since the
99 > >> problem doesn't seem to be related to a particular
100 > >> card/model/manufacturer,
101 > >> that might not fix it, either.
102 > >>
103 > >> This problem seems to only occur on machines that operate as a xen
104 > >> server.
105 > >> Other machines, identical Z800s, not running xen, run just fine.
106 > >>
107 > >> What would you suggest?
108 > >
109 > > More info required:
110 > >
111 > > - Which version of Xen
112 >
113 > 4.5.1
114 >
115 > Installed versions: 4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags -debug
116 > -efi -flask -xsm)
117
118 Ok, recent one.
119
120 > > - Does this only occur with HVM guests?
121 >
122 > The host has been running only HVM guests every time it happend.
123 > It was running a PV guest in between (which I had to shut down
124 > because other VMs were migrated, requiring the RAM).
125
126 The PV didn't have any issues?
127
128 > > - Which network-driver are you using inside the guest
129 >
130 > r8169, compiled as a module
131 >
132 > Same happened with the tg3 driver when the on-board cards were used.
133 > The tg3 driver is completely disabled in the kernel config, i. e.
134 > not even compiled as a module.
135
136 You have network cards assigned to the guests?
137
138 > > - Can you connect to the "local" console of the guest?
139 >
140 > Yes, the host seems to be running fine except for having no network
141 > connectivity. There's a keyboard and monitor physically connected to
142 > it with which you can log in and do stuff.
143
144 The HOST loses network connectivity?
145
146 > You get no answer when you ping the host while it is unreachable.
147 >
148 > > - If yes, does it still have no connectivity?
149 >
150 > It has been restarted this morning when it was found to be unreachable.
151 >
152 > > I saw the same on my lab machine, which was related to:
153 > > - Not using correct drivers inside HVM guests
154 >
155 > There are Windoze 7 guests running that have PV drivers installed.
156 > One of those has formerly been running on a VMware host and was
157 > migrated on Tuesday. I deinstalled the VMware tools from it.
158
159 Which PV drivers?
160 And did you ensure all VMWare related drivers were removed?
161 I am not convinced uninstalling the VMWare tools is sufficient.
162
163 > Since Monday, a HVM Linux system (a modified 32-bit Debian) has also
164 > been migrated from the VMware host to this one. I don't know if it
165 > has VMware tools installed (I guess it does because it could be shut
166 > down via VMware) and how those might react now. It's working, and I
167 > don't want to touch it.
168 >
169 > However, the problem already occured before this migration, when the
170 > on-board cards were still used.
171 >
172 > > - Switch hardware not keeping the MAC/IP/Port lists long enough
173 >
174 > What might be the reason for the lists becoming too short? Too many
175 > devices connected to the network?
176
177 No network activity for a while. (clean installs, nothing running)
178 Switch forgetting the MAC-address assigned to the VM.
179
180 Connecting to the VM-console, I could ping www.google.com and then the
181 connectivity re-appeared.
182
183 > The host has been connected to two different switches and showed the
184 > problem. Previously, that was an 8-port 1Gb switch, now it's a 24-port
185 > 1Gb switch. However, the 8-port switch is also connected to the 24-port
186 > switch the host is now connected to. (The 24-port switch connects it
187 > "directly" to the rest of the network.)
188
189 Assuming it's a managed switch, you could test this.
190 Alternatively, check if you can access the VMs from the host.
191
192 --
193 Joost

Replies

Subject Author
Re: [gentoo-user] Networking trouble hw <hw@×××××.de>