1 |
J. Roeleveld wrote: |
2 |
> On Thursday, October 15, 2015 05:46:07 PM hw wrote: |
3 |
>> J. Roeleveld wrote: |
4 |
>>> On Thursday, October 15, 2015 03:30:01 PM hw wrote: |
5 |
>>>> Hi, |
6 |
>>>> |
7 |
>>>> I have a xen host with some HV guests which becomes unreachable via |
8 |
>>>> the network after apparently random amount of times. I have already |
9 |
>>>> switched the network card to see if that would make a difference, |
10 |
>>>> and with the card currently installed, it worked fine for over 20 days |
11 |
>>>> until it become unreachable again. Before switching the network card, |
12 |
>>>> it would run a week or two before becoming unreachable. The previous |
13 |
>>>> card was the on-board BCM5764M which uses the tg3 driver. |
14 |
>>>> |
15 |
>>>> There are messages like this in the log file: |
16 |
>>>> |
17 |
>>>> |
18 |
>>>> Oct 14 20:58:02 moonflo kernel: ------------[ cut here ]------------ |
19 |
>>>> Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at |
20 |
>>>> net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14 20:58:02 |
21 |
>>>> moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 timed |
22 |
>>>> out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb md4 hmac |
23 |
>>>> nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter ip_tables |
24 |
>>>> xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau |
25 |
>>>> snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO) |
26 |
>>>> zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight |
27 |
>>>> drm_kms_helper |
28 |
>>>> ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm snd_timer snd |
29 |
>>>> soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul ablk_helper |
30 |
>>>> cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd usb_storage |
31 |
>>>> ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo kernel: CPU: |
32 |
>>>> 10 PID: 0 Comm: swapper/10 Tainted: P O 4.0.5-gentoo #3 Oct |
33 |
>>>> 14 |
34 |
>>>> 20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800 |
35 |
>>>> Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02 moonflo |
36 |
>>>> kernel: ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8 |
37 |
>>>> 0000000000000001 Oct 14 20:58:02 moonflo kernel: ffff880124d43de8 |
38 |
>>>> ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02 |
39 |
>>>> moonflo |
40 |
>>>> kernel: 0000000000000000 ffff8800d45f2000 0000000000000001 |
41 |
>>>> ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace: |
42 |
>>>> Oct 14 20:58:02 moonflo kernel: <IRQ> [<ffffffff814da8d8>] |
43 |
>>>> dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel: |
44 |
>>>> [<ffffffff81088850>] |
45 |
>>>> warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel: |
46 |
>>>> [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02 moonflo |
47 |
>>>> kernel: [<ffffffff812b31c5>] ? add_interrupt_randomness+0x35/0x1e0 Oct |
48 |
>>>> 14 |
49 |
>>>> 20:58:02 moonflo kernel: [<ffffffff8145b819>] dev_watchdog+0x259/0x270 |
50 |
>>>> Oct |
51 |
>>>> 14 20:58:02 moonflo kernel: [<ffffffff8145b5c0>] ? |
52 |
>>>> dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel: |
53 |
>>>> [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo |
54 |
>>>> kernel: [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 Oct 14 |
55 |
>>>> 20:58:02 moonflo kernel: [<ffffffff810d42a6>] |
56 |
>>>> run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel: |
57 |
>>>> [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 Oct 14 20:58:02 moonflo |
58 |
>>>> kernel: [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 Oct 14 20:58:02 moonflo |
59 |
>>>> kernel: [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 Oct 14 |
60 |
>>>> 20:58:02 moonflo kernel: [<ffffffff814e1e8e>] |
61 |
>>>> xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo kernel: |
62 |
>>>> <EOI> |
63 |
>>>> |
64 |
>>>> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct 14 20:58:02 |
65 |
>>>> |
66 |
>>>> moonflo kernel: [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 |
67 |
>>>> Oct |
68 |
>>>> 14 20:58:02 moonflo kernel: [<ffffffff810459e0>] ? |
69 |
>>>> xen_safe_halt+0x10/0x20 |
70 |
>>>> Oct 14 20:58:02 moonflo kernel: [<ffffffff81053979>] ? |
71 |
>>>> default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel: |
72 |
>>>> [<ffffffff810542da>] |
73 |
>>>> ? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel: |
74 |
>>>> [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 Oct 14 20:58:02 |
75 |
>>>> moonflo kernel: [<ffffffff81047cd5>] ? cpu_bringup_and_idle+0x25/0x40 |
76 |
>>>> Oct |
77 |
>>>> 14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]--- Oct 14 |
78 |
>>>> 20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up |
79 |
>>>> |
80 |
>>>> |
81 |
>>>> After that, there are lots of messages about the link being up, one |
82 |
>>>> message |
83 |
>>>> every 12 seconds. When you unplug the network cable, you get a message |
84 |
>>>> that the link is down, and no message when you plug it in again. |
85 |
>>>> |
86 |
>>>> I was hoping that switching the network card (to one that uses a |
87 |
>>>> different |
88 |
>>>> driver) might solve the problem, and it did not. Now I can only guess |
89 |
>>>> that |
90 |
>>>> the network card goes to sleep and sometimes cannot be woken up again. |
91 |
>>>> |
92 |
>>>> I tried to reduce the connection speed to 100Mbit and found that |
93 |
>>>> accessing |
94 |
>>>> the VMs (via RDP) becomes too slow to use them. So I disabled the power |
95 |
>>>> management of the network card (through sysfs) and will have to see if |
96 |
>>>> the |
97 |
>>>> problem persists. |
98 |
>>>> |
99 |
>>>> We'll be getting decent network cards in a couple days, but since the |
100 |
>>>> problem doesn't seem to be related to a particular |
101 |
>>>> card/model/manufacturer, |
102 |
>>>> that might not fix it, either. |
103 |
>>>> |
104 |
>>>> This problem seems to only occur on machines that operate as a xen |
105 |
>>>> server. |
106 |
>>>> Other machines, identical Z800s, not running xen, run just fine. |
107 |
>>>> |
108 |
>>>> What would you suggest? |
109 |
>>> |
110 |
>>> More info required: |
111 |
>>> |
112 |
>>> - Which version of Xen |
113 |
>> |
114 |
>> 4.5.1 |
115 |
>> |
116 |
>> Installed versions: 4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags -debug |
117 |
>> -efi -flask -xsm) |
118 |
> |
119 |
> Ok, recent one. |
120 |
> |
121 |
>>> - Does this only occur with HVM guests? |
122 |
>> |
123 |
>> The host has been running only HVM guests every time it happend. |
124 |
>> It was running a PV guest in between (which I had to shut down |
125 |
>> because other VMs were migrated, requiring the RAM). |
126 |
> |
127 |
> The PV didn't have any issues? |
128 |
|
129 |
The whole server has the issue, not a particular VM. While the PV guest |
130 |
was running, the server didn't freeze. |
131 |
|
132 |
>>> - Which network-driver are you using inside the guest |
133 |
>> |
134 |
>> r8169, compiled as a module |
135 |
>> |
136 |
>> Same happened with the tg3 driver when the on-board cards were used. |
137 |
>> The tg3 driver is completely disabled in the kernel config, i. e. |
138 |
>> not even compiled as a module. |
139 |
> |
140 |
> You have network cards assigned to the guests? |
141 |
|
142 |
No, they are all connected via a bridge. |
143 |
|
144 |
I enabled STP on the bridge and the server was ok for a week, then had |
145 |
to be restarted. I'm seeing lots of messages in the log: |
146 |
|
147 |
|
148 |
Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating |
149 |
Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu |
150 |
Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating |
151 |
Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu |
152 |
Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating |
153 |
Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu |
154 |
Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating |
155 |
Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu |
156 |
Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating |
157 |
|
158 |
|
159 |
and sometimes: |
160 |
|
161 |
Oct 28 10:47:04 moonflo kernel: brloc: port 1(enp55s4) neighbor 8000.00:00:10:11:12:00 lost |
162 |
|
163 |
|
164 |
Any idea what this means? |
165 |
|
166 |
(Google has gone on strike, and another search engine didn't give any useful |
167 |
findings ...) |
168 |
|
169 |
|
170 |
>>> - Can you connect to the "local" console of the guest? |
171 |
>> |
172 |
>> Yes, the host seems to be running fine except for having no network |
173 |
>> connectivity. There's a keyboard and monitor physically connected to |
174 |
>> it with which you can log in and do stuff. |
175 |
> |
176 |
> The HOST loses network connectivity? |
177 |
|
178 |
Yes. |
179 |
|
180 |
Apparently when it became unresponsive yesterday, it was not possible |
181 |
to log in at the console, either. I wasn't there yesterday, though I've |
182 |
see that happen before. We tried to shut it down via acpid by pressing the |
183 |
power button. It didn't turn off, so it was switched off by holding the power |
184 |
button. What I can see in the log is: |
185 |
|
186 |
|
187 |
Oct 28 14:12:33 moonflo logger[20322]: /etc/xen/scripts/block: remove XENBUS_PATH=backend/vbd/2/768 |
188 |
Oct 28 14:12:33 moonflo logger[20323]: /etc/xen/scripts/vif-bridge: offline type_if=vif XENBUS_PATH=backend/vif/2/0 |
189 |
Oct 28 14:12:33 moonflo logger[20347]: /etc/xen/scripts/vif-bridge: brctl delif brloc vif2.0 failed |
190 |
Oct 28 14:12:33 moonflo logger[20353]: /etc/xen/scripts/vif-bridge: ifconfig vif2.0 down failed |
191 |
Oct 28 14:12:33 moonflo logger[20361]: /etc/xen/scripts/vif-bridge: Successful vif-bridge offline for vif2.0, bridge brloc. |
192 |
Oct 28 14:12:33 moonflo logger[20372]: /etc/xen/scripts/vif-bridge: remove type_if=tap XENBUS_PATH=backend/vif/2/0 |
193 |
Oct 28 14:12:33 moonflo logger[20391]: /etc/xen/scripts/vif-bridge: Successful vif-bridge remove for vif2.0-emu, bridge brloc. |
194 |
Oct 28 14:15:33 moonflo shutdown[20476]: shutting down for system halt |
195 |
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Oct 28 14:17:34 moonflo syslog-ng[4611]: syslog-ng starting up; version='3.6.2' |
196 |
|
197 |
|
198 |
And: |
199 |
|
200 |
|
201 |
Oct 24 11:47:42 moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 timed out |
202 |
Oct 24 11:47:42 moonflo kernel: Modules linked in: xt_physdev br_netfilter iptable_filter ip_tables xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) zuni |
203 |
code(PO) zavl(PO) zcommon(PO) znvpair(PO) nouveau snd_hda_codec_realtek snd_hda_codec_generic video spl(O) backlight zlib_deflate drm_kms_helper snd_hda_intel snd_ |
204 |
hda_controller snd_hda_codec snd_pcm snd_timer r8169 snd ttm soundcore mii xts aesni_intel glue_helper lrw gf128mul ablk_helper cryptd aes_x86_64 sha256_generic hi |
205 |
d_generic usbhid uhci_hcd usb_storage ehci_pci ehci_hcd usbcore usb_common |
206 |
Oct 24 11:47:42 moonflo kernel: CPU: 12 PID: 0 Comm: swapper/12 Tainted: P O 4.0.5-gentoo #3 |
207 |
Oct 24 11:47:42 moonflo kernel: Hardware name: Hewlett-Packard HP Z800 Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 |
208 |
Oct 24 11:47:42 moonflo kernel: ffffffff8175a77d ffff880124d83d98 ffffffff814da8d8 0000000000000001 |
209 |
Oct 24 11:47:42 moonflo kernel: ffff880124d83de8 ffff880124d83dd8 ffffffff81088850 ffff880124d83e68 |
210 |
Oct 24 11:47:42 moonflo kernel: 0000000000000000 ffff88011efd8000 0000000000000001 ffff8800d4eb5e80 |
211 |
Oct 24 11:47:42 moonflo kernel: Call Trace: |
212 |
Oct 24 11:47:42 moonflo kernel: <IRQ> [<ffffffff814da8d8>] dump_stack+0x45/0x57 |
213 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff81088850>] warn_slowpath_common+0x80/0xc0 |
214 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 |
215 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff812b31c5>] ? add_interrupt_randomness+0x35/0x1e0 |
216 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b819>] dev_watchdog+0x259/0x270 |
217 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 |
218 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 |
219 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 |
220 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff810d42a6>] run_timer_softirq+0x176/0x2b0 |
221 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 |
222 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 |
223 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 |
224 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff814e1e8e>] xen_do_hypervisor_callback+0x1e/0x40 |
225 |
Oct 24 11:47:42 moonflo kernel: <EOI> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 |
226 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 |
227 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff810459e0>] ? xen_safe_halt+0x10/0x20 |
228 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff81053979>] ? default_idle+0x9/0x10 |
229 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff810542da>] ? arch_cpu_idle+0xa/0x10 |
230 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 |
231 |
Oct 24 11:47:42 moonflo kernel: [<ffffffff81047cd5>] ? cpu_bringup_and_idle+0x25/0x40 |
232 |
Oct 24 11:47:42 moonflo kernel: ---[ end trace 320b6f98f8fc070f ]--- |
233 |
Oct 24 11:47:42 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up |
234 |
|
235 |
|
236 |
That was two days before it went down. After that, messages about topology changes |
237 |
are starting to appear. |
238 |
|
239 |
I'm not sure if I should call this "progress" ;) |
240 |
|
241 |
> |
242 |
>> You get no answer when you ping the host while it is unreachable. |
243 |
>> |
244 |
>>> - If yes, does it still have no connectivity? |
245 |
>> |
246 |
>> It has been restarted this morning when it was found to be unreachable. |
247 |
>> |
248 |
>>> I saw the same on my lab machine, which was related to: |
249 |
>>> - Not using correct drivers inside HVM guests |
250 |
>> |
251 |
>> There are Windoze 7 guests running that have PV drivers installed. |
252 |
>> One of those has formerly been running on a VMware host and was |
253 |
>> migrated on Tuesday. I deinstalled the VMware tools from it. |
254 |
> |
255 |
> Which PV drivers? |
256 |
|
257 |
Xen GPL PV Driver Developers |
258 |
17.09.2014 |
259 |
0.11.0.373 |
260 |
Univention GmbH |
261 |
|
262 |
> And did you ensure all VMWare related drivers were removed? |
263 |
> I am not convinced uninstalling the VMWare tools is sufficient. |
264 |
|
265 |
What would I need to look at to make sure they are removed? |
266 |
|
267 |
The problem has been there before the VM that had VMWare drivers |
268 |
installed was migrated to this server. So I don't think they are |
269 |
causing this problem. |
270 |
|
271 |
|
272 |
>> Since Monday, a HVM Linux system (a modified 32-bit Debian) has also |
273 |
>> been migrated from the VMware host to this one. I don't know if it |
274 |
>> has VMware tools installed (I guess it does because it could be shut |
275 |
>> down via VMware) and how those might react now. It's working, and I |
276 |
>> don't want to touch it. |
277 |
>> |
278 |
>> However, the problem already occured before this migration, when the |
279 |
>> on-board cards were still used. |
280 |
>> |
281 |
>>> - Switch hardware not keeping the MAC/IP/Port lists long enough |
282 |
>> |
283 |
>> What might be the reason for the lists becoming too short? Too many |
284 |
>> devices connected to the network? |
285 |
> |
286 |
> No network activity for a while. (clean installs, nothing running) |
287 |
> Switch forgetting the MAC-address assigned to the VM. |
288 |
> |
289 |
> Connecting to the VM-console, I could ping www.google.com and then the |
290 |
> connectivity re-appeared. |
291 |
|
292 |
Half of the switches have been replaced last week in order to track down |
293 |
what appears to be a weird network problem. The problem is that the RDP |
294 |
clients are being randomly stalled. If it was only that, I'd suspect this |
295 |
server some more, but the internet connection goes through the same switches |
296 |
and is apprently also slowed down when the RPD clients are stalled. They |
297 |
got also randomly stalled when the RDP clients were accessing a totally |
298 |
different server (the VMWare server), so this might be entirely unrelated. |
299 |
|
300 |
Replacing the switches didn't fix the problem, so I'll probably put them |
301 |
back into service and replace the other half. |
302 |
|
303 |
>> The host has been connected to two different switches and showed the |
304 |
>> problem. Previously, that was an 8-port 1Gb switch, now it's a 24-port |
305 |
>> 1Gb switch. However, the 8-port switch is also connected to the 24-port |
306 |
>> switch the host is now connected to. (The 24-port switch connects it |
307 |
>> "directly" to the rest of the network.) |
308 |
> |
309 |
> Assuming it's a managed switch, you could test this. |
310 |
> Alternatively, check if you can access the VMs from the host. |
311 |
|
312 |
Good idea, I'll try that when it happens when I'm here. |
313 |
|
314 |
The network cards have arrived, Intel PRO 1000 dual port, made for IBM. |
315 |
I hope I get to swap the card today. Those *really* should work. |
316 |
|
317 |
Hm, I could plug in two of them and give each VM and the host its own |
318 |
physical card. Do you think that might help? |