1 |
J. Roeleveld wrote: |
2 |
> On Thursday, October 15, 2015 03:30:01 PM hw wrote: |
3 |
>> Hi, |
4 |
>> |
5 |
>> I have a xen host with some HV guests which becomes unreachable via |
6 |
>> the network after apparently random amount of times. I have already |
7 |
>> switched the network card to see if that would make a difference, |
8 |
>> and with the card currently installed, it worked fine for over 20 days |
9 |
>> until it become unreachable again. Before switching the network card, |
10 |
>> it would run a week or two before becoming unreachable. The previous |
11 |
>> card was the on-board BCM5764M which uses the tg3 driver. |
12 |
>> |
13 |
>> There are messages like this in the log file: |
14 |
>> |
15 |
>> |
16 |
>> Oct 14 20:58:02 moonflo kernel: ------------[ cut here ]------------ |
17 |
>> Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at |
18 |
>> net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14 20:58:02 |
19 |
>> moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 timed |
20 |
>> out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb md4 hmac |
21 |
>> nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter ip_tables |
22 |
>> xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau |
23 |
>> snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO) |
24 |
>> zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight drm_kms_helper |
25 |
>> ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm snd_timer snd |
26 |
>> soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul ablk_helper |
27 |
>> cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd usb_storage |
28 |
>> ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo kernel: CPU: |
29 |
>> 10 PID: 0 Comm: swapper/10 Tainted: P O 4.0.5-gentoo #3 Oct 14 |
30 |
>> 20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800 |
31 |
>> Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02 moonflo |
32 |
>> kernel: ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8 |
33 |
>> 0000000000000001 Oct 14 20:58:02 moonflo kernel: ffff880124d43de8 |
34 |
>> ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02 moonflo |
35 |
>> kernel: 0000000000000000 ffff8800d45f2000 0000000000000001 |
36 |
>> ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace: |
37 |
>> Oct 14 20:58:02 moonflo kernel: <IRQ> [<ffffffff814da8d8>] |
38 |
>> dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel: [<ffffffff81088850>] |
39 |
>> warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel: |
40 |
>> [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02 moonflo |
41 |
>> kernel: [<ffffffff812b31c5>] ? add_interrupt_randomness+0x35/0x1e0 Oct 14 |
42 |
>> 20:58:02 moonflo kernel: [<ffffffff8145b819>] dev_watchdog+0x259/0x270 Oct |
43 |
>> 14 20:58:02 moonflo kernel: [<ffffffff8145b5c0>] ? |
44 |
>> dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel: |
45 |
>> [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo |
46 |
>> kernel: [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 Oct 14 |
47 |
>> 20:58:02 moonflo kernel: [<ffffffff810d42a6>] |
48 |
>> run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel: |
49 |
>> [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 Oct 14 20:58:02 moonflo |
50 |
>> kernel: [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 Oct 14 20:58:02 moonflo |
51 |
>> kernel: [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 Oct 14 |
52 |
>> 20:58:02 moonflo kernel: [<ffffffff814e1e8e>] |
53 |
>> xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo kernel: <EOI> |
54 |
>> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct 14 20:58:02 |
55 |
>> moonflo kernel: [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct |
56 |
>> 14 20:58:02 moonflo kernel: [<ffffffff810459e0>] ? xen_safe_halt+0x10/0x20 |
57 |
>> Oct 14 20:58:02 moonflo kernel: [<ffffffff81053979>] ? |
58 |
>> default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel: [<ffffffff810542da>] |
59 |
>> ? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel: |
60 |
>> [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 Oct 14 20:58:02 |
61 |
>> moonflo kernel: [<ffffffff81047cd5>] ? cpu_bringup_and_idle+0x25/0x40 Oct |
62 |
>> 14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]--- Oct 14 |
63 |
>> 20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up |
64 |
>> |
65 |
>> |
66 |
>> After that, there are lots of messages about the link being up, one message |
67 |
>> every 12 seconds. When you unplug the network cable, you get a message that |
68 |
>> the link is down, and no message when you plug it in again. |
69 |
>> |
70 |
>> I was hoping that switching the network card (to one that uses a different |
71 |
>> driver) might solve the problem, and it did not. Now I can only guess that |
72 |
>> the network card goes to sleep and sometimes cannot be woken up again. |
73 |
>> |
74 |
>> I tried to reduce the connection speed to 100Mbit and found that accessing |
75 |
>> the VMs (via RDP) becomes too slow to use them. So I disabled the power |
76 |
>> management of the network card (through sysfs) and will have to see if the |
77 |
>> problem persists. |
78 |
>> |
79 |
>> We'll be getting decent network cards in a couple days, but since the |
80 |
>> problem doesn't seem to be related to a particular card/model/manufacturer, |
81 |
>> that might not fix it, either. |
82 |
>> |
83 |
>> This problem seems to only occur on machines that operate as a xen server. |
84 |
>> Other machines, identical Z800s, not running xen, run just fine. |
85 |
>> |
86 |
>> What would you suggest? |
87 |
> |
88 |
> More info required: |
89 |
> |
90 |
> - Which version of Xen |
91 |
|
92 |
4.5.1 |
93 |
|
94 |
Installed versions: 4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags -debug -efi -flask -xsm) |
95 |
|
96 |
> - Does this only occur with HVM guests? |
97 |
|
98 |
The host has been running only HVM guests every time it happend. |
99 |
It was running a PV guest in between (which I had to shut down |
100 |
because other VMs were migrated, requiring the RAM). |
101 |
|
102 |
> - Which network-driver are you using inside the guest |
103 |
|
104 |
r8169, compiled as a module |
105 |
|
106 |
Same happened with the tg3 driver when the on-board cards were used. |
107 |
The tg3 driver is completely disabled in the kernel config, i. e. |
108 |
not even compiled as a module. |
109 |
|
110 |
> - Can you connect to the "local" console of the guest? |
111 |
|
112 |
Yes, the host seems to be running fine except for having no network |
113 |
connectivity. There's a keyboard and monitor physically connected to |
114 |
it with which you can log in and do stuff. |
115 |
|
116 |
You get no answer when you ping the host while it is unreachable. |
117 |
|
118 |
> - If yes, does it still have no connectivity? |
119 |
|
120 |
It has been restarted this morning when it was found to be unreachable. |
121 |
|
122 |
> I saw the same on my lab machine, which was related to: |
123 |
> - Not using correct drivers inside HVM guests |
124 |
|
125 |
There are Windoze 7 guests running that have PV drivers installed. |
126 |
One of those has formerly been running on a VMware host and was |
127 |
migrated on Tuesday. I deinstalled the VMware tools from it. |
128 |
|
129 |
Since Monday, a HVM Linux system (a modified 32-bit Debian) has also |
130 |
been migrated from the VMware host to this one. I don't know if it |
131 |
has VMware tools installed (I guess it does because it could be shut |
132 |
down via VMware) and how those might react now. It's working, and I |
133 |
don't want to touch it. |
134 |
|
135 |
However, the problem already occured before this migration, when the |
136 |
on-board cards were still used. |
137 |
|
138 |
> - Switch hardware not keeping the MAC/IP/Port lists long enough |
139 |
|
140 |
What might be the reason for the lists becoming too short? Too many |
141 |
devices connected to the network? |
142 |
|
143 |
The host has been connected to two different switches and showed the |
144 |
problem. Previously, that was an 8-port 1Gb switch, now it's a 24-port |
145 |
1Gb switch. However, the 8-port switch is also connected to the 24-port |
146 |
switch the host is now connected to. (The 24-port switch connects it |
147 |
"directly" to the rest of the network.) |