1 |
On Thursday, October 15, 2015 05:46:07 PM hw wrote: |
2 |
> J. Roeleveld wrote: |
3 |
> > On Thursday, October 15, 2015 03:30:01 PM hw wrote: |
4 |
> >> Hi, |
5 |
> >> |
6 |
> >> I have a xen host with some HV guests which becomes unreachable via |
7 |
> >> the network after apparently random amount of times. I have already |
8 |
> >> switched the network card to see if that would make a difference, |
9 |
> >> and with the card currently installed, it worked fine for over 20 days |
10 |
> >> until it become unreachable again. Before switching the network card, |
11 |
> >> it would run a week or two before becoming unreachable. The previous |
12 |
> >> card was the on-board BCM5764M which uses the tg3 driver. |
13 |
> >> |
14 |
> >> There are messages like this in the log file: |
15 |
> >> |
16 |
> >> |
17 |
> >> Oct 14 20:58:02 moonflo kernel: ------------[ cut here ]------------ |
18 |
> >> Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at |
19 |
> >> net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14 20:58:02 |
20 |
> >> moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 timed |
21 |
> >> out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb md4 hmac |
22 |
> >> nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter ip_tables |
23 |
> >> xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau |
24 |
> >> snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO) |
25 |
> >> zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight |
26 |
> >> drm_kms_helper |
27 |
> >> ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm snd_timer snd |
28 |
> >> soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul ablk_helper |
29 |
> >> cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd usb_storage |
30 |
> >> ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo kernel: CPU: |
31 |
> >> 10 PID: 0 Comm: swapper/10 Tainted: P O 4.0.5-gentoo #3 Oct |
32 |
> >> 14 |
33 |
> >> 20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800 |
34 |
> >> Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02 moonflo |
35 |
> >> kernel: ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8 |
36 |
> >> 0000000000000001 Oct 14 20:58:02 moonflo kernel: ffff880124d43de8 |
37 |
> >> ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02 |
38 |
> >> moonflo |
39 |
> >> kernel: 0000000000000000 ffff8800d45f2000 0000000000000001 |
40 |
> >> ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace: |
41 |
> >> Oct 14 20:58:02 moonflo kernel: <IRQ> [<ffffffff814da8d8>] |
42 |
> >> dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel: |
43 |
> >> [<ffffffff81088850>] |
44 |
> >> warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel: |
45 |
> >> [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02 moonflo |
46 |
> >> kernel: [<ffffffff812b31c5>] ? add_interrupt_randomness+0x35/0x1e0 Oct |
47 |
> >> 14 |
48 |
> >> 20:58:02 moonflo kernel: [<ffffffff8145b819>] dev_watchdog+0x259/0x270 |
49 |
> >> Oct |
50 |
> >> 14 20:58:02 moonflo kernel: [<ffffffff8145b5c0>] ? |
51 |
> >> dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel: |
52 |
> >> [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo |
53 |
> >> kernel: [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 Oct 14 |
54 |
> >> 20:58:02 moonflo kernel: [<ffffffff810d42a6>] |
55 |
> >> run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel: |
56 |
> >> [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 Oct 14 20:58:02 moonflo |
57 |
> >> kernel: [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 Oct 14 20:58:02 moonflo |
58 |
> >> kernel: [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 Oct 14 |
59 |
> >> 20:58:02 moonflo kernel: [<ffffffff814e1e8e>] |
60 |
> >> xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo kernel: |
61 |
> >> <EOI> |
62 |
> >> |
63 |
> >> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct 14 20:58:02 |
64 |
> >> |
65 |
> >> moonflo kernel: [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 |
66 |
> >> Oct |
67 |
> >> 14 20:58:02 moonflo kernel: [<ffffffff810459e0>] ? |
68 |
> >> xen_safe_halt+0x10/0x20 |
69 |
> >> Oct 14 20:58:02 moonflo kernel: [<ffffffff81053979>] ? |
70 |
> >> default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel: |
71 |
> >> [<ffffffff810542da>] |
72 |
> >> ? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel: |
73 |
> >> [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 Oct 14 20:58:02 |
74 |
> >> moonflo kernel: [<ffffffff81047cd5>] ? cpu_bringup_and_idle+0x25/0x40 |
75 |
> >> Oct |
76 |
> >> 14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]--- Oct 14 |
77 |
> >> 20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up |
78 |
> >> |
79 |
> >> |
80 |
> >> After that, there are lots of messages about the link being up, one |
81 |
> >> message |
82 |
> >> every 12 seconds. When you unplug the network cable, you get a message |
83 |
> >> that the link is down, and no message when you plug it in again. |
84 |
> >> |
85 |
> >> I was hoping that switching the network card (to one that uses a |
86 |
> >> different |
87 |
> >> driver) might solve the problem, and it did not. Now I can only guess |
88 |
> >> that |
89 |
> >> the network card goes to sleep and sometimes cannot be woken up again. |
90 |
> >> |
91 |
> >> I tried to reduce the connection speed to 100Mbit and found that |
92 |
> >> accessing |
93 |
> >> the VMs (via RDP) becomes too slow to use them. So I disabled the power |
94 |
> >> management of the network card (through sysfs) and will have to see if |
95 |
> >> the |
96 |
> >> problem persists. |
97 |
> >> |
98 |
> >> We'll be getting decent network cards in a couple days, but since the |
99 |
> >> problem doesn't seem to be related to a particular |
100 |
> >> card/model/manufacturer, |
101 |
> >> that might not fix it, either. |
102 |
> >> |
103 |
> >> This problem seems to only occur on machines that operate as a xen |
104 |
> >> server. |
105 |
> >> Other machines, identical Z800s, not running xen, run just fine. |
106 |
> >> |
107 |
> >> What would you suggest? |
108 |
> > |
109 |
> > More info required: |
110 |
> > |
111 |
> > - Which version of Xen |
112 |
> |
113 |
> 4.5.1 |
114 |
> |
115 |
> Installed versions: 4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags -debug |
116 |
> -efi -flask -xsm) |
117 |
|
118 |
Ok, recent one. |
119 |
|
120 |
> > - Does this only occur with HVM guests? |
121 |
> |
122 |
> The host has been running only HVM guests every time it happend. |
123 |
> It was running a PV guest in between (which I had to shut down |
124 |
> because other VMs were migrated, requiring the RAM). |
125 |
|
126 |
The PV didn't have any issues? |
127 |
|
128 |
> > - Which network-driver are you using inside the guest |
129 |
> |
130 |
> r8169, compiled as a module |
131 |
> |
132 |
> Same happened with the tg3 driver when the on-board cards were used. |
133 |
> The tg3 driver is completely disabled in the kernel config, i. e. |
134 |
> not even compiled as a module. |
135 |
|
136 |
You have network cards assigned to the guests? |
137 |
|
138 |
> > - Can you connect to the "local" console of the guest? |
139 |
> |
140 |
> Yes, the host seems to be running fine except for having no network |
141 |
> connectivity. There's a keyboard and monitor physically connected to |
142 |
> it with which you can log in and do stuff. |
143 |
|
144 |
The HOST loses network connectivity? |
145 |
|
146 |
> You get no answer when you ping the host while it is unreachable. |
147 |
> |
148 |
> > - If yes, does it still have no connectivity? |
149 |
> |
150 |
> It has been restarted this morning when it was found to be unreachable. |
151 |
> |
152 |
> > I saw the same on my lab machine, which was related to: |
153 |
> > - Not using correct drivers inside HVM guests |
154 |
> |
155 |
> There are Windoze 7 guests running that have PV drivers installed. |
156 |
> One of those has formerly been running on a VMware host and was |
157 |
> migrated on Tuesday. I deinstalled the VMware tools from it. |
158 |
|
159 |
Which PV drivers? |
160 |
And did you ensure all VMWare related drivers were removed? |
161 |
I am not convinced uninstalling the VMWare tools is sufficient. |
162 |
|
163 |
> Since Monday, a HVM Linux system (a modified 32-bit Debian) has also |
164 |
> been migrated from the VMware host to this one. I don't know if it |
165 |
> has VMware tools installed (I guess it does because it could be shut |
166 |
> down via VMware) and how those might react now. It's working, and I |
167 |
> don't want to touch it. |
168 |
> |
169 |
> However, the problem already occured before this migration, when the |
170 |
> on-board cards were still used. |
171 |
> |
172 |
> > - Switch hardware not keeping the MAC/IP/Port lists long enough |
173 |
> |
174 |
> What might be the reason for the lists becoming too short? Too many |
175 |
> devices connected to the network? |
176 |
|
177 |
No network activity for a while. (clean installs, nothing running) |
178 |
Switch forgetting the MAC-address assigned to the VM. |
179 |
|
180 |
Connecting to the VM-console, I could ping www.google.com and then the |
181 |
connectivity re-appeared. |
182 |
|
183 |
> The host has been connected to two different switches and showed the |
184 |
> problem. Previously, that was an 8-port 1Gb switch, now it's a 24-port |
185 |
> 1Gb switch. However, the 8-port switch is also connected to the 24-port |
186 |
> switch the host is now connected to. (The 24-port switch connects it |
187 |
> "directly" to the rest of the network.) |
188 |
|
189 |
Assuming it's a managed switch, you could test this. |
190 |
Alternatively, check if you can access the VMs from the host. |
191 |
|
192 |
-- |
193 |
Joost |