Re: [gentoo-user] Networking trouble - gentoo-user

From:	hw <hw@×××××.de>
To:	gentoo-user@l.g.o
Subject:	Re: [gentoo-user] Networking trouble
Date:	Thu, 29 Oct 2015 10:29:31
Message-Id:	`5631F4FE.90000@gc-24.de`
In Reply to:	Re: [gentoo-user] Networking trouble by "J. Roeleveld"

1

J. Roeleveld wrote:

2

> On Thursday, October 15, 2015 05:46:07 PM hw wrote:

3

>> J. Roeleveld wrote:

4

>>> On Thursday, October 15, 2015 03:30:01 PM hw wrote:

5

>>>> Hi,

6

>>>>

7

>>>> I have a xen host with some HV guests which becomes unreachable via

8

>>>> the network after apparently random amount of times.  I have already

9

>>>> switched the network card to see if that would make a difference,

10

>>>> and with the card currently installed, it worked fine for over 20 days

11

>>>> until it become unreachable again.  Before switching the network card,

12

>>>> it would run a week or two before becoming unreachable.  The previous

13

>>>> card was the on-board BCM5764M which uses the tg3 driver.

14

>>>>

15

>>>> There are messages like this in the log file:

16

>>>>

17

>>>>

18

>>>> Oct 14 20:58:02 moonflo kernel: ------------[ cut here ]------------

19

>>>> Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at

20

>>>> net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14 20:58:02

21

>>>> moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 timed

22

>>>> out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb md4 hmac

23

>>>> nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter ip_tables

24

>>>> xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau

25

>>>> snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO)

26

>>>> zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight

27

>>>> drm_kms_helper

28

>>>> ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm snd_timer snd

29

>>>> soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul ablk_helper

30

>>>> cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd usb_storage

31

>>>> ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo kernel: CPU:

32

>>>> 10 PID: 0 Comm: swapper/10 Tainted: P           O    4.0.5-gentoo #3 Oct

33

>>>> 14

34

>>>> 20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800

35

>>>> Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02 moonflo

36

>>>> kernel:  ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8

37

>>>> 0000000000000001 Oct 14 20:58:02 moonflo kernel:  ffff880124d43de8

38

>>>> ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02

39

>>>> moonflo

40

>>>> kernel:  0000000000000000 ffff8800d45f2000 0000000000000001

41

>>>> ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace:

42

>>>> Oct 14 20:58:02 moonflo kernel:  <IRQ>  [<ffffffff814da8d8>]

43

>>>> dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel:

44

>>>> [<ffffffff81088850>]

45

>>>> warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel:

46

>>>> [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02 moonflo

47

>>>> kernel:  [<ffffffff812b31c5>] ? add_interrupt_randomness+0x35/0x1e0 Oct

48

>>>> 14

49

>>>> 20:58:02 moonflo kernel:  [<ffffffff8145b819>] dev_watchdog+0x259/0x270

50

>>>> Oct

51

>>>> 14 20:58:02 moonflo kernel:  [<ffffffff8145b5c0>] ?

52

>>>> dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel:

53

>>>> [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo

54

>>>> kernel:  [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 Oct 14

55

>>>> 20:58:02 moonflo kernel:  [<ffffffff810d42a6>]

56

>>>> run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel:

57

>>>> [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 Oct 14 20:58:02 moonflo

58

>>>> kernel:  [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 Oct 14 20:58:02 moonflo

59

>>>> kernel:  [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 Oct 14

60

>>>> 20:58:02 moonflo kernel:  [<ffffffff814e1e8e>]

61

>>>> xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo kernel:

62

>>>> <EOI>

63

>>>>

64

>>>>    [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct 14 20:58:02

65

>>>>

66

>>>> moonflo kernel:  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20

67

>>>> Oct

68

>>>> 14 20:58:02 moonflo kernel:  [<ffffffff810459e0>] ?

69

>>>> xen_safe_halt+0x10/0x20

70

>>>> Oct 14 20:58:02 moonflo kernel:  [<ffffffff81053979>] ?

71

>>>> default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel:

72

>>>> [<ffffffff810542da>]

73

>>>> ? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel:

74

>>>> [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 Oct 14 20:58:02

75

>>>> moonflo kernel:  [<ffffffff81047cd5>] ? cpu_bringup_and_idle+0x25/0x40

76

>>>> Oct

77

>>>> 14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]--- Oct 14

78

>>>> 20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up

79

>>>>

80

>>>>

81

>>>> After that, there are lots of messages about the link being up, one

82

>>>> message

83

>>>> every 12 seconds.  When you unplug the network cable, you get a message

84

>>>> that the link is down, and no message when you plug it in again.

85

>>>>

86

>>>> I was hoping that switching the network card (to one that uses a

87

>>>> different

88

>>>> driver) might solve the problem, and it did not.  Now I can only guess

89

>>>> that

90

>>>> the network card goes to sleep and sometimes cannot be woken up again.

91

>>>>

92

>>>> I tried to reduce the connection speed to 100Mbit and found that

93

>>>> accessing

94

>>>> the VMs (via RDP) becomes too slow to use them.  So I disabled the power

95

>>>> management of the network card (through sysfs) and will have to see if

96

>>>> the

97

>>>> problem persists.

98

>>>>

99

>>>> We'll be getting decent network cards in a couple days, but since the

100

>>>> problem doesn't seem to be related to a particular

101

>>>> card/model/manufacturer,

102

>>>> that might not fix it, either.

103

>>>>

104

>>>> This problem seems to only occur on machines that operate as a xen

105

>>>> server.

106

>>>> Other machines, identical Z800s, not running xen, run just fine.

107

>>>>

108

>>>> What would you suggest?

109

>>>

110

>>> More info required:

111

>>>

112

>>> - Which version of Xen

113

>>

114

>> 4.5.1

115

>>

116

>> Installed versions:  4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags -debug

117

>> -efi -flask -xsm)

118

>

119

> Ok, recent one.

120

>

121

>>> - Does this only occur with HVM guests?

122

>>

123

>> The host has been running only HVM guests every time it happend.

124

>> It was running a PV guest in between (which I had to shut down

125

>> because other VMs were migrated, requiring the RAM).

126

>

127

> The PV didn't have any issues?

128

129

The whole server has the issue, not a particular VM.  While the PV guest

130

was running, the server didn't freeze.

131

132

>>> - Which network-driver are you using inside the guest

133

>>

134

>> r8169, compiled as a module

135

>>

136

>> Same happened with the tg3 driver when the on-board cards were used.

137

>> The tg3 driver is completely disabled in the kernel config, i. e.

138

>> not even compiled as a module.

139

>

140

> You have network cards assigned to the guests?

141

142

No, they are all connected via a bridge.

143

144

I enabled STP on the bridge and the server was ok for a week, then had

145

to be restarted.  I'm seeing lots of messages in the log:

146

147

148

Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating

149

Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu

150

Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating

151

Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu

152

Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating

153

Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu

154

Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating

155

Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu

156

Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating

157

158

159

and sometimes:

160

161

Oct 28 10:47:04 moonflo kernel: brloc: port 1(enp55s4) neighbor 8000.00:00:10:11:12:00 lost

162

163

164

Any idea what this means?

165

166

(Google has gone on strike, and another search engine didn't give any useful

167

findings ...)

168

169

170

>>> - Can you connect to the "local" console of the guest?

171

>>

172

>> Yes, the host seems to be running fine except for having no network

173

>> connectivity.  There's a keyboard and monitor physically connected to

174

>> it with which you can log in and do stuff.

175

>

176

> The HOST loses network connectivity?

177

178

Yes.

179

180

Apparently when it became unresponsive yesterday, it was not possible

181

to log in at the console, either.  I wasn't there yesterday, though I've

182

see that happen before.  We tried to shut it down via acpid by pressing the

183

power button. It didn't turn off, so it was switched off by holding the power

184

button.  What I can see in the log is:

185

186

187

Oct 28 14:12:33 moonflo logger[20322]: /etc/xen/scripts/block: remove XENBUS_PATH=backend/vbd/2/768

188

Oct 28 14:12:33 moonflo logger[20323]: /etc/xen/scripts/vif-bridge: offline type_if=vif XENBUS_PATH=backend/vif/2/0

189

Oct 28 14:12:33 moonflo logger[20347]: /etc/xen/scripts/vif-bridge: brctl delif brloc vif2.0 failed

190

Oct 28 14:12:33 moonflo logger[20353]: /etc/xen/scripts/vif-bridge: ifconfig vif2.0 down failed

191

Oct 28 14:12:33 moonflo logger[20361]: /etc/xen/scripts/vif-bridge: Successful vif-bridge offline for vif2.0, bridge brloc.

192

Oct 28 14:12:33 moonflo logger[20372]: /etc/xen/scripts/vif-bridge: remove type_if=tap XENBUS_PATH=backend/vif/2/0

193

Oct 28 14:12:33 moonflo logger[20391]: /etc/xen/scripts/vif-bridge: Successful vif-bridge remove for vif2.0-emu, bridge brloc.

194

Oct 28 14:15:33 moonflo shutdown[20476]: shutting down for system halt

195

^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Oct 28 14:17:34 moonflo syslog-ng[4611]: syslog-ng starting up; version='3.6.2'

196

197

198

And:

199

200

201

Oct 24 11:47:42 moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 timed out

202

Oct 24 11:47:42 moonflo kernel: Modules linked in: xt_physdev br_netfilter iptable_filter ip_tables xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) zuni

203

code(PO) zavl(PO) zcommon(PO) znvpair(PO) nouveau snd_hda_codec_realtek snd_hda_codec_generic video spl(O) backlight zlib_deflate drm_kms_helper snd_hda_intel snd_

204

hda_controller snd_hda_codec snd_pcm snd_timer r8169 snd ttm soundcore mii xts aesni_intel glue_helper lrw gf128mul ablk_helper cryptd aes_x86_64 sha256_generic hi

205

d_generic usbhid uhci_hcd usb_storage ehci_pci ehci_hcd usbcore usb_common

206

Oct 24 11:47:42 moonflo kernel: CPU: 12 PID: 0 Comm: swapper/12 Tainted: P           O    4.0.5-gentoo #3

207

Oct 24 11:47:42 moonflo kernel: Hardware name: Hewlett-Packard HP Z800 Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013

208

Oct 24 11:47:42 moonflo kernel:  ffffffff8175a77d ffff880124d83d98 ffffffff814da8d8 0000000000000001

209

Oct 24 11:47:42 moonflo kernel:  ffff880124d83de8 ffff880124d83dd8 ffffffff81088850 ffff880124d83e68

210

Oct 24 11:47:42 moonflo kernel:  0000000000000000 ffff88011efd8000 0000000000000001 ffff8800d4eb5e80

211

Oct 24 11:47:42 moonflo kernel: Call Trace:

212

Oct 24 11:47:42 moonflo kernel:  <IRQ>  [<ffffffff814da8d8>] dump_stack+0x45/0x57

213

Oct 24 11:47:42 moonflo kernel:  [<ffffffff81088850>] warn_slowpath_common+0x80/0xc0

214

Oct 24 11:47:42 moonflo kernel:  [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50

215

Oct 24 11:47:42 moonflo kernel:  [<ffffffff812b31c5>] ? add_interrupt_randomness+0x35/0x1e0

216

Oct 24 11:47:42 moonflo kernel:  [<ffffffff8145b819>] dev_watchdog+0x259/0x270

217

Oct 24 11:47:42 moonflo kernel:  [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80

218

Oct 24 11:47:42 moonflo kernel:  [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80

219

Oct 24 11:47:42 moonflo kernel:  [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70

220

Oct 24 11:47:42 moonflo kernel:  [<ffffffff810d42a6>] run_timer_softirq+0x176/0x2b0

221

Oct 24 11:47:42 moonflo kernel:  [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0

222

Oct 24 11:47:42 moonflo kernel:  [<ffffffff8108c04e>] irq_exit+0x7e/0xa0

223

Oct 24 11:47:42 moonflo kernel:  [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50

224

Oct 24 11:47:42 moonflo kernel:  [<ffffffff814e1e8e>] xen_do_hypervisor_callback+0x1e/0x40

225

Oct 24 11:47:42 moonflo kernel:  <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20

226

Oct 24 11:47:42 moonflo kernel:  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20

227

Oct 24 11:47:42 moonflo kernel:  [<ffffffff810459e0>] ? xen_safe_halt+0x10/0x20

228

Oct 24 11:47:42 moonflo kernel:  [<ffffffff81053979>] ? default_idle+0x9/0x10

229

Oct 24 11:47:42 moonflo kernel:  [<ffffffff810542da>] ? arch_cpu_idle+0xa/0x10

230

Oct 24 11:47:42 moonflo kernel:  [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0

231

Oct 24 11:47:42 moonflo kernel:  [<ffffffff81047cd5>] ? cpu_bringup_and_idle+0x25/0x40

232

Oct 24 11:47:42 moonflo kernel: ---[ end trace 320b6f98f8fc070f ]---

233

Oct 24 11:47:42 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up

234

235

236

That was two days before it went down.  After that, messages about topology changes

237

are starting to appear.

238

239

I'm not sure if I should call this "progress" ;)

240

241

>

242

>> You get no answer when you ping the host while it is unreachable.

243

>>

244

>>> - If yes, does it still have no connectivity?

245

>>

246

>> It has been restarted this morning when it was found to be unreachable.

247

>>

248

>>> I saw the same on my lab machine, which was related to:

249

>>> - Not using correct drivers inside HVM guests

250

>>

251

>> There are Windoze 7 guests running that have PV drivers installed.

252

>> One of those has formerly been running on a VMware host and was

253

>> migrated on Tuesday.  I deinstalled the VMware tools from it.

254

>

255

> Which PV drivers?

256

257

Xen GPL PV Driver Developers

258

17.09.2014

259

0.11.0.373

260

Univention GmbH

261

262

> And did you ensure all VMWare related drivers were removed?

263

> I am not convinced uninstalling the VMWare tools is sufficient.

264

265

What would I need to look at to make sure they are removed?

266

267

The problem has been there before the VM that had VMWare drivers

268

installed was migrated to this server.  So I don't think they are

269

causing this problem.

270

271

272

>> Since Monday, a HVM Linux system (a modified 32-bit Debian) has also

273

>> been migrated from the VMware host to this one.  I don't know if it

274

>> has VMware tools installed (I guess it does because it could be shut

275

>> down via VMware) and how those might react now.  It's working, and I

276

>> don't want to touch it.

277

>>

278

>> However, the problem already occured before this migration, when the

279

>> on-board cards were still used.

280

>>

281

>>> - Switch hardware not keeping the MAC/IP/Port lists long enough

282

>>

283

>> What might be the reason for the lists becoming too short?  Too many

284

>> devices connected to the network?

285

>

286

> No network activity for a while. (clean installs, nothing running)

287

> Switch forgetting the MAC-address assigned to the VM.

288

>

289

> Connecting to the VM-console, I could ping www.google.com and then the

290

> connectivity re-appeared.

291

292

Half of the switches have been replaced last week in order to track down

293

what appears to be a weird network problem.  The problem is that the RDP

294

clients are being randomly stalled.  If it was only that, I'd suspect this

295

server some more, but the internet connection goes through the same switches

296

and is apprently also slowed down when the RPD clients are stalled.  They

297

got also randomly stalled when the RDP clients were accessing a totally

298

different server (the VMWare server), so this might be entirely unrelated.

299

300

Replacing the switches didn't fix the problem, so I'll probably put them

301

back into service and replace the other half.

302

303

>> The host has been connected to two different switches and showed the

304

>> problem.  Previously, that was an 8-port 1Gb switch, now it's a 24-port

305

>> 1Gb switch.  However, the 8-port switch is also connected to the 24-port

306

>> switch the host is now connected to.  (The 24-port switch connects it

307

>> "directly" to the rest of the network.)

308

>

309

> Assuming it's a managed switch, you could test this.

310

> Alternatively, check if you can access the VMs from the host.

311

312

Good idea, I'll try that when it happens when I'm here.

313

314

The network cards have arrived, Intel PRO 1000 dual port, made for IBM.

315

I hope I get to swap the card today. Those *really* should work.

316

317

Hm, I could plug in two of them and give each VM and the host its own

318

physical card.  Do you think that might help?

Gentoo Archives: gentoo-user

Replies

1	J. Roeleveld wrote:
2	> On Thursday, October 15, 2015 05:46:07 PM hw wrote:
3	>> J. Roeleveld wrote:
4	>>> On Thursday, October 15, 2015 03:30:01 PM hw wrote:
5	>>>> Hi,
6	>>>>
7	>>>> I have a xen host with some HV guests which becomes unreachable via
8	>>>> the network after apparently random amount of times. I have already
9	>>>> switched the network card to see if that would make a difference,
10	>>>> and with the card currently installed, it worked fine for over 20 days
11	>>>> until it become unreachable again. Before switching the network card,
12	>>>> it would run a week or two before becoming unreachable. The previous
13	>>>> card was the on-board BCM5764M which uses the tg3 driver.
14	>>>>
15	>>>> There are messages like this in the log file:
16	>>>>
17	>>>>
18	>>>> Oct 14 20:58:02 moonflo kernel: ------------[ cut here ]------------
19	>>>> Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at
20	>>>> net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14 20:58:02
21	>>>> moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 timed
22	>>>> out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb md4 hmac
23	>>>> nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter ip_tables
24	>>>> xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau
25	>>>> snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO)
26	>>>> zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight
27	>>>> drm_kms_helper
28	>>>> ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm snd_timer snd
29	>>>> soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul ablk_helper
30	>>>> cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd usb_storage
31	>>>> ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo kernel: CPU:
32	>>>> 10 PID: 0 Comm: swapper/10 Tainted: P O 4.0.5-gentoo #3 Oct
33	>>>> 14
34	>>>> 20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800
35	>>>> Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02 moonflo
36	>>>> kernel: ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8
37	>>>> 0000000000000001 Oct 14 20:58:02 moonflo kernel: ffff880124d43de8
38	>>>> ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02
39	>>>> moonflo
40	>>>> kernel: 0000000000000000 ffff8800d45f2000 0000000000000001
41	>>>> ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace:
42	>>>> Oct 14 20:58:02 moonflo kernel: <IRQ> [<ffffffff814da8d8>]
43	>>>> dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel:
44	>>>> [<ffffffff81088850>]
45	>>>> warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel:
46	>>>> [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02 moonflo
47	>>>> kernel: [<ffffffff812b31c5>] ? add_interrupt_randomness+0x35/0x1e0 Oct
48	>>>> 14
49	>>>> 20:58:02 moonflo kernel: [<ffffffff8145b819>] dev_watchdog+0x259/0x270
50	>>>> Oct
51	>>>> 14 20:58:02 moonflo kernel: [<ffffffff8145b5c0>] ?
52	>>>> dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel:
53	>>>> [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo
54	>>>> kernel: [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 Oct 14
55	>>>> 20:58:02 moonflo kernel: [<ffffffff810d42a6>]
56	>>>> run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel:
57	>>>> [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 Oct 14 20:58:02 moonflo
58	>>>> kernel: [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 Oct 14 20:58:02 moonflo
59	>>>> kernel: [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 Oct 14
60	>>>> 20:58:02 moonflo kernel: [<ffffffff814e1e8e>]
61	>>>> xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo kernel:
62	>>>> <EOI>
63	>>>>
64	>>>> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct 14 20:58:02
65	>>>>
66	>>>> moonflo kernel: [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
67	>>>> Oct
68	>>>> 14 20:58:02 moonflo kernel: [<ffffffff810459e0>] ?
69	>>>> xen_safe_halt+0x10/0x20
70	>>>> Oct 14 20:58:02 moonflo kernel: [<ffffffff81053979>] ?
71	>>>> default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel:
72	>>>> [<ffffffff810542da>]
73	>>>> ? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel:
74	>>>> [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 Oct 14 20:58:02
75	>>>> moonflo kernel: [<ffffffff81047cd5>] ? cpu_bringup_and_idle+0x25/0x40
76	>>>> Oct
77	>>>> 14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]--- Oct 14
78	>>>> 20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up
79	>>>>
80	>>>>
81	>>>> After that, there are lots of messages about the link being up, one
82	>>>> message
83	>>>> every 12 seconds. When you unplug the network cable, you get a message
84	>>>> that the link is down, and no message when you plug it in again.
85	>>>>
86	>>>> I was hoping that switching the network card (to one that uses a
87	>>>> different
88	>>>> driver) might solve the problem, and it did not. Now I can only guess
89	>>>> that
90	>>>> the network card goes to sleep and sometimes cannot be woken up again.
91	>>>>
92	>>>> I tried to reduce the connection speed to 100Mbit and found that
93	>>>> accessing
94	>>>> the VMs (via RDP) becomes too slow to use them. So I disabled the power
95	>>>> management of the network card (through sysfs) and will have to see if
96	>>>> the
97	>>>> problem persists.
98	>>>>
99	>>>> We'll be getting decent network cards in a couple days, but since the
100	>>>> problem doesn't seem to be related to a particular
101	>>>> card/model/manufacturer,
102	>>>> that might not fix it, either.
103	>>>>
104	>>>> This problem seems to only occur on machines that operate as a xen
105	>>>> server.
106	>>>> Other machines, identical Z800s, not running xen, run just fine.
107	>>>>
108	>>>> What would you suggest?
109	>>>
110	>>> More info required:
111	>>>
112	>>> - Which version of Xen
113	>>
114	>> 4.5.1
115	>>
116	>> Installed versions: 4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags -debug
117	>> -efi -flask -xsm)
118	>
119	> Ok, recent one.
120	>
121	>>> - Does this only occur with HVM guests?
122	>>
123	>> The host has been running only HVM guests every time it happend.
124	>> It was running a PV guest in between (which I had to shut down
125	>> because other VMs were migrated, requiring the RAM).
126	>
127	> The PV didn't have any issues?
128
129	The whole server has the issue, not a particular VM. While the PV guest
130	was running, the server didn't freeze.
131
132	>>> - Which network-driver are you using inside the guest
133	>>
134	>> r8169, compiled as a module
135	>>
136	>> Same happened with the tg3 driver when the on-board cards were used.
137	>> The tg3 driver is completely disabled in the kernel config, i. e.
138	>> not even compiled as a module.
139	>
140	> You have network cards assigned to the guests?
141
142	No, they are all connected via a bridge.
143
144	I enabled STP on the bridge and the server was ok for a week, then had
145	to be restarted. I'm seeing lots of messages in the log:
146
147
148	Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating
149	Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu
150	Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating
151	Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu
152	Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating
153	Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu
154	Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating
155	Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn bpdu
156	Oct 28 11:14:05 moonflo kernel: brloc: topology change detected, propagating
157
158
159	and sometimes:
160
161	Oct 28 10:47:04 moonflo kernel: brloc: port 1(enp55s4) neighbor 8000.00:00:10:11:12:00 lost
162
163
164	Any idea what this means?
165
166	(Google has gone on strike, and another search engine didn't give any useful
167	findings ...)
168
169
170	>>> - Can you connect to the "local" console of the guest?
171	>>
172	>> Yes, the host seems to be running fine except for having no network
173	>> connectivity. There's a keyboard and monitor physically connected to
174	>> it with which you can log in and do stuff.
175	>
176	> The HOST loses network connectivity?
177
178	Yes.
179
180	Apparently when it became unresponsive yesterday, it was not possible
181	to log in at the console, either. I wasn't there yesterday, though I've
182	see that happen before. We tried to shut it down via acpid by pressing the
183	power button. It didn't turn off, so it was switched off by holding the power
184	button. What I can see in the log is:
185
186
187	Oct 28 14:12:33 moonflo logger[20322]: /etc/xen/scripts/block: remove XENBUS_PATH=backend/vbd/2/768
188	Oct 28 14:12:33 moonflo logger[20323]: /etc/xen/scripts/vif-bridge: offline type_if=vif XENBUS_PATH=backend/vif/2/0
189	Oct 28 14:12:33 moonflo logger[20347]: /etc/xen/scripts/vif-bridge: brctl delif brloc vif2.0 failed
190	Oct 28 14:12:33 moonflo logger[20353]: /etc/xen/scripts/vif-bridge: ifconfig vif2.0 down failed
191	Oct 28 14:12:33 moonflo logger[20361]: /etc/xen/scripts/vif-bridge: Successful vif-bridge offline for vif2.0, bridge brloc.
192	Oct 28 14:12:33 moonflo logger[20372]: /etc/xen/scripts/vif-bridge: remove type_if=tap XENBUS_PATH=backend/vif/2/0
193	Oct 28 14:12:33 moonflo logger[20391]: /etc/xen/scripts/vif-bridge: Successful vif-bridge remove for vif2.0-emu, bridge brloc.
194	Oct 28 14:15:33 moonflo shutdown[20476]: shutting down for system halt
195	^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Oct 28 14:17:34 moonflo syslog-ng[4611]: syslog-ng starting up; version='3.6.2'
196
197
198	And:
199
200
201	Oct 24 11:47:42 moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 timed out
202	Oct 24 11:47:42 moonflo kernel: Modules linked in: xt_physdev br_netfilter iptable_filter ip_tables xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) zuni
203	code(PO) zavl(PO) zcommon(PO) znvpair(PO) nouveau snd_hda_codec_realtek snd_hda_codec_generic video spl(O) backlight zlib_deflate drm_kms_helper snd_hda_intel snd_
204	hda_controller snd_hda_codec snd_pcm snd_timer r8169 snd ttm soundcore mii xts aesni_intel glue_helper lrw gf128mul ablk_helper cryptd aes_x86_64 sha256_generic hi
205	d_generic usbhid uhci_hcd usb_storage ehci_pci ehci_hcd usbcore usb_common
206	Oct 24 11:47:42 moonflo kernel: CPU: 12 PID: 0 Comm: swapper/12 Tainted: P O 4.0.5-gentoo #3
207	Oct 24 11:47:42 moonflo kernel: Hardware name: Hewlett-Packard HP Z800 Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013
208	Oct 24 11:47:42 moonflo kernel: ffffffff8175a77d ffff880124d83d98 ffffffff814da8d8 0000000000000001
209	Oct 24 11:47:42 moonflo kernel: ffff880124d83de8 ffff880124d83dd8 ffffffff81088850 ffff880124d83e68
210	Oct 24 11:47:42 moonflo kernel: 0000000000000000 ffff88011efd8000 0000000000000001 ffff8800d4eb5e80
211	Oct 24 11:47:42 moonflo kernel: Call Trace:
212	Oct 24 11:47:42 moonflo kernel: <IRQ> [<ffffffff814da8d8>] dump_stack+0x45/0x57
213	Oct 24 11:47:42 moonflo kernel: [<ffffffff81088850>] warn_slowpath_common+0x80/0xc0
214	Oct 24 11:47:42 moonflo kernel: [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50
215	Oct 24 11:47:42 moonflo kernel: [<ffffffff812b31c5>] ? add_interrupt_randomness+0x35/0x1e0
216	Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b819>] dev_watchdog+0x259/0x270
217	Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80
218	Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80
219	Oct 24 11:47:42 moonflo kernel: [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70
220	Oct 24 11:47:42 moonflo kernel: [<ffffffff810d42a6>] run_timer_softirq+0x176/0x2b0
221	Oct 24 11:47:42 moonflo kernel: [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0
222	Oct 24 11:47:42 moonflo kernel: [<ffffffff8108c04e>] irq_exit+0x7e/0xa0
223	Oct 24 11:47:42 moonflo kernel: [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50
224	Oct 24 11:47:42 moonflo kernel: [<ffffffff814e1e8e>] xen_do_hypervisor_callback+0x1e/0x40
225	Oct 24 11:47:42 moonflo kernel: <EOI> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
226	Oct 24 11:47:42 moonflo kernel: [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
227	Oct 24 11:47:42 moonflo kernel: [<ffffffff810459e0>] ? xen_safe_halt+0x10/0x20
228	Oct 24 11:47:42 moonflo kernel: [<ffffffff81053979>] ? default_idle+0x9/0x10
229	Oct 24 11:47:42 moonflo kernel: [<ffffffff810542da>] ? arch_cpu_idle+0xa/0x10
230	Oct 24 11:47:42 moonflo kernel: [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0
231	Oct 24 11:47:42 moonflo kernel: [<ffffffff81047cd5>] ? cpu_bringup_and_idle+0x25/0x40
232	Oct 24 11:47:42 moonflo kernel: ---[ end trace 320b6f98f8fc070f ]---
233	Oct 24 11:47:42 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up
234
235
236	That was two days before it went down. After that, messages about topology changes
237	are starting to appear.
238
239	I'm not sure if I should call this "progress" ;)
240
241	>
242	>> You get no answer when you ping the host while it is unreachable.
243	>>
244	>>> - If yes, does it still have no connectivity?
245	>>
246	>> It has been restarted this morning when it was found to be unreachable.
247	>>
248	>>> I saw the same on my lab machine, which was related to:
249	>>> - Not using correct drivers inside HVM guests
250	>>
251	>> There are Windoze 7 guests running that have PV drivers installed.
252	>> One of those has formerly been running on a VMware host and was
253	>> migrated on Tuesday. I deinstalled the VMware tools from it.
254	>
255	> Which PV drivers?
256
257	Xen GPL PV Driver Developers
258	17.09.2014
259	0.11.0.373
260	Univention GmbH
261
262	> And did you ensure all VMWare related drivers were removed?
263	> I am not convinced uninstalling the VMWare tools is sufficient.
264
265	What would I need to look at to make sure they are removed?
266
267	The problem has been there before the VM that had VMWare drivers
268	installed was migrated to this server. So I don't think they are
269	causing this problem.
270
271
272	>> Since Monday, a HVM Linux system (a modified 32-bit Debian) has also
273	>> been migrated from the VMware host to this one. I don't know if it
274	>> has VMware tools installed (I guess it does because it could be shut
275	>> down via VMware) and how those might react now. It's working, and I
276	>> don't want to touch it.
277	>>
278	>> However, the problem already occured before this migration, when the
279	>> on-board cards were still used.
280	>>
281	>>> - Switch hardware not keeping the MAC/IP/Port lists long enough
282	>>
283	>> What might be the reason for the lists becoming too short? Too many
284	>> devices connected to the network?
285	>
286	> No network activity for a while. (clean installs, nothing running)
287	> Switch forgetting the MAC-address assigned to the VM.
288	>
289	> Connecting to the VM-console, I could ping www.google.com and then the
290	> connectivity re-appeared.
291
292	Half of the switches have been replaced last week in order to track down
293	what appears to be a weird network problem. The problem is that the RDP
294	clients are being randomly stalled. If it was only that, I'd suspect this
295	server some more, but the internet connection goes through the same switches
296	and is apprently also slowed down when the RPD clients are stalled. They
297	got also randomly stalled when the RDP clients were accessing a totally
298	different server (the VMWare server), so this might be entirely unrelated.
299
300	Replacing the switches didn't fix the problem, so I'll probably put them
301	back into service and replace the other half.
302
303	>> The host has been connected to two different switches and showed the
304	>> problem. Previously, that was an 8-port 1Gb switch, now it's a 24-port
305	>> 1Gb switch. However, the 8-port switch is also connected to the 24-port
306	>> switch the host is now connected to. (The 24-port switch connects it
307	>> "directly" to the rest of the network.)
308	>
309	> Assuming it's a managed switch, you could test this.
310	> Alternatively, check if you can access the VMs from the host.
311
312	Good idea, I'll try that when it happens when I'm here.
313
314	The network cards have arrived, Intel PRO 1000 dual port, made for IBM.
315	I hope I get to swap the card today. Those really should work.
316
317	Hm, I could plug in two of them and give each VM and the host its own
318	physical card. Do you think that might help?