Re: [gentoo-user] Networking trouble - gentoo-user

From:	"J. Roeleveld" <joost@××××××××.org>
To:	gentoo-user@l.g.o
Subject:	Re: [gentoo-user] Networking trouble
Date:	Thu, 29 Oct 2015 17:26:09
Message-Id:	`0F927A0B-7591-42A7-8850-CFAB564E6E17@antarean.org`
In Reply to:	Re: [gentoo-user] Networking trouble by hw

1

On 29 October 2015 11:29:18 CET, hw <hw@×××××.de> wrote:

2

>J. Roeleveld wrote:

3

>> On Thursday, October 15, 2015 05:46:07 PM hw wrote:

4

>>> J. Roeleveld wrote:

5

>>>> On Thursday, October 15, 2015 03:30:01 PM hw wrote:

6

>>>>> Hi,

7

>>>>>

8

>>>>> I have a xen host with some HV guests which becomes unreachable

9

>via

10

>>>>> the network after apparently random amount of times.  I have

11

>already

12

>>>>> switched the network card to see if that would make a difference,

13

>>>>> and with the card currently installed, it worked fine for over 20

14

>days

15

>>>>> until it become unreachable again.  Before switching the network

16

>card,

17

>>>>> it would run a week or two before becoming unreachable.  The

18

>previous

19

>>>>> card was the on-board BCM5764M which uses the tg3 driver.

20

>>>>>

21

>>>>> There are messages like this in the log file:

22

>>>>>

23

>>>>>

24

>>>>> Oct 14 20:58:02 moonflo kernel: ------------[ cut here

25

>]------------

26

>>>>> Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at

27

>>>>> net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14

28

>20:58:02

29

>>>>> moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0

30

>timed

31

>>>>> out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb

32

>md4 hmac

33

>>>>> nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter

34

>ip_tables

35

>>>>> xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau

36

>>>>> snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO)

37

>>>>> zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight

38

>>>>> drm_kms_helper

39

>>>>> ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm

40

>snd_timer snd

41

>>>>> soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul

42

>ablk_helper

43

>>>>> cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd

44

>usb_storage

45

>>>>> ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo

46

>kernel: CPU:

47

>>>>> 10 PID: 0 Comm: swapper/10 Tainted: P           O    4.0.5-gentoo

48

>#3 Oct

49

>>>>> 14

50

>>>>> 20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800

51

>>>>> Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02

52

>moonflo

53

>>>>> kernel:  ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8

54

>>>>> 0000000000000001 Oct 14 20:58:02 moonflo kernel:  ffff880124d43de8

55

>>>>> ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02

56

>>>>> moonflo

57

>>>>> kernel:  0000000000000000 ffff8800d45f2000 0000000000000001

58

>>>>> ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace:

59

>>>>> Oct 14 20:58:02 moonflo kernel:  <IRQ>  [<ffffffff814da8d8>]

60

>>>>> dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel:

61

>>>>> [<ffffffff81088850>]

62

>>>>> warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel:

63

>>>>> [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02

64

>moonflo

65

>>>>> kernel:  [<ffffffff812b31c5>] ?

66

>add_interrupt_randomness+0x35/0x1e0 Oct

67

>>>>> 14

68

>>>>> 20:58:02 moonflo kernel:  [<ffffffff8145b819>]

69

>dev_watchdog+0x259/0x270

70

>>>>> Oct

71

>>>>> 14 20:58:02 moonflo kernel:  [<ffffffff8145b5c0>] ?

72

>>>>> dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel:

73

>>>>> [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02

74

>moonflo

75

>>>>> kernel:  [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 Oct

76

>14

77

>>>>> 20:58:02 moonflo kernel:  [<ffffffff810d42a6>]

78

>>>>> run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel:

79

>>>>> [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 Oct 14 20:58:02

80

>moonflo

81

>>>>> kernel:  [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 Oct 14 20:58:02

82

>moonflo

83

>>>>> kernel:  [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 Oct

84

>14

85

>>>>> 20:58:02 moonflo kernel:  [<ffffffff814e1e8e>]

86

>>>>> xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo

87

>kernel:

88

>>>>> <EOI>

89

>>>>>

90

>>>>>    [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct 14

91

>20:58:02

92

>>>>>

93

>>>>> moonflo kernel:  [<ffffffff810013aa>] ?

94

>xen_hypercall_sched_op+0xa/0x20

95

>>>>> Oct

96

>>>>> 14 20:58:02 moonflo kernel:  [<ffffffff810459e0>] ?

97

>>>>> xen_safe_halt+0x10/0x20

98

>>>>> Oct 14 20:58:02 moonflo kernel:  [<ffffffff81053979>] ?

99

>>>>> default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel:

100

>>>>> [<ffffffff810542da>]

101

>>>>> ? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel:

102

>>>>> [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 Oct 14

103

>20:58:02

104

>>>>> moonflo kernel:  [<ffffffff81047cd5>] ?

105

>cpu_bringup_and_idle+0x25/0x40

106

>>>>> Oct

107

>>>>> 14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]---

108

>Oct 14

109

>>>>> 20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up

110

>>>>>

111

>>>>>

112

>>>>> After that, there are lots of messages about the link being up,

113

>one

114

>>>>> message

115

>>>>> every 12 seconds.  When you unplug the network cable, you get a

116

>message

117

>>>>> that the link is down, and no message when you plug it in again.

118

>>>>>

119

>>>>> I was hoping that switching the network card (to one that uses a

120

>>>>> different

121

>>>>> driver) might solve the problem, and it did not.  Now I can only

122

>guess

123

>>>>> that

124

>>>>> the network card goes to sleep and sometimes cannot be woken up

125

>again.

126

>>>>>

127

>>>>> I tried to reduce the connection speed to 100Mbit and found that

128

>>>>> accessing

129

>>>>> the VMs (via RDP) becomes too slow to use them.  So I disabled the

130

>power

131

>>>>> management of the network card (through sysfs) and will have to

132

>see if

133

>>>>> the

134

>>>>> problem persists.

135

>>>>>

136

>>>>> We'll be getting decent network cards in a couple days, but since

137

>the

138

>>>>> problem doesn't seem to be related to a particular

139

>>>>> card/model/manufacturer,

140

>>>>> that might not fix it, either.

141

>>>>>

142

>>>>> This problem seems to only occur on machines that operate as a xen

143

>>>>> server.

144

>>>>> Other machines, identical Z800s, not running xen, run just fine.

145

>>>>>

146

>>>>> What would you suggest?

147

>>>>

148

>>>> More info required:

149

>>>>

150

>>>> - Which version of Xen

151

>>>

152

>>> 4.5.1

153

>>>

154

>>> Installed versions:  4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags

155

>-debug

156

>>> -efi -flask -xsm)

157

>>

158

>> Ok, recent one.

159

>>

160

>>>> - Does this only occur with HVM guests?

161

>>>

162

>>> The host has been running only HVM guests every time it happend.

163

>>> It was running a PV guest in between (which I had to shut down

164

>>> because other VMs were migrated, requiring the RAM).

165

>>

166

>> The PV didn't have any issues?

167

>

168

>The whole server has the issue, not a particular VM.  While the PV

169

>guest

170

>was running, the server didn't freeze.

171

>

172

>>>> - Which network-driver are you using inside the guest

173

>>>

174

>>> r8169, compiled as a module

175

>>>

176

>>> Same happened with the tg3 driver when the on-board cards were used.

177

>>> The tg3 driver is completely disabled in the kernel config, i. e.

178

>>> not even compiled as a module.

179

>>

180

>> You have network cards assigned to the guests?

181

>

182

>No, they are all connected via a bridge.

183

>

184

>I enabled STP on the bridge and the server was ok for a week, then had

185

>to be restarted.  I'm seeing lots of messages in the log:

186

>

187

>

188

>Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,

189

>propagating

190

>Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn

191

>bpdu

192

>Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,

193

>propagating

194

>Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn

195

>bpdu

196

>Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,

197

>propagating

198

>Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn

199

>bpdu

200

>Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,

201

>propagating

202

>Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn

203

>bpdu

204

>Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,

205

>propagating

206

>

207

>

208

>and sometimes:

209

>

210

>Oct 28 10:47:04 moonflo kernel: brloc: port 1(enp55s4) neighbor

211

>8000.00:00:10:11:12:00 lost

212

>

213

>

214

>Any idea what this means?

215

>

216

>(Google has gone on strike, and another search engine didn't give any

217

>useful

218

>findings ...)

219

>

220

>

221

>>>> - Can you connect to the "local" console of the guest?

222

>>>

223

>>> Yes, the host seems to be running fine except for having no network

224

>>> connectivity.  There's a keyboard and monitor physically connected

225

>to

226

>>> it with which you can log in and do stuff.

227

>>

228

>> The HOST loses network connectivity?

229

>

230

>Yes.

231

>

232

>Apparently when it became unresponsive yesterday, it was not possible

233

>to log in at the console, either.  I wasn't there yesterday, though

234

>I've

235

>see that happen before.  We tried to shut it down via acpid by pressing

236

>the

237

>power button. It didn't turn off, so it was switched off by holding the

238

>power

239

>button.  What I can see in the log is:

240

>

241

>

242

>Oct 28 14:12:33 moonflo logger[20322]: /etc/xen/scripts/block: remove

243

>XENBUS_PATH=backend/vbd/2/768

244

>Oct 28 14:12:33 moonflo logger[20323]: /etc/xen/scripts/vif-bridge:

245

>offline type_if=vif XENBUS_PATH=backend/vif/2/0

246

>Oct 28 14:12:33 moonflo logger[20347]: /etc/xen/scripts/vif-bridge:

247

>brctl delif brloc vif2.0 failed

248

>Oct 28 14:12:33 moonflo logger[20353]: /etc/xen/scripts/vif-bridge:

249

>ifconfig vif2.0 down failed

250

>Oct 28 14:12:33 moonflo logger[20361]: /etc/xen/scripts/vif-bridge:

251

>Successful vif-bridge offline for vif2.0, bridge brloc.

252

>Oct 28 14:12:33 moonflo logger[20372]: /etc/xen/scripts/vif-bridge:

253

>remove type_if=tap XENBUS_PATH=backend/vif/2/0

254

>Oct 28 14:12:33 moonflo logger[20391]: /etc/xen/scripts/vif-bridge:

255

>Successful vif-bridge remove for vif2.0-emu, bridge brloc.

256

>Oct 28 14:15:33 moonflo shutdown[20476]: shutting down for system halt

257

>^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Oct

258

>28 14:17:34 moonflo syslog-ng[4611]: syslog-ng starting up;

259

>version='3.6.2'

260

>

261

>

262

>And:

263

>

264

>

265

>Oct 24 11:47:42 moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169):

266

>transmit queue 0 timed out

267

>Oct 24 11:47:42 moonflo kernel: Modules linked in: xt_physdev

268

>br_netfilter iptable_filter ip_tables xen_pciback xen_gntalloc

269

>xen_gntdev bridge stp llc zfs(PO) zuni

270

>code(PO) zavl(PO) zcommon(PO) znvpair(PO) nouveau snd_hda_codec_realtek

271

>snd_hda_codec_generic video spl(O) backlight zlib_deflate

272

>drm_kms_helper snd_hda_intel snd_

273

>hda_controller snd_hda_codec snd_pcm snd_timer r8169 snd ttm soundcore

274

>mii xts aesni_intel glue_helper lrw gf128mul ablk_helper cryptd

275

>aes_x86_64 sha256_generic hi

276

>d_generic usbhid uhci_hcd usb_storage ehci_pci ehci_hcd usbcore

277

>usb_common

278

>Oct 24 11:47:42 moonflo kernel: CPU: 12 PID: 0 Comm: swapper/12

279

>Tainted: P           O    4.0.5-gentoo #3

280

>Oct 24 11:47:42 moonflo kernel: Hardware name: Hewlett-Packard HP Z800

281

>Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013

282

>Oct 24 11:47:42 moonflo kernel:  ffffffff8175a77d ffff880124d83d98

283

>ffffffff814da8d8 0000000000000001

284

>Oct 24 11:47:42 moonflo kernel:  ffff880124d83de8 ffff880124d83dd8

285

>ffffffff81088850 ffff880124d83e68

286

>Oct 24 11:47:42 moonflo kernel:  0000000000000000 ffff88011efd8000

287

>0000000000000001 ffff8800d4eb5e80

288

>Oct 24 11:47:42 moonflo kernel: Call Trace:

289

>Oct 24 11:47:42 moonflo kernel:  <IRQ>  [<ffffffff814da8d8>]

290

>dump_stack+0x45/0x57

291

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff81088850>]

292

>warn_slowpath_common+0x80/0xc0

293

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff810888d1>]

294

>warn_slowpath_fmt+0x41/0x50

295

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff812b31c5>] ?

296

>add_interrupt_randomness+0x35/0x1e0

297

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff8145b819>]

298

>dev_watchdog+0x259/0x270

299

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff8145b5c0>] ?

300

>dev_graft_qdisc+0x80/0x80

301

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff8145b5c0>] ?

302

>dev_graft_qdisc+0x80/0x80

303

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff810d4047>]

304

>call_timer_fn.isra.30+0x17/0x70

305

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff810d42a6>]

306

>run_timer_softirq+0x176/0x2b0

307

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff8108bd0a>]

308

>__do_softirq+0xda/0x1f0

309

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff8108c04e>]

310

>irq_exit+0x7e/0xa0

311

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff8130e075>]

312

>xen_evtchn_do_upcall+0x35/0x50

313

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff814e1e8e>]

314

>xen_do_hypervisor_callback+0x1e/0x40

315

>Oct 24 11:47:42 moonflo kernel:  <EOI>  [<ffffffff810013aa>] ?

316

>xen_hypercall_sched_op+0xa/0x20

317

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff810013aa>] ?

318

>xen_hypercall_sched_op+0xa/0x20

319

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff810459e0>] ?

320

>xen_safe_halt+0x10/0x20

321

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff81053979>] ?

322

>default_idle+0x9/0x10

323

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff810542da>] ?

324

>arch_cpu_idle+0xa/0x10

325

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff810bd170>] ?

326

>cpu_startup_entry+0x190/0x2f0

327

>Oct 24 11:47:42 moonflo kernel:  [<ffffffff81047cd5>] ?

328

>cpu_bringup_and_idle+0x25/0x40

329

>Oct 24 11:47:42 moonflo kernel: ---[ end trace 320b6f98f8fc070f ]---

330

>Oct 24 11:47:42 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up

331

>

332

>

333

>That was two days before it went down.  After that, messages about

334

>topology changes

335

>are starting to appear.

336

>

337

>I'm not sure if I should call this "progress" ;)

338

>

339

>>

340

>>> You get no answer when you ping the host while it is unreachable.

341

>>>

342

>>>> - If yes, does it still have no connectivity?

343

>>>

344

>>> It has been restarted this morning when it was found to be

345

>unreachable.

346

>>>

347

>>>> I saw the same on my lab machine, which was related to:

348

>>>> - Not using correct drivers inside HVM guests

349

>>>

350

>>> There are Windoze 7 guests running that have PV drivers installed.

351

>>> One of those has formerly been running on a VMware host and was

352

>>> migrated on Tuesday.  I deinstalled the VMware tools from it.

353

>>

354

>> Which PV drivers?

355

>

356

>Xen GPL PV Driver Developers

357

>17.09.2014

358

>0.11.0.373

359

>Univention GmbH

360

>

361

>> And did you ensure all VMWare related drivers were removed?

362

>> I am not convinced uninstalling the VMWare tools is sufficient.

363

>

364

>What would I need to look at to make sure they are removed?

365

>

366

>The problem has been there before the VM that had VMWare drivers

367

>installed was migrated to this server.  So I don't think they are

368

>causing this problem.

369

>

370

>

371

>>> Since Monday, a HVM Linux system (a modified 32-bit Debian) has also

372

>>> been migrated from the VMware host to this one.  I don't know if it

373

>>> has VMware tools installed (I guess it does because it could be shut

374

>>> down via VMware) and how those might react now.  It's working, and I

375

>>> don't want to touch it.

376

>>>

377

>>> However, the problem already occured before this migration, when the

378

>>> on-board cards were still used.

379

>>>

380

>>>> - Switch hardware not keeping the MAC/IP/Port lists long enough

381

>>>

382

>>> What might be the reason for the lists becoming too short?  Too many

383

>>> devices connected to the network?

384

>>

385

>> No network activity for a while. (clean installs, nothing running)

386

>> Switch forgetting the MAC-address assigned to the VM.

387

>>

388

>> Connecting to the VM-console, I could ping www.google.com and then

389

>the

390

>> connectivity re-appeared.

391

>

392

>Half of the switches have been replaced last week in order to track

393

>down

394

>what appears to be a weird network problem.  The problem is that the

395

>RDP

396

>clients are being randomly stalled.  If it was only that, I'd suspect

397

>this

398

>server some more, but the internet connection goes through the same

399

>switches

400

>and is apprently also slowed down when the RPD clients are stalled.

401

>They

402

>got also randomly stalled when the RDP clients were accessing a totally

403

>different server (the VMWare server), so this might be entirely

404

>unrelated.

405

>

406

>Replacing the switches didn't fix the problem, so I'll probably put

407

>them

408

>back into service and replace the other half.

409

>

410

>>> The host has been connected to two different switches and showed the

411

>>> problem.  Previously, that was an 8-port 1Gb switch, now it's a

412

>24-port

413

>>> 1Gb switch.  However, the 8-port switch is also connected to the

414

>24-port

415

>>> switch the host is now connected to.  (The 24-port switch connects

416

>it

417

>>> "directly" to the rest of the network.)

418

>>

419

>> Assuming it's a managed switch, you could test this.

420

>> Alternatively, check if you can access the VMs from the host.

421

>

422

>Good idea, I'll try that when it happens when I'm here.

423

>

424

>The network cards have arrived, Intel PRO 1000 dual port, made for IBM.

425

>I hope I get to swap the card today. Those *really* should work.

426

>

427

>Hm, I could plug in two of them and give each VM and the host its own

428

>physical card.  Do you think that might help?

429

430

Quick reply from mobile.

431

Will give a more detailed one later.

432

433

 Noticed you are using ZFS. Where is your swap partition located?

434

435

On ZFS or?

436

437

--

438

Joost 

439

--

440

Sent from my Android device with K-9 Mail. Please excuse my brevity.

Gentoo Archives: gentoo-user

Replies

1	On 29 October 2015 11:29:18 CET, hw <hw@×××××.de> wrote:
2	>J. Roeleveld wrote:
3	>> On Thursday, October 15, 2015 05:46:07 PM hw wrote:
4	>>> J. Roeleveld wrote:
5	>>>> On Thursday, October 15, 2015 03:30:01 PM hw wrote:
6	>>>>> Hi,
7	>>>>>
8	>>>>> I have a xen host with some HV guests which becomes unreachable
9	>via
10	>>>>> the network after apparently random amount of times. I have
11	>already
12	>>>>> switched the network card to see if that would make a difference,
13	>>>>> and with the card currently installed, it worked fine for over 20
14	>days
15	>>>>> until it become unreachable again. Before switching the network
16	>card,
17	>>>>> it would run a week or two before becoming unreachable. The
18	>previous
19	>>>>> card was the on-board BCM5764M which uses the tg3 driver.
20	>>>>>
21	>>>>> There are messages like this in the log file:
22	>>>>>
23	>>>>>
24	>>>>> Oct 14 20:58:02 moonflo kernel: ------------[ cut here
25	>]------------
26	>>>>> Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at
27	>>>>> net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14
28	>20:58:02
29	>>>>> moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0
30	>timed
31	>>>>> out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb
32	>md4 hmac
33	>>>>> nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter
34	>ip_tables
35	>>>>> xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau
36	>>>>> snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO)
37	>>>>> zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight
38	>>>>> drm_kms_helper
39	>>>>> ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm
40	>snd_timer snd
41	>>>>> soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul
42	>ablk_helper
43	>>>>> cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd
44	>usb_storage
45	>>>>> ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo
46	>kernel: CPU:
47	>>>>> 10 PID: 0 Comm: swapper/10 Tainted: P O 4.0.5-gentoo
48	>#3 Oct
49	>>>>> 14
50	>>>>> 20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800
51	>>>>> Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02
52	>moonflo
53	>>>>> kernel: ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8
54	>>>>> 0000000000000001 Oct 14 20:58:02 moonflo kernel: ffff880124d43de8
55	>>>>> ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02
56	>>>>> moonflo
57	>>>>> kernel: 0000000000000000 ffff8800d45f2000 0000000000000001
58	>>>>> ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace:
59	>>>>> Oct 14 20:58:02 moonflo kernel: <IRQ> [<ffffffff814da8d8>]
60	>>>>> dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel:
61	>>>>> [<ffffffff81088850>]
62	>>>>> warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel:
63	>>>>> [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02
64	>moonflo
65	>>>>> kernel: [<ffffffff812b31c5>] ?
66	>add_interrupt_randomness+0x35/0x1e0 Oct
67	>>>>> 14
68	>>>>> 20:58:02 moonflo kernel: [<ffffffff8145b819>]
69	>dev_watchdog+0x259/0x270
70	>>>>> Oct
71	>>>>> 14 20:58:02 moonflo kernel: [<ffffffff8145b5c0>] ?
72	>>>>> dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel:
73	>>>>> [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02
74	>moonflo
75	>>>>> kernel: [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 Oct
76	>14
77	>>>>> 20:58:02 moonflo kernel: [<ffffffff810d42a6>]
78	>>>>> run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel:
79	>>>>> [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 Oct 14 20:58:02
80	>moonflo
81	>>>>> kernel: [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 Oct 14 20:58:02
82	>moonflo
83	>>>>> kernel: [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 Oct
84	>14
85	>>>>> 20:58:02 moonflo kernel: [<ffffffff814e1e8e>]
86	>>>>> xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo
87	>kernel:
88	>>>>> <EOI>
89	>>>>>
90	>>>>> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct 14
91	>20:58:02
92	>>>>>
93	>>>>> moonflo kernel: [<ffffffff810013aa>] ?
94	>xen_hypercall_sched_op+0xa/0x20
95	>>>>> Oct
96	>>>>> 14 20:58:02 moonflo kernel: [<ffffffff810459e0>] ?
97	>>>>> xen_safe_halt+0x10/0x20
98	>>>>> Oct 14 20:58:02 moonflo kernel: [<ffffffff81053979>] ?
99	>>>>> default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel:
100	>>>>> [<ffffffff810542da>]
101	>>>>> ? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel:
102	>>>>> [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 Oct 14
103	>20:58:02
104	>>>>> moonflo kernel: [<ffffffff81047cd5>] ?
105	>cpu_bringup_and_idle+0x25/0x40
106	>>>>> Oct
107	>>>>> 14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]---
108	>Oct 14
109	>>>>> 20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up
110	>>>>>
111	>>>>>
112	>>>>> After that, there are lots of messages about the link being up,
113	>one
114	>>>>> message
115	>>>>> every 12 seconds. When you unplug the network cable, you get a
116	>message
117	>>>>> that the link is down, and no message when you plug it in again.
118	>>>>>
119	>>>>> I was hoping that switching the network card (to one that uses a
120	>>>>> different
121	>>>>> driver) might solve the problem, and it did not. Now I can only
122	>guess
123	>>>>> that
124	>>>>> the network card goes to sleep and sometimes cannot be woken up
125	>again.
126	>>>>>
127	>>>>> I tried to reduce the connection speed to 100Mbit and found that
128	>>>>> accessing
129	>>>>> the VMs (via RDP) becomes too slow to use them. So I disabled the
130	>power
131	>>>>> management of the network card (through sysfs) and will have to
132	>see if
133	>>>>> the
134	>>>>> problem persists.
135	>>>>>
136	>>>>> We'll be getting decent network cards in a couple days, but since
137	>the
138	>>>>> problem doesn't seem to be related to a particular
139	>>>>> card/model/manufacturer,
140	>>>>> that might not fix it, either.
141	>>>>>
142	>>>>> This problem seems to only occur on machines that operate as a xen
143	>>>>> server.
144	>>>>> Other machines, identical Z800s, not running xen, run just fine.
145	>>>>>
146	>>>>> What would you suggest?
147	>>>>
148	>>>> More info required:
149	>>>>
150	>>>> - Which version of Xen
151	>>>
152	>>> 4.5.1
153	>>>
154	>>> Installed versions: 4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags
155	>-debug
156	>>> -efi -flask -xsm)
157	>>
158	>> Ok, recent one.
159	>>
160	>>>> - Does this only occur with HVM guests?
161	>>>
162	>>> The host has been running only HVM guests every time it happend.
163	>>> It was running a PV guest in between (which I had to shut down
164	>>> because other VMs were migrated, requiring the RAM).
165	>>
166	>> The PV didn't have any issues?
167	>
168	>The whole server has the issue, not a particular VM. While the PV
169	>guest
170	>was running, the server didn't freeze.
171	>
172	>>>> - Which network-driver are you using inside the guest
173	>>>
174	>>> r8169, compiled as a module
175	>>>
176	>>> Same happened with the tg3 driver when the on-board cards were used.
177	>>> The tg3 driver is completely disabled in the kernel config, i. e.
178	>>> not even compiled as a module.
179	>>
180	>> You have network cards assigned to the guests?
181	>
182	>No, they are all connected via a bridge.
183	>
184	>I enabled STP on the bridge and the server was ok for a week, then had
185	>to be restarted. I'm seeing lots of messages in the log:
186	>
187	>
188	>Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,
189	>propagating
190	>Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn
191	>bpdu
192	>Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,
193	>propagating
194	>Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn
195	>bpdu
196	>Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,
197	>propagating
198	>Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn
199	>bpdu
200	>Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,
201	>propagating
202	>Oct 28 11:14:05 moonflo kernel: brloc: port 1(enp55s4) received tcn
203	>bpdu
204	>Oct 28 11:14:05 moonflo kernel: brloc: topology change detected,
205	>propagating
206	>
207	>
208	>and sometimes:
209	>
210	>Oct 28 10:47:04 moonflo kernel: brloc: port 1(enp55s4) neighbor
211	>8000.00:00:10:11:12:00 lost
212	>
213	>
214	>Any idea what this means?
215	>
216	>(Google has gone on strike, and another search engine didn't give any
217	>useful
218	>findings ...)
219	>
220	>
221	>>>> - Can you connect to the "local" console of the guest?
222	>>>
223	>>> Yes, the host seems to be running fine except for having no network
224	>>> connectivity. There's a keyboard and monitor physically connected
225	>to
226	>>> it with which you can log in and do stuff.
227	>>
228	>> The HOST loses network connectivity?
229	>
230	>Yes.
231	>
232	>Apparently when it became unresponsive yesterday, it was not possible
233	>to log in at the console, either. I wasn't there yesterday, though
234	>I've
235	>see that happen before. We tried to shut it down via acpid by pressing
236	>the
237	>power button. It didn't turn off, so it was switched off by holding the
238	>power
239	>button. What I can see in the log is:
240	>
241	>
242	>Oct 28 14:12:33 moonflo logger[20322]: /etc/xen/scripts/block: remove
243	>XENBUS_PATH=backend/vbd/2/768
244	>Oct 28 14:12:33 moonflo logger[20323]: /etc/xen/scripts/vif-bridge:
245	>offline type_if=vif XENBUS_PATH=backend/vif/2/0
246	>Oct 28 14:12:33 moonflo logger[20347]: /etc/xen/scripts/vif-bridge:
247	>brctl delif brloc vif2.0 failed
248	>Oct 28 14:12:33 moonflo logger[20353]: /etc/xen/scripts/vif-bridge:
249	>ifconfig vif2.0 down failed
250	>Oct 28 14:12:33 moonflo logger[20361]: /etc/xen/scripts/vif-bridge:
251	>Successful vif-bridge offline for vif2.0, bridge brloc.
252	>Oct 28 14:12:33 moonflo logger[20372]: /etc/xen/scripts/vif-bridge:
253	>remove type_if=tap XENBUS_PATH=backend/vif/2/0
254	>Oct 28 14:12:33 moonflo logger[20391]: /etc/xen/scripts/vif-bridge:
255	>Successful vif-bridge remove for vif2.0-emu, bridge brloc.
256	>Oct 28 14:15:33 moonflo shutdown[20476]: shutting down for system halt
257	>^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Oct
258	>28 14:17:34 moonflo syslog-ng[4611]: syslog-ng starting up;
259	>version='3.6.2'
260	>
261	>
262	>And:
263	>
264	>
265	>Oct 24 11:47:42 moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169):
266	>transmit queue 0 timed out
267	>Oct 24 11:47:42 moonflo kernel: Modules linked in: xt_physdev
268	>br_netfilter iptable_filter ip_tables xen_pciback xen_gntalloc
269	>xen_gntdev bridge stp llc zfs(PO) zuni
270	>code(PO) zavl(PO) zcommon(PO) znvpair(PO) nouveau snd_hda_codec_realtek
271	>snd_hda_codec_generic video spl(O) backlight zlib_deflate
272	>drm_kms_helper snd_hda_intel snd_
273	>hda_controller snd_hda_codec snd_pcm snd_timer r8169 snd ttm soundcore
274	>mii xts aesni_intel glue_helper lrw gf128mul ablk_helper cryptd
275	>aes_x86_64 sha256_generic hi
276	>d_generic usbhid uhci_hcd usb_storage ehci_pci ehci_hcd usbcore
277	>usb_common
278	>Oct 24 11:47:42 moonflo kernel: CPU: 12 PID: 0 Comm: swapper/12
279	>Tainted: P O 4.0.5-gentoo #3
280	>Oct 24 11:47:42 moonflo kernel: Hardware name: Hewlett-Packard HP Z800
281	>Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013
282	>Oct 24 11:47:42 moonflo kernel: ffffffff8175a77d ffff880124d83d98
283	>ffffffff814da8d8 0000000000000001
284	>Oct 24 11:47:42 moonflo kernel: ffff880124d83de8 ffff880124d83dd8
285	>ffffffff81088850 ffff880124d83e68
286	>Oct 24 11:47:42 moonflo kernel: 0000000000000000 ffff88011efd8000
287	>0000000000000001 ffff8800d4eb5e80
288	>Oct 24 11:47:42 moonflo kernel: Call Trace:
289	>Oct 24 11:47:42 moonflo kernel: <IRQ> [<ffffffff814da8d8>]
290	>dump_stack+0x45/0x57
291	>Oct 24 11:47:42 moonflo kernel: [<ffffffff81088850>]
292	>warn_slowpath_common+0x80/0xc0
293	>Oct 24 11:47:42 moonflo kernel: [<ffffffff810888d1>]
294	>warn_slowpath_fmt+0x41/0x50
295	>Oct 24 11:47:42 moonflo kernel: [<ffffffff812b31c5>] ?
296	>add_interrupt_randomness+0x35/0x1e0
297	>Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b819>]
298	>dev_watchdog+0x259/0x270
299	>Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b5c0>] ?
300	>dev_graft_qdisc+0x80/0x80
301	>Oct 24 11:47:42 moonflo kernel: [<ffffffff8145b5c0>] ?
302	>dev_graft_qdisc+0x80/0x80
303	>Oct 24 11:47:42 moonflo kernel: [<ffffffff810d4047>]
304	>call_timer_fn.isra.30+0x17/0x70
305	>Oct 24 11:47:42 moonflo kernel: [<ffffffff810d42a6>]
306	>run_timer_softirq+0x176/0x2b0
307	>Oct 24 11:47:42 moonflo kernel: [<ffffffff8108bd0a>]
308	>__do_softirq+0xda/0x1f0
309	>Oct 24 11:47:42 moonflo kernel: [<ffffffff8108c04e>]
310	>irq_exit+0x7e/0xa0
311	>Oct 24 11:47:42 moonflo kernel: [<ffffffff8130e075>]
312	>xen_evtchn_do_upcall+0x35/0x50
313	>Oct 24 11:47:42 moonflo kernel: [<ffffffff814e1e8e>]
314	>xen_do_hypervisor_callback+0x1e/0x40
315	>Oct 24 11:47:42 moonflo kernel: <EOI> [<ffffffff810013aa>] ?
316	>xen_hypercall_sched_op+0xa/0x20
317	>Oct 24 11:47:42 moonflo kernel: [<ffffffff810013aa>] ?
318	>xen_hypercall_sched_op+0xa/0x20
319	>Oct 24 11:47:42 moonflo kernel: [<ffffffff810459e0>] ?
320	>xen_safe_halt+0x10/0x20
321	>Oct 24 11:47:42 moonflo kernel: [<ffffffff81053979>] ?
322	>default_idle+0x9/0x10
323	>Oct 24 11:47:42 moonflo kernel: [<ffffffff810542da>] ?
324	>arch_cpu_idle+0xa/0x10
325	>Oct 24 11:47:42 moonflo kernel: [<ffffffff810bd170>] ?
326	>cpu_startup_entry+0x190/0x2f0
327	>Oct 24 11:47:42 moonflo kernel: [<ffffffff81047cd5>] ?
328	>cpu_bringup_and_idle+0x25/0x40
329	>Oct 24 11:47:42 moonflo kernel: ---[ end trace 320b6f98f8fc070f ]---
330	>Oct 24 11:47:42 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up
331	>
332	>
333	>That was two days before it went down. After that, messages about
334	>topology changes
335	>are starting to appear.
336	>
337	>I'm not sure if I should call this "progress" ;)
338	>
339	>>
340	>>> You get no answer when you ping the host while it is unreachable.
341	>>>
342	>>>> - If yes, does it still have no connectivity?
343	>>>
344	>>> It has been restarted this morning when it was found to be
345	>unreachable.
346	>>>
347	>>>> I saw the same on my lab machine, which was related to:
348	>>>> - Not using correct drivers inside HVM guests
349	>>>
350	>>> There are Windoze 7 guests running that have PV drivers installed.
351	>>> One of those has formerly been running on a VMware host and was
352	>>> migrated on Tuesday. I deinstalled the VMware tools from it.
353	>>
354	>> Which PV drivers?
355	>
356	>Xen GPL PV Driver Developers
357	>17.09.2014
358	>0.11.0.373
359	>Univention GmbH
360	>
361	>> And did you ensure all VMWare related drivers were removed?
362	>> I am not convinced uninstalling the VMWare tools is sufficient.
363	>
364	>What would I need to look at to make sure they are removed?
365	>
366	>The problem has been there before the VM that had VMWare drivers
367	>installed was migrated to this server. So I don't think they are
368	>causing this problem.
369	>
370	>
371	>>> Since Monday, a HVM Linux system (a modified 32-bit Debian) has also
372	>>> been migrated from the VMware host to this one. I don't know if it
373	>>> has VMware tools installed (I guess it does because it could be shut
374	>>> down via VMware) and how those might react now. It's working, and I
375	>>> don't want to touch it.
376	>>>
377	>>> However, the problem already occured before this migration, when the
378	>>> on-board cards were still used.
379	>>>
380	>>>> - Switch hardware not keeping the MAC/IP/Port lists long enough
381	>>>
382	>>> What might be the reason for the lists becoming too short? Too many
383	>>> devices connected to the network?
384	>>
385	>> No network activity for a while. (clean installs, nothing running)
386	>> Switch forgetting the MAC-address assigned to the VM.
387	>>
388	>> Connecting to the VM-console, I could ping www.google.com and then
389	>the
390	>> connectivity re-appeared.
391	>
392	>Half of the switches have been replaced last week in order to track
393	>down
394	>what appears to be a weird network problem. The problem is that the
395	>RDP
396	>clients are being randomly stalled. If it was only that, I'd suspect
397	>this
398	>server some more, but the internet connection goes through the same
399	>switches
400	>and is apprently also slowed down when the RPD clients are stalled.
401	>They
402	>got also randomly stalled when the RDP clients were accessing a totally
403	>different server (the VMWare server), so this might be entirely
404	>unrelated.
405	>
406	>Replacing the switches didn't fix the problem, so I'll probably put
407	>them
408	>back into service and replace the other half.
409	>
410	>>> The host has been connected to two different switches and showed the
411	>>> problem. Previously, that was an 8-port 1Gb switch, now it's a
412	>24-port
413	>>> 1Gb switch. However, the 8-port switch is also connected to the
414	>24-port
415	>>> switch the host is now connected to. (The 24-port switch connects
416	>it
417	>>> "directly" to the rest of the network.)
418	>>
419	>> Assuming it's a managed switch, you could test this.
420	>> Alternatively, check if you can access the VMs from the host.
421	>
422	>Good idea, I'll try that when it happens when I'm here.
423	>
424	>The network cards have arrived, Intel PRO 1000 dual port, made for IBM.
425	>I hope I get to swap the card today. Those really should work.
426	>
427	>Hm, I could plug in two of them and give each VM and the host its own
428	>physical card. Do you think that might help?
429
430	Quick reply from mobile.
431	Will give a more detailed one later.
432
433	Noticed you are using ZFS. Where is your swap partition located?
434
435	On ZFS or?
436
437	--
438	Joost
439	--
440	Sent from my Android device with K-9 Mail. Please excuse my brevity.