Re: [gentoo-user] Networking trouble - gentoo-user

From:	"J. Roeleveld" <joost@××××××××.org>
To:	gentoo-user@l.g.o
Subject:	Re: [gentoo-user] Networking trouble
Date:	Fri, 16 Oct 2015 05:31:49
Message-Id:	`2569546.9sPlulUjpb@andromeda`
In Reply to:	Re: [gentoo-user] Networking trouble by hw

1

On Thursday, October 15, 2015 05:46:07 PM hw wrote:

2

> J. Roeleveld wrote:

3

> > On Thursday, October 15, 2015 03:30:01 PM hw wrote:

4

> >> Hi,

5

> >>

6

> >> I have a xen host with some HV guests which becomes unreachable via

7

> >> the network after apparently random amount of times.  I have already

8

> >> switched the network card to see if that would make a difference,

9

> >> and with the card currently installed, it worked fine for over 20 days

10

> >> until it become unreachable again.  Before switching the network card,

11

> >> it would run a week or two before becoming unreachable.  The previous

12

> >> card was the on-board BCM5764M which uses the tg3 driver.

13

> >>

14

> >> There are messages like this in the log file:

15

> >>

16

> >>

17

> >> Oct 14 20:58:02 moonflo kernel: ------------[ cut here ]------------

18

> >> Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at

19

> >> net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14 20:58:02

20

> >> moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 timed

21

> >> out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb md4 hmac

22

> >> nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter ip_tables

23

> >> xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau

24

> >> snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO)

25

> >> zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight

26

> >> drm_kms_helper

27

> >> ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm snd_timer snd

28

> >> soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul ablk_helper

29

> >> cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd usb_storage

30

> >> ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo kernel: CPU:

31

> >> 10 PID: 0 Comm: swapper/10 Tainted: P           O    4.0.5-gentoo #3 Oct

32

> >> 14

33

> >> 20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800

34

> >> Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02 moonflo

35

> >> kernel:  ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8

36

> >> 0000000000000001 Oct 14 20:58:02 moonflo kernel:  ffff880124d43de8

37

> >> ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02

38

> >> moonflo

39

> >> kernel:  0000000000000000 ffff8800d45f2000 0000000000000001

40

> >> ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace:

41

> >> Oct 14 20:58:02 moonflo kernel:  <IRQ>  [<ffffffff814da8d8>]

42

> >> dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel:

43

> >> [<ffffffff81088850>]

44

> >> warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel:

45

> >> [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02 moonflo

46

> >> kernel:  [<ffffffff812b31c5>] ? add_interrupt_randomness+0x35/0x1e0 Oct

47

> >> 14

48

> >> 20:58:02 moonflo kernel:  [<ffffffff8145b819>] dev_watchdog+0x259/0x270

49

> >> Oct

50

> >> 14 20:58:02 moonflo kernel:  [<ffffffff8145b5c0>] ?

51

> >> dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel:

52

> >> [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo

53

> >> kernel:  [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 Oct 14

54

> >> 20:58:02 moonflo kernel:  [<ffffffff810d42a6>]

55

> >> run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel:

56

> >> [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 Oct 14 20:58:02 moonflo

57

> >> kernel:  [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 Oct 14 20:58:02 moonflo

58

> >> kernel:  [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 Oct 14

59

> >> 20:58:02 moonflo kernel:  [<ffffffff814e1e8e>]

60

> >> xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo kernel:

61

> >> <EOI>

62

> >>

63

> >>   [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct 14 20:58:02

64

> >>

65

> >> moonflo kernel:  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20

66

> >> Oct

67

> >> 14 20:58:02 moonflo kernel:  [<ffffffff810459e0>] ?

68

> >> xen_safe_halt+0x10/0x20

69

> >> Oct 14 20:58:02 moonflo kernel:  [<ffffffff81053979>] ?

70

> >> default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel:

71

> >> [<ffffffff810542da>]

72

> >> ? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel:

73

> >> [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 Oct 14 20:58:02

74

> >> moonflo kernel:  [<ffffffff81047cd5>] ? cpu_bringup_and_idle+0x25/0x40

75

> >> Oct

76

> >> 14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]--- Oct 14

77

> >> 20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up

78

> >>

79

> >>

80

> >> After that, there are lots of messages about the link being up, one

81

> >> message

82

> >> every 12 seconds.  When you unplug the network cable, you get a message

83

> >> that the link is down, and no message when you plug it in again.

84

> >>

85

> >> I was hoping that switching the network card (to one that uses a

86

> >> different

87

> >> driver) might solve the problem, and it did not.  Now I can only guess

88

> >> that

89

> >> the network card goes to sleep and sometimes cannot be woken up again.

90

> >>

91

> >> I tried to reduce the connection speed to 100Mbit and found that

92

> >> accessing

93

> >> the VMs (via RDP) becomes too slow to use them.  So I disabled the power

94

> >> management of the network card (through sysfs) and will have to see if

95

> >> the

96

> >> problem persists.

97

> >>

98

> >> We'll be getting decent network cards in a couple days, but since the

99

> >> problem doesn't seem to be related to a particular

100

> >> card/model/manufacturer,

101

> >> that might not fix it, either.

102

> >>

103

> >> This problem seems to only occur on machines that operate as a xen

104

> >> server.

105

> >> Other machines, identical Z800s, not running xen, run just fine.

106

> >>

107

> >> What would you suggest?

108

> >

109

> > More info required:

110

> >

111

> > - Which version of Xen

112

>

113

> 4.5.1

114

>

115

> Installed versions:  4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags -debug

116

> -efi -flask -xsm)

117

118

Ok, recent one.

119

120

> > - Does this only occur with HVM guests?

121

>

122

> The host has been running only HVM guests every time it happend.

123

> It was running a PV guest in between (which I had to shut down

124

> because other VMs were migrated, requiring the RAM).

125

126

The PV didn't have any issues?

127

128

> > - Which network-driver are you using inside the guest

129

>

130

> r8169, compiled as a module

131

>

132

> Same happened with the tg3 driver when the on-board cards were used.

133

> The tg3 driver is completely disabled in the kernel config, i. e.

134

> not even compiled as a module.

135

136

You have network cards assigned to the guests?

137

138

> > - Can you connect to the "local" console of the guest?

139

>

140

> Yes, the host seems to be running fine except for having no network

141

> connectivity.  There's a keyboard and monitor physically connected to

142

> it with which you can log in and do stuff.

143

144

The HOST loses network connectivity?

145

146

> You get no answer when you ping the host while it is unreachable.

147

>

148

> > - If yes, does it still have no connectivity?

149

>

150

> It has been restarted this morning when it was found to be unreachable.

151

>

152

> > I saw the same on my lab machine, which was related to:

153

> > - Not using correct drivers inside HVM guests

154

>

155

> There are Windoze 7 guests running that have PV drivers installed.

156

> One of those has formerly been running on a VMware host and was

157

> migrated on Tuesday.  I deinstalled the VMware tools from it.

158

159

Which PV drivers?

160

And did you ensure all VMWare related drivers were removed?

161

I am not convinced uninstalling the VMWare tools is sufficient.

162

163

> Since Monday, a HVM Linux system (a modified 32-bit Debian) has also

164

> been migrated from the VMware host to this one.  I don't know if it

165

> has VMware tools installed (I guess it does because it could be shut

166

> down via VMware) and how those might react now.  It's working, and I

167

> don't want to touch it.

168

>

169

> However, the problem already occured before this migration, when the

170

> on-board cards were still used.

171

>

172

> > - Switch hardware not keeping the MAC/IP/Port lists long enough

173

>

174

> What might be the reason for the lists becoming too short?  Too many

175

> devices connected to the network?

176

177

No network activity for a while. (clean installs, nothing running)

178

Switch forgetting the MAC-address assigned to the VM.

179

180

Connecting to the VM-console, I could ping www.google.com and then the 

181

connectivity re-appeared.

182

183

> The host has been connected to two different switches and showed the

184

> problem.  Previously, that was an 8-port 1Gb switch, now it's a 24-port

185

> 1Gb switch.  However, the 8-port switch is also connected to the 24-port

186

> switch the host is now connected to.  (The 24-port switch connects it

187

> "directly" to the rest of the network.)

188

189

Assuming it's a managed switch, you could test this.

190

Alternatively, check if you can access the VMs from the host.

191

192

--

193

Joost

Gentoo Archives: gentoo-user

Replies

1	On Thursday, October 15, 2015 05:46:07 PM hw wrote:
2	> J. Roeleveld wrote:
3	> > On Thursday, October 15, 2015 03:30:01 PM hw wrote:
4	> >> Hi,
5	> >>
6	> >> I have a xen host with some HV guests which becomes unreachable via
7	> >> the network after apparently random amount of times. I have already
8	> >> switched the network card to see if that would make a difference,
9	> >> and with the card currently installed, it worked fine for over 20 days
10	> >> until it become unreachable again. Before switching the network card,
11	> >> it would run a week or two before becoming unreachable. The previous
12	> >> card was the on-board BCM5764M which uses the tg3 driver.
13	> >>
14	> >> There are messages like this in the log file:
15	> >>
16	> >>
17	> >> Oct 14 20:58:02 moonflo kernel: ------------[ cut here ]------------
18	> >> Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at
19	> >> net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14 20:58:02
20	> >> moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 timed
21	> >> out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb md4 hmac
22	> >> nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter ip_tables
23	> >> xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau
24	> >> snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO)
25	> >> zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight
26	> >> drm_kms_helper
27	> >> ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm snd_timer snd
28	> >> soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul ablk_helper
29	> >> cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd usb_storage
30	> >> ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo kernel: CPU:
31	> >> 10 PID: 0 Comm: swapper/10 Tainted: P O 4.0.5-gentoo #3 Oct
32	> >> 14
33	> >> 20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800
34	> >> Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02 moonflo
35	> >> kernel: ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8
36	> >> 0000000000000001 Oct 14 20:58:02 moonflo kernel: ffff880124d43de8
37	> >> ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02
38	> >> moonflo
39	> >> kernel: 0000000000000000 ffff8800d45f2000 0000000000000001
40	> >> ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace:
41	> >> Oct 14 20:58:02 moonflo kernel: <IRQ> [<ffffffff814da8d8>]
42	> >> dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel:
43	> >> [<ffffffff81088850>]
44	> >> warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel:
45	> >> [<ffffffff810888d1>] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02 moonflo
46	> >> kernel: [<ffffffff812b31c5>] ? add_interrupt_randomness+0x35/0x1e0 Oct
47	> >> 14
48	> >> 20:58:02 moonflo kernel: [<ffffffff8145b819>] dev_watchdog+0x259/0x270
49	> >> Oct
50	> >> 14 20:58:02 moonflo kernel: [<ffffffff8145b5c0>] ?
51	> >> dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel:
52	> >> [<ffffffff8145b5c0>] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo
53	> >> kernel: [<ffffffff810d4047>] call_timer_fn.isra.30+0x17/0x70 Oct 14
54	> >> 20:58:02 moonflo kernel: [<ffffffff810d42a6>]
55	> >> run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel:
56	> >> [<ffffffff8108bd0a>] __do_softirq+0xda/0x1f0 Oct 14 20:58:02 moonflo
57	> >> kernel: [<ffffffff8108c04e>] irq_exit+0x7e/0xa0 Oct 14 20:58:02 moonflo
58	> >> kernel: [<ffffffff8130e075>] xen_evtchn_do_upcall+0x35/0x50 Oct 14
59	> >> 20:58:02 moonflo kernel: [<ffffffff814e1e8e>]
60	> >> xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo kernel:
61	> >> <EOI>
62	> >>
63	> >> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Oct 14 20:58:02
64	> >>
65	> >> moonflo kernel: [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
66	> >> Oct
67	> >> 14 20:58:02 moonflo kernel: [<ffffffff810459e0>] ?
68	> >> xen_safe_halt+0x10/0x20
69	> >> Oct 14 20:58:02 moonflo kernel: [<ffffffff81053979>] ?
70	> >> default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel:
71	> >> [<ffffffff810542da>]
72	> >> ? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel:
73	> >> [<ffffffff810bd170>] ? cpu_startup_entry+0x190/0x2f0 Oct 14 20:58:02
74	> >> moonflo kernel: [<ffffffff81047cd5>] ? cpu_bringup_and_idle+0x25/0x40
75	> >> Oct
76	> >> 14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]--- Oct 14
77	> >> 20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up
78	> >>
79	> >>
80	> >> After that, there are lots of messages about the link being up, one
81	> >> message
82	> >> every 12 seconds. When you unplug the network cable, you get a message
83	> >> that the link is down, and no message when you plug it in again.
84	> >>
85	> >> I was hoping that switching the network card (to one that uses a
86	> >> different
87	> >> driver) might solve the problem, and it did not. Now I can only guess
88	> >> that
89	> >> the network card goes to sleep and sometimes cannot be woken up again.
90	> >>
91	> >> I tried to reduce the connection speed to 100Mbit and found that
92	> >> accessing
93	> >> the VMs (via RDP) becomes too slow to use them. So I disabled the power
94	> >> management of the network card (through sysfs) and will have to see if
95	> >> the
96	> >> problem persists.
97	> >>
98	> >> We'll be getting decent network cards in a couple days, but since the
99	> >> problem doesn't seem to be related to a particular
100	> >> card/model/manufacturer,
101	> >> that might not fix it, either.
102	> >>
103	> >> This problem seems to only occur on machines that operate as a xen
104	> >> server.
105	> >> Other machines, identical Z800s, not running xen, run just fine.
106	> >>
107	> >> What would you suggest?
108	> >
109	> > More info required:
110	> >
111	> > - Which version of Xen
112	>
113	> 4.5.1
114	>
115	> Installed versions: 4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags -debug
116	> -efi -flask -xsm)
117
118	Ok, recent one.
119
120	> > - Does this only occur with HVM guests?
121	>
122	> The host has been running only HVM guests every time it happend.
123	> It was running a PV guest in between (which I had to shut down
124	> because other VMs were migrated, requiring the RAM).
125
126	The PV didn't have any issues?
127
128	> > - Which network-driver are you using inside the guest
129	>
130	> r8169, compiled as a module
131	>
132	> Same happened with the tg3 driver when the on-board cards were used.
133	> The tg3 driver is completely disabled in the kernel config, i. e.
134	> not even compiled as a module.
135
136	You have network cards assigned to the guests?
137
138	> > - Can you connect to the "local" console of the guest?
139	>
140	> Yes, the host seems to be running fine except for having no network
141	> connectivity. There's a keyboard and monitor physically connected to
142	> it with which you can log in and do stuff.
143
144	The HOST loses network connectivity?
145
146	> You get no answer when you ping the host while it is unreachable.
147	>
148	> > - If yes, does it still have no connectivity?
149	>
150	> It has been restarted this morning when it was found to be unreachable.
151	>
152	> > I saw the same on my lab machine, which was related to:
153	> > - Not using correct drivers inside HVM guests
154	>
155	> There are Windoze 7 guests running that have PV drivers installed.
156	> One of those has formerly been running on a VMware host and was
157	> migrated on Tuesday. I deinstalled the VMware tools from it.
158
159	Which PV drivers?
160	And did you ensure all VMWare related drivers were removed?
161	I am not convinced uninstalling the VMWare tools is sufficient.
162
163	> Since Monday, a HVM Linux system (a modified 32-bit Debian) has also
164	> been migrated from the VMware host to this one. I don't know if it
165	> has VMware tools installed (I guess it does because it could be shut
166	> down via VMware) and how those might react now. It's working, and I
167	> don't want to touch it.
168	>
169	> However, the problem already occured before this migration, when the
170	> on-board cards were still used.
171	>
172	> > - Switch hardware not keeping the MAC/IP/Port lists long enough
173	>
174	> What might be the reason for the lists becoming too short? Too many
175	> devices connected to the network?
176
177	No network activity for a while. (clean installs, nothing running)
178	Switch forgetting the MAC-address assigned to the VM.
179
180	Connecting to the VM-console, I could ping www.google.com and then the
181	connectivity re-appeared.
182
183	> The host has been connected to two different switches and showed the
184	> problem. Previously, that was an 8-port 1Gb switch, now it's a 24-port
185	> 1Gb switch. However, the 8-port switch is also connected to the 24-port
186	> switch the host is now connected to. (The 24-port switch connects it
187	> "directly" to the rest of the network.)
188
189	Assuming it's a managed switch, you could test this.
190	Alternatively, check if you can access the VMs from the host.
191
192	--
193	Joost