Gentoo Archives: gentoo-hardened

From: David Sommerseth <gentoo.list@××××××××××××.net>
To: gentoo-hardened@l.g.o
Subject: Re: [gentoo-hardened] tg3 driver - transmit timed out, resetting
Date: Wed, 25 Feb 2009 14:02:41
Message-Id: 49A54F7E.4090603@topphemmelig.net
In Reply to: Re: [gentoo-hardened] tg3 driver - transmit timed out, resetting by atoth@atoth.sote.hu
1 atoth@××××××××××.hu wrote:
2 > On Pén, December 12, 2008 19:09, David Sommerseth wrote:
3 >>
4 >> David Sommerseth wrote:
5 >>> atoth@××××××××××.hu wrote:
6 >>>> PCI-X dual port Broadcom NetXtreme BCM5704 Gigabit Ethernet (rev 03)
7 >>>> adapter is working fine here driven by tg3, 2.6.27-hardened-r1. The
8 >>>> driver
9 >>>> doesn't seem to be borked with my card.
10 >>>>
11 >>>> Did you check out the "error" field of ifconfig's output for the
12 >>>> interface
13 >>>> of your card?
14 >>>>
15 >>>> Regards,
16 >>>> Dw.
17 >>> Hmmm ... No, I have not had that opportunity. The server is located
18 >>> 2000km away from me, and I
19 >>> usually call a guy (who is not a technician)to go in and press
20 >>> CTRL-ALT-DEL on a keyboard. That is
21 >>> the short-time "fix". But I'm going to have a look physically on the
22 >>> server in a couple of weeks,
23 >>> so if I get positive feedbacks from others as well regarding 2.6.27
24 >>> kernel, I'm willing to try that
25 >>> upgrade.
26 >>>
27 >>> This interface is an on-board interface in an IBM eServer. The first
28 >>> time it happened, it was no
29 >>> problems for about 28 days. Now it was 13 days. So I expect it to
30 >>> happen again, soon enough.
31 >>>
32 >>> I'll try to hack the shutdown scripts to dump the ifconfig info
33 >>> somewhere somehow.
34 >> Then it happened again ... and I have ifconfig stats for the interface:
35 >>
36 >> eth0 Link encap:Ethernet HWaddr 00:14:5e:5d:3c:d0
37 >> inet6 addr: fe80::214:5eff:fe5d:3cd0/64 Scope:Link
38 >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
39 >> RX packets:10551633 errors:4294967239 dropped:767 overruns:0
40 >> frame:170
41 >> TX packets:9371606 errors:4294967239 dropped:0 overruns:0
42 >> carrier:0
43 >> collisions:4294967239 txqueuelen:1000
44 >> RX bytes:28237000 (26.9 MiB) TX bytes:163377979 (155.8 MiB)
45 >> Interrupt:16
46 >>
47 >> From the kernel log I see this:
48 >>
49 >> Dec 12 12:19:21 fw [74355.059369] tg3: tg3_abort_hw timed out for world,
50 >> TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
51 >> Dec 12 12:19:24 fw [74357.842979] tg3: world: No firmware running.
52 >> Dec 12 12:19:41 fw [74374.992867] tg3: world: Link is down.
53 >>
54 >> I'm surprised by the errors and collision numbers here, as I checked it
55 >> the
56 >> other day, and all of them was 0. I also know that the TX and RX values
57 >> was above 3-4GB, but don't remember which was what.
58 >>
59 >> Could this be an overflow bug of some kind?
60 >>
61 >> I have also found out that IBM have released an updated firmware to this
62 >> network device, so I'll try to upgrade it during Christmas when I'm close
63 >> to the box again. In the mean time I have a little ping-script, which
64 >> restarts network (incl. reloading of the tg3 module) when the network
65 >> dies.
66 >> This restart gives me minimal downtime.
67 >>
68 >> But I do not understand why this box was so rock solid until I upgraded
69 >> from 2.6.22-hardened-r8 to 2.6.25-hardened-r8. The new kernel driver
70 >> obviously does something it didn't do before. Unfortunately I can't find
71 >> anything particular in the kernel git logs for the tg3.[ch] files which
72 >> could pin-point anything particular.
73 >>
74 >>
75 >> Does anyone have any experiences regarding firmware upgrades on these
76 >> cards? The instructions seems pretty much forward, but if you know about
77 >> anything, whatever, I would appreciate that.
78 >>
79 >>
80 >> kind regards,
81 >>
82 >> David Sommerseth
83 >>
84 >
85 > Rather strange. The collisions and the errors counter shows the same...
86 > It was a long time ago, when I last saw collisions.
87 >
88 > There are several possibilities regarding this symptom. It would be
89 > important to know if the card is connected to a hub, or a switch(ing-hub)?
90 > 1.) There can be a defective device on the subnet, which is connected to
91 > it from time-to-time, or it is present all the time, but doesn't hog the
92 > line constantly
93
94 Pretty confident this is not the case, as this interface is the one
95 connected straight to the router from the ISP.
96
97 > 2.) The switch/hub can have a problem - try reconnecting the card to
98 > another port
99
100 Pretty confident this is also not the case.
101
102 > 3.) The network card can have a problem, which can be software related and
103 > might be solved by a firmware upgrade (unfortunately the card itself
104 > cannot be replaced being an on-board NIC)
105
106 Firmware updated now. I found a firmware updates for the Broadcom
107 interface I have in the IBM xSeries server and updated it. I also upgraded
108 the kernel to 2.6.25-hardened-r11 from 2.6.25-hardened-r8. After this, the
109 server have survived 55 days without any issues, which is the longest since
110 I upgraded from 2.6.22-hardened-r8. I believe strongly that it was the
111 firmware update which helped out.
112
113 > 4.) It can even be caused by a driver bug - which we know is all the way
114 > possible since the e1000 issue
115
116 Yeah, and this part scares me more ...
117
118 > I hope it'll turn out soon. I would think about a hardware issue, but it's
119 > a disturbing fact, that these symptoms appeared after a kernel upgrade.
120
121 Exactly!
122
123
124 So my thesis is that between linux-2.6.22-hardened-r8 and
125 2.6.25-hardened-r8 the tg3 driver must have been updated somehow, which
126 then depends on some features in the firmware which obviously did not work
127 properly. And if the tg3 driver did not change, I've simply been way to
128 lucky to not experience that for over 13 months with the 2.6.22 kernel.
129
130 The firmware I upgraded to can be found here:
131 http://www-947.ibm.com/systems/support/supportsite.wss/docdisplay?lndocid=MIGR-5070004&brandind=5000008
132
133 This update upgraded the network card firmware "bootcode" from 3.61 to 3.65
134 and the "IPMI" from 6.20 to 6.25.
135
136
137 kind regards,
138
139 David Sommerseth