1 |
atoth@××××××××××.hu wrote: |
2 |
> On Pén, December 12, 2008 19:09, David Sommerseth wrote: |
3 |
>> |
4 |
>> David Sommerseth wrote: |
5 |
>>> atoth@××××××××××.hu wrote: |
6 |
>>>> PCI-X dual port Broadcom NetXtreme BCM5704 Gigabit Ethernet (rev 03) |
7 |
>>>> adapter is working fine here driven by tg3, 2.6.27-hardened-r1. The |
8 |
>>>> driver |
9 |
>>>> doesn't seem to be borked with my card. |
10 |
>>>> |
11 |
>>>> Did you check out the "error" field of ifconfig's output for the |
12 |
>>>> interface |
13 |
>>>> of your card? |
14 |
>>>> |
15 |
>>>> Regards, |
16 |
>>>> Dw. |
17 |
>>> Hmmm ... No, I have not had that opportunity. The server is located |
18 |
>>> 2000km away from me, and I |
19 |
>>> usually call a guy (who is not a technician)to go in and press |
20 |
>>> CTRL-ALT-DEL on a keyboard. That is |
21 |
>>> the short-time "fix". But I'm going to have a look physically on the |
22 |
>>> server in a couple of weeks, |
23 |
>>> so if I get positive feedbacks from others as well regarding 2.6.27 |
24 |
>>> kernel, I'm willing to try that |
25 |
>>> upgrade. |
26 |
>>> |
27 |
>>> This interface is an on-board interface in an IBM eServer. The first |
28 |
>>> time it happened, it was no |
29 |
>>> problems for about 28 days. Now it was 13 days. So I expect it to |
30 |
>>> happen again, soon enough. |
31 |
>>> |
32 |
>>> I'll try to hack the shutdown scripts to dump the ifconfig info |
33 |
>>> somewhere somehow. |
34 |
>> Then it happened again ... and I have ifconfig stats for the interface: |
35 |
>> |
36 |
>> eth0 Link encap:Ethernet HWaddr 00:14:5e:5d:3c:d0 |
37 |
>> inet6 addr: fe80::214:5eff:fe5d:3cd0/64 Scope:Link |
38 |
>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 |
39 |
>> RX packets:10551633 errors:4294967239 dropped:767 overruns:0 |
40 |
>> frame:170 |
41 |
>> TX packets:9371606 errors:4294967239 dropped:0 overruns:0 |
42 |
>> carrier:0 |
43 |
>> collisions:4294967239 txqueuelen:1000 |
44 |
>> RX bytes:28237000 (26.9 MiB) TX bytes:163377979 (155.8 MiB) |
45 |
>> Interrupt:16 |
46 |
>> |
47 |
>> From the kernel log I see this: |
48 |
>> |
49 |
>> Dec 12 12:19:21 fw [74355.059369] tg3: tg3_abort_hw timed out for world, |
50 |
>> TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff |
51 |
>> Dec 12 12:19:24 fw [74357.842979] tg3: world: No firmware running. |
52 |
>> Dec 12 12:19:41 fw [74374.992867] tg3: world: Link is down. |
53 |
>> |
54 |
>> I'm surprised by the errors and collision numbers here, as I checked it |
55 |
>> the |
56 |
>> other day, and all of them was 0. I also know that the TX and RX values |
57 |
>> was above 3-4GB, but don't remember which was what. |
58 |
>> |
59 |
>> Could this be an overflow bug of some kind? |
60 |
>> |
61 |
>> I have also found out that IBM have released an updated firmware to this |
62 |
>> network device, so I'll try to upgrade it during Christmas when I'm close |
63 |
>> to the box again. In the mean time I have a little ping-script, which |
64 |
>> restarts network (incl. reloading of the tg3 module) when the network |
65 |
>> dies. |
66 |
>> This restart gives me minimal downtime. |
67 |
>> |
68 |
>> But I do not understand why this box was so rock solid until I upgraded |
69 |
>> from 2.6.22-hardened-r8 to 2.6.25-hardened-r8. The new kernel driver |
70 |
>> obviously does something it didn't do before. Unfortunately I can't find |
71 |
>> anything particular in the kernel git logs for the tg3.[ch] files which |
72 |
>> could pin-point anything particular. |
73 |
>> |
74 |
>> |
75 |
>> Does anyone have any experiences regarding firmware upgrades on these |
76 |
>> cards? The instructions seems pretty much forward, but if you know about |
77 |
>> anything, whatever, I would appreciate that. |
78 |
>> |
79 |
>> |
80 |
>> kind regards, |
81 |
>> |
82 |
>> David Sommerseth |
83 |
>> |
84 |
> |
85 |
> Rather strange. The collisions and the errors counter shows the same... |
86 |
> It was a long time ago, when I last saw collisions. |
87 |
> |
88 |
> There are several possibilities regarding this symptom. It would be |
89 |
> important to know if the card is connected to a hub, or a switch(ing-hub)? |
90 |
> 1.) There can be a defective device on the subnet, which is connected to |
91 |
> it from time-to-time, or it is present all the time, but doesn't hog the |
92 |
> line constantly |
93 |
|
94 |
Pretty confident this is not the case, as this interface is the one |
95 |
connected straight to the router from the ISP. |
96 |
|
97 |
> 2.) The switch/hub can have a problem - try reconnecting the card to |
98 |
> another port |
99 |
|
100 |
Pretty confident this is also not the case. |
101 |
|
102 |
> 3.) The network card can have a problem, which can be software related and |
103 |
> might be solved by a firmware upgrade (unfortunately the card itself |
104 |
> cannot be replaced being an on-board NIC) |
105 |
|
106 |
Firmware updated now. I found a firmware updates for the Broadcom |
107 |
interface I have in the IBM xSeries server and updated it. I also upgraded |
108 |
the kernel to 2.6.25-hardened-r11 from 2.6.25-hardened-r8. After this, the |
109 |
server have survived 55 days without any issues, which is the longest since |
110 |
I upgraded from 2.6.22-hardened-r8. I believe strongly that it was the |
111 |
firmware update which helped out. |
112 |
|
113 |
> 4.) It can even be caused by a driver bug - which we know is all the way |
114 |
> possible since the e1000 issue |
115 |
|
116 |
Yeah, and this part scares me more ... |
117 |
|
118 |
> I hope it'll turn out soon. I would think about a hardware issue, but it's |
119 |
> a disturbing fact, that these symptoms appeared after a kernel upgrade. |
120 |
|
121 |
Exactly! |
122 |
|
123 |
|
124 |
So my thesis is that between linux-2.6.22-hardened-r8 and |
125 |
2.6.25-hardened-r8 the tg3 driver must have been updated somehow, which |
126 |
then depends on some features in the firmware which obviously did not work |
127 |
properly. And if the tg3 driver did not change, I've simply been way to |
128 |
lucky to not experience that for over 13 months with the 2.6.22 kernel. |
129 |
|
130 |
The firmware I upgraded to can be found here: |
131 |
http://www-947.ibm.com/systems/support/supportsite.wss/docdisplay?lndocid=MIGR-5070004&brandind=5000008 |
132 |
|
133 |
This update upgraded the network card firmware "bootcode" from 3.61 to 3.65 |
134 |
and the "IPMI" from 6.20 to 6.25. |
135 |
|
136 |
|
137 |
kind regards, |
138 |
|
139 |
David Sommerseth |