1 |
I recently rebuilt a SunBlade 2000 system that was |
2 |
running Solaris 8 to Gentoo 2006.0. The system sports |
3 |
a Sun RIO GEM NIC, and worked quite well for the first |
4 |
few days, however, we didn't hit it hard during that |
5 |
time period either. The systems primary task is to be |
6 |
our source repository, and so needs to be network |
7 |
enabled. |
8 |
|
9 |
The system was initially setup on 3/9/2006, and ran |
10 |
fine until 3/15/2006 when we started getting the below |
11 |
error messages: |
12 |
|
13 |
Mar 15 15:39:25 tsdfft1 NETDEV WATCHDOG: eth0: |
14 |
transmit timed out |
15 |
Mar 15 15:39:25 tsdfft1 eth0: transmit timed out, |
16 |
resetting |
17 |
Mar 15 15:39:25 tsdfft1 eth0: |
18 |
TX_STATE[003ffc05:00000001:00000019] |
19 |
Mar 15 15:39:25 tsdfft1 eth0: |
20 |
RX_STATE[0100c805:00000001:00000021] |
21 |
Mar 15 15:39:25 tsdfft1 eth0: Link is up at 100 Mbps, |
22 |
half-duplex. |
23 |
Mar 15 15:39:25 tsdfft1 eth0: Pause is disabled |
24 |
|
25 |
And: |
26 |
|
27 |
Mar 15 16:11:58 tsdfft1 eth0: TX MAC xmit underrun. |
28 |
|
29 |
We're presently using the 2.6.16 kernel (vanilla) with |
30 |
sungem driver version 0.98. We have also seen this |
31 |
issue with the 2.6.15.6 kernel (vanilla) and the |
32 |
2.4.32_r2 kernel (provided by Gentoo 2006.0). |
33 |
|
34 |
The first one is spuratic, but happens from time to |
35 |
time. (Same error message everytime, save date & |
36 |
time.) The second one is the most reproducible as all |
37 |
I have to do is try to pull down source from the |
38 |
repository (hosted on Apache2 via WebDAV), and after |
39 |
about 6 MiB of data transfer, the link will die until |
40 |
an ifconfig down/up is done, when it will go for a |
41 |
while longer and then require a system reboot. |
42 |
|
43 |
In researching the issue, I discovered that there is |
44 |
one of several issues at play - the card is going bad, |
45 |
or there is a driver problem. I found a link to an |
46 |
xmit underrun issue for Solaris, but was unable to |
47 |
access it due to it being locked under |
48 |
sunsolve.sun.com. So I have no guarantee that going |
49 |
back to Solaris will solve the issue either. |
50 |
|
51 |
I have had a hard time finding an xmit underrun issue |
52 |
under Linux, most searches result in references to |
53 |
where the message is generated from and not from users |
54 |
trying to find solutions to the problem. |
55 |
|
56 |
I did, however, notice that there was a similar |
57 |
problem with overflows on the RX portion of the chip, |
58 |
which was solved through resetting the chip's RX unit |
59 |
via gem_rxmac_reset(). |
60 |
|
61 |
My first attempt at a fix was to modify the driver at |
62 |
the point of issue to schedule a reset, based on code |
63 |
elsewhere in the driver. (See sungem-fix1.patch.txt) |
64 |
|
65 |
At first this patch did not seem to work, however, I |
66 |
have been running the kernel with it for about a week |
67 |
now, and at least SSH and Apache seem to keep running. |
68 |
So I do think it at least helped to improve the |
69 |
situation, but it does not solve the problem on the |
70 |
Subversion side (Apache/WebDAV) which still dies after |
71 |
issues (just tested to make sure). |
72 |
|
73 |
I then tried building a solution based on the |
74 |
gem_rxmac_reset() and the various init functions, and |
75 |
produced gem_txmac_reset(). However, my first use |
76 |
locked up the kernel. It might be just that I tried to |
77 |
gain a lock when I shouldn't have (I did try to get |
78 |
the lock and tx_lock for the driver). However, I am |
79 |
not sure that I did it correctly. |
80 |
|
81 |
I would very much appreciate it if someone who is more |
82 |
familiar with the sungem driver would look at the |
83 |
patches and verify that (a) it is the correct thing to |
84 |
do, and (b) I did it correctly. |
85 |
|
86 |
I am aware that the network the system is running on |
87 |
is suppose to be full duplex, 100 Mbps. However, I |
88 |
have noticed that the card/driver seems to think it is |
89 |
half-duplex. Could this simply be a duplexing issue? I |
90 |
have no control of the switch it is plugged into (so |
91 |
far as settings go), but have not been able to find a |
92 |
way to get ifconfig to force it to full-duplex. (We've |
93 |
typically built the driver into the kernel.) |
94 |
|
95 |
If there is any information that I missed which would |
96 |
be helpful, please let me know and I will be glad to |
97 |
pass on what I can. |
98 |
|
99 |
Patches and additional error log information on eth0 |
100 |
are available at the following URL: |
101 |
http://tinyurl.com/hxfbp |
102 |
|
103 |
Summary of system information: |
104 |
System: Sun Microsystem's SunBlade 2000 |
105 |
Purchased: roughly 11/03. |
106 |
Processor: UltraSparcIII+/cheetah+/sparc64 |
107 |
NIC: Sun RIO GEM 10/100, built-in on SunBlade 2000 |
108 |
Linux Distro: Gentoo 2006.0 |
109 |
Kernel Versions: 2.6.16, 2.6.15.6, Gentoo's 2.4.32_r2 |
110 |
|
111 |
Specific error: |
112 |
|
113 |
NETDEV WATCHDOG: eth0: transmit timed out |
114 |
eth0: transmit timed out,resetting |
115 |
eth0: TX_STATE[003ffc05:00000001:00000019] |
116 |
eth0: RX_STATE[0100c805:00000001:00000021] |
117 |
eth0: Link is up at 100 Mbps,half-duplex. |
118 |
eth0: Pause is disabled |
119 |
... |
120 |
eth0: TX MAC xmit underrun. |
121 |
|
122 |
Any advice, help, etc. would be greatly appreciated. |
123 |
|
124 |
TIA, |
125 |
|
126 |
Benjamen R. Meyer |
127 |
|
128 |
P.S. I also posted to the netdev list at |
129 |
vger.kernel.org, but I have not heard anything. |
130 |
-- |
131 |
gentoo-sparc@g.o mailing list |