Gentoo Archives: gentoo-sparc

From: BRM <bm_witness@×××××.com>
To: gentoo-sparc@l.g.o
Subject: [gentoo-sparc] Sun Gem (RIO GEM r01) errors...
Date: Mon, 03 Apr 2006 19:29:52
Message-Id: 20060403192955.34673.qmail@web60011.mail.yahoo.com
1 I recently rebuilt a SunBlade 2000 system that was
2 running Solaris 8 to Gentoo 2006.0. The system sports
3 a Sun RIO GEM NIC, and worked quite well for the first
4 few days, however, we didn't hit it hard during that
5 time period either. The systems primary task is to be
6 our source repository, and so needs to be network
7 enabled.
8
9 The system was initially setup on 3/9/2006, and ran
10 fine until 3/15/2006 when we started getting the below
11 error messages:
12
13 Mar 15 15:39:25 tsdfft1 NETDEV WATCHDOG: eth0:
14 transmit timed out
15 Mar 15 15:39:25 tsdfft1 eth0: transmit timed out,
16 resetting
17 Mar 15 15:39:25 tsdfft1 eth0:
18 TX_STATE[003ffc05:00000001:00000019]
19 Mar 15 15:39:25 tsdfft1 eth0:
20 RX_STATE[0100c805:00000001:00000021]
21 Mar 15 15:39:25 tsdfft1 eth0: Link is up at 100 Mbps,
22 half-duplex.
23 Mar 15 15:39:25 tsdfft1 eth0: Pause is disabled
24
25 And:
26
27 Mar 15 16:11:58 tsdfft1 eth0: TX MAC xmit underrun.
28
29 We're presently using the 2.6.16 kernel (vanilla) with
30 sungem driver version 0.98. We have also seen this
31 issue with the 2.6.15.6 kernel (vanilla) and the
32 2.4.32_r2 kernel (provided by Gentoo 2006.0).
33
34 The first one is spuratic, but happens from time to
35 time. (Same error message everytime, save date &
36 time.) The second one is the most reproducible as all
37 I have to do is try to pull down source from the
38 repository (hosted on Apache2 via WebDAV), and after
39 about 6 MiB of data transfer, the link will die until
40 an ifconfig down/up is done, when it will go for a
41 while longer and then require a system reboot.
42
43 In researching the issue, I discovered that there is
44 one of several issues at play - the card is going bad,
45 or there is a driver problem. I found a link to an
46 xmit underrun issue for Solaris, but was unable to
47 access it due to it being locked under
48 sunsolve.sun.com. So I have no guarantee that going
49 back to Solaris will solve the issue either.
50
51 I have had a hard time finding an xmit underrun issue
52 under Linux, most searches result in references to
53 where the message is generated from and not from users
54 trying to find solutions to the problem.
55
56 I did, however, notice that there was a similar
57 problem with overflows on the RX portion of the chip,
58 which was solved through resetting the chip's RX unit
59 via gem_rxmac_reset().
60
61 My first attempt at a fix was to modify the driver at
62 the point of issue to schedule a reset, based on code
63 elsewhere in the driver. (See sungem-fix1.patch.txt)
64
65 At first this patch did not seem to work, however, I
66 have been running the kernel with it for about a week
67 now, and at least SSH and Apache seem to keep running.
68 So I do think it at least helped to improve the
69 situation, but it does not solve the problem on the
70 Subversion side (Apache/WebDAV) which still dies after
71 issues (just tested to make sure).
72
73 I then tried building a solution based on the
74 gem_rxmac_reset() and the various init functions, and
75 produced gem_txmac_reset(). However, my first use
76 locked up the kernel. It might be just that I tried to
77 gain a lock when I shouldn't have (I did try to get
78 the lock and tx_lock for the driver). However, I am
79 not sure that I did it correctly.
80
81 I would very much appreciate it if someone who is more
82 familiar with the sungem driver would look at the
83 patches and verify that (a) it is the correct thing to
84 do, and (b) I did it correctly.
85
86 I am aware that the network the system is running on
87 is suppose to be full duplex, 100 Mbps. However, I
88 have noticed that the card/driver seems to think it is
89 half-duplex. Could this simply be a duplexing issue? I
90 have no control of the switch it is plugged into (so
91 far as settings go), but have not been able to find a
92 way to get ifconfig to force it to full-duplex. (We've
93 typically built the driver into the kernel.)
94
95 If there is any information that I missed which would
96 be helpful, please let me know and I will be glad to
97 pass on what I can.
98
99 Patches and additional error log information on eth0
100 are available at the following URL:
101 http://tinyurl.com/hxfbp
102
103 Summary of system information:
104 System: Sun Microsystem's SunBlade 2000
105 Purchased: roughly 11/03.
106 Processor: UltraSparcIII+/cheetah+/sparc64
107 NIC: Sun RIO GEM 10/100, built-in on SunBlade 2000
108 Linux Distro: Gentoo 2006.0
109 Kernel Versions: 2.6.16, 2.6.15.6, Gentoo's 2.4.32_r2
110
111 Specific error:
112
113 NETDEV WATCHDOG: eth0: transmit timed out
114 eth0: transmit timed out,resetting
115 eth0: TX_STATE[003ffc05:00000001:00000019]
116 eth0: RX_STATE[0100c805:00000001:00000021]
117 eth0: Link is up at 100 Mbps,half-duplex.
118 eth0: Pause is disabled
119 ...
120 eth0: TX MAC xmit underrun.
121
122 Any advice, help, etc. would be greatly appreciated.
123
124 TIA,
125
126 Benjamen R. Meyer
127
128 P.S. I also posted to the netdev list at
129 vger.kernel.org, but I have not heard anything.
130 --
131 gentoo-sparc@g.o mailing list

Replies

Subject Author
Re: [gentoo-sparc] Sun Gem (RIO GEM r01) errors... Paul Heinlein <heinlein@××××××.com>