1 |
neysx 05/07/28 13:21:06 |
2 |
|
3 |
Added: xml/htdocs/doc/en/articles hardware-stability-p1.xml |
4 |
hardware-stability-p2.xml |
5 |
Log: |
6 |
#100524 xmlified hardware stability articles |
7 |
|
8 |
Revision Changes Path |
9 |
1.1 xml/htdocs/doc/en/articles/hardware-stability-p1.xml |
10 |
|
11 |
file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p1.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo |
12 |
plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p1.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo |
13 |
|
14 |
Index: hardware-stability-p1.xml |
15 |
=================================================================== |
16 |
<?xml version="1.0" encoding="UTF-8"?> |
17 |
<!DOCTYPE guide SYSTEM "/dtd/guide.dtd"> |
18 |
<!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/hardware-stability-p1.xml,v 1.1 2005/07/28 13:21:06 neysx Exp $ --> |
19 |
|
20 |
<guide link="/doc/en/articles/hardware-stability-p1.xml" lang="en"> |
21 |
<title>Linux hardware stability guide, Part 1</title> |
22 |
|
23 |
<author title="Author"> |
24 |
<mail link="drobbins@g.o">Daniel Robbins</mail> |
25 |
</author> |
26 |
<author title="Editor"> |
27 |
<mail link="jackdark@×××××.com">Joshua Saddler</mail> |
28 |
</author> |
29 |
|
30 |
<abstract> |
31 |
In this article, Daniel Robbins shows you how to diagnose and fix CPU |
32 |
flakiness, as well as how to test your RAM for defects. By the end of this |
33 |
article, you'll have the skills to ensure that your Linux system is as stable |
34 |
as it possibly can be. |
35 |
</abstract> |
36 |
|
37 |
<!-- The original version of this article was first published on IBM |
38 |
developerWorks, and is property of Westtech Information Services. This |
39 |
document is an updated version of the original article, and contains |
40 |
various improvements made by the Gentoo Linux Documentation team --> |
41 |
|
42 |
<version>1.0</version> |
43 |
<date>2005-07-28</date> |
44 |
|
45 |
<chapter> |
46 |
<title>CPU troubleshooting</title> |
47 |
<section> |
48 |
<body> |
49 |
|
50 |
<note> |
51 |
The original version of this article was first published on IBM developerWorks, |
52 |
and is property of Westtech Information Services. This document is an updated |
53 |
version of the original article, and contains various improvements made by the |
54 |
Gentoo Linux Documentation team. |
55 |
</note> |
56 |
|
57 |
<p> |
58 |
Many of us in the Linux world have been bitten by nasty hardware problems. How |
59 |
many of us have set up a Linux box, installed our favorite distribution, |
60 |
compiled and installed some additional apps, and gotten everything working |
61 |
perfectly only to find that our new system has an (argh!) fatal hardware bug? |
62 |
Whether the symptoms are random segmentation faults, data corruption, hard |
63 |
locks, or lost data is irrelevant -- the hardware glitch effectively makes our |
64 |
normally reliable Linux operating system barely able to stay afloat. In this |
65 |
article, we'll take an in-depth look at how to detect flaky CPUs and RAM -- |
66 |
allowing you to replace the defective parts before they do some serious |
67 |
damage. |
68 |
</p> |
69 |
|
70 |
<p> |
71 |
If you're experiencing instability problems and suspect they are hardware |
72 |
related, I encourage you to test both your CPU and memory to ensure that |
73 |
they're working OK. However, even if you haven't experienced these problems, |
74 |
it's still a good idea to perform these CPU and memory tests. In doing so, you |
75 |
may detect a hardware problem that could have bitten you at an inopportune |
76 |
time, something that could have caused data loss or hours of frustration in a |
77 |
frantic search for the source of the problem. The proper, proactive application |
78 |
of these techniques can help you to avoid a lot of headaches, and if your |
79 |
system passes the tests, you'll have the peace of mind that your system is up |
80 |
to spec. |
81 |
</p> |
82 |
|
83 |
</body> |
84 |
</section> |
85 |
<section> |
86 |
<title>CPU issues</title> |
87 |
<body> |
88 |
|
89 |
<p> |
90 |
If you have a horribly defective CPU, your machine may be unable to boot Linux |
91 |
or may only run for a few minutes before locking up. CPUs in this ragged state |
92 |
are easy to diagnose as defective because the symptoms are so obvious. But |
93 |
there are more subtle CPU defects that aren't so easy to detect; generally, the |
94 |
less obvious errors are the ones that cause machines to either lock up every |
95 |
now and then for no apparent reason, or cause certain processes to die |
96 |
unexpectedly. Most CPU instabilities can be triggered by "exercising" the CPU |
97 |
-- giving it a bunch of work to do, causing it to heat up and possibly flake |
98 |
out. Let's look at some ways to stress-test the CPU. |
99 |
</p> |
100 |
|
101 |
<p> |
102 |
You may be surprised to hear that one of the best tests of CPU stability is |
103 |
built in to Linux -- the kernel compile. The gcc compiler is a great tool for |
104 |
testing general CPU stability, and a kernel build uses gcc a whole lot. By |
105 |
creating and running the following script from your <path>/usr/src/linux</path> |
106 |
directory, you can give your machine an industrial-strength kernel compile |
107 |
stress test: |
108 |
</p> |
109 |
|
110 |
<pre caption="The cpubuild script"> |
111 |
#!/bin/bash |
112 |
make dep |
113 |
while [ "foo" = "foo" ] |
114 |
do |
115 |
make clean |
116 |
make -j2 bzImage |
117 |
if [ $? -ne 0 ] |
118 |
then |
119 |
echo OUCH OUCH OUCH OUCH |
120 |
exit 1 |
121 |
fi |
122 |
done |
123 |
</pre> |
124 |
|
125 |
<p> |
126 |
You'll notice that this script <e>repeatedly</e> compiles the kernel. The |
127 |
reason for this is simple -- some CPUs have intermittent glitches, allowing |
128 |
them to compile the kernel perfectly 95% of the time, but causing the kernel |
129 |
compile to bomb out every now and then. Normally, this is because it may take |
130 |
five or more kernel compiles before the processor heats up to the point where |
131 |
it becomes unstable. |
132 |
</p> |
133 |
|
134 |
<p> |
135 |
In the above script, make sure to adjust the <c>-j</c> option so that the |
136 |
number following it is one greater than the number of CPUs in your system; in |
137 |
other words, use "2" for uniprocessors, "3" for dual-processors, etc. The |
138 |
<c>-j</c> option tells <c>make</c> to build the kernel in parallel, ensuring |
139 |
that there's always at least one gcc process on deck after each source file is |
140 |
compiled -- ensuring that the stress on your CPU is maximized. If your Linux |
141 |
box is going to be unused for the afternoon, go ahead and run this script, and |
142 |
let the machine recompile the kernel for a few hours. |
143 |
</p> |
144 |
|
145 |
</body> |
146 |
</section> |
147 |
<section> |
148 |
<title>Possible CPU problems</title> |
149 |
<body> |
150 |
|
151 |
<p> |
152 |
If the script runs perfectly for several hours, congratulations! Your CPU has |
153 |
passed the first test. However, it's possible that the above script dies |
154 |
unexpectedly. How do you know you're having a CPU problem as opposed to |
155 |
something else? Well, if gcc spat out an error like this, then there's a very |
156 |
good possibility that your CPU is defective: |
157 |
</p> |
158 |
|
159 |
<pre caption="GCC error"> |
160 |
gcc: Internal compiler error: program cc1 got fatal signal 11 |
161 |
</pre> |
162 |
|
163 |
<p> |
164 |
At this point, you have about three possibilities as to the state of your CPU: |
165 |
</p> |
166 |
|
167 |
<ul> |
168 |
<li> |
169 |
If you type <c>make bzImage</c> to resume the kernel compilation, and the |
170 |
compiler dies on the exact same file, keep typing <c>make bzImage</c> over |
171 |
and over again. If after about ten tries the build process continues to die |
172 |
on this particular file, then the problem is most likely caused by a (rare) |
173 |
gcc compiler bug that's being triggered by this particular source file, |
174 |
rather than a flaky CPU. However, these days, gcc is quite stable, so this |
175 |
isn't likely to happen. |
176 |
</li> |
177 |
<li> |
178 |
If you type <c>make bzImage</c> to resume kernel compilation, and you get |
179 |
another signal 11 a little bit later, then your CPU is most likely on its |
180 |
last legs. |
181 |
</li> |
182 |
<li> |
183 |
If you type <c>make bzImage</c> to resume kernel compilation and the kernel |
184 |
compiles successfully, this doesn't mean that your CPU is OK. Normally, |
185 |
this means that your CPU glitch only shows up every now and then, normally |
186 |
only when the CPU rises above a certain temperature (a CPU will get hotter |
187 |
when it is being used for an extended period of time, and may take several |
188 |
kernel compiles to get to that critical point). |
189 |
</li> |
190 |
</ul> |
191 |
|
192 |
</body> |
193 |
</section> |
194 |
<section> |
195 |
<title>Rescuing your CPU</title> |
196 |
<body> |
197 |
|
198 |
<p> |
199 |
If your CPU is experiencing random intermittent errors when placed under heavy |
200 |
load, it's possible that your CPU isn't defective at all -- maybe it simply |
201 |
isn't being cooled properly. Here are some things that you can check: |
202 |
</p> |
203 |
|
204 |
<ul> |
205 |
<li>Is your CPU fan plugged in?</li> |
206 |
<li>Is it relatively dust-free?</li> |
207 |
<li> |
208 |
Does the fan actually spin (and spin at the proper speed) when the power is |
209 |
on? |
210 |
</li> |
211 |
<li>Is the heat sink seated properly on the CPU?</li> |
212 |
<li>Is there thermal grease between the CPU and the heat sink?</li> |
213 |
<li>Does your case have adequate ventilation?</li> |
214 |
</ul> |
215 |
|
216 |
|
217 |
|
218 |
1.1 xml/htdocs/doc/en/articles/hardware-stability-p2.xml |
219 |
|
220 |
file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p2.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo |
221 |
plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p2.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo |
222 |
|
223 |
Index: hardware-stability-p2.xml |
224 |
=================================================================== |
225 |
<?xml version="1.0" encoding="UTF-8"?> |
226 |
<!DOCTYPE guide SYSTEM "/dtd/guide.dtd"> |
227 |
<!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/hardware-stability-p2.xml,v 1.1 2005/07/28 13:21:06 neysx Exp $ --> |
228 |
|
229 |
<guide link="/doc/en/articles/hardware-stability-p2.xml" lang="en"> |
230 |
<title>Linux hardware stability guide, Part 2</title> |
231 |
|
232 |
<author title="Author"> |
233 |
<mail link="drobbins@g.o">Daniel Robbins</mail> |
234 |
</author> |
235 |
<author title="Editor"> |
236 |
<mail link="jackdark@×××××.com">Joshua Saddler</mail> |
237 |
</author> |
238 |
|
239 |
<abstract> |
240 |
In this article, Daniel Robbins shares his experiences in getting his NVIDIA |
241 |
TNT graphics card working under Linux using NVIDIA's accelerated drivers. As he |
242 |
does, he'll show you how to diagnose and fix IRQ and PCI latency timer issues |
243 |
-- techniques you can use to ensure that your systems don't experience |
244 |
lock-ups, inconsistent behavior, or data loss. |
245 |
</abstract> |
246 |
|
247 |
<!-- The original version of this article was first published on IBM |
248 |
developerWorks, and is property of Westtech Information Services. This |
249 |
document is an updated version of the original article, and contains |
250 |
various improvements made by the Gentoo Linux Documentation team --> |
251 |
|
252 |
<version>1.0</version> |
253 |
<date>2005-07-28</date> |
254 |
|
255 |
<chapter> |
256 |
<title>Drivers</title> |
257 |
<section> |
258 |
<title>The many causes of instability</title> |
259 |
<body> |
260 |
|
261 |
<note> |
262 |
The original version of this article was first published on IBM developerWorks, |
263 |
and is property of Westtech Information Services. This document is an updated |
264 |
version of the original article, and contains various improvements made by the |
265 |
Gentoo Linux Documentation team. |
266 |
</note> |
267 |
|
268 |
<p> |
269 |
A stability problem is often not caused by defective hardware, but by improper |
270 |
hardware configuration or flaky drivers. My experience in this area began when |
271 |
I tried to get Linux working on my Diamond Viper V550, an NVIDIA TNT-based AGP |
272 |
card, using NVIDIA's own accelerated drivers. |
273 |
</p> |
274 |
|
275 |
<p> |
276 |
NVIDIA has their own display drivers for Linux, a collaboration between NVIDIA, |
277 |
SGI, and VA Linux. These drivers have a lot of advantages over the standard |
278 |
2d-only NVIDIA drivers included with Xfree86 4.0. For one, they have full |
279 |
accelerated 3D support. Even better, they feature an official OpenGL 1.2 |
280 |
implementation, rather than just an enhanced version of Mesa. So, all in all, |
281 |
these accelerated drivers are the ones you want to be using if you own an |
282 |
NVIDIA-based graphics card, at least in theory. My attempt to get them working |
283 |
properly turned out to be an excellent learning experience, to say the least. |
284 |
</p> |
285 |
|
286 |
<p> |
287 |
After I installed the accelerated Linux NVIDIA drivers (see <uri |
288 |
link="#resources">Resources</uri> later in this article), I started up Xfree86 |
289 |
and began playing around with all my 3D applications, now wonderfully |
290 |
accelerated as they should be. Up until that point, I had had to reboot into |
291 |
Windows NT in order to take advantage of 3D acceleration. Now, while I don't |
292 |
mind NT, having to reboot to use 3D apps was somewhat annoying, and I was glad |
293 |
to have one less reason to leave Linux and reboot my machine. However, after |
294 |
playing around for an hour or so, I experienced a fatal setback to my Linux 3D |
295 |
aspirations -- my machine locked up. My mouse simply stopped moving and the |
296 |
screen froze, and I had to reboot my system. |
297 |
</p> |
298 |
|
299 |
<p> |
300 |
Yes, I was having some kind of stability problem. But I didn't know exactly |
301 |
what was causing the problem. Did I have flaky hardware, or was the card |
302 |
misconfigured? Or maybe it was a problem with the driver -- did it not like my |
303 |
VIA KT133-based Athlon motherboard? Whatever the problem, I wanted to resolve |
304 |
it quickly. In this article, I'm going to share with you the procedure that I |
305 |
went through to fix my hardware stability problem. Although you may not be |
306 |
struggling with exactly the same issue, the steps that I used to diagnose and |
307 |
(mostly) fix the problem are general in nature and applicable to many different |
308 |
types of Linux hardware problems. |
309 |
</p> |
310 |
|
311 |
</body> |
312 |
</section> |
313 |
<section> |
314 |
<title>First, the hardware</title> |
315 |
<body> |
316 |
|
317 |
<p> |
318 |
The first thing that crossed my mind was that I might have flaky or |
319 |
under-cooled hardware. On the one hand, my Diamond Viper V550 seemed to have no |
320 |
problems under Windows NT. On the other hand, maybe Linux was somehow pushing |
321 |
the chip harder and triggering heat-related lock-ups. My V550 did get |
322 |
<e>extremely</e> hot, and its OEM heatsink seemed at best barely adequate. The |
323 |
combination of the lock-ups and the fact that this card was being marginally |
324 |
cooled convinced me to head over to PC Power and Cooling (see <uri |
325 |
link="#resources">Resources</uri>) to purchase a mini integrated heatsink/fan |
326 |
for my V550. |
327 |
</p> |
328 |
|
329 |
<p> |
330 |
So, I received my Video Cool, popped off the OEM heatsink on the video card |
331 |
(voiding the warranty), cleaned off the TNT chip and affixed the Video Cool to |
332 |
the top of the chip. Verdict? My video card didn't get extremely hot anymore, |
333 |
but the lockups continued. The lesson I learned from this particular experience |
334 |
is this -- if you ensure that your system is adequately cooled to begin with, |
335 |
you'll never need to worry about components malfunctioning due to inadequate |
336 |
cooling. This in itself is a good reason to invest some time and effort in |
337 |
making sure that your workstations and servers run coolly. Now that I had taken |
338 |
care of the heat issue, I knew that the lock-ups were most likely not due to |
339 |
flaky hardware, and I began to look elsewhere. |
340 |
</p> |
341 |
|
342 |
</body> |
343 |
</section> |
344 |
<section> |
345 |
<title>New drivers -- and a possible solution?</title> |
346 |
<body> |
347 |
|
348 |
<p> |
349 |
I partly suspected that NVIDIA's drivers were themselves the cause of the |
350 |
problem. Fortunately , a new version of the drivers had just been released, so |
351 |
I immediately upgraded in the hope that this would solve my stability problem. |
352 |
Unfortunately, it didn't, and after checking with others on the #nvidia channel |
353 |
on openprojects.net, I found out that while not everyone was able to get the |
354 |
driver to operate stably, it was working for a lot of people. |
355 |
</p> |
356 |
|
357 |
<p> |
358 |
On #nvidia, someone suggested that I make sure that the V550 wasn't sharing an |
359 |
IRQ with another card. Unlike the standard XFree86 driver, the accelerated |
360 |
NVIDIA driver requires an IRQ for proper operation. To see if it had its own |
361 |
dedicated IRQ, I typed <c>cat /proc/interrupts</c>, and lo and behold, my V550 |
362 |
was sharing an interrupt with my IDE controller. Before I explain how I solved |
363 |
this particular problem, I'd like to give you a brief background on IRQs. |
364 |
</p> |
365 |
|
366 |
<p> |
367 |
PCs use IRQs, and hardware interrupts in general, to allow peripheral devices, |
368 |
such as the video card and the disk controllers, to signal the CPU that they |
369 |
have data that's ready to be processed. In the old days before the PCI bus |
370 |
existed, it was critical that each device in the machine had its own, dedicated |
371 |
IRQ. In case you are still using ISA peripherals in your machine, this is still |
372 |
true -- all non-PCI devices should have their own dedicated IRQ. |
373 |
</p> |
374 |
|
375 |
</body> |
376 |
</section> |
377 |
</chapter> |
378 |
|
379 |
<chapter> |
380 |
<title>IRQs</title> |
381 |
<section> |
382 |
<title>IRQs and PCI</title> |
383 |
<body> |
384 |
|
385 |
<p> |
386 |
However, things are a little different with the PCI bus. PCI allocates four |
387 |
IRQs that can be used by the PCI/AGP cards in your system. In general, these |
388 |
IRQs <e>can</e> be shared among multiple devices. (If you do this, make sure |
389 |
that all the devices doing the sharing are PCI and AGP devices.) IRQ sharing is |
390 |
important, especially for modern machines that may have five PCIs and one AGP |
391 |
slot. Without IRQ sharing, you would be unable to have more than four IRQ-using |
392 |
cards in your system. |
393 |
</p> |
394 |
|
395 |
<p> |
396 |
There are, however, some limitations to PCI IRQ sharing. While modern |
397 |
motherboards' BIOS and Linux kernels generally support PCI IRQ sharing, certain |
398 |
PCI cards may simply refuse to work properly when sharing an IRQ with another |
399 |
device. If you're experiencing random system lockups, especially lockups that |
400 |
appear to be correlated with the use of a specific hardware device, you may |
401 |
want to try and get all your PCI devices to use their own IRQs, just to be on |
402 |
the safe side. The first step is to see if any devices in your system are |
403 |
sharing IRQs to begin with. To do this, follow these steps: |
404 |
</p> |
405 |
|
406 |
<ol> |
407 |
<li> |
408 |
Use the various hardware devices in your system, such as disk, sound, |
409 |
video, SCSI, etc. This ensures that Linux will handle interrupts for these |
410 |
various devices. |
411 |
</li> |
412 |
<li> |
413 |
<c>cat /proc/interrupts</c>, which will display a list and count of all |
414 |
interrupts which the Linux kernel has handled so far. Look in the far right |
415 |
column in this list. If two or more devices are listed in a single row, |
416 |
then they're sharing that particular IRQ. |
417 |
</li> |
418 |
</ol> |
419 |
|
420 |
<p> |
421 |
If one of the devices in question is a non-PCI device (ISA or other legacy |
422 |
cards) then you've found yourself an IRQ conflict, which you can attempt to fix |
423 |
with your BIOS, the isapnptools package, or the physical jumpers on your |
424 |
|
425 |
|
426 |
|
427 |
-- |
428 |
gentoo-doc-cvs@g.o mailing list |