Gentoo Archives: gentoo-doc-cvs

From: Xavier Neys <neysx@×××××××××××.org>
To: gentoo-doc-cvs@l.g.o
Subject: [gentoo-doc-cvs] cvs commit: hardware-stability-p1.xml
Date: Thu, 28 Jul 2005 13:21:37
Message-Id: 200507281320.j6SDKnQf018739@robin.gentoo.org
1 neysx 05/07/28 13:21:06
2
3 Added: xml/htdocs/doc/en/articles hardware-stability-p1.xml
4 hardware-stability-p2.xml
5 Log:
6 #100524 xmlified hardware stability articles
7
8 Revision Changes Path
9 1.1 xml/htdocs/doc/en/articles/hardware-stability-p1.xml
10
11 file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p1.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo
12 plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p1.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo
13
14 Index: hardware-stability-p1.xml
15 ===================================================================
16 <?xml version="1.0" encoding="UTF-8"?>
17 <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
18 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/hardware-stability-p1.xml,v 1.1 2005/07/28 13:21:06 neysx Exp $ -->
19
20 <guide link="/doc/en/articles/hardware-stability-p1.xml" lang="en">
21 <title>Linux hardware stability guide, Part 1</title>
22
23 <author title="Author">
24 <mail link="drobbins@g.o">Daniel Robbins</mail>
25 </author>
26 <author title="Editor">
27 <mail link="jackdark@×××××.com">Joshua Saddler</mail>
28 </author>
29
30 <abstract>
31 In this article, Daniel Robbins shows you how to diagnose and fix CPU
32 flakiness, as well as how to test your RAM for defects. By the end of this
33 article, you'll have the skills to ensure that your Linux system is as stable
34 as it possibly can be.
35 </abstract>
36
37 <!-- The original version of this article was first published on IBM
38 developerWorks, and is property of Westtech Information Services. This
39 document is an updated version of the original article, and contains
40 various improvements made by the Gentoo Linux Documentation team -->
41
42 <version>1.0</version>
43 <date>2005-07-28</date>
44
45 <chapter>
46 <title>CPU troubleshooting</title>
47 <section>
48 <body>
49
50 <note>
51 The original version of this article was first published on IBM developerWorks,
52 and is property of Westtech Information Services. This document is an updated
53 version of the original article, and contains various improvements made by the
54 Gentoo Linux Documentation team.
55 </note>
56
57 <p>
58 Many of us in the Linux world have been bitten by nasty hardware problems. How
59 many of us have set up a Linux box, installed our favorite distribution,
60 compiled and installed some additional apps, and gotten everything working
61 perfectly only to find that our new system has an (argh!) fatal hardware bug?
62 Whether the symptoms are random segmentation faults, data corruption, hard
63 locks, or lost data is irrelevant -- the hardware glitch effectively makes our
64 normally reliable Linux operating system barely able to stay afloat. In this
65 article, we'll take an in-depth look at how to detect flaky CPUs and RAM --
66 allowing you to replace the defective parts before they do some serious
67 damage.
68 </p>
69
70 <p>
71 If you're experiencing instability problems and suspect they are hardware
72 related, I encourage you to test both your CPU and memory to ensure that
73 they're working OK. However, even if you haven't experienced these problems,
74 it's still a good idea to perform these CPU and memory tests. In doing so, you
75 may detect a hardware problem that could have bitten you at an inopportune
76 time, something that could have caused data loss or hours of frustration in a
77 frantic search for the source of the problem. The proper, proactive application
78 of these techniques can help you to avoid a lot of headaches, and if your
79 system passes the tests, you'll have the peace of mind that your system is up
80 to spec.
81 </p>
82
83 </body>
84 </section>
85 <section>
86 <title>CPU issues</title>
87 <body>
88
89 <p>
90 If you have a horribly defective CPU, your machine may be unable to boot Linux
91 or may only run for a few minutes before locking up. CPUs in this ragged state
92 are easy to diagnose as defective because the symptoms are so obvious. But
93 there are more subtle CPU defects that aren't so easy to detect; generally, the
94 less obvious errors are the ones that cause machines to either lock up every
95 now and then for no apparent reason, or cause certain processes to die
96 unexpectedly. Most CPU instabilities can be triggered by "exercising" the CPU
97 -- giving it a bunch of work to do, causing it to heat up and possibly flake
98 out. Let's look at some ways to stress-test the CPU.
99 </p>
100
101 <p>
102 You may be surprised to hear that one of the best tests of CPU stability is
103 built in to Linux -- the kernel compile. The gcc compiler is a great tool for
104 testing general CPU stability, and a kernel build uses gcc a whole lot. By
105 creating and running the following script from your <path>/usr/src/linux</path>
106 directory, you can give your machine an industrial-strength kernel compile
107 stress test:
108 </p>
109
110 <pre caption="The cpubuild script">
111 #!/bin/bash
112 make dep
113 while [ "foo" = "foo" ]
114 do
115 make clean
116 make -j2 bzImage
117 if [ $? -ne 0 ]
118 then
119 echo OUCH OUCH OUCH OUCH
120 exit 1
121 fi
122 done
123 </pre>
124
125 <p>
126 You'll notice that this script <e>repeatedly</e> compiles the kernel. The
127 reason for this is simple -- some CPUs have intermittent glitches, allowing
128 them to compile the kernel perfectly 95% of the time, but causing the kernel
129 compile to bomb out every now and then. Normally, this is because it may take
130 five or more kernel compiles before the processor heats up to the point where
131 it becomes unstable.
132 </p>
133
134 <p>
135 In the above script, make sure to adjust the <c>-j</c> option so that the
136 number following it is one greater than the number of CPUs in your system; in
137 other words, use "2" for uniprocessors, "3" for dual-processors, etc. The
138 <c>-j</c> option tells <c>make</c> to build the kernel in parallel, ensuring
139 that there's always at least one gcc process on deck after each source file is
140 compiled -- ensuring that the stress on your CPU is maximized. If your Linux
141 box is going to be unused for the afternoon, go ahead and run this script, and
142 let the machine recompile the kernel for a few hours.
143 </p>
144
145 </body>
146 </section>
147 <section>
148 <title>Possible CPU problems</title>
149 <body>
150
151 <p>
152 If the script runs perfectly for several hours, congratulations! Your CPU has
153 passed the first test. However, it's possible that the above script dies
154 unexpectedly. How do you know you're having a CPU problem as opposed to
155 something else? Well, if gcc spat out an error like this, then there's a very
156 good possibility that your CPU is defective:
157 </p>
158
159 <pre caption="GCC error">
160 gcc: Internal compiler error: program cc1 got fatal signal 11
161 </pre>
162
163 <p>
164 At this point, you have about three possibilities as to the state of your CPU:
165 </p>
166
167 <ul>
168 <li>
169 If you type <c>make bzImage</c> to resume the kernel compilation, and the
170 compiler dies on the exact same file, keep typing <c>make bzImage</c> over
171 and over again. If after about ten tries the build process continues to die
172 on this particular file, then the problem is most likely caused by a (rare)
173 gcc compiler bug that's being triggered by this particular source file,
174 rather than a flaky CPU. However, these days, gcc is quite stable, so this
175 isn't likely to happen.
176 </li>
177 <li>
178 If you type <c>make bzImage</c> to resume kernel compilation, and you get
179 another signal 11 a little bit later, then your CPU is most likely on its
180 last legs.
181 </li>
182 <li>
183 If you type <c>make bzImage</c> to resume kernel compilation and the kernel
184 compiles successfully, this doesn't mean that your CPU is OK. Normally,
185 this means that your CPU glitch only shows up every now and then, normally
186 only when the CPU rises above a certain temperature (a CPU will get hotter
187 when it is being used for an extended period of time, and may take several
188 kernel compiles to get to that critical point).
189 </li>
190 </ul>
191
192 </body>
193 </section>
194 <section>
195 <title>Rescuing your CPU</title>
196 <body>
197
198 <p>
199 If your CPU is experiencing random intermittent errors when placed under heavy
200 load, it's possible that your CPU isn't defective at all -- maybe it simply
201 isn't being cooled properly. Here are some things that you can check:
202 </p>
203
204 <ul>
205 <li>Is your CPU fan plugged in?</li>
206 <li>Is it relatively dust-free?</li>
207 <li>
208 Does the fan actually spin (and spin at the proper speed) when the power is
209 on?
210 </li>
211 <li>Is the heat sink seated properly on the CPU?</li>
212 <li>Is there thermal grease between the CPU and the heat sink?</li>
213 <li>Does your case have adequate ventilation?</li>
214 </ul>
215
216
217
218 1.1 xml/htdocs/doc/en/articles/hardware-stability-p2.xml
219
220 file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p2.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo
221 plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p2.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo
222
223 Index: hardware-stability-p2.xml
224 ===================================================================
225 <?xml version="1.0" encoding="UTF-8"?>
226 <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
227 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/hardware-stability-p2.xml,v 1.1 2005/07/28 13:21:06 neysx Exp $ -->
228
229 <guide link="/doc/en/articles/hardware-stability-p2.xml" lang="en">
230 <title>Linux hardware stability guide, Part 2</title>
231
232 <author title="Author">
233 <mail link="drobbins@g.o">Daniel Robbins</mail>
234 </author>
235 <author title="Editor">
236 <mail link="jackdark@×××××.com">Joshua Saddler</mail>
237 </author>
238
239 <abstract>
240 In this article, Daniel Robbins shares his experiences in getting his NVIDIA
241 TNT graphics card working under Linux using NVIDIA's accelerated drivers. As he
242 does, he'll show you how to diagnose and fix IRQ and PCI latency timer issues
243 -- techniques you can use to ensure that your systems don't experience
244 lock-ups, inconsistent behavior, or data loss.
245 </abstract>
246
247 <!-- The original version of this article was first published on IBM
248 developerWorks, and is property of Westtech Information Services. This
249 document is an updated version of the original article, and contains
250 various improvements made by the Gentoo Linux Documentation team -->
251
252 <version>1.0</version>
253 <date>2005-07-28</date>
254
255 <chapter>
256 <title>Drivers</title>
257 <section>
258 <title>The many causes of instability</title>
259 <body>
260
261 <note>
262 The original version of this article was first published on IBM developerWorks,
263 and is property of Westtech Information Services. This document is an updated
264 version of the original article, and contains various improvements made by the
265 Gentoo Linux Documentation team.
266 </note>
267
268 <p>
269 A stability problem is often not caused by defective hardware, but by improper
270 hardware configuration or flaky drivers. My experience in this area began when
271 I tried to get Linux working on my Diamond Viper V550, an NVIDIA TNT-based AGP
272 card, using NVIDIA's own accelerated drivers.
273 </p>
274
275 <p>
276 NVIDIA has their own display drivers for Linux, a collaboration between NVIDIA,
277 SGI, and VA Linux. These drivers have a lot of advantages over the standard
278 2d-only NVIDIA drivers included with Xfree86 4.0. For one, they have full
279 accelerated 3D support. Even better, they feature an official OpenGL 1.2
280 implementation, rather than just an enhanced version of Mesa. So, all in all,
281 these accelerated drivers are the ones you want to be using if you own an
282 NVIDIA-based graphics card, at least in theory. My attempt to get them working
283 properly turned out to be an excellent learning experience, to say the least.
284 </p>
285
286 <p>
287 After I installed the accelerated Linux NVIDIA drivers (see <uri
288 link="#resources">Resources</uri> later in this article), I started up Xfree86
289 and began playing around with all my 3D applications, now wonderfully
290 accelerated as they should be. Up until that point, I had had to reboot into
291 Windows NT in order to take advantage of 3D acceleration. Now, while I don't
292 mind NT, having to reboot to use 3D apps was somewhat annoying, and I was glad
293 to have one less reason to leave Linux and reboot my machine. However, after
294 playing around for an hour or so, I experienced a fatal setback to my Linux 3D
295 aspirations -- my machine locked up. My mouse simply stopped moving and the
296 screen froze, and I had to reboot my system.
297 </p>
298
299 <p>
300 Yes, I was having some kind of stability problem. But I didn't know exactly
301 what was causing the problem. Did I have flaky hardware, or was the card
302 misconfigured? Or maybe it was a problem with the driver -- did it not like my
303 VIA KT133-based Athlon motherboard? Whatever the problem, I wanted to resolve
304 it quickly. In this article, I'm going to share with you the procedure that I
305 went through to fix my hardware stability problem. Although you may not be
306 struggling with exactly the same issue, the steps that I used to diagnose and
307 (mostly) fix the problem are general in nature and applicable to many different
308 types of Linux hardware problems.
309 </p>
310
311 </body>
312 </section>
313 <section>
314 <title>First, the hardware</title>
315 <body>
316
317 <p>
318 The first thing that crossed my mind was that I might have flaky or
319 under-cooled hardware. On the one hand, my Diamond Viper V550 seemed to have no
320 problems under Windows NT. On the other hand, maybe Linux was somehow pushing
321 the chip harder and triggering heat-related lock-ups. My V550 did get
322 <e>extremely</e> hot, and its OEM heatsink seemed at best barely adequate. The
323 combination of the lock-ups and the fact that this card was being marginally
324 cooled convinced me to head over to PC Power and Cooling (see <uri
325 link="#resources">Resources</uri>) to purchase a mini integrated heatsink/fan
326 for my V550.
327 </p>
328
329 <p>
330 So, I received my Video Cool, popped off the OEM heatsink on the video card
331 (voiding the warranty), cleaned off the TNT chip and affixed the Video Cool to
332 the top of the chip. Verdict? My video card didn't get extremely hot anymore,
333 but the lockups continued. The lesson I learned from this particular experience
334 is this -- if you ensure that your system is adequately cooled to begin with,
335 you'll never need to worry about components malfunctioning due to inadequate
336 cooling. This in itself is a good reason to invest some time and effort in
337 making sure that your workstations and servers run coolly. Now that I had taken
338 care of the heat issue, I knew that the lock-ups were most likely not due to
339 flaky hardware, and I began to look elsewhere.
340 </p>
341
342 </body>
343 </section>
344 <section>
345 <title>New drivers -- and a possible solution?</title>
346 <body>
347
348 <p>
349 I partly suspected that NVIDIA's drivers were themselves the cause of the
350 problem. Fortunately , a new version of the drivers had just been released, so
351 I immediately upgraded in the hope that this would solve my stability problem.
352 Unfortunately, it didn't, and after checking with others on the #nvidia channel
353 on openprojects.net, I found out that while not everyone was able to get the
354 driver to operate stably, it was working for a lot of people.
355 </p>
356
357 <p>
358 On #nvidia, someone suggested that I make sure that the V550 wasn't sharing an
359 IRQ with another card. Unlike the standard XFree86 driver, the accelerated
360 NVIDIA driver requires an IRQ for proper operation. To see if it had its own
361 dedicated IRQ, I typed <c>cat /proc/interrupts</c>, and lo and behold, my V550
362 was sharing an interrupt with my IDE controller. Before I explain how I solved
363 this particular problem, I'd like to give you a brief background on IRQs.
364 </p>
365
366 <p>
367 PCs use IRQs, and hardware interrupts in general, to allow peripheral devices,
368 such as the video card and the disk controllers, to signal the CPU that they
369 have data that's ready to be processed. In the old days before the PCI bus
370 existed, it was critical that each device in the machine had its own, dedicated
371 IRQ. In case you are still using ISA peripherals in your machine, this is still
372 true -- all non-PCI devices should have their own dedicated IRQ.
373 </p>
374
375 </body>
376 </section>
377 </chapter>
378
379 <chapter>
380 <title>IRQs</title>
381 <section>
382 <title>IRQs and PCI</title>
383 <body>
384
385 <p>
386 However, things are a little different with the PCI bus. PCI allocates four
387 IRQs that can be used by the PCI/AGP cards in your system. In general, these
388 IRQs <e>can</e> be shared among multiple devices. (If you do this, make sure
389 that all the devices doing the sharing are PCI and AGP devices.) IRQ sharing is
390 important, especially for modern machines that may have five PCIs and one AGP
391 slot. Without IRQ sharing, you would be unable to have more than four IRQ-using
392 cards in your system.
393 </p>
394
395 <p>
396 There are, however, some limitations to PCI IRQ sharing. While modern
397 motherboards' BIOS and Linux kernels generally support PCI IRQ sharing, certain
398 PCI cards may simply refuse to work properly when sharing an IRQ with another
399 device. If you're experiencing random system lockups, especially lockups that
400 appear to be correlated with the use of a specific hardware device, you may
401 want to try and get all your PCI devices to use their own IRQs, just to be on
402 the safe side. The first step is to see if any devices in your system are
403 sharing IRQs to begin with. To do this, follow these steps:
404 </p>
405
406 <ol>
407 <li>
408 Use the various hardware devices in your system, such as disk, sound,
409 video, SCSI, etc. This ensures that Linux will handle interrupts for these
410 various devices.
411 </li>
412 <li>
413 <c>cat /proc/interrupts</c>, which will display a list and count of all
414 interrupts which the Linux kernel has handled so far. Look in the far right
415 column in this list. If two or more devices are listed in a single row,
416 then they're sharing that particular IRQ.
417 </li>
418 </ol>
419
420 <p>
421 If one of the devices in question is a non-PCI device (ISA or other legacy
422 cards) then you've found yourself an IRQ conflict, which you can attempt to fix
423 with your BIOS, the isapnptools package, or the physical jumpers on your
424
425
426
427 --
428 gentoo-doc-cvs@g.o mailing list