[gentoo-doc-cvs] cvs commit: hardware-stability-p1.xml - gentoo-doc-cvs

From:	Xavier Neys <neysx@×××××××××××.org>
To:	gentoo-doc-cvs@l.g.o
Subject:	[gentoo-doc-cvs] cvs commit: hardware-stability-p1.xml
Date:	Thu, 28 Jul 2005 13:21:37
Message-Id:	`200507281320.j6SDKnQf018739@robin.gentoo.org`

1

neysx       05/07/28 13:21:06

2

3

  Added:       xml/htdocs/doc/en/articles hardware-stability-p1.xml

4

                        hardware-stability-p2.xml

5

  Log:

6

  #100524 xmlified hardware stability articles

7

8

Revision  Changes    Path

9

1.1                  xml/htdocs/doc/en/articles/hardware-stability-p1.xml

10

11

file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p1.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo

12

plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p1.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo

13

14

Index: hardware-stability-p1.xml

15

===================================================================

16

<?xml version="1.0" encoding="UTF-8"?>

17

<!DOCTYPE guide SYSTEM "/dtd/guide.dtd">

18

<!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/hardware-stability-p1.xml,v 1.1 2005/07/28 13:21:06 neysx Exp $ -->

19

20

<guide link="/doc/en/articles/hardware-stability-p1.xml" lang="en">

21

<title>Linux hardware stability guide, Part 1</title>

22

23

<author title="Author">

24

  <mail link="drobbins@g.o">Daniel Robbins</mail>

25

</author>

26

<author title="Editor">

27

  <mail link="jackdark@×××××.com">Joshua Saddler</mail>

28

</author>

29

30

<abstract>

31

In this article, Daniel Robbins shows you how to diagnose and fix CPU

32

flakiness, as well as how to test your RAM for defects. By the end of this

33

article, you'll have the skills to ensure that your Linux system is as stable

34

as it possibly can be.

35

</abstract>

36

37

<!-- The original version of this article was first published on IBM 

38

developerWorks, and is property of Westtech Information Services. This 

39

document is an updated version of the original article, and contains

40

various improvements made by the Gentoo Linux Documentation team -->

41

42

<version>1.0</version>

43

<date>2005-07-28</date>

44

45

<chapter>

46

<title>CPU troubleshooting</title>

47

<section>

48

<body>

49

50

<note>

51

The original version of this article was first published on IBM developerWorks,

52

and is property of Westtech Information Services. This document is an updated

53

version of the original article, and contains various improvements made by the

54

Gentoo Linux Documentation team.

55

</note>

56

57

<p>

58

Many of us in the Linux world have been bitten by nasty hardware problems.  How

59

many of us have set up a Linux box, installed our favorite distribution,

60

compiled and installed some additional apps, and gotten everything working

61

perfectly only to find that our new system has an (argh!) fatal hardware bug?

62

Whether the symptoms are random segmentation faults, data corruption, hard

63

locks, or lost data is irrelevant -- the hardware glitch effectively makes our

64

normally reliable Linux operating system barely able to stay afloat. In this

65

article, we'll take an in-depth look at how to detect flaky CPUs and RAM --

66

allowing you to replace the defective parts before they do some serious 

67

damage.

68

</p>

69

70

<p>

71

If you're experiencing instability problems and suspect they are hardware

72

related, I encourage you to test both your CPU and memory to ensure that

73

they're working OK. However, even if you haven't experienced these problems,

74

it's still a good idea to perform these CPU and memory tests. In doing so, you

75

may detect a hardware problem that could have bitten you at an inopportune

76

time, something that could have caused data loss or hours of frustration in a

77

frantic search for the source of the problem. The proper, proactive application

78

of these techniques can help you to avoid a lot of headaches, and if your

79

system passes the tests, you'll have the peace of mind that your system is up

80

to spec.

81

</p>

82

83

</body>

84

</section>

85

<section>

86

<title>CPU issues</title>

87

<body>

88

89

<p>

90

If you have a horribly defective CPU, your machine may be unable to boot Linux

91

or may only run for a few minutes before locking up. CPUs in this ragged state

92

are easy to diagnose as defective because the symptoms are so obvious. But

93

there are more subtle CPU defects that aren't so easy to detect; generally, the

94

less obvious errors are the ones that cause machines to either lock up every

95

now and then for no apparent reason, or cause certain processes to die

96

unexpectedly. Most CPU instabilities can be triggered by "exercising" the CPU

97

-- giving it a bunch of work to do, causing it to heat up and possibly flake

98

out.  Let's look at some ways to stress-test the CPU.

99

</p>

100

101

<p>

102

You may be surprised to hear that one of the best tests of CPU stability is

103

built in to Linux -- the kernel compile. The gcc compiler is a great tool for

104

testing general CPU stability, and a kernel build uses gcc a whole lot. By

105

creating and running the following script from your <path>/usr/src/linux</path>

106

directory, you can give your machine an industrial-strength kernel compile

107

stress test:

108

</p>

109

110

<pre caption="The cpubuild script">

111

#!/bin/bash

112

make dep

113

while [ "foo" = "foo" ]

114

do

115

  make clean

116

  make -j2 bzImage

117

  if [ $? -ne 0 ]

118

  then

119

    echo OUCH OUCH OUCH OUCH

120

    exit 1

121

fi

122

done

123

</pre>

124

125

<p>

126

You'll notice that this script <e>repeatedly</e> compiles the kernel.  The

127

reason for this is simple -- some CPUs have intermittent glitches, allowing

128

them to compile the kernel perfectly 95% of the time, but causing the kernel

129

compile to bomb out every now and then. Normally, this is because it may take

130

five or more kernel compiles before the processor heats up to the point where

131

it becomes unstable.

132

</p>

133

134

<p>

135

In the above script, make sure to adjust the <c>-j</c> option so that the

136

number following it is one greater than the number of CPUs in your system; in

137

other words, use "2" for uniprocessors, "3" for dual-processors, etc. The

138

<c>-j</c> option tells <c>make</c> to build the kernel in parallel, ensuring

139

that there's always at least one gcc process on deck after each source file is

140

compiled -- ensuring that the stress on your CPU is maximized. If your Linux

141

box is going to be unused for the afternoon, go ahead and run this script, and

142

let the machine recompile the kernel for a few hours.

143

</p>

144

145

</body>

146

</section>

147

<section>

148

<title>Possible CPU problems</title>

149

<body>

150

151

<p>

152

If the script runs perfectly for several hours, congratulations! Your CPU has

153

passed the first test. However, it's possible that the above script dies

154

unexpectedly. How do you know you're having a CPU problem as opposed to

155

something else? Well, if gcc spat out an error like this, then there's a very

156

good possibility that your CPU is defective:

157

</p>

158

159

<pre caption="GCC error">

160

gcc: Internal compiler error: program cc1 got fatal signal 11

161

</pre>

162

163

<p>

164

At this point, you have about three possibilities as to the state of your CPU:

165

</p>

166

167

<ul>

168

  <li>

169

    If you type <c>make bzImage</c> to resume the kernel compilation, and the

170

    compiler dies on the exact same file, keep typing <c>make bzImage</c> over

171

    and over again. If after about ten tries the build process continues to die

172

    on this particular file, then the problem is most likely caused by a (rare)

173

    gcc compiler bug that's being triggered by this particular source file,

174

    rather than a flaky CPU. However, these days, gcc is quite stable, so this

175

    isn't likely to happen.

176

  </li>

177

  <li>

178

    If you type <c>make bzImage</c> to resume kernel compilation, and you get

179

    another signal 11 a little bit later, then your CPU is most likely on its

180

    last legs.

181

  </li>

182

  <li>

183

    If you type <c>make bzImage</c> to resume kernel compilation and the kernel

184

    compiles successfully, this doesn't mean that your CPU is OK.  Normally,

185

    this means that your CPU glitch only shows up every now and then, normally

186

    only when the CPU rises above a certain temperature (a CPU will get hotter

187

    when it is being used for an extended period of time, and may take several

188

    kernel compiles to get to that critical point).

189

  </li>

190

</ul>

191

192

</body>

193

</section>

194

<section>

195

<title>Rescuing your CPU</title>

196

<body>

197

198

<p>

199

If your CPU is experiencing random intermittent errors when placed under heavy

200

load, it's possible that your CPU isn't defective at all -- maybe it simply

201

isn't being cooled properly. Here are some things that you can check:

202

</p>

203

204

<ul>

205

  <li>Is your CPU fan plugged in?</li>

206

  <li>Is it relatively dust-free?</li>

207

  <li>

208

    Does the fan actually spin (and spin at the proper speed) when the power is

209

on?

210

  </li>

211

  <li>Is the heat sink seated properly on the CPU?</li>

212

  <li>Is there thermal grease between the CPU and the heat sink?</li>

213

  <li>Does your case have adequate ventilation?</li>

214

</ul>

1.1                  xml/htdocs/doc/en/articles/hardware-stability-p2.xml

219

220

file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p2.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo

221

plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p2.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo

222

223

Index: hardware-stability-p2.xml

224

===================================================================

225

<?xml version="1.0" encoding="UTF-8"?>

226

<!DOCTYPE guide SYSTEM "/dtd/guide.dtd">

227

<!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/hardware-stability-p2.xml,v 1.1 2005/07/28 13:21:06 neysx Exp $ -->

228

229

<guide link="/doc/en/articles/hardware-stability-p2.xml" lang="en">

230

<title>Linux hardware stability guide, Part 2</title>

231

232

<author title="Author">

233

  <mail link="drobbins@g.o">Daniel Robbins</mail>

234

</author>

235

<author title="Editor">

236

  <mail link="jackdark@×××××.com">Joshua Saddler</mail>

237

</author>

238

239

<abstract>

240

In this article, Daniel Robbins shares his experiences in getting his NVIDIA

241

TNT graphics card working under Linux using NVIDIA's accelerated drivers. As he

242

does, he'll show you how to diagnose and fix IRQ and PCI latency timer issues

243

-- techniques you can use to ensure that your systems don't experience

244

lock-ups, inconsistent behavior, or data loss.

245

</abstract>

246

247

<!-- The original version of this article was first published on IBM 

248

developerWorks, and is property of Westtech Information Services. This 

249

document is an updated version of the original article, and contains

250

various improvements made by the Gentoo Linux Documentation team -->

251

252

<version>1.0</version>

253

<date>2005-07-28</date>

254

255

<chapter>

256

<title>Drivers</title>

257

<section>

258

<title>The many causes of instability</title>

259

<body>

260

261

<note>

262

The original version of this article was first published on IBM developerWorks,

263

and is property of Westtech Information Services. This document is an updated

264

version of the original article, and contains various improvements made by the

265

Gentoo Linux Documentation team.

266

</note>

267

268

<p>

269

A stability problem is often not caused by defective hardware, but by improper

270

hardware configuration or flaky drivers. My experience in this area began when

271

I tried to get Linux working on my Diamond Viper V550, an NVIDIA TNT-based AGP

272

card, using NVIDIA's own accelerated drivers.

273

</p>

274

275

<p>

276

NVIDIA has their own display drivers for Linux, a collaboration between NVIDIA,

277

SGI, and VA Linux. These drivers have a lot of advantages over the standard

278

2d-only NVIDIA drivers included with Xfree86 4.0. For one, they have full

279

accelerated 3D support. Even better, they feature an official OpenGL 1.2

280

implementation, rather than just an enhanced version of Mesa. So, all in all,

281

these accelerated drivers are the ones you want to be using if you own an

282

NVIDIA-based graphics card, at least in theory.  My attempt to get them working

283

properly turned out to be an excellent learning experience, to say the least.

284

</p>

285

286

<p>

287

After I installed the accelerated Linux NVIDIA drivers (see <uri

288

link="#resources">Resources</uri> later in this article), I started up Xfree86

289

and began playing around with all my 3D applications, now wonderfully

290

accelerated as they should be. Up until that point, I had had to reboot into

291

Windows NT in order to take advantage of 3D acceleration. Now, while I don't

292

mind NT, having to reboot to use 3D apps was somewhat annoying, and I was glad

293

to have one less reason to leave Linux and reboot my machine. However, after

294

playing around for an hour or so, I experienced a fatal setback to my Linux 3D

295

aspirations -- my machine locked up. My mouse simply stopped moving and the

296

screen froze, and I had to reboot my system.

297

</p>

298

299

<p>

300

Yes, I was having some kind of stability problem. But I didn't know exactly

301

what was causing the problem. Did I have flaky hardware, or was the card

302

misconfigured? Or maybe it was a problem with the driver -- did it not like my

303

VIA KT133-based Athlon motherboard? Whatever the problem, I wanted to resolve

304

it quickly. In this article, I'm going to share with you the procedure that I

305

went through to fix my hardware stability problem. Although you may not be

306

struggling with exactly the same issue, the steps that I used to diagnose and

307

(mostly) fix the problem are general in nature and applicable to many different

308

types of Linux hardware problems.

309

</p>

310

311

</body>

312

</section>

313

<section>

314

<title>First, the hardware</title>

315

<body>

316

317

<p>

318

The first thing that crossed my mind was that I might have flaky or

319

under-cooled hardware. On the one hand, my Diamond Viper V550 seemed to have no

320

problems under Windows NT. On the other hand, maybe Linux was somehow pushing

321

the chip harder and triggering heat-related lock-ups. My V550 did get

322

<e>extremely</e> hot, and its OEM heatsink seemed at best barely adequate. The

323

combination of the lock-ups and the fact that this card was being marginally

324

cooled convinced me to head over to PC Power and Cooling (see <uri

325

link="#resources">Resources</uri>) to purchase a mini integrated heatsink/fan

326

for my V550.

327

</p>

328

329

<p>

330

So, I received my Video Cool, popped off the OEM heatsink on the video card

331

(voiding the warranty), cleaned off the TNT chip and affixed the Video Cool to

332

the top of the chip. Verdict? My video card didn't get extremely hot anymore,

333

but the lockups continued. The lesson I learned from this particular experience

334

is this -- if you ensure that your system is adequately cooled to begin with,

335

you'll never need to worry about components malfunctioning due to inadequate

336

cooling. This in itself is a good reason to invest some time and effort in

337

making sure that your workstations and servers run coolly. Now that I had taken

338

care of the heat issue, I knew that the lock-ups were most likely not due to

339

flaky hardware, and I began to look elsewhere.

340

</p>

341

342

</body>

343

</section>

344

<section>

345

<title>New drivers -- and a possible solution?</title>

346

<body>

347

348

<p>

349

I partly suspected that NVIDIA's drivers were themselves the cause of the

350

problem. Fortunately , a new version of the drivers had just been released, so

351

I immediately upgraded in the hope that this would solve my stability problem.

352

Unfortunately, it didn't, and after checking with others on the #nvidia channel

353

on openprojects.net, I found out that while not everyone was able to get the

354

driver to operate stably, it was working for a lot of people.

355

</p>

356

357

<p>

358

On #nvidia, someone suggested that I make sure that the V550 wasn't sharing an

359

IRQ with another card. Unlike the standard XFree86 driver, the accelerated

360

NVIDIA driver requires an IRQ for proper operation. To see if it had its own

361

dedicated IRQ, I typed <c>cat /proc/interrupts</c>, and lo and behold, my V550

362

was sharing an interrupt with my IDE controller. Before I explain how I solved

363

this particular problem, I'd like to give you a brief background on IRQs.

364

</p>

365

366

<p>

367

PCs use IRQs, and hardware interrupts in general, to allow peripheral devices,

368

such as the video card and the disk controllers, to signal the CPU that they

369

have data that's ready to be processed. In the old days before the PCI bus

370

existed, it was critical that each device in the machine had its own, dedicated

371

IRQ. In case you are still using ISA peripherals in your machine, this is still

372

true -- all non-PCI devices should have their own dedicated IRQ.

373

</p>

374

375

</body>

376

</section>

377

</chapter>

378

379

<chapter>

380

<title>IRQs</title>

381

<section>

382

<title>IRQs and PCI</title>

383

<body>

384

385

<p>

386

However, things are a little different with the PCI bus. PCI allocates four

387

IRQs that can be used by the PCI/AGP cards in your system. In general, these

388

IRQs <e>can</e> be shared among multiple devices. (If you do this, make sure

389

that all the devices doing the sharing are PCI and AGP devices.) IRQ sharing is

390

important, especially for modern machines that may have five PCIs and one AGP

391

slot. Without IRQ sharing, you would be unable to have more than four IRQ-using

392

cards in your system.

393

</p>

394

395

<p>

396

There are, however, some limitations to PCI IRQ sharing. While modern

397

motherboards' BIOS and Linux kernels generally support PCI IRQ sharing, certain

398

PCI cards may simply refuse to work properly when sharing an IRQ with another

399

device. If you're experiencing random system lockups, especially lockups that

400

appear to be correlated with the use of a specific hardware device, you may

401

want to try and get all your PCI devices to use their own IRQs, just to be on

402

the safe side. The first step is to see if any devices in your system are

403

sharing IRQs to begin with. To do this, follow these steps:

404

</p>

405

406

<ol>

407

  <li>

408

    Use the various hardware devices in your system, such as disk, sound,

409

    video, SCSI, etc. This ensures that Linux will handle interrupts for these

410

    various devices.

411

  </li>

412

  <li>

413

    <c>cat /proc/interrupts</c>, which will display a list and count of all

414

    interrupts which the Linux kernel has handled so far. Look in the far right

415

    column in this list. If two or more devices are listed in a single row,

416

    then they're sharing that particular IRQ.

417

  </li>

418

</ol>

419

420

<p>

421

If one of the devices in question is a non-PCI device (ISA or other legacy

422

cards) then you've found yourself an IRQ conflict, which you can attempt to fix

423

with your BIOS, the isapnptools package, or the physical jumpers on your

--

428

gentoo-doc-cvs@g.o mailing list

1	neysx 05/07/28 13:21:06
2
3	Added: xml/htdocs/doc/en/articles hardware-stability-p1.xml
4	hardware-stability-p2.xml
5	Log:
6	#100524 xmlified hardware stability articles
7
8	Revision Changes Path
9	1.1 xml/htdocs/doc/en/articles/hardware-stability-p1.xml
10
11	file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p1.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo
12	plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p1.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo
13
14	Index: hardware-stability-p1.xml
15	===================================================================
16	<?xml version="1.0" encoding="UTF-8"?>
17	<!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
18	<!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/hardware-stability-p1.xml,v 1.1 2005/07/28 13:21:06 neysx Exp $ -->
19
20	<guide link="/doc/en/articles/hardware-stability-p1.xml" lang="en">
21	<title>Linux hardware stability guide, Part 1</title>
22
23	<author title="Author">
24	<mail link="drobbins@g.o">Daniel Robbins</mail>
25	</author>
26	<author title="Editor">
27	<mail link="jackdark@×××××.com">Joshua Saddler</mail>
28	</author>
29
30	<abstract>
31	In this article, Daniel Robbins shows you how to diagnose and fix CPU
32	flakiness, as well as how to test your RAM for defects. By the end of this
33	article, you'll have the skills to ensure that your Linux system is as stable
34	as it possibly can be.
35	</abstract>
36
37	<!-- The original version of this article was first published on IBM
38	developerWorks, and is property of Westtech Information Services. This
39	document is an updated version of the original article, and contains
40	various improvements made by the Gentoo Linux Documentation team -->
41
42	<version>1.0</version>
43	<date>2005-07-28</date>
44
45	<chapter>
46	<title>CPU troubleshooting</title>
47	<section>
48	<body>
49
50	<note>
51	The original version of this article was first published on IBM developerWorks,
52	and is property of Westtech Information Services. This document is an updated
53	version of the original article, and contains various improvements made by the
54	Gentoo Linux Documentation team.
55	</note>
56
57	<p>
58	Many of us in the Linux world have been bitten by nasty hardware problems. How
59	many of us have set up a Linux box, installed our favorite distribution,
60	compiled and installed some additional apps, and gotten everything working
61	perfectly only to find that our new system has an (argh!) fatal hardware bug?
62	Whether the symptoms are random segmentation faults, data corruption, hard
63	locks, or lost data is irrelevant -- the hardware glitch effectively makes our
64	normally reliable Linux operating system barely able to stay afloat. In this
65	article, we'll take an in-depth look at how to detect flaky CPUs and RAM --
66	allowing you to replace the defective parts before they do some serious
67	damage.
68	</p>
69
70	<p>
71	If you're experiencing instability problems and suspect they are hardware
72	related, I encourage you to test both your CPU and memory to ensure that
73	they're working OK. However, even if you haven't experienced these problems,
74	it's still a good idea to perform these CPU and memory tests. In doing so, you
75	may detect a hardware problem that could have bitten you at an inopportune
76	time, something that could have caused data loss or hours of frustration in a
77	frantic search for the source of the problem. The proper, proactive application
78	of these techniques can help you to avoid a lot of headaches, and if your
79	system passes the tests, you'll have the peace of mind that your system is up
80	to spec.
81	</p>
82
83	</body>
84	</section>
85	<section>
86	<title>CPU issues</title>
87	<body>
88
89	<p>
90	If you have a horribly defective CPU, your machine may be unable to boot Linux
91	or may only run for a few minutes before locking up. CPUs in this ragged state
92	are easy to diagnose as defective because the symptoms are so obvious. But
93	there are more subtle CPU defects that aren't so easy to detect; generally, the
94	less obvious errors are the ones that cause machines to either lock up every
95	now and then for no apparent reason, or cause certain processes to die
96	unexpectedly. Most CPU instabilities can be triggered by "exercising" the CPU
97	-- giving it a bunch of work to do, causing it to heat up and possibly flake
98	out. Let's look at some ways to stress-test the CPU.
99	</p>
100
101	<p>
102	You may be surprised to hear that one of the best tests of CPU stability is
103	built in to Linux -- the kernel compile. The gcc compiler is a great tool for
104	testing general CPU stability, and a kernel build uses gcc a whole lot. By
105	creating and running the following script from your <path>/usr/src/linux</path>
106	directory, you can give your machine an industrial-strength kernel compile
107	stress test:
108	</p>
109
110	<pre caption="The cpubuild script">
111	#!/bin/bash
112	make dep
113	while [ "foo" = "foo" ]
114	do
115	make clean
116	make -j2 bzImage
117	if [ $? -ne 0 ]
118	then
119	echo OUCH OUCH OUCH OUCH
120	exit 1
121	fi
122	done
123	</pre>
124
125	<p>
126	You'll notice that this script <e>repeatedly</e> compiles the kernel. The
127	reason for this is simple -- some CPUs have intermittent glitches, allowing
128	them to compile the kernel perfectly 95% of the time, but causing the kernel
129	compile to bomb out every now and then. Normally, this is because it may take
130	five or more kernel compiles before the processor heats up to the point where
131	it becomes unstable.
132	</p>
133
134	<p>
135	In the above script, make sure to adjust the <c>-j</c> option so that the
136	number following it is one greater than the number of CPUs in your system; in
137	other words, use "2" for uniprocessors, "3" for dual-processors, etc. The
138	<c>-j</c> option tells <c>make</c> to build the kernel in parallel, ensuring
139	that there's always at least one gcc process on deck after each source file is
140	compiled -- ensuring that the stress on your CPU is maximized. If your Linux
141	box is going to be unused for the afternoon, go ahead and run this script, and
142	let the machine recompile the kernel for a few hours.
143	</p>
144
145	</body>
146	</section>
147	<section>
148	<title>Possible CPU problems</title>
149	<body>
150
151	<p>
152	If the script runs perfectly for several hours, congratulations! Your CPU has
153	passed the first test. However, it's possible that the above script dies
154	unexpectedly. How do you know you're having a CPU problem as opposed to
155	something else? Well, if gcc spat out an error like this, then there's a very
156	good possibility that your CPU is defective:
157	</p>
158
159	<pre caption="GCC error">
160	gcc: Internal compiler error: program cc1 got fatal signal 11
161	</pre>
162
163	<p>
164	At this point, you have about three possibilities as to the state of your CPU:
165	</p>
166
167	<ul>
168	<li>
169	If you type <c>make bzImage</c> to resume the kernel compilation, and the
170	compiler dies on the exact same file, keep typing <c>make bzImage</c> over
171	and over again. If after about ten tries the build process continues to die
172	on this particular file, then the problem is most likely caused by a (rare)
173	gcc compiler bug that's being triggered by this particular source file,
174	rather than a flaky CPU. However, these days, gcc is quite stable, so this
175	isn't likely to happen.
176	</li>
177	<li>
178	If you type <c>make bzImage</c> to resume kernel compilation, and you get
179	another signal 11 a little bit later, then your CPU is most likely on its
180	last legs.
181	</li>
182	<li>
183	If you type <c>make bzImage</c> to resume kernel compilation and the kernel
184	compiles successfully, this doesn't mean that your CPU is OK. Normally,
185	this means that your CPU glitch only shows up every now and then, normally
186	only when the CPU rises above a certain temperature (a CPU will get hotter
187	when it is being used for an extended period of time, and may take several
188	kernel compiles to get to that critical point).
189	</li>
190	</ul>
191
192	</body>
193	</section>
194	<section>
195	<title>Rescuing your CPU</title>
196	<body>
197
198	<p>
199	If your CPU is experiencing random intermittent errors when placed under heavy
200	load, it's possible that your CPU isn't defective at all -- maybe it simply
201	isn't being cooled properly. Here are some things that you can check:
202	</p>
203
204	<ul>
205	<li>Is your CPU fan plugged in?</li>
206	<li>Is it relatively dust-free?</li>
207	<li>
208	Does the fan actually spin (and spin at the proper speed) when the power is
209	on?
210	</li>
211	<li>Is the heat sink seated properly on the CPU?</li>
212	<li>Is there thermal grease between the CPU and the heat sink?</li>
213	<li>Does your case have adequate ventilation?</li>
214	</ul>
215
216
217
218	1.1 xml/htdocs/doc/en/articles/hardware-stability-p2.xml
219
220	file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p2.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo
221	plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/hardware-stability-p2.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo
222
223	Index: hardware-stability-p2.xml
224	===================================================================
225	<?xml version="1.0" encoding="UTF-8"?>
226	<!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
227	<!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/hardware-stability-p2.xml,v 1.1 2005/07/28 13:21:06 neysx Exp $ -->
228
229	<guide link="/doc/en/articles/hardware-stability-p2.xml" lang="en">
230	<title>Linux hardware stability guide, Part 2</title>
231
232	<author title="Author">
233	<mail link="drobbins@g.o">Daniel Robbins</mail>
234	</author>
235	<author title="Editor">
236	<mail link="jackdark@×××××.com">Joshua Saddler</mail>
237	</author>
238
239	<abstract>
240	In this article, Daniel Robbins shares his experiences in getting his NVIDIA
241	TNT graphics card working under Linux using NVIDIA's accelerated drivers. As he
242	does, he'll show you how to diagnose and fix IRQ and PCI latency timer issues
243	-- techniques you can use to ensure that your systems don't experience
244	lock-ups, inconsistent behavior, or data loss.
245	</abstract>
246
247	<!-- The original version of this article was first published on IBM
248	developerWorks, and is property of Westtech Information Services. This
249	document is an updated version of the original article, and contains
250	various improvements made by the Gentoo Linux Documentation team -->
251
252	<version>1.0</version>
253	<date>2005-07-28</date>
254
255	<chapter>
256	<title>Drivers</title>
257	<section>
258	<title>The many causes of instability</title>
259	<body>
260
261	<note>
262	The original version of this article was first published on IBM developerWorks,
263	and is property of Westtech Information Services. This document is an updated
264	version of the original article, and contains various improvements made by the
265	Gentoo Linux Documentation team.
266	</note>
267
268	<p>
269	A stability problem is often not caused by defective hardware, but by improper
270	hardware configuration or flaky drivers. My experience in this area began when
271	I tried to get Linux working on my Diamond Viper V550, an NVIDIA TNT-based AGP
272	card, using NVIDIA's own accelerated drivers.
273	</p>
274
275	<p>
276	NVIDIA has their own display drivers for Linux, a collaboration between NVIDIA,
277	SGI, and VA Linux. These drivers have a lot of advantages over the standard
278	2d-only NVIDIA drivers included with Xfree86 4.0. For one, they have full
279	accelerated 3D support. Even better, they feature an official OpenGL 1.2
280	implementation, rather than just an enhanced version of Mesa. So, all in all,
281	these accelerated drivers are the ones you want to be using if you own an
282	NVIDIA-based graphics card, at least in theory. My attempt to get them working
283	properly turned out to be an excellent learning experience, to say the least.
284	</p>
285
286	<p>
287	After I installed the accelerated Linux NVIDIA drivers (see <uri
288	link="#resources">Resources</uri> later in this article), I started up Xfree86
289	and began playing around with all my 3D applications, now wonderfully
290	accelerated as they should be. Up until that point, I had had to reboot into
291	Windows NT in order to take advantage of 3D acceleration. Now, while I don't
292	mind NT, having to reboot to use 3D apps was somewhat annoying, and I was glad
293	to have one less reason to leave Linux and reboot my machine. However, after
294	playing around for an hour or so, I experienced a fatal setback to my Linux 3D
295	aspirations -- my machine locked up. My mouse simply stopped moving and the
296	screen froze, and I had to reboot my system.
297	</p>
298
299	<p>
300	Yes, I was having some kind of stability problem. But I didn't know exactly
301	what was causing the problem. Did I have flaky hardware, or was the card
302	misconfigured? Or maybe it was a problem with the driver -- did it not like my
303	VIA KT133-based Athlon motherboard? Whatever the problem, I wanted to resolve
304	it quickly. In this article, I'm going to share with you the procedure that I
305	went through to fix my hardware stability problem. Although you may not be
306	struggling with exactly the same issue, the steps that I used to diagnose and
307	(mostly) fix the problem are general in nature and applicable to many different
308	types of Linux hardware problems.
309	</p>
310
311	</body>
312	</section>
313	<section>
314	<title>First, the hardware</title>
315	<body>
316
317	<p>
318	The first thing that crossed my mind was that I might have flaky or
319	under-cooled hardware. On the one hand, my Diamond Viper V550 seemed to have no
320	problems under Windows NT. On the other hand, maybe Linux was somehow pushing
321	the chip harder and triggering heat-related lock-ups. My V550 did get
322	<e>extremely</e> hot, and its OEM heatsink seemed at best barely adequate. The
323	combination of the lock-ups and the fact that this card was being marginally
324	cooled convinced me to head over to PC Power and Cooling (see <uri
325	link="#resources">Resources</uri>) to purchase a mini integrated heatsink/fan
326	for my V550.
327	</p>
328
329	<p>
330	So, I received my Video Cool, popped off the OEM heatsink on the video card
331	(voiding the warranty), cleaned off the TNT chip and affixed the Video Cool to
332	the top of the chip. Verdict? My video card didn't get extremely hot anymore,
333	but the lockups continued. The lesson I learned from this particular experience
334	is this -- if you ensure that your system is adequately cooled to begin with,
335	you'll never need to worry about components malfunctioning due to inadequate
336	cooling. This in itself is a good reason to invest some time and effort in
337	making sure that your workstations and servers run coolly. Now that I had taken
338	care of the heat issue, I knew that the lock-ups were most likely not due to
339	flaky hardware, and I began to look elsewhere.
340	</p>
341
342	</body>
343	</section>
344	<section>
345	<title>New drivers -- and a possible solution?</title>
346	<body>
347
348	<p>
349	I partly suspected that NVIDIA's drivers were themselves the cause of the
350	problem. Fortunately , a new version of the drivers had just been released, so
351	I immediately upgraded in the hope that this would solve my stability problem.
352	Unfortunately, it didn't, and after checking with others on the #nvidia channel
353	on openprojects.net, I found out that while not everyone was able to get the
354	driver to operate stably, it was working for a lot of people.
355	</p>
356
357	<p>
358	On #nvidia, someone suggested that I make sure that the V550 wasn't sharing an
359	IRQ with another card. Unlike the standard XFree86 driver, the accelerated
360	NVIDIA driver requires an IRQ for proper operation. To see if it had its own
361	dedicated IRQ, I typed <c>cat /proc/interrupts</c>, and lo and behold, my V550
362	was sharing an interrupt with my IDE controller. Before I explain how I solved
363	this particular problem, I'd like to give you a brief background on IRQs.
364	</p>
365
366	<p>
367	PCs use IRQs, and hardware interrupts in general, to allow peripheral devices,
368	such as the video card and the disk controllers, to signal the CPU that they
369	have data that's ready to be processed. In the old days before the PCI bus
370	existed, it was critical that each device in the machine had its own, dedicated
371	IRQ. In case you are still using ISA peripherals in your machine, this is still
372	true -- all non-PCI devices should have their own dedicated IRQ.
373	</p>
374
375	</body>
376	</section>
377	</chapter>
378
379	<chapter>
380	<title>IRQs</title>
381	<section>
382	<title>IRQs and PCI</title>
383	<body>
384
385	<p>
386	However, things are a little different with the PCI bus. PCI allocates four
387	IRQs that can be used by the PCI/AGP cards in your system. In general, these
388	IRQs <e>can</e> be shared among multiple devices. (If you do this, make sure
389	that all the devices doing the sharing are PCI and AGP devices.) IRQ sharing is
390	important, especially for modern machines that may have five PCIs and one AGP
391	slot. Without IRQ sharing, you would be unable to have more than four IRQ-using
392	cards in your system.
393	</p>
394
395	<p>
396	There are, however, some limitations to PCI IRQ sharing. While modern
397	motherboards' BIOS and Linux kernels generally support PCI IRQ sharing, certain
398	PCI cards may simply refuse to work properly when sharing an IRQ with another
399	device. If you're experiencing random system lockups, especially lockups that
400	appear to be correlated with the use of a specific hardware device, you may
401	want to try and get all your PCI devices to use their own IRQs, just to be on
402	the safe side. The first step is to see if any devices in your system are
403	sharing IRQs to begin with. To do this, follow these steps:
404	</p>
405
406	<ol>
407	<li>
408	Use the various hardware devices in your system, such as disk, sound,
409	video, SCSI, etc. This ensures that Linux will handle interrupts for these
410	various devices.
411	</li>
412	<li>
413	<c>cat /proc/interrupts</c>, which will display a list and count of all
414	interrupts which the Linux kernel has handled so far. Look in the far right
415	column in this list. If two or more devices are listed in a single row,
416	then they're sharing that particular IRQ.
417	</li>
418	</ol>
419
420	<p>
421	If one of the devices in question is a non-PCI device (ISA or other legacy
422	cards) then you've found yourself an IRQ conflict, which you can attempt to fix
423	with your BIOS, the isapnptools package, or the physical jumpers on your
424
425
426
427	--
428	gentoo-doc-cvs@g.o mailing list

Gentoo Archives: gentoo-doc-cvs