[gentoo-desktop] Re: System problems - some progress - gentoo-desktop

From:	Duncan <1i5t5.duncan@×××.net>
To:	gentoo-desktop@l.g.o
Subject:	[gentoo-desktop] Re: System problems - some progress
Date:	Sat, 26 Mar 2011 08:42:25
Message-Id:	`pan.2011.03.26.08.40.10@cox.net`
In Reply to:	Re: [gentoo-desktop] Re: System problems - some progress by Lindsay Haisley

1

Lindsay Haisley posted on Fri, 25 Mar 2011 21:46:32 -0500 as excerpted:

2

3

> On Fri, 2011-03-25 at 22:59 +0000, Duncan wrote:

4

>> Simply my experience-educated opinion.  YMMV, as they say.  And of

5

>> course,

6

>> it applies to new installations more than your current situation, but

7

>> as you mentioned that you are planning such a new installation...

8

>

9

> Duncan, thanks for your very thorough discussion of current technologies

10

> disk/RAID/filesystem/etc. technologies.  Wow!  I'm going to have to read

11

> it through several times to absorb it.  I've gotten to the point at

12

> which I'm more involved with what I can _do_ with the Linux boxes I set

13

> up than what I can do that's cool and cutting edge with Linux in setting

14

> them up, but playing with bleeding-edge stuff has always been tempting.

15

16

By contrast, Linux is still my hobby, tho really, a full time one in that 

17

I spend hours a day at it, pretty much 7 days a week.  I'm thinking I 

18

might switch to Linux as a job at some point, perhaps soon, but it's not a 

19

switch I'll make lightly, and it's not something I'll even consider 

20

"selling my soul for" to take -- it'll be on my terms or I might as well 

21

stay with Linux as a hobby -- an arrangement that works and that suits me 

22

fine.

23

24

Because Linux is a hobby, I go where my interest leads me.  Even tho I'm 

25

not a developer, I test, bisect and file bugs on the latest git kernels 

26

and am proud to be able to say that a number of bugs were fixed before 

27

release (and one, a Radeon AGP graphics bug, after, it hit stable and two 

28

kernel releases before it was ultimately fixed, as reasonably new graphics 

29

cards on AGP busses aren't as common as they once were...) because of my 

30

work.

31

32

Slowly, one at a time, I've tackled Bind DNS, NTPD, md/RAID, direct 

33

netfilter/iptables (which interestingly enough were *SIGNIFICANTLY* easier 

34

for me to wrap my mind around than the various so-called "easier" firewall 

35

tools that ultimately use netfilter/iptables at the back end anyway, 

36

perhaps because I already understood network basics and all the "simple" 

37

ones simply obscured the situation for me) and other generally considered 

38

"enterprise" tools.  But while having my head around these normally 

39

enterprise technologies well enough to troubleshoot them may well help me 

40

with a job in the field in the future, that's not why I learned them.  As 

41

a Linux hobbyist, I learned them for much the same reason mountain climber 

42

hobbyists climb mountains, because they were there, and for the personal 

43

challenge.

44

45

Meanwhile, as I alluded to earlier, I tried LVM2 (successor to both EVMS 

46

and the original LVM, as you likely already know) on top of md/RAID, but 

47

found that for me, they layering of technologies obscured my 

48

understanding, to the point where I was no longer comfortable with my 

49

ability to recover in a disaster situation in which both the RAID and LVM2 

50

levels were damaged.

51

52

Couple that with an experience where I had a broken LVM2 that needed 

53

rebuilt, but with the portage tree on LVM2, and I realized that for what I 

54

was doing, especially since md/raid's partitioned-raid support was now 

55

quite robust, the LVM2 layer just added complexity for very little real 

56

added flexibility or value, particularly since I couldn't put / on it 

57

anyway, without an initr* (one of the technologies I've never taken time 

58

to understand to the point I'm comfortable using it).

59

60

That's why I recommended that you pick a storage layer technology that 

61

fits your needs as best you can, get comfortable with it, and avoid if 

62

possible multi-layering.  The keep-it-simple rule really /does/ help avoid 

63

admin-level fat-fingering, which really /is/ a threat to data and system 

64

integrity.  Sure, there's a bit of additional flexibility by layering, but 

65

it's worth the hard look at whether the benefit really does justify the 

66

additional complexity.  In triple-digit or even higher double-digit 

67

storage device situations, basically enterprise level, there's certainly 

68

many scenarios where the multi-level layering adds significant value, but 

69

part of being a good admin, ultimately, is recognizing where that's not 

70

the case, and with a bit of experience under my belt, I realized it wasn't 

71

a good tradeoff for my usage.

72

73

Here, I picked md/raid over lvm2 (and over hardware RAID) for a number of 

74

reasons.  First, md/raid for / can be directly configured on the kernel 

75

command line.  No initr* needed.  That allowed me to put off learning 

76

initr* tech for another day, as well as reducing complexity.  As for

77

md/raid over hardware RAID, there's certainly a place for both, 

78

particularly when md/raid may be layered on hardware raid, but for low-

79

budget emphasis of the R(edundant) in Redundant Array of Independent 

80

Devices (RAID), there's nothing like being able to plug in not just 

81

another drive, but any standard (SATA in my case) controller, and/or any 

82

mobo, and with at worst a kernel rebuild with the new drivers (since I 

83

choose to build-in what I need and not build what I don't, so a kernel 

84

rebuild would be necessary if it were different controllers), be back up 

85

and running.  No having to find a compatible RAID card...  Plus, the Linux 

86

kernel md/RAID stack has FAR FAR more testing under all sorts of corner-

87

case situations than any hardware RAID is **EVER** going to get.

88

89

But as I mentioned, the times they are a changin', and with the latest 

90

desktop environments (kde4.6, gnome3 and I believe the latest gnome2, and 

91

xfce?, at minimum) leaving the deprecated hal behind in favor of udev/

92

udisks/upower and etc, and with udisks in particular depending on device-

93

mapper, now part of lvm2, and the usual removable device auto-detect/auto-

94

mount functionality of the desktops dependent in turn on udisks, while it 

95

probably won't affect non-X server based installations and arguably 

96

doesn't /heavily/ (other than non-optional dependencies) affect *ix 

97

traditionalist desktop users who aren't comfortable with auto-mount in the 

98

first place (I'm one, there's known security tradeoffs involved, see 

99

recent stories on auto-triggered vulns due to gnome scanning for icons on 

100

auto-mounted thumbdrives and/or cds, for instance), it's a fact that 

101

within the year, most new releases will be requiring that device-mapper 

102

and thus lvm2 be installed and device-mapper enabled in the kernel, to 

103

support their automount functionality.  As such, and because lvm2 has at 

104

least basic raid-0 and raid-1 support (tho not the more advanced stuff, 

105

raid5/6/10/50/60 etc, last I checked, but I may well be behind) of its 

106

own, particularly for distributions relying on prebuilt kernels, therefore 

107

modules, therefore initr*s, already, so lvm2's initr* requirement isn't a 

108

factor, lvm2 is likely to be a better choice for many than md/raid.

109

110

Meanwhile, while btrfs isn't /yet/ a major choice unless you want still 

111

clearly experimental, by first distro releases next year, it's very likely 

112

to be.  And because it's the clear successor to ext*, and has built-in 

113

multi-device and volume management flexibility of its own, come next year, 

114

both lvm2 and md/raid will lose their place in the spotlight to a large 

115

degree.  Yet still, btrfs is not yet mature and while tantalizing in its 

116

closeness, remains still an impractical immediate choice.  Plus, there's 

117

likely to be some limitations to its device management abilities that 

118

aren't clear yet, at least not to those not intimately following its 

119

development, and significant questions remain on what filesystem-supported 

120

features will be boot-time-supported as well.

121

122

> Some of the stuff you've mentioned, such as btrfs, are totally new to me

123

> since I haven't kept up with the state of the art.  Some years ago we

124

> had EVMS, which was developed by IBM here in Austin.  I was a member of

125

> the Capital Area Central Texas UNIX Society (CACTUS) and we had the EVMS

126

> developers come and talk about it.  EVMS was great.  It was a layered

127

> technology with an API for a management client, so you could have a CLI,

128

> a GUI, a web-based management client, whatever, and a all of them useing

129

> the same API to the disk management layer.  It was an umbrella

130

> technology which covered several levels of Linux MD Raid plus LVM.  You

131

> could put fundamental storage elements together like tinker-toys and

132

> slice and dice them any way you wanted to.

133

134

The technology is truly wonderful.  Unfortunately, the fact that running / 

135

on it requires an initr* means it's significantly less useful than it 

136

might be.  Were that one requirement removed, the whole equation would be 

137

altered and I may still be running LVM instead of md/raid.  Not that it's 

138

much of a problem for those running a binary-based distribution already 

139

dependent on an initr*, but even in my early days on Linux, back on 

140

Mandrake, I was one of the outliers who learned kernel config 

141

customization and building within months of switching to Linux, and never 

142

looked back.  And once you're doing that, why have an initr* unless you're 

143

absolutely forced into it, which in turn puts a pretty strong negative on 

144

any technology that's going to force you into it...

145

146

> EVMS was started from an initrd, which set up the EVMS platform and then

147

> did a pivot_root to the EVMS-supported result.  I have our SOHO

148

> firewall/gateway and file server set up with it.  The root fs is on a

149

> Linux MD RAID-1 array, and what's on top of that I've rather forgotten

150

> but the result is a drive and partition layout that makes sense for the

151

> purpose of the box.  I set this up as a kind of proof-of-concept

152

> exercise because I was taken with EVMS and figured it would be useful,

153

> which it was.  The down side of this was that some time after that, IBM

154

> dropped support for the EVMS project and pulled their developers off of

155

> it.  I was impressed with the fact that IBM was actually paying people

156

> to develop open source stuff, but when they pulled the plug on it, EVMS

157

> became an orphaned project.  The firewall/gateway box runs Gentoo, so I

158

> proceeded with regular updates until one day the box stopped booting.

159

> The libraries, notably glibc, embedded in the initrd system got out of

160

> sync, version wise, with the rest of the system, and I was getting some

161

> severely strange errors early in the boot process followed by a kernel

162

> panic.  It took a bit of work to even _see_ the errors, since they were

163

> emitted in the boot process earlier than CLI scroll-back comes into

164

> play, and then there was further research to determine what I needed to

165

> do to fix the problem.  I ended up having to mount and then manually

166

> repair the initrd internal filesystem, manually reconstituting library

167

> symlinks as required.

168

169

That's interesting.  I thought most distributions used or recommended use 

170

of an alternative libc in their initr*, one that either fully static-

171

linked so didn't need included if the binaries were static-linked, or at 

172

least was smaller and more fit for the purpose of a dedicated limited-

173

space-ram-disk early-boot environment.

174

175

But if you're basing the initr* on glibc, which would certainly be easier 

176

and is, now that I think of it, probably the way gentoo handles it, yeah, 

177

I could see the glibc getting stale in the initrd.

178

179

... Because if it was an initramfs, it'd be presumably rebuilt and thus 

180

synced with the live system when the kernel was updated, since an 

181

initramfs is appended to the kernel binary file itself.  That seems to me 

182

to be a big benefit to the initramfs system, easier syncing with the built 

183

kernel and the main system, tho it would certainly take longer, and I 

184

expect for that reason that the initramfs rebuild could be short-

185

circuited, thus allowed to get stale, if desired.  But it should at least 

186

be easier to keep updated if desired, because it /is/ dynamically attached 

187

to the kernel binary at each kernel build.

188

189

But as I said I haven't really gotten into initr*s.  In fact, I don't even 

190

build busybox, etc, here, sticking it in package.provided.  If my working 

191

system gets so screwed up that I'd end up with busybox, I simply boot to 

192

the backup / partition instead.  That backup is actually a fully 

193

operational system snapshot taken at the point the backup was made, so it 

194

includes everything the system did at that point.  As such, unlike most 

195

people's limited recovery environments, I have a full-featured X, KDE, 

196

etc, all fully functional and working just as they were on my main system 

197

at the time of the backup.  So if it comes to it, I can simply switch to 

198

it as my main root, and go back to work, updating or building from a new 

199

stage-3 as necessary, at my leisure, not because I have to before I can 

200

get a working system again, because the backup /is/ a working system, as 

201

fully functional (if outdated) as it was on the day I took the backup.

202

203

> I've built some Linux boxes for one of my clients - 1U servers and the

204

> like.  These folks are pretty straight-forward in their requirements,

205

> mainly asking that the boxes just work.  The really creative work goes

206

> into the medical research PHP application that lives on the boxes, and

207

> I've learned beaucoup stuff about OOP in PHP, AJAX, etc. from the main

208

> programming fellow on the project.  We've standardized on Ubuntu server

209

> edition on SuperMicro 4 drive 1U boxes.  These boxes generally come with

210

> RAID supported by a proprietary chipset or two, which never works quite

211

> right with Linux, so the first job I always do with these is to rip out

212

> the SATA cabling from the back plane and replace the on-board RAID with

213

> an LSI 3ware card.  These cards don't mess around - they JUST WORK.

214

> LSI/3ware has been very good about supporting Linux for their products.

215

> We generally set these up as RAID-5 boxes.  There's a web-based

216

> monitoring daemon for Linux that comes with the card, and it just works,

217

> too, although it takes a bit of dickering.  The RAID has no component in

218

> user-space (except for the monitor daemon) and shows up as a single SCSI

219

> drive, which can be partitioned and formatted just as if it were a

220

> single drive.  The 3ware cards are nice!  If you're using a redundant

221

> array such as RAID-1 or RAID-5, you can designate a drive as a hot-spare

222

> and if one of the drives in an array fails, the card will fail over to

223

> the hot-spare, rebuild the array, and the monitor daemon will send you

224

> an email telling you that it happened.  Slick!

225

226

> The LSI 3ware cards aren't cheap, but they're not unreasonable either,

227

> and I've never had one fail.  I'm thinking that the my drive setup on my

228

> new desktop box will probably use RAID-1 supported by a 3ware card.

229

230

That sounds about as standard as a hardware RAID card can be.  I like my 

231

md/raid because I can literally use any standard SATA controller, but 

232

there are certainly tradeoffs.  I'll have to keep this in mind in case I 

233

ever do need to scale an installation to where hardware RAID is needed 

234

(say if I were layering md/kernel RAID-0 on top of hardware RAID-1 or 

235

RAID-5/6).

236

237

FWIW, experience again.  I don't believe software RAID-5/6 to be worth 

238

it.  md/raid 0 or 1 or 10, yes, but if I'm doing RAID-5 or 6, it'll be 

239

hardware, likely underneath a kernel level RAID-0 or 10, for a final 

240

RAID-50/60/500/600/510/610, with the left-most digit as the hardware 

241

implementation.  This because software RAID-5/6 is simply slow.

242

243

Similar experience, Linux md/RAID-1 is /surprisingly/ efficient, MUCH more 

244

so than one might imagine, as the kernel I/O scheduler makes *VERY* good 

245

use of parallel scheduling.  In fact, in many cases it beats RAID-0 

246

performance, unless of course you need the additional space of RAID-0.  

247

Certainly, that has been my experience, at least.

248

249

> I'll

250

> probably use an ext3 filesystem on the partitions.  I know ext4 is under

251

> development, but I'm not sure if it would offer me any advantages.

252

253

FWIW, ext4 is no longer "under development", it's officially mature and 

254

ready for use, and has been for a number of kernels, now.  In fact, 

255

they're actually planning on killing separate ext2/3 driver support as the 

256

ext4 driver implements it anyway.

257

258

As such, I'd definitely recommend considering ext4, noting of course that 

259

you can specifically enable/disable various ext4 features at mkfs and/or 

260

mount time, if desired.  Thus there's really no reason to stick with ext3 

261

now.  Go with ext4, and if your situation warrants, disable one or more of 

262

the ext4 features, making it more ext3-like.

263

264

OTOH, I'd specifically recommend evaluating the journal options, 

265

regardless of ext3/ext4 choice.  ext3 defaulted to "ordered" for years, 

266

then for a few kernels, switched to "writeback" by default, then just 

267

recently (2.6.38?) switched back to "ordered".  AFAIK, ext4 has always 

268

defaulted to the faster but less corner-case crash safe "writeback".  The 

269

third and most conservative option is of course "journal" (journal the 

270

data too, as opposed to metadata only, with the other two).

271

272

Having lived thru the reiserfs "writeback" era and been OH so glad when 

273

they implemented and defaulted to "ordered" for it, I don't believe I'll 

274

/ever/ trust anything beyond what I'd trust on a RAID-0 without backups, 

275

to "writeback" again, regardless of /what/ the filesystem designers say or 

276

what the default is.

277

278

And, I know of at least one person that experienced data integrity issues 

279

with writeback on ext3 when the kernel was defaulting to that, that 

280

immediately disappeared when he switched back to ordered.

281

282

Bottom line, yeah I believe ext4 is safe, but ext3 or ext4, unless you 

283

really do /not/ care about your data integrity or are going to the extreme 

284

and already have data=journal, DEFINITELY specify data=ordered, both in 

285

your mount options, and by setting the defaults via tune2fs.

286

287

If there's one bit of advice in all these posts that I'd have you take 

288

away, it's that.  It's NOT worth the integrity of your data!  Use 

289

data=ordered unless you really do NOT care, to the same degree that you 

290

don't put data you care about on RAID-0, without at least ensuring that 

291

it's backed up elsewhere.  I've seen people lose data needlessly over 

292

this; I've lost it on reiserfs myself before they implemented data=ordered 

293

by default, and truly, just as with RAID-0, data=writeback is NOT worth 

294

whatever performance increase it might bring, unless you really do /not/ 

295

care about the data integrity on that filesystem!

296

297

> I used reiserfs on some of the partitions on my servers, and on some

298

> partitions on my desktop box too.  Big mistake!  There was a bug in

299

> reiserfs support the current kernel when I built the first server and

300

> the kernel crapped all over the hard drive one night and the box

301

> crashed!

302

303

IDR the kernel version but there's one that's specifically warned about.  

304

That must have been the one...

305

306

But FWIW, I've had no problems, even thru a period of bad-ram resulting in 

307

kernel crashes and the like, since the introduction of journal=ordered.  

308

Given the time mentioned above when ext3 defaulted to data=writeback, I'd 

309

even venture to say that for that period and on those kernels, reiserfs 

310

may well have been safer than ext3!

311

312

> I was able to fix it and salvage customer data, but it was

313

> pretty scary.  Hans Reiser is in prison for life for murder, and there's

314

> like one person on the Linux kernel development group who maintains

315

> reiserfs.  Ext3/4, on the other hand, is solid, maybe not quite as fast,

316

> but supported by a dedicated group of developers.

317

318

For many years, the kernel person doing most of the reiserfs maintenance 

319

and the one who introduced the previously mentioned data=ordered mode by 

320

default and data=journal mode as an option, was Chris Mason.  I believe he 

321

was employed by SuSE, for years the biggest distribution to default to 

322

reiserfs, even before it was in mainline, I believe.  I'm not sure if he's 

323

still the official reiserfs maintainer due to his current duties, but he 

324

DEFINITELY groks the filesystem.  Those current duties?  He's employed by 

325

Oracle now, and is the lead developer of btrfs.

326

327

Now reiserfs does have its warts.  It's not particularly fast any more, 

328

and has performance issues on multi-core systems due to its design around 

329

the BKL (big kernel lock, deprecated for years with users converting to 

330

other lock methods, no current in-tree users with 2.6.38 and set to be 

331

fully removed with 2.6.39), which tho reiserfs was converted to other 

332

locking a few kernels ago, the single-access-at-a-time assumption and 

333

bottleneck lives on.  However, it is and has been quite stable for many 

334

years now, since the intro of data=ordered, to the point that as mentioned 

335

above, I believe it was safer than ext3 during the time ext3 defaulted to 

336

writeback, because reiserfs still had the saner default of data=ordered.

337

338

But to each his own.  I'd still argue that a data=writeback default is 

339

needlessly risking data, however, and far more dangerous regardless of 

340

whether it's ext3, ext4, or reiserfs, than any of the three themselves 

341

are, otherwise.

342

343

> So I'm thinking that this new box will have a couple of

344

> professional-grade (the 5-year warranty type) 1.5 or 2 TB drives and a

345

> 3ware card.  I still haven't settled on the mainboard, which will have

346

> to support the 3ware card, a couple of sound cards and a legacy Adaptec

347

> SCSI card for our ancient but extremely well-built HP scanner.  The

348

> chipset will have to be well supported in Linux.  I'll probably build

349

> the box myself once I decide on the hardware.

350

351

FWIW, my RAID is 4x SATA 300 gig Seagates, 5 year warranty I expect now 

352

either expired or soon to.  Most of the system is RAID-1 across all four, 

353

however, and I'm backed up to external as well altho I'll admit that 

354

backup's a dated, now.  I bought them after having a string of bad luck 

355

with ~1 year failures on both Maxtor (which had previously been quite 

356

dependable for me) and Western Digital (which I had read bad things about 

357

but thought I'd try after Maxtor, only to have the same ~1 year issues).  

358

Obviously, they've long outlasted those, so I've been satisfied.

359

360

As I said, I'll keep the 3ware RAID cards in mind.

361

362

Mainboard:  If a server board fits your budget, I'd highly recommend 

363

getting a Tyan board that's Linux certified.  The one I'm running in my 

364

main machine is now 8 years old, /long/ out of warranty and beyond further 

365

BIOS updates, but still running solid.  It was a $400 board back then, 

366

reasonable for a dual-socket Opteron.  Not only did it come with Linux 

367

certifications for various distributions, but they had Linux specific 

368

support.  Further, how many boards do /you/ know of that have a pre-

369

customized sensors.conf file available for the download? =:^)  And when 

370

the dual-cores came out, a BIOS update was made available that supported 

371

them.  As a result, while it was $400, that then leading edge dual socket 

372

Opteron board from eight years ago, while it's no longer leading edge by 

373

any means, eight years later still forms the core for an acceptably decent 

374

system, dual-dual-core Opteron 290s @ 2.8 GHz (topped out the sockets), 

375

currently 6 gig RAM, 3x2-gig as I had one stick die that I've not 

376

replaced, but 8 sockets so I could run 16 gig if I wanted, 4xSATA drives, 

377

only SATA-150, but they're RAIDED, Radeon hd4650 AGP (no PCI-E, tho it 

378

does have PCI-X), etc.  No PCI-E, limited to SATA-150, and no hardware 

379

virtualization instruction support on the CPUs, so it's definitely dated, 

380

but after all, it's an 8 years' old system!

381

382

It's likely to be a decade old by the time I actually upgrade it.  Yes, 

383

it's definitely a server-class board and the $400 I paid reflected that, 

384

but 8 years and shooting for 10!  And with the official Linux support 

385

including a custom sensors.conf.  I'm satisfied that I got my money's 

386

worth.

387

388

But I don't believe all Tyan's boards are as completely Linux supported as 

389

that one was, so do your research.

390

391

> I'm gonna apply the KISS principle to the OS design for this, and stay

392

> away from bleeding edge software technologies, although, especially

393

> after reading your essay, it's very tempting to try some of this stuff

394

> out to see what the the _really_ smart people are coming up with!  I'm

395

> getting off of the Linux state-of-the art train for a while and go

396

> walking in the woods.  The kernel will have to be low-latency since I

397

> may use the box for recording work with Jack and Ardour2, and almost

398

> certainly for audio editing, and maybe video editing at some point.

399

> That's where my energy is going to go for this one.

400

401

Well, save btrfs for a project a couple years down the line, then.  But 

402

certainly, investigate md/raid vs lvm2 and make your choice, keeping in 

403

mind that while nowdays they overlap features, md/raid doesn't require an 

404

initr* to run / on it, while lvm2 will likely be pulled in as a dependency 

405

for your X desktop, at least kde/gnome/xfce, by later this year, whether 

406

you actually use its lvm features or not.

407

408

And do consider ext4, but regardless of ext3/4, be /sure/ you either 

409

choose data=ordered or can give a good reason why you didn't.  (Low-

410

latency writing just might be a reasonable excuse for data=writeback, but 

411

be sure you keep backed up if you do!)  Because /that/ one may well save 

412

your data, someday!

413

414

--

415

Duncan - List replies preferred.   No HTML msgs.

416

"Every nonfree program has a lord, a master --

417

and if you use the program, he is your master."  Richard Stallman

Gentoo Archives: gentoo-desktop

Replies