[gentoo-doc-cvs] cvs commit: l-awk1.xml - gentoo-doc-cvs

From:	Xavier Neys <neysx@×××××××××××.org>
To:	gentoo-doc-cvs@l.g.o
Subject:	[gentoo-doc-cvs] cvs commit: l-awk1.xml
Date:	Thu, 28 Jul 2005 08:04:32
Message-Id:	`200507280803.j6S83kpL027109@robin.gentoo.org`

1

neysx       05/07/28 08:04:04

2

3

  Added:       xml/htdocs/doc/en/articles l-awk1.xml l-awk2.xml l-awk3.xml

4

  Log:

5

  #99260 xmlified awk articles

6

7

Revision  Changes    Path

8

1.1                  xml/htdocs/doc/en/articles/l-awk1.xml

9

10

file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk1.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo

11

plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk1.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo

12

13

Index: l-awk1.xml

14

===================================================================

15

<?xml version='1.0' encoding="UTF-8"?>

16

<!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/l-awk1.xml,v 1.1 2005/07/28 08:04:04 neysx Exp $ -->

17

<!DOCTYPE guide SYSTEM "/dtd/guide.dtd">

18

19

<guide link="/doc/en/articles/l-awk1.xml">

20

<title>Awk by example, Part 1</title>

21

22

<author title="Author">

23

  <mail link="drobbins@g.o">Daniel Robbins</mail>

24

</author>

25

<author title="Editor">

26

  <mail link="rane@××××××.pl">Łukasz Damentko</mail>

27

</author>

28

29

<abstract>

30

Awk is a very nice language with a very strange name. In this first article of a

31

three-part series, Daniel Robbins will quickly get your awk programming skills

32

up to speed. As the series progresses, more advanced topics will be covered,

33

culminating with an advanced real-world awk application demo.

34

</abstract>

35

36

<!-- The original version of this article was published on IBM developerWorks,

37

and is property of Westtech Information Services. This document is an updated

38

version of the original article, and contains various improvements made by the

39

Gentoo Linux Documentation team -->

40

41

<version>1.0</version>

42

<date>2005-07-15</date>

43

44

<chapter>

45

<title>An intro to the great language with the strange name</title>

46

<section>

47

<title>In defense of awk</title>

48

<body>

49

50

<note>

51

The original version of this article was published on IBM developerWorks, and is

52

property of Westtech Information Services. This document is an updated version

53

of the original article, and contains various improvements made by the Gentoo

54

Linux Documentation team.

55

</note>

56

57

<p>

58

In this series of articles, I'm going to turn you into a proficient awk coder.

59

I'll admit, awk doesn't have a very pretty or particularly "hip" name, and the

60

GNU version of awk, called gawk, sounds downright weird. Those unfamiliar with

61

the language may hear "awk" and think of a mess of code so backwards and

62

antiquated that it's capable of driving even the most knowledgeable UNIX guru to

63

the brink of insanity (causing him to repeatedly yelp "kill -9!" as he runs for

64

coffee machine).

65

</p>

66

67

<p>

68

Sure, awk doesn't have a great name. But it is a great language. Awk is geared

69

toward text processing and report generation, yet features many well-designed

70

features that allow for serious programming. And, unlike some languages, awk's

71

syntax is familiar, and borrows some of the best parts of languages like C,

72

python, and bash (although, technically, awk was created before both python and

73

bash). Awk is one of those languages that, once learned, will become a key part

74

of your strategic coding arsenal.

75

</p>

76

77

</body>

78

</section>

79

<section>

80

<title>The first awk</title>

81

<body>

82

83

<p>

84

You should see the contents of your <path>/etc/passwd</path> file appear before

85

your eyes.  Now, for an explanation of what awk did. When we called awk, we

86

specified <path>/etc/passwd</path> as our input file. When we executed awk, it

87

evaluated the print command for each line in <path>/etc/passwd</path>, in

88

order. All output is sent to stdout, and we get a result identical to catting

89

<path>/etc/pass</path>.

90

</p>

91

92

<p>

93

Now, for an explanation of the { print } code block. In awk, curly braces are

94

used to group blocks of code together, similar to C. Inside our block of code,

95

we have a single print command. In awk, when a print command appears by itself,

96

the full contents of the current line are printed.

97

</p>

98

99

<pre caption="Printing the current line">

100

$ <i>awk '{ print $0 }' /etc/passwd</i>

101

$ <i>awk '{ print "" }' /etc/passwd</i>

102

</pre>

103

104

<p>

105

In awk, the $0 variable represents the entire current line, so print and print

106

$0 do exactly the same thing.

107

</p>

108

109

<pre caption="Filling the screen with some text">

110

$ <i>awk '{ print "hiya" }' /etc/passwd</i>

111

</pre>

112

113

</body>

114

</section>

115

<section>

116

<title>Multiple fields</title>

117

<body>

118

119

<pre caption="print $1">

120

$ <i>awk -F":" '{ print $1 $3 }' /etc/passwd</i>

121

halt7

122

operator11

123

root0

124

shutdown6

125

sync5

126

bin1

127

<comment>....etc.</comment>

128

</pre>

129

130

<pre caption="print $1 $3">

131

$ <i>awk -F":" '{ print $1 " " $3 }' /etc/passwd</i>

132

</pre>

133

134

<pre caption="$1$3">

135

$ <i>awk -F":" '{ print "username: " $1 "\t\tuid:" $3" }' /etc/passwd</i>

136

username: halt          uid:7

137

username: operator      uid:11

138

username: root          uid:0

139

username: shutdown      uid:6

140

username: sync          uid:5

141

username: bin           uid:1

142

<comment>....etc.</comment>

143

</pre>

144

145

</body>

146

</section>

147

<section>

148

<title>External scripts</title>

149

<body>

150

151

<pre caption="Sample script">

152

BEGIN { FS=":" }

153

{ print $1 }

154

</pre>

155

156

<p>

157

The difference between these two methods has to do with how we set the field

158

separator. In this script, the field separator is specified within the code

159

itself (by setting the FS variable), while our previous example set FS by

160

passing the -F":" option to awk on the command line. It's generally best to set

161

the field separator inside the script itself, simply because it means you have

162

one less command line argument to remember to type. We'll cover the FS variable

163

in more detail later in this article.

164

</p>

165

166

</body>

167

</section>

168

<section>

169

<title>The BEGIN and END blocks</title>

170

<body>

171

172

<p>

173

Normally, awk executes each block of your script's code once for each input

174

line. However, there are many programming situations where you may need to

175

execute initialization code before awk begins processing the text from the input

176

file. For such situations, awk allows you to define a BEGIN block. We used a

177

BEGIN block in the previous example. Because the BEGIN block is evaluated before

178

awk starts processing the input file, it's an excellent place to initialize the

179

FS (field separator) variable, print a heading, or initialize other global

180

variables that you'll reference later in the program.

181

</p>

182

183

<p>

184

Awk also provides another special block, called the END block. Awk executes this

185

block after all lines in the input file have been processed. Typically, the END

186

block is used to perform final calculations or print summaries that should

187

appear at the end of the output stream.

188

</p>

189

190

</body>

191

</section>

192

<section>

193

<title>Regular expressions and blocks</title>

194

<body>

195

196

<pre caption="Regular expressions and blocks">

197

/foo/ { print }

198

/[0-9]+\.[0-9]*/ { print }

199

</pre>

200

201

</body>

202

</section>

203

<section>

204

<title>Expressions and blocks</title>

205

<body>

206

207

<pre caption="fredprint">

208

$1 == "fred" { print $3 }

209

</pre>

210

211

<pre caption="root">

212

$5 ~ /root/ { print $3 }

213

</pre>

1.1                  xml/htdocs/doc/en/articles/l-awk2.xml

218

219

file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk2.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo

220

plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk2.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo

221

222

Index: l-awk2.xml

223

===================================================================

224

<?xml version='1.0' encoding="UTF-8"?>

225

<!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/l-awk2.xml,v 1.1 2005/07/28 08:04:04 neysx Exp $ -->

226

<!DOCTYPE guide SYSTEM "/dtd/guide.dtd">

227

228

<guide link="/doc/en/articles/l-awk2.xml">

229

<title>Awk by example, Part 2</title>

230

231

<author title="Author">

232

  <mail link="drobbins@g.o">Daniel Robbins</mail>

233

</author>

234

<author title="Editor">

235

  <mail link="rane@××××××.pl">Łukasz Damentko</mail>

236

</author>

237

238

<abstract>

239

In this sequel to his previous intro to awk, Daniel Robbins continues to explore

240

awk, a great language with a strange name. Daniel will show you how to handle

241

multi-line records, use looping constructs, and create and use awk arrays. By

242

the end of this article, you'll be well versed in a wide range of awk features,

243

and you'll be ready to write your own powerful awk scripts.

244

</abstract>

245

246

<!-- The original version of this article was published on IBM developerWorks,

247

and is property of Westtech Information Services. This document is an updated

248

version of the original article, and contains various improvements made by the

249

Gentoo Linux Documentation team -->

250

251

<version>1.0</version>

252

<date>2005-07-27</date>

253

254

<chapter>

255

<title>Records, loops, and arrays</title>

256

<section>

257

<title>Multi-line records</title>

258

<body>

259

260

<note>

261

The original version of this article was published on IBM developerWorks, and is

262

property of Westtech Information Services. This document is an updated version

263

of the original article, and contains various improvements made by the Gentoo

264

Linux Documentation team.

265

</note>

266

267

<p>

268

Awk is an excellent tool for reading in and processing structured data, such as

269

the system's <path>/etc/passwd</path> file. <path>/etc/passwd</path> is the UNIX

270

user database, and is a colon-delimited text file, containing a lot of important

271

information, including all existing user accounts and user IDs, among other

272

things. In <uri link="/doc/en/articles/l-awk1.xml">my previous article</uri>, I

273

showed you how awk could easily parse this file. All we had to do was to set the

274

FS (field separator) variable to ":".

275

</p>

276

277

<p>

278

By setting the FS variable correctly, awk can be configured to parse almost any

279

kind of structured data, as long as there is one record per line. However, just

280

setting FS won't do us any good if we want to parse a record that exists over

281

multiple lines. In these situations, we also need to modify the RS record

282

separator variable. The RS variable tells awk when the current record ends and a

283

new record begins.

284

</p>

285

286

<p>

287

As an example, let's look at how we'd handle the task of processing an address

288

list of Federal Witness Protection Program participants:

289

</p>

290

291

<pre caption="Sample entry from Federal Witness Protection Program participants list">

292

Jimmy the Weasel

293

100 Pleasant Drive

294

San Francisco, CA 12345

295

Big Tony

296

200 Incognito Ave.

297

Suburbia, WA 67890

298

</pre>

299

300

<p>

301

Ideally, we'd like awk to recognize each 3-line address as an individual record,

302

rather than as three separate records. It would make our code a lot simpler if

303

awk would recognize the first line of the address as the first field ($1), the

304

street address as the second field ($2), and the city, state, and zip code as

305

field $3. The following code will do just what we want:

306

</p>

307

308

<pre caption="Making one field from the address">

309

BEGIN {

310

    FS="\n"

311

    RS=""

312

}

313

</pre>

314

315

<p>

316

Above, setting FS to "\n" tells awk that each field appears on its own line. By

317

setting RS to "", we also tell awk that each address record is separated by a

318

blank line. Once awk knows how the input is formatted, it can do all the parsing

319

work for us, and the rest of the script is simple. Let's look at a complete

320

script that will parse this address list and print out each address record on a

321

single line, separating each field with a comma.

322

</p>

323

324

<pre caption="Complete script">

325

BEGIN {

326

    FS="\n"

327

    RS=""

328

}

329

{ print $1 ", " $2 ", " $3 }

330

</pre>

331

332

333

<p>

334

If this script is saved as <path>address.awk</path>, and the address data is

335

stored in a file called <path>address.txt</path>, you can execute this script by

336

typing <c>awk -f address.awk address.txt</c>. This code produces the following

337

output:

338

</p>

339

340

<pre caption="The script's output">

341

Jimmy the Weasel, 100 Pleasant Drive, San Francisco, CA 12345

342

Big Tony, 200 Incognito Ave., Suburbia, WA 67890

343

</pre>

344

345

</body>

346

</section>

347

<section>

348

<title>OFS and ORS</title>

349

<body>

350

351

<p>

352

In address.awk's print statement, you can see that awk concatenates (joins)

353

strings that are placed next to each other on a line. We used this feature to

354

insert a comma and a space (", ") between the three address fields that appeared

355

on the line. While this method works, it's a bit ugly looking. Rather than

356

inserting literal ", " strings between our fields, we can have awk do it for us

357

by setting a special awk variable called OFS. Take a look at this code snippet.

358

</p>

359

360

<pre caption="Sample code snippet">

361

print "Hello", "there", "Jim!"

362

</pre>

363

364

<p>

365

The commas on this line are not part of the actual literal strings. Instead,

366

they tell awk that "Hello", "there", and "Jim!" are separate fields, and that

367

the OFS variable should be printed between each string. By default, awk produces

368

the following output:

369

</p>

370

371

<pre caption="Output produced by awk">

372

Hello there Jim!

373

</pre>

374

375

<p>

376

This shows us that by default, OFS is set to " ", a single space. However, we

377

can easily redefine OFS so that awk will insert our favorite field separator.

378

Here's a revised version of our original <path>address.awk</path> program that

379

uses OFS to output those intermediate ", " strings:

380

</p>

381

382

<pre caption="Redefining OFS">

383

BEGIN {

384

    FS="\n"

385

    RS=""

386

    OFS=", "

387

}

388

{ print $1, $2, $3 }

389

</pre>

390

391

<p>

392

Awk also has a special variable called ORS, called the "output record

393

separator". By setting ORS, which defaults to a newline ("\n"), we can control

394

the character that's automatically printed at the end of a print statement. The

395

default ORS value causes awk to output each new print statement on a new line.

396

If we wanted to make the output double-spaced, we would set ORS to "\n\n". Or,

397

if we wanted records to be separated by a single space (and no newline), we

398

would set ORS to " ".

399

</p>

400

401

</body>

402

</section>

403

<section>

404

<title>Multi-line to tabbed</title>

405

<body>

406

407

<p>

408

Let's say that we wrote a script that converted our address list to a

409

single-line per record, tab-delimited format for import into a spreadsheet.

410

After using a slightly modified version of <path>address.awk</path>, it would

411

become clear that our program only works for three-line addresses. If awk

412

encountered the following address, the fourth line would be thrown away and not

413

printed:

414

</p>

415

416

<pre caption="Sample entry">

417

Cousin Vinnie

418

Vinnie's Auto Shop

419

300 City Alley

420

Sosueme, OR 76543

421

</pre>

1.1                  xml/htdocs/doc/en/articles/l-awk3.xml

427

428

file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk3.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo

429

plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk3.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo

430

431

Index: l-awk3.xml

432

===================================================================

433

<?xml version='1.0' encoding="UTF-8"?>

434

<!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/l-awk3.xml,v 1.1 2005/07/28 08:04:04 neysx Exp $ -->

435

<!DOCTYPE guide SYSTEM "/dtd/guide.dtd">

436

437

<guide link="/doc/en/articles/l-awk3.xml">

438

<title>Awk by example, Part 3</title>

439

440

<author title="Author">

441

  <mail link="drobbins@g.o">Daniel Robbins</mail>

442

</author>

443

<author title="Editor">

444

  <mail link="rane@××××××.pl">Łukasz Damentko</mail>

445

</author>

446

447

<abstract>

448

In this sequel to his previous intro to awk, Daniel Robbins continues to explore

449

awk, a great language with a strange name. Daniel will show you how to handle

450

multi-line records, use looping constructs, and create and use awk arrays. By

451

the end of this article, you'll be well versed in a wide range of awk features,

452

and you'll be ready to write your own powerful awk scripts.

453

</abstract>

454

455

<!-- The original version of this article was published on IBM developerWorks,

456

and is property of Westtech Information Services. This document is an updated

457

version of the original article, and contains various improvements made by the

458

Gentoo Linux Documentation team -->

459

460

<version>1.0</version>

461

<date>2005-07-27</date>

462

463

<chapter>

464

<title>String functions and ... checkbooks?</title>

465

<section>

466

<title>Formatting output</title>

467

<body>

468

469

<p>

470

While awk's print statement does do the job most of the time, sometimes more is

471

needed. For those times, awk offers two good old friends called printf() and

472

sprintf(). Yes, these functions, like so many other awk parts, are identical to

473

their C counterparts. printf() will print a formatted string to stdout, while

474

sprintf() returns a formatted string that can be assigned to a variable. If

475

you're not familiar with printf() and sprintf(), an introductory C text will

476

quickly get you up to speed on these two essential printing functions. You can

477

view the printf() man page by typing "man 3 printf" on your Linux system.

478

</p>

479

480

<p>

481

Here's some sample awk sprintf() and printf() code. As you can see, everything

482

looks almost identical to C.

483

</p>

484

485

<pre caption="Sample awk sprintf() and printf() code">

486

x=1

487

b="foo"

488

printf("%s got a %d on the last test\n","Jim",83)

489

myout=("%s-%d",b,x)

490

print myout

491

</pre>

492

493

<p>

494

This code will print:

495

</p>

496

497

<pre caption="Code output">

498

Jim got a 83 on the last test

499

foo-1

500

</pre>

501

502

</body>

503

</section>

504

<section>

505

<title>String functions</title>

506

<body>

507

508

<p>

509

Awk has a plethora of string functions, and that's a good thing. In awk, you

510

really need string functions, since you can't treat a string as an array of

511

characters as you can in other languages like C, C++, and Python. For example,

512

if you execute the following code:

513

</p>

514

515

<pre caption="Example code">

516

mystring="How are you doing today?"

517

print mystring[3]

518

</pre>

519

520

<p>

521

You'll receive an error that looks something like this:

522

</p>

523

524

<pre caption="Example code error">

525

awk: string.gawk:59: fatal: attempt to use scalar as array

526

</pre>

527

528

<p>

529

Oh, well. While not as convenient as Python's sequence types, awk's string

530

functions get the job done. Let's take a look at them.

531

</p>

532

533

<p>

534

First, we have the basic length() function, which returns the length of a

535

string. Here's how to use it:

536

</p>

537

538

<pre caption="length() function example">

539

print length(mystring)

540

</pre>

541

542

<p>

543

This code will print the value:

544

</p>

545

546

<pre caption="Printed value">

547

24

548

</pre>

549

550

<p>

551

OK, let's keep going. The next string function is called index, and will return

552

the position of the occurrence of a substring in another string, or it will

553

return 0 if the string isn't found. Using mystring, we can call it this way:

554

</p>

555

556

<pre caption="index() funtion example">

557

print index(mystring,"you")

558

</pre>

559

560

<p>

561

Awk prints:

562

</p>

563

564

<pre caption="Function output">

565

9

566

</pre>

567

568

<p>

569

We move on to two more easy functions, tolower() and toupper(). As you might

570

guess, these functions will return the string with all characters converted to

571

lowercase or uppercase respectively. Notice that tolower() and toupper() return

572

the new string, and don't modify the original. This code:

573

</p>

574

575

<pre caption="Converting strings to lower or uppercase">

576

print tolower(mystring)

577

print toupper(mystring)

578

print mystring

579

</pre>

580

581

<p>

582

....will produce this output:

583

</p>

584

585

<pre caption="Output">

586

how are you doing today?

587

HOW ARE YOU DOING TODAY?

588

How are you doing today?

589

</pre>

590

591

<p>

592

So far so good, but how exactly do we select a substring or even a single

593

character from a string? That's where substr() comes in. Here's how to call

594

substr():

595

</p>

596

597

<pre caption="substr() function example">

598

mysub=substr(mystring,startpos,maxlen)

599

</pre>

600

601

<p>

602

mystring should be either a string variable or a literal string from which you'd

603

like to extract a substring. startpos should be set to the starting character

604

position, and maxlen should contain the maximum length of the string you'd like

605

to extract. Notice that I said maximum length; if length(mystring) is shorter

606

than startpos+maxlen, your result will be truncated. substr() won't modify the

607

original string, but returns the substring instead. Here's an example:

608

</p>

609

610

<pre caption="Another example">

611

print substr(mystring,9,3)

612

</pre>

613

614

<p>

615

Awk will print:

616

</p>

617

618

<pre caption="What awk prints">

619

you

620

</pre>

621

622

<p>

623

If you regularly program in a language that uses array indices to access parts

624

of a string (and who doesn't), make a mental note that substr() is your awk

625

substitute. You'll need to use it to extract single characters and substrings;

626

because awk is a string-based language, you'll be using it often.

627

</p>

628

629

<p>

630

Now, we move on to some meatier functions, the first of which is called match().

631

match() is a lot like index(), except instead of searching for a substring like

--

636

gentoo-doc-cvs@g.o mailing list

Gentoo Archives: gentoo-doc-cvs