Gentoo Archives: gentoo-doc-cvs

From: Xavier Neys <neysx@×××××××××××.org>
To: gentoo-doc-cvs@l.g.o
Subject: [gentoo-doc-cvs] cvs commit: l-awk1.xml
Date: Thu, 28 Jul 2005 08:04:32
Message-Id: 200507280803.j6S83kpL027109@robin.gentoo.org
1 neysx 05/07/28 08:04:04
2
3 Added: xml/htdocs/doc/en/articles l-awk1.xml l-awk2.xml l-awk3.xml
4 Log:
5 #99260 xmlified awk articles
6
7 Revision Changes Path
8 1.1 xml/htdocs/doc/en/articles/l-awk1.xml
9
10 file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk1.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo
11 plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk1.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo
12
13 Index: l-awk1.xml
14 ===================================================================
15 <?xml version='1.0' encoding="UTF-8"?>
16 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/l-awk1.xml,v 1.1 2005/07/28 08:04:04 neysx Exp $ -->
17 <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
18
19 <guide link="/doc/en/articles/l-awk1.xml">
20 <title>Awk by example, Part 1</title>
21
22 <author title="Author">
23 <mail link="drobbins@g.o">Daniel Robbins</mail>
24 </author>
25 <author title="Editor">
26 <mail link="rane@××××××.pl">Łukasz Damentko</mail>
27 </author>
28
29 <abstract>
30 Awk is a very nice language with a very strange name. In this first article of a
31 three-part series, Daniel Robbins will quickly get your awk programming skills
32 up to speed. As the series progresses, more advanced topics will be covered,
33 culminating with an advanced real-world awk application demo.
34 </abstract>
35
36 <!-- The original version of this article was published on IBM developerWorks,
37 and is property of Westtech Information Services. This document is an updated
38 version of the original article, and contains various improvements made by the
39 Gentoo Linux Documentation team -->
40
41 <version>1.0</version>
42 <date>2005-07-15</date>
43
44 <chapter>
45 <title>An intro to the great language with the strange name</title>
46 <section>
47 <title>In defense of awk</title>
48 <body>
49
50 <note>
51 The original version of this article was published on IBM developerWorks, and is
52 property of Westtech Information Services. This document is an updated version
53 of the original article, and contains various improvements made by the Gentoo
54 Linux Documentation team.
55 </note>
56
57 <p>
58 In this series of articles, I'm going to turn you into a proficient awk coder.
59 I'll admit, awk doesn't have a very pretty or particularly "hip" name, and the
60 GNU version of awk, called gawk, sounds downright weird. Those unfamiliar with
61 the language may hear "awk" and think of a mess of code so backwards and
62 antiquated that it's capable of driving even the most knowledgeable UNIX guru to
63 the brink of insanity (causing him to repeatedly yelp "kill -9!" as he runs for
64 coffee machine).
65 </p>
66
67 <p>
68 Sure, awk doesn't have a great name. But it is a great language. Awk is geared
69 toward text processing and report generation, yet features many well-designed
70 features that allow for serious programming. And, unlike some languages, awk's
71 syntax is familiar, and borrows some of the best parts of languages like C,
72 python, and bash (although, technically, awk was created before both python and
73 bash). Awk is one of those languages that, once learned, will become a key part
74 of your strategic coding arsenal.
75 </p>
76
77 </body>
78 </section>
79 <section>
80 <title>The first awk</title>
81 <body>
82
83 <p>
84 You should see the contents of your <path>/etc/passwd</path> file appear before
85 your eyes. Now, for an explanation of what awk did. When we called awk, we
86 specified <path>/etc/passwd</path> as our input file. When we executed awk, it
87 evaluated the print command for each line in <path>/etc/passwd</path>, in
88 order. All output is sent to stdout, and we get a result identical to catting
89 <path>/etc/pass</path>.
90 </p>
91
92 <p>
93 Now, for an explanation of the { print } code block. In awk, curly braces are
94 used to group blocks of code together, similar to C. Inside our block of code,
95 we have a single print command. In awk, when a print command appears by itself,
96 the full contents of the current line are printed.
97 </p>
98
99 <pre caption="Printing the current line">
100 $ <i>awk '{ print $0 }' /etc/passwd</i>
101 $ <i>awk '{ print "" }' /etc/passwd</i>
102 </pre>
103
104 <p>
105 In awk, the $0 variable represents the entire current line, so print and print
106 $0 do exactly the same thing.
107 </p>
108
109 <pre caption="Filling the screen with some text">
110 $ <i>awk '{ print "hiya" }' /etc/passwd</i>
111 </pre>
112
113 </body>
114 </section>
115 <section>
116 <title>Multiple fields</title>
117 <body>
118
119 <pre caption="print $1">
120 $ <i>awk -F":" '{ print $1 $3 }' /etc/passwd</i>
121 halt7
122 operator11
123 root0
124 shutdown6
125 sync5
126 bin1
127 <comment>....etc.</comment>
128 </pre>
129
130 <pre caption="print $1 $3">
131 $ <i>awk -F":" '{ print $1 " " $3 }' /etc/passwd</i>
132 </pre>
133
134 <pre caption="$1$3">
135 $ <i>awk -F":" '{ print "username: " $1 "\t\tuid:" $3" }' /etc/passwd</i>
136 username: halt uid:7
137 username: operator uid:11
138 username: root uid:0
139 username: shutdown uid:6
140 username: sync uid:5
141 username: bin uid:1
142 <comment>....etc.</comment>
143 </pre>
144
145 </body>
146 </section>
147 <section>
148 <title>External scripts</title>
149 <body>
150
151 <pre caption="Sample script">
152 BEGIN { FS=":" }
153 { print $1 }
154 </pre>
155
156 <p>
157 The difference between these two methods has to do with how we set the field
158 separator. In this script, the field separator is specified within the code
159 itself (by setting the FS variable), while our previous example set FS by
160 passing the -F":" option to awk on the command line. It's generally best to set
161 the field separator inside the script itself, simply because it means you have
162 one less command line argument to remember to type. We'll cover the FS variable
163 in more detail later in this article.
164 </p>
165
166 </body>
167 </section>
168 <section>
169 <title>The BEGIN and END blocks</title>
170 <body>
171
172 <p>
173 Normally, awk executes each block of your script's code once for each input
174 line. However, there are many programming situations where you may need to
175 execute initialization code before awk begins processing the text from the input
176 file. For such situations, awk allows you to define a BEGIN block. We used a
177 BEGIN block in the previous example. Because the BEGIN block is evaluated before
178 awk starts processing the input file, it's an excellent place to initialize the
179 FS (field separator) variable, print a heading, or initialize other global
180 variables that you'll reference later in the program.
181 </p>
182
183 <p>
184 Awk also provides another special block, called the END block. Awk executes this
185 block after all lines in the input file have been processed. Typically, the END
186 block is used to perform final calculations or print summaries that should
187 appear at the end of the output stream.
188 </p>
189
190 </body>
191 </section>
192 <section>
193 <title>Regular expressions and blocks</title>
194 <body>
195
196 <pre caption="Regular expressions and blocks">
197 /foo/ { print }
198 /[0-9]+\.[0-9]*/ { print }
199 </pre>
200
201 </body>
202 </section>
203 <section>
204 <title>Expressions and blocks</title>
205 <body>
206
207 <pre caption="fredprint">
208 $1 == "fred" { print $3 }
209 </pre>
210
211 <pre caption="root">
212 $5 ~ /root/ { print $3 }
213 </pre>
214
215
216
217 1.1 xml/htdocs/doc/en/articles/l-awk2.xml
218
219 file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk2.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo
220 plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk2.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo
221
222 Index: l-awk2.xml
223 ===================================================================
224 <?xml version='1.0' encoding="UTF-8"?>
225 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/l-awk2.xml,v 1.1 2005/07/28 08:04:04 neysx Exp $ -->
226 <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
227
228 <guide link="/doc/en/articles/l-awk2.xml">
229 <title>Awk by example, Part 2</title>
230
231 <author title="Author">
232 <mail link="drobbins@g.o">Daniel Robbins</mail>
233 </author>
234 <author title="Editor">
235 <mail link="rane@××××××.pl">Łukasz Damentko</mail>
236 </author>
237
238 <abstract>
239 In this sequel to his previous intro to awk, Daniel Robbins continues to explore
240 awk, a great language with a strange name. Daniel will show you how to handle
241 multi-line records, use looping constructs, and create and use awk arrays. By
242 the end of this article, you'll be well versed in a wide range of awk features,
243 and you'll be ready to write your own powerful awk scripts.
244 </abstract>
245
246 <!-- The original version of this article was published on IBM developerWorks,
247 and is property of Westtech Information Services. This document is an updated
248 version of the original article, and contains various improvements made by the
249 Gentoo Linux Documentation team -->
250
251 <version>1.0</version>
252 <date>2005-07-27</date>
253
254 <chapter>
255 <title>Records, loops, and arrays</title>
256 <section>
257 <title>Multi-line records</title>
258 <body>
259
260 <note>
261 The original version of this article was published on IBM developerWorks, and is
262 property of Westtech Information Services. This document is an updated version
263 of the original article, and contains various improvements made by the Gentoo
264 Linux Documentation team.
265 </note>
266
267 <p>
268 Awk is an excellent tool for reading in and processing structured data, such as
269 the system's <path>/etc/passwd</path> file. <path>/etc/passwd</path> is the UNIX
270 user database, and is a colon-delimited text file, containing a lot of important
271 information, including all existing user accounts and user IDs, among other
272 things. In <uri link="/doc/en/articles/l-awk1.xml">my previous article</uri>, I
273 showed you how awk could easily parse this file. All we had to do was to set the
274 FS (field separator) variable to ":".
275 </p>
276
277 <p>
278 By setting the FS variable correctly, awk can be configured to parse almost any
279 kind of structured data, as long as there is one record per line. However, just
280 setting FS won't do us any good if we want to parse a record that exists over
281 multiple lines. In these situations, we also need to modify the RS record
282 separator variable. The RS variable tells awk when the current record ends and a
283 new record begins.
284 </p>
285
286 <p>
287 As an example, let's look at how we'd handle the task of processing an address
288 list of Federal Witness Protection Program participants:
289 </p>
290
291 <pre caption="Sample entry from Federal Witness Protection Program participants list">
292 Jimmy the Weasel
293 100 Pleasant Drive
294 San Francisco, CA 12345
295 Big Tony
296 200 Incognito Ave.
297 Suburbia, WA 67890
298 </pre>
299
300 <p>
301 Ideally, we'd like awk to recognize each 3-line address as an individual record,
302 rather than as three separate records. It would make our code a lot simpler if
303 awk would recognize the first line of the address as the first field ($1), the
304 street address as the second field ($2), and the city, state, and zip code as
305 field $3. The following code will do just what we want:
306 </p>
307
308 <pre caption="Making one field from the address">
309 BEGIN {
310 FS="\n"
311 RS=""
312 }
313 </pre>
314
315 <p>
316 Above, setting FS to "\n" tells awk that each field appears on its own line. By
317 setting RS to "", we also tell awk that each address record is separated by a
318 blank line. Once awk knows how the input is formatted, it can do all the parsing
319 work for us, and the rest of the script is simple. Let's look at a complete
320 script that will parse this address list and print out each address record on a
321 single line, separating each field with a comma.
322 </p>
323
324 <pre caption="Complete script">
325 BEGIN {
326 FS="\n"
327 RS=""
328 }
329 { print $1 ", " $2 ", " $3 }
330 </pre>
331
332
333 <p>
334 If this script is saved as <path>address.awk</path>, and the address data is
335 stored in a file called <path>address.txt</path>, you can execute this script by
336 typing <c>awk -f address.awk address.txt</c>. This code produces the following
337 output:
338 </p>
339
340 <pre caption="The script's output">
341 Jimmy the Weasel, 100 Pleasant Drive, San Francisco, CA 12345
342 Big Tony, 200 Incognito Ave., Suburbia, WA 67890
343 </pre>
344
345 </body>
346 </section>
347 <section>
348 <title>OFS and ORS</title>
349 <body>
350
351 <p>
352 In address.awk's print statement, you can see that awk concatenates (joins)
353 strings that are placed next to each other on a line. We used this feature to
354 insert a comma and a space (", ") between the three address fields that appeared
355 on the line. While this method works, it's a bit ugly looking. Rather than
356 inserting literal ", " strings between our fields, we can have awk do it for us
357 by setting a special awk variable called OFS. Take a look at this code snippet.
358 </p>
359
360 <pre caption="Sample code snippet">
361 print "Hello", "there", "Jim!"
362 </pre>
363
364 <p>
365 The commas on this line are not part of the actual literal strings. Instead,
366 they tell awk that "Hello", "there", and "Jim!" are separate fields, and that
367 the OFS variable should be printed between each string. By default, awk produces
368 the following output:
369 </p>
370
371 <pre caption="Output produced by awk">
372 Hello there Jim!
373 </pre>
374
375 <p>
376 This shows us that by default, OFS is set to " ", a single space. However, we
377 can easily redefine OFS so that awk will insert our favorite field separator.
378 Here's a revised version of our original <path>address.awk</path> program that
379 uses OFS to output those intermediate ", " strings:
380 </p>
381
382 <pre caption="Redefining OFS">
383 BEGIN {
384 FS="\n"
385 RS=""
386 OFS=", "
387 }
388 { print $1, $2, $3 }
389 </pre>
390
391 <p>
392 Awk also has a special variable called ORS, called the "output record
393 separator". By setting ORS, which defaults to a newline ("\n"), we can control
394 the character that's automatically printed at the end of a print statement. The
395 default ORS value causes awk to output each new print statement on a new line.
396 If we wanted to make the output double-spaced, we would set ORS to "\n\n". Or,
397 if we wanted records to be separated by a single space (and no newline), we
398 would set ORS to " ".
399 </p>
400
401 </body>
402 </section>
403 <section>
404 <title>Multi-line to tabbed</title>
405 <body>
406
407 <p>
408 Let's say that we wrote a script that converted our address list to a
409 single-line per record, tab-delimited format for import into a spreadsheet.
410 After using a slightly modified version of <path>address.awk</path>, it would
411 become clear that our program only works for three-line addresses. If awk
412 encountered the following address, the fourth line would be thrown away and not
413 printed:
414 </p>
415
416 <pre caption="Sample entry">
417 Cousin Vinnie
418 Vinnie's Auto Shop
419 300 City Alley
420 Sosueme, OR 76543
421 </pre>
422
423
424
425
426 1.1 xml/htdocs/doc/en/articles/l-awk3.xml
427
428 file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk3.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo
429 plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk3.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo
430
431 Index: l-awk3.xml
432 ===================================================================
433 <?xml version='1.0' encoding="UTF-8"?>
434 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/l-awk3.xml,v 1.1 2005/07/28 08:04:04 neysx Exp $ -->
435 <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
436
437 <guide link="/doc/en/articles/l-awk3.xml">
438 <title>Awk by example, Part 3</title>
439
440 <author title="Author">
441 <mail link="drobbins@g.o">Daniel Robbins</mail>
442 </author>
443 <author title="Editor">
444 <mail link="rane@××××××.pl">Łukasz Damentko</mail>
445 </author>
446
447 <abstract>
448 In this sequel to his previous intro to awk, Daniel Robbins continues to explore
449 awk, a great language with a strange name. Daniel will show you how to handle
450 multi-line records, use looping constructs, and create and use awk arrays. By
451 the end of this article, you'll be well versed in a wide range of awk features,
452 and you'll be ready to write your own powerful awk scripts.
453 </abstract>
454
455 <!-- The original version of this article was published on IBM developerWorks,
456 and is property of Westtech Information Services. This document is an updated
457 version of the original article, and contains various improvements made by the
458 Gentoo Linux Documentation team -->
459
460 <version>1.0</version>
461 <date>2005-07-27</date>
462
463 <chapter>
464 <title>String functions and ... checkbooks?</title>
465 <section>
466 <title>Formatting output</title>
467 <body>
468
469 <p>
470 While awk's print statement does do the job most of the time, sometimes more is
471 needed. For those times, awk offers two good old friends called printf() and
472 sprintf(). Yes, these functions, like so many other awk parts, are identical to
473 their C counterparts. printf() will print a formatted string to stdout, while
474 sprintf() returns a formatted string that can be assigned to a variable. If
475 you're not familiar with printf() and sprintf(), an introductory C text will
476 quickly get you up to speed on these two essential printing functions. You can
477 view the printf() man page by typing "man 3 printf" on your Linux system.
478 </p>
479
480 <p>
481 Here's some sample awk sprintf() and printf() code. As you can see, everything
482 looks almost identical to C.
483 </p>
484
485 <pre caption="Sample awk sprintf() and printf() code">
486 x=1
487 b="foo"
488 printf("%s got a %d on the last test\n","Jim",83)
489 myout=("%s-%d",b,x)
490 print myout
491 </pre>
492
493 <p>
494 This code will print:
495 </p>
496
497 <pre caption="Code output">
498 Jim got a 83 on the last test
499 foo-1
500 </pre>
501
502 </body>
503 </section>
504 <section>
505 <title>String functions</title>
506 <body>
507
508 <p>
509 Awk has a plethora of string functions, and that's a good thing. In awk, you
510 really need string functions, since you can't treat a string as an array of
511 characters as you can in other languages like C, C++, and Python. For example,
512 if you execute the following code:
513 </p>
514
515 <pre caption="Example code">
516 mystring="How are you doing today?"
517 print mystring[3]
518 </pre>
519
520 <p>
521 You'll receive an error that looks something like this:
522 </p>
523
524 <pre caption="Example code error">
525 awk: string.gawk:59: fatal: attempt to use scalar as array
526 </pre>
527
528 <p>
529 Oh, well. While not as convenient as Python's sequence types, awk's string
530 functions get the job done. Let's take a look at them.
531 </p>
532
533 <p>
534 First, we have the basic length() function, which returns the length of a
535 string. Here's how to use it:
536 </p>
537
538 <pre caption="length() function example">
539 print length(mystring)
540 </pre>
541
542 <p>
543 This code will print the value:
544 </p>
545
546 <pre caption="Printed value">
547 24
548 </pre>
549
550 <p>
551 OK, let's keep going. The next string function is called index, and will return
552 the position of the occurrence of a substring in another string, or it will
553 return 0 if the string isn't found. Using mystring, we can call it this way:
554 </p>
555
556 <pre caption="index() funtion example">
557 print index(mystring,"you")
558 </pre>
559
560 <p>
561 Awk prints:
562 </p>
563
564 <pre caption="Function output">
565 9
566 </pre>
567
568 <p>
569 We move on to two more easy functions, tolower() and toupper(). As you might
570 guess, these functions will return the string with all characters converted to
571 lowercase or uppercase respectively. Notice that tolower() and toupper() return
572 the new string, and don't modify the original. This code:
573 </p>
574
575 <pre caption="Converting strings to lower or uppercase">
576 print tolower(mystring)
577 print toupper(mystring)
578 print mystring
579 </pre>
580
581 <p>
582 ....will produce this output:
583 </p>
584
585 <pre caption="Output">
586 how are you doing today?
587 HOW ARE YOU DOING TODAY?
588 How are you doing today?
589 </pre>
590
591 <p>
592 So far so good, but how exactly do we select a substring or even a single
593 character from a string? That's where substr() comes in. Here's how to call
594 substr():
595 </p>
596
597 <pre caption="substr() function example">
598 mysub=substr(mystring,startpos,maxlen)
599 </pre>
600
601 <p>
602 mystring should be either a string variable or a literal string from which you'd
603 like to extract a substring. startpos should be set to the starting character
604 position, and maxlen should contain the maximum length of the string you'd like
605 to extract. Notice that I said maximum length; if length(mystring) is shorter
606 than startpos+maxlen, your result will be truncated. substr() won't modify the
607 original string, but returns the substring instead. Here's an example:
608 </p>
609
610 <pre caption="Another example">
611 print substr(mystring,9,3)
612 </pre>
613
614 <p>
615 Awk will print:
616 </p>
617
618 <pre caption="What awk prints">
619 you
620 </pre>
621
622 <p>
623 If you regularly program in a language that uses array indices to access parts
624 of a string (and who doesn't), make a mental note that substr() is your awk
625 substitute. You'll need to use it to extract single characters and substrings;
626 because awk is a string-based language, you'll be using it often.
627 </p>
628
629 <p>
630 Now, we move on to some meatier functions, the first of which is called match().
631 match() is a lot like index(), except instead of searching for a substring like
632
633
634
635 --
636 gentoo-doc-cvs@g.o mailing list