Gentoo Archives: gentoo-doc-cvs

From: Jan Kundrat <jkt@×××××××××××.org>
To: gentoo-doc-cvs@l.g.o
Subject: [gentoo-doc-cvs] cvs commit: l-sed1.xml
Date: Tue, 26 Jul 2005 10:47:06
Message-Id: 200507261046.j6QAkpVW027955@robin.gentoo.org
1 jkt 05/07/26 10:46:47
2
3 Added: xml/htdocs/doc/en/articles l-sed1.xml l-sed2.xml l-sed3.xml
4 Log:
5 #99049, "Common threads: Sed by example", converted by rane
6
7 Revision Changes Path
8 1.1 xml/htdocs/doc/en/articles/l-sed1.xml
9
10 file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-sed1.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo
11 plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-sed1.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo
12
13 Index: l-sed1.xml
14 ===================================================================
15 <?xml version='1.0' encoding="UTF-8"?>
16 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/l-sed1.xml,v 1.1 2005/07/26 10:46:47 jkt Exp $ -->
17 <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
18
19 <guide link="/doc/en/articles/l-sed1.xml">
20 <title>Sed by example, Part 1</title>
21
22 <author title="Author">
23 <mail link="drobbins@g.o">Daniel Robbins</mail>
24 </author>
25 <author title="Editor">
26 <mail link="rane@××××××.pl">Łukasz Damentko</mail>
27 </author>
28
29 <abstract>
30 In this series of articles, Daniel Robbins will show you how to use the very
31 powerful (but often forgotten) UNIX stream editor, sed. Sed is an ideal tool for
32 batch-editing files or for creating shell scripts to modify existing files in
33 powerful ways.
34 </abstract>
35
36 <!-- The original version of this article was published on IBM developerWorks,
37 and is property of Westtech Information Services. This document is an updated
38 version of the original article, and contains various improvements made by the
39 Gentoo Linux Documentation team -->
40
41 <version>1.0</version>
42 <date>2005-07-15</date>
43
44 <chapter>
45 <title>Get to know the powerful UNIX editor</title>
46 <section>
47 <title>Pick an editor</title>
48 <body>
49
50 <note>
51 The original version of this article was published on IBM developerWorks, and is
52 property of Westtech Information Services. This document is an updated version
53 of the original article, and contains various improvements made by the Gentoo
54 Linux Documentation team.
55 </note>
56
57 <p>
58 In the UNIX world, we have a lot of options when it comes to editing files.
59 Think of it -- vi, emacs, and jed come to mind, as well as many others. We all
60 have our favorite editor (along with our favorite keybindings) that we have come
61 to know and love. With our trusty editor, we are ready to tackle any number of
62 UNIX-related administration or programming tasks with ease.</p>
63
64 <p>
65 While interactive editors are great, they do have limitations. Though their
66 interactive nature can be a strength, it can also be a weakness. Consider a
67 situation where you need to perform similar types of changes on a group of
68 files. You could instinctively fire up your favorite editor and perform a bunch
69 of mundane, repetitive, and time-consuming edits by hand. But there's a better
70 way.
71 </p>
72
73 </body>
74 </section>
75 <section>
76 <title>Enter sed</title>
77 <body>
78
79 <p>
80 It would be nice if we could automate the process of making edits to files, so
81 that we could "batch" edit files, or even write scripts with the ability to
82 perform sophisticated changes to existing files. Fortunately for us, for these
83 types of situations, there is a better way -- and the better way is called sed.
84 </p>
85
86 <p>
87 sed is a lightweight stream editor that's included with nearly all UNIX flavors,
88 including Linux. sed has a lot of nice features. First of all, it's very
89 lightweight, typically many times smaller than your favorite scripting language.
90 Secondly, because sed is a stream editor, it can perform edits to data it
91 receives from stdin, such as from a pipeline. So, you don't need to have the
92 data to be edited stored in a file on disk. Because data can just as easily be
93 piped to sed, it's very easy to use sed as part of a long, complex pipeline in a
94 powerful shell script. Try doing that with your favorite editor.
95 </p>
96
97 </body>
98 </section>
99 <section>
100 <title>GNU sed</title>
101 <body>
102
103 <p>
104 Fortunately for us Linux users, one of the nicest versions of sed out there
105 happens to be GNU sed, which is currently at version 3.02. Every Linux
106 distribution has GNU sed, or at least should. GNU sed is popular not only
107 because its sources are freely distributable, but because it happens to have a
108 lot of handy, time-saving extensions to the POSIX sed standard. GNU sed also
109 doesn't suffer from many of the limitations that earlier and proprietary
110 versions of sed had, such as a limited line length -- GNU sed handles lines of
111 any length with ease.
112 </p>
113
114 </body>
115 </section>
116 <section>
117 <title>The newest GNU sed</title>
118 <body>
119
120 <p>
121 While researching this article, I noticed that several online sed aficionados
122 made reference to a GNU sed 3.02a. Strangely, I couldn't find sed 3.02a on
123 <uri>ftp://ftp.gnu.org</uri> (see <uri link="#resources">Resources</uri> for
124 these links), so I had to go look for it elsewhere. I found it at
125 <uri>ftp://alpha.gnu.org</uri>, in <path>/pub/sed</path>. I happily downloaded
126 it, compiled it, and installed it, only to find minutes later that the most
127 recent version of sed is 3.02.80 -- and you can find its sources right next to
128 those for 3.02a, at <uri>ftp://alpha.gnu.org</uri>. After getting GNU sed
129 3.02.80 installed, I was finally ready to go.
130 </p>
131
132 </body>
133 </section>
134 <section>
135 <title>The right sed</title>
136 <body>
137
138 <p>
139 In this series, we will be using GNU sed 3.02.80. Some (but very few) of the
140 most advanced examples you'll find in my upcoming, follow-on articles in this
141 series will not work with GNU sed 3.02 or 3.02a. If you're using a non-GNU sed,
142 your results may vary. Why not take some time to install GNU sed 3.02.80 now?
143 Then, not only will you be ready for the rest of the series, but you'll also be
144 able to use arguably the best sed in existence!
145 </p>
146
147 </body>
148 </section>
149 <section>
150 <title>Sed examples</title>
151 <body>
152
153 <p>
154 Sed works by performing any number of user-specified editing operations
155 ("commands") on the input data. Sed is line-based, so the commands are performed
156 on each line in order. And, sed writes its results to standard output (stdout);
157 it doesn't modify any input files.
158 </p>
159
160 <p>
161 Let's look at some examples. The first several are going to be a bit weird
162 because I'm using them to illustrate how sed works rather than to perform any
163 useful task. However, if you're new to sed, it's very important that you
164 understand them. Here's our first example:
165 </p>
166
167 <pre caption="Example of sed usage">
168 $ <i>sed -e 'd' /etc/services</i>
169 </pre>
170
171 <p>
172 If you type this command, you'll get absolutely no output. Now, what happened?
173 In this example, we called sed with one editing command, <c>d</c>. Sed opened
174 the <path>/etc/services</path> file, read a line into its pattern buffer,
175 performed our editing command ("delete line"), and then printed the pattern
176 buffer (which was empty). It then repeated these steps for each successive line.
177 This produced no output, because the <c>d</c> command zapped every single line
178 in the pattern buffer!
179 </p>
180
181 <p>
182 There are a couple of things to notice in this example. First,
183 <path>/etc/services</path> was not modified at all. This is because, again, sed
184 only reads from the file you specify on the command line, using it as input --
185 it doesn't try to modify the file. The second thing to notice is that sed is
186 line-oriented. The <c>d</c> command didn't simply tell sed to delete all incoming
187 data in one fell swoop. Instead, sed read each line of /etc/services one by one
188 into its internal buffer, called the pattern buffer. Once a line was read into
189 the pattern buffer, it performed the <c>d</c> command and printed the contents
190 of the pattern buffer (nothing in this example). Later, I'll show you how to use
191 address ranges to control which lines a command is applied to -- but in the
192 absence of addresses, a command is applied to all lines.
193 </p>
194
195 <p>
196 The third thing to notice is the use of single quotes to surround the <c>d</c>
197 command. It's a good idea to get into the habit of using single quotes to
198 surround your sed commands, so that shell expansion is disabled.
199 </p>
200
201 </body>
202 </section>
203 <section>
204 <title>Another sed example</title>
205 <body>
206
207 <p>
208 Here's an example of how to use sed to remove the first line of the
209 <path>/etc/services</path> file from our output stream:
210 </p>
211
212 <pre caption="Another sed example">
213 $ <i>sed -e '1d' /etc/services | more</i>
214
215
216
217 1.1 xml/htdocs/doc/en/articles/l-sed2.xml
218
219 file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-sed2.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo
220 plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-sed2.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo
221
222 Index: l-sed2.xml
223 ===================================================================
224 <?xml version='1.0' encoding="UTF-8"?>
225 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/l-sed2.xml,v 1.1 2005/07/26 10:46:47 jkt Exp $ -->
226 <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
227
228 <guide link="/doc/en/articles/l-sed2.xml">
229 <title>Sed by example, Part 2</title>
230
231 <author title="Author">
232 <mail link="drobbins@g.o">Daniel Robbins</mail>
233 </author>
234 <author title="Editor">
235 <mail link="rane@××××××.pl">Łukasz Damentko</mail>
236 </author>
237
238 <abstract>
239 Sed is a very powerful and compact text stream editor. In this article, the
240 second in the series, Daniel shows you how to use sed to perform string
241 substitution; create larger sed scripts; and use sed's append, insert, and
242 change line commands.
243 </abstract>
244
245 <!-- The original version of this article was published on IBM developerWorks,
246 and is property of Westtech Information Services. This document is an updated
247 version of the original article, and contains various improvements made by the
248 Gentoo Linux Documentation team -->
249
250 <version>1.0</version>
251 <date>2005-07-15</date>
252
253 <chapter>
254 <title>How to further take advantage of the UNIX text editor</title>
255 <section>
256 <title>Substitution!</title>
257 <body>
258
259 <note>
260 The original version of this article was published on IBM developerWorks, and is
261 property of Westtech Information Services. This document is an updated version
262 of the original article, and contains various improvements made by the Gentoo
263 Linux Documentation team.
264 </note>
265
266
267 <p>
268 Let's look at one of sed's most useful commands, the substitution command.
269 Using it, we can replace a particular string or matched regular expression with
270 another string. Here's an example of the most basic use of this command:
271 </p>
272
273 <pre caption="Most basic use of substitution command">
274 $ <i>sed -e 's/foo/bar/' myfile.txt</i>
275 </pre>
276
277 <p>
278 The above command will output the contents of myfile.txt to stdout, with the
279 first occurrence of 'foo' (if any) on each line replaced with the string 'bar'.
280 Please note that I said first occurrence on each line, though this is normally
281 not what you want. Normally, when I do a string replacement, I want to perform
282 it globally. That is, I want to replace all occurrences on every line, as
283 follows:
284 </p>
285
286 <pre caption="Replacing all the occurences on every line">
287 $ <i>sed -e 's/foo/bar/g' myfile.txt</i>
288 </pre>
289
290 <p>
291 The additional 'g' option after the last slash tells sed to perform a global
292 replace.
293 </p>
294
295 <p>
296 Here are a few other things you should know about the <c>s///</c> substitution
297 command. First, it is a command, and a command only; there are no addresses
298 specified in any of the above examples. This means that the <c>s///</c> command
299 can also be used with addresses to control what lines it will be applied to, as
300 follows:
301 </p>
302
303 <pre caption="Specifying lines command will be applied to">
304 $ <i>sed -e '1,10s/enchantment/entrapment/g' myfile2.txt</i>
305 </pre>
306
307 <p>
308 The above example will cause all occurrences of the phrase 'enchantment' to be
309 replaced with the phrase 'entrapment', but only on lines one through ten,
310 inclusive.
311 </p>
312
313 <pre caption="Specifying more options">
314 $ <i>sed -e '/^$/,/^END/s/hills/mountains/g' myfile3.txt</i>
315 </pre>
316
317 <p>
318 This example will swap 'hills' for 'mountains', but only on blocks of text
319 beginning with a blank line, and ending with a line beginning with the three
320 characters 'END', inclusive.
321 </p>
322
323 <p>
324 Another nice thing about the <c>s///</c> command is that we have a lot of
325 options when it comes to those <c>/</c> separators. If we're performing string
326 substitution and the regular expression or replacement string has a lot of
327 slashes in it, we can change the separator by specifying a different character
328 after the 's'. For example, this will replace all occurrences of
329 <path>/usr/local</path> with <path>/usr</path>:
330 </p>
331
332 <pre caption="Replacing all the occurences of one string with another one">
333 $ <i>sed -e 's:/usr/local:/usr:g' mylist.txt</i>
334 </pre>
335
336 <note>
337 In this example, we're using the colon as a separator. If you ever need to
338 specify the separator character in the regular expression, put a backslash
339 before it.
340 </note>
341
342 </body>
343 </section>
344 <section>
345 <title>Regexp snafus</title>
346 <body>
347
348 <p>
349 Up until now, we've only performed simple string substitution. While this is
350 handy, we can also match a regular expression. For example, the following sed
351 command will match a phrase beginning with '&lt;' and ending with '&gt;', and
352 containing any number of characters inbetween. This phrase will be deleted
353 (replaced with an empty string):
354 </p>
355
356 <pre caption="Deleting specified phrase">
357 $ <i>sed -e 's/&lt;.*&gt;//g' myfile.html</i>
358 </pre>
359
360 <p>
361 This is a good first attempt at a sed script that will remove HTML tags from a
362 file, but it won't work well, due to a regular expression quirk. The reason?
363 When sed tries to match the regular expression on a line, it finds the longest
364 match on the line. This wasn't an issue in my previous sed article, because we
365 were using the <c>d</c> and <c>p</c> commands, which would delete or print the
366 entire line anyway. But when we use the <c>s///</c> command, it definitely makes
367 a big difference, because the entire portion that the regular expression matches
368 will be replaced with the target string, or in this case, deleted. This means
369 that the above example will turn the following line:
370 </p>
371
372 <pre caption="Sample HTML code">
373 &lt;b&gt;This&lt;/b&gt; is what &lt;b&gt;I&lt;/b&gt; meant.
374 </pre>
375
376 <p>
377 Into this:
378 </p>
379
380 <pre caption="Not desired effect">
381 meant.
382 </pre>
383
384 <p>
385 Rather than this, which is what we wanted to do:
386 </p>
387
388 <pre caption="Desired effect">
389 This is what I meant.
390 </pre>
391
392 <p>
393 Fortunately, there is an easy way to fix this. Instead of typing in a regular
394 expression that says "a '&lt;' character followed by any number of characters, and
395 ending with a '&gt;' character", we just need to type in a regexp that says "a
396 '&lt;' character followed by any number of non-'&gt;' characters, and ending
397 with a '&gt;' character". This will have the effect of matching the shortest
398 possible match, rather than the longest possible one. The new command looks like
399 this:
400 </p>
401
402 <pre caption="">
403 $ <i>sed -e 's/&lt;[^&gt;]*&gt;//g' myfile.html</i>
404 </pre>
405
406 <p>
407 In the above example, the '[^&gt;]' specifies a "non-'&gt;'" character, and the '*'
408 after it completes this expression to mean "zero or more non-'&gt;' characters".
409 Test this command on a few sample html files, pipe them to more, and review
410 their results.
411 </p>
412
413 </body>
414 </section>
415 <section>
416 <title>More character matching</title>
417 <body>
418
419 <p>
420 The '[ ]' regular expression syntax has some more additional options. To specify
421 a range of characters, you can use a '-' as long as it isn't in the first or
422 last position, as follows:
423
424
425
426 1.1 xml/htdocs/doc/en/articles/l-sed3.xml
427
428 file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-sed3.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo
429 plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-sed3.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo
430
431 Index: l-sed3.xml
432 ===================================================================
433 <?xml version='1.0' encoding="UTF-8"?>
434 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/l-sed3.xml,v 1.1 2005/07/26 10:46:47 jkt Exp $ -->
435 <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
436
437 <guide link="/doc/en/articles/l-sed3.xml">
438 <title>Sed by example, Part 3</title>
439
440 <author title="Author">
441 <mail link="drobbins@g.o">Daniel Robbins</mail>
442 </author>
443 <author title="Editor">
444 <mail link="rane@××××××.pl">Łukasz Damentko</mail>
445 </author>
446
447 <abstract>
448 Sed is a very powerful and compact text stream editor. In this article, the
449 second in the series, Daniel shows you how to use sed to perform string
450 substitution; create larger sed scripts; and use sed's append, insert, and
451 change line commands.
452 </abstract>
453
454 <!-- The original version of this article was published on IBM developerWorks,
455 and is property of Westtech Information Services. This document is an updated
456 version of the original article, and contains various improvements made by the
457 Gentoo Linux Documentation team -->
458
459 <version>1.0</version>
460 <date>2005-07-16</date>
461
462 <chapter>
463 <title>Taking it to the next level: Data crunching, sed style</title>
464 <section>
465 <title>Muscular sed</title>
466 <body>
467
468 <note>
469 The original version of this article was published on IBM developerWorks, and is
470 property of Westtech Information Services. This document is an updated version
471 of the original article, and contains various improvements made by the Gentoo
472 Linux Documentation team.
473 </note>
474
475 <p>
476 In <uri link="l-sed2.xml">my second sed article</uri>, I
477 offered examples that demonstrated how sed works, but very few of these examples
478 actually did anything particularly useful. In this final sed article, it's time
479 to change that pattern and put sed to good use. I'll show you several excellent
480 examples that not only demonstrate the power of sed, but also do some really
481 neat (and handy) things. For example, in the second half of the article, I'll
482 show you how I designed a sed script that converts a .QIF file from Intuit's
483 Quicken financial program into a nicely formatted text file. Before doing that,
484 we'll take a look at some less complicated yet useful sed scripts.
485 </p>
486
487 </body>
488 </section>
489 <section>
490 <title>Text translation</title>
491 <body>
492
493 <p>
494 Our first practical script converts UNIX-style text to DOS/Windows format. As
495 you probably know, DOS/Windows-based text files have a CR (carriage return) and
496 LF (line feed) at the end of each line, while UNIX text has only a line feed.
497 There may be times when you need to move some UNIX text to a Windows system, and
498 this script will perform the necessary format conversion for you.
499 </p>
500
501 <pre caption="Format conversion between UNIX and Windows">
502 $ <i>sed -e 's/$/\r/' myunix.txt > mydos.txt</i>
503 </pre>
504
505 <p>
506 In this script, the '$' regular expression will match the end of the line, and
507 the '\r' tells sed to insert a carriage return right before it. Insert a
508 carriage return before a line feed, and presto, a CR/LF ends each line. Please
509 note that the '\r' will be replaced with a CR only when using GNU sed 3.02.80 or
510 later. If you haven't installed GNU sed 3.02.80 yet, see <uri
511 link="l-sed1.xml">my first sed article</uri> for instructions on
512 how to do this.
513 </p>
514
515 <p>
516 I can't tell you how many times I've downloaded some example script or C code,
517 only to find that it's in DOS/Windows format. While many programs don't mind
518 DOS/Windows format CR/LF text files, several programs definitely do -- the most
519 notable being bash, which chokes as soon as it encounters a carriage return. The
520 following sed invocation will convert DOS/Windows format text to trusty UNIX
521 format:
522 </p>
523
524 <pre caption="Converting C code from Windows to UNIX format">
525 $ <i>sed -e 's/.$//' mydos.txt > myunix.txt</i>
526 </pre>
527
528 <p>
529 The way this script works is simple: our substitution regular expression matches
530 the last character on the line, which happens to be a carriage return. We
531 replace it with nothing, causing it to be deleted from the output entirely. If
532 you use this script and notice that the last character of every line of the
533 output has been deleted, you've specified a text file that's already in UNIX
534 format. No need for that!
535 </p>
536
537 </body>
538 </section>
539 <section>
540 <title>Reversing lines</title>
541 <body>
542
543 <p>
544 Here's another handy little script. This one will reverse lines in a file,
545 similar to the "tac" command that's included with most Linux distributions. The
546 name "tac" may be a bit misleading, because "tac" doesn't reverse the position
547 of characters on the line (left and right), but rather the position of lines in
548 the file (up and down). Tacing the following file:
549 </p>
550
551 <pre caption="Sample file">
552 foo
553 bar
554 oni
555 </pre>
556
557 <p>
558 ....produces the following output:
559 </p>
560
561 <pre caption="Output file">
562 oni
563 bar
564 foo
565 </pre>
566
567 <p>
568 We can do the same thing with the following sed script:
569 </p>
570
571 <pre caption="Doing same with script">
572 $ <i>sed -e '1!G;h;$!d' forward.txt > backward.txt</i>
573 </pre>
574
575 <p>
576 You'll find this sed script useful if you're logged in to a FreeBSD system,
577 which doesn't happen to have a "tac" command. While handy, it's also a good idea
578 to know why this script does what it does. Let's dissect it.
579 </p>
580
581 </body>
582 </section>
583 <section>
584 <title>Reversal explained</title>
585 <body>
586
587 <p>
588 First, this script contains three separate sed commands, separated by
589 semicolons: '1!G', 'h' and '$!d'. Now, it's time to get an good understanding of
590 the addresses used for the first and third commands. If the first command were
591 '1G', the 'G' command would be applied only to the first line. However, there is
592 an additional '!' character -- this '!' character negates the address, meaning
593 that the 'G' command will apply to all but the first line. For the '$!d'
594 command, we have a similar situation. If the command were '$d', it would apply
595 the 'd' command to only the last line in the file (the '$' address is a simple
596 way of specifying the last line). However, with the '!', '$!d' will apply the
597 'd' command to all but the last line. Now, all we need to to is understand what
598 the commands themselves do.
599 </p>
600
601 <p>
602 When we execute our line reversal script on the text file above, the first
603 command that gets executed is 'h'. This command tells sed to copy the contents
604 of the pattern space (the buffer that holds the current line being worked on) to
605 the hold space (a temporary buffer). Then, the 'd' command is executed, which
606 deletes "foo" from the pattern space, so it doesn't get printed after all the
607 commands are executed for this line.
608 </p>
609
610 <p>
611 Now, line two. After "bar" is read into the pattern space, the 'G' command is
612 executed, which appends the contents of the hold space ("foo\n") to the pattern
613 space ("bar\n"), resulting in "bar\n\foo\n" in our pattern space. The 'h'
614 command puts this back in the hold space for safekeeping, and 'd' deletes the
615 line from the pattern space so that it isn't printed.
616 </p>
617
618 <p>
619 For the last "oni" line, the same steps are repeated, except that the contents
620 of the pattern space aren't deleted (due to the '$!' before the 'd'), and the
621 contents of the pattern space (three lines) are printed to stdout.
622 </p>
623
624 <p>
625 Now, it's time to do some powerful data conversion with sed.
626 </p>
627
628 </body>
629 </section>
630 <section>
631 <title>sed QIF magic</title>
632
633
634
635 --
636 gentoo-doc-cvs@g.o mailing list