/[gentoo]/xml/htdocs/doc/en/articles/l-sed2.xml
Gentoo

Contents of /xml/htdocs/doc/en/articles/l-sed2.xml

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.8 - (hide annotations) (download) (as text)
Fri Jun 29 16:03:34 2012 UTC (3 years, 1 month ago) by swift
Branch: MAIN
CVS Tags: HEAD
Changes since 1.7: +7 -7 lines
File MIME type: application/xml
Fix bug #397687 - Spelling corrections in article, thanks to Christophe Lefebvre for the patch

1 jkt 1.1 <?xml version='1.0' encoding="UTF-8"?>
2 swift 1.8 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/l-sed2.xml,v 1.7 2011/09/04 17:53:41 swift Exp $ -->
3 jkt 1.1 <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
4    
5 swift 1.7 <guide disclaimer="articles">
6 jkt 1.1 <title>Sed by example, Part 2</title>
7    
8     <author title="Author">
9     <mail link="drobbins@gentoo.org">Daniel Robbins</mail>
10     </author>
11    
12     <abstract>
13     Sed is a very powerful and compact text stream editor. In this article, the
14     second in the series, Daniel shows you how to use sed to perform string
15     substitution; create larger sed scripts; and use sed's append, insert, and
16     change line commands.
17     </abstract>
18    
19     <!-- The original version of this article was published on IBM developerWorks,
20     and is property of Westtech Information Services. This document is an updated
21     version of the original article, and contains various improvements made by the
22     Gentoo Linux Documentation team -->
23    
24 swift 1.8 <version>2</version>
25 rane 1.4 <date>2005-10-09</date>
26 jkt 1.1
27     <chapter>
28     <title>How to further take advantage of the UNIX text editor</title>
29     <section>
30     <title>Substitution!</title>
31     <body>
32    
33     <p>
34     Let's look at one of sed's most useful commands, the substitution command.
35     Using it, we can replace a particular string or matched regular expression with
36     another string. Here's an example of the most basic use of this command:
37     </p>
38    
39     <pre caption="Most basic use of substitution command">
40     $ <i>sed -e 's/foo/bar/' myfile.txt</i>
41     </pre>
42    
43     <p>
44     The above command will output the contents of myfile.txt to stdout, with the
45     first occurrence of 'foo' (if any) on each line replaced with the string 'bar'.
46     Please note that I said first occurrence on each line, though this is normally
47     not what you want. Normally, when I do a string replacement, I want to perform
48     it globally. That is, I want to replace all occurrences on every line, as
49     follows:
50     </p>
51    
52 swift 1.8 <pre caption="Replacing all the occurrences on every line">
53 jkt 1.1 $ <i>sed -e 's/foo/bar/g' myfile.txt</i>
54     </pre>
55    
56     <p>
57     The additional 'g' option after the last slash tells sed to perform a global
58     replace.
59     </p>
60    
61     <p>
62     Here are a few other things you should know about the <c>s///</c> substitution
63     command. First, it is a command, and a command only; there are no addresses
64     specified in any of the above examples. This means that the <c>s///</c> command
65     can also be used with addresses to control what lines it will be applied to, as
66     follows:
67     </p>
68    
69     <pre caption="Specifying lines command will be applied to">
70     $ <i>sed -e '1,10s/enchantment/entrapment/g' myfile2.txt</i>
71     </pre>
72    
73     <p>
74     The above example will cause all occurrences of the phrase 'enchantment' to be
75     replaced with the phrase 'entrapment', but only on lines one through ten,
76     inclusive.
77     </p>
78    
79     <pre caption="Specifying more options">
80     $ <i>sed -e '/^$/,/^END/s/hills/mountains/g' myfile3.txt</i>
81     </pre>
82    
83     <p>
84     This example will swap 'hills' for 'mountains', but only on blocks of text
85     beginning with a blank line, and ending with a line beginning with the three
86     characters 'END', inclusive.
87     </p>
88    
89     <p>
90     Another nice thing about the <c>s///</c> command is that we have a lot of
91     options when it comes to those <c>/</c> separators. If we're performing string
92     substitution and the regular expression or replacement string has a lot of
93     slashes in it, we can change the separator by specifying a different character
94     after the 's'. For example, this will replace all occurrences of
95     <path>/usr/local</path> with <path>/usr</path>:
96     </p>
97    
98 swift 1.8 <pre caption="Replacing all the occurrences of one string with another one">
99 jkt 1.1 $ <i>sed -e 's:/usr/local:/usr:g' mylist.txt</i>
100     </pre>
101    
102     <note>
103     In this example, we're using the colon as a separator. If you ever need to
104     specify the separator character in the regular expression, put a backslash
105     before it.
106     </note>
107    
108     </body>
109     </section>
110     <section>
111     <title>Regexp snafus</title>
112     <body>
113    
114     <p>
115     Up until now, we've only performed simple string substitution. While this is
116     handy, we can also match a regular expression. For example, the following sed
117     command will match a phrase beginning with '&lt;' and ending with '&gt;', and
118 swift 1.8 containing any number of characters in-between. This phrase will be deleted
119 jkt 1.1 (replaced with an empty string):
120     </p>
121    
122     <pre caption="Deleting specified phrase">
123     $ <i>sed -e 's/&lt;.*&gt;//g' myfile.html</i>
124     </pre>
125    
126     <p>
127     This is a good first attempt at a sed script that will remove HTML tags from a
128     file, but it won't work well, due to a regular expression quirk. The reason?
129     When sed tries to match the regular expression on a line, it finds the longest
130     match on the line. This wasn't an issue in my previous sed article, because we
131     were using the <c>d</c> and <c>p</c> commands, which would delete or print the
132     entire line anyway. But when we use the <c>s///</c> command, it definitely makes
133     a big difference, because the entire portion that the regular expression matches
134     will be replaced with the target string, or in this case, deleted. This means
135     that the above example will turn the following line:
136     </p>
137    
138     <pre caption="Sample HTML code">
139     &lt;b&gt;This&lt;/b&gt; is what &lt;b&gt;I&lt;/b&gt; meant.
140     </pre>
141    
142     <p>
143     Into this:
144     </p>
145    
146     <pre caption="Not desired effect">
147     meant.
148     </pre>
149    
150     <p>
151     Rather than this, which is what we wanted to do:
152     </p>
153    
154     <pre caption="Desired effect">
155     This is what I meant.
156     </pre>
157    
158     <p>
159     Fortunately, there is an easy way to fix this. Instead of typing in a regular
160     expression that says "a '&lt;' character followed by any number of characters, and
161     ending with a '&gt;' character", we just need to type in a regexp that says "a
162     '&lt;' character followed by any number of non-'&gt;' characters, and ending
163     with a '&gt;' character". This will have the effect of matching the shortest
164     possible match, rather than the longest possible one. The new command looks like
165     this:
166     </p>
167    
168     <pre caption="">
169     $ <i>sed -e 's/&lt;[^&gt;]*&gt;//g' myfile.html</i>
170     </pre>
171    
172     <p>
173     In the above example, the '[^&gt;]' specifies a "non-'&gt;'" character, and the '*'
174     after it completes this expression to mean "zero or more non-'&gt;' characters".
175     Test this command on a few sample html files, pipe them to more, and review
176     their results.
177     </p>
178    
179     </body>
180     </section>
181     <section>
182     <title>More character matching</title>
183     <body>
184    
185     <p>
186     The '[ ]' regular expression syntax has some more additional options. To specify
187     a range of characters, you can use a '-' as long as it isn't in the first or
188     last position, as follows:
189     </p>
190    
191 swift 1.8 <pre caption="Specifying a range of characters">
192 jkt 1.1 '[a-x]*'
193     </pre>
194    
195     <p>
196     This will match zero or more characters, as long as all of them are
197     'a','b','c'...'v','w','x'. In addition, the '[:space:]' character class is
198     available for matching whitespace. Here's a fairly complete list of available
199     character classes:
200     </p>
201    
202    
203     <table>
204     <tr>
205     <th>Character class</th>
206     <th>Description</th>
207     </tr>
208     <tr>
209     <ti>[:alnum:]</ti>
210     <ti>Alphanumeric [a-z A-Z 0-9]</ti>
211     </tr>
212     <tr>
213     <ti>[:alpha:]</ti>
214     <ti>Alphabetic [a-z A-Z]</ti>
215     </tr>
216     <tr>
217     <ti>[:blank:]</ti>
218     <ti>Spaces or tabs</ti>
219     </tr>
220     <tr>
221     <ti>[:cntrl:]</ti>
222     <ti>Any control characters</ti>
223     </tr>
224     <tr>
225     <ti>[:digit:]</ti>
226     <ti>Numeric digits [0-9]</ti>
227     </tr>
228     <tr>
229     <ti>[:graph:]</ti>
230     <ti>Any visible characters (no whitespace)</ti>
231     </tr>
232     <tr>
233     <ti>[:lower:]</ti>
234     <ti>Lower-case [a-z]</ti>
235     </tr>
236     <tr>
237     <ti>[:print:]</ti>
238     <ti>Non-control characters</ti>
239     </tr>
240     <tr>
241     <ti>[:punct:]</ti>
242     <ti>Punctuation characters</ti>
243     </tr>
244     <tr>
245     <ti>[:space:]</ti>
246     <ti>Whitespace</ti>
247     </tr>
248     <tr>
249     <ti>[:upper:]</ti>
250     <ti>Upper-case [A-Z]</ti>
251     </tr>
252     <tr>
253     <ti>[:xdigit:]</ti>
254     <ti>hex digits [0-9 a-f A-F]</ti>
255     </tr>
256     </table>
257    
258     <p>
259     It's advantageous to use character classes whenever possible, because they adapt
260 swift 1.8 better to non-English speaking locales (including accented characters when
261 jkt 1.1 necessary, etc.).
262     </p>
263    
264     </body>
265     </section>
266     <section>
267     <title>Advanced substitution stuff</title>
268     <body>
269    
270     <p>
271     We've looked at how to perform simple and even reasonably complex straight
272     substitutions, but sed can do even more. We can actually refer to either parts
273     of or the entire matched regular expression, and use these parts to construct
274     the replacement string. As an example, let's say you were replying to a message.
275     The following example would prefix each line with the phrase "ralph said: ":
276     </p>
277    
278     <pre caption="Prefixing each line with certain string">
279     $ <i>sed -e 's/.*/ralph said: &amp;/' origmsg.txt</i>
280     </pre>
281    
282     <p>
283     The output will look like this:
284     </p>
285    
286     <pre caption="Output of the above command">
287     ralph said: Hiya Jim,
288     ralph said:
289     ralph said: I sure like this sed stuff!
290     ralph said:
291     </pre>
292    
293     <p>
294     In this example, we use the '&amp;' character in the replacement string,
295     which tells sed to insert the entire matched regular expression. So, whatever
296     was matched by '.*' (the largest group of zero or more characters on the line,
297     or the entire line) can be inserted anywhere in the replacement string, even
298     multiple times. This is great, but sed is even more powerful.
299     </p>
300    
301     </body>
302     </section>
303     <section>
304     <title>Those wonderful backslashed parentheses</title>
305     <body>
306    
307     <p>
308     Even better than '&amp;', the <c>s///</c> command allows us to define regions in
309     our regular expression, and we can refer to these specific regions in our
310     replacement string. As an example, let's say we have a file that contains the
311     following text:
312     </p>
313    
314     <pre caption="Sample text">
315     foo bar oni
316     eeny meeny miny
317     larry curly moe
318     jimmy the weasel
319     </pre>
320    
321     <p>
322     Now, let's say we wanted to write a sed script that would replace "eeny meeny
323     miny" with "Victor eeny-meeny Von miny", etc. To do this, first we would write a
324     regular expression that would match the three strings, separated by spaces:
325     </p>
326    
327     <pre caption="Matching regular expression">
328     '.* .* .*'
329     </pre>
330    
331     <p>
332     There. Now, we will define regions by inserting backslashed parentheses around
333     each region of interest:
334     </p>
335    
336     <pre caption="Defining regions">
337     '\(.*\) \(.*\) \(.*\)'
338     </pre>
339    
340     <p>
341     This regular expression will work the same as our first one, except that it will
342     define three logical regions that we can refer to in our replacement string.
343     Here's the final script:
344     </p>
345    
346     <pre caption="Final script">
347     $ <i>sed -e 's/\(.*\) \(.*\) \(.*\)/Victor \1-\2 Von \3/' myfile.txt</i>
348     </pre>
349    
350     <p>
351     As you can see, we refer to each parentheses-delimited region by typing '\x',
352     where x is the number of the region, starting at one. Output is as follows:
353     </p>
354    
355     <pre caption="Output of the above command">
356     Victor foo-bar Von oni
357     Victor eeny-meeny Von miny
358     Victor larry-curly Von moe
359     Victor jimmy-the Von weasel
360     </pre>
361    
362     <p>
363     As you become more familiar with sed, you will be able to perform fairly
364     powerful text processing with a minimum of effort. You may want to think about
365     how you'd have approached this problem using your favorite scripting language --
366     could you have easily fit the solution in one line?
367     </p>
368    
369     </body>
370     </section>
371     <section>
372     <title>Mixing things up</title>
373     <body>
374    
375     <p>
376     As we begin creating more complex sed scripts, we need the ability to enter more
377     than one command. There are several ways to do this. First, we can use
378     semicolons between the commands. For example, this series of commands uses the
379     '=' command, which tells sed to print the line number, as well as the <c>p</c>
380     command, which explicitly tells sed to print the line (since we're in '-n'
381     mode):
382     </p>
383    
384     <pre caption="First method, semicolons">
385     $ <i>sed -n -e '=;p' myfile.txt</i>
386     </pre>
387    
388     <p>
389     Whenever two or more commands are specified, each command is applied (in order)
390     to every line in the file. In the above example, first the '=' command is
391     applied to line 1, and then the <c>p</c> command is applied. Then, sed proceeds
392     to line 2, and repeats the process. While the semicolon is handy, there are
393     instances where it won't work. Another alternative is to use two -e options to
394     specify two separate commands:
395     </p>
396    
397     <pre caption="Second method, multiple -e">
398     $ <i>sed -n -e '=' -e 'p' myfile.txt</i>
399     </pre>
400    
401     <p>
402     However, when we get to the more complex append and insert commands, even
403     multiple '-e' options won't help us. For complex multiline scripts, the best way
404     is to put your commands in a separate file. Then, reference this script file
405     with the -f options:
406     </p>
407    
408     <pre caption="Third method, external file with commands">
409     $ <i>sed -n -f mycommands.sed myfile.txt</i>
410     </pre>
411    
412     <p>
413     This method, although arguably less convenient, will always work.
414     </p>
415    
416     </body>
417     </section>
418     <section>
419     <title>Multiple commands for one address</title>
420     <body>
421    
422     <p>
423     Sometimes, you may want to specify multiple commands that will apply to a single
424     address. This comes in especially handy when you are performing lots of
425     <c>s///</c> to transform words or syntax in the source file. To perform multiple
426     commands per address, enter your sed commands in a file, and use the '{ }'
427     characters to group commands, as follows:
428     </p>
429    
430     <pre caption="Entering multiple commands per address">
431     1,20{
432 rane 1.5 s/[Ll]inux/GNU\/Linux/g
433     s/samba/Samba/g
434     s/posix/POSIX/g
435 jkt 1.1 }
436     </pre>
437    
438     <p>
439     The above example will apply three substitution commands to lines 1 through 20,
440     inclusive. You can also use regular expression addresses, or a combination of
441     the two:
442     </p>
443    
444     <pre caption="Combination of both methods">
445     1,/^END/{
446     s/[Ll]inux/GNU\/Linux/g
447     s/samba/Samba/g
448     s/posix/POSIX/g
449 rane 1.5 p
450 jkt 1.1 }
451     </pre>
452    
453     <p>
454     This example will apply all the commands between '{ }' to the lines starting at
455     1 and up to a line beginning with the letters "END", or the end of file if
456     "END" is not found in the source file.
457     </p>
458    
459     </body>
460     </section>
461     <section>
462     <title>Append, insert, and change line</title>
463     <body>
464    
465     <p>
466     Now that we're writing sed scripts in separate files, we can take advantage of
467     the append, insert, and change line commands. These commands will insert a line
468     after the current line, insert a line before the current line, or replace the
469     current line in the pattern space. They can also be used to insert multiple
470     lines into the output. The insert line command is used as follows:
471     </p>
472    
473     <pre caption="Using the insert line command">
474     i\
475     This line will be inserted before each line
476     </pre>
477    
478     <p>
479     If you don't specify an address for this command, it will be applied to each
480     line and produce output that looks like this:
481     </p>
482    
483     <pre caption="Output of the above command">
484     This line will be inserted before each line
485     line 1 here
486     This line will be inserted before each line
487     line 2 here
488     This line will be inserted before each line
489     line 3 here
490     This line will be inserted before each line
491     line 4 here
492     </pre>
493    
494     <p>
495     If you'd like to insert multiple lines before the current line, you can add
496     additional lines by appending a backslash to the previous line, like so:
497     </p>
498    
499     <pre caption="Inserting multiple lines before the current one">
500     i\
501     insert this line\
502     and this one\
503     and this one\
504     and, uh, this one too.
505     </pre>
506    
507     <p>
508     The append command works similarly, but will insert a line or lines after the
509     current line in the pattern space. It's used as follows:
510     </p>
511    
512     <pre caption="Appending lines after the current one">
513     a\
514     insert this line after each line. Thanks! :)
515     </pre>
516    
517     <p>
518     On the other hand, the "change line" command will actually replace the current
519     line in the pattern space, and is used as follows:
520     </p>
521    
522     <p>
523     Because the append, insert, and change line commands need to be entered on
524     multiple lines, you'll want to type them in to text sed scripts and tell sed to
525     source them by using the '-f' option. Using the other methods to pass commands
526     to sed will result in problems.
527     </p>
528    
529     </body>
530     </section>
531     <section>
532     <title>Next time</title>
533     <body>
534    
535     <p>
536     Next time, in the final article of this series on sed, I'll show you lots of
537     excellent real-world examples of using sed for many different kinds of tasks.
538     Not only will I show you what the scripts do, but why they do what they do.
539     After you're done, you'll have additional excellent ideas of how to use sed in
540     your various projects. I'll see you then!
541     </p>
542    
543     </body>
544     </section>
545     </chapter>
546    
547     <chapter>
548     <title>Resources</title>
549     <section>
550     <title>Useful links</title>
551     <body>
552    
553     <ul>
554     <li>
555     Read Daniel's other sed articles from developerWorks: Common threads: Sed by
556     example, <uri link="l-sed1.xml">Part 1</uri> and <uri
557     link="l-sed3.xml">Part 3</uri>.
558     </li>
559     <li>
560 nightmorph 1.6 Check out Eric Pement's excellent <uri
561     link="http://sed.sourceforge.net/sedfaq.html">sed
562     FAQ</uri>.
563 jkt 1.1 </li>
564     <li>
565 nightmorph 1.6 You can find the sources to sed at
566     <uri>ftp://ftp.gnu.org/pub/gnu/sed</uri>.
567 jkt 1.1 </li>
568     <li>
569     Eric Pement also has a handy list of <uri
570 nightmorph 1.6 link="http://sed.sourceforge.net/sed1line.txt">sed
571 jkt 1.1 one-liners</uri> that any aspiring sed guru should definitely look at.
572     </li>
573     <li>
574     If you'd like a good old-fashioned book, <uri
575     link="http://www.oreilly.com/catalog/sed2/">O'Reilly's sed &amp; awk, 2nd
576     Edition</uri> would be wonderful choice.
577     </li>
578     <!-- FIXME BOTH DEAD and no other locations, sorry
579     <li>
580     Maybe you'd like to read <uri
581     link="http://www.softlab.ntua.gr/unix/docs/sed.txt">7th edition UNIX's sed
582     man page</uri> (circa 1978!).
583     </li>
584     <li>
585     Take Felix von Leitner's short <uri
586     link="http://www.math.fu-berlin.de/~leitner/sed/tutorial.html">sed
587     tutorial</uri>.
588     </li>
589     -->
590 nightmorph 1.6 <!-- Dead link
591     <li>
592 jkt 1.1 Brush up on <uri link="http://vision.eng.shu.ac.uk/C++/misc/regexp/">using
593     regular expressions</uri> to find and modify patterns in text in this free,
594     dW-exclusive tutorial.
595     </li>
596 nightmorph 1.6 -->
597 jkt 1.1 </ul>
598    
599     </body>
600     </section>
601     </chapter>
602     </guide>

  ViewVC Help
Powered by ViewVC 1.1.20