Gentoo Archives: gentoo-user

From: Willie Wong <wwong@×××××××××.EDU>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] [OT] - command line read *.csv & create new file
Date: Sun, 22 Feb 2009 20:56:40
Message-Id: 20090222205938.GA455@princeton.edu
In Reply to: [gentoo-user] [OT] - command line read *.csv & create new file by Mark Knecht
1 On Sun, Feb 22, 2009 at 11:06:31AM -0800, Penguin Lover Mark Knecht squawked:
2 > I've got a really big data file in essentially a *.csv format.
3 > (comma delimited) I need to scan this file and create a new output
4 > file. I'm wondering if there is a reasonably easy command line way of
5 > doing this using something like sed or awk which I know nothing about.
6 > Thanks in advance.
7
8 Definitely more than doable in sed or awk. If you want a reference
9 book, try http://oreilly.com/catalog/9781565922259/
10
11 Unfortunately I haven't used awk in the longest time and can't
12 remember how it will go. The following sed recipe may work, modulo
13 some small modifications
14
15 > The basic idea goes something like this:
16 >
17 > 1) The input file might look this the following where some of it is
18 > attributes (shown as letters) and other parts are results. (shown as
19 > numbers)
20 >
21 > A,B,C,D,1
22 > E,F,G,H,2
23 > I,J,K,L,3
24 > M,N,O,P,4
25 > Q,R,S,T,5
26 > U,V,W,X,6
27 >
28 > 2) From the above data input file I want to take the attributes from a
29 > few preceeding lines (say 3 in this example) and write them to the
30 > output file along with the result on the last of the 3 lines. The
31 > output file might look like this:
32 >
33 > A,B,C,D,E,F,G,H,I,J,K,L,3
34 > E,F,G,H,I,J,K,L,M,N,O,P,4
35 > I,J,K,L,M,N,O,P,Q,R,S,T,5
36 > M,N,O,P,Q,R,S,T,U,V,W,X,6
37 >
38 > 3) This must be done as a read/process/write operation of some sort
39 > because the input file may be far larger than system memory.
40 > (Currently it isn't, but it likely will eventually be.)
41 >
42 > 4) In my example above I suggested that there is a single result but
43 > their may be more than one. (Don't know yet.) I showed 3 lines but
44 > might be doing 10. I don't know. It's important to me to pick a
45 > moderately flexible way of dealing with this as the order of columns
46 > and number of results will likely change over time and I'll certainly
47 > need to adjust.
48
49 First create the sedscript
50
51 sedscript1:
52 --------------------------
53 1 {
54 N
55 N
56 }
57 {
58 p
59 D
60 N
61 }
62 --------------------------
63
64 The first block only hits when the first line of input is read. It
65 forces it to read the next two lines.
66
67 The second block hits for every pattern space, it prints the three
68 line blocks, deletes the first line, and reads the next line.
69
70 Now create the sedscript
71
72 sedscript2:
73 --------------------------
74 {
75 N
76 N
77 s/,[^,]\n/,/gp
78 d
79 }
80 --------------------------
81
82 This reads a three-line block at a time, removes the last field (and
83 the new line character) from all but the last line, replacing it with
84 a comma. Then it prints. And then it clears the pattern space.
85
86 So you can do
87
88 cat INPUT | sed -f sedscript1 | sed -f sedscript2
89
90 should give you what you want. Like I said, the whole thing can
91 probably be done a lot more eloquently in awk. But my awk-fu is not
92 what it used to be.
93
94 For a quick reference for sed, try
95 http://www.grymoire.com/Unix/Sed.html
96
97 W
98 --
99 Ever stop to think, and forget to start again?
100 Sortir en Pantoufles: up 807 days, 18:51

Replies

Subject Author
Re: [gentoo-user] [OT] - command line read *.csv & create new file Mark Knecht <markknecht@×××××.com>