1 |
On Sun, Feb 22, 2009 at 11:06:31AM -0800, Penguin Lover Mark Knecht squawked: |
2 |
> I've got a really big data file in essentially a *.csv format. |
3 |
> (comma delimited) I need to scan this file and create a new output |
4 |
> file. I'm wondering if there is a reasonably easy command line way of |
5 |
> doing this using something like sed or awk which I know nothing about. |
6 |
> Thanks in advance. |
7 |
|
8 |
Definitely more than doable in sed or awk. If you want a reference |
9 |
book, try http://oreilly.com/catalog/9781565922259/ |
10 |
|
11 |
Unfortunately I haven't used awk in the longest time and can't |
12 |
remember how it will go. The following sed recipe may work, modulo |
13 |
some small modifications |
14 |
|
15 |
> The basic idea goes something like this: |
16 |
> |
17 |
> 1) The input file might look this the following where some of it is |
18 |
> attributes (shown as letters) and other parts are results. (shown as |
19 |
> numbers) |
20 |
> |
21 |
> A,B,C,D,1 |
22 |
> E,F,G,H,2 |
23 |
> I,J,K,L,3 |
24 |
> M,N,O,P,4 |
25 |
> Q,R,S,T,5 |
26 |
> U,V,W,X,6 |
27 |
> |
28 |
> 2) From the above data input file I want to take the attributes from a |
29 |
> few preceeding lines (say 3 in this example) and write them to the |
30 |
> output file along with the result on the last of the 3 lines. The |
31 |
> output file might look like this: |
32 |
> |
33 |
> A,B,C,D,E,F,G,H,I,J,K,L,3 |
34 |
> E,F,G,H,I,J,K,L,M,N,O,P,4 |
35 |
> I,J,K,L,M,N,O,P,Q,R,S,T,5 |
36 |
> M,N,O,P,Q,R,S,T,U,V,W,X,6 |
37 |
> |
38 |
> 3) This must be done as a read/process/write operation of some sort |
39 |
> because the input file may be far larger than system memory. |
40 |
> (Currently it isn't, but it likely will eventually be.) |
41 |
> |
42 |
> 4) In my example above I suggested that there is a single result but |
43 |
> their may be more than one. (Don't know yet.) I showed 3 lines but |
44 |
> might be doing 10. I don't know. It's important to me to pick a |
45 |
> moderately flexible way of dealing with this as the order of columns |
46 |
> and number of results will likely change over time and I'll certainly |
47 |
> need to adjust. |
48 |
|
49 |
First create the sedscript |
50 |
|
51 |
sedscript1: |
52 |
-------------------------- |
53 |
1 { |
54 |
N |
55 |
N |
56 |
} |
57 |
{ |
58 |
p |
59 |
D |
60 |
N |
61 |
} |
62 |
-------------------------- |
63 |
|
64 |
The first block only hits when the first line of input is read. It |
65 |
forces it to read the next two lines. |
66 |
|
67 |
The second block hits for every pattern space, it prints the three |
68 |
line blocks, deletes the first line, and reads the next line. |
69 |
|
70 |
Now create the sedscript |
71 |
|
72 |
sedscript2: |
73 |
-------------------------- |
74 |
{ |
75 |
N |
76 |
N |
77 |
s/,[^,]\n/,/gp |
78 |
d |
79 |
} |
80 |
-------------------------- |
81 |
|
82 |
This reads a three-line block at a time, removes the last field (and |
83 |
the new line character) from all but the last line, replacing it with |
84 |
a comma. Then it prints. And then it clears the pattern space. |
85 |
|
86 |
So you can do |
87 |
|
88 |
cat INPUT | sed -f sedscript1 | sed -f sedscript2 |
89 |
|
90 |
should give you what you want. Like I said, the whole thing can |
91 |
probably be done a lot more eloquently in awk. But my awk-fu is not |
92 |
what it used to be. |
93 |
|
94 |
For a quick reference for sed, try |
95 |
http://www.grymoire.com/Unix/Sed.html |
96 |
|
97 |
W |
98 |
-- |
99 |
Ever stop to think, and forget to start again? |
100 |
Sortir en Pantoufles: up 807 days, 18:51 |