Gentoo Archives: gentoo-user

From: Alexander Kapshuk <alexander.kapshuk@×××××.com>
To: Gentoo mailing list <gentoo-user@l.g.o>
Subject: Re: [gentoo-user] OT: Extracting year from data, but honour empty lines
Date: Sat, 12 May 2018 06:17:18
Message-Id: CAJ1xhMUxxbaz53+BufRvSaRoSbHAfy=EX9YcRnf3dEj_g=ht4Q@mail.gmail.com
In Reply to: [gentoo-user] OT: Extracting year from data, but honour empty lines by Daniel Frey
1 On Sat, May 12, 2018 at 2:16 AM, Daniel Frey <djqfrey@×××××.com> wrote:
2 > Hi all,
3 >
4 > I am trying to do something relatively simple and I've had something
5 > working in the past, but my brain just doesn't want to work today.
6 >
7 > I have a text file with the following (this is just a subset of about
8 > 2500 dates, and I don't want to edit these all by hand if I can avoid it):
9 >
10 > --- START ---
11 > December 2, 1994
12 > March 27, 1992
13 > June 4, 1994
14 > 1993
15 > January 11, 1992
16 > January 3, 1995
17 >
18 >
19 > March 12, 1993
20 > July 12, 1991
21 > May 17, 1991
22 > August 7, 1992
23 > December 23, 1994
24 > March 27, 1992
25 > March 1995
26 > --- END ---
27 >
28 > As you can see, there's no standard in the way the date is formatted.
29 > Some of them are also formatted YYYY-MM-DD and MM-DD-YYYY.
30 >
31 > I have a basic grep that I tossed together:
32 >
33 > grep -o '\([0-9]\{4\}\)'
34 >
35 > This does extract the year but yields the following:
36 >
37 > 1994
38 > 1992
39 > 1994
40 > 1993
41 > 1992
42 > 1995
43 > 1993
44 > 1991
45 > 1991
46 > 1992
47 > 1994
48 > 1992
49 > 1995
50 >
51 > As you can see, the two empty lines are removed but this will cause
52 > problems with data not lining up later on.
53 >
54 > Does anyone have a quick tip for my tired brain to make this work and
55 > just output a blank line if there's no match? I swear I did this months
56 > ago and had something working but I apparently didn't bother saving the
57 > script I made. Argh!
58 >
59 > Dan
60 >
61
62 Here's an awk and sed scripts for you to try:
63 cat dates
64 December 2, 1994
65 March 27, 1992
66 June 4, 1994
67 1993
68 January 11, 1992
69 January 3, 1995
70
71
72 March 12, 1993
73 July 12, 1991
74 May 17, 1991
75 August 7, 1992
76 December 23, 1994
77 March 27, 1992
78 March 1995
79
80 2018-05-12
81 05-12-2018
82
83 awk 'match($0,/[0-9][0-9][0-9][0-9]/){
84 print substr($0, RSTART, RLENGTH)
85 }
86 /^$/
87 ' dates
88
89 1994
90 1992
91 1994
92 1993
93 1992
94 1995
95
96
97 1993
98 1991
99 1991
100 1992
101 1994
102 1992
103 1995
104
105 2018
106 2018
107
108 sed 's/.*\([0-9][0-9][0-9][0-9]\).*/\1/p
109 /^$/p
110 d' dates
111
112 1994
113 1992
114 1994
115 1993
116 1992
117 1995
118
119
120 1993
121 1991
122 1991
123 1992
124 1994
125 1992
126 1995
127
128 2018
129 2018