Re: [gentoo-amd64] Whats using all my disc space? - gentoo-amd64

From:	Michael Rock <rockmikerock@×××××.com>
To:	gentoo-amd64@l.g.o
Subject:	Re: [gentoo-amd64] Whats using all my disc space?
Date:	Fri, 17 Oct 2008 13:36:09
Message-Id:	`591240b10810170636y718e68f3s65976aefaa8b5289@mail.gmail.com`
In Reply to:	Re: [gentoo-amd64] Whats using all my disc space? by Paul Stear

1

Did anyone suggest fslint?  Apart from finding unnecessary files, duplicates

2

3

and broken links, there's an education to be had in clever scripting behind

4

it all:

5

*********************************************************************************************************

6

#!/bin/bash

7

8

# findup - find duplicate files

9

# Copyright (c) 2000-2006 by Pádraig Brady <P@××××××××××.com>.

10

#

11

# This program is free software; you can redistribute it and/or modify

12

# it under the terms of the GNU General Public License as published by

13

# the Free Software Foundation; either version 2 of the License, or

14

# any later version.

15

#

16

# This program is distributed in the hope that it will be useful,

17

# but WITHOUT ANY WARRANTY; without even the implied warranty of

18

# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

19

# See the GNU General Public License for more details,

20

# which is available at www.gnu.org

21

22

23

# Description

24

#

25

#    will show duplicate files in the specified directories

26

#    (and their subdirectories), in the format:

27

#

28

#    2 * 2048    file1 file2

29

#    3 * 1024    file3 file4 file5

30

#    2 * 1024    file6 file7

31

#

32

#    Where the number is the disk usage in bytes of each of the

33

#    duplicate files on that line, and all duplicate files are

34

#    shown on the same line.

35

#        Output it ordered by largest disk usage first and

36

#    then by the number of duplicate files.

37

#

38

# Caveats/Notes:

39

#    I compared this to any equivalent utils I could find (as of Nov 2000)

40

#    and it's (by far) the fastest, has the most functionality (thanks to

41

#    find) and has no (known) bugs. In my opinion fdupes is the next best

42

but

43

#    is slower (even though written in C), and has a bug where hard links

44

#    in different directories are reported as duplicates sometimes.

45

#

46

#    This script requires uniq > V2.0.21 (part of GNU textutils|coreutils)

47

#    undefined operation if any dir/file names contain \n or \\

48

#    sparse files are not treated differently.

49

#    Don't specify params to find that affect output etc. (e.g -printf etc.)

50

#    zero length files are ignored.

51

#    symbolic links are ignored.

52

#    path1 & path2 can be files &/or directories

53

54

script_dir=`dirname $0`                #directory of this script

55

script_dir=`readlink -f "$script_dir"` #Make sure absolute path

56

57

. $script_dir/supprt/fslver

58

59

Usage() {

60

    ProgName=`basename "$0"`

61

    echo "find dUPlicate files.

62

Usage: $ProgName [[-t [-m|-d]] [-r] [-f] paths(s) ...]

63

64

If no path(s) specified then the currrent directory is assumed.

65

66

When -m is specified any found duplicates will be merged (using hardlinks).

67

When -d is specified any found duplicates will be deleted (only 1 left).

68

When -t is specfied, only report what -m or -d would do.

69

70

You can also pipe output to $script_dir/fstool/dupwaste to

71

get a total of the wastage due to duplicates.

72

73

Examples:

74

75

search for duplicates in current directory and below

76

    findup or findup .

77

search for duplicates in all linux source directories and merge using

78

hardlinks

79

    findup -m /usr/src/linux*

80

same as above but don't look in subdirectories

81

    findup -r .

82

search for duplicates in /usr/bin

83

    findup /usr/bin

84

search in multiple directories but not their subdirectories

85

    findup -r /usr/bin /bin /usr/sbin /sbin

86

search for duplicates in \$PATH

87

    findup \`$script_dir/supprt/getffp\`

88

search system for duplicate files over 100K in size

89

    findup / -size +100k

90

search only my files (that I own and are in my home dir)

91

    findup ~ -user \`id -u\`

92

search system for duplicate files belonging to roger

93

    findup / -user \`id -u roger\`"

94

    exit

95

}

96

97

for arg

98

do

99

    case "$arg" in

100

    -h|--help|-help)

101

        Usage ;;

102

    -v|--version)

103

        Version ;;

104

    --gui)

105

        mode="gui" ;;

106

-m)

107

        mode="merge" ;;

108

-d)

109

        mode="del" ;;

110

-t)

111

        t="t" ;;

112

*)

113

                argsToPassOn="$argsToPassOn '$arg'"

114

    esac

115

done

116

[ "$mode" = "merge" ] && argsToPassOn="$argsToPassOn -xdev"

117

118

if [ ! -z "$mode" ]; then

119

    forceFullPath="-f"

120

    sep_mode="prepend"

121

else

122

    sep_mode="none"

123

fi

124

125

if [ "$mode" = "gui" ] || [ "$mode" = "merge" ] || [ "$mode" = "del" ]; then

126

    merge_early=""    #process hardlinks

127

else

128

    merge_early="-u"  #ignore hardlinks

129

fi

130

131

. $script_dir/supprt/getfpf $forceFullPath "$argsToPassOn"

132

133

check_uniq

134

135

if [ `find . -maxdepth 0 -printf "%D" 2> /dev/null` = "D" ]

136

then

137

    devFmt="\060" #0

138

else

139

    devFmt=%D #This is new and will help find more duplicate files

140

fi

141

                     #print name, inode & size.

142

find "$@" -size +0c -type f -printf "$FPF\0$devFmt\0%i\0%s\n" |

143

tr ' \t\0' '\0\1 ' | #remove spaces, tabs in file names

144

sort -k2,2n -k4,4nr -k3,3 $merge_early |#group [and merge] dev,size & inodes

145

if [ -z "$merge_early" ]; then

146

    $script_dir/supprt/rmlint/merge_hardlinks

147

else

148

    uniq -3 -D       #pick just duplicate filesizes

149

fi |

150

sort -k3,3n |        #NB sort inodes so md5sum does less seeking all over

151

disk

152

cut -f1 -d' ' -s |   #get filenames to work on

153

tr '\0\1\n' ' \t\0' |#reset any space & tabs etc and delimit names with \0

154

xargs -r0 md5sum -- |#calculate md5sums for possible duplicates

155

sort |               #group duplicate files together

156

tr ' \t' '\1\2' |    #remove spaces & tabs again (sed can't match \0)

157

sed -e 's/\(^.\{32\}\)..\(.*\)/\2 \1/' | #switch sums and filenames

158

159

# The following optional block, checks duplicates again using sha1

160

# Note for data sets that don't totally fit in cache this will

161

# probably read duplicate files off the disk again.

162

uniq --all-repeated -1 | #pick just duplicates

163

cut -d' ' -f1 |          #get filenames

164

sort |                   #sort by paths to try to minimise disk seeks

165

tr '\1\2\n' ' \t\0' |    #reset any space & tabs etc and delimit names with

166

\0

167

xargs -r0 sha1sum -- |   #to be sure to be sure

168

sort |                   #group duplicate files together

169

tr ' \t' '\1\2' |        #remove spaces & tabs again (sed can't match \0)

170

sed -e 's/\(^.\{40\}\)..\(.*\)/\2 \1/' | #switch sums and filenames

171

172

uniq --all-repeated=$sep_mode -1 | #pick just duplicates

173

sed -e 's/\(^.*\) \(.*\)/\2 \1/' | #switch sums and filenames back

174

tr '\1\2' ' \t' |    #put spaces & tabs back

175

176

if [ ! -z "$mode" ]; then

177

  cut -d' ' -f2- |

178

  if [ ! $mode = "gui" ]; then # external call to python as this is faster

179

    if [ -f $script_dir/supprt/rmlint/fixdup.py ]; then

180

        $script_dir/supprt/rmlint/fixdup.py $t$mode

181

    elif [ -f $script_dir/supprt/rmlint/fixdup.sh ]; then

182

        $script_dir/supprt/rmlint/fixdup.sh $t$mode

183

    else

184

        echo "Error, couldn't find merge util" >&2

185

        exit 1

186

fi

187

  else

188

cat

189

fi

190

else

191

(

192

psum='no match'

193

line=''

194

declare -i counter

195

while read sum file; do           #sum is delimited by first space

196

  if [ "$sum" != "$psum" ]; then

197

    if [ ! -z "$line" ]; then

198

       echo "$counter * $line"

199

fi

200

    counter=1

201

    line="`du -b "$file"`"

202

    psum="$sum"

203

  else

204

    counter=counter+1             #Use bash arithmetic, not expr (for speed)

205

    line="$line $file"

206

fi

207

done

208

209

if [ ! -z "$line" ]; then

210

  echo "$counter * $line"

211

fi

212

) |

213

sort -k3,3 -k1,1 -brn

214

fi

215

*************************************************************************************************************

216

217

218

On Fri, Oct 17, 2008 at 7:12 AM, Paul Stear <gentoo@××××××××××××.com> wrote:

219

220

> On Thursday 16 October 2008 15:52:59 Richard Freeman wrote:

221

>

222

> > To add to the chorus of suggestions, may I offer "kdirstat"?  It is in

223

> > portage and does a great job of mapping file use, as well as some

224

> > administrative tools for cleanup.  Just be careful when deleting files

225

> > that you don't just move them to the trash.

226

>

227

> Well thanks again for all responses, kdirstat is now emerged and looks good

228

> at

229

> identifying all my rubbish.

230

> Paul

231

>

232

> --

233

> This message has been sent using kmail with gentoo linux

234

>

235

>

1	Did anyone suggest fslint? Apart from finding unnecessary files, duplicates
2
3	and broken links, there's an education to be had in clever scripting behind
4	it all:
5	*********************************************************************************************************
6	#!/bin/bash
7
8	# findup - find duplicate files
9	# Copyright (c) 2000-2006 by Pádraig Brady <P@××××××××××.com>.
10	#
11	# This program is free software; you can redistribute it and/or modify
12	# it under the terms of the GNU General Public License as published by
13	# the Free Software Foundation; either version 2 of the License, or
14	# any later version.
15	#
16	# This program is distributed in the hope that it will be useful,
17	# but WITHOUT ANY WARRANTY; without even the implied warranty of
18	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
19	# See the GNU General Public License for more details,
20	# which is available at www.gnu.org
21
22
23	# Description
24	#
25	# will show duplicate files in the specified directories
26	# (and their subdirectories), in the format:
27	#
28	# 2 * 2048 file1 file2
29	# 3 * 1024 file3 file4 file5
30	# 2 * 1024 file6 file7
31	#
32	# Where the number is the disk usage in bytes of each of the
33	# duplicate files on that line, and all duplicate files are
34	# shown on the same line.
35	# Output it ordered by largest disk usage first and
36	# then by the number of duplicate files.
37	#
38	# Caveats/Notes:
39	# I compared this to any equivalent utils I could find (as of Nov 2000)
40	# and it's (by far) the fastest, has the most functionality (thanks to
41	# find) and has no (known) bugs. In my opinion fdupes is the next best
42	but
43	# is slower (even though written in C), and has a bug where hard links
44	# in different directories are reported as duplicates sometimes.
45	#
46	# This script requires uniq > V2.0.21 (part of GNU textutils\|coreutils)
47	# undefined operation if any dir/file names contain \n or \\
48	# sparse files are not treated differently.
49	# Don't specify params to find that affect output etc. (e.g -printf etc.)
50	# zero length files are ignored.
51	# symbolic links are ignored.
52	# path1 & path2 can be files &/or directories
53
54	script_dir=`dirname $0` #directory of this script
55	script_dir=`readlink -f "$script_dir"` #Make sure absolute path
56
57	. $script_dir/supprt/fslver
58
59	Usage() {
60	ProgName=`basename "$0"`
61	echo "find dUPlicate files.
62	Usage: $ProgName [[-t [-m\|-d]] [-r] [-f] paths(s) ...]
63
64	If no path(s) specified then the currrent directory is assumed.
65
66	When -m is specified any found duplicates will be merged (using hardlinks).
67	When -d is specified any found duplicates will be deleted (only 1 left).
68	When -t is specfied, only report what -m or -d would do.
69
70	You can also pipe output to $script_dir/fstool/dupwaste to
71	get a total of the wastage due to duplicates.
72
73	Examples:
74
75	search for duplicates in current directory and below
76	findup or findup .
77	search for duplicates in all linux source directories and merge using
78	hardlinks
79	findup -m /usr/src/linux*
80	same as above but don't look in subdirectories
81	findup -r .
82	search for duplicates in /usr/bin
83	findup /usr/bin
84	search in multiple directories but not their subdirectories
85	findup -r /usr/bin /bin /usr/sbin /sbin
86	search for duplicates in \$PATH
87	findup \`$script_dir/supprt/getffp\`
88	search system for duplicate files over 100K in size
89	findup / -size +100k
90	search only my files (that I own and are in my home dir)
91	findup ~ -user \`id -u\`
92	search system for duplicate files belonging to roger
93	findup / -user \`id -u roger\`"
94	exit
95	}
96
97	for arg
98	do
99	case "$arg" in
100	-h\|--help\|-help)
101	Usage ;;
102	-v\|--version)
103	Version ;;
104	--gui)
105	mode="gui" ;;
106	-m)
107	mode="merge" ;;
108	-d)
109	mode="del" ;;
110	-t)
111	t="t" ;;
112	*)
113	argsToPassOn="$argsToPassOn '$arg'"
114	esac
115	done
116	[ "$mode" = "merge" ] && argsToPassOn="$argsToPassOn -xdev"
117
118	if [ ! -z "$mode" ]; then
119	forceFullPath="-f"
120	sep_mode="prepend"
121	else
122	sep_mode="none"
123	fi
124
125	if [ "$mode" = "gui" ] \|\| [ "$mode" = "merge" ] \|\| [ "$mode" = "del" ]; then
126	merge_early="" #process hardlinks
127	else
128	merge_early="-u" #ignore hardlinks
129	fi
130
131	. $script_dir/supprt/getfpf $forceFullPath "$argsToPassOn"
132
133	check_uniq
134
135	if [ `find . -maxdepth 0 -printf "%D" 2> /dev/null` = "D" ]
136	then
137	devFmt="\060" #0
138	else
139	devFmt=%D #This is new and will help find more duplicate files
140	fi
141	#print name, inode & size.
142	find "$@" -size +0c -type f -printf "$FPF\0$devFmt\0%i\0%s\n" \|
143	tr ' \t\0' '\0\1 ' \| #remove spaces, tabs in file names
144	sort -k2,2n -k4,4nr -k3,3 $merge_early \|#group [and merge] dev,size & inodes
145	if [ -z "$merge_early" ]; then
146	$script_dir/supprt/rmlint/merge_hardlinks
147	else
148	uniq -3 -D #pick just duplicate filesizes
149	fi \|
150	sort -k3,3n \| #NB sort inodes so md5sum does less seeking all over
151	disk
152	cut -f1 -d' ' -s \| #get filenames to work on
153	tr '\0\1\n' ' \t\0' \|#reset any space & tabs etc and delimit names with \0
154	xargs -r0 md5sum -- \|#calculate md5sums for possible duplicates
155	sort \| #group duplicate files together
156	tr ' \t' '\1\2' \| #remove spaces & tabs again (sed can't match \0)
157	sed -e 's/\(^.\{32\}\)..\(.*\)/\2 \1/' \| #switch sums and filenames
158
159	# The following optional block, checks duplicates again using sha1
160	# Note for data sets that don't totally fit in cache this will
161	# probably read duplicate files off the disk again.
162	uniq --all-repeated -1 \| #pick just duplicates
163	cut -d' ' -f1 \| #get filenames
164	sort \| #sort by paths to try to minimise disk seeks
165	tr '\1\2\n' ' \t\0' \| #reset any space & tabs etc and delimit names with
166	\0
167	xargs -r0 sha1sum -- \| #to be sure to be sure
168	sort \| #group duplicate files together
169	tr ' \t' '\1\2' \| #remove spaces & tabs again (sed can't match \0)
170	sed -e 's/\(^.\{40\}\)..\(.*\)/\2 \1/' \| #switch sums and filenames
171
172	uniq --all-repeated=$sep_mode -1 \| #pick just duplicates
173	sed -e 's/\(^.\) \(.\)/\2 \1/' \| #switch sums and filenames back
174	tr '\1\2' ' \t' \| #put spaces & tabs back
175
176	if [ ! -z "$mode" ]; then
177	cut -d' ' -f2- \|
178	if [ ! $mode = "gui" ]; then # external call to python as this is faster
179	if [ -f $script_dir/supprt/rmlint/fixdup.py ]; then
180	$script_dir/supprt/rmlint/fixdup.py $t$mode
181	elif [ -f $script_dir/supprt/rmlint/fixdup.sh ]; then
182	$script_dir/supprt/rmlint/fixdup.sh $t$mode
183	else
184	echo "Error, couldn't find merge util" >&2
185	exit 1
186	fi
187	else
188	cat
189	fi
190	else
191	(
192	psum='no match'
193	line=''
194	declare -i counter
195	while read sum file; do #sum is delimited by first space
196	if [ "$sum" != "$psum" ]; then
197	if [ ! -z "$line" ]; then
198	echo "$counter * $line"
199	fi
200	counter=1
201	line="`du -b "$file"`"
202	psum="$sum"
203	else
204	counter=counter+1 #Use bash arithmetic, not expr (for speed)
205	line="$line $file"
206	fi
207	done
208
209	if [ ! -z "$line" ]; then
210	echo "$counter * $line"
211	fi
212	) \|
213	sort -k3,3 -k1,1 -brn
214	fi
215	*************************************************************************************************************
216
217
218	On Fri, Oct 17, 2008 at 7:12 AM, Paul Stear <gentoo@××××××××××××.com> wrote:
219
220	> On Thursday 16 October 2008 15:52:59 Richard Freeman wrote:
221	>
222	> > To add to the chorus of suggestions, may I offer "kdirstat"? It is in
223	> > portage and does a great job of mapping file use, as well as some
224	> > administrative tools for cleanup. Just be careful when deleting files
225	> > that you don't just move them to the trash.
226	>
227	> Well thanks again for all responses, kdirstat is now emerged and looks good
228	> at
229	> identifying all my rubbish.
230	> Paul
231	>
232	> --
233	> This message has been sent using kmail with gentoo linux
234	>
235	>

Gentoo Archives: gentoo-amd64