Gentoo Archives: gentoo-dev

From: Brian Harring <ferringb@×××××.com>
To: gentoo-dev@l.g.o
Subject: Re: [gentoo-dev] [RFC][NEW] Utility to find orphaned files
Date: Sun, 25 Apr 2010 11:47:26
Message-Id: 20100425114519.GD16877@hrair
In Reply to: [gentoo-dev] [RFC][NEW] Utility to find orphaned files by Angelo Arrifano
1 On Sun, Apr 25, 2010 at 01:18:25PM +0200, Angelo Arrifano wrote:
2 > Hello developers developers and developers,
3 >
4 > Ever wondered how much crap is left in your X-years old Gentoo box?
5 >
6 > I just developed a python utility to efficiently find orphaned files in
7 > the system. By orphaned files I mean the files that are present on
8 > system directories and don't belong to any installed package.
9 >
10 > The package builds a virtual filesystem (cache) on the RAM using python
11 > hash tables. Then it uses the cache to find the ownership of files
12 > inside user-specified dirs.
13 >
14 > Building the cache takes less than 10 seconds here in a system with 1366
15 > installed packages.
16 >
17 > This is not intended to be a finished program yet, I'm looking forward
18 > for your constructive commentaries.
19
20 You're going to want to do realpathing here... also you'll need to
21 handle syms, and spaces are allowed in paths. I'd personally suggest
22 using one of the PM api's for this.
23
24 Part of the reason I advise poking at the PM apis is that it covers up
25 some of the nastier details w/ contents and others w/ parsing; simple
26 example,
27
28 python -c "
29 import sys
30 from pkgcore.config import load_config
31 from pkgcore.fs import contents, livefs
32 contents = contents.contentsSet()
33 for pkg in load_config().get_default('domain').named_repos['vdb']:
34 contents.update(pkg.contents);
35 stream = (x for x in livefs.iter_scan(sys.argv[1]) if x not in
36 contents)
37 print '\n'.join(map(str, sorted(stream)))
38 " desired-path
39
40 Note also that's a *very* quick writing. I'd personally look at
41 serializing the sorted lists to disk for both streams (what contents
42 says is on disk vs what is on disk), and then lockstep walking the
43 lists; via that you can keep the memory usage down.
44
45 ~harring