Gentoo Archives: gentoo-portage-dev

From: Brian Harring <ferringb@×××××.com>
To: gentoo-portage-dev@l.g.o
Subject: [gentoo-portage-dev] intrusive spring cleaning/sanitizing
Date: Sun, 09 Apr 2006 13:27:41
Message-Id: 20060409132733.GB3172@nightcrawler.had1.or.comcast.net
1 Alright, patch #2.
2
3 This one is not 2.1 material.
4
5 There are bugs in it (sheer size of changes gurantees I typo'd
6 somewhere and haven't yet caught in re-reading the diff).
7
8 It _does_ introduce a good potential for non typo'd bugs due to
9 migrating config over to being a saner dict equiv. They've been
10 stomped where found, but it's not a typo style bug; it introduces the
11 possibility of bugs in consumer code that uses config objects and
12 weren't doing containment tests on config instances prior to
13 accessing (or using get which provides it)
14
15
16 So... changes. Aside from the earlier list from "spring cleaning",
17 following general changes-
18
19 1) list comprehensions are your friend if you're doing filtering of a
20 list.
21 python -m timeit $'l=range(1000);i=0;\nwhile i < len(l):\n\tif l[i]!="asdf":del l[i]\n\telse:i+=1'
22 100 loops, best of 3: 4.12 msec per loop
23 python -m timeit $'l=range(1000);\nfor i in xrange(len(l)-1,-1,-1):\n\tif l[i]!="asdf":del l[i]'
24 100 loops, best of 3: 3 msec per loop
25 python -m timeit '[x for x in range(1000) if x == "asdf"]'
26 1000 loops, best of 3: 887 usec per loop
27
28 Lesson there is that inventing your own for loop is fricking slow
29 (portage does this in several key spots).
30 Trying to dodge the cost of removal by popping from the right also is
31 costly.
32 List comp's, on the other hand are cheap in worst case (as
33 demonstrated above), but still win out in best case (1.9ms vs 3.75 for
34 "invent your own loop" and 2.28ms for reversed walking). There still
35 exists code that uses the slower versions, mainly because they're a
36 bit harder to remove.
37
38 Granted the bit above is kind of a "well duh", but portage code is
39 kinda old/crufty :)
40
41 2) config.__getitem__ previously would function as the equiv of
42 get(key, ""); this sucks because to see if a key exists, and grab
43 the value, you have to do a containment test _then_ pull the value.
44
45 There's a reason dicts throw KeyErrors, and the profileration of
46 has_key's usage in portage is proof of it.
47
48 3) flesh out config to support usual iter*/len/__nonzero__ tests.
49 4) Use itertools were appropriate.
50 5) cacheddir isn't always your friend in listdir calls. If it's not
51 cached, or it's stale and you're not doing files/dir/recursion, a
52 simple os.listdir suffices (and is far faster since it doesn't
53 trigger stat calls on each entry).
54 6) elog_process doesn't need to reinvent the "dynamically grab an
55 attribute from a module", load_mod exists already.
56 7) don't pop from a list if you're just accessing it (stupid and
57 slow).
58 8) Know the return type. If None is returned when an fs accessing
59 function is called for a nonexistant file, calling said func and
60 testing for the file yourself is dumb.
61 9) if blah in dar:
62 for x in dar[blah]:
63 # is dumb. use.
64 for x in dar.get(blah, []):
65 10) if os.path.exists(blah):
66 return open(blah).read()
67 return ""
68
69 is dumb, try to open the file and catch the IOError instead; one
70 less syscall involved...
71 11) when building a unique list, do _not_ search the list and append
72 if it doesn't exist in it. It's amazing how much portage does
73 this :/
74 12) collision-protect cleanup. Existing code is pretty ugly, reworked
75 it to load contents sets once (and don't hold them in memory),
76 filtering per contents set instead of iterating over the image
77 file set. Up shot, faster, and it's able to dump all colliding
78 files instead of the first hit.
79 13) added iter funcs for cp_all/cpv_all. Eliminates building a list
80 in memory, and consumer code reacts immediately instead of pausing
81 while the list is built.
82 14) initial dead method removal.
83
84 So... yeah. Speeds things up (algorithmic improvements), less code,
85 and uses less intermediate memory; downside is the __getitem__ change
86 introduces potential for complaints from consumer code, and the size
87 of the changes make auditing the patch a bit fun.
88
89 Throwing it y'alls way for anyone who is interested/wants to poke at
90 it- fixes/comments welcome (especially considering this is what I'm
91 running now).
92
93 ~harring

Attachments

File name MIME type
intrusive-cleanup.patch.bz2 application/octet-stream

Replies

Subject Author
Re: [gentoo-portage-dev] intrusive spring cleaning/sanitizing Brian Harring <ferringb@×××××.com>