1 |
On 10/22/19 2:51 AM, Jaco Kroon wrote: |
2 |
> Hi All, |
3 |
> |
4 |
> |
5 |
> On 2019/10/21 18:42, Richard Yao wrote: |
6 |
>> |
7 |
>> If we consider the access frequency, it might actually not be that |
8 |
>> bad. Consider a simple example with 500 files and two directory |
9 |
>> buckets. If we have 250 in each, then the size of the directory is |
10 |
>> always 250. However, if 50 files are accessed 90% of the time, then |
11 |
>> putting 450 into one directory and that 50 into another directory, we |
12 |
>> end up with the performance of the O(n) directory lookup being |
13 |
>> consistent with there being only 90 files in each directory. |
14 |
>> |
15 |
>> I am not sure if we should be discarding all other considerations to |
16 |
>> make changes to benefit O(n) directory lookup filesystems, but if we |
17 |
>> are, then the hashing approach is not necessarily the best one. It is |
18 |
>> only the best when all files are accessed with equal frequency, which |
19 |
>> would be an incorrect assumption. A more human friendly approach might |
20 |
>> still be better. I doubt that we have the data to determine that though. |
21 |
>> |
22 |
>> Also, another idea is to use a cheap hash function (e.g. fletcher) and |
23 |
>> just have the mirrors do the hashing behind the scenes. Then we would |
24 |
>> have the best of both worlds. |
25 |
> |
26 |
> |
27 |
> Experience: |
28 |
> |
29 |
> ext4 sucks at targeting name lookups without dir_index feature (O(n) |
30 |
> lookups - scans all entries in the folder). With dir_index readdir |
31 |
> performance is crap. Pick your poison I guess. Most of our larger |
32 |
> filesystems (2TB+, but especially the 80TB+ ones) we've reverted to |
33 |
> disabling dir_index as the benefit is outweighed by the crappy readdir() |
34 |
> and glob() performance. |
35 |
My read of the ext4 disk layout documentation is that the read operation |
36 |
should work mostly the same way, except with a penalty from reading |
37 |
larger directories caused by the addition of the tree's metadata and |
38 |
from having more partially filled blocks: |
39 |
|
40 |
https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Directory_Entries |
41 |
|
42 |
The code itself is the same traversal code: |
43 |
|
44 |
https://github.com/torvalds/linux/blob/v5.3/fs/ext4/dir.c#L106 |
45 |
|
46 |
However, a couple of things stand out to me here at a glance: |
47 |
|
48 |
1. `cond_resched()` adds scheduler delay for no apparent reason. |
49 |
`cond_resched()` is meant to be used in places where we could block |
50 |
excessively on non-PREEMPT kernels, but this doesn't strike me as one of |
51 |
those places. The fact that we block on disk on uncached reads naturally |
52 |
serves the same purpose, so an explicit rescheduling point here is |
53 |
redundant. PREEMPT kernels should perform better in readdir() on ext4 by |
54 |
virtue of making `cond_resched()` a no-op. |
55 |
|
56 |
2. read-ahead is implemented in a way that appears to be over-reading |
57 |
the directory whenever the needed information is not cached. This is |
58 |
technically read-ahead, although it is not a great way of doing it. A |
59 |
much better way to do this would be to pipeline `readdir()` by |
60 |
initiating asynchronous read operations in anticipation of future reads. |
61 |
|
62 |
Both of thse should affect both variants of ext4's directories, but the |
63 |
penalties I mentioned earlier mean that the dir_index variant would be |
64 |
affected more. |
65 |
|
66 |
If you have a way to benchmark things, a simple idea to evaluate would |
67 |
be deleting the `cond_resched()` line. If we had data showing an |
68 |
improvement, I would be happy to send a small one-line patch deleting |
69 |
the line to Ted to get the change into mainline. |
70 |
|
71 |
> There doesn't seem to be a real specific tip-over point, and it seems to |
72 |
> depend a lot on RAM availability and harddrive speed (obviously). So if |
73 |
> dentries gets cached, disk speeds becomes less of an issue. However, on |
74 |
> large folders (where I typically use 10k as a value for large based on |
75 |
> "gut feeling" and "unquantifiable experience" and "nothing scientific at |
76 |
> all") I find that even with lots of RAM two consecutive ls commands |
77 |
> remains terribly slow. Switch off dir_index and that becomes an order of |
78 |
> magnitude faster. |
79 |
> |
80 |
> I don't have a great deal of experience with XFS, but on those systems |
81 |
> where we do it's generally on a VM, and perceivably (again, not |
82 |
> scientific) our experience has been that it feels slower. Again, not |
83 |
> scientific, just perception. |
84 |
> |
85 |
> I'm in support for the change. This will bucket to 256 folders and |
86 |
> should have a reasonably even split between folders. If required a |
87 |
> second layer could be introduced by using the 3rd and 4th digits of the |
88 |
> hash for a second layer. Any hash should be fine, it really doesn't |
89 |
> need to be cryptographically strong, it just needs to provide a good |
90 |
> spread and be really fast. Generally a hash table should have a prime |
91 |
> number of buckets to assist with hash bias, but frankly, that's over |
92 |
> complicating the situation here. |
93 |
> |
94 |
> I also agree with others that it used to be easy to get distfiles as and |
95 |
> when needed, so an alternative structure could mirror that of the |
96 |
> portage tree itself, in other words "cat/pkg/distfile". This perhaps |
97 |
> just shifts the issue: |
98 |
> |
99 |
> jkroon@plastiekpoot /usr/portage $ find . -maxdepth 1 -type d -name |
100 |
> "*-*" | wc -l |
101 |
> 167 |
102 |
> jkroon@plastiekpoot /usr/portage $ find *-* -maxdepth 1 -type d | wc -l |
103 |
> 19412 |
104 |
> jkroon@plastiekpoot /usr/portage $ for i in *-*; do echo $(find $i |
105 |
> -maxdepth 1 -type d | wc -l) $i; done | sort -g | tail -n10 |
106 |
> 347 net-misc |
107 |
> 373 media-sound |
108 |
> 395 media-libs |
109 |
> 399 dev-util |
110 |
> 505 dev-libs |
111 |
> 528 dev-java |
112 |
> 684 dev-haskell |
113 |
> 690 dev-ruby |
114 |
> 1601 dev-perl |
115 |
> 1889 dev-python |
116 |
> |
117 |
> So that's average 116 sub folders under the top layer (only two over |
118 |
> 1000), and then presumably less than 100 distfiles maximum per package? |
119 |
> Probably overkill but would (should) solve both the too many files per |
120 |
> folder as well as the easy lookup by hand issue. |
121 |
> |
122 |
> I don't have a preference on either solution though but do agree that |
123 |
> "easy finding of distfiles" are handy. The INDEX mechanism is fine for me. |
124 |
> |
125 |
> Kind Regards, |
126 |
> |
127 |
> Jaco |
128 |
> |