1 |
On Tuesday, August 09, 2016 05:22:22 PM james wrote: |
2 |
> On 08/09/2016 01:41 PM, Michael Mol wrote: |
3 |
> > On Tuesday, August 09, 2016 01:23:57 PM james wrote: |
4 |
|
5 |
> > The exception is my storage cluster, which has dirty_bytes much higher, as |
6 |
> > it's very solidly battery backed, so I can use its oodles of memory as a |
7 |
> > write cache, giving its kernel time to reorder writes and flush data to |
8 |
> > disk efficiently, and letting clients very rapidly return from write |
9 |
> > requests. |
10 |
> Are these TSdB (time series data) by chance? |
11 |
|
12 |
No; my TS data is stored in a MySQL VM whose storage is host-local. |
13 |
|
14 |
> |
15 |
> OK, so have your systematically experimented with these parameter |
16 |
> settings, collected and correlated the data, domain (needs) specific ? |
17 |
|
18 |
Not with these particular settings; what they *do* is fairly straightforward, |
19 |
so establishing configuration constraints is a function of knowing the capacity |
20 |
and behavior of the underlying hardware; there's little need to guess. |
21 |
|
22 |
For hypothetical example, let's say you're using a single spinning rust disk |
23 |
with an enabled write cache of 64MiB. (Common enough, although you should |
24 |
ensure the write cache is disabled if you find yourself at risk of poweroff. You |
25 |
should be able to script that with nut, or even acpid, though.) That means the |
26 |
disk could queue up 64MiB of data to be be written, and efficiently reorder |
27 |
writes to flush them to disk faster. So, in that circumstance, perhaps you'd |
28 |
set dirty_background_bytes to 64MiB, so that the kernel will try to feed it a |
29 |
full cache's worth of data at once, giving the drive a chance to optimize its |
30 |
write ordering. |
31 |
|
32 |
For another hypothetical example, let's say you're using a parity RAID array |
33 |
with three data disks and two parity disks, with a strip length of 1MiB. Now, |
34 |
with parity RAID, if you modify a small bit of data, when that data gets |
35 |
committed to disk, the parity bits need to get updated as well. That means |
36 |
that small write requires first reading the relevant portions of all three data |
37 |
disks, holding them in memory, adjusting the portion you wrote to, calculating |
38 |
the parity, and writing the result out to all five disks. But if you make a |
39 |
*large* write that replaces all of the data in the stripe (so, a well-placed |
40 |
3MiB write, in this case), you don't have to read the disks to find out what |
41 |
data was already there, and can simply write out your data and parity. In this |
42 |
case, perhaps you want to set dirty_background_bytes to 3MiB (or some multiple |
43 |
thereof), so that the kernel doesn't try flushing data to disk until it has a |
44 |
full stripe's worth of material, and can forgo a time-consuming initial read. |
45 |
|
46 |
For a final hypothetical example, consider SSDs. SSDs share one interesting |
47 |
thing in common with parity RAID arrays...they have an optimum write size |
48 |
that's a lot larger than 4KiB. When you write a small amount of data to an |
49 |
SSD, it has to read an entire block of NAND flash, modify it in its own RAM, |
50 |
and write that entire block back out to NAND flash. (All of this happens |
51 |
internally to the SSD.) So, for efficiency, you want to give the SSD an entire |
52 |
block's worth of data to write at a time, if you can. So you might set |
53 |
dirty_background_bytes to the size of the SSD's block, because the fewer the |
54 |
write cycles, the longer it will last. (Different model SSDs will have different |
55 |
block sizes, ranging anywhere from 512KiB to 8MiB, currently.) |
56 |
|
57 |
> |
58 |
> As unikernels collide with my work on building up minimized and |
59 |
> optimized linux clusters, my pathway forward is to use several small |
60 |
> clusters, where the codes/frameworks can be changed, even the |
61 |
> tweaked-tuned kernels and DFS and note the performance differences for |
62 |
> very specific domain solutions. My examples are quite similar to that |
63 |
> aforementioned flight sim above, but the ordinary and uncommon |
64 |
> workloads of regular admin (dev/ops) work is only a different domain. |
65 |
> |
66 |
> Ideas on automating the exploration of these settings |
67 |
> (scripts/traces/keystores) are keenly of interest to me, just so you know. |
68 |
|
69 |
I think I missed some context, despite rereading what was already discussed. |
70 |
|
71 |
> |
72 |
> >> I use OpenRC, just so you know. I also have a motherboard with IOMMU |
73 |
> >> that is currently has questionable settings in the kernel config file. I |
74 |
> >> cannot find consensus if/how IOMMU that affects IO with the Sata HD |
75 |
> >> devices versus mm mapped peripherals.... in the context of 4.x kernel |
76 |
> >> options. I'm trying very hard here to avoid a deep dive on these issues, |
77 |
> >> so trendy strategies are most welcome, as workstation and cluster node |
78 |
> >> optimizations are all I'm really working on atm. |
79 |
> > |
80 |
> > Honestly, I'd suggest you deep dive. An image once, with clarity, will |
81 |
> > last |
82 |
> > you a lot longer than ongoing fuzzy and trendy images from people whose |
83 |
> > hardware and workflow is likely to be different from yours. |
84 |
> > |
85 |
> > The settings I provided should be absolutely fine for most use cases. Only |
86 |
> > exception would be mobile devices with spinning rust, but those are |
87 |
> > getting |
88 |
> > rarer and rarer... |
89 |
> |
90 |
> I did a quick test with games-arcade/xgalaga. It's an old, quirky game |
91 |
> with sporadic lag variations. On a workstation with 32G ram and (8) 4GHz |
92 |
> 64bit cores, very lightly loaded, there is no reason for in game lag. |
93 |
> Your previous settings made it much better and quicker the vast majority |
94 |
> of the time; but not optimal (always responsive). Experiences tell me if |
95 |
> I can tweak a system so that that game stays responsive whilst the |
96 |
> application(s) mix is concurrently running then the quick |
97 |
> test+parameter settings is reasonably well behaved. So thats becomes a |
98 |
> baseline for further automated tests and fine tuning for a system under |
99 |
> study. |
100 |
|
101 |
What kind of storage are you running on? What filesystem? If you're still |
102 |
hitting swap, are you using a swap file or a swap partition? |
103 |
|
104 |
> |
105 |
> |
106 |
> Perhaps Zabbix +TSdB can get me further down the pathway. Time |
107 |
> sequenced and analyzed data is over kill for this (xgalaga) test, but |
108 |
> those coalesced test-vectors will be most useful for me as I seek a |
109 |
> gentoo centric pathway for low latency clusters (on bare metal). |
110 |
|
111 |
If you're looking to avoid Zabbix interfering with your performance, you'll |
112 |
want the Zabbix server and web interface on a machine separate from the |
113 |
machines you're trying to optimize. |
114 |
|
115 |
-- |
116 |
:wq |