1 |
On 08/10/2016 07:45 AM, Michael Mol wrote: |
2 |
> On Tuesday, August 09, 2016 05:22:22 PM james wrote: |
3 |
>> On 08/09/2016 01:41 PM, Michael Mol wrote: |
4 |
>>> On Tuesday, August 09, 2016 01:23:57 PM james wrote: |
5 |
> |
6 |
>>> The exception is my storage cluster, which has dirty_bytes much higher, as |
7 |
>>> it's very solidly battery backed, so I can use its oodles of memory as a |
8 |
>>> write cache, giving its kernel time to reorder writes and flush data to |
9 |
>>> disk efficiently, and letting clients very rapidly return from write |
10 |
>>> requests. |
11 |
>> Are these TSdB (time series data) by chance? |
12 |
> |
13 |
> No; my TS data is stored in a MySQL VM whose storage is host-local. |
14 |
> |
15 |
>> |
16 |
>> OK, so have your systematically experimented with these parameter |
17 |
>> settings, collected and correlated the data, domain (needs) specific ? |
18 |
> |
19 |
> Not with these particular settings; what they *do* is fairly straightforward, |
20 |
> so establishing configuration constraints is a function of knowing the capacity |
21 |
> and behavior of the underlying hardware; there's little need to guess. |
22 |
> |
23 |
> For hypothetical example, let's say you're using a single spinning rust disk |
24 |
> with an enabled write cache of 64MiB. (Common enough, although you should |
25 |
> ensure the write cache is disabled if you find yourself at risk of poweroff. You |
26 |
> should be able to script that with nut, or even acpid, though.) That means the |
27 |
> disk could queue up 64MiB of data to be be written, and efficiently reorder |
28 |
> writes to flush them to disk faster. So, in that circumstance, perhaps you'd |
29 |
> set dirty_background_bytes to 64MiB, so that the kernel will try to feed it a |
30 |
> full cache's worth of data at once, giving the drive a chance to optimize its |
31 |
> write ordering. |
32 |
> |
33 |
> For another hypothetical example, let's say you're using a parity RAID array |
34 |
> with three data disks and two parity disks, with a strip length of 1MiB. Now, |
35 |
> with parity RAID, if you modify a small bit of data, when that data gets |
36 |
> committed to disk, the parity bits need to get updated as well. That means |
37 |
> that small write requires first reading the relevant portions of all three data |
38 |
> disks, holding them in memory, adjusting the portion you wrote to, calculating |
39 |
> the parity, and writing the result out to all five disks. But if you make a |
40 |
> *large* write that replaces all of the data in the stripe (so, a well-placed |
41 |
> 3MiB write, in this case), you don't have to read the disks to find out what |
42 |
> data was already there, and can simply write out your data and parity. In this |
43 |
> case, perhaps you want to set dirty_background_bytes to 3MiB (or some multiple |
44 |
> thereof), so that the kernel doesn't try flushing data to disk until it has a |
45 |
> full stripe's worth of material, and can forgo a time-consuming initial read. |
46 |
> |
47 |
> For a final hypothetical example, consider SSDs. SSDs share one interesting |
48 |
> thing in common with parity RAID arrays...they have an optimum write size |
49 |
> that's a lot larger than 4KiB. When you write a small amount of data to an |
50 |
> SSD, it has to read an entire block of NAND flash, modify it in its own RAM, |
51 |
> and write that entire block back out to NAND flash. (All of this happens |
52 |
> internally to the SSD.) So, for efficiency, you want to give the SSD an entire |
53 |
> block's worth of data to write at a time, if you can. So you might set |
54 |
> dirty_background_bytes to the size of the SSD's block, because the fewer the |
55 |
> write cycles, the longer it will last. (Different model SSDs will have different |
56 |
> block sizes, ranging anywhere from 512KiB to 8MiB, currently.) |
57 |
|
58 |
|
59 |
Ok, after reading some of the docs and postings, several time, I see how |
60 |
to focus in on the exact hardware on a specific system. The nice thing |
61 |
about clusters is they are largely identical systems, or groups of |
62 |
identical systems, in quantity so that helps with scaling issues.... |
63 |
testing specific hardware, individually, should lead to near-optimal |
64 |
default settings so they can bee deployed as cluster nodes, later. |
65 |
|
66 |
|
67 |
>> As unikernels collide with my work on building up minimized and |
68 |
>> optimized linux clusters, my pathway forward is to use several small |
69 |
>> clusters, where the codes/frameworks can be changed, even the |
70 |
>> tweaked-tuned kernels and DFS and note the performance differences for |
71 |
>> very specific domain solutions. My examples are quite similar to that |
72 |
>> aforementioned flight sim above, but the ordinary and uncommon |
73 |
>> workloads of regular admin (dev/ops) work is only a different domain. |
74 |
>> |
75 |
>> Ideas on automating the exploration of these settings |
76 |
>> (scripts/traces/keystores) are keenly of interest to me, just so you know. |
77 |
> |
78 |
> I think I missed some context, despite rereading what was already discussed. |
79 |
|
80 |
Yea, I was thinking out loud here. just ignore this... |
81 |
|
82 |
>>>> I use OpenRC, just so you know. I also have a motherboard with IOMMU |
83 |
>>>> that is currently has questionable settings in the kernel config file. I |
84 |
>>>> cannot find consensus if/how IOMMU that affects IO with the Sata HD |
85 |
>>>> devices versus mm mapped peripherals.... in the context of 4.x kernel |
86 |
>>>> options. I'm trying very hard here to avoid a deep dive on these issues, |
87 |
>>>> so trendy strategies are most welcome, as workstation and cluster node |
88 |
>>>> optimizations are all I'm really working on atm. |
89 |
>>> |
90 |
>>> Honestly, I'd suggest you deep dive. An image once, with clarity, will |
91 |
>>> last |
92 |
>>> you a lot longer than ongoing fuzzy and trendy images from people whose |
93 |
>>> hardware and workflow is likely to be different from yours. |
94 |
>>> |
95 |
>>> The settings I provided should be absolutely fine for most use cases. Only |
96 |
>>> exception would be mobile devices with spinning rust, but those are |
97 |
>>> getting |
98 |
>>> rarer and rarer... |
99 |
>> |
100 |
>> I did a quick test with games-arcade/xgalaga. It's an old, quirky game |
101 |
>> with sporadic lag variations. On a workstation with 32G ram and (8) 4GHz |
102 |
>> 64bit cores, very lightly loaded, there is no reason for in game lag. |
103 |
>> Your previous settings made it much better and quicker the vast majority |
104 |
>> of the time; but not optimal (always responsive). Experiences tell me if |
105 |
>> I can tweak a system so that that game stays responsive whilst the |
106 |
>> application(s) mix is concurrently running then the quick |
107 |
>> test+parameter settings is reasonably well behaved. So thats becomes a |
108 |
>> baseline for further automated tests and fine tuning for a system under |
109 |
>> study. |
110 |
> |
111 |
> What kind of storage are you running on? What filesystem? If you're still |
112 |
> hitting swap, are you using a swap file or a swap partition? |
113 |
|
114 |
The system I mostly referenced, rarely hits swap in days of uptime. It's |
115 |
the keyboard latency, while playing the game, that I try to tune away, |
116 |
while other codes are running. I try very hard to keep codes from |
117 |
swapping out, cause ultimately I'm most interested in clusters that keep |
118 |
everything running (in memory). AkA ultimate utilization of Apache-Spark |
119 |
and other "in-memory" techniques. |
120 |
|
121 |
|
122 |
Combined codes running simultaneously never hits the HD (no swappiness) |
123 |
but still there is keyboard lag. Not that it is actually affecting the |
124 |
running codes to any appreciable degree, but it is a test I run so that |
125 |
the cluster nodes will benefit from still being (low latency) quickly |
126 |
attentive to interactions with the cluster master processes, regardless |
127 |
of workloads on the nodes. Sure its not totally accurate, but so far |
128 |
this semantical approach, is pretty darn close. It's not part of this |
129 |
conversation (on VM etc) but ultimately getting this right solves one of |
130 |
the biggest problems for building any cluster; that is workload |
131 |
invocation, shedding and management to optimize resource utilization, |
132 |
regardless of the orchestration(s) used to manage the nodes. Swapping to |
133 |
disc is verbotim, in my (ultimate) goals and target scenarios. |
134 |
|
135 |
No worries, you have given me enough info and ideas to move forward with |
136 |
testing and tuning. I'm going to evolve these into more precisely |
137 |
controlled and monitored experiments, noting exact hardware differences; |
138 |
that should complete the tuning of the Memory Management tasks, within |
139 |
acceptable confine . Then automate it for later checking on cluster |
140 |
test runs with various hardware setups. Eventually these test will be |
141 |
extended to a variety of memory and storage hardware, once the |
142 |
techniques are automated. No worries, I now have enough ideas and |
143 |
details (thanks to you) to move forward. |
144 |
|
145 |
|
146 |
>> Perhaps Zabbix +TSdB can get me further down the pathway. Time |
147 |
>> sequenced and analyzed data is over kill for this (xgalaga) test, but |
148 |
>> those coalesced test-vectors will be most useful for me as I seek a |
149 |
>> gentoo centric pathway for low latency clusters (on bare metal). |
150 |
> |
151 |
> If you're looking to avoid Zabbix interfering with your performance, you'll |
152 |
> want the Zabbix server and web interface on a machine separate from the |
153 |
> machines you're trying to optimize. |
154 |
|
155 |
agreed. |
156 |
|
157 |
Thanks Mike, |
158 |
James |