Gentoo Archives: gentoo-user

From: james <garftd@×××××××.net>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] kde-apps/kde-l10n-16.04.3:5/5::gentoo conflicting with kde-apps/kdepim-l10n-15.12.3:5/5::gentoo
Date: Wed, 10 Aug 2016 14:05:25
Message-Id: 82210337-0ef5-f004-05b9-f4d234aa1e2a@verizon.net
In Reply to: Re: [gentoo-user] kde-apps/kde-l10n-16.04.3:5/5::gentoo conflicting with kde-apps/kdepim-l10n-15.12.3:5/5::gentoo by Michael Mol
1 On 08/10/2016 07:45 AM, Michael Mol wrote:
2 > On Tuesday, August 09, 2016 05:22:22 PM james wrote:
3 >> On 08/09/2016 01:41 PM, Michael Mol wrote:
4 >>> On Tuesday, August 09, 2016 01:23:57 PM james wrote:
5 >
6 >>> The exception is my storage cluster, which has dirty_bytes much higher, as
7 >>> it's very solidly battery backed, so I can use its oodles of memory as a
8 >>> write cache, giving its kernel time to reorder writes and flush data to
9 >>> disk efficiently, and letting clients very rapidly return from write
10 >>> requests.
11 >> Are these TSdB (time series data) by chance?
12 >
13 > No; my TS data is stored in a MySQL VM whose storage is host-local.
14 >
15 >>
16 >> OK, so have your systematically experimented with these parameter
17 >> settings, collected and correlated the data, domain (needs) specific ?
18 >
19 > Not with these particular settings; what they *do* is fairly straightforward,
20 > so establishing configuration constraints is a function of knowing the capacity
21 > and behavior of the underlying hardware; there's little need to guess.
22 >
23 > For hypothetical example, let's say you're using a single spinning rust disk
24 > with an enabled write cache of 64MiB. (Common enough, although you should
25 > ensure the write cache is disabled if you find yourself at risk of poweroff. You
26 > should be able to script that with nut, or even acpid, though.) That means the
27 > disk could queue up 64MiB of data to be be written, and efficiently reorder
28 > writes to flush them to disk faster. So, in that circumstance, perhaps you'd
29 > set dirty_background_bytes to 64MiB, so that the kernel will try to feed it a
30 > full cache's worth of data at once, giving the drive a chance to optimize its
31 > write ordering.
32 >
33 > For another hypothetical example, let's say you're using a parity RAID array
34 > with three data disks and two parity disks, with a strip length of 1MiB. Now,
35 > with parity RAID, if you modify a small bit of data, when that data gets
36 > committed to disk, the parity bits need to get updated as well. That means
37 > that small write requires first reading the relevant portions of all three data
38 > disks, holding them in memory, adjusting the portion you wrote to, calculating
39 > the parity, and writing the result out to all five disks. But if you make a
40 > *large* write that replaces all of the data in the stripe (so, a well-placed
41 > 3MiB write, in this case), you don't have to read the disks to find out what
42 > data was already there, and can simply write out your data and parity. In this
43 > case, perhaps you want to set dirty_background_bytes to 3MiB (or some multiple
44 > thereof), so that the kernel doesn't try flushing data to disk until it has a
45 > full stripe's worth of material, and can forgo a time-consuming initial read.
46 >
47 > For a final hypothetical example, consider SSDs. SSDs share one interesting
48 > thing in common with parity RAID arrays...they have an optimum write size
49 > that's a lot larger than 4KiB. When you write a small amount of data to an
50 > SSD, it has to read an entire block of NAND flash, modify it in its own RAM,
51 > and write that entire block back out to NAND flash. (All of this happens
52 > internally to the SSD.) So, for efficiency, you want to give the SSD an entire
53 > block's worth of data to write at a time, if you can. So you might set
54 > dirty_background_bytes to the size of the SSD's block, because the fewer the
55 > write cycles, the longer it will last. (Different model SSDs will have different
56 > block sizes, ranging anywhere from 512KiB to 8MiB, currently.)
57
58
59 Ok, after reading some of the docs and postings, several time, I see how
60 to focus in on the exact hardware on a specific system. The nice thing
61 about clusters is they are largely identical systems, or groups of
62 identical systems, in quantity so that helps with scaling issues....
63 testing specific hardware, individually, should lead to near-optimal
64 default settings so they can bee deployed as cluster nodes, later.
65
66
67 >> As unikernels collide with my work on building up minimized and
68 >> optimized linux clusters, my pathway forward is to use several small
69 >> clusters, where the codes/frameworks can be changed, even the
70 >> tweaked-tuned kernels and DFS and note the performance differences for
71 >> very specific domain solutions. My examples are quite similar to that
72 >> aforementioned flight sim above, but the ordinary and uncommon
73 >> workloads of regular admin (dev/ops) work is only a different domain.
74 >>
75 >> Ideas on automating the exploration of these settings
76 >> (scripts/traces/keystores) are keenly of interest to me, just so you know.
77 >
78 > I think I missed some context, despite rereading what was already discussed.
79
80 Yea, I was thinking out loud here. just ignore this...
81
82 >>>> I use OpenRC, just so you know. I also have a motherboard with IOMMU
83 >>>> that is currently has questionable settings in the kernel config file. I
84 >>>> cannot find consensus if/how IOMMU that affects IO with the Sata HD
85 >>>> devices versus mm mapped peripherals.... in the context of 4.x kernel
86 >>>> options. I'm trying very hard here to avoid a deep dive on these issues,
87 >>>> so trendy strategies are most welcome, as workstation and cluster node
88 >>>> optimizations are all I'm really working on atm.
89 >>>
90 >>> Honestly, I'd suggest you deep dive. An image once, with clarity, will
91 >>> last
92 >>> you a lot longer than ongoing fuzzy and trendy images from people whose
93 >>> hardware and workflow is likely to be different from yours.
94 >>>
95 >>> The settings I provided should be absolutely fine for most use cases. Only
96 >>> exception would be mobile devices with spinning rust, but those are
97 >>> getting
98 >>> rarer and rarer...
99 >>
100 >> I did a quick test with games-arcade/xgalaga. It's an old, quirky game
101 >> with sporadic lag variations. On a workstation with 32G ram and (8) 4GHz
102 >> 64bit cores, very lightly loaded, there is no reason for in game lag.
103 >> Your previous settings made it much better and quicker the vast majority
104 >> of the time; but not optimal (always responsive). Experiences tell me if
105 >> I can tweak a system so that that game stays responsive whilst the
106 >> application(s) mix is concurrently running then the quick
107 >> test+parameter settings is reasonably well behaved. So thats becomes a
108 >> baseline for further automated tests and fine tuning for a system under
109 >> study.
110 >
111 > What kind of storage are you running on? What filesystem? If you're still
112 > hitting swap, are you using a swap file or a swap partition?
113
114 The system I mostly referenced, rarely hits swap in days of uptime. It's
115 the keyboard latency, while playing the game, that I try to tune away,
116 while other codes are running. I try very hard to keep codes from
117 swapping out, cause ultimately I'm most interested in clusters that keep
118 everything running (in memory). AkA ultimate utilization of Apache-Spark
119 and other "in-memory" techniques.
120
121
122 Combined codes running simultaneously never hits the HD (no swappiness)
123 but still there is keyboard lag. Not that it is actually affecting the
124 running codes to any appreciable degree, but it is a test I run so that
125 the cluster nodes will benefit from still being (low latency) quickly
126 attentive to interactions with the cluster master processes, regardless
127 of workloads on the nodes. Sure its not totally accurate, but so far
128 this semantical approach, is pretty darn close. It's not part of this
129 conversation (on VM etc) but ultimately getting this right solves one of
130 the biggest problems for building any cluster; that is workload
131 invocation, shedding and management to optimize resource utilization,
132 regardless of the orchestration(s) used to manage the nodes. Swapping to
133 disc is verbotim, in my (ultimate) goals and target scenarios.
134
135 No worries, you have given me enough info and ideas to move forward with
136 testing and tuning. I'm going to evolve these into more precisely
137 controlled and monitored experiments, noting exact hardware differences;
138 that should complete the tuning of the Memory Management tasks, within
139 acceptable confine . Then automate it for later checking on cluster
140 test runs with various hardware setups. Eventually these test will be
141 extended to a variety of memory and storage hardware, once the
142 techniques are automated. No worries, I now have enough ideas and
143 details (thanks to you) to move forward.
144
145
146 >> Perhaps Zabbix +TSdB can get me further down the pathway. Time
147 >> sequenced and analyzed data is over kill for this (xgalaga) test, but
148 >> those coalesced test-vectors will be most useful for me as I seek a
149 >> gentoo centric pathway for low latency clusters (on bare metal).
150 >
151 > If you're looking to avoid Zabbix interfering with your performance, you'll
152 > want the Zabbix server and web interface on a machine separate from the
153 > machines you're trying to optimize.
154
155 agreed.
156
157 Thanks Mike,
158 James

Replies