1 |
Andrew Savchenko <bircoph <at> gentoo.org> writes: |
2 |
|
3 |
|
4 |
> > I can hardly imagine that otherwise the compiler converts integer |
5 |
> > or pointer arithmetic into floating point arithmetics, or is |
6 |
> > this really the case for certain flags? If yes, why should these |
7 |
> > flags *ever* be useful? |
8 |
> > I mean: The context switching happens for non-kernel code as well, |
9 |
> > doesn't it? |
10 |
|
11 |
|
12 |
First off, reading this thread, I cannot really tell what the intended use |
13 |
of the the "highly tuned" kernels is to be. For almost all workstation |
14 |
and server proposes, what has been previously stated is mostly correct. If |
15 |
you really want test these waters, do it on a system that is not in your |
16 |
critical path. You tune and experiment, you are going to bork your box. |
17 |
Water coolers on the CPUs is a good idea when taxing FPU and other simd |
18 |
hareware on the CPU, imho. sys-power/Powertop is your friend. |
19 |
|
20 |
|
21 |
> Yes, context switching happens for all code and have its costs. But |
22 |
> for userspace code context switching happens for many other |
23 |
> reasons, e.g. on each syscall (userspace <-> kernelspace switching). |
24 |
> Also some user applications may need high precision or context |
25 |
> switching pays off due to mass parallel data processing, e.g. SIMD |
26 |
> instructions in scientific or multimedia applications. |
27 |
|
28 |
( |
29 |
Here here, I knew we had an LU expert int he crowd. Most scientific |
30 |
or highly parallelized number cruncing does benefit from experimenting |
31 |
with settings and *profiling* the results (trace-cdm + kernelshark) |
32 |
are in portage and are very useful for analysis of hardware timings, |
33 |
context switching and a myriad of other issues. Be careful, you can |
34 |
sink a lifetime into such efforts with little to show for your efforts. |
35 |
The best thing is to read up on specific optimizations for specific |
36 |
codes as vetted by the specific hardware in your processors. Tuning for |
37 |
one need will most likely retard other types of performances; that is |
38 |
why before you delve into these waters, you really need to learn about |
39 |
profiling both target (applicattion) and kernel codes, *BEFORE* randomly |
40 |
tuning the advanced numerical intricacies of your hardware resources. |
41 |
Start with memory and cgroups before worrying about the hardware inside |
42 |
your processors (cpu and gpu). |
43 |
|
44 |
|
45 |
> But unless special conditions mentioned above, fixed point is still |
46 |
> faster in userspace, some ffmpeg codecs have both fixed and floating |
47 |
> point implementations, you may compare them. Programming in fixed point |
48 |
> is much harder, so most people avoid it unless they have a very |
49 |
> goode reason to use it. And dont't forget that kernel is |
50 |
> performance critical unlike most of userspace applications. |
51 |
|
52 |
Video (mpeg, h.264 and such) massively benefits from the enhanced matrix |
53 |
abilities of the simd hardware in your video card's GPU. These bare metal |
54 |
resources are being integrated into gcc-5.1+ for experimentation. But, |
55 |
it is likely going to take a year or so before ordinary users of linux |
56 |
resources see these performance gains. I would encourage you |
57 |
to experiment, but *never on your main workstation*. I'm purchasing |
58 |
a new nvidia video card just to benchmark and tune some numerically |
59 |
intesive codes that use sci-libs/magma. Although this will be my |
60 |
currently fastest video card, it will sit in a box that not used |
61 |
for visual eye candy (gaming, anime, ray_traces etc). |
62 |
|
63 |
|
64 |
The mesos clustering codes (shark, storm, tachyon etc) and MP(I) codes are |
65 |
going to fundamentally change the numerical processing landscape for even |
66 |
small linux clusters. An excellent bit of code to get your feet_wet is |
67 |
sys-apps/hwloc. More than FPU, MP(I) {sys-cluster/openmpi} and other |
68 |
clustering codes are going to allow you to use the DDR(4|5) memory found in |
69 |
many video cards (GPU) via *RDMA*. The world is rapidly changing and many |
70 |
old "fixed point integer" folks do not see the Tsunami that is just |
71 |
off_shore. Many computationally expensive codes have development project to |
72 |
move to an "in-memory" [1] environment where HD resources are avoided as |
73 |
much as possible in a cluster environment. Clustered resources "tuned" for |
74 |
such things as a video rendering farm, will have very different optimized |
75 |
kernels than your KDE(G*) workstation or web server. medica-gfx/Blender is |
76 |
another excellent collection of codes that benefits from all sorts of tuning |
77 |
on a special_purpose system. |
78 |
|
79 |
So do you really have a valid need to tune the FPU performance due to a |
80 |
numerically demanding applications? YMMV |
81 |
|
82 |
> Best regards, |
83 |
> Andrew Savchenko |
84 |
|
85 |
|
86 |
hth, |
87 |
James |
88 |
|
89 |
[1] https://amplab.cs.berkeley.edu/ |