1 |
On Fri, Apr 20, 2018 at 7:21 AM, Mick <michaelkintzios@×××××.com> wrote: |
2 |
> On Friday, 20 April 2018 12:55:13 BST Corbin Bird wrote: |
3 |
>> Oak Ridge National Laboratory uses these processors ( Rhea Cluster ) and |
4 |
>> has numerous heat failures. |
5 |
>> |
6 |
>> Due to poor cooling ... surprised? |
7 |
>> |
8 |
>> The cooling is not working right. Something is still wrong. |
9 |
>> |
10 |
>> On 04/19/2018 09:33 PM, R0b0t1 wrote: |
11 |
>> > Dell Precision T7600, two 16 thread Xeons, 192GB of RAM, two Quadro |
12 |
>> > cards and a Tesla card. |
13 |
>> > |
14 |
>> > The system is a few years old at this point. Old enough that the |
15 |
>> > thermal compound could have hardened, which is why I replaced it. |
16 |
> |
17 |
> If the problem started suddenly, rather than getting progressively worse over |
18 |
> time, it may have something to do with kernel drivers, or some change in |
19 |
> firmware. |
20 |
> |
21 |
|
22 |
As far as I know it has always been like this. It may be why it was |
23 |
hardly used before it came into my care. Looking at the server I could |
24 |
blame poor design; the inside is rather cramped, despite the care |
25 |
taken with the internal baffles. They may not have run a good flow |
26 |
simulation. |
27 |
|
28 |
Mr. Bird's observation seems to support this. |
29 |
|
30 |
> If the cause is mechanical, I'd also suggest checking the heat sink contact |
31 |
> surface. Some heat sinks are poorly manufactured and require flattening with |
32 |
> wet 'n dry sandpaper to get a flat enough surface and improve their contact |
33 |
> with the CPU. I've seen 15°C improvement in a Zalman CPU cooler after excess |
34 |
> metal was removed from copper pipes, which were manufactured proud. Hardcore |
35 |
> O/C's flatten the CPU too, but I'd avoid anything as radical because it can go |
36 |
> badly wrong if you remove more than the surface varnish from the chip. |
37 |
> |
38 |
> In the interim, opening the side panel may also help in hot weather. |
39 |
> |
40 |
|
41 |
The internals are custom made to fit the motherboard, cards, and drive |
42 |
slots. It may work better if I move it to another tower but it will be |
43 |
a while before I can find one. I will look at the interface between |
44 |
the heatsink and processor again, but it looked fine. |
45 |
|
46 |
|
47 |
How concerned should I be about overheating machine check errors? I |
48 |
used to think that it was best to avoid them, as the threshold was |
49 |
high enough that very small parts of the die could overshoot and fail, |
50 |
but I was informed that is not the case. Besides the throttling (which |
51 |
is fairly bad) I am not sure if there are any drawbacks to the |
52 |
overheating. |
53 |
|
54 |
I am wondering what the point of 32 threads is if you can't use them at 100%. |
55 |
|
56 |
Cheers, |
57 |
R0b0t1 |