1 |
On Friday, 20 April 2018 15:11:43 BST R0b0t1 wrote: |
2 |
> On Fri, Apr 20, 2018 at 7:21 AM, Mick <michaelkintzios@×××××.com> wrote: |
3 |
> > On Friday, 20 April 2018 12:55:13 BST Corbin Bird wrote: |
4 |
> >> Oak Ridge National Laboratory uses these processors ( Rhea Cluster ) and |
5 |
> >> has numerous heat failures. |
6 |
> >> |
7 |
> >> Due to poor cooling ... surprised? |
8 |
> >> |
9 |
> >> The cooling is not working right. Something is still wrong. |
10 |
> >> |
11 |
> >> On 04/19/2018 09:33 PM, R0b0t1 wrote: |
12 |
> >> > Dell Precision T7600, two 16 thread Xeons, 192GB of RAM, two Quadro |
13 |
> >> > cards and a Tesla card. |
14 |
> >> > |
15 |
> >> > The system is a few years old at this point. Old enough that the |
16 |
> >> > thermal compound could have hardened, which is why I replaced it. |
17 |
> > |
18 |
> > If the problem started suddenly, rather than getting progressively worse |
19 |
> > over time, it may have something to do with kernel drivers, or some |
20 |
> > change in firmware. |
21 |
> |
22 |
> As far as I know it has always been like this. It may be why it was |
23 |
> hardly used before it came into my care. Looking at the server I could |
24 |
> blame poor design; the inside is rather cramped, despite the care |
25 |
> taken with the internal baffles. They may not have run a good flow |
26 |
> simulation. |
27 |
> |
28 |
> Mr. Bird's observation seems to support this. |
29 |
> |
30 |
> > If the cause is mechanical, I'd also suggest checking the heat sink |
31 |
> > contact |
32 |
> > surface. Some heat sinks are poorly manufactured and require flattening |
33 |
> > with wet 'n dry sandpaper to get a flat enough surface and improve their |
34 |
> > contact with the CPU. I've seen 15°C improvement in a Zalman CPU cooler |
35 |
> > after excess metal was removed from copper pipes, which were manufactured |
36 |
> > proud. Hardcore O/C's flatten the CPU too, but I'd avoid anything as |
37 |
> > radical because it can go badly wrong if you remove more than the surface |
38 |
> > varnish from the chip. |
39 |
> > |
40 |
> > In the interim, opening the side panel may also help in hot weather. |
41 |
> |
42 |
> The internals are custom made to fit the motherboard, cards, and drive |
43 |
> slots. It may work better if I move it to another tower but it will be |
44 |
> a while before I can find one. I will look at the interface between |
45 |
> the heatsink and processor again, but it looked fine. |
46 |
> |
47 |
> |
48 |
> How concerned should I be about overheating machine check errors? I |
49 |
> used to think that it was best to avoid them, as the threshold was |
50 |
> high enough that very small parts of the die could overshoot and fail, |
51 |
> but I was informed that is not the case. Besides the throttling (which |
52 |
> is fairly bad) I am not sure if there are any drawbacks to the |
53 |
> overheating. |
54 |
|
55 |
Semiconductors eventually fail when overheated. So it is not a good idea to |
56 |
continue trying to fry your CPU. |
57 |
|
58 |
You can confirm the reason of these exceptions by installing and running 'app- |
59 |
admin/mcelog'. If the tower design is poor and air circulation within the |
60 |
case is creating recirculatory thermal race conditions, your choices would |
61 |
typically be: |
62 |
|
63 |
1. Install more effective after market CPU coolers. This means you have to |
64 |
spend money, which may be better spent on a new tower/PC. It may also be |
65 |
there isn't enough space in the case to fit them, although low profile/compact |
66 |
CPU coolers exist and you may have better luck with them. |
67 |
|
68 |
2. Install bigger or additional case fans, to help getting the heated air out |
69 |
of the case and minimising hot spots and hot air recirculation. You could try |
70 |
forcing some more air through the case with a small desktop fan to see if this |
71 |
option has any legs. |
72 |
|
73 |
3. Modify the case, by drilling/cutting holes to improve air flow, e.g. at the |
74 |
top of the case. |
75 |
|
76 |
4. Migrating components to a diffent case/MoBo, which you have already |
77 |
considered. |
78 |
|
79 |
|
80 |
> I am wondering what the point of 32 threads is if you can't use them at |
81 |
> 100%. |
82 |
> |
83 |
> Cheers, |
84 |
> R0b0t1 |
85 |
|
86 |
Quite, but the box may have not been intended to come across the pressures of |
87 |
running gentoo to compile software on a regular basis. I've found many |
88 |
cheaper laptops in particular are so poorly designed from a cooling |
89 |
perspective, they struggle to run a lengthy gentoo emerge. I've also had |
90 |
desktops which struggled, although nothing as critical as yours. The |
91 |
permanent solutions I came up with involved after market cooling fans. With |
92 |
boxen I was not keen to spend money on for cooling improvements I would just |
93 |
open the side panel during an emerge, which allowed the CPU temperature to |
94 |
drop sufficiently to avoid further thermal throttling. |
95 |
|
96 |
-- |
97 |
Regards, |
98 |
Mick |