Gentoo Archives: gentoo-user

From: Mick <michaelkintzios@×××××.com>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] Dell Precision Workstation Overheating
Date: Fri, 20 Apr 2018 14:40:44
Message-Id: 17524638.sDcRh40y47@dell_xps
In Reply to: Re: [gentoo-user] Dell Precision Workstation Overheating by R0b0t1
1 On Friday, 20 April 2018 15:11:43 BST R0b0t1 wrote:
2 > On Fri, Apr 20, 2018 at 7:21 AM, Mick <michaelkintzios@×××××.com> wrote:
3 > > On Friday, 20 April 2018 12:55:13 BST Corbin Bird wrote:
4 > >> Oak Ridge National Laboratory uses these processors ( Rhea Cluster ) and
5 > >> has numerous heat failures.
6 > >>
7 > >> Due to poor cooling ... surprised?
8 > >>
9 > >> The cooling is not working right. Something is still wrong.
10 > >>
11 > >> On 04/19/2018 09:33 PM, R0b0t1 wrote:
12 > >> > Dell Precision T7600, two 16 thread Xeons, 192GB of RAM, two Quadro
13 > >> > cards and a Tesla card.
14 > >> >
15 > >> > The system is a few years old at this point. Old enough that the
16 > >> > thermal compound could have hardened, which is why I replaced it.
17 > >
18 > > If the problem started suddenly, rather than getting progressively worse
19 > > over time, it may have something to do with kernel drivers, or some
20 > > change in firmware.
21 >
22 > As far as I know it has always been like this. It may be why it was
23 > hardly used before it came into my care. Looking at the server I could
24 > blame poor design; the inside is rather cramped, despite the care
25 > taken with the internal baffles. They may not have run a good flow
26 > simulation.
27 >
28 > Mr. Bird's observation seems to support this.
29 >
30 > > If the cause is mechanical, I'd also suggest checking the heat sink
31 > > contact
32 > > surface. Some heat sinks are poorly manufactured and require flattening
33 > > with wet 'n dry sandpaper to get a flat enough surface and improve their
34 > > contact with the CPU. I've seen 15°C improvement in a Zalman CPU cooler
35 > > after excess metal was removed from copper pipes, which were manufactured
36 > > proud. Hardcore O/C's flatten the CPU too, but I'd avoid anything as
37 > > radical because it can go badly wrong if you remove more than the surface
38 > > varnish from the chip.
39 > >
40 > > In the interim, opening the side panel may also help in hot weather.
41 >
42 > The internals are custom made to fit the motherboard, cards, and drive
43 > slots. It may work better if I move it to another tower but it will be
44 > a while before I can find one. I will look at the interface between
45 > the heatsink and processor again, but it looked fine.
46 >
47 >
48 > How concerned should I be about overheating machine check errors? I
49 > used to think that it was best to avoid them, as the threshold was
50 > high enough that very small parts of the die could overshoot and fail,
51 > but I was informed that is not the case. Besides the throttling (which
52 > is fairly bad) I am not sure if there are any drawbacks to the
53 > overheating.
54
55 Semiconductors eventually fail when overheated. So it is not a good idea to
56 continue trying to fry your CPU.
57
58 You can confirm the reason of these exceptions by installing and running 'app-
59 admin/mcelog'. If the tower design is poor and air circulation within the
60 case is creating recirculatory thermal race conditions, your choices would
61 typically be:
62
63 1. Install more effective after market CPU coolers. This means you have to
64 spend money, which may be better spent on a new tower/PC. It may also be
65 there isn't enough space in the case to fit them, although low profile/compact
66 CPU coolers exist and you may have better luck with them.
67
68 2. Install bigger or additional case fans, to help getting the heated air out
69 of the case and minimising hot spots and hot air recirculation. You could try
70 forcing some more air through the case with a small desktop fan to see if this
71 option has any legs.
72
73 3. Modify the case, by drilling/cutting holes to improve air flow, e.g. at the
74 top of the case.
75
76 4. Migrating components to a diffent case/MoBo, which you have already
77 considered.
78
79
80 > I am wondering what the point of 32 threads is if you can't use them at
81 > 100%.
82 >
83 > Cheers,
84 > R0b0t1
85
86 Quite, but the box may have not been intended to come across the pressures of
87 running gentoo to compile software on a regular basis. I've found many
88 cheaper laptops in particular are so poorly designed from a cooling
89 perspective, they struggle to run a lengthy gentoo emerge. I've also had
90 desktops which struggled, although nothing as critical as yours. The
91 permanent solutions I came up with involved after market cooling fans. With
92 boxen I was not keen to spend money on for cooling improvements I would just
93 open the side panel during an emerge, which allowed the CPU temperature to
94 drop sufficiently to avoid further thermal throttling.
95
96 --
97 Regards,
98 Mick

Attachments

File name MIME type
signature.asc application/pgp-signature