1 |
Summary |
2 |
------- |
3 |
- dipper.gentoo.org suffered a major motherboard failure on Friday, May 13th. |
4 |
- The outage started around 2016/05/13 08h56 UTC, and was mostly resolved |
5 |
at 2016/05/14 20h53 UTC, approximately 36 hours in duration. |
6 |
- During this time, no rsync updates were issued, nor were distfiles, |
7 |
releases or snapshots updated. |
8 |
- New hardware purchasing is planned to recover capacity & mitigate |
9 |
hardware old-age. |
10 |
|
11 |
Timeline (UTC): |
12 |
--------------- |
13 |
2016/05/13: |
14 |
08h56: iDRAC/OOB notification to Infra [1] |
15 |
09h07: Icinga notifications to Infra |
16 |
13h14: Infra human (jmbsvicetto) notices the problem [2]. |
17 |
15h14: Hosting sponsor requested to investigate hardware |
18 |
(sponsor localtime 08h14, nobody onsite yet) |
19 |
15h24: Initial infra discussions about where enough disk space |
20 |
is if we have to move it. |
21 |
15h42: Sponsor initial investigation suggest dead hardware [3]. |
22 |
16h00: (approx) Data consolidation/backup to other hosts begins. |
23 |
19h46: Sponsor pulls host, tests, seems dead [4] |
24 |
21h36: Email to -core/-project & status page update |
25 |
23h05: Migration/Recovery plan outlined on IRC |
26 |
|
27 |
2016/05/14: |
28 |
09h00: (approx) Data consolidation/backup completed. |
29 |
17h54: Sponsor contact onsite (10h54 localtime) |
30 |
20h15: "New" host booted |
31 |
20h53: All-clear notification |
32 |
|
33 |
2015/05/16: |
34 |
Lurking bug with snapshots resolved |
35 |
|
36 |
Root cause and timeline notes: |
37 |
------------------------------ |
38 |
This was hardware failure. The hardware was years outside of warranty. |
39 |
Timing meant we didn't notice it immediately, had to move lots of data, |
40 |
and were then limited by sponsor staff availability to be hands-on with |
41 |
hardware for workarounds. |
42 |
|
43 |
[1] The initial iDRAC reports said: |
44 |
Event: CPU 1 has a thermal trip (over-temperature) event. |
45 |
[2] IPMI serial-over-lan gets no useful output, IPMI logs and sensors |
46 |
report no additional info, power cycle does nothing. |
47 |
[3] Front panel says CPU1 overheating, power button & unplugging, |
48 |
replugging have no effect. |
49 |
[4] LEDs light up, but fans don't spin or anything else when power |
50 |
button is pushed; reseating CPUs has no effect either. |
51 |
|
52 |
Corrective & Preventative Measures: |
53 |
----------------------------------- |
54 |
A similar system had all data evacuated (archived or simply moved), |
55 |
including multiple VMs, then the disks from the dead system were |
56 |
transfered and booted with minimal tweaks (udev, networking). The VMs |
57 |
are still offline, pending more VM capacity (they have large disk |
58 |
needs). |
59 |
|
60 |
The failed hardware was some of the newest hardware in this sponsor |
61 |
location. It was new as of Nov 2011 w/ 3 year warranty. Other Infra |
62 |
servers present at the same location: |
63 |
- 2x Dell systems, new as of Dec 2011 as VM hosts |
64 |
[one of these is the new home of dipper, running natively] |
65 |
- 2x Dell systems, new as of May 2007; |
66 |
- 4x Supermicro Atom systems, new as of May 2010 [6x originally, 2x failed] |
67 |
- (various $arch development systems). |
68 |
|
69 |
Based on these ages, Infra is preparing hardware specifications for a |
70 |
new VM hosting environment to be purchased by the trustees and hosted at |
71 |
the same location. This would host the temporarily offline VMs, as well |
72 |
as absorb at least the Atom & 2007 Dell systems. |
73 |
|
74 |
Future actions to improve outcome: |
75 |
- Move rsync & snapshot generation to a dedicated redundant VM |
76 |
- Improve distfiles/release tarball process to have more redundancy, |
77 |
perhaps push-based. |
78 |
- Encourage cleanups of roverlay/tinderbox/devbox VMs to shrink size. |
79 |
|
80 |
-- |
81 |
Robin Hugh Johnson |
82 |
Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer |
83 |
E-Mail : robbat2@g.o |
84 |
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 |
85 |
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 |