Gentoo Archives: gentoo-project

From: "Robin H. Johnson" <robbat2@g.o>
To: gentoo-core@l.g.o, gentoo-project@l.g.o
Subject: [gentoo-project] dipper.gentoo.org outage post-mortem
Date: Thu, 19 May 2016 18:44:41
Message-Id: 20160519184439.GA19438@orbis-terrarum.net
1 Summary
2 -------
3 - dipper.gentoo.org suffered a major motherboard failure on Friday, May 13th.
4 - The outage started around 2016/05/13 08h56 UTC, and was mostly resolved
5 at 2016/05/14 20h53 UTC, approximately 36 hours in duration.
6 - During this time, no rsync updates were issued, nor were distfiles,
7 releases or snapshots updated.
8 - New hardware purchasing is planned to recover capacity & mitigate
9 hardware old-age.
10
11 Timeline (UTC):
12 ---------------
13 2016/05/13:
14 08h56: iDRAC/OOB notification to Infra [1]
15 09h07: Icinga notifications to Infra
16 13h14: Infra human (jmbsvicetto) notices the problem [2].
17 15h14: Hosting sponsor requested to investigate hardware
18 (sponsor localtime 08h14, nobody onsite yet)
19 15h24: Initial infra discussions about where enough disk space
20 is if we have to move it.
21 15h42: Sponsor initial investigation suggest dead hardware [3].
22 16h00: (approx) Data consolidation/backup to other hosts begins.
23 19h46: Sponsor pulls host, tests, seems dead [4]
24 21h36: Email to -core/-project & status page update
25 23h05: Migration/Recovery plan outlined on IRC
26
27 2016/05/14:
28 09h00: (approx) Data consolidation/backup completed.
29 17h54: Sponsor contact onsite (10h54 localtime)
30 20h15: "New" host booted
31 20h53: All-clear notification
32
33 2015/05/16:
34 Lurking bug with snapshots resolved
35
36 Root cause and timeline notes:
37 ------------------------------
38 This was hardware failure. The hardware was years outside of warranty.
39 Timing meant we didn't notice it immediately, had to move lots of data,
40 and were then limited by sponsor staff availability to be hands-on with
41 hardware for workarounds.
42
43 [1] The initial iDRAC reports said:
44 Event: CPU 1 has a thermal trip (over-temperature) event.
45 [2] IPMI serial-over-lan gets no useful output, IPMI logs and sensors
46 report no additional info, power cycle does nothing.
47 [3] Front panel says CPU1 overheating, power button & unplugging,
48 replugging have no effect.
49 [4] LEDs light up, but fans don't spin or anything else when power
50 button is pushed; reseating CPUs has no effect either.
51
52 Corrective & Preventative Measures:
53 -----------------------------------
54 A similar system had all data evacuated (archived or simply moved),
55 including multiple VMs, then the disks from the dead system were
56 transfered and booted with minimal tweaks (udev, networking). The VMs
57 are still offline, pending more VM capacity (they have large disk
58 needs).
59
60 The failed hardware was some of the newest hardware in this sponsor
61 location. It was new as of Nov 2011 w/ 3 year warranty. Other Infra
62 servers present at the same location:
63 - 2x Dell systems, new as of Dec 2011 as VM hosts
64 [one of these is the new home of dipper, running natively]
65 - 2x Dell systems, new as of May 2007;
66 - 4x Supermicro Atom systems, new as of May 2010 [6x originally, 2x failed]
67 - (various $arch development systems).
68
69 Based on these ages, Infra is preparing hardware specifications for a
70 new VM hosting environment to be purchased by the trustees and hosted at
71 the same location. This would host the temporarily offline VMs, as well
72 as absorb at least the Atom & 2007 Dell systems.
73
74 Future actions to improve outcome:
75 - Move rsync & snapshot generation to a dedicated redundant VM
76 - Improve distfiles/release tarball process to have more redundancy,
77 perhaps push-based.
78 - Encourage cleanups of roverlay/tinderbox/devbox VMs to shrink size.
79
80 --
81 Robin Hugh Johnson
82 Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer
83 E-Mail : robbat2@g.o
84 GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

Attachments

File name MIME type
signature.asc application/pgp-signature

Replies