Gentoo Archives: gentoo-dev-announce

From: "Robin H. Johnson" <robbat2@g.o>
To: gentoo-project@l.g.o
Cc: gentoo-dev-announce@l.g.o
Subject: [gentoo-dev-announce] Bugzilla 2010/09/17 outage: post-mortem
Date: Fri, 17 Sep 2010 23:34:31
Message-Id: robbat2-20100917T230253-417170974Z@orbis-terrarum.net
1 Hi,
2
3 So this is a outage report for today's massive bugzilla outage.
4 Bugzilla was offline today for nearly 12 hours :-(. This was a huge
5 outage in relation to the previous performance. Prior to this, we were
6 running at approximately 99.99% availability in the over a span of the
7 last 324 days (the cumulative times when Bugzilla was not available due
8 to backend issues were approximately 47 minutes).
9
10 What happened:
11 --------------
12 As part of ongoing work, the Bugzilla (idl0r and myself), wanted to
13 load a new snapshot of the production database into the bugstest
14 instance.
15 1. I made the snapshot, and then went to bed (which was already 3am
16 localtime), leaving idl0r to apply it.
17 2. About an hour after I had gone to bed, the restore of the snapshot
18 lead to a (cause uknown) hard reboot of one of the database servers.
19 3. Old table and binlog data was present on disk post-reboot: some
20 changes that had been applied more than 12 hours previously were no
21 longer present. Multiple tables were irreparably corrupted.
22 4. Partial replay from the bad binlog caused full replication failure
23 on the other database server.
24 5. idl0r examined the problem, and shut down the web service access to
25 prevent any further problems.
26 6. +7 hours later, I woke up, and started fixing it.
27
28 Available courses of action at first analysis:
29 ----------------------------------------------
30 A) Restore for last backup, lose ~45-90 minutes of data.
31 B) Validate and keep data from one DB side.
32
33 I chose option B, using the server that did not reboot.
34
35 What we could have done better:
36 -------------------------------
37 - Immediate reporting to -dev mailing list, not just IRC.
38 - Escalation practices. It was beyond the ability of the available infra
39 members to fix, and had to escalate to me (and it took me 4h20m to
40 fix).
41 - Even if I had been alerted by phone, I'm not sure I would have been
42 able to fix it since I'd been awake that long already.
43 - Increase second-tier DBA skillset available in infra team.
44
45 Pending questions:
46 ------------------
47 - Why did the first database server hard-reboot?
48 - Why was the XFS journal so out of date?
49
50 --
51 Robin Hugh Johnson
52 Gentoo Linux: Developer, Trustee & Infrastructure Lead
53 E-Mail : robbat2@g.o
54 GnuPG FP : 11AC BA4F 4778 E3F6 E4ED F38E B27B 944E 3488 4E85