1 |
Hi, |
2 |
|
3 |
So this is a outage report for today's massive bugzilla outage. |
4 |
Bugzilla was offline today for nearly 12 hours :-(. This was a huge |
5 |
outage in relation to the previous performance. Prior to this, we were |
6 |
running at approximately 99.99% availability in the over a span of the |
7 |
last 324 days (the cumulative times when Bugzilla was not available due |
8 |
to backend issues were approximately 47 minutes). |
9 |
|
10 |
What happened: |
11 |
-------------- |
12 |
As part of ongoing work, the Bugzilla (idl0r and myself), wanted to |
13 |
load a new snapshot of the production database into the bugstest |
14 |
instance. |
15 |
1. I made the snapshot, and then went to bed (which was already 3am |
16 |
localtime), leaving idl0r to apply it. |
17 |
2. About an hour after I had gone to bed, the restore of the snapshot |
18 |
lead to a (cause uknown) hard reboot of one of the database servers. |
19 |
3. Old table and binlog data was present on disk post-reboot: some |
20 |
changes that had been applied more than 12 hours previously were no |
21 |
longer present. Multiple tables were irreparably corrupted. |
22 |
4. Partial replay from the bad binlog caused full replication failure |
23 |
on the other database server. |
24 |
5. idl0r examined the problem, and shut down the web service access to |
25 |
prevent any further problems. |
26 |
6. +7 hours later, I woke up, and started fixing it. |
27 |
|
28 |
Available courses of action at first analysis: |
29 |
---------------------------------------------- |
30 |
A) Restore for last backup, lose ~45-90 minutes of data. |
31 |
B) Validate and keep data from one DB side. |
32 |
|
33 |
I chose option B, using the server that did not reboot. |
34 |
|
35 |
What we could have done better: |
36 |
------------------------------- |
37 |
- Immediate reporting to -dev mailing list, not just IRC. |
38 |
- Escalation practices. It was beyond the ability of the available infra |
39 |
members to fix, and had to escalate to me (and it took me 4h20m to |
40 |
fix). |
41 |
- Even if I had been alerted by phone, I'm not sure I would have been |
42 |
able to fix it since I'd been awake that long already. |
43 |
- Increase second-tier DBA skillset available in infra team. |
44 |
|
45 |
Pending questions: |
46 |
------------------ |
47 |
- Why did the first database server hard-reboot? |
48 |
- Why was the XFS journal so out of date? |
49 |
|
50 |
-- |
51 |
Robin Hugh Johnson |
52 |
Gentoo Linux: Developer, Trustee & Infrastructure Lead |
53 |
E-Mail : robbat2@g.o |
54 |
GnuPG FP : 11AC BA4F 4778 E3F6 E4ED F38E B27B 944E 3488 4E85 |