Gentoo Archives: gentoo-dev

From: "Robin H. Johnson" <robbat2@g.o>
To: gentoo-dev@l.g.o
Cc: gentoo-dev-announce@l.g.o
Subject: [gentoo-dev] overlays.gentoo.org restoration & post-mortem
Date: Sat, 18 Jan 2014 05:03:11
Message-Id: 20140118050256.GF3378@orbis-terrarum.net
overlays.gentoo.org service has been restored on a new system.  
Some statistics and a post-mortem follow.

Special thanks to antarus and a3li for all their interactions with our sponsor,
and managing most of the details. I just did the final data recovery and this
writeup.

Please resume using the service, and if you see something weird that you
think is different from before, please file a bug for Infrastructure.

In the process, the service moved to a new machine. The SSH keys have changed
as follows:
DSA: d6:71:99:1f:46:c9:42:95:e1:9d:be:8e:f7:76:51:b5
RSA: 92:b5:40:16:63:a3:61:9f:d7:63:64:ba:d5:51:41:b9
ECDSA: 96:f0:29:e6:d4:85:58:46:31:ba:0e:17:0b:8c:fa:d8

As this time, we will NOT be restoring Trac due to low demand. If you
still require an web-based SVN browser for old SVN repos, please contact
us at infra@g.o.

If you have a dev/ repo under the list 'IMPORTANT' below, you MUST push
to the server again.

IMPORTANT: The following repos were damaged beyond repair, and were not
available in backups. You'll need to push again, I have reset the repos to
empty:
dev/anarchy.git
dev/dberkholz.git
dev/dev-zero.git
dev/dilfridge.git
dev/fordfrog.git
dev/graaff.git
dev/maekke.git
dev/mschiff.git
dev/quantumsummers.git
dev/zorry.git

FYI: The following repos appeared to be empty:
dev/b33fc0d3.git
dev/moult.git
dev/tomwij.git
user/blueicefield.git
user/disinbox.git
user/palatis.git
user/paragon.git
user/vmalov.git
user/xray.git

FYI: The following repos contained dangling commits/tags/blobs, and this
should not be considered new breakage; if you have a newer copy, you are
encouraged to push again:
dev/blueness.git
dev/maksbotan.git
dev/mgorny.git
dev/qiaomuf.git
dev/xmw.git
proj/betagarden.git
proj/catalyst.git (+tags)
proj/devmanual.git
proj/dotnet.git
proj/elfix.git (+tags)
proj/emacs-tools.git
proj/gamerlay.git
proj/hardened-dev.git
proj/hardened-patchset.git
proj/kde.git
proj/lisp.git
proj/openrc.git (+tags)
proj/portage.git
proj/ruby-overlay.git
proj/sci.git
proj/sunrise.git
proj/webapp-config.git
proj/x11.git
user/gmt.git
user/mv.git (+blobs)
user/palmer.git

Statistics:
-----------
  354 repos total
-  10 repos unrecoverable (all in /dev)
= 344 repos recovered/available

    9 repos that seem to empty
   26 repos with dangling commits/tags/blobs
    2 repos recovered from external sources.

Breakdown by path:
------------------
193 proj/ repos
 69 dev/  repos
 91 user/ repos
  1 other repo

Post-mortem
-----------
Hornbill went offline around: 2014-01-10 13:13 UTC
Hornbill last started a backup of VCS: 2014-01-10 07:59:04 UTC
Hornbill last completed a backup of VCS: 2014-01-10 08:20:54 UTC

Between the backup starting, and the server going offline, we were able
to confirm writes to the following Git repos:
dev/fordfrog.git
proj/kde.git
gitolite-admin.git

We believe that there were no writes to user/ repos, but are not 100%
certain, as the logging was insufficient for this purpose.

Hornbill went offline just over a week ago: Mid-afternoon on a Friday
for the timezone where it's located. Due staff turnover and business
changes at the previous sponsor, we were not able to contact anybody
until regular office hours on Monday, January 13th.

The server in question, while previously functioning, was not
recoverable after a remote hands reboot on Monday afternoon (UTC).
On Tuesday, more the sponsor was able to examine in it more depth, and
it was not recoverable. More concealingly, it turned out to be one of
the few remaining Gentoo infrastructure systems with IDE drives. The
data was recovered, however it seemed to have a lot of corruption.

It was noted that our backups were missing all of the dev/ repos, due to
a system-wide rule to exclude /dev/ from backups (the rule should only
be the real /dev, not any directory simply named "dev"). For this
reason, we decided to try and get the data from the old server.

Verification/recovery of the remaining data was also hampered by
confirming that some of the Git repos in the backup were not entirely
clean, containing legacy errors that turned out to be false positives
from their CVS/SVN conversions, or dangling commits/blobs/tags.

What could we do better next time:
----------------------------------
- Have backups of all repos!
- Compare the age of the backup immediately, and consider going live
  with the backup. Only 5 hours of work would have been lost, and even
  then possibly only temporarily, due to the distributed nature of Git.
- More people need to use the infra-status page to learn about the state
  of Gentoo services.

Actions for Infra
-----------------
- Include dev/ repos were not in the backup
- Set up Gitolite mirroring
- Review gitolite logging (needs to be easier to confirm when writes
  took place)

-- 
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead
E-Mail     : robbat2@g.o
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85

Attachments

File name MIME type
signature.asc application/pgp-signature

Replies