The anatomy of a crash, part I

I’d just thought I should give you all a little more detail about what happened this week. And as the story became quite long, I will break it up into parts. Here’s Monday:

On Monday, we were doing upgrades and patches to our server park as we tend to do at least once every season. This is when we upgrade disks, install new web servers, patch software, and so on. Most of our hardware is managed by ourselves, such as the game servers and web servers, but they do run atop a disk storage solution which is managed by the supplier directly. This system is generally very stable and has a high performance and capacity, and we’ve used the same supplier since 2004.

This Monday, our supplier was on site to make an upgrade to the firmware of their system. This upgraded firmware turned out to be buggy, as it caused the disks to crash, and with them, Hattrick. Why this happened we can’t really say, but it is likely that our specific conditions somehow triggered this bug that had not been revealed at their previous tests or at previous installations.

This happened just after 15 HT-time on Monday, triggering error messages for those users that were online at the time. At 15.23 we closed the web and started working on figuring what exactly had happened, what the consequences were, and what we should do next. At this point we also close all applications that “run” the game in the background, as they obviously no longer have a database to work against. This is also true for transfers.

Ten minutes or so later the supplier confirms that his patch has caused the crash, and he tries to get the disks back online, but he also gives us a first indication that a restore of backup may be necessary – something we at this point estimate would take the rest of the day to perform. At 15.48 we pass this outlook on to the users through the down web.

The next hour the disks are brought back online, but they can’t be accessed as they should by the rest of the system. The supplier gets to work on this while we are left to consider the options we have. What we are hoping for at this point is that the data on the disks are indeed salvageable, and that we can resume the game within hours. We’re skeptical about this though, and unfortunately this also proves to be well founded. The supplier can’t make it work and we decide a restore of our backup has to be done. A small consolation in all this is that the window of data loss is relatively small, only 30 minutes. Around 18 HT-time we start preparing for this work, which will take most of the night.

This is mainly due to the enormous amount of data – in excess of 1 terabyte, and a lot more if we count archives and forums – that needs to be moved and “unpacked”, and then checked for inconsistencies by us, before we can go live with the site. On top of this the game world itself will stand still during the downtime, and will need to catch up before we can reopen for users.

While we have hopes to be up earlier than that, we publish a new message on the downweb stating that we will probably not be up until Tuesday afternoon. What we really hope for is to be up and tunning a few hours before the Cup games start at 18 HT-time on Tuesday evening.  All available devs are called in for a telephone conference in which the tasks are divided. Daniel is already in place in the server hall to work on the restore of the backup, while Chris babysits the restore process throughout the night.

And Monday ends…

Tags: