Anatomy of a crash, part II

Here’s part two of the story, starting very early Tuesday morning and going on until 15.00:

Very early on Tuesday morning the restore is completed and all devs are online to start looking for potential problems, corrupt data or failed transactions in and around the game. This has to be done before we can sitch on the game applications again, that make the time tick again in the Hattrick universe.

A list of issues is identified, some are sorted out, other are out of our hands and simply relayed to users (such as the fact that our last backup of HT-mail was only at 5 AM Monday morning, or that 42 youth acamies that had been started on Monday would have to be restarted by the users).

But there is good news as well. Late morning it does seem that everything is well ahead of schedule, and the engine is catching up. We decide to open the site just after 10 HT-time, really quite pleased with ourselves. And as this point we get a very cold shower.

While the restored backup is now in fine working condition, the disks are simply too slow for the site to be possible to use. The redundant disks are stuck in a process of checking the data for errors, and again, because of the amount of data we need to have “live” this takes a long time – much longer than expected. The site goes down again and we don’t even need to change the message to users about our estimated reopening, as we still target late afternoon, but now with a new set of problems. What we all fear is that we will not be able to be back before cup games start, as up until then we have mainly inflicted boredom on our users. If the site is down as the cup starts, many users will get consequences that affect their team and season plans in a more tangible way.

For those of us that can’t code, we discuss various way of preventing problems for users – such as extending transfer deadlines – and putting together information for when we are back online. We also answer questions on our Facebook page and through our Twitter channel @hattrick.

After lunch, around 14.30, we get some further bad news. The disk check is not only not yet done, it will be done at an “unknown” time. And in the mean time, the performance of the storage solution will be a quarter of what it should be – really impossible for our site to run on. However, we still have the old disk system from the same supplier available, which was our main storage until a year ago, and a decision is taken to take it into use instead. It’s smaller, so we can’t fit both the main Hattrick site, Youth, and the forums onto it. But it would be the only way to get the game up again this same day. The alternative would be several days longer downtime for the entire system. Obviously this option is the best one we have, so we get started on it. It’s 15 HT-time when the decision is made, and the move should take about 90 minutes. There is still hope that we can be online well before the Cup games get started.