Anatomy of a crash, part IV

We briefly discussed the possibility of actually delaying the entire week, postpone everything, but it was a very brief discussion indeed. Even very small manual tweaks of the schedule can cause huge follow-on problems, such as the problems we had with the NT schedules in the past season. The system uses real dates for scheduling, and once the season starts almost every page on the site will use these schedules in order to work properly and show the right information. To postpone the schedule for the entire game, in a situation where proper tests can’t really be done, could really create issues for months afterards. It’s one of these beautiful thoughts that are not very practical when you think about it.

At 18.08, the main database is back online  and we can also open our internal developer version of the Hattrick website to start checking if everything looks alright. Meanwhile, we bring the applications back one after the other, so that the game can start to catch up for the 24 hours lost. This involves everything that has to do with the game, from moving transferred players to their new clubs, to replenishing scout pools to making the daily updates for all leagues – and running the first Cup games. We finalize the MyHTs and Sysinfos that need to be up when the site finally reopens, weighing our words quite carefully in order to not promise too much or underplay any issues users may have.

At 18.10 we notice that the forums are running extremely slow. It’s investigated, and after a while confirmed that despite now being “alone” on the new disks (as the main database was moved and youth was stopped), performance will suffer. Even so, it’s Ok for some so we decide to give it a try. At 18.48 the final transfer extension is made and at 19.11 we think the game engine has catched up well enough for us to be able to reopen the site for everyone again.

Once open, we can only confirm that the forums really are way too slow. It is decided that they should be closed for the time being, while Daniel goes off for dinner and to have a think about possible solutions. At 20.00 the Stage server is back online, we already have 63 000 users online across the Hattrick system.

Around 21, we recap the day and agree that everyone should go get some sleep as things are now running very smoothly and the engine has catched up. The forums remained closed until early next morning, when Daniel will try out an idea he has for improving the performance. Klas babysits the site for a few more hours but everything is now stable.