I’d just thought I should give you all a little more detail about what happened this week. And as the story became quite long, I will break it up into parts. Here’s Monday:
On Monday, we were doing upgrades and patches to our server park as we tend to do at least once every season. This is when we upgrade disks, install new web servers, patch software, and so on. Most of our hardware is managed by ourselves, such as the game servers and web servers, but they do run atop a disk storage solution which is managed by the supplier directly. This system is generally very stable and has a high performance and capacity, and we’ve used the same supplier since 2004.
This Monday, our supplier was on site to make an upgrade to the firmware of their system. This upgraded firmware turned out to be buggy, as it caused the disks to crash, and with them, Hattrick. Why this happened we can’t really say, but it is likely that our specific conditions somehow triggered this bug that had not been revealed at their previous tests or at previous installations.
This happened just after 15 HT-time on Monday, triggering error messages for those users that were online at the time. At 15.23 we closed the web and started working on figuring what exactly had happened, what the consequences were, and what we should do next. At this point we also close all applications that “run” the game in the background, as they obviously no longer have a database to work against. This is also true for transfers.
Ten minutes or so later the supplier confirms that his patch has caused the crash, and he tries to get the disks back online, but he also gives us a first indication that a restore of backup may be necessary – something we at this point estimate would take the rest of the day to perform. At 15.48 we pass this outlook on to the users through the down web.
The next hour the disks are brought back online, but they can’t be accessed as they should by the rest of the system. The supplier gets to work on this while we are left to consider the options we have. What we are hoping for at this point is that the data on the disks are indeed salvageable, and that we can resume the game within hours. We’re skeptical about this though, and unfortunately this also proves to be well founded. The supplier can’t make it work and we decide a restore of our backup has to be done. A small consolation in all this is that the window of data loss is relatively small, only 30 minutes. Around 18 HT-time we start preparing for this work, which will take most of the night.
This is mainly due to the enormous amount of data – in excess of 1 terabyte, and a lot more if we count archives and forums – that needs to be moved and “unpacked”, and then checked for inconsistencies by us, before we can go live with the site. On top of this the game world itself will stand still during the downtime, and will need to catch up before we can reopen for users.
While we have hopes to be up earlier than that, we publish a new message on the downweb stating that we will probably not be up until Tuesday afternoon. What we really hope for is to be up and tunning a few hours before the Cup games start at 18 HT-time on Tuesday evening. All available devs are called in for a telephone conference in which the tasks are divided. Daniel is already in place in the server hall to work on the restore of the backup, while Chris babysits the restore process throughout the night.
And Monday ends…
Could you please not write this rubbish and do your work? You request a rather big pile of money for the supporter package and all we get for this is crappy servers and a fucking useless blog?
Please… Not everyone of the HTs are developers. Some work on other areas too, for instance communication. And this blog, and the first posts, has not taken any time at all from the developers.
So everyone is actually doing their work. And work hard, especially during a tough week like this one. Many of us has been working day and night to fix everything. Promise 🙂
“big pile of money” for the supporter package? It is not something you HAVE to pay… It is NOT a big pile of money either. It is a rather small amount of cash to add extra features to a free game.
Stop whining.
As soon as you think your comments can’t be traced back to your hattrick account you start using the “big words” :))
Ah si certo bella storia, interessante…vedete di muovervi e fare qualcosa di utile piuttosto non aprire blog come questo, risolvete i veri problemi!
Lavorate pelandroni!!!!
a gennaio col cavolo che vi do 34 €…ma manco se me lo regalate!!!!
sto caz di un blog…
Good job but …
– you should suggest your disk supplier to perform some tests BEFORE the release of a new firmware
– test it by yourself on a test server (“stage” is on a different server, no ?)
Hang in there 🙂
it is a difference between a test site on a virtual disk than a complete array of disks… I expect the stage server to be ar virtual server on the same array of disks. But I might be wrong.
And if you really read the article it said the supplier did not run into any problems anywhere before the HT setup.
But yes, on general basis the need for testing prior to performing this kind of action is important. Lets just hope the FW upgrade will end up worthwhile… and that it is not just a bunch of minor important upgrades we all – and HT – could have lived without. 🙂
Thx for your reply 🙂
very interesting to read, keep the battlefield diary coming!
Problemas acontecem, neste momento estou preocupado com o estado das academias e se a minha foi encerrada ou não, espero que não, ficaria muito aborrecido de deitar muitos meses de trabalho fora, no entanto, se isso acontecer, espero que o HT pense numa compensação a dar aos utilizadores que perderem a academia, espero também que as academias sejam postas a funcionar rapidamente
Se tu já tinhas uma academia, não há perigo de ela ser encerrada.
espero que não, e espero também que os jogos que tinha agendados para esta semana tenham sido realizados e os treinos também
Why do you keep the “supplier” secret? Tell us the name of the company, this is ridiculous and unprofessional to upgrade firmware on live servers without testing.
how much need your supplier to pay for this enormous economical and image damage?
I don’t know, and to be honest I personally don’t care. The damage is done as Neil Young once said, money will not make this situation any better.
excellent u cazz
brao regia!!!
Nice with a blog, it seems Hattrick have finally understood that communication is good. The last 1 season have been a huge step forward on this account!!!
As part of the fallout I hope you will discuss with your supplier what they can put in place to prevent such situations happening again. Specifically how they will improve their testing process to encompass the specific requirements of the Hattrick installation
I really appreciate all the recent modifies you have done, like the removing of the BOT every season and so on, but the opening of this blog in this moment, when Hattrick is fighting against lots of problems caused by crash, I think is not be a good idea, or surely not in best time.
About this, I ask you why were not created previously a mirroring system, or a similar system, able to prevent this kind of situation that unfortunately could repeat in the future too.
Thanks guys,
it’s always interesting to read about internal workings of such a system and people around it.
As for the crash, I only have “thanks” and a “good work” for you. Being in your shoes more than once, I just know that people who write nasty comments about you not doing the work really don’t know anything about systems needed to keep HT up and running. I do appreciate your work and totally understand that you’ve spent day and night eyeballing database restore or move operations, hoping that they’d go as fast as possible.
Good luck in future!
I’m a bit upset myself, as most of the users, for this down time. However, I’m also aware that problems happen and not ALL of them can be avoided. All the companies test their devices before selling them, and sometimes also the buyer tests it before installing it. Unfortunately, we are unable to test ALL the possible situations a device could face. Hattrick faces a lot of upgrades every season. The fact that most of them go along without problems means that the developers are not noobs…
And, as someone else said, who write this blog are not the developers. It wouldn’t have been faster the recover of the site if this blog had never been opened.
You are free to not renew the supporter, no one is forcing you to do it. I don’t think I will renew it myself, but at least I don’t yell at the ht staff like a brainless person.