We briefly discussed the possibility of actually delaying the entire week, postpone everything, but it was a very brief discussion indeed. Even very small manual tweaks of the schedule can cause huge follow-on problems, such as the problems we had with the NT schedules in the past season. The system uses real dates for scheduling, and once the season starts almost every page on the site will use these schedules in order to work properly and show the right information. To postpone the schedule for the entire game, in a situation where proper tests can’t really be done, could really create issues for months afterards. It’s one of these beautiful thoughts that are not very practical when you think about it.
At 18.08, the main database is back online and we can also open our internal developer version of the Hattrick website to start checking if everything looks alright. Meanwhile, we bring the applications back one after the other, so that the game can start to catch up for the 24 hours lost. This involves everything that has to do with the game, from moving transferred players to their new clubs, to replenishing scout pools to making the daily updates for all leagues – and running the first Cup games. We finalize the MyHTs and Sysinfos that need to be up when the site finally reopens, weighing our words quite carefully in order to not promise too much or underplay any issues users may have.
At 18.10 we notice that the forums are running extremely slow. It’s investigated, and after a while confirmed that despite now being “alone” on the new disks (as the main database was moved and youth was stopped), performance will suffer. Even so, it’s Ok for some so we decide to give it a try. At 18.48 the final transfer extension is made and at 19.11 we think the game engine has catched up well enough for us to be able to reopen the site for everyone again.
Once open, we can only confirm that the forums really are way too slow. It is decided that they should be closed for the time being, while Daniel goes off for dinner and to have a think about possible solutions. At 20.00 the Stage server is back online, we already have 63 000 users online across the Hattrick system.
Around 21, we recap the day and agree that everyone should go get some sleep as things are now running very smoothly and the engine has catched up. The forums remained closed until early next morning, when Daniel will try out an idea he has for improving the performance. Klas babysits the site for a few more hours but everything is now stable.
Many thanks for sharing theese stories with us. I think and hope, it can clear many things in many heads. 🙂
I like Hattrick.
I think this blog is a very good idea, congrats.
Maybe i’m a freak but, it’s possible to know which servers, disk cabinet you have or another info about your system? I don’t know how much resources you have, but is so strange you don’t have a redundant system or almost a HA system or maybe a well balanced system that allow you to be online with at least 50% of servers online.
Hope you can start YA soon 🙂
It has been a hard way to learn and gain experience… when the crash happens!. In my oppinion; a backup restore in large systems must only be planned for corrupted systems or total crash… the last resort.
The way to avoid a backup restore when the storage crash is to replicate the 1Tb of data in a cheaper storage system; the whole system can be online -with lower performance- while a set of contingency plans is done for restoring the production storage system. Your system engeneers should known this, about restore times and time needed for a ready cabin disk (raid and performance). High availability and recovery times are very importan in this ‘live’ game.
I’m sure you do your best! and thanks for your great work with this ‘free’ game!! manteinance of large systems needed for this live hattrick is not easy…
change your tactics and win the game!
Did you plan to modify the so called “engine”, when next time – i hope not, but who knows? – the situation demands the change of the precalculated schedule, the engine should allow it ?
(Without the need to calculate with chaos-theory algorithms, of course. 🙂 )
There is one last thing. The Youth.
But I think (and hope) you’ll get tomorrow noon at the latest.
PS: The German translation at myHT is not the best, to say it soft.
I’m an IT pro and I understand you 100%. I understand most of all your feeling… A full Fault Tolerance will never exist and neither does a DR program.
I understand you but don’t understand your disk provider. I hope that you’ll be able to have that disks provider pay big money for this ugly demonstration of unprofessionalism: a bug that arises only in your condition? a bug on firmware/configuration that makes data unusable and cannot be rolled back?? a disk provider that can’t give you a new, clear and working storage when you’re in trouble to restore data on???
Use that money to double the arrays and allow you to apply modifications to one array at a time, in future.
And please, avoid making firmware updates only because they say to do it, when there’s no real need to do it!!!
Thanks and congrats anyway.
Sorry, my english is very bad. Spain: Estamos con vosotros tios, soy informatico y no me imagino la cantidad de problemas que se han creado de la noche a la mañana. Mucho animo y tranquilos, esperaremos lo que haga falta para disfrutar del gran trabajo que estais haciendo y que todos llamamos Hattrick.
Sois los amos!!!!!!
Pour quand le retour des ACDJ ?? on veut nos académies
Forums down and so is Sysinfo? What the hell…
An unhandled error in the application has occurred, please try again later. The error has been logged.
now i get this kind of error on every page in HT except My Hattrick.
my club, youth, forum, help are down.
and the shop is working fine 🙂
‘Anatomy of a crash, part V’ in the pipeline already?
Sure you do your best. There’ll always be a coffee for you at my place 😉
crash is really terrible…
but what is happening right now? almost none of the pages are accessible.
Oh wow! Is it done? Seems you merited a big fat
Your side is crashin again… lot of Runtime Errors atm.
I read that everything is all right. But as I want to open one of the messages I’ve seen these federal:
404 – File or directory not found.
The resource you are looking for might have been removed, had its name changed, or is temporarily unavailable.”
So, what is it? O, thank you very much for academies – they work! Good work:)
I am glad to see that my suggestion has been discussed…
Years ago I used to be a programmer myself. And, you should agree with me at this, on hattrick there has to be ONLY ONE object giving you the right hattrick time, season, week, day, time, etc….
Changing that single object, effects to whole site without giving any problem
But this is theory, theory about how it should work. Like I said, years ago I was a programmer, and I know that even in small applications you make mistakes about this object programmaing having more objects doing mostly the same….
So I dont blame you for this! 😉
But maybe it is worth to make a change for this principle about only one singel object giving you the right hattrick time. And let all other modules and applications depand on that.
With kind regards!
Have some perspective! Yes, there’s a lot of best practice about IT systems, storage etc. but they are all complicated and usually very expensive too.
I’ve been a Hattrick manager since 2002 and this is the first major crash (or have I forgotten something..?). I am sure that the development team do their best since it will all be in their CV afterwards. Buy hey – we can play the game for free. Nobody’s dying here.
Let’s hope they don’t use the same storage supplier in your local hospital 😉
Thanks for the background to the crash. You recovered well.
One little quibble about the English: it’s “caught up” not “catched up”. All those irregular English verbs.
With kind regards.
I hope that we will be able to see the list of developers’ names and their responsibilities as well, as you keep referring to their names and I don’t know who is who…
Thanks for the story behind the scene. It is really interesting.
It’s amazing to see the number of people complaining so hard for a single day of downtime and some loss in the play.
It’s only a game ! you don’t even have to pay for it unless you want to. And I’ve seen companies accepting longer downtime though it involved losing a lot of money.
I really do think that it’s a good job to have it up in so few time after a major storage issue.
the truth hurts all Hattrick developers and GM …
Can you see this match. I think that Match Motor has problems yet. Please fix it.
Perfect for us to understand HT administration. I’m working in IT area and I think you’ve handled that very proffesionally and very quick. I know exactly how you felt after the crash. I’m always woried of hardware crash in our company site. It’s nightmare. Nobody cares about IT, why, when everything is working 🙂 But when something happend, everyone are nervous and are asking why and when :). This work is suck but I like it :))
This blog is a very good idea. I hope the developer will read the users suggestions
I don’t think I will be supporter anymore if you increase the price. Sorry.
A piece of advice from a colleague, this will happen again. Not exactly ‘this’, of course, but something similar.
Build up your capabilities to have a more flexible schedule system. Some users might not be forgiving, and eventually, our business is keeping them happy.
11 september 2010.
we’re waiting for a new post: Anatomy of a crash, part V…