The story goes on, this part just handles about 30 minutes. But hey – those minutes were intense and a real rollercoaster ride:
16.19 Mattias emerges from one of numerous telephone conferences with devs and suppliers during the day, to tell the whole team some really heart-breaking news on our Skype chat:
[2010-08-24 16:19:07] Mattias: OK, bad news:
We will not be up for another 24 hours. We are copying the data to our old disk system, but at this rate (since the disk check is going on at the same time), we won’t be finished for another 20 hours. And then we need to let ht.exe catch up.
There is no chance that we can delay the matches that long, so they will be played on schedule.
[2010-08-24 16:19:40] Peter: 🙁
[2010-08-24 16:20:19] Mattias: There is a chance that we can abort the parity check. Our supplier is trying that out on a test system, to see if its possible. If so, we will be online later this evening.
[2010-08-24 16:23:42] Mattias: I suggest we inform the users about the possibility that we will be offline for another 24 hours and explain why.
[2010-08-24 16:24:09] Mattias: Also, we need to tell them that the matches will be played as scheduled.
[2010-08-24 16:24:31] Niclas: this is indeed bad news.To say the least
[2010-08-24 16:25:59] Anne: So something along the lines of We might have to keep the site offline for another 24h. We are working really hard to find a solution to be up and running before then. Unfortunately, matches scheduled during those 24h will be played.
From having been a crash that we could, in a sense, live with, we now have to face the consequences of a real disaster, which threatens to leave the site closed through in effect the entire Cup and friendlies round, affecting every single user in the game. It’s just really heavy to digest. I think at this moment all of the more community-faced members of the team silently pencilled in a rough few weeks in their mental calendar. I know I did.
[2010-08-24 16:35:27] Mattias: stop
[2010-08-24 16:36:28] Mattias: We have just aborted parity check on the youth disks and we increased performance by 8 times.
[2010-08-24 16:36:41] Mattias: So we will probably up later tonight.
This still meant we would miss the start of the cup games, but it was a big relief that it looked possible that we could be up again tonight. But we still couldn’t be sure what we would get out of the system once we reopened it.
Nice to see, what exactley has been going on, thanks for that 🙂
Dont give up the fight 🙂
P.S.: Is there a follow up entry coming regarding what the exact problem of the youthacademy is / was ?
There are more chaptes to come, unfortunately. Part IV which I just posted ends on Tuesday, but the forum wasn’t up then and the youth academy was stopped. So, there are more stuff to share, and in particular about youth.
…a terrible history…very terrible…
First of all, ty very much for working so hard trying to fix this major crash. For the future my advice is to set up redundant back-up procedures. At the moment the most frustrating thing is to have the major team online 24/24 and the youth team offline 24/24. This is quite a waste of resources. It would be a lot better to have a 12/24 scenario for both teams (maybe in chunks of 4 hours) so that everybody would be allowed to set match orders easily for both teams.
I prepared this season to beat my old records but I lost all chances cause of the downtime. I saw there is no one who wants to repair to this. Because of my serious illness this was my last season on hattrick. Bad thing.. but it’s just a game. I hope, in future, to see the developers trying to repair this things.
Good job for bringing online the main part of the site.
However, I can’t stop questioning myself why don’t you keep a mirror site having the same functionality of the original, and just waiting to take control in case of crash (of course it has to be sync-ed with the main db). You will have 0 downtime.
Windows servers never worked well for me, and aspx is not the most flexible language too.
We want youth academies!:) Please:)
All this is pathetic: we do not want explanations, but only damages!!
If you were good people, after what has happened, you should at least lower the price of supporter, but now we know that you will not give a shit about us!
What happened with youth academy?????????? Last 3 days is off!
Hello, i want to send my support to us and your work every day. I am a developer myself and know very well the troubles we have to face every day. Keep this way, a big hug from Argentina. (that’s why my english is kind of poor 😛 )
I really don’t quite understand the issue with the parity check. To honestly tell you, I actually was 95% sure that this is the main issue after the sort of storage unit crash that you described.
But the parity check is done by the controller of the storage unit and has nothing to do with the integrity of YOUR data as far as I know. It only checks if the virtual redundant drive – i’m guessing a spin-off of a normal RAID-5 configuration – is ok. I have cancelled many of these checks myself in order to be ably to provide users where I work the needed performance, and manually started them over night or whenever usage was low.
Additional details about why you considered it so dangerous to abort the parity check, would be very interesting at this point.
Turn to linux/unix – 1000000 times faster than damn windows,and a lot less problems when it comes up to support.
P.S. HT.exe could be really easily rewriten 😀
Don’t try to find an excuse to all this mess,
be quite and work on it.
Im getting p…..d off with everytime another bug, crash or slow site… HT go back to your roots and give people FUN again!
great job with the blog, guys! much appreciated !
Very very nice to see this blog started! It’s just what you and us need.
Hi, I know that many users here are complaining in a really bad way, but I work every day with server system and I know that it is not as easy as a 14 years old could think. So keep on the good work, this blog and the hire of HT-Anne are a nice improvement in the last year regardind PR.
cant wait next crash 😀
Hmm. I can do without them. 🙂
Just another way of thinking….. another way this problem HAD en maybe SHOULD be solved…
In case of a serious crash like this one, its maybe better to PAUSE the system for a week. Just get online with the whole system on august 30, 15:00 HT time; While the real world time is runnig, hattrick time had to be paused a exact whole week.
Players will not get older, no updates on team, healthy, etc. No transfer…
On mondag august 30, 15:00 HT it is week 3, instead of week 4….
The only “problem” is that we cannot play hattrick for a whole week, but when we start again at august 30, NOBODY is loosing, normal transfers, everyone is able to make line-ups for the cups games, etc, etc.
I hope that you understand what I am trying to say, may english is not that good… 😉
And last but not least…. this solution can still be done… although we lose our actions of the last hours/days when we get back online, but we start “clean” on august 30…
Hope to get a reaction on this.
Please dont think that i am trying to be clever at your expense, but i am working with these things professionally.
I´m wondering what kind of raid that you are using for storage. It must be at huge amount of data, and using a raid 5 will extend the recovery time considerably. Using a EVA SAN with Raid 0+1, could prevent this kind of problems in the future, ore at least minimize the recovery time.
Some here ask “Well, why don’t have they a live mirror?”
Aehm, sorry, have you a rough idea how much money such a thing eats up? It doubles (a least) the whole hardware + connecivity infrastructure.
Thank you for your hard work. I’m really sure, that this will have consequences in the meaning of checking their setups and processes. At least the backup is now proven as rocksolid 🙂
Good job, guys!
nice job on the blog and definitely a well handled crisis (at least communication-wise) I am just wondering how much of this situation could have been avoided by a) taking the site down prior to the FW updates and patches and b) serious testing before relaunching HT. (There would be at least no data-loss and probably a lot less work in checking for inconsistencies and errors…) Is this something that will be taken in consideration next time when something needs updating? (plpan maintenance server down?)
nice thriller! 🙂
It wouldn’t hurt to hire some testers! 🙂
Why did you move to Malmo is hard to comprehend! It scary going to near that area. I’ve been there
Is that Skype log made up? That sounds so made up.
Nope. It’s authentic.
Peter’s contribution to the Skype chat was the best 😀
blogs open to what you have to do is solve the problem of the academy. There are people who pay the Support for it.
most expeditiously in solving problems and less blogs……..
Thanks for this inside information. I’m a developer myself and I know how stressful a situation like that can be. I seems you’ve handled things well. Don’t let rude comments from ignorant users bring you down. Good luck.
it’s a verry cool game!!