THIS POST IS WRITTEN BY MATTEO DIMAI; AKA LA-ARTOD
What does it really mean, in Hattrick, to have a “good” or a “bad” team? To be “strong” or “weak”?
Intuitively, a good manager wins more often than a bad manager, and a strong team wins more often than a weak team. But how can you determine how strong a team is when the result is also affected by luck? Strong teams win titles, but the strongest team isn’t always the winner in a given match. And is there any way to compare two teams that have never met before?
The easiest way is to express the strength of a team with a number, and then you compare teams by comparing those numbers. These numbers are ratings of the team strength, and by comparing these ratings you can create a ranking – an ordering of numbers from the strongest team to the weakest. Not all rankings are based on ratings – the Cup ranking currently used for Cup seedings isn’t – but single ratings can easily be transformed into rankings.
OK, so now we want to have a single number to express team strength. But if you think about it, there are three factors in play here:
- How strong your players are in general
- What choices your team build allows you
- How good you are as a manager to pick the choice that is best against your opponent
The first point is the hardest one. Assessing the strength of a player by its skills is problematic. Even if we pick two players that have been trained for the same number of weeks and that had the same starting skills (so that they have the same HTMS points, if you’re familiar with those), their contribution on the field will depend on which skills have been trained and where the players play on the field. A central defender playing normal has a smaller overall contribution to ratings than a winger – if for nothing else, because he uses only two skills instead of four.
So, when I was asked by the HT team to help develop a rating system that could summarize team strength, I kept going back to the only thing that matters in Hattrick; ratings, and to the minimal unit that creates them: a lineup.
I would want to start with rating lineups. How players influenced the ratings was going to be irrelevant, at least at the start. Then, I wrestled with the concept of strength: I wrote before that a stronger team wins more often, but there are draws as well, so how does one count them?
In statistics, it is much easier to work with a result that can only be a win or a loss. I studied quite a few ratings, including the Elo chess rating and HTEV, and I decided that I would work with “success probabilities” defined as the probability of winning + one half of the probability of drawing.
The core of the Hattrick Power Rating is therefore a rating, called “Base Power Rating”, that summarizes the strength, in terms of success probabilities, of a specific lineup against a generic opponent.
You can safely skip this paragraph if you’re not into mathematics. I was aiming for the rating to have these properties:
- Monotony: if any of your ratings improves and the rest stay equal, your Power Rating improves;
- Differences in Base Power Ratings should be interpretable as success probabilities;
- Location invariance: the same difference in Power Ratings should mean the same success probability no matter what the absolute value of Base Power rating is.
A key property the Hattrick Power Rating does not have is transitivity: if lineup B has a higher Base Power Rating than lineup A, and lineup C has a higher Base Power Rating than lineup B, it doesn’t mean that lineup C will have higher success probabilities when playing against lineup A, even in the case that lineup B has higher success probabilities against lineup A and lineup C has higher success probabilities against lineup B. It means that, against a plurality of different opponents, lineup C will on average be stronger – win more often – than lineup B, and lineup B will on average be stronger than lineup A. This is because Hattrick has some “rock-paper-scissors” components, as we will see later – and this is addressed by other parts of the Power Ratings project.
The very first part of the project was trying to understand if I could create a rating such that teams that are strictly better in terms of ratings also had strictly better Power Ratings. Team A is strictly better in terms of ratings than team B when all of team A’s sector ratings are higher or equal than team B’s and at least one rating is higher.
I thought of a couple of methods, but in the end my original idea worked: I repeatedly simulated games between teams that were strictly better than each other, and with an iterated procedure a ranking came out such that a given difference in Power Ratings meant that the stronger team had a certain success probability.
Once you create an algorithm, you try to break it. So I experimented with a variety of wildly different teams to try to find out whether the algorithm held only for strictly ordered teams or if it worked in general. I was most interested in extreme teams, as the original teams were pretty balanced, so I experimented with AOA teams, with very asymmetrical teams etc., playing both against the original teams and against each other. The results were surprising… up to a point. In general, the algorithm worked and the rating I devised was fairly robust. There were exceptions, though, and exactly where I expected them: very asymmetrical teams, facing opponents that hit their weak points, fared more poorly than their sector ratings would suggest. This was, in fact, more a relief than a shock, because it meant that there is a tactical aspect in Hattrick that matters, and quite a lot. It also meant, though, that there was a difference between the overall strength of a lineup and the strength of a lineup against a specific opponent’s lineup. This could then become a measure of tactical ability.
At this stage, the games I was simulating didn’t include tactics or special events. So I started exploring the effects of tactics and adding them to the simulator. Counter attacks (CA) and Long shots (LS) change the game mechanics, so I concluded that I needed to simulate matches against CA and LS teams as well. The other tactics could be simulated with smaller adjustments to the formulas. As for special events, I simulated their effect as well with an equivalent, but simplified, model, so that I would be able to add their effect to the rating if I could calculate an expected number of goals from special events.
Calculating expected goals from special events is far from easy, though. The probabilities shift as special events take place, and this makes calculations very, very difficult. It was the hardest part of the whole project, although in principle it should have been the easiest.
Since Base Power Ratings measure team strength against a generic opponent, some teams fare better than others – LS teams specifically, because they are strong against most other lineups, and Base Power Ratings don’t take into consideration that most teams don’t play their usual lineup when playing against LS teams. Base Power Ratings show you how strong your team would be if you played a certain lineup regardless of the opponent, and without the opponent adjusting his lineup to yours. This is not how one plays against long shots. And when tactics come into play, it’s not about team strength anymore – it’s about skill.
When you’re not considering your team in isolation, but compare it to a specific opponent, then the perspective changes. You don’t necessarily play your absolutely strongest lineup – you play the lineup that gives you the better chances to win. The difference in Base Power Ratings show what the success probability would be – on average – if both teams played their default lineup regardless of the opponent. But when you have two lineups to compare, then you can calculate – preferably with Super Replays – the specific success probability for that match. And if a team which, based on Base Power Ratings alone, would have had a 50% success probability (“implied probability”), instead has a 70% success probability (“actual probability”), then its manager played well. If it has a 30% actual probability when it would have had a 50% implied probability, then it played badly. This difference between actual and implied success probability is the so-called Skill Power Rating.
So in this way you can isolate the three components of a result: Base Power Rating measure raw lineup strength, Skill Power Rating measure user skill, and if you add a spruce of randomness, you get the actual result.
And what about team strength proper, not just lineup strength? Well, of course team strength is the strength of its strongest lineup in terms of Base Power Ratings, while team width and flexibility will be reflected in Skill Power Ratings. How do you find the strongest lineup of a team out of all the possible lineups, you ask? Well, this is a little mystery for another time. 🙂
What did I learn about Hattrick from this project?
Specialties matter, but not to the incredible extent that market prices for players with specialties suggest. They matter more in close matches, of course, but you have to consider whether the match would still be close with a better player without a specialty.
There is a tactical aspect in Hattrick that matters. Whether it matters enough is up to discussion. But when you give up some defence or attack to boost midfield, or the opposite – it matters. I have seen my share of matches where the success probability implied by Base Power Ratings was 0% and the actual success probability was 100%. Sure, it usually meant that the opponent made some catastrophic mistake – but it happens.
Tactic types matter… up to a point. Counter attacks are strong, but – as experienced users already know – the variability of results is higher by playing it. So, if you are the favourite, randomness can only hurt you, and you want to minimize variability. Therefore, counter attacks are, in general, a tactic for the underdogs. Surprisingly, or maybe not, long shots are a tactic type with high variability as well. So maybe this explains why long shots are a tactic strong enough to force any team to adapt its style of play against it, but not actually that powerful to dominate the game. The other tactic types are significantly less strong and, I’d say, choices suited for occasional use.