How can we use data in soccer

In the book published last year, “The Numbers Game – Why Everything You Know About Soccer Is Wrong”, the authors Chris Anderson and David Sally made every effort to do one thing: calling for a revolution in soccer to help it adapting into this “Big Data” modern world like other popular sports. They claim that soccer is probably the most old-fashioned and stubborn sport in the world. “That’s the way it’s always been done” are the seven words that dominate soccer.

I generally agree with them. Well, at least it’s the impression given by FIFA and UEFA… But why is soccer lagging behind?

Is it because that soccer is a harder game? It’s technically more difficult for the human body and the rules give the game much more flexibility than, say basketball and American football in which every second of the game time and every inch of the field is counted accurately. Too much liquidity and too little control makes it complicated to collect, manipulate and apply the data.

Or it might just be a culture issue. You may imagine that if the world of soccer were dominated by the US instead of Europe, it might have already been enjoying the prosperity of data analytics now and even the Moneyball  story could have been born from it rather than from baseball.

Whatever the reason is, now it seems things are changing. As technologies of collecting and manipulating data are developed faster than ever, and successful examples of using data in other sports are shown one after another, I believe that more and more soccer professionals are ready to embrace the era of the Big Data.

In this post, I want to talk about some ideas that are inspired by the book about how data analysis can be used in soccer. It’s not a book review. Besides the topics from the book, I’ll also write about my own ideas and something I saw elsewhere.

·            The Nature of the Game

Everybody wants to know more about the things they love, and that’s what data can be used for. Here are some facts that data can tell you about soccer, and not all of them are obvious.

1.     A game of randomness

One of the most fundamental properties of soccer is its randomness. Of course, every sport game can be effected by some random issues, such as the direction of the wind, a shot hitting the post, or a player’s silly mistake, etc. Usually we call them good or bad luck. Generally, if you lose a soccer game, how much can you blame it to bad luck? The book tells you: 50%. According to it, half of what affects the outcome of a game is chance. That is, if your opponent is at exactly the same level as you, and you do everything right, you can still lose because of bad luck; on the other hand, even if you do everything wrong (but of course you are still playing soccer), you may still win by pure luck.

This is a qualitatively interesting observation, but I am not quite convinced by this 50% number and the way the authors calculated it. By comparing the betting odds to the true results, the authors found that soccer has the largest factor of chance among other group sports such as American football, basketball, hockey, etc. But how much the betting odds are reliable is a question. There is a method in statistics to find out the percentage of the effect of different factors, it’s called regression. You can list all the factors you think may affect the result and put them in the regression function, even something like the color of some players’ boots, or how many players pray before the game, etc., then you do the computation, and it will tell you the “correlation” of each factor to the result. I am sure that the two factors I mentioned have correlations close to zero, otherwise there must be something strange going on. The correlation can then tell you how much fraction of the result can be explained by the factor. If you add up all the fractions and they are less than 100%, it means there is something unexplained by the known factors. When you believe you have listed all the factors that can affect the result, those unexplained fraction can be attributed to chance.

Here, that 50% of the result is affected by chance means all the contribution of the known factors adds up to only 50%. This is where my doubt is, because it’s very common that we miss out some important factors. The betting companies and the analysts outside of the club cannot see all the things that are going on within the team. For example, the players’ physical state in recent training should be an important factor affecting the game, but only the people inside the team know exactly how they are. In fact one possible application of this method is to find out more factors that affect the game and get more control of it.

2.     Goals are rare

Another observation is that goals in soccer are much rarer than in other group sports, with a team scoring a goal once every 69 minutes on average. This is needless to say because every fan knows how precious a goal is in soccer. And by the way, this property is related to randomness, in that the fewer events you have, the more random they appear.

The interesting thing is, while different leagues may have dramatically different playing styles, the numbers of goals they score are not much different. The following is a graph from the book showing the average goals per game in Europe’s top 4 leagues. As one can expect, Bundesliga has the most goals on average, and Serie A has the fewest. But the difference is almost negligible.

Screen Shot 2014-01-31 at 6.27.23 PM

It looks like that in the past decade, the average number of goals almost remains a constant year by year. It would also be interesting to see if it changes with time in the history. The following two graphs are goals per match in the English first tier from 1950 to 2010, extracted from the book, and goals per match in the Spanish first division from 1928 to 2012 which is made by myself.

Screen Shot 2014-01-31 at 6.37.07 PMUntitled

They have a similar trend: in the 10-20 years before 1970, the number dropped dramatically from 3.5 or even 4 to a little more than 2.5; and after 1970, it almost stayed as a constant. The steep drop may not directly related to the success of the Catenaccio in Italy at the same period, but it’s definitely a reflection of the trend that the teams began to pay more attention to tactics, especially that of defending. And after that, even after many efforts of trying to encourage attacking, such as introducing the 3-1-0 point system and modifying the offside rules, the average goal level just always stays this low, as if there is infinite potential in defense that no matter what you try, it can always neutralize the effort. As for why the number did not become even smaller, maybe because it’s the limit that keeps soccer a game worth watching.

3.     Illusion and reality

“A streak” of good or bad luck does not exist. In fact it’s just an illusion formed by selective memory. The opposite can be true: when a player or a team performs at a much higher or lower level than they should be, it’s more likely that they will “bounce back” to their average level the next time. This is called “regression to the mean” in statistics.

So don’t blame a player or a team when they play bad in one or two games, and attribute their better performance the next time to your criticism. Even no one did nothing, they would still play better because that’s just the natural fluctuation.

Further more, we often see the coaches being sucked in the middle of the season, and then the team bouncing back immediately from the bottom of their life, as if sucking the coach is always a correct choice. The truth is, coaches are usually sucked when their team reaches the bottom and bouncing back is always the next step no matter who the coach is. To verify this, we can compare the teams that have similar (bad) performance with and without the coaches being sucked. They behave almost the same. (The graph is from a Dutch study cited by the book.) Thus, sucking coaches merely according to several bad results is at best a placebo to the team’s illness.

Screen Shot 2014-02-02 at 12.59.05 PM

By the way, this book contributed a whole chapter to this issue, with Andre Villas-Boas’ coach life at Chelsea as the studied case, and discussed how impossible it is to assess Villas-Boas’ ability due to that Abramovich had screwed up all the factors that can be used to test him. Exactly when I was reading this chapter, Villas-Boas was sucked again by Tottenham. Poor Andre… I don’t know since when has England become so impatient with managers.

·            Playing Styles

Now let’s turn our attention to what data can tell us about the most important thing in soccer: to win.

As said in the book, the first pioneer in soccer analytics is an Englishman, Charles Reep, who, using pencil and paper, recorded every game he went to from 1953 to 1967, with detailed annotation of each event of passing, shooting, scoring, etc. From these data he collected, he tried to seek out the winning formula. He found that the passing accuracy is merely 50% and the scoring efficiency is only 1 goal in 9 shots. He also studied the scoring efficiency in different regions of the pitch. Then he made the conclusion: passes and touches are useless, teams should try to move the ball as quickly as possible to the opponent’s box in order to maximize the efficiency, in other words, more long balls, less possession.

His data were correct, but his conclusion was too hasty. Data can only provide information, but reaching a conclusion or making a decision requires much more than that. With the same data, different interpretations may even lead to opposite conclusions. In the case Reep faced, the number of goals equals the chances created (shots) multiplied by the efficiency, so instead of trying to increase efficiency, we can also increase the number of shots in order to increase goals, especially if you find that increasing the efficiency is much harder than creating more chances. It’s just simple math, but going which way leads to totally different soccer.

It turns out that the best way to create chances is to keep possession, and the best way to keep possession is through short passes. Not to mention that keeping possession is also the best way of defending. The following two graphs from the book show that possession has a strong positive correlation with points earned in a season, and more short passes is related to higher rank in the league.

Screen Shot 2014-02-02 at 7.03.31 PM Screen Shot 2014-02-02 at 7.01.49 PM

That’s why FC Barcelona is so successful in recent years that it has become a paradigm that many other teams are learning from. Passing becomes more and more important and passing skill is improved a lot through all these years. The following two graphs compare the ratio of successful number of passes in Reep’s data and in today’s Premier League. Continuous passing for more than 7 times is much easier to achieve nowadays than in the 1950’s.

Screen Shot 2014-02-02 at 5.21.07 PM Screen Shot 2014-02-02 at 6.44.09 PM

But be careful, we don’t want to make the same mistake that Reep made: being too hasty. What’s said above does not mean that Reep was totally wrong or there is only one legitimate style of soccer. First, correlation is not equivalent to causality, so this is not the proof that fighting for possession is the reason of winning. The correlation between possession and winning can be due to the fact that the teams win more because they have better players, and better players are also better at keeping possession. In fact possession cannot be directly controlled, instead you can only try to keep possession. How hard you try is what you can plan. If we really want to study how this choice of style is related to winning, we have to find a way to measure “trying to keep possession”, which should be decoupled with the ability of the players. Second, you make the choice also according to your opponent. If you are not the better team, sometimes going against the tide wisely may pay off even more. Stoke City is an example. They are famous for long balls and have the least possession percentage in the Premier League, but they can always manage to stay in the middle of the table with their limited budget. That’s why there are always some coaches sharing some similar ideas with Reep.

Anyway, we see that data can help us explore tactical ideas in soccer. But we need to be very cautious about this part. And also don’t forget that soccer is not only about winning, but also about esthetics, culture and spirit, about which the data may have little to say.

·            Assessing Players

This is what Moneyball is about, and maybe it’s the area that the data are most widely used in all the sports. Looking at the scoring table or some technical figures to compare players is not only a fun of many fans, but also one important part of the coaches’ and scouts’ work. In fact I expect this kind of things happening long ago in the history of soccer, but I was astonished when I read the story of the Brazilian player Carlos Henrique Kaiser  who, as a striker, had not scored a single goal in his 20 years of footballer career but had managed to transfer among many Brazilian and European clubs, and was even treated by some media as a star.  So those clubs in the 1980’s never looked at the players’ data?

Of course, today it can never happen again. Instead, the story would be that Abramovich looked at his iPad with some fancy data analysis software installed and ruled out Falcao as unqualified transfer candidate. (Unverified rumor)

I don’t know how the clubs use data to assess players, but no doubt that they are much more complicated than the tables we usually see in the media. Oversimplified data can be misleading. Here is an example in the book about how adding some simple extra consideration can make a difference.

We assume that the number of goals scored is the most important figure for a striker.  Is that always true? Even if we put aside all the other functions of strikers such as assistances, crucial passes, etc., there is still a caveat, that is, the value of the goals they score in different games are not equal. It’s not hard to agree that a goal in a 1-0 win is more valuable than a goal in a 6-0 win. And it’s also reasonable to assume that in the 1-0 win it’s more difficult to score a goal than in the 6-0 win, so that the former demonstrates higher quality of the player. Thus we want to include this effect and calculate the weighted scoring table of the player.

The book calculated the average increase of points with each goal scored.

Screen Shot 2014-01-29 at 11.37.36 PM

With this data, they modified the scoring table in the premier league in season 09-10 and 10-11, and found that Darren Bent who transferred from Sunderland to Aston Villa had a very stable behavior (ranked #2 in both seasons) yet Fernando Torres who transferred from Liverpool to Chelsea dropped dramatically from #5 to #19.  Had Abramovich known this ranking method, maybe he would consider to by Bent at half of the price of Torres instead. Hopefully this misfortune will not happen again to him since he’s now got his nice software.

Another good example I’ve seen in assessing players is the website GoalImpact. They use a comprehensive algorithm to calculate the “goal impact” index of each player, which generally shows both how well his team plays and how much his team depends on him. One interesting result they get is comparing the goal impact ranking to the FIFA Ballon d’Or best player list.

Let’s look at the Oct 2012 – Oct 2013 ranking (the second table). Many Bayern players rank very high, which is reasonable since the overall point of their team is high; then Ronaldo is ranked almost top, which is interesting considering that Real Madrid did not win any important title in the last year; and Messi’s rank is understandable considering his injury. What is the most surprising is the rank of Ribery, it’s much lower than I expected. Even within Bayern, he only ranks #5, which means Bayern really did not depend on him that much. Seeing this, I realized that this year’s Ballon d’Or may not be that outrageous. Or at least it shows that Bayern is a team that works most collectively and it’s really hard to choose one representative from them for the individual award.

While evaluating which players are better than others is the hot topic of fans and media, the professionals usually need to know more. It’s not only the simple comparison of the quality of players, but also whether this player is the right style that they need. A linear table is not enough for telling them this; instead, they need multi-dimensional results. I’ll return to this soon.

·            The Truly Big Data

As statisticians and economists, the authors of the book proposed many ways to retrieve meaningful information from the data of soccer. But most of them are more traditional statistical methods, and the data they used are in fact not so “big” as in the sense of “big data”.

With 22 players and a ball moving in a space of 68*105 square meters and in a time span of 90 minutes or more, soccer itself is a generator of big data. Their coordinates and all the events happening on the pitch can be recorded in real time. From last year, NBA just begins to realize this technique in all the basketball courts using the motion-tracking system. In soccer, companies like Opta are already trying to collect as much data as possible through watching the recorded videos, but it’s more complicated since the soccer pitch is much larger for the cameras to catch every corner of the pitch. I’ve heard that wearable GPS is already used in training in some clubs. If it could be used in the game to collect data, that would be perfect.

Why do we need so many data? One application is to study the dynamical model of the game. There is an interesting example in basketball to study the “state of transition”, which can be used to estimate the players’ contribution to the game in a more comprehensive way.  Of course these are much more complicated projects than just making several analysis graphs.

What I think is most useful and also possible with today’s data and techniques is to study players’ detailed profile using machine learning. As I mentioned above, linear comparison of players doesn’t have much use, we need more features to characterize a player. Dividing players into groups like forwards, midfielders, defenders is a stereotype, because each group can have different subtypes, and some players from different groups may have more similarity in some aspects. It’s impossible to make a perfect division and comparison for all the players in one’s head, but we can assign this work to the computers. As long as we provide enough features that can characterize each player’s behavior on the pitch, computers can cluster them into data-based groups. They can be more detailed than we can imagine, and it can even reveal new types that we’ve never thought about before. Similar work has been done in NBA (again!) and I would not be surprised if Opta or other data companies are already doing this work. It’s especially useful to build this database for young players, so when the scouts want some kind of players they know which groups to look into and it’s much easier for them to find a bargain. This is Moneyball in the big data era.

With these detailed data of player behavior, the game simulation can be made more real. Well, I am not really interested in improving the video games, but what if someday video games can help the games on the pitch? It’s true that soccer is full of randomness, but it does not mean we don’t have control. The results cannot be determined, but if you imagine playing the same game under the same condition for millions of times, the results should have a distribution, and then we can see what is the most possible result. Of course in reality we can never do it, but computers can simulate it easily in a short period of time (you don’t need to see the real time images of the simulations so they can be done much faster).  So if a coach is not sure whether he should play three or four defense, or whether he should use player A or player B in some position, or he just want to try out some crazy experiment, don’t worry, do some simulations! And if the computer tells you that plan A has 1.5 times more possibility of winning than plan B on a 95% confidence level, then you know which one to choose, right? Of course, all these imaginations rely on the accuracy of the simulation, and in that there’s still a long way to go. But I believe that as we understand more about the soccer data, it’s nothing unimaginable in the near future.

·            So will there be a revolution?

I don’t know how dramatic a change can be called a revolution, but certainly it will not happen overnight. Unlike the age of Moneyball, now both the data and the methods of analysis become much more complicated and diverse, and the data “industry” has produced a huge market that you can shop for your own needs. It’s possible that the use of data will bloom simultaneously in various aspects of soccer, and different clubs may benefit in different areas according to the cost of money and time. While the bigger clubs have more money to investigate in refining tactics or long term training plans, the smaller clubs may focus on looking for young talent bargains because that’s the investment that can pay off immediately.

It’s also expected that there will be many failed experiments or useless inventions, as in any innovational fields. Confidence and patience is what we need. I heard that Liverpool is the club which invests most in data analytics in the Premier League in recent years. It did not seem to be successful so far, but maybe from this season it begins to pay off. Anyway, we still need to wait and see.

“Everything you know about soccer is wrong”, as suggested in the title of the book? I don’t think so. At least for me, many things I’ve been told still stand firm in front of data. Data analysis is a continuity rather than a rebellion to the traditional discipline of soccer. Most of the time you’ll see that it just tells you what you already know, but with more details about the how’s and why’s.

There is a concern that too much data analysis could water down the fun of soccer. It may be true for some people, as different people have different definition of fun. But for me, it just adds a new way of enjoying soccer. To show you what I mean, I want to invite every reader to enjoy a talent show. Before you watch the video, let me tell you what’s behind what this woman is doing is very simple physics: center of gravity. The length and weight of each piece is accurately measured and the overlapping positions are calculated and marked. What she does is simply putting those pieces together following a strict order. Every motion is a well-programed procedure. However, these technical details can never deny the great elegance of this show. The music, her concentrated eyes and the message of the feather all draw my mind into the almost sacred atmosphere. I watched it for many times. And in addition to feeling amazed, I understand why it’s so amazing, and why it’s a brilliant show.

It’s the same for soccer. Data gives you more insight of the game and you’ll be more sure of what you mean when you say: What a brilliant game!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s