In the book published last year, “The Numbers Game – Why Everything You Know About Soccer Is Wrong”, the authors Chris Anderson and David Sally made every effort to do one thing: calling for a revolution in soccer to help it adapting into this “Big Data” modern world like other popular sports. They claim that soccer is probably the most old-fashioned and stubborn sport in the world. “That’s the way it’s always been done” are the seven words that dominate soccer.
I generally agree with them. Well, at least it’s the impression given by FIFA and UEFA… But why is soccer lagging behind?
Is it because that soccer is a harder game? It’s technically more difficult for the human body and the rules give the game much more flexibility than, say basketball and American football in which every second of the game time and every inch of the field is counted accurately. Too much liquidity and too little control makes it complicated to collect, manipulate and apply the data.
Or it might just be a culture issue. You may imagine that if the world of soccer were dominated by the US instead of Europe, it might have already been enjoying the prosperity of data analytics now and even the Moneyball story could have been born from it rather than from baseball.
Whatever the reason is, now it seems things are changing. As technologies of collecting and manipulating data are developed faster than ever, and successful examples of using data in other sports are shown one after another, I believe that more and more soccer professionals are ready to embrace the era of the Big Data.
In this post, I want to talk about some ideas that are inspired by the book about how data analysis can be used in soccer. It’s not a book review. Besides the topics from the book, I’ll also write about my own ideas and something I saw elsewhere.
Still remember the legendary World Cup predictor Paul the Octopus? In his short life (May he rest in peace) he has not missed a single prediction of the World Cup games. Now, we have the ape Eli, who’s been right in predicting all the last 6 Super Bowl results.
These are super natural creatures sent by God, right? Otherwise how could they be so accurate?
Well, don’t be blowed away. In fact by using just a bit of probability common sense, you can create such a super predictor yourself, no mater it’s a cat, a dog or whatever animal.
Let’s start with 64 animals. The first year, you let them predict the Super Bowl and the most likely out come is that 32 animals choose the winner to be team A and the other 32 choose team B. Report the 32 or so animals who are correct to the media, keep them, and discard the other 32 who are wrong. The second year, repeat with the 32 animals, and the most likely result is that you get 16 animals correct so that they are kept. Repeat this procedure year by year, and in the 6th year, it’s most likely that there is one animal left. And you can report to the media: here we have a ape(or whatever animal) who has correctly predicted all the last 6 Super Bowls!
Not that amazing any more, right? Because you know that according to the lows of probability, even if all the animals know nothing about what they are doing and they just pick sides randomly, you still can get this result. That is how this whole thing works! Have you seen any reports on an animal predictor who were right before but is wrong this time? Maybe a few, but the majority will not get reported because they are not interesting anymore. Only those who continuously get them right get reported in the media. With this selection, we can always generate “super predictors” from a bunch of animals out there. But can you trust these super predictors more than others? No. Because they always make the choice randomly. There is nothing special in them at all.
So next time when you want bet on some game (although generally I advice you avoid this practice in which you always lose in a statistical sense), you’d better trust your gut rather than those super predictors.
This analysis was done BEFORE I began to learn any systematical techniques on statistics or data mining. Many aspects in this article are premature and need much improvement. But it’s the starting point of my passion of data analytics, especially football analytics. So I’ll start my blog with it.
The motive of this analysis is to settle the debates over some issues of last season of Barça using data analysis. Data analysis can well compensate the vague impression, short memory and biased opinion that we usually have in a qualitative analysis. I believe that the team has a much more advanced and comprehensive system of data analytics, comparing to that, this analysis is very simple and crude. But with only the basic tools and limited data online, we can at least obtain some general idea from it.
Here I analyze the playing minutes of the first team players of Barca in recent five years, in order to understand the issues about rotation, age structure and the situation of the homegrown players.
All data are taken from the database of Spanish football http://www.bdfutbol.com
Before going into the analysis, let’s define a concept for our convenience: The first team player. We follow the practical concept rather than the official status: Those and only those who play more than (including) 90 minutes in all the official games of the first team in a season are defined as the first team players.