In the book published last year, “The Numbers Game – Why Everything You Know About Soccer Is Wrong”, the authors Chris Anderson and David Sally made every effort to do one thing: calling for a revolution in soccer to help it adapting into this “Big Data” modern world like other popular sports. They claim that soccer is probably the most old-fashioned and stubborn sport in the world. “That’s the way it’s always been done” are the seven words that dominate soccer.

I generally agree with them. Well, at least it’s the impression given by FIFA and UEFA… But why is soccer lagging behind?

Is it because that soccer is a harder game? It’s technically more difficult for the human body and the rules give the game much more flexibility than, say basketball and American football in which every second of the game time and every inch of the field is counted accurately. Too much liquidity and too little control makes it complicated to collect, manipulate and apply the data.

Or it might just be a culture issue. You may imagine that if the world of soccer were dominated by the US instead of Europe, it might have already been enjoying the prosperity of data analytics now and even the Moneyball story could have been born from it rather than from baseball.

Whatever the reason is, now it seems things are changing. As technologies of collecting and manipulating data are developed faster than ever, and successful examples of using data in other sports are shown one after another, I believe that more and more soccer professionals are ready to embrace the era of the Big Data.

In this post, I want to talk about some ideas that are inspired by the book about how data analysis can be used in soccer. It’s not a book review. Besides the topics from the book, I’ll also write about my own ideas and something I saw elsewhere.

Everybody wants to know more about the things they love, and that’s what data can be used for. Here are some facts that data can tell you about soccer, and not all of them are obvious.

One of the most fundamental properties of soccer is its randomness. Of course, every sport game can be effected by some random issues, such as the direction of the wind, a shot hitting the post, or a player’s silly mistake, etc. Usually we call them good or bad luck. Generally, if you lose a soccer game, how much can you blame it to bad luck? The book tells you: 50%. According to it, half of what affects the outcome of a game is chance. That is, if your opponent is at exactly the same level as you, and you do everything right, you can still lose because of bad luck; on the other hand, even if you do everything wrong (but of course you are still playing soccer), you may still win by pure luck.

This is a qualitatively interesting observation, but I am not quite convinced by this 50% number and the way the authors calculated it. By comparing the betting odds to the true results, the authors found that soccer has the largest factor of chance among other group sports such as American football, basketball, hockey, etc. But how much the betting odds are reliable is a question. There is a method in statistics to find out the percentage of the effect of different factors, it’s called regression. You can list all the factors you think may affect the result and put them in the regression function, even something like the color of some players’ boots, or how many players pray before the game, etc., then you do the computation, and it will tell you the “correlation” of each factor to the result. I am sure that the two factors I mentioned have correlations close to zero, otherwise there must be something strange going on. The correlation can then tell you how much fraction of the result can be explained by the factor. If you add up all the fractions and they are less than 100%, it means there is something unexplained by the known factors. When you believe you have listed all the factors that can affect the result, those unexplained fraction can be attributed to chance.

Here, that 50% of the result is affected by chance means all the contribution of the known factors adds up to only 50%. This is where my doubt is, because it’s very common that we miss out some important factors. The betting companies and the analysts outside of the club cannot see all the things that are going on within the team. For example, the players’ physical state in recent training should be an important factor affecting the game, but only the people inside the team know exactly how they are. In fact one possible application of this method is to find out more factors that affect the game and get more control of it.

Another observation is that goals in soccer are much rarer than in other group sports, with a team scoring a goal once every 69 minutes on average. This is needless to say because every fan knows how precious a goal is in soccer. And by the way, this property is related to randomness, in that the fewer events you have, the more random they appear.

The interesting thing is, while different leagues may have dramatically different playing styles, the numbers of goals they score are not much different. The following is a graph from the book showing the average goals per game in Europe’s top 4 leagues. As one can expect, Bundesliga has the most goals on average, and Serie A has the fewest. But the difference is almost negligible.

It looks like that in the past decade, the average number of goals almost remains a constant year by year. It would also be interesting to see if it changes with time in the history. The following two graphs are goals per match in the English first tier from 1950 to 2010, extracted from the book, and goals per match in the Spanish first division from 1928 to 2012 which is made by myself.

They have a similar trend: in the 10-20 years before 1970, the number dropped dramatically from 3.5 or even 4 to a little more than 2.5; and after 1970, it almost stayed as a constant. The steep drop may not directly related to the success of the Catenaccio in Italy at the same period, but it’s definitely a reflection of the trend that the teams began to pay more attention to tactics, especially that of defending. And after that, even after many efforts of trying to encourage attacking, such as introducing the 3-1-0 point system and modifying the offside rules, the average goal level just always stays this low, as if there is infinite potential in defense that no matter what you try, it can always neutralize the effort. As for why the number did not become even smaller, maybe because it’s the limit that keeps soccer a game worth watching.

“A streak” of good or bad luck does not exist. In fact it’s just an illusion formed by selective memory. The opposite can be true: when a player or a team performs at a much higher or lower level than they should be, it’s more likely that they will “bounce back” to their average level the next time. This is called “regression to the mean” in statistics.

So don’t blame a player or a team when they play bad in one or two games, and attribute their better performance the next time to your criticism. Even no one did nothing, they would still play better because that’s just the natural fluctuation.

Further more, we often see the coaches being sucked in the middle of the season, and then the team bouncing back immediately from the bottom of their life, as if sucking the coach is always a correct choice. The truth is, coaches are usually sucked when their team reaches the bottom and bouncing back is always the next step no matter who the coach is. To verify this, we can compare the teams that have similar (bad) performance with and without the coaches being sucked. They behave almost the same. (The graph is from a Dutch study cited by the book.) Thus, sucking coaches merely according to several bad results is at best a placebo to the team’s illness.

By the way, this book contributed a whole chapter to this issue, with Andre Villas-Boas’ coach life at Chelsea as the studied case, and discussed how impossible it is to assess Villas-Boas’ ability due to that Abramovich had screwed up all the factors that can be used to test him. Exactly when I was reading this chapter, Villas-Boas was sucked again by Tottenham. Poor Andre… I don’t know since when has England become so impatient with managers.

Now let’s turn our attention to what data can tell us about the most important thing in soccer: to win.

As said in the book, the first pioneer in soccer analytics is an Englishman, Charles Reep, who, using pencil and paper, recorded every game he went to from 1953 to 1967, with detailed annotation of each event of passing, shooting, scoring, etc. From these data he collected, he tried to seek out the winning formula. He found that the passing accuracy is merely 50% and the scoring efficiency is only 1 goal in 9 shots. He also studied the scoring efficiency in different regions of the pitch. Then he made the conclusion: passes and touches are useless, teams should try to move the ball as quickly as possible to the opponent’s box in order to maximize the efficiency, in other words, more long balls, less possession.

His data were correct, but his conclusion was too hasty. Data can only provide information, but reaching a conclusion or making a decision requires much more than that. With the same data, different interpretations may even lead to opposite conclusions. In the case Reep faced, the number of goals equals the chances created (shots) multiplied by the efficiency, so instead of trying to increase efficiency, we can also increase the number of shots in order to increase goals, especially if you find that increasing the efficiency is much harder than creating more chances. It’s just simple math, but going which way leads to totally different soccer.

It turns out that the best way to create chances is to keep possession, and the best way to keep possession is through short passes. Not to mention that keeping possession is also the best way of defending. The following two graphs from the book show that possession has a strong positive correlation with points earned in a season, and more short passes is related to higher rank in the league.

That’s why FC Barcelona is so successful in recent years that it has become a paradigm that many other teams are learning from. Passing becomes more and more important and passing skill is improved a lot through all these years. The following two graphs compare the ratio of successful number of passes in Reep’s data and in today’s Premier League. Continuous passing for more than 7 times is much easier to achieve nowadays than in the 1950’s.

But be careful, we don’t want to make the same mistake that Reep made: being too hasty. What’s said above does not mean that Reep was totally wrong or there is only one legitimate style of soccer. First, correlation is not equivalent to causality, so this is not the proof that fighting for possession is the reason of winning. The correlation between possession and winning can be due to the fact that the teams win more because they have better players, and better players are also better at keeping possession. In fact possession cannot be directly controlled, instead you can only try to keep possession. How hard you try is what you can plan. If we really want to study how this choice of style is related to winning, we have to find a way to measure “trying to keep possession”, which should be decoupled with the ability of the players. Second, you make the choice also according to your opponent. If you are not the better team, sometimes going against the tide wisely may pay off even more. Stoke City is an example. They are famous for long balls and have the least possession percentage in the Premier League, but they can always manage to stay in the middle of the table with their limited budget. That’s why there are always some coaches sharing some similar ideas with Reep.

Anyway, we see that data can help us explore tactical ideas in soccer. But we need to be very cautious about this part. And also don’t forget that soccer is not only about winning, but also about esthetics, culture and spirit, about which the data may have little to say.

This is what Moneyball is about, and maybe it’s the area that the data are most widely used in all the sports. Looking at the scoring table or some technical figures to compare players is not only a fun of many fans, but also one important part of the coaches’ and scouts’ work. In fact I expect this kind of things happening long ago in the history of soccer, but I was astonished when I read the story of the Brazilian player Carlos Henrique Kaiser who, as a striker, had not scored a single goal in his 20 years of footballer career but had managed to transfer among many Brazilian and European clubs, and was even treated by some media as a star. So those clubs in the 1980’s never looked at the players’ data?

Of course, today it can never happen again. Instead, the story would be that Abramovich looked at his iPad with some fancy data analysis software installed and ruled out Falcao as unqualified transfer candidate. (Unverified rumor)

I don’t know how the clubs use data to assess players, but no doubt that they are much more complicated than the tables we usually see in the media. Oversimplified data can be misleading. Here is an example in the book about how adding some simple extra consideration can make a difference.

We assume that the number of goals scored is the most important figure for a striker. Is that always true? Even if we put aside all the other functions of strikers such as assistances, crucial passes, etc., there is still a caveat, that is, the value of the goals they score in different games are not equal. It’s not hard to agree that a goal in a 1-0 win is more valuable than a goal in a 6-0 win. And it’s also reasonable to assume that in the 1-0 win it’s more difficult to score a goal than in the 6-0 win, so that the former demonstrates higher quality of the player. Thus we want to include this effect and calculate the weighted scoring table of the player.

The book calculated the average increase of points with each goal scored.

With this data, they modified the scoring table in the premier league in season 09-10 and 10-11, and found that Darren Bent who transferred from Sunderland to Aston Villa had a very stable behavior (ranked #2 in both seasons) yet Fernando Torres who transferred from Liverpool to Chelsea dropped dramatically from #5 to #19. Had Abramovich known this ranking method, maybe he would consider to by Bent at half of the price of Torres instead. Hopefully this misfortune will not happen again to him since he’s now got his nice software.

Another good example I’ve seen in assessing players is the website GoalImpact. They use a comprehensive algorithm to calculate the “goal impact” index of each player, which generally shows both how well his team plays and how much his team depends on him. One interesting result they get is comparing the goal impact ranking to the FIFA Ballon d’Or best player list.

Let’s look at the Oct 2012 – Oct 2013 ranking (the second table). Many Bayern players rank very high, which is reasonable since the overall point of their team is high; then Ronaldo is ranked almost top, which is interesting considering that Real Madrid did not win any important title in the last year; and Messi’s rank is understandable considering his injury. What is the most surprising is the rank of Ribery, it’s much lower than I expected. Even within Bayern, he only ranks #5, which means Bayern really did not depend on him that much. Seeing this, I realized that this year’s Ballon d’Or may not be that outrageous. Or at least it shows that Bayern is a team that works most collectively and it’s really hard to choose one representative from them for the individual award.

While evaluating which players are better than others is the hot topic of fans and media, the professionals usually need to know more. It’s not only the simple comparison of the quality of players, but also whether this player is the right style that they need. A linear table is not enough for telling them this; instead, they need multi-dimensional results. I’ll return to this soon.

As statisticians and economists, the authors of the book proposed many ways to retrieve meaningful information from the data of soccer. But most of them are more traditional statistical methods, and the data they used are in fact not so “big” as in the sense of “big data”.

With 22 players and a ball moving in a space of 68*105 square meters and in a time span of 90 minutes or more, soccer itself is a generator of big data. Their coordinates and all the events happening on the pitch can be recorded in real time. From last year, NBA just begins to realize this technique in all the basketball courts using the motion-tracking system. In soccer, companies like Opta are already trying to collect as much data as possible through watching the recorded videos, but it’s more complicated since the soccer pitch is much larger for the cameras to catch every corner of the pitch. I’ve heard that wearable GPS is already used in training in some clubs. If it could be used in the game to collect data, that would be perfect.

Why do we need so many data? One application is to study the dynamical model of the game. There is an interesting example in basketball to study the “state of transition”, which can be used to estimate the players’ contribution to the game in a more comprehensive way. Of course these are much more complicated projects than just making several analysis graphs.

What I think is most useful and also possible with today’s data and techniques is to study players’ detailed profile using machine learning. As I mentioned above, linear comparison of players doesn’t have much use, we need more features to characterize a player. Dividing players into groups like forwards, midfielders, defenders is a stereotype, because each group can have different subtypes, and some players from different groups may have more similarity in some aspects. It’s impossible to make a perfect division and comparison for all the players in one’s head, but we can assign this work to the computers. As long as we provide enough features that can characterize each player’s behavior on the pitch, computers can cluster them into data-based groups. They can be more detailed than we can imagine, and it can even reveal new types that we’ve never thought about before. Similar work has been done in NBA (again!) and I would not be surprised if Opta or other data companies are already doing this work. It’s especially useful to build this database for young players, so when the scouts want some kind of players they know which groups to look into and it’s much easier for them to find a bargain. This is Moneyball in the big data era.

With these detailed data of player behavior, the game simulation can be made more real. Well, I am not really interested in improving the video games, but what if someday video games can help the games on the pitch? It’s true that soccer is full of randomness, but it does not mean we don’t have control. The results cannot be determined, but if you imagine playing the same game under the same condition for millions of times, the results should have a distribution, and then we can see what is the most possible result. Of course in reality we can never do it, but computers can simulate it easily in a short period of time (you don’t need to see the real time images of the simulations so they can be done much faster). So if a coach is not sure whether he should play three or four defense, or whether he should use player A or player B in some position, or he just want to try out some crazy experiment, don’t worry, do some simulations! And if the computer tells you that plan A has 1.5 times more possibility of winning than plan B on a 95% confidence level, then you know which one to choose, right? Of course, all these imaginations rely on the accuracy of the simulation, and in that there’s still a long way to go. But I believe that as we understand more about the soccer data, it’s nothing unimaginable in the near future.

I don’t know how dramatic a change can be called a revolution, but certainly it will not happen overnight. Unlike the age of Moneyball, now both the data and the methods of analysis become much more complicated and diverse, and the data “industry” has produced a huge market that you can shop for your own needs. It’s possible that the use of data will bloom simultaneously in various aspects of soccer, and different clubs may benefit in different areas according to the cost of money and time. While the bigger clubs have more money to investigate in refining tactics or long term training plans, the smaller clubs may focus on looking for young talent bargains because that’s the investment that can pay off immediately.

It’s also expected that there will be many failed experiments or useless inventions, as in any innovational fields. Confidence and patience is what we need. I heard that Liverpool is the club which invests most in data analytics in the Premier League in recent years. It did not seem to be successful so far, but maybe from this season it begins to pay off. Anyway, we still need to wait and see.

“Everything you know about soccer is wrong”, as suggested in the title of the book? I don’t think so. At least for me, many things I’ve been told still stand firm in front of data. Data analysis is a continuity rather than a rebellion to the traditional discipline of soccer. Most of the time you’ll see that it just tells you what you already know, but with more details about the how’s and why’s.

There is a concern that too much data analysis could water down the fun of soccer. It may be true for some people, as different people have different definition of fun. But for me, it just adds a new way of enjoying soccer. To show you what I mean, I want to invite every reader to enjoy a talent show. Before you watch the video, let me tell you what’s behind what this woman is doing is very simple physics: center of gravity. The length and weight of each piece is accurately measured and the overlapping positions are calculated and marked. What she does is simply putting those pieces together following a strict order. Every motion is a well-programed procedure. However, these technical details can never deny the great elegance of this show. The music, her concentrated eyes and the message of the feather all draw my mind into the almost sacred atmosphere. I watched it for many times. And in addition to feeling amazed, I understand why it’s so amazing, and why it’s a brilliant show.

It’s the same for soccer. Data gives you more insight of the game and you’ll be more sure of what you mean when you say: What a brilliant game!

]]>Still remember the legendary World Cup predictor Paul the Octopus? In his short life (May he rest in peace) he has not missed a single prediction of the World Cup games. Now, we have the ape Eli, who’s been right in predicting all the last 6 Super Bowl results.

These are super natural creatures sent by God, right? Otherwise how could they be so accurate?

Well, don’t be blowed away. In fact by using just a bit of probability common sense, you can create such a super predictor yourself, no mater it’s a cat, a dog or whatever animal.

Let’s start with 64 animals. The first year, you let them predict the Super Bowl and the most likely out come is that 32 animals choose the winner to be team A and the other 32 choose team B. Report the 32 or so animals who are correct to the media, keep them, and discard the other 32 who are wrong. The second year, repeat with the 32 animals, and the most likely result is that you get 16 animals correct so that they are kept. Repeat this procedure year by year, and in the 6th year, it’s most likely that there is one animal left. And you can report to the media: here we have a ape(or whatever animal) who has correctly predicted all the last 6 Super Bowls!

Not that amazing any more, right? Because you know that according to the lows of probability, even if all the animals know nothing about what they are doing and they just pick sides randomly, you still can get this result. That is how this whole thing works! Have you seen any reports on an animal predictor who were right before but is wrong this time? Maybe a few, but the majority will not get reported because they are not interesting anymore. Only those who continuously get them right get reported in the media. With this selection, we can always generate “super predictors” from a bunch of animals out there. But can you trust these super predictors more than others? No. Because they always make the choice randomly. There is nothing special in them at all.

So next time when you want bet on some game (although generally I advice you avoid this practice in which you always lose in a statistical sense), you’d better trust your gut rather than those super predictors.

]]>The motive of this analysis is to settle the debates over some issues of last season of Barça using data analysis. Data analysis can well compensate the vague impression, short memory and biased opinion that we usually have in a qualitative analysis. I believe that the team has a much more advanced and comprehensive system of data analytics, comparing to that, this analysis is very simple and crude. But with only the basic tools and limited data online, we can at least obtain some general idea from it.

Here I analyze the playing minutes of the first team players of Barca in recent five years, in order to understand the issues about rotation, age structure and the situation of the homegrown players.

All data are taken from the database of Spanish football http://www.bdfutbol.com

Before going into the analysis, let’s define a concept for our convenience: The first team player. We follow the practical concept rather than the official status: Those and only those who play more than (including) 90 minutes in all the official games of the first team in a season are defined as the first team players.

Here we use Standard Deviation(SDV) of the playing minutes of all the first team players to measure the degree of rotation. Standard deviation shows the difference among a bunch of data. For a greater standard deviation the data is more scattered and more different, for a smaller standard deviation the data is more concentrated, as we usually say, more “equal”. So more rotation corresponds to smaller standard deviation and vice visa.

In the meanwhile, I also calculate the standard deviation of the major players. “Major players” are defined as those who play more than the average minutes of the team. There are about 14(+ or – 1) major players in each season who are usually the starting XI. The “little rotation” represented by the major players is also an important aspect of the team.

The results are shown in the following table and graph.

We can see that for both the whole team and the major players, the standard deviations of season 12-13 are smaller than all the other 4 years. Season 10-11 has the largest whole team standard deviation and season 11-12 has the largest major player standard deviation. Especially the minutes of Messi in 11-12 is 5221, which is the only player in the 5 years that played more than 5000 minutes in a season. This season also has the most games(3 more games than average) , which along with the lack of rotation in the previous season, may explain some of the difficulties in this season, such as a lot of injury and poor state of physical energy.

We can go into more detail of the fraction of the number of players in each range of minutes.

It’s worth mentioning that in season 12-13, no player played more than 4500 minutes, even Messi who played most had only 4095 minutes. It’s a side proof of that this season did the maximum rotation.

We can also use pie chart to compare season 12-13 to the 5 year average:

There is an interesting feature in season 12-13. While all other seasons have only one peak in the distribution(except the 0-500 minutes range), season 12-13 has two: 1500-2500 and 3500-4500. The fact that many players fall into the range 1500-2500 indicates that the substitutes played more and contributed more to the games.

There is one subtle issue in considering the rotation: injury. Usually, by rotation we mean the choice of players made by the coach when he has the freedom of choosing, let’s call it “active rotation”. But there is another case: when the player is injured, there is no choice, so considering this change of playing time is not so relevant, let’s call it “passive rotation”. When considering the strategy of the team, we often care more about the active rotation.

Ideally, we can take all the injury information into account and calculate for the active rotation. But I don’t have that much information. So let’s deal with one typical player: Messi. The following table shows the total minutes played by Messi in 5 years.

You may suspect that the reduced playing time in season 12-13 is majorly caused by his injury rather than the active rotation (like in 08-09). So let’s take a look. Messi was injured twice in this season, while the one in December 2012 did not have much effect; the one after playing the first leg of UCL with PSG interrupted all his remaining games of the season. So let’s trim away the remaining games after this game, and only calculate the average playing minutes per game in the healthy period. That can be taken to measure pure active rotation. In principle, the average minutes in the healthy period is higher than the general average because in the injury period the player plays much less.

I only do this trimming for season 12-13 and comparing its “healthy” average to the general average in the other 4 years. Even in this extreme, we can see that the average minutes of season 12-13 is still relatively small, especially comparing to 11-12.

In all, through different perspectives, we reach the same conclusion: The season 12-13 did maximum rotation in the past 5 years. And Messi’s injury was NOT caused by overplaying, neither there exited a Messi-dependence more than the other seasons indicated by the data of playing minutes. This may seem anti-intuition for some people, because what they remember are only those 2 months when Tito was in New York in which Barça did almost no rotation at all. But the data shows us a whole picture which is very different.

I analyzed both the distribution of number of players in different ages and the distribution of playing minutes in age ranges. It can show us the dynamical dependence of the team on different ages of players, besides the static age structure of the team.

Let’s start with the average age of the whole team and of the major players in each season.

We see that the average age does not change too much each year and the team keeps a good balance of the age structure. The large difference between the whole team and major players in season 10-11 indicates a relatively fixed starting XI with more experienced players, and this is also related to the low degree of rotation.

The following is the playing minutes distribution in age. Starting from the age of 20, each age range is chosen to be 3 years. This choice is decided to make the distribution most smooth.

And similarly, we can compare season 12-13 to the 5 year average with pie charts.

We can see that the playing minutes of season 12-13 concentrate in the age range 23-25. Comparing to the average distribution, players under 22 have much less playing minutes, and players under 20 have 0. In detail, only one player under 20 played in the first team – Deulofeu(18), but he only played 68 minutes, so he’s not included. The first 3 seasons have so many minutes for under-20 players because of Bojan. His playing time was all above 1000 minutes. We can compare the playing minutes of all the players under 22 in the 5 seasons:

We see that it’s almost decreasing along the years. Season 12-13 has the minimum. It indicates that young players did not get as many chances as the earlier seasons.

In season 12-13, the players older than 25 also have total playing time less than those in the other seasons. This is caused by more injuries of the older players. It’s a signal that although the age structure of the team is not changed much, comparing to the former seasons, more and more players with higher age may not be able to sustain the continuous intensity of the games. It may also be caused by other reasons that can cause worn out injury, such as the change in training, diet, etc. Since in the new season 13-14 there is not too much change in the team structure, it’s expected that this situation will continue or even worsen, which means that when considering the long term plan of the season, the group of players with age older than 26 should not be strongly relied upon because of physical condition.

Season 12-13 shows that the team has a strong dependence on players with age 23-25(now 24-26), which is a good thing in the long run. If everything goes normally, their physical state is expected to maintain during the next 3-4 years.

Let’s look at the season 10-11 again. The minutes’ peak of this season falls in the age range 29-31. If we consider the comprehensive performance, few people would doubt that this season was the peak of the past 5 years. Recalling the conclusion we get before for this season, the whole picture is now clear: it has the least rotation and the most experienced players available to build a stable and strong starting XI. However, the price was paid in the next two seasons in the form of increase of injuries in general and worn-out of older players.

Now let’s enter a topic that concerns many people: the use of the players from La Masia. Some people argue that the season 12-13 almost “abandoned” the homegrown players from La Masia. So let’s find out the truth by comparing the data of this season to the other 4 seasons.

I analyzed all the homegrown players that entered and left the first team after season 08-09. The homegrowns that entered the first team include those who were promoted from Barça B (who played more than 90 minutes per season in the first team) and those who were bought back from other teams. The homegrowns that left the first team include those who were sold or rent and those who went back to Barça B after playing more than 90 minutes in the first team. These is one special case, Bojan, who entered the first team in 2007, but because he is a typical young homegrown player in these years, I also count him as promoted in 2008. The following is the number of players who entered and left in each season.

Note: In the summer of 2008 there were also homegrown players who left the first team but they can be considered irrelevant to our analysis.

In the past 5 years, 21 homegrown players entered the first team and 8 left. The average entering number is 4.2 and the average leaving number is 1.6. In season 12-13, 2 new homegrowns entered, which is less than average, and 2 left, which is more than average.

“Entering” only means the playing time of the whole season is more than 90 minutes. For further analysis, we need to know how much these players are played in the first team. If we only count the total playing minutes of homegrowns, we have the following result:

Of course, because some homegrowns that entered earlier gradually became major players during the years, it’s expected that the total playing time increased each year.

We care more about the promotion of young homegrowns, so let’s have a look at the number and playing minutes of the homegrowns under 22. We can see that both the number and the playing minutes in season 12-13 are less than average.

A more detailed question would be, among these young homegrowns who entered the first team, how many can stay, and how many can become major players? After all, only those who play as major players or at least as frequent substitutes can be counted as “successfully promoted”. So I classify the homegrowns into 3 groups according to their playing minutes:

“Unstable”: playing minutes < 1000

“Half Stable”: playing minutes between 1000 and 2000

“Stable”: playing minutes > 2000

2000 minutes is approximately the average playing time of the whole team and a typical divider of the major and non-major players. The following table shows the number of homegrowns in each category. “Alive” is the total number.

We see that the season 12-13 has the highest number in the Alive, Stable and Half Stable categories, and lower than average number of the Unstable.

You may argue that the number of homegrowns in each season includes those who were already promoted or stabilized from previous seasons, which makes the increase in numbers expected, yet less relevant. Under this consideration, we can define the ratio of the number of each category over the accumulated number of homegrowns that have entered the first team from 2008 to the corresponding season. For example, the accumulated number of homegrowns that have entered the first team (only entering, no leaving counted) is 21 up to season 12-13, so the “Stable Rate” of 12-13 season is 5/21=0.24. Of course, these rates are biased to the other side, because the accumulated number always increase with time, but the total number of players who can play in the first team is limited, so the rates should be expected to decrease in the long run. However, for 5 years, it does not have a big effect, and at least we can see the picture qualitatively. The following table and graph shows these rates for each season. The last row in the table is the sum of Stable Rate and Half Stable Rate.

All the indices in season 12-13 are below the average, however, we notice that the first two seasons contribute most to the average. It’s expected because the accumulated number of homegrowns is much less in the first two seasons than in the following seasons. If we only compare the last three seasons, we see that the Stable Rate does not change much, and the season 12-13 has the highest Half Stable plus Stable Rate among the recent 3 years. More noticeably, the Unstable Rate of this season is the lowest in all these years. It indicates that although the homegrowns have less chance to play in total, the “quality” is quite high, which means that there are relatively more half stable and stable homegrowns who make significant contribution to the team.

There is an interesting feature of season 12-13 in the graph: the difference among the three rates is very small. As indicated before, the numbers of unstable, half stable and stable players are 4, 3, 5, respectively. It seems like there are smooth “steps” along the slop of playing minutes for the homegrowns. The half stable players are important in the sense that they fill in the gap between the experienced and immature players, and they are expected to grow into major players in a short period. It’s important for the team to have enough half stable players in storage in order to have a structure that can evolve stably in the long run, unless the major strategy of the team is buying players instead of using homegrowns. Season 12-13 had a good structure with the most half stable players: 3. Let me give their names: Thiago, Tello, Montoya. (What a loss we had with Thiago…)

In all, the general situation of playing homegrown players in season 12-13 was not as good as the previous seasons, but it was still within the acceptable range of vibration, and it had a quite healthy structure of playing minutes. After all, this is a very complicated issue. Unlike other technical issues, it’s not an independent index of each season; instead there are strong correlations among the continuous years. It’s impossible to make an objective assessment with only one season of data, or even 5 altogether. Only with a long range of time spanning before and after this period can we make a clearer picture of this issue.

We draw these conclusions of the season 12-13 from this analysis:

- Among the past 5 years, this season distinctively had the maximum rotation. The playing minutes were most equally distributed.
- This season had an age structure featuring young players in the age range of 23-25, thus the general state of the team is expected to be stable during the next 3-5 years.
- The use of young homegrown players in this season was in general not as much as the previous seasons, yet still within the acceptable vibration. With the most half stable players, it might have the healthiest structure of playing minutes for the long-term evolution of the team.

In all, this analysis provides us a preliminary look into the issues of Barça related to players’ playing minutes. I feel that this is just a tip of the iceberg that can be drawn from the data. With more data and more detailed analysis, we may find more interesting patterns that can answer our questions or help us find the problems that are not easy to notice otherwise. Starting from playing minutes, we can also relate it to other data, such as injuries and physical state of the players. It can be important supplement information for the coach to make decisions about the use of players from the long-term point of view.

]]>