Concept work “Data-driven Scouting in Football“

Summer break – for the usual fan, that’s the time to rest and recover from a nerve-wracking season. For me, it means that I can dedicate myself to one of my favourite topics: Data-analysis in football (and thus I can deliver a Longread to you which really deserves its name). For many, this might be a very boring topic. For me, it’s nothing less than an absolute affair of the heart. This might possibly be due to the fact that I can combine my profession (being a natural scientist) and my passion (football) perfectly under one roof called data-analysis. I am some sort of an expert in both. Thus, it’s enough for a profound concept work – or a lot of superficial knowledge only (you might be able to judge yourself after finishing reading this text). However, I am convinced that almost every professional football club will have job positions such as „Head of Data Analysis“ within the next five years.

But why is that? Which benefit might data analyses add to the football? And, also important, what do they actually not deliver? Frequently, I have the impression that data analysis is not appreciated within professional football by some because there are “soft” factors which cannot be taken into account properly by analysing data only. Because the game is “too complex” to “get analysed completely“. That’s true. Within US-Sport (Basketball, Football, Baseball) it’s much easier with set-pieces following on set pieces or because of the pitch and the numbers of players on it are smaller.
However, it’s certain that data analyses will never completely substitute the visual analyses of teams and players. But that’s not the point of this discussion. Sometimes, I get the impression that the persons in charge somehow “fear the new” when it comes to this topic. Because, of course, not all steps of data analysis can be understood by everyone within the club. And, of course, many also might fear that data analyses might make their jobs redundant. But deep data analyses are by no means a substitute for the visual judgements of players/matches/moves. They are rather supposed to be an addition. They improve the overall judgements and thus complete the picture, make it more fact-based and – of which I am totally convinced – will thus eventually lead to clubs being able to make much better decisions.

I put some thoughts into the topic of data analyses within scouting as there’s always a lot to discuss with regards to the procedures of the summer transfer window. Scouting is for sure the area in which the intensive use of data might have the biggest outcome (besides the analyses of matches and set pieces and within medical departments). This is first and foremost due to the fact that contracting of and operating with a player is one of the money accounts within a club which is amongst the hardest to predict whether money actually buys what you paid for. To better predict if a player really possesses the skills which your team needs, sound data analyses are actually indispensable. Here are three reasons why:

  • The visual scouting is full of errors (biases). Here are some examples (and here is the text, which I have used amongst others):

    Confirmation bias. Cristiano Ronaldo is for sure a great footballer (from a human perspective he’s rather …err…). Within his first appearance in the Champions League with Real Madrid, he scored two goals after free kicks. Of course, he is outstanding when it comes to set-pieces. We can all agree to this as almost every one of us will have seen a set-piece goal of him. But is he really – outstanding? Of course, he’s quite good at kicking the ball, but he’s not outstanding. His ratio is not more than average (especially compared to Lionel Messi – but we’ll get back to that later). But many of us do always start to think „Aww, dangerous, he’s pretty good at that“ when the schmaltzily Portuguese prepares for a free-kick because we once witnessed that he scored through a free-kick (which was only one of his plenty attempts)? And many feel their impression confirmed when he manages to score another one albeit he also failed oftentimes before?
    The confirmation bias makes us believe our own expectations (“Ronaldo is a god of free-kicks“). We fade out all the information which do not match our picture (A high number of missed free-kicks) to not screw our expectations.

    Anchor bias. Another problem within visual scouting is the need for some sort of anchor to judge upon a player (something like a reference value to which a player can be compared). And this is mostly the first match or the first action one witnessed a player in (in the case of Ronaldo, for example, his two free-kick goals in his first CL appearance for Madrid). In relation to the confirmation bias this leads to massive misjudging as we are fading out information and define faulty reference values. We do that because that’s how the human brain works.

    Outcome bias. “The FC St. Pauli did perform horribly during this season because they only scored soandsomany points.“ – the outcome bias is pretty easy to explain but also one of the biggest errors.  Because it focuses on the results only. But as the better team does not always leave the pitch as the winner, the outcome bias gets its chance. Because many simply wonder: “We lost – so what was wrong?“ or “We won – so what was good?“ and thus simply judge the result more as positive or negative, albeit to comprehensively judge the result is of minor importance only.

    Similarity bias. We, as humans, are rather considerable egoists. Thus, there’s e.g. the similarity bias, which assumes that we judge upon humans who are more similar to ourselves in a more positive way than those who are more different. You can for sure imagine that the effect, which is caused to scouting by such a bias, might be enormous.

    But all these misjudgements can be prevented by data analyses which are added to the visual scouting. And the list of misjudgements can almost arbitrarily be extended (just have a look at this”Cognitive Bias Codex„ – and if you only want to klick on one link within this text, choose this one here). This is by no means a criticism towards visual scouts. It’s simply some sort of human misjudgement which is happening there and can accordingly only be made by humans but not by data analyses (however, even data analyses are human-made and are thus not completely lacking misjudgements).
  • With the help of data analyses, the visual work of scouts can be made more effective.
    Many Podcast listeners might have wondered how the hell one should be able to actually listen to all the stuff one wants to listen to. To play the podcast on 1.5 speed is a way for most of them, however, there’s a lot of listening pleasure fading when done so. A lot more difficult is this solution when it comes to Scouting, as watching matches on 1.5 speed is even more unbearable. But to completely monitor the actions of players, there’s a need for visual scouting. To do this, meanwhile, there’s WyScout, which are offering video snippets of almost every possible actions. But all these actions have to be watched first and – even more important – be judged by someone. By doing sound data analyses before such scoutings happen, the number of players which have to get visually scouted reduces automatically and the efficiency is raised.
  • Data analyses are not adding to the costs of a club, they are a cheaper way of doing things.
    Let’s imagine a player is earning 250.000 € per season. I would guess (very conservatively, so rather underestimated) that this is roughly the average player salary in the 2nd league. So if such a player comes to a club and signs a two or three-year contract, it’s a decision that can’t be refinanced by 20 extra beers sold in the stadium or ten t-shirts in the fan shop. And we’re only talking about one player here. Per season, including contract renewals, these are sums that make me personally sick. An equally cautious estimate on my part assumes 100,000 € for the possible section “data analysis” (one “head of”, two computer science students doing the “dirty work” and costs for the data itself, which you have to buy from different providers).

    I’m not a business economist, but if you can only increase the outcome rate of transfers by 5-10% with deeper data analysis (certainly not an unrealistic scenario), then this is a worthwhile investment and saves a lot of money elsewhere. And just now, when due to the corona crisis the income is no longer growing steadily and all clubs have to save money, in the near future the clubs that work very effectively in scouting will probably prevail.
Talent or not? In this case, rather easy to decide. 
(c) Peter Boehmer

What would data-driven scouting have to look like for such an investment to be worthwhile?

After all, everyone is fishing for players in the same pond. At the latest since platforms like InStat and above all WyScout have been able to offer information and data on almost every player in every professional league in the world, it will no longer be the case that scouts in deep South America, after 1,910 hours of flight time, overnight stays in foul-smelling hotels and through countless contacts, get to see a player that no one has ever scouted before. 

So it’s much more about what you do with the data you’re offered. If you just rely on the data and statistics available on the market, you will continue to fish in the same pond with all the other clubs. And the hit rate won’t differ much and is completely dependent on visual scouting, which is full of mistakes. But if you take the existing data and create your own model, a profile, tailored to the needs of a club or team, then you will still be fishing for players in the same pond, but you will have another (better?) rod available. And only then will you be able to use the data to discover players who may have swum under the radar before. And only then you will be able to act differently on the transfer market (also in terms of price) than all other anglers on the pond.

Which data are offered?

There are now an incredible number of providers in football who prepare and offer data – and praise scouting for making it easier. The platforms InStat and above all WyScout simplify scouting especially by providing video snippets of players whose skills can be compared with those of others then. But there are a lot of providers who already offer their own data analysis. Let me introduce you to some of them: Statsbomb

The current “poster boy” of data preparation and visualisation is certainly Statsbomb and especially the so-called radar charts. These show the skills of individual players, according to the demands of each position, compared to all other players in the same position.

Radar charts from Statsbomb.
Basically, the more colour in the game, the higher the special skill of a player.

The comparison of N’Golo Kantè and Kevin de Bruyne, for example, shows that both players are extremely strong in midfield, but with very different skills. While de Bruyne sets the standard in the Premier League with assists and important passes, Kantè does the same, but with defensive skills like tackling, winning the ball and pressing. In this way, it is possible to identify relatively quickly which players really do have certain skills on a permanent basis. It’s also exciting that the skills are also set in a ratio of all leagues and not just the league in which the players currently play. So the radar chart of Mats Møller Dæhli, which I once asked for via Twitter, showed that he is an international leader in the field of “Winning the ball through pressing”.

Radar-Chart from Mats Møller Dæhli from Statsbomb.
The data in the graph show the values compared to all data (also international). The table shows the values compared to all players of his position in the 2nd league. Pretty impressive, both the performance of Mats, but also the display of Statsbomb.
No question, the data of Statsbomb is quite fancy. And many clubs are already working with Statsbomb in scouting.

Goalimpact

Another interesting provider of in-depth data in football is Goalimpact. Where “deep” is rather the wrong term here, it is rather a large amount of data. Because the development of Goalimpact is based on a relatively simple principle that can be applied to almost every popular league and every player: based on the results of the matches and the players on the field, a kind of +/- statistics is applied. In addition, there are still certain expectations of a match (simplified: if Bayern Munich plays against 1 FC Rumpelballhausen, a 1-0 victory is weighted differently than if they win it against Real Madrid).

The interesting thing about this data is that the statistics also try to capture off-ball actions of the players. Most of the statistics are based on ball actions of the players, on duels, goal kicks, passes. But how can the value of a player be weighted, who blocks spaces cleverly and so does not have to win the balls in a duel or a player who tears a gap in the opponent’s line by a clever running path and thus enables a successful pass of his teammate? That is quite difficult to evaluate.

The Goalimpact captures this value that players have on their own team’s game simply by “crediting” a team’s success to all the players in a team. But as simple as this statistic is, there are of course a lot of drawbacks to using the pure result as a factor in a team’s performance, because not every team is rewarded for its performance in the form of wins. And of course, even if the result is positive, the performance of each individual player cannot always be positively assessed. But the goalimpact is certainly a statistic that is of interest for scouting, especially for younger players, because the model always gives a forecast of the development of players. But it is not meaningful enough as a statistic alone.
(If you want to know more about Goalimpact you can read the interview, I once did with Thorsten Wittmütz from Goalimpact).

Impect – Packing
Also a kind of “Poster Boy”, although this one was very badly portrayed by Mehmet Scholl during the European Championship coverage in 2016, is the “Packing” of the company Impect. The company, founded by the two former players Jens Hegeler and Stefan Reinartz, offers a different way of evaluating passes: It counts how many players are “overplayed” by passes, i.e. how many opposing players are further away from their own goal after a successful pass than the ball was before. This is a rather special statistic, but it is very important because it gives a completely different weighting to the pass quota as such.

It is particularly illustrative as an example: As a central defender, if I constantly play crosses and back-passes, I’m sure I have a good passing rate but am I more valuable to my team than a player who plays vertically more often, thus increasing the probability of scoring goals, but also producing more false passes? Since the Impect counts, the number of players played over, the quality of the vertical pass is more important (the example of Hummels and Subotic is very insightful in this regard).

Graph of the Impect-values of players of the 1st Bundesliga and the Premier League. You can be sure that the name Thiago should be next to the red dot in the top right corner.

American Soccer Analysis – g+
As many of you know, the use of data in US sports is much more advanced and accepted than in European football. Therefore it is not surprising that the blog American Soccer Analysis is a kind of pioneer in data analysis in soccer. And just in time for the corona-break they presented what I think is a very interesting and meaningful metric: goals added, or short g+. This metric is based on the expected goals that I love. The difference is that this metric assigns a goal and conceding goal probability to every action on the pitch, not just to the goal finishes. This allows to calculate how much the goal probability of the player and the opponent changes as a result of an action (e.g. a won duel). If your own goal probability increases and/or your opponent’s goal probability decreases, the player who performed the action will be credited with the difference. As a result, each shot, pass, duel or ball win/loss is weighted and evaluated, which ultimately allows the evaluation of all actions of each individual player. This metric is quite fancy and nicely elaborated, but it is not so new. In the end it is very similar to the non-shot xG-Metric, which also tries to evaluate all actions far away from goal shots and goal templates. Other providers have similar metrics (Statsbomb e.g. xGBuildup, which evaluates passes before a goal is scored, or xG Thread (xT) von Karun Singh (where it is mathematically unravelled) and, fairly new the possession value, short PV, which evaluates passes and can also prompt negative values; my goodness, there is even an xG model, which gives not only the position of the goal shot but also the goal shot itself a goal probability, is then called shot placement xG ). But there is also a similar reading from the field of sports science that has been given the beautiful word creation Dangerousity. You see, with the metric expected Goals a lot has happened in the last years.

The dawn of a new guard of expected goals as best metric ever?
And while I once announced more than two years ago that expected goals will replace the “normal” goal-shooting statistics in media coverage in the near future, I do the same here, but refer to goals added. I am impressed by the way metrics xAdded metrics have been created from metric x. From xG, to expected assists (xA), to deeper pass-chains (xG thread, xG builddup and PV) and with dangerousity to the evaluation of the action on the field for each action, which may find some kind of completion with g+ & non-shot xG. I think the significance of g+ in scouting is limited, because it is a general evaluation and not an evaluation of individual skills (which doesn’t mean that you shouldn’t use it in scouting), but the metrics are quite understandable and for a rough estimation of the performance of players it is umpteen times better than the running distances, duel values and pass rates presented so far.

21st Club
21st Club comes along in the style of a consulting agency. And the advice they offer is often much broader than just looking at the individual skills of individual players. It’s more about questions like: What is the best age to buy/sell players from an economic point of view? What are the differences in performance between the individual leagues? Can a player from league X show the same performance in league Y? What is the age structure in the squad of successful teams? Which coach fits to a squad? (I hope the last question was asked out loud recently at the FCSP)

It is therefore not necessarily a provider of data but a company which, based on data, provides a kind of recommendations for action for longer-term management structures etc. In the broadest sense, 21st Club is also a provider that can be very interesting for data-driven scouting. And I can recommend reading these articels to all those responsible for football clubs.

Global Soccer Network
Pretty close to what clubs need in scouting is, in my opinion, what Global Soccer Network offers. And it’s really close. I don’t know what the underlying data is used to build the GSN index, but that’s exactly how every club in scouting should do it (although not as complex, less would be enough to make it significantly different from other clubs). To better understand the offered data and also the ideas behind something like this, I highly recommend the interview from the VFL Bochum fan blog einsachtvieracht with one of the creators of Global Soccer Network.
I personally find it exciting that not only the skills of the individual players are evaluated here, but also that you can see if the playing style of your own team matches with that of other teams and therefore it might be easier for players to adapt to the new team. Or, as Statsbomb is already doing with its data, players will be rated according to the demands of the individual positions. A further step as Statsbomb is probably that it can also be checked if the skill sets of the players fit to other positions. And there is also a forecast for the individual players. I can’t tell you how accurate this is, but there is a lot of thought and data analysis behind it, if I look at the interview linked above. As I said, at first glance, I don’t have more insight into the data and processes of Global Soccer Network, it seems like a scouting tool that can really help clubs.

This is only a small selection of providers of data to improve the rating of footballers and they are probably the most well-known players in this market. There are many more and certainly the FCSP will receive offers for such products week after week, where the provider promises you the ultimate improvement of your own scouting. For less data-affine people it is surely difficult to keep track of the data and to distinguish valuable from worthless data (this question will get its own section in the text). By the way, it is often optaSports, that is the data provider for all the metrics, indexes and forecasts that such providers provide (but there are also some providers that produce the data themselves).

And in general, when using such external data providers, it also applies that everyone continues to work in the same pond and then accordingly with the same, much better fishing rod. In order to really stand out in scouting and to really use the data in such a way that it fits perfectly to the requirements of a club, a game system, it requires in-house data analysis, at best in conjunction with external data providers.

A separate department for data analysis and the use of several sources of data is, by the way, also useful for critically questioning the definitions of the individual evaluation criteria. Florian Zenger of Clubfans United for example, once looked into why FCN player Hanno Behrens gets three quite different duel odds with three different data providers. These definitions should therefore be questioned and at best developed by the players themselves, so that a false assessment is not made.

Valuable or worthless data?

If you decide to rely on data when searching for new players, then it should of course be the correct or meaningful data. The example of Cristiano Ronaldo and free kicks is perfect for this section: wow, he scored a whopping 30 goals in Real Madrid’s kit through free-kicks. Looking at this figure, it would be clear that we should urgently try to sign him if we are looking for a set piece specialist. But the truth is that Ronaldo kicked 410 free kicks directly onto the goal in the same period. W.O.W! That leaves a success rate of just over 7%, which suddenly doesn’t really speak well for an absolute set piece specialist. Because the average conversion rate of free kicks is just under 6%. That of the 150 best in Europe is even just under 13% (you can read all about it here). Cristiano Ronaldo is therefore only slightly more than the average free-kick taker. And I already described the example of Hummels and Subotic with the comparison of their pass rates and the corresponding packing above.

So if you want to use data for the evaluation of players, it is incredibly important to distinguish between valuable and worthless data and to evaluate the validity of all data.

But which data is valuable or worthless?

Of course, I am not the first person to ask this question and many clubs already work with data analysts or use data from providers. Basically, it makes sense and makes absolute sense if the players are scouted according to different criteria depending on their positions (because a striker needs different skills than a defender). Accordingly, every club in scouting would first have to ask itself what the core skills of players in certain positions are, which skills are needed at all in the football a team wants to play.

I will try to get to the bottom of this question by listing which data are relevant for player scouting and should be included in a data analysis:

General factors

  • Age
    In general, a player must of course make a team better. Accordingly, the age of a player should only play a minor role in this regard. But from an economic point of view and also from the perspective of squad development over the years, the age of players plays an enormously important role. Age alone plays a role in the question of resale value. And of course, the question must also be asked whether a player can still develop and thus possibly leave the club again at a higher price. And of course clubs should always try to get players to the club before the market value increases. This is a very general question: Do we as a club want to develop a player and sell him at a higher value or do we need a player who can help us immediately, especially in the vacant position? The ideal solution is of course both together: A player who helps us, but who also develops further and then recommends himself later for higher tasks.
  • Height
    Yes, height is important, as you should only be able to use a central defender who is only 1.82m tall, as he will be inferior in header duels. But height is not the deciding factor. It must rather be an indication of the jump height, which is taken into account when searching for the appropriate players (this can be calculated on the basis of physiological characteristics). Why did I just mention 1.82m? Take a look at the jumping height:
Leo “Air” Østigård – we will miss him!
  • Speed
    A very central skill. And in many media, maximum speeds are used again and again. That is as right as it is wrong. After all, what do maximum speeds tell us when there are much more important values in the area of speed? I personally would pay much more attention to what a player can do on the first 5, 10 and 25 meters, how well he accelerates. After all, how often will players (have to) reach their maximum retrievable speed per game? Rather rarely. But short accelerations are often required and therefore important (by the way, the number of sprints per 90min is also an important parameter).
    Whoever plays “FIFA” privately knows this, because under the generic term “speed”, “acceleration” is one of the most important sub-points for the players.
  • Injury history
    I certainly don’t need to explain to anyone why the long-suffering Marc Hornschuh is currently falling through the scouting grid at many clubs. Because injuries are probably the biggest item of lost value in every club.
    Accordingly, scouting should look very closely at the history of injuries. How often was a player injured? What kind of injuries were there? Did a player simply have bad luck? Or is there more to it, such as bad training, unprofessional behaviour? I  have elaborated on that in this text
  • Market value
    Anyway, it’s clear that as an ambitious second division team, you shouldn’t waste time on having players in your scouting models who bring a mild smile to a request from the second division at most. Accordingly, a scouting model must have a kind of upper limit for players so that a model does not always have to sort out the first 20 players directly, as they do not fit into the price and salary structure of a club anyway. In a scouting model, the market value of players could be used as a kind of filter to exclude the category of players that are of no interest to you because of their prices.
  • Goalimpact / GSN – Forecasts
    The two companies already mentioned, Global Soccer Network and Goalimpact, are also (or especially) interesting because they provide a forecast of the development of the individual players. And although the hit rate of these forecasts is certainly not anywhere near what the creators, but also the users hope for (the individual development of players is simply far too closely linked to the situation in the clubs (playing time, system, coaches, competition) and this simply cannot be predicted for individual players over the years), they are nevertheless interesting and should be taken into account in scouting. After all, these predictions are not based on gazes in glass balls, but have also been validated by means of historical data, i.e. the already completed development of players, and so it can be assumed that the hit rate is not optimal, but it can provide another important evaluation in addition to the assessment of visual scouts.

Skill-Set

  • Set pieces
    As a general rule, performance at set pieces should be considered separately when assessing players. Philipp Hofmann from KSC serves as an example. Although he scored a remarkable 17 goals this season, three of them by penalty kick and (at least – was a short research) five after a corner. If you subtract the goals scored through set pieces, there are still nine honourable goals left, but it’s not really impressive what resulted out of the game for a striker who played the season completely uninjured and was usually the sole leader.
  • Successful actions
    Regardless of the position you are looking for, you should consider all the skills required for a position as a whole. What is the success rate of all a player’s actions on the field? Be it passes, goal finishes, duels. InStat for example provides such a metric.
    Additionally, the metrics g+, Dangerousity and/or non-shot xG should also be considered, as they also help to indicate the successful actions, but offer another added value, because these metrics also contain statements on how the player’s actions change his own goal probability (and that of the opponent). Does a player help us to increase our goal probability and/or decrease the opponent’s? This is absolutely the most central question to ask when scouting. And that’s where the use of these metrics is certainly helpful.
  • Finishes & ways to that 
    Especially for offensive players it is certainly one of the most important skill sets. In addition to the goals and assists actually scored, xG and xA, but also other metrics should be taken into account. How often did  centre passes find their target? What about secondary assists? For example, opta also defines the metric chances created, which summarises the goal templates, but also the previous passes.
    But before we get lost in secondary assists, I’d like to talk about the importance of xG/xA metrics in assessing player performance: It’s simply elementary for offensive players to get something out of the action. If the FCSP hires an outfield player who has scored three goals and two goals in, say, 50 games, then that player should urgently show in the xG and xAmetrics that there is much more possible (I don’t have anyone specific in mind). Or at least such a player should be an absolute exception in the secondary Assists/chances created section.
    Or there are clever minds who recognize that such a player has much more potential (but that’s another matter, a completely separate topic, more about that later).By the way: expected Goals models can and should of course also be used to evaluate defensive players. With the principle “the lower the xG-value of the opponent, the better the defensive work” there is still much room for misinterpretation, but in the beginning, the metric can be used for this.
xG-Graph of Marvin Ducksch from the season 17/18.
I created them with data from Stratabet (which doesn’t exist anymore), which at that time provided data for bloggers for free. Here you can see that Marvin Ducksch was the top scorer of Holstein Kiel with 18 goals in the season, but he could have scored a few more goals with 125 shots, which resulted in a total xG value of over 20.
  • Sort of passes 
    Certainly, one of the most important skills in the evaluation of footballers, since a successful and good pass combines many components (good technique, reaction speed, feeling for space). We already had the example of the pass quota with Subotic and Hummels. I’ll make it very clear again: The pass quota of all passes played is not at all meaningful. If a player is only using back and cross passes and therefore has little risk in the passing game, then Lothar Matthäus on Sky can be happy about the great pass quota of a player and talk something about “world-class”. But as long as the passes are not further divided, they are completely uninteresting for scouting.
    And there are incredibly many ways to further subdivide passes, which makes sense depending on the position. For advanced players, the already mentioned packing is of course meaningful. In Statsbomb for central defenders “unpressed long balls” are always given as metric. Similar to American Football, the metric is progressive pass yards, which indicates how many yards of space a player has produced with his passes (and there are certainly players who have caused a loss of space).
    Completely different pass metrics should be considered when scouting offensive outfield players, for example: When packing, not only the player who has passed over opponent players is included in the score. The players who have taken the pass are also counted, or the number of players who have been passed with their pass. The packing therefore also allows an evaluation of the free-running behaviour of offensive players to a certain extent And while progressive pass yards play a minor role for offensive players (since they are usually already in front and play the passes there), the metric passes in the offensive third get a special meaning. In addition, the crosses have to be considered separately from other passes, especially for players on the outside lane.
    And, of course, all passes, if the analysis allows it, should be evaluated under which pressure which kind of passes are played. In this case, an analysis of position data can be helpful to find out under what pressure a player was under during the pass
    A scouting model should then include some kind of weighted quota for the many specially defined subtypes of passes (e.g. non-risk passes are not as valuable as the risk passes and successful passes in the attacking third are more valuable than those without opponent pressure in the build-up game, etc.).
  • Duels
    Of course, duels must also be subdivided when looking at them. Is it a defensive or offensive duel, is it ground or air duel and where on the field was the duel fought? Again, the generally given duel quota is not really meaningful (but in my opinion, it is much more than the general pass quota) and the metric gets better the more it is divided. Nevertheless, the quota should be used here. Players who only win 30% of their offensive duels might not be helpful in the offence.
    The metric “duels” also includes dribblings, which are very important for offensive players. Especially on the outside positions, there will be a focus on 1-on-1 situations in the future. And only if a player manages to do a certain number of dribblings successfully, he can be really helpful and improve his own game.
  • Ballgains / -losses
    Another key metric is ball wins and losses. It is particularly important that they are considered in relation to a team’s possession of the ball. In Statsbomb, for example, possession adjusted values are given for duels and ball wins. This is because it is certainly obvious that a team with a lot of possession will have fewer ball wins overall and the absolute number of ball wins by players will, therefore, be different from teams with less possession
    When assessing ball wins and losses, the position on the pitch (and, of course, the pressure on the opponents) is also decisive. If a striker loses a lot of balls in the opposing half, because he often has to play alone against three opponents, this should be assessed differently than if a player on the outside of the pitch has the ball bouncing several times without pressure
  • Index
    Many freely and not freely available data portals on players provide indexes on the strength of the players. Among the freely available ones, I will certainly not mention the kicker average score. but there are e.g. whoscored and sofascore, which produce an index for players based on data. However, these are usually quite simple and take into account the scoring of goals scored disproportionately. 
    And as I could have a look into the data of InStat a few weeks ago, I also got to know the InStat index there. But of course, WyScout and GSN also have such an index.
    Important when using it: If it is not clear which criteria are used how exactly for the creation of such an index, the value itself is interesting but not the most important basis for evaluation. Especially if a club wants to go its own way in scouting, it is advisable not to rely on the indexes of the providers, which are also available for all other clubs.

Develop a scouting model

Now we have a whole range of values and factors which, in my opinion, must be taken into account in scouting. But if you look at these values all individually, you will 
1. get completely tangled, because one loses the overview very quickly due to the multitude of factors and 
2. not really get ahead, because the data cannot simply be added up.

So what structure do the data need to be structured so that the player who can really help a club can be identified? And what structure does a club need in order to stand out in scouting?
Let’s put it briefly: a club must develop its own scouting model. A model that produces a separate index for every position sought in the squad, fed by specially defined requirements. This is the only way for a club to really set itself apart from other clubs in the long term and to go its own way in data-driven scouting because only by using its own model it can stand out from other clubs in its evaluation of players. How do you develop a model from the valuable data that has been worked out that can be used in an association? Of course, there are already external providers who have developed such a model – but here, too, if you want to be different, you have to create such a model yourself, otherwise, you will be fishing in the same pond with the same basic conditions. To develop a club’s own scouting model, a procedure in several steps is necessary:

Step 1: Define position profiles
I have already highlighted a number of relevant data for scouting. And it quickly became clear that for a central defender, for example, it doesn’t matter how high his xG value is when playing out of the game. Much more important, of course, is his defensive behaviour (ground and air duels, number of clearances/ball wins), his passing game in the build-up and so on.

A club should, therefore, define a profile for each individual position in the squad, which makes it clear what kind of data is relevant for that position and should, therefore, be included in a scouting model. This is work that needs to be re-evaluated over and over again, depending on how much the demands on individual positions or on the squad change over time.

Step 2: Data Mining
Once the position profiles have been developed, comes the step of data mining. Here clubs should first ask themselves whether they want to collect the data independently (from the position data of the individual games) or whether they want to collect them from data providers. 
My opinion: A mixture of both approaches is a good idea because some of the metrics that are interesting for a club are not available “off the shelf”. If a club buys data, it should not be limited to one single provider. “A lot helps a lot” must be the credo here. Because as the example with the duel values has already shown: Not every xG model comes to the same result, because e.g. in one model the number of opponent players between ball and goal is included in the calculation, while in other models this value is missing, but there the opponent pressure (e.g. defined as the number of players in a certain radius around the player shooting the goal) is also included. Therefore, it makes sense to tap as many different data sources as possible in order to get the most accurate picture possible.

Data mining is then certainly the point in the development of such scouting models where many “traditional” scouts and sports officials in football drop out. There is simply too much data. In a single football match, more than 3 million position data are recorded. And even the 2000-4000 event data per match (depending on which events are recorded) simply cannot be comfortably entered into an Excel spreadsheet. This is where the help of data analysts with appropriate programming skills is needed. And I can say from my own experience that the effort is worth it. Because the nice thing about such a data analysis in e.g. Python (my language of choice) is that once a script is written, the processes run much faster. Working on a scouting model may eat up most of the time at this point at the beginning (we are not talking about weeks but months), but once the corresponding scripts have been written, data mining takes much less time, because they can be used again and again and even if the position profiles have changed, they usually only need fine-tuning.

Step 3: Weighting of the data
Once a club has all the data together, a kind of priority list must be created. Which skills are most important? What must a player definitely be good at? Are there exclusion criteria? The weighting of the skills required for a position will determine whether such a scouting model can lead to a general improvement in scouting in clubs or whether it is more a question of interesting but not very meaningful additional information on players.

Correspondingly thorough work should be done at this point. The advantage of a specially developed scouting model, e.g. in comparison to the tools in InStat, is above all that players do not get kicked off the list after falling below certain limits. 

An example: As a club, I am looking for a player on the wing, who above all gives a lot of scoring assists. His values should be correspondingly good in the areas of expected assists and flank accuracy (to work with only two variables in this example – in reality, there are of course a few more). With InStat you can always define a lower limit value for this. But this will lead to players who fall below the limit being banned from the ranking completely. That would be pretty stupid if there are players who more than meet all other requirements, but just don’t have the desired quota on the flanks. However, if you work with the weighting of individual skills, a player will also be in a worse position if he doesn’t have the desired flank accuracy, but won’t be completely removed from the ranking. This is a really important difference because if it is only one skill that does not achieve the desired accuracy, this shortcoming can be compensated if a player excels in other skills. No one is thrown out of the ranking and therefore no one can disappear from the radar who fits the grid, even if some skills are not (yet) as desired.

In order to create a model, the values have to be adjusted to each other. It is probably clear to everyone that a value of 20.3 in the xG metric cannot be lumped together with a quota of successfully contested duels. Accordingly, the individual metrics have to be standardised with the help of limit values so that in the end uniform figures are spit out which can then be weighted.

Such a model does not, of course, develop overnight. It will take weeks to months to create a weighted model for each individual position from the raw data collected and collated. But again, once you have developed the basic structure of a ranking for players, it is “only” a matter of fine-tuning the next step: the model validation.

Experts in talent recognition among themselves: FCSP U19 coach Timo Schultz and Thomas Reis, 18/19 still coach of the Wolfsburg U19, today successful coach in Bochum.
(c) Peter Boehmer

Step 4: Model-Validation
So the first draft of a scouting model will throw out a ranking of players. Are the best players on this ranking list really the players who are to be counted among the best in their position? This will certainly not be the case in the first version. The model needs a deep validation to be really helpful. For such validation, experts are needed to provide the appropriate input. By experts, I mean scouts, football teachers and proven experts in the game and the players. At this point, the club should scrape together all the football expertise it has in its own ranks to make the data-driven scouting model an effective tool in player scouting.

To do this, the experts must provide an assessment of the ranking. Do the best players on the ranking list agree with the experts’ assessment? If not (and this will be the case in the first versions), why does player XYZ appear so far up in this ranking? Was a metric weighted wrong? Is the importance of skill generally overrated?

After this kind of feedback round, the position profiles and also the weighting of the data would probably need to be revised. Then we would move on to the next round. And the next one. And ideally, everyone will ask themselves at some point what’s wrong with the model, since two players who repeatedly make it into the top 10 have not yet been on the club’s own scouting radar. And then you might find out that these players have been wrongly rated so far, either because of a lack of playing time or because the skills the players are looking for simply don’t show up on visual inspection as the data shows. And if these players have not yet appeared on the radar of other clubs, well then… Jackpot!

Granted: This is the ideal case, which will not always occur with every scouting model for every position. Most of the time it will be more likely that the assessment of the visual scouting will be fine-tuned by the data analysis and the assessment of a player will be improved (which would already create an enormous added value).

No two scouting models are the same

A specially developed scouting model therefore takes time. But once it is developed, I think it can become a really helpful tool in the scouting department of clubs. I am writing here quite deliberately that it can be a tool. Such a model would not replace the traditional scouting but can significantly improve it.
It should also be taken into account that the performance of players has to be weighted according to the league. I think it will make sense to everyone that a player from the regional league will not immediately bring the same performance to the pitch in the 2nd division. Accordingly, the skills in general still have to be weighted, depending on the league in which they were shown.
During the development, but above all during the interpretation of such a model, it is important to consider what kind of player should be signed. Does a club need immediate help, which can immediately improve the club on a position based on the existing skillset and other parameters? Or is a club looking for players who can develop in the shadow of established strengths? Looking for players who are not yet able to provide the necessary performance (permanently), but have a high potential? The ideal case would, of course, be both: A player is already a real reinforcement, but is also developing into an even better player.

Depending on the type of player’s application, the scouting model must also be aligned (of course there are then several models). Because especially with young players it is completely normal that they do not permanently deliver the best performance, that their performance on the court shows enormous fluctuations. Accordingly, scouting must pay more attention to peak performance than to the average or median as a parameter in a model. A scouting model can represent this peak performance, but the assessment of whether a player can also improve and can at some point consistently retrieve these peaks is then naturally left to the experts. The question of whether the peaks can also become a constant performance is not much less than the core around which everything revolves in the scouting of young players. And it is incredibly difficult to answer. I recommend a youtube-Video by Rasmus Ankersen on this topic. Ankersen is one of the heads of the rather data-driven clubs Brentford FC and FC Midtylland.

A graphic, which hopefully all scouts of this world already know: The performance/potential matrix. And all responsible persons would like to engage players from the category “Potential Gem”, who just don’t call up what they are able to perform at maximum. The reasons for the low performance are manifold and range from the training conditions (= little training, bad training, wrong training) to the character of the player.
source  fort he graph)

How would the scouting ideally go?

Back to the topic: It doesn’t have to be a whole battalion of data analysts and a separate scouting department in clubs. For many clubs, scouting with the help of data analysis could certainly be significantly improved if they had someone who could filter out only those data from the multitude of available or “purchasable” data that are really relevant for scouting in the club and develop scouting models for individual positions. This requires a basic understanding of football and, of course, at least a certain affinity for data analysis, so that large volumes of data can be handled accordingly and prepared in such a way that decision-makers can work with them.

I have no idea if and how intensively this kind of data analysis is already used in the 2nd division (the battalions of analysts are already in place in all clubs of the Premier League and also in the 1st league many clubs work with their own data analysts). From my point of view, working with data analyses is absolutely necessary, even with the sums that clubs invest in players, since even a financially relatively small position can make a quite large contribution to successful scouting. At any rate, an attempt to develop such a model would make a rather small stain on the annual balance sheet in relation to the costs of the player squad. Accordingly, my banal assessment would be: Attempt makes wise!

Want to read even further?
If you are interested in the areas where data analysis is useful in football and is already used by clubs, take a look here:

  • For example a super exciting report about Ian Graham and his work in Liverpool
  • The texts published by optasports are always highly recommended
  • A general description of the use of data for transfers can be found here
  • If you want to know how data is collected at opta and what might be possible in the future, take a look here
  • Data analysis is already being used, e.g. by neighbours – in the medical field
  • And in general, I would like to present the book “Matchplan: Die neue Fußballmatrix” by Christoph Biermann on the topic

// Tim (Translated by Arne)

MillernTon on Twitter
MillernTon on YouTube (New! Subscribe!)

MillernTon on Facebook
MillernTon on Instagram

If you like what we’re doing here, you can find more information about how and whether you should support us here.