System for notational analysis in small-sided soccer games

The objective of this study was to compose an objective and detailed notational analysis system for 3 vs. 2 + GK small-sided soccer games, in which three roles are examined: attacker with ball, attacker without ball and defender. The actions and the outcome of the actions were registered for each player and in each role. Players earn points for each action and outcome according to an a priori determined scheme. Performance scores for each role are calculated as the average number of points a participant earns per trial. This notation system was tested on 19 highly talented female soccer players and validity and reliability of the system were determined. In addition, practical applications were discussed and the most important items of the notation system were determined and using only these items, a simplified notation system was proposed. The notation system has high ecological validity and can discriminate the high and low categorized players, but further development is necessary to increase the reliability of the system.


Introduction
Assessing tactical skills of team sport players is challenging but interesting for both sport practice and science. In sport practice, trainers, coaches and scouts want an easy tool to determine the quality of performance, identify strengths and weaknesses and follow the developments of players. Scientifically, an objective method to assess tactical skills of team sport players on the field would be valuable for research on expertise and decision making. Bard and Fleury 1 were the first to attempt to objectively examine decision making skills by presenting slides of offensive basketball game situations to experienced basketball players and novices, after which they had to verbalize their response. However, the validity and reliability of this test was not reported. Better ecological validity would be acquired using film clips as argued by Helsen and Pauwels. 2 They were among the first who developed a film-based decision-making test that has been used frequently ever since. [3][4][5][6] However, the most ecologically valid way of measuring decision making or tactical skills is by using game play. [7][8][9] By coding, behaviours exhibited during game play actual performance can be assessed. This is more authentic and represents one's ability more accurately. 10 In sports and physical education, there is an increasing interest in developing performance assessment instruments that can be used on game play performances. In a review, Arias and Castejo´n 11 showed that the two most often cited assessment instruments are the Team Sport Assessment Procedure (TSAP) and the Game Performance Assessment Instrument (GPAI).
The TSAP of Gre´haigne et al. 12 was designed for invasion games and examines how players gain ball possession and how they play the ball. Ball possession can be gained by conquering or receiving the ball, and then, the player can play a neutral ball, lose the ball, play an offensive ball or execute a successful shot. Based on the frequencies of occurrence, the volume of play and efficiency index can be calculated and those two combined yield a performance score. Although this is an easy-to-use assessment instrument, its major limitation is that it only examines the player in possession of the ball. Since a player carries the ball for less than 2% of the game, [13][14][15] it is essential that a performance assessment instrument for team sports also includes the performances of players off-the-ball.
The GPAI designed by Oslin et al. 16 is the most frequently used assessment instrument 11 and includes both on-the-ball and off-the-ball movements. Oslin et al. 16 aimed for a performance assessment instrument that can be used for any kind of game and identified general game components for which the observer has to assess appropriateness of the player's behaviour. For example, for each time a player is in ball possession, the observer assesses the decisions made and these are coded as appropriate if a player choses to shoot or pass to an open teammate when the opportunity is available, and coded as inappropriate if a player does not pass at an appropriate time or to a marked teammate. Thus, the observer has to decide whether players are open or marked, whether a pass is given at the appropriate time or not, etc. and this leads to a high level of subjectivity in the assessment process.
Other, more recent, performance assessment instruments used general tactical principles of the game (e.g. 'penetration' or 'offensive coverage' as in FUT-SAT 17,18 ) or did not assess the performances of all the players (i.e. attackers and defenders) involved in the game (e.g. Game Performance Evaluation Tool 19 ; for an overview of performance assessments instruments, see 11 or 20 ). This inspired us to develop a detailed and, in our view, more objective notation system in which the performances of all players are assessed, that is attacker with ball, attacker without ball and defender. For each role, the actions of the participants are registered as well as the outcome of the actions. Depending on the outcome of the action, the participant earns points for each action corresponding to the a priori determined point distribution, so that the user of the system is not required to judge the quality or appropriateness of the actions performed by the players. Performance scores for each role are calculated as the average number of points a participant earns per trial in that role.
The aim of the current study was to examine the validity and reliability of the notation system among highly talented soccer players. Validity was determined with regard to ecological, content, concurrent and construct validity. To determine the reliability of the notation system, inter-and intra-observer reliability were assessed. Consequently, the most important items of the notation system were determined and using only these items, a simplified notation system was proposed. Finally, practical applications were discussed.

Participants
A total of 19 highly talented female soccer players participated in this study, with a mean age of 16.3 years (SD ¼ 1.1) and a mean soccer experience of 9.9 years (SD ¼ 2.3). They all played in the national soccer talent team, in which they train about 15 to 20 h a week and play in a high level competition for males under 14 years of age. The experiment was approved by the local ethics committee of the research institute and all participants gave their written informed consent prior to the experiment; parental consent was provided for players younger than 18 years.

Procedure
To assess the performances of the players (i.e. attackers and defenders), we chose to use 3 vs. 2 þ GK smallsided games (i.e. 3 attackers vs. 2 defenders and a goalkeeper) since these are less complex than 11 vs. 11 matches, facilitate more ball touches per player and are the basics of soccer according to the Royal Netherlands Football Association. 21 The small-sided game was played on a 40 -m long and 25 -m wide field (dimensions were advised by the head coach of the national soccer talent team) with official sized goals, and official soccer rules, including offside, were applied.
The six players were instructed to start at specific locations ( Figure 1). The attackers' task was to try to score as quickly as possible, whereas the defenders had to prevent that. If the defenders obtained ball possession, they had to try to score at the opposite goal. However, the turnover was only for motivational reasons, the notational analysis was only carried out on the performance prior to the change of ball possession (the participants were unaware of this). The trial ended if a goal was scored, a foul was made or the ball went out of play. The variables that were measured are explained in the section 'Notation system'. After five trials, the participants switched roles (except for the goalkeeper), so that all participants played on each position. Thus, in one test, a participant played 15 attacking trials and 10 defending trials. In total eight tests were conducted, spread out over 4.5 months. Participants who attended less than five tests were excluded from analysis. A total of 733 trials were analysed; on average, a participant played 34 trials (SD ¼ 5) per position.
van Maarseveen et al.
The tests took place on the regular training pitch of the national soccer talent team and were video recorded with a Go Pro Hero 3 camera (Black Edition, resolution 1920 Â 1080, 30 Hz; Go-Pro, USA) that was fixed on a 6.5 -m high platform (Showtec LTB-200/6 Lifting Tower, The Netherlands), and analysed afterwards using the notation system.

Notation system
Our notation system distinguishes three roles for a player: attacker with ball, attacker without ball and defender. For each role, possible actions and outcomes have been identified and defined ( Table 1). The first step of the notation system was to analyse the video footage frame by frame by registering all the actions a participant makes, and its outcome, for each role. For positioning not the frequency but the duration of being open or marked was registered. This could easily be done using video coding software like Dartfish (TeamPro 7), which we used.
Depending on the outcome, the participants earned points for the actions they performed. The allocation of points was a priori determined by soccer experts, and is shown in Table 1. For example, when a player passes the ball towards a teammate, this teammate receives the ball and the pass was directed forward, then the passing player earns two points. Only for positioning a slightly different approach was used, the registered duration in each of the categories of positioning were used to calculate the percentage of time a player spend in each of the categories, and consequently, these percentages were multiplied with the points allocated to each category, as can be found in Table 1. For example, when a player was open, on his own half, in the centre of the field, for 25% of the total time, then this player got 0.25 Â 2 ¼ 0.5 points for this category. By adding up the points per trial for each role, and calculating the average number of points a player received per trial, a performance score for each role was computed. There were no minimum or maximum scores, as the performance scores depend on the actions that a player made and on the outcome of these actions.

Data analysis
Validity. In addition to descriptions of the ecological and content validity of the notation system, the concurrent validity and construct validity were calculated.
Ecological validity. Ecological validity reflects the congruency between the constraints during assessment and real-life situations. Using a representative design, in which the task constraints are similar to the natural performance setting, a high ecological validity is achieved. 22 Our notation system was applied to 3 vs. 2 þ GK smallsided games, this enabled the participants to behave naturally, and thus, with regard to the task constraints of the assessment method the ecological validity of our notation system is high. With regard to the actual soccer game, however, the ecological validity can be improved by assessing the performances of the players while playing 11 vs. 11 on a regular-sized pitch instead of 3 vs. 2 þ GK small-sided games. Nevertheless, in comparison with previous research, the assessment method used in the current study is a proper representation of the actual performance environment.
Content validity. Content validity was determined by two experts with over 25 years of experience in coaching soccer at national and international level. They provided feedback on the terms and definitions of the notation system and discussed the allocation of points until consensus was reached.  The attacker moves the ball, after receiving and prior to passing/shooting (without a near defender) and . . .   Concurrent validity. Concurrent validity can be determined by correlating the results of a new measurement technique with a reference criterion that is administered at about the same time. 23 In this study, the head coach a judged the performances of the players and categorized them as high, medium or low. Categorizations were made for their general performance in the 3 vs. 2 þ GK tests and on their specific performances as attacker with ball, attacker without ball and defender. As indication of concurrent validity, Kendall's tau correlations 24 were determined between the categorizations of the coach and the performance scores attained with the notation system.
Construct validity. Construct validity of the notation system was determined by its success in differentiating between the high and low categorized players. Performance scores for the three roles of the high and low categorized players were compared separately using independent t-tests to determine whether the notation system could differentiate between skill level.
Reliability. The reliability of the notation system was determined using intra-observer and inter-observer reliability.
Intra-observer reliability. A total of 75 trials (10% of the complete dataset) were coded twice by the main researcher to determine intra-observer reliability. Hughes et al. 25 recommend to use percentage error as indicator of reliability for categorical data and values less than 5% are seen as acceptable. With the exception of positioning, percentage error was calculated for each action and outcome separately, to give insight into the reliability of the separate items. For positioning the duration of being open or marked was registered, and thus the Pearson correlation between the two data sets was determined as reliability score.
Inter-observer reliability. Although the main researcher coded all data, an assistant was also trained for 5 h to use the notation system. After training, a total of 118 trials (16% of the complete dataset) were coded by the assistant to assess inter-observer reliability. The percentage error 25 was calculated for all actions and outcomes separately, except for positioning, for which the Pearson correlation between the two coders was assessed.
Simplification of the notation system. As it is labour-intensive to register all actions and outcomes for each role, we also examined whether it is possible to simplify the notation system. For each role, we calculated the average occurrence of each action per player per trial and the percentage of points the players earned with each action in relation to the total number of points they earned for that particular role. We also examined the ability to discriminate the high and low categorized players of each action separately by using independent t-tests. Based on these results, we stepwise excluded actions from the notation system to find a simplified notation system that included as few as possible actions but was still able to differentiate between the high and low categorized players.
Practical applications. For coaches, it is valuable to have an easy method to compare the players to each other and to get an overview of the strengths and weaknesses of each individual player. To fulfil this request, we created two easy-to-read graphs based on the results of the notation system. To compare the performances of the players within a team or group, a graphical representation was created of the performance scores for offence (i.e. the sum of the performance scores for the role of attacker with ball and without ball) and defence of each player. Also, the average group scores were displayed. The individual strengths and weaknesses were explored by calculating the points each participant earned for each action separately. We expressed them as z-scores to facilitate the comparisons between actions and displayed them in a radar graph.

Validity
Concurrent validity. Significant correlations between the categorizations by the coach and the performance scores have been found for general performances, ¼ .486, p < .05, attacker with ball, ¼ .397, p < .05, attacker without ball, ¼ .523, p < .05 and defender, ¼ .461, p < .05, indicating that the performance scores as obtained with the notation system were significantly related to the categorizations by the head coach.
Construct validity. In Table 2, the mean and standard deviations of the performance scores for the three roles can be found for the high and low categorized players. The high categorized players obtained significantly higher performance scores with the notation system than the low categorized players in all three roles, all ps < .05, meaning that the notation system can differentiate between the high and low categorized players.

Reliability
Intra-observer reliability. Table 3 shows the intra-observer reliability for each action and outcome that was coded frequently in this sample (i.e. more than 5 times). All actions and outcomes were coded with a percentage error within the acceptable 5%, except for running actions, being offside, defensive pressure and intercepting the ball, of which the last two were only slightly above the 5% norm. For positioning, the Pearson correlation between the two data sets was found to be significant, all ps < .001, and ranging from 0.865 to 0.995. Thus, overall the intra-observer reliability was sufficient to good.
Inter-observer reliability. The inter-observer reliability for each action and outcome that was coded more than 5 times in this sample is displayed in Table 3. The percentage error varied from 0.0% to 45.9%, indicating that some items had high inter-observer reliability and others low. For positioning, a significant correlation was found between the two coders. Table 4 shows for each role the average occurrence of each action per player per trial, the percentage of points the players earned with that action, and the Table 3. Intra-and inter-observer reliability, expressed as percentage error, except for positioning, for which Pearson correlation was calculated.

Intra-observer
Inter-observer  independent t-test statistics on the performance scores acquired with each action separately. When used separately, only the action defensive pressure yielded a significant difference between the high and low categorized players, the other separate actions could not differentiate the high from the low categorized players. As shown in Table 2 (and in Table 5, the first line for each role), the construct validity of the complete notation system is good, meaning that it differentiates the high and low categorized players. To reduce the workload of the system, we examined whether it is possible to simplify the system without losing its discriminating ability. Stepwise elimination of actions from the notation system revealed that by including the three actions shooting, dribbling and offensive 1:1 duel the notation system can discriminate the high-and low-skilled players in the role of attacker with ball (Table 5). For the role of attacker without ball, running action, being in promising position and positioning are necessary and sufficient to significantly differentiate the high from the low categorized players. For defenders, the high and low categorized players can be discriminated by including only the single action defensive pressure.

Practical applications
The performance scores on offence and defence are displayed in Figure 2 for each participant. Using this graphical representation, it is easy for coaches to see how the players score in comparison to each other. The best players appear in the top right corner and the weakest in the bottom left corner. The defence specialists (i.e. good in defence, weak in offence) are located in the bottom right corner and the offence specialists (i.e. good in offence, weak in defence) in the top left corner. Several soccer coaches have approved the practical relevance of this graph.
Examples of the individual strengths and weaknesses of two participants are shown in Figure 3. Participant 12 had high performance scores for all three roles, whereas Participant 15 scored low on the roles attacker with ball and defender and above average for the role of attacker without ball. The strengths and weaknesses graphs (Figure 3) reveal that Participant 12 especially excels in passing but may benefit from improving her intercepting skills and although Participant 15 scored on average low on defending, her intercepting skills were above average. Table 4. For each action in each role, the mean occurrence per player per trial, the mean percentage of points earned with that action in relation to the total number of points for that role, the mean and standard deviation of the high and low categorized players and the test of the difference between them.

Discussion
The aim of this study was to take a first step in developing an objective notation system for small-sided soccer games that examines player performances both on and off the ball. The notation system was tested on highly talented female soccer players from the national talent program. Validity and reliability of the notation system were determined, practical applications were shown and a simplified system was proposed to reduce the workload of the complete notation system. The notation system has high ecological validity as a representative design is used in which the task constraints are similar to the natural performance setting and consequently enables natural behaviour. Assessing the performances of the players while playing 11 vs. 11 regular matches, will even further improve the ecological validity and is interesting for future research. Nevertheless, in comparison with previous research, the method we used to assess performance is a proper representation of the actual performance setting. Furthermore, as two experts with over 25 years of experience in coaching soccer at national and international level contributed to the development of the notation system, the content validity of the notation system was warranted.
The concurrent validity of the notation system was found to be significant for each role and for the overall performance score. However, the correlations between the performance scores and the categorizations by the head coach showed medium to large effects. This could possibly be due to correlating the performance scores with the opinion of one expert instead of a panel of experts. Also the fact that we analysed the small number of 19 players could have affected the results, and furthermore, these players were all enrolled in the national talent program, meaning that they were all highly skilled players and consequently large differences were not to be expected. Applying the notation system on a larger and more heterogeneous skilled group of players will probably yield higher concurrent validity.
Construct validity was determined by comparing the performance scores of the high and low categorized players. In each role, the highly skilled players scored significantly higher than the low categorized players, demonstrating the good ability of the notation system to discriminate the high-and low-skilled players.
The intra-observer reliability was good except for running actions and offside. The inter-observer reliability, however, was good for some actions but low for dribbling, 1:1 duel both offensively and defensively, running action, offside, defensive pressure and intercepting. For most of these, the recognition of the action was found to be more difficult than the determination of the outcome of that action, as the van Maarseveen et al.
reliability scores of the outcome were more often at an acceptable level than the reliability scores of the actions. The actions that scored low on reliability were all actions that are less objectively identifiable than actions like passes or shots on goal, indicating that improvement in reliability can be expected after clarifying the definitions of those actions. The low reliability of offside is probably due to the fact that it is an item that can be easily forgotten to register and, in addition, the camera's viewpoint (behind the goal) made it difficult to identify offside. The notation system showed reasonably good intra-observer reliability, but the inter-observer reliability requires more attention. The reliability can be improved by defining the actions and outcomes more clearly and by administering more guided training with the notation system than the current 5 h of practice before starting to assess performances.
Another reason for the low reliability scores may be the complexity of the system, as any actions and outcomes need to be registered. Reducing the workload by eliminating actions from the system may also improve the inter-observer reliability. We found that when for the attacker with ball only the actions shooting, dribbling and offensive 1:1 duel were included, for the attacker without ball running actions, being in promising position and positioning and for the defender only defensive pressure, then the complexity and workload of the notation system were reduced considerably, but its ability to differentiate the high-from the low-skilled players remained.
On the other hand, using specialised camera's and software that can track the positions of the players and ball 26 in combination with specially designed algorithms, the registration of all actions of all players on the field can be automated. An advantage of registering all actions is that it reveals a great deal of specific information about the players, which can be used to create player profiles indicating strengths and weaknesses of each player, as we showed in the practical applications, and these player profiles can be used to evaluate training, to follow the development of the individual players and to set goals for an individualised training program. 27 Also, the comparison of the performances of the players within a team is of practical relevance to  coaches and scouts. For example, coaches can easily compare players and choose a more offensively or defensively playing midfielder according to their preferred game strategy. For both practical applications that we showed, a benchmark would be of great value. Then players can be compared to age-and gender-matched top-level players. To achieve this, the performances of many players of different age and gender should be assessed with the notation system. Until now, the notation system has only been used to assess the performances of just 19 players. As these players were all enrolled in the national talent program, and thus preselected on their high skills, large differences in performance among the players were not to be expected. The fact that the notation system was able to discriminate the high from the low categorized players shows the potential of the notation system to assist in talent identification.

Conclusion
The notation system we composed for assessing performances of soccer players in 3 vs. 2 þ GK smallsided games seems a good first step towards an objective assessment tool that examines both performances on and off the ball. The notation system differentiates the high-and low-skilled players and had high ecological validity, which may be improved by examining 11 vs. 11 matches. Further development is necessary to increase the reliability of the system and a longitudinal study on the use of the system to assist in player evaluation and selection would be valuable.