Musculoskeletal (MSK) profiling tools, particularly of the lower limb, are widely used to highlight injury risk, and influence the composition of rehabilitation and conditioning programming for return to training (RTT).1 Suboptimal movement quality, or movement that is impaired, inefficient, asymmetrical, functionally compensated or diminished2 is believed to have an impact on injury risk,2–4 as such, the assessment of movement quality by practitioners is popular in clinical practise. Movement quality is considered to be a modifiable factor that can influence injury risk, as research has continued to show associations between movement variability and musculoskeletal (MSK) injury,5 therefore tools that capture and monitor changes in movement quality are of growing interest to practitioners.
Quantitative and qualitative human movement analysis is widely utilized in sport and clinical practice. Laboratory-based three-dimensional (3D) analysis is purveyed as the “gold standard”,6,7 however, in the non-research environment it is expensive, time consuming, and often unfeasible to set-up. The real-world need to capture data on large numbers of participants frequently, has led to several qualitative visual rating criteria emerging as a cheaper, more accessible means of human movement analysis.7
Several authors3,8–11 cultivated and explored the use of lower limb qualitative scales during functional movements, to provide clinicians with simply applied means of identifying movement quality issues within the MSK system. The development of the Landing Error Scoring System (LESS) provided practitioners with a reliable and valid tool9,12 with minimal set-up time and efficient post-test evaluation through the assessment of a jump landing technique. Unfortunately, analysis of trunk position is limited, with no evaluation of the upper limb and evaluation of bilateral jumping movements only. Torso and upper limb positioning has been shown to influence the lower limb during landing13–15 potentially impacting lower limb loading, patterning, movement quality, and subsequent injury risk. While these studies have acknowledged the contributions and impact the torso and upper limb may have on the biomechanics on the lower limb, protocols measuring and capturing torso and upper limb movement within whole movement patterning are lacking. Additional investigation into developing a methodology via qualitative means is therefore warranted.
The Qualitative Analysis of Single Leg Loading (QASLS) is a relatively new clinical assessment tool that incorporates biomechanical analysis of movement patterns of the lower limb, upper limb, and torso during single-leg loading tasks in addition to providing a compound score.16 This allows for comparison between limbs but is also arguably more replicative of the unilateral hopping, landing, and change of direction patterns observed in sport. Unilateral limb evaluation is important because it remains the most common mechanism of the majority of lower limb overuse and traumatic injuries.11 Furthermore, the effective evaluation of unilateral movement quality provides valuable markers for identifying both sporting and non- sporting individuals at risk of injury.
Research into QASLS use is limited, with only one study to date reporting on intra-rater reliability.17 However, the study was limited by sample size of participants and raters and no insight into absolute measurement error was presented. Measurement error values are an integral element of understanding the value of a tool, task, or intervention, as they inform a clinician if any notable changes have occurred and whether they are representative of a truly observed change and not attributed to systematic error, chance or an intervention. While ICCs allude to the reliability, they remain insensitive to sample variety.18 It is therefore recommended that a standard error of measurement (SEM) and the smallest detectable difference (SDD) also be presented to accurately identify and establish parameters to classify changes in performance.19 The SEM informs clinicians of the measurement error of a test, is presented in the same units as the measurements and therefore allows scrutiny to other SEM presented within the literature. The SDD provides a base value which should be surpassed to distinguish real change from random error.
Currently, no investigation has documented measurement error values or within and between session values of the QASLS system. If the measurement error, reliability and validity of the qualitative method can be established, practitioners will be able to use the QASLS system with certainty. This will assist with informing observation around individual and group performances, movement variability and associated injury risk, to support the development of better profiling practises. To determine the intra and inter-rater, within- and between-session reliability of the QASLS tool during two unilateral movement tasks, and provide insight to measurement error and smallest detectable difference (SDD). A secondary purpose was to report on the associated measurement error and SDD. It was hypothesised that QASLS scores would demonstrate good to excellent reliability for all tasks, however it was expected that inter-rater reliability would demonstrate greater variability depending on rater experience.
MATERIALS AND METHODS
Fifteen vocationally trained elite pre-professional20 female dancers,21 volunteered for this study. The colleges medical team approved participation by all participants who were uninjured and had no history of surgical intervention within the prior six months. Informed written consent was provided by each participant. This study was approved by the University Research and Ethics Committee.
Participants attended testing within their performance facility on three separate occasions during a three-week testing period. Within-session data collection occurred on the same day with Session 2 occurring one hour after Session 1, and between-session data collection occurring one week later. All testing sessions were conducted at the same time of day to account for circadian rhythm changes that may affect performance tasks. Participants performed single-leg squat (SLS) and a single- leg landing (SLL) on both the right and left legs. The order of the movement task was randomly selected (by participants selecting face down cards with the tasks written) in Session 1 and repeated in Session 2.
MOVEMENT ASSESSMENT TASKS
Each movement trial was recorded from frontal and sagittal planes with video cameras (Panasonic Lumix DMC-FZ200) at a sampling rate of 100 Hz, positioned three meters from the testing/landing zone, with each camera lens set upon tripods set to a height of 0.7m. (Figure 1).
Single-Leg Squat (SLS) (Figure 2a)
Participants were asked to stand on one limb (self-selected) facing the frontal plane, they were verbally instructed to squat as low as possible as if sitting back and down on a chair and return to the start position. Participants were then asked to repeat on the opposite limb. No further instructions were provided so as not to influence the individual’s movement strategy.
Single-Leg Land (SLL) (Figure 2b)
Participants stood on a 30cm high box, they were asked to step forward and land onto the contralateral limb holding the landing for at least two seconds. No further instructions were provided so as not to influence the individual’s movement strategy.
QASLS is a visual rating tool that provides segmental scoring of an observed unilateral loaded movement pattern on a 10-point scale. Adopting a dichotomous scoring strategy of six body segments (Arm, Trunk, Pelvis, Hip, Knee and Ankle) the tool utilizes a region criteria where appropriate strategy scores a zero and suboptimal strategy scores a one (Figure 3). A higher QASLS score indicates a greater number of suboptimal strategies used to complete a task, and a lower QASLS score indicates fewer component strategies required to complete the tasks. Within the QASLS framework, operational definitions are provided in conjunction to the movement strategies observed at each segmental level, along with instruction relating to compound dichotomous scoring. The QASLS system is advocated to be used so that the compound score, irregardless of if from a singular or multiple effort, is comprised of the total number of strategies required by an individual to complete the task irregardless of frequency. Namely, if three or five repetitions of a unilateral task are completed, even if a sub-optimal strategy is observed once or five times the practitioner awards one mark, resulting in the cumulation of a “sub-optimal” trial. Five repetitions were evaluated based on previous reported procedures within the only article to evaluate reliability of the tool,17 to designate a compound overall QASLS score.
Videos were analyzed using QASLS scoring sheet (Figure 3), the scoring performance was derived for each participant from both the frontal and sagittal plane views, with each video viewed then marked and scored. Three raters (LH, AM and BO) independently scored participants across the five trials via QASLS scoring sheet (Figure 3) having viewed both the frontal and sagittal place videos for each participant. The three raters were provided with written instructions on how to assess the movement tasks via QASLS, could review the videos as many times as required to obtain a score and were blinded to the other raters scores, to avoid potential bias.
SPSS for windows (version 25) (SPSS Inc, Chicago, IL.) was used to determine within and between session reliability agreement. Within and between session reliability agreement22 of the QASLS rating criteria was determined using intra-class correlations (ICCk,3) for each limb and movement assessment task, with 95% confidence intervals (%CI). A custom-made spreadsheet (Microsoft Excel Version 16.16.22) calculated standard error of measurement (SEM) and smallest detectable change (SDD) values. Within and between-session reliability of composite scores were calculated using a mean rating (ICCk,3) 2-way mixed-effects absolute agreement model. ICCk,3 values were interpreted as > 0.90 excellent, 0.75-0.9 good, 0.50-0.75 moderate, and < 0.50 as poor.18 Statistical significance was set at p <0.05.
Due to the dichotomous nature of the QASLS system,22,23 intra and inter-rater compound scores and individual components of the QASLS tool were determined via the percentage of exact agreement (PEA%) and kappa co-efficient. Cohens Scales24 were selected to interpret kappa values where 0.81-1.00 is almost perfect agreement, 0.61-0.80 substantial, 0.41-0.61 moderate, 0.21-0.40 fair and 0.01- 0.20 none to slight. Acceptable PEA% has been described25 as between 75-90%, however, this remains specific to each study. As there are no current universally accepted interpretations, and in the absence of literature supporting clear interpretation of PEA, ≥ 66% has been chosen as a reflection of majority agreement (55-75% as defined in other papers.24,26 SEM and SDD were calculated to represent and establish the smallest worthwhile change and identify random error scores between test sessions. With formulas taken from previously reported methods.16,19,27–29
Fifteen participants originally volunteered for this study, due to corrupted video data, one participant was excluded resulting in an analysis of 14 participants (age 19±2 height 167±6 cm body mass 56±6 kg).
Within and Between Session Reliability of QASLS (Table 1)
No significant differences were noted between limbs (p = 0.20) or testing sessions for both tasks. Within and between-session reliability for both tasks were moderate to excellent (ICC = 0.67-0.93). Within-session reliability of the QASLS composite score (0-10) for SLS was good for both limbs (Right ICC = 0.82, 95%CI = .36-.96; Left ICC = 0.86, 95%CI = .49-.97). SEM for within-day reliability was 0.82 and 0.72 points the SDD 2.28 and 2.00 points on a ten-point scale for the right and left limbs respectively. Similar results were observed in the right SLL task (ICC = 0.87, 95% CI .42-.97, SEM 0.45, SDD 1.26) however, left limb performance was moderate (ICC = 0.67, 95%CI .25-.92, SEM 0.89, SDD 2.45).
Between-session reliability of the composite QASLS score for the SLL was slightly less than the within-session scores (Right ICC= 0.72 95%CI .15-.93, Left ICC = 0.69 95%CI .07-.92) graded as moderate. SEM for SLS between-session reliability was 0.96 and 0.99, the SDD was 2.65 and 2.75 for the right and left limbs respectively. The SLL task demonstrated greater between-session reliability than the SLS task (ICC = 0.92-0.93) with SEM of 0.41 and SDD of 1.14 for the right limb and 0.47 and 1.52 for the left limb. SEMs for both within-session and between-session were less than 1 with the SDD ranging from 1.0-2.5 points. This suggests an error measurement of 1 across testing time frames and that a change of 1-3 points would be necessary to demonstrate a minimal detectable change.
Intra-Rater Reliability of QASLS
Intra-rater reliability was “perfect to excellent” agreement (k=0.85-1.0) for both movement tasks. Except for right SLL (k=0.85, PEA = 90%) where items 7 and 8 in the QASLS criteria were disagreed on for participants 1 & 5 respectively. Therefore, individual components of the QASLS tool were further analyzed with details found in Tables 2 and 3.
Inter-Rater Reliability of QASLS
Table 4 presents the inter-rater reliability for compound QASLS scores, rater reliability for SLS ranged from non-to substantial (k=0.13- 0.74) and for SLS non-slight for SLL (k = 0.03-0.17). Single leg squat demonstrated the biggest discrepancy between PEA%. Rater 2 (R2) demonstrated the greatest difference between Rater 1 (R1) and Rater 3 (R3) (43%-90% respectively). R2 and R3 demonstrated the highest levels of PEA% (53.3%-90%) with each other, agreement with R1 was lower for both R2 (43-47%) and R3 (53-60%).
Inter-rater reliability for individual and categorical components ranged from non-substantial (k = .000-.80) (table 5). Kappa values were unable to be established for all raters and participants scores, due to the lack of variance in 1 or both rater scores. Despite high values of PEA% (such as 100%) low kappa scores were still noted. During the SLS raters demonstrated the best agreement for pelvic, knee and touchdown components (items 3,4,7,8 and 9 on the criteria), however this was different for SLL where raters demonstrated the best agreement for upper limb, trunk and ankle components (items 1,2 and 10 on the criteria).
The purpose of this study was to determine the intra- and inter-rater and within- and between-session reliability of the QASLS tool during two unilateral movement tasks, the SLS and SLL. A secondary purpose was to report on the associated measurement error. Overall compound QASLS scores suggest moderate to excellent reliability (ICC = 0.82-0.87 and ICC = 0.69-0.93 for within and between session, respectively), indicating the QASLS tool is sufficiently reliable for movement analysis of the unilateral movement tasks of squatting and landing. Results highlighted that there was a measurement error of 1 between testing timeframes and that a change in 1-3 points is required to determine a change in performance. This is believed to be the first study to provide within and between session reliability specifically for the QASLS tool, therefore there is no prior research to compare results to. Other qualitative movement screens that use dichotomous scales similar to QASLS such as the Functional Movement Screen (FMS) and LESS have reported similar test-retest reliability values.27 Shultz et al. established that compound FMS scoring was relatively good (ICC = 0.6) for elite female athletes when tested seven days apart. The reliability values within this study are consistent with those reported in the above literature, 95% CI remain large and are potentially due to the variability within human movement. Despite ICCs being commonly reported in reliability studies, within qualitative research many interpretations of the ICC exist. Therefore clarification of excellent or good reliability is elusive with studies classifying broad values (from 0.40->0.80) as excellent or fair to good.30
Intra-rater reliability was found to be excellent (PEA% 90-100%, k = 0.85-1.0) and is in agreement with other work17 (although limited to SLS) that has analyzed rater-reliability. There are believed to be no comparable papers currently available concerning the reliability of the QASLS tool and a SLL task.
Overall inter-rater compound QASLS scoring was non-substantial for SLS (k = 0.03-0.17), which is lower than previously reported reliability,17 however PEA% ranged from 43-90%. Results were comparative to other qualitative measures that have analysed SLS.2 Chmielewski et al. showed PEA of 32-48% during SLS via segmental approach and weighted kappa values of 0.00-0.53.31 Schultz et al. (2013) described inter-rater agreement via Krippendorff a (ka) as poor (ka = .38) when using the FMS on female athletes.
Inter-rater reliability of each QASLS component was fair to almost perfect (k = 0.40-1.0). Regarding individual component analysis, best scores appeared to be between R2 and R3 during the SLS with 100% agreement in 8/10 categories. The three raters demonstrated differences of agreement in components 6,7 and 8 (NWB thigh movement and knee valgus). Previous findings17 have also concluded rater disagreement of the valgus components during the scoring of SLS in university participants. The raters in this research, as with the raters in this current study received no formal training and were reliant on the operational differences presented within the tool.
The operational differences presented in components 7 and 8 of the QASLS tool are very similar in their description, which might not be concise enough for raters to deduce the difference between the terms “noticeable” and “significant”. It is unclear if the reliability results observed in this study are attributed to the level of rater training or vagueness of the operational definition of knee valgus. This might also provide an explanation for why these differences were not reflected in SLL results where the greater complexity of the task suggests valgus is easier to spot within the movement pattern.
Inter-rater reliability was unable to be calculated for some categorical components due to the lace of variance between raters and observations of movement errors, and is described as the kappa paradox.24 When conceiving this study, due to minimal research regarding the QASLS tool, important decisions regarding the interpretation of the variable generated by the QASLS tool were considered, as this would dictate the statistical approach. Unlike quantitative variables seen in 2D or 3D movement data that follow interval or ratio principles that can be parametrically analyzed, a case could be made for QASLS being classified as ordinal (due to the dichotomous element of the segmental evaluation where the outcome falls into two categories of yes or no) and interval (compound scores that run on a scale of 0-10 where the gaps are proportional), thus, how best to establish tool performance relating to reliability and agreement was open to debate. Previous visual rating methods that also use dichotomous scoring, treat data as interval.9–11 The QASLS tool has been designed as a clinical instrument to provide a score that guides practitioners in evaluation of single-leg loading patterns of the whole system, the decision was therefore made to evaluate data as an interval variable.
Study Applications and Limitations
A strength of this study was the presence of both the kappa and PEA% analysis methods, yet, neither method is without fault. PEA is a precise, interpretable and easily determined statistic but does not account for chance rater guesses.24 The kappa value eliminates any chance rater choices, but is limited in sensitivity in data prevalence that clusters very high or very low, or in homogenous populations where estimate agreement appears exclusively lowered.22
Described as the “base rate problem,”32 and usually seen in a moderate to high PEA and a low kappa score, the paradox has been shown to occur in very simple cases with only two evaluators and two outcomes (similar to this papers design), at equal points of the sensitivity and specificity of the raters, or if the prevalence of one of the raters assigns one specific outcome more frequently33 as observed between R1 and R2, and R1 and R3.
Data indicated that at individual participant level, movement variability was high with different movement patterns deployed within the same movement pattern, but as an overall cohort movement patterns were consistent and therefore variability was low. It is unsurprising that this data set has high levels of homogeneity that is likely unavoidable in the analysis of a sub-elite population. Analysis of movement quality remains a key aspect of profiling and programming within the sporting environment, it is therefore likely that future research will continue to be focused within this population. It is prudent to acknowledge the limitations of this non-heterogeneous sample and the likely impact that would have on a kappa score, and establishing a truly heterogeneous elite sporting population would be difficult to achieve. The argument is therefore made that the limitation is within the statistic rather than the direct relevance of the population. Future research into other sporting populations such as injured or adolescents where a cohort could be relatively heterogeneous in their construct would be warranted.
A final limitation of the study is the level of rater training provided in using the QASLS tool. The findings of the kappa results are potentially suggestive of a redesign of the test instrument or retraining of the raters.24 Given the robustness of the intra-rater and between and within-session results, the requirement for full instrument redesign appears unlikely. Raters were provided with the same standardized instructions on how to administer the tool along with the basic component operational definitions embedded within the tool. It is possible that each rater interpreted each section in a specific way which ultimately impacted agreement.
While training around the use and interpretations of other movement visual rating criteria is standardized by other authors, currently there are no training programs available for the QASLS tool. It was therefore decided that understanding the current interpretations, limitations, and strengths of the QASLS tool as it is currently used without training within clinical practice, was more pertinent for this study, to better guide any future recommendations around QASLS training content.
Rater training is an important component to qualitative analysis34 but rarely appears to be delivered in a standardized way.3,9,10,31,32 Providing raters with greater instruction around operational differences and providing potential examples of each observable segmental strategy (e.g. trunk dominant, hip avoidant, knee dominant) may assist raters clinically in standardizing their scoring methods. This is particularly evident around components 7 and 8 of the QASLS tool where identifying minor deviations in knee movement appeared more difficult. This is also supported by the better levels of reliability and agreement observed during SLL, where the larger deviations seen within that movement pattern are more discernible. The QASLS tool has demonstrated satisfactory within and between and intra-rater reliability for its use by practitioners. Results demonstrate that the current operational definitions within the tool are adequate for intra-rater use, further work on rater-education to include standardized examples, may maintain more consistent and objective analysis to improve agreement ratings before more widespread use.
The QASLS tool demonstrated moderate to excellent within- and between-session reliability, and excellent intra-rater reliability, and could be used as a movement quality tool to evaluate unilateral squatting and landing tasks by a single rater. PEA% was acceptable for inter-rater agreement, but results should be interpreted with caution. It would be beneficial to explore the operational definitions used within the tool, so inter-rater agreement could be elevated to more acceptable levels. A potentially homogenous population was selected, and while not unrepresentative of a healthy, elite sporting population, it is unclear how the QASLS tool may be influenced by more heterogenous samples such as injured populations or adolescent younger age groups. Future additional investigation within additional groups of athletes will provide greater understanding into the application and continuing development of the QASLS and other visual observation tools of movement quality.
Conflicts of interest
The authors report no conflicts of interest.