BACKGROUND
The toll of musculoskeletal injuries is difficult to quantify, but is likely substantial among nations across the economic spectrum. The ramifications of musculoskeletal injury are far-reaching and include costs related to healthcare as well as impact on quality of life, future health, and workplace productivity, to name a few.1 Physical activity, despite its readily apparent benefits to physical health, increases one’s exposure to potentially injurious events and is often implicated in initiating the cycle of injury-related personal and societal costs. Recent epidemiological studies of sport-related injury in the U.S. estimate 8.6 million Americans report an activity-related injury each year.2 Preserving the benefits of physical activity while avoiding adverse outcomes requires a balance between participation and, where possible, minimizing exposure.3
One potential method for reducing such exposures involves screening for or modifying high-risk movement behaviors. The developers of the FMS™ proposed that the practice of sports medicine was lacking with respect to injury risk screening.4,5 They describe a gap between 1) the pre-participation medical clearance exam, and 2) performance testing designed to guide sport-related training or tactical decisions. Their solution, which has since gained considerable traction, involves the screening of fundamental movement behaviors as an indicator of potential activity-related injury risk and as an initial means of identifying possible avenues of remediation.
Initial research on the FMS™ indicated that it may help prospectively discriminate individuals at high vs. low risk for activity-related injury on the basis of a standardized movement assessment battery.6 This observation has led to an increased focus on the application of movement screens, both as a predictor of risk and to support the design of training programs. Additional movement assessment instruments developed to date have sought to address a range of populations and specific activity-based needs.7–11 These developments, and the accelerating pace of research on the topic of movement quality, attest to the continued interest in applying such instruments clinically.
Notwithstanding, the proliferation of movement screens as a pre-participation tool has led to a concomitant increase in the demand for raters and the lack of demonstrated competence with visual observation when evaluating movement. As the scale of application increases for the FMS™ and similar clinical instruments, there is a potential for their reliability to suffer within and across studies. This may stem from variability in rater expertise, individual raters adopting personal preferences in rating style, or the mutual influence of different screening systems featuring similar component tests. Any such source of error has the potential to affect clinical and scientific interpretation of the associated rating systems. Alternatively, one may increase confidence in their meaning to the extent such sources of error can be addressed. A feasible method of calibrating clinical movement assessments (or the raters who rate them) may help ensure data quality and insulate these instruments from reliability concerns associated with scale of application.
Assessing practical methods by which raters with varying levels of experience as a movement professional—and varying levels of exposure to specific movement assessment instruments—can achieve greater reliability in applying movement quality assessments. This may be particularly useful in high-volume settings, in which effects related to rater variation have a greater likelihood of obscuring meaningful trends.
The subject of FMS™ reliability among raters of varying experience has been partially addressed by previous work. While specific findings vary by study, authors appear to conclude more often than not that the instrument is reliable for the purposes investigated.12,13 Even so, valid concerns have been raised about the conclusiveness of the research,14 the analytical approaches involved,15 and the psychometric properties of the FMS™ as a rating instrument.16 Establishing reliability of the FMS™ and similar movement quality assessment scales should be considered an ongoing effort. The body of literature addressing FMS™ interrater reliability has thus far given little attention to expediently calibrating or “synchronizing” item and composite scores across novice raters, which is a priority in high-volume applications or any time multiple raters are involved. This study examined the effect of a brief training seminar—administered by a licensed physical therapist who is FMS™-certified—on interrater reliability of FMS™ scores among individuals with no prior exposure to the instrument or its scoring criteria. Such a seminar could feasibly be administered prior to large scale testing endeavors to reduce measurement noise. Data was analyzed at the level of the component scores and the composite score, in each case using models that account for the type of data and number of raters. The purpose of this research was to examine the effect of a two-hour FMSTM training seminar on measures of reliability between previously untrained scorers. It is hypothesized that a brief, standardized training seminar will be sufficient to achieve good to strong interrater reliability for all FMS™ components.
METHODS
Experimental Approach to the Problem
Component (i.e., item) and composite FMS™ scores were acquired on two occasions from a group of five raters. The raters consisted of four novice second-year physical therapy students with no prior FMS™ training or experience, and one expert who was FMS™ certified with three years’ experience using FMS™ and has been a licensed physical therapist for 20 years. The novice raters participated in a two-hour training seminar provided by the expert rater eight days prior to the initiation of data collection. The training session consisted of initially viewing each of the seven screening tests, totaling approximately 75 minutes, of the FMS™ scoring video (Functional Movement Systems). Additionally, the seven movement patterns, three clearing tests, examiner verbal instructions, and scoring criteria were explained in detail by the expert rater. Summary sheets for each FMS™ movement were provided to the raters, including written and visual descriptions of scoring from zero to three for each movement pattern. Novice raters then performed, practiced, and scored each of the seven movement patterns and three clearing tests.
A sample of 16 subjects was scored twice by each rater with four days between each session. On both occasions, a researcher read the scripted instructions used the same materials as used in the training session to have the subjects perform each test. The tests were scored in real-time by all raters simultaneously and subsequently analyzed to establish reliability.
Subjects
A total of sixteen subjects (12 females [23.33 ± 1.61 years, 164.68 ± 5.94 cm, 61.97 ± 9.33 kg] and four males [23.75 ± 1.71 years, 181.61 ± 10.47 cm, 88.22 ± 20.18 kg]) participated in this study. Participation was open to healthy adults without restrictions to physical activity. Prior to participation, subjects signed an informed consent form approved by the university Institutional Review Board.
Procedures
Participants reported to the testing site on Day 1 of testing, and returned to repeat the test four days later (Day 2) at the same location. Upon arrival, participants were instructed in the performance of each movement pattern in the order specified by Cook et al.4,5 The standardized order of movement patterns and tests was as follows: 1) Deep Squat (DS), 2) Hurdle Step (HS), 3) Inline Lunge (ILL), 4) Shoulder Mobility (SM), 5) Shoulder Clearing Tests, 6) Active Straight Leg Raise (ASLR), 7) Trunk Stability Push Up (TSPU), 8) Spinal Extension Clearing Test, 9) Rotary Stability (RS) (prior to changes of 2020), 10) Spinal Flexion Clearing Test. Test order and verbal instructions were scripted for criteria to meet scores of “grade 3” or “grade 2” and each subject completed each test position regardles of rater’s score. All raters observed and scored the same subject at the same time. Raters were permitted to move about the testing room and to request that participants perform additional repetitions of any test, but were not permitted to discuss scores. These same procedures were repeated four days later. Participants were instructed not to practice the test behaviors between the first and second testing occasions. Summary sheets for each FMS™ movement were provided, including written and visual descriptions of scoring for each movement pattern. Novice raters performed, practiced, and scored each of the seven movement patterns and three clearing tests. Prior to data collection, interrater reliability for novice raters for the DS, HS, and ILL movement patterns was rated and found to have excellent reliability after viewing and scoring video clips of these three movement patterns. These three movement patterns were selected by the researchers due to the increased complexity of the grading criteria for those movement patterns when compared to the other movement patterns.
Each item was rated by all participants in real-time based on the originally published scoring criteria as instructed during the training seminar. Raters were additionally instructed to record the lower of two scores as the component score for any test in which a bilateral asymmetry was noted, and to assign a component score of 0 in any test which pain was reported or if an associated clearing test was positive (i.e. evoked pain).
Statistical Analyses
Interrater reliability was analyzed separately for each Day 1 component score and also for the Day 1 composite score, the latter of which is simply a sum of the component scores. To account for the number of raters (n > 2) and the structure of the component data, Krippendorff’s α and Fleiss’ Kappa were computed. Note, Krippendorff’s α is designed for ordinal data whereas Fleiss’ kappa is designed for categorical data. To facilitate comparison with previously published data intraclass correlation coefficients (ICC) was computed for each component score, although, it should be noted, that ICC may not be appropriate for ordinal data. For the composite score, interrater reliability was assessed using ICC. All ICC coefficients were calculated using two-way ICC models for agreement. Interrater reliability for Day 2 scores was calculated separately using the same methods described for Day 1. All statistical analyses were conducted using R version 3.6.1 (the R Foundation; Vienna, Austria) at an a priori significance level of α = 0.05. Coefficients were interpreted in accordance with published guidelines.17,18 Specifically, ICC was interpreted as poor (0.00 – 0.40), fair/good (0.40 – 0.75), excellent (0.75 – 1.00). Krippendorff’s α was interpreted as unacceptable, (0.00 – 0.65), tentatively acceptable (0.65 – 0.80), or acceptable (0.80 – 1.00). Finally, Fleiss’ Kappa was interpreted as slight (0.00 – 0.20), fair (0.21 – 0.40), moderate (0.41 – 0.60), substantial (0.61 – 0.80), or almost perfect (0.81 – 1.00).
RESULTS
Score counts for each combination of Rater * Day * Test Item are shown in Table 1. Interrater reliability on Day 1 and Day 2 are summarized in Tables 2 and 3, respectively. The results vary considerably depending on the statistical test that was utilized. Interpreting Krippendorff’s α, Day 1 interrater reliability was unacceptable for Hurdle Step, Inline Lunge, Active Straight Leg Raise, and Rotary Stability; tentatively acceptable for Deep Squat; and acceptable for Shoulder Mobility. Based on Fleiss’ Kappa, Day 1 interrater reliability was poor for Hurdle Step and Rotary Stability (p > 0.05); fair for Inline Lunge and Trunk Stability Push Up; moderate for Active Straight Leg Raise; substantial for Deep Squat; and almost perfect for Shoulder Mobility. Day 1 ICCs indicated poor interrater reliability for Hurdle Step (p > 0.05), Rotary Stability (p > 0.05), and Inline Lunge; fair/good interrater reliability for Active Straight Leg Raise, and Trunk Stability Push Up; and excellent reliability for Deep Squat and Shoulder Mobility.
Interpreting Krippendorff’s α for Day 2, interrater reliability was acceptable reliability for Deep Squat, Hurdle Step, Shoulder Mobility, and Active Straight Leg Raise; tentatively acceptable reliability for Trunk Stability Push Up; and unacceptable reliability for Inline Lunge and Rotary Stability. Fleiss’ kappa indicated poor agreement for Rotary Stability (p > 0.05); fair agreement for Inline Lunge; moderate agreement for Trunk Stability Push Up; substantial agreement for Deep Squat and Active Straight Leg Raise; and almost perfect agreement for Shoulder Mobility. Day 2 ICCs indicated poor interrater reliability for Rotary Stability (p > 0.05); fair/good interrater reliability for Inline Lunge and Trunk Stability Push Up; and excellent interrater reliability for Deep Squat, Shoulder Mobility, and Active Straight Leg Raise. Day 2 interrater ICC for Hurdle Step could not be calculated.
Finally, interrater ICC for the composite score was excellent on both days (Day 1 ICC = 0.79, Day 2 ICC = 0.84; Table 4). Intraclass correlation coefficients (two-way models for agreement) calculated separately for Day 1 and Day 2 FMS™ composite scores.
DISCUSSION
The results of this study indicate that interrater FMS™ item score reliability was variable following a standardized two-hour training seminar in raters previously unfamiliar with the FMS™. We elaborate on specific FMS™ components in the following paragraphs. Additionally, we observed that interrater reliability of the composite score was excellent. One caveat that bears mentioning before further discussion is the lack of variability within certain component ratings. Specifically, nearly all raters assigned a score of “2” for every participant—on both days—in the Hurdle Step and Rotary Stability tests. Depending on the statistical test, this may result in a finding that agreement between raters is either essentially perfect or cannot be calculated. Whichever the case, these models should be interpreted with caution.
Results concerning the composite score are fairly consistent with previous findings.13 For example, Onate et al.19 observed an interrater ICC of 0.98 for the FMS™ composite score, and Smith et al.20 observed interrater ICCs of 0.87 and 0.89, respectively, on two separate days of testing. The authors conclude that the composite score can be rated reliably by judges of varying levels of experience. While this observation does strengthen the case for composite scoring of the FMS™, and perhaps movement quality screens in general, recent publications have highlighted serious limitations concerning this metric. Multiple factor analyses21,22 have identified a non-unidimensional structure and/or unacceptably low internal consistency. These observations call into question the psychometric validity of the composite score independently of whether or not a reliable score can be obtained.
In contrast, FMS™ item/component scores present a more granular perspective of movement quality and may be less vulnerable to criticism concerning their psychometric qualities. The study’s findings for Rotary Stability were again consistent with Onate et al.,19 who observed that a kappa statistic could not be calculated due to lack of variability. This study’s remaining results show a pattern of interrater agreement that is more or less similar to that of Onate et al. for the item scores, albeit a lower coefficient in all cases except Shoulder Mobility. This may be due in part to the use of Fleiss’ kappa where Onate et al. used Cohen’s kappa. (The latter was not an option in this study design because of the number of raters involved.) Minick et al.23 also used a two-rater kappa and reported generally higher agreement than this study found. Particularly noteworthy in their findings were considerably higher levels of observed agreement for Hurdle Step and Rotary Stability. Shultz et al.18 evaluated interrater reliability of FMS™ item scores using Krippendorff’s α and found unacceptable agreement in all cases except Hurdle Step, for which agreement was in the “acceptable” range. This may be partially attributable to the study population (DI varsity athletes), but does stand in contrast to the present findings.
The clinical interpretation of agreement depends on the choice of reliability statistic. This study endeavored to make the case that ICC should not be used for assessing reliability of ordinally scaled items such as the FMS™ component scores. In those cases, kappa (Fleiss or Cohen) and Krippendorff’s α are better suited models. In the dataset for this study, Active Straight Leg Raise and Trunk Stability Push Up—along with the Deep Squat, to a lesser extent—are perhaps the best examples of how ICC results may give the impression of an unrealistically high level of reliability. However, ambiguity of interpretation remains even when comparing results from kappa and α models. For instance, where Active Straight Leg Raise and Inline Lunge are considered “unacceptable” by α standards, the authors of this study would judge them as having moderate and fair agreement, respectively, based on their kappa models (referring to Day 1 results).
Based on the combined results for this study, the best candidates for inclusion in a high-volume screening effort following a brief, introductory training seminar would be: Shoulder Mobility, Active Straight Leg Raise, Deep Squat, and Trunk Stability Push Up. With one exception, each of these FMS™ components achieves a level of reliability that could be considered at least “moderate” (kappa) or “tentatively acceptable” (α) on both days. Active Straight Leg Raise, the exception, misses the α cutoff for being considered “tentatively acceptable” on Day 1 by a slim margin. These findings could be useful for those planning large-scale screens. Further, they might suggest a refinement of scoring criteria to the less reliable items or, at least, more focused training prior to their use.
Before concluding, this study highlights one potentially telling observation. The interrater reliability models feature five raters, one of whom was designated an “expert” and the rest “novices”. The rater designations are not accounted for in the models, but are specified in the Table 1 caption. In several cases, it appears that the cluster of novice raters disagrees systematically with the expert (e.g., DS, ILL). For example, the expert rater assigned a Deep Squat score of 1 to six subjects on both Day 1 and Day 2. In contrast, only two or three subjects were assigned a Deep Squat score of 1 by the novice raters. The expert rater also stands alone in assigning more 2’s and fewer 3’s on the Inline Lunge (both days) when compared with the novices, the latter of whom agree more closely with each other than they do with the expert. These systematic biases existed despite checking for interrater reliability on DS, HS, and ILL during the training session. It may represent opportunities to firm up reliability by modifying the training method, such as using live subjects rather than video, and by devoting additional training such that consensus is achieved with the criterion rater prior to data collection.
Limitations
There are several limitations in the current study. First, scoring by all raters was performed in real-time. While this better simulates the conditions under which the FMS™ would be administered, simultaneous assessment by five raters may have affected scores by virtue of requiring raters to view test subjects from different vantage points. This may be especially true for multidimensional tests such as the Inline Lunge, for which scores are likely to be more sensitive to viewing angle. The second limitation concerns the test subjects themselves. These individuals comprised a limited (n = 16) convenience sample of graduate students. Third, subjects may have scored differently from day 1 to day 2; however, the test subjects were blinded to their scores. Although raters may have recalled scores from Day 1, biasing their Day 2 scores, it is unlikely due to the number of scripted movement patterns tested and since re-testing was four days later. As such, our findings should be considered preliminary pending further work involving diverse samples with a greater number of observations.
CONCLUSIONS
A two-hour training session on the scoring and administration of the Functional Movement Screen™ in previously untrained raters produced acceptable interrater reliability in the Shoulder Mobility, Active Straight Leg Raise, Deep Squat, and Trunk Stability Push Up tests. Based on the results of the current study, the authors are not able to conclude that the remaining tests—Hurdle Step, Rotary Stability, and Inline Lunge—are comparably reliable after similar training. A brief training seminar could be used prior to high-volume movement screens to provide reliable measurements involving multiple raters, particularly where rater experience is limited.
Conflicts of Interest
The authors report no conflicts of interest.