INTRODUCTION
Measuring and restoring strength to aid in rehabilitating a patient’s function is considered a key aspect of physical therapist practice.1–3 The demand for objective quantitative measurements has facilitated the increased utilization of dynamometry in clinical practice.4 Isokinetic dynamometry provides highly reproducible results but is limited in clinical practice.5 The use of portable or hand-held dynamometers (HHDs) is increasing in popularity due to their affordability, operational and clinical efficiency. Hand-held dynamometry (HHD) is easy to use in different clinical settings with different populations.5,6 HHDs are commonly used with rater stabilization or with external fixation stabilization.
Reported psychometric properties for HHD measurements have been reported to be good to excellent.3,5,7 A recent systematic review with meta-analysis for hip muscle strength measures with portable dynamometers found moderate to high-quality evidence for sufficient intra-rater and inter-rater reliability for some positions regardless of fixation method.7 Florencio et al. report that studies show HHD reliability with rater-stabilized measurements with ICC between 0.70 and 0.98, and measurements with external fixation stabilization with ICC between 0.49 and 0.99 for measurements related to hip and knee.2 Maximal isometric force values obtained by HHD are comparable to values obtained with isokinetic dynamometry, but inconsistent validity has been found with both rater-stabilized and external fixation methods.3,7
HHD is susceptible to many sources of error and lacks standardization. Sources of error may be grouped into subject attributes, testing procedures, instrument characteristics, and rater attributes. Studies found the reliability of HHD measurements to be reliable in various patient populations.8 For example, Vaz et al. found excellent (≥0.90) intra-rater and inter-rater intraclass correlation coefficient (ICC) for hip measurements in individuals with symptomatic hip osteoarthritis.9 Koblauer et al. found similar excellent intra-rater and inter-rater ICC for knee extensor strength for patients awaiting total knee replacement.8 However, they did find a high error rate, expressed in the smallest detectable difference (SDD), ranging from 19.0% to 57.5%, concluding that the use of HHD is not advised for clinical practice. The lack of standardized testing protocols and testing positions for different muscle groups is another source of error.3,5,7 Instruments characteristics might be another source. Du et al. examined the variability between different HHD instruments for measuring muscle strength and found differences between 0.2% and 16% between dynamometers.10 Errors associated with the rater include experience with HDD and rater strength.3,5,11,12 The rater’s inability to maintain a stable base against higher torque outputs, has led to the recommendation to use external fixation methods.7
Operational considerations might not make it feasible to use external fixation methods, such as in inpatient acute care setting. The lack of consensus on using external fixation stabilization versus rater-stabilized methods for stabilization warrants further investigation. In addition, there is no consensus on the most appropriate rater testing positions to enable proper stabilization against an external force.13 A wide variety of methods have been described in research and clinical practice.7
The primary aim of this study was to assess the inter-rater and intra-rater reliability of rater-stabilized HHD measurements from a mechanically produced force in three different standardized rater test positions. Reliability is defined as the degree of consistency with which a rater measures a variable.14 A second aim of the study was to assess the agreement between external fixation stabilization and rater-stabilized measurements.
METHODS
Design
This is an intra-rater and inter-rater reliability study approved by the Manchester University Institutional Review Board.
Instrumentation and Rater Test Positions
A novel device was designed to provide a mechanical force. A mannequin leg was securely attached to the base of a height-adjustable plinth. Through a rope, the mannequin leg was attached to a pulley with a 60 kg weight stack (Speed Pulley 702600, STEENS, Norway). The external leverage arm, the point of rotation to the location of the instrument placement, was constant at 43.5 cm. The set-up was configured to allow for three standardized rater test positions commonly used in clinical practice. The positions were selected to allow for optimal rater stabilization and are commonly used to measure hip flexion, hip abduction and knee extension. The selected positions are described by Aerts and Alwood15 and in Figure 1.
A newly obtained Lafayette Hand-Held Dynamometer (Lafayette Instrument Company, Lafayette IN, USA, Model 01165A) was used to obtain and record the measurements. The hand-held dynamometer was not altered in any way. The device was accompanied by a certificate of calibration with absolute errors ranging from -3.1 newton (N) to +0.6 N, and relative errors ranging from -0.25% to + 0.65% between the HHD measurement and actual value. An external fixation stabilization device (Hand-Held Dynamometer Support Stand Model 01166, Lafayette Instrument Company, Lafayette IN, USA) was used to stabilize the HHD measurements obtained by external fixation (Figure 2).
Raters
A convenience sample of raters was recruited by the Manchester University faculty from the local geographical area (Fort Wayne, Indiana – USA). After signing a consent form to participate, raters were asked to complete a survey inquiring about their practice experience and familiarity with using HHD. Age, sex, anthropometric information, and hand grip strength were collected. Each rater was assigned a random number between 1 through 10. The raters underwent a thirty-minute training session and performed several practice measurements until they reported being comfortable with the measurement techniques and rater test positions.
Testing Procedures
The mechanical set-up was calibrated to produce predetermined external forces. The predetermined external forces were based on forces that a rater may encounter during clinical practice when measuring hip abduction, hip flexion, and knee extension using the selected test positions. The force values were based on a data set obtained in clinical practice (n=800) by the main investigator (unpublished clinical data). The force values were categorized into three groups: 1. low (-1SD), 2. medium (mean), and 3. high (+1SD). The testing was performed over three days. Before each testing session, the investigators performed three measurements against each external force magnitude, by using the external fixation method to ensure that the mechanical setup produced the expected predetermined forces. The force values are presented in Table 2.
Each rater was then asked to complete nine measurements in each of the three different test positions, i.e. hip abduction, hip flexion, and knee extension. Three different force magnitudes (i.e. low, medium, high) were randomized across the nine measurements so that each rater performed three measurements against each force magnitude in random order. To mimic the clinic situation where therapists do not exactly know how much force the patient or client will produce, the raters were blinded to the preset external forces. The investigators, recording the rater-stabilized HHD measurements, were also blinded to the preset external forces. The reading of the peak force (N) of each measurement was recorded. The raters used a “make” technique where the raters matched the external mechanical force for a duration of five seconds. With a “make” technique the rater holds the hand-held dynamometer stationary matching the external torque. This would mimic the measurements taken when the patient / client produces force through an isometric muscle contraction. In contrast with a “break” technique, the rater must create enough force to break the isometric muscle contraction. This would mimic measurements taken when the patient / client produces force through an eccentric muscle contraction. Both techniques have been used in clinical practice and research. A “make” technique might be favored during early rehabilitation as a “break” technique may increase patients / client risk for injury.6 Additionally, a “make” technique may have better reliability and provide more accurate measurements.3,8
Statistical Analysis
Statistical analysis was performed using the Statistical program for the Social Sciences version 28 (SPSS, IBM corp., Armonk, NY) and significance was set at α<0.05.
Raters’ demographics were analyzed and reported using descriptive statistics. Quantitative variables are expressed in mean (N) and standard deviation (SD). Based on the sample calculator presented by Bonett, using 10 raters, sample size calculation was conducted for reliability data with an ICC estimated at 0.80, an amplitude-based confidence interval of 0.3, that is, 0.5 < ICC> 1.0, and a confidence coefficient α<0.05, resulted in 11 different tests (subjects).2,16 The intraclass correlation coefficient (ICC) is the most common statistic used to assess intra-rater and inter-rater reliability.8 Intra-rater reliability for each rater was assessed by using the three measurements of each force magnitude obtained in each test position. The two-way mixed intraclass correlation coefficient model 3 (ICC3,k)14 and the 95% confidence intervals were calculated. Inter-rater reliability was assessed using the average of the three trials for each measurement. The random effects absolute agreement inter-class correlation coefficient model 2 (ICC2,k) and the 95% confidence intervals were calculated.14 The following guidelines were used to interpret the ICC: below 0.75 as poor to moderate, above 0.75 as ‘good’, and above 0.90 as ‘excellent’ which ensures reasonable reliability.14
The clinical utility of the ICC is minimal as it provides a measure of reliability within a study group and not individual measures. Therefore, the Standard Error of Measurement (SEM) and Minimal Detectable Change (MDC) are used as additional measures of reliability.
The (SEM) and (MDC) were calculated as an additional assessment of inter-rater reliability. Lower values of SEM and MDC values are indicative of lower measurement error and better reliability.8 SEM was calculated using the following formula: MDC was calculated as follows: All SEM and MDC values are presented in absolute values and as a percentage of the mean maximal strength. For each rater-stabilized test, the Mean and SD were calculated using the average of the three measurements from each rater. For the SEM and MDC, the inter-rater ICC for each test position was used.
The agreement refers to the ability of a measurement tool to produce the same exact values.17 Agreement was assessed by calculating the error rate between external fixation stabilization and rated-stabilized measurements. Both external fixation stabilization and rater-stabilized HHD measurements were obtained from a mechanical force therefore controlling for patient / client variability. By controlling errors associated with subjects, testing procedures, and instrument variability, the researchers aimed to obtain an estimate of error associated solely with the rater.
The agreement was estimated by calculating the error rate as follows18:
ErrorRate(%)=(non−fixeddynmometerreading−fixeddynometerreading)fixeddyanometerreading×100
The error rates of the average measurements of each rater’s tests were used to estimate overall accuracy. A one-way ANOVA was conducted to determine if the error rate (%) between the rater-stabilization and external fixation stabilization measurements were different by rater, test position, or by force magnitude.
RESULTS
Raters
Ten participants consented to be a part of the study. All raters were licensed physical therapists and worked in various clinical settings. Raters’ characteristics are presented in Table 1. Six raters reported using HHD with 50 to 90% of their patient / clients, two raters with 10 to 50% of their patient / clients, and two raters stated they never used HHD. The overall strength of the raters was assessed through grip strength. The mean grip strength for all raters was 456N ± 140. All 10 raters performed all measurements in Test Position 1 - Hip Abduction and Test Position 2 - Hip Flexion. Two raters did not complete measurements in Test Position 3 - Knee Extension against the highest external force magnitude. The measurement attempts were halted as the raters were unsure if they could maintain a stable base. The raters who were unable to maintain a stable base were both females who never or rarely uses HHD and had the lowest mean dominant grip strength measurements of 275N and 314N.
Reliability
Intra-rater reliability for each rater was assessed using the three trials for each measurement. Two providers did not perform measurements in Test Position 3 - Knee Extension against the highest resistance. The intra-rater reliability (ICC3,k) for the different raters across the measurements ranged from 0.97 to 1.00. Inter-rater reliability was assessed using the means of the three trials from each rater for each measurement. After examining box plots, one extreme outlier (three box lengths away from the edge of the box), classified as (Rater 3, Test Position 2 - Hip Flexion, low force magnitude) was identified but was not removed before further analysis. The inter-rater reliability (ICC2,k) between the different raters in the three test positions was 0.99 (CI 95% 0.93; 1.00). The absolute SEM ranged from 0.5 to 3.0 N and the relative SEM from 0.2% to 0.9 % respectively. The absolute MDC ranged from 1.4 to 8.3N and the relative MDC from 0.7% to 2.8%. Table 2 provides a summary of the measurements by test position and force magnitude (nine tests) for the external fixation and rater-stabilized measurements.
Agreement
After examining boxplots, two outliers were identified. The authors did not remove this data from the analysis as it did not impact the analysis outcome. The means obtained by rater-stabilized measurements compared to the means obtained by external fixation measurements were higher in all nine tests. The error rate between the external fixation and rater-stabilized measurements ranged from 6.9% in Test Position 3 – Knee Extension, low force magnitude to 31.2% in Test Position 3 – Knee Extension, high force magnitude. Data was normally distributed for all groups, as assessed by Shapiro-Wilk’s test (p > 0.05), except for one group i.e. “Test Position 2 - Hip Flexion”. After reviewing the “normal Q-Q plot” and considering the robustness of one-way ANOVA against Type I error with equal group sizes, the analysis proceeded.
For “Rater”, variances were homogeneous by Levene’s for equality of variances (p = 0.443). There was no statistically significant difference in error rate (%) between the different raters (F (9,77) = 1.358, p = 0.222). Error rates by rater are presented in Table 3.
For “Test Position”, there was no homogeneity of variances by Levene’s for equality of variances (p < 0.001). There was no statistically significant difference in error rate (%) between the different test positions (Welch’s F (2,47.691) = 1.583, p = .216). Error rates by test position are presented in Table 4.
For “Force Magnitude”, there was no homogeneity of variances by Levene’s for equality of variances (p <0.001). There was a statistically significant difference in error rate (%) between the different force magnitudes (Welch’s F (2,50.798) = 42.938, p <0 .001). Error rates by force magnitude are presented in Table 5.
DISCUSSION
This study investigated the variability associated with the rater by assessing the reliability and agreement of HHD measurements. External fixation was compared to rater-stabilized measurements using the same instrument, taken in three standardized rater test positions against three different force magnitudes. The external force was created by a mechanical device eliminating the variability associated with patients or clients.
Both intra-rater and inter-rater reliability were excellent in this study with intra-rater and interclass correlation coefficients of 0.97 and above. These results are consistent with previous studies reporting reliability for HHD measurements for hip measurements.5,7,19 The inter-rater relative SEM values obtained in this study ranged from 0.2% to 1.0% and relative MDC values ranged from 0.7% to 2.8%. The values of the current study compare with the values reported by Morin et al. who reports relative inter-rater SEM values between 1.1% and 3.0% and relative MDC values between 3.1% and 8.3% for hip and knee measurements obtained by using a semi-fixed and pull dynamometer.5 The values found in this study for relative SEM were lower than the values reported by other studies.2,20 Florencio et al. reported intra-session relative SEM values for HHD hip and knee, using examiner and belt stabilized measurements, between 6% and 15% using a similar instrument.2
Error rates were calculated to assess agreement. The error rates between the external fixation stabilization and rater-stabilized measurements ranged from 6.9% to 31.2% comparable to –4.9% to 27.1% calculated values based on the data provided by Florencia et al.2
The raters consistently, regardless of the force magnitude and test position, measured higher values compared to the external fixation stabilization measurement. This finding is consistent with Florencio et al. who reported that measurements obtained with examiner stabilization were generally greater than those observed for belt-stabilization in 13 of 16 measurement positions.2 This systematic error might be related to the raters trying to adjust to the direction and magnitude of the force. Providers were instructed to ensure that the pressure pad stayed perpendicular to the movement arm (artificial limb), which was not assured by the mechanical stabilization. This finding is similar to Florencia et al. who suggested that better control of the dynamometer position is guaranteed when the stabilization is provided by the rater.2 Another reason can be related to the increased impulse force created when raters are trying to adjust to the force magnitude. The raters were instructed to maintain the starting position and not to push back. The external force was released in a controlled two-second manner. By adjusting to the unknown force, the rater may create a higher impulse force increasing the force output registered by the hand-held dynamometer. In addition, the results of this study suggest that this systematic relative error is more pronounced with lower force magnitudes.
It has been recognized that assessor strength impacts the ability to provide proper stabilization when confronted with higher forces exceeding the accessor’s strength.6 In this study, two raters were unable to provide a stable base when confronted with a higher force magnitude in Test Position 3 - Knee Extension.
The major strength of this study is the exclusion of variability related to patients or clients. The investigators could only identify one other study investigating HHD accuracy using a mechanically produced force by a spring-loaded device.18 Overall accuracy was estimated at 3%. The main difference was the raters used a “break technique” compared to a “make technique” in this study. Another strength of the study was the use of the same instrument for both stabilization methods, eliminating the error associated with the use of different instruments. A third strength of this study was the randomization and blinding of recording investigators and raters to the forces applied by the mechanical device. This mimics the clinical situation where raters do not know how much resistance to provide to counteract the forces produced by the patient / client.
This study has several limitations. The limited sample size (n = 9 tests) was smaller than the calculated sample size required (n=11), which may have resulted in an under powered study. The mechanical setup did not accommodate an additional test position. In retrospect, an additional force magnitude could have been included. The limited sample size precluded the investigation of interaction effects between rater, test position, and force magnitude. Further research should explore these possible interaction effects. A second limitation was the minimal training provided to the participating raters, approximately 30 minutes of instruction and practice. In the study of Morin et al., trainers received three[ days of training followed by 20 hours of practice.5 This limited training may have an impact on the raters’ ability to provide the proper stabilization. A third limitation might be related to the external fixation stabilization device. The relative SD and relative SEM for our external fixation stabilization device measurements ranged from 2.7 % to 16.8% and from 0.1% to 0.9% respectively. Florencio et al. reported data for knee and hip strength measurements from healthy young individuals using a similar HHD LaFayette instrument.2 Based on the data from Florencio et al., the relative SD which ranges from 31.6% to 37.3%. were calculated2 Florencio et al. reported relative SEM from 7% to 15%.2 The relative SEM includes the error associated with the instrument itself (-0.25% - 0.65%).2 Both studies used fixed stabilization technique, but the external mechanical produced torque in our study resulted in lower relative SD and relative SEM compared to the subject produced torque in the study from Florencio et al.2
CONCLUSION
This research study provides evidence that rater-stabilized HHD measurements have excellent intra-rater and inter-rater reliability, with low SEM and MDC. The maximal external force should not exceed the rater capability and raters should use standardized test positions. In clinical practice, providers should be aware that HHD may have a degree of inaccuracy. The values obtained with rater-stabilized HHD were consistently higher compared to the values obtained by external fixation HHD. Further research should investigate if this overestimation is influenced by the force magnitude.
Acknowledgments
The authors would like to thank all raters for their contributions. We would like to thank Adam Fischer for constructing the mechanical setup and Ed Ball for soliciting the participating raters.
Funding
This work was supported in part by the Manchester University Health Sciences and Pharmacy Programs Internal Research Grant. The funding source had no involvement in study design, data collection, analysis, interpretation, manuscript writing, or decision to submit the article for publication.
Conflict of Interest
The authors report no conflicts of interest.