Direct behavior rating (DBR) represents a feasible method for monitoring student behavior in the classroom; however, limited work to date has focused on the use of multi-item scales. The purposes of the study were to examine the (a) dependability of data obtained from a multi-item DBR designed to assess peer conflict and (b) treatment sensitivity of Direct Behavior Rating Multi-Item Scales (DBR-MIS) constructed using factor-derived and individualized methods. Analyses were performed using teacher ratings of 65 students (53 boys, 12 girls) between 6 and 12 years old. Results of decision studies indicated that an acceptable criterion of dependability (ϕ > .70) for low-stakes, intraindividual decision making could be achieved using a three-item scale across eight occasions, a four- or five-item scale across four occasions, or a six-item scale across three occasions. Subsequent analyses verified that a six-item DBR demonstrated acceptable treatment sensitivity when ratings were conducted on 3 days during baseline and 3 days during treatment with methylphenidate. Implications for practice and future research are discussed.
Over the past decade, schools have increasingly implemented multitiered frameworks (MTSS), such as school-wide positive behavior supports, to increase prosocial behavior and reduce interpersonal aggression and disruptive behavior (Spaulding, Horner, May, & Vincent, 2008). A central tenet of MTSS is data-based evaluation of student response to the provided supports, especially for students receiving targeted or intensive intervention (Kratochwill, Albers, & Shernoff, 2004; Leff, Power, Manz, Costigan, & Nabors, 2001). As with all assessment tools, it is important that progress-monitoring measures demonstrate adequate levels of reliability and validity for consumers to trust in the data the measures generate. However, there are two requirements that are unique to a progress-monitoring context. First, users must be able to administer progress-monitoring measures repeatedly to produce streams of data. Given that progress monitoring typically occurs on a weekly or biweekly basis, this means that measures must be fairly brief and easy to complete to ensure feasibility in applied settings. Second, it is critical that progress-monitoring measures demonstrate sufficient sensitivity to change. Treatment sensitivity is the degree to which an assessment tool detects the magnitude of changes in behavior in the expected direction following the application of a treatment (Chafouleas, Sanetti, Kilgus, & Maggin, 2012). The ability to detect small changes in the behavior of interest is clearly of great importance when evaluating student response to intervention (Fuchs, Fuchs, Hamlett, Walz, & Germann, 1993).
Over the past decade, direct behavior rating (DBR) has been advocated as an efficient, effective, and acceptable method for monitoring student behavior in response to intervention. As outlined by Christ, Riley-Tillman, and Chafouleas (2009), there are three defining characteristics of DBR. First, DBR is direct, in that ratings are produced in close temporal proximity to the behavior of interest. Second, DBR is used to assess observable behaviors, which are predetermined and require minimal inference to assess. Third, DBR involves rating of those behaviors by someone who has directly observed the student, rather than recording a precise count or temporal measurement (e.g., duration). As a result, the numbers generated through DBR are inevitably influenced to some degree by the rater’s own perceptions. Beyond these core requirements, however, “DBR is not defined by a single scale, form, or number of rating items” (Chafouleas, Riley-Tillman, & Christ, 2009, p. 196). Rather, as an assessment method, DBR is broadly defined and expected to encompass multiple options with regard to both target behaviors and scaling approaches.
To date, the majority of research on DBR has focused on the use of either single-item (DBR-SIS) or multi-item scales (DBR-MIS). Whereas DBR-SIS involves rating a single global target as a measure of a construct (e.g., disruptive behavior), DBR-MIS involves rating several items or discrete behaviors (e.g., raises hand, stays on task) to generate a composite score as a measure of the particular construct (e.g., academic engagement; Chafouleas, 2011). DBR-SIS data are summarized at the level of the individual item, whereas DRB-MIS data can be either interpreted at the level of the individual item (i.e., each discrete behavior) or summarized at the level of the overall scale (Volpe & Briesch, 2015). Evidence in support of both methods within the context of progress monitoring has accumulated in recent years, with research suggesting that a consistent estimate of student behavior may be obtained in less than 2 weeks when conducting daily DBR-SIS (e.g., Chafouleas et al., 2010; Chafouleas, Christ, Riley-Tillman, Briesch, & Chanese, 2007) or MIS (e.g., Volpe & Briesch, 2012, 2016) ratings. Some initial evidence of treatment sensitivity has also been demonstrated for both DBR-SIS (e.g., Chafouleas et al., 2012; Riley-Tillman, Methe, & Weegar, 2009) and MIS (e.g., Volpe & Gadow, 2010; Volpe, Gadow, Blom-Hoffman, & Feinberg, 2009) approaches. Although the psychometric research conducted to date suggests great promise for use of DBR within a progress-monitoring context, there exists an apparent need for additional research studies to guide recommendations for applied use.
First, much of what we know about DBR is based on a limited number of behavioral targets. In large part, both DBR-SIS and MIS studies have focused on two indicators of behavioral functioning: academic engagement and disruptive behavior. Results of two studies conducted by Volpe and Briesch (2012, 2016), for example, suggested that a consistent estimate of academic engagement could be obtained using a five-item DBR-MIS after four occasions, whereas disruptive behavior required between eight (e.g., Volpe & Briesch, 2016) and 12 assessment occasions (Volpe & Briesch, 2012). Although academic engagement and disruptive behavior are considered to be important contributors to students’ academic success in the classroom (Briesch, Chafouleas, & Riley-Tillman, 2016), there are certainly many other indicators of student performance that may be of interest when implementing both secondary and tertiary behavioral supports. It is therefore important that researchers begin to investigate a broader range of behavioral targets to provide assessment tools that align with the full range of social, emotional, and behavioral concerns likely to be encountered in school settings.
Second, much of the DBR research to date has relied on analog—rather than in vivo—rating of student behavior. Most typically, graduate students have been employed as raters in place of teachers and asked to conduct ratings based on videotaped footage (e.g., Volpe & Briesch, 2012, 2015). Although such an approach has been deemed necessary to allow for repeated viewings of the same sample of behavior, there are obvious limitations to the use of analog conditions. Most importantly, the dependability, or consistency, of behavioral estimates may be higher when there are not as many competing demands on the rater’s attention. In a study by Chafouleas et al. (2010), for example, far fewer rating occasions were needed when DBR-SIS was completed by graduate student observers than by the classroom teachers. This may have been attributable to the fact that the teachers were forced to divide their attention across the delivery of instruction and management of student behavior, whereas the external raters were able to focus exclusively on the behavior of individual students. The degree to which those results obtained under analog conditions may generalize to typical classroom-based settings is therefore unclear.
Third, much more is known to date about how many DBRs are needed to obtain a dependable estimate of behavior than is known about the treatment sensitivity of these measures. With regard to DBR-SIS, primary evidence of sensitivity to change comes from two studies examining the effectiveness of both individual and class-wide behavioral interventions. Chafouleas et al. (2012) found that DBR-SIS detected reductions in disruptive behavior, and increases in academic engagement and compliance following implementation of a daily behavior report card intervention. Similarly, Riley-Tillman et al. (2009) found that DBR-SIS measured changes in academic engagement at the class-wide level in response to teacher modeling and verbal prompting. In regard to the use of multi-item scales, initial evidence exists to suggest that brief, individualized scales may be used to evaluate response to behavioral and pharmacological interventions. Pelham et al. (Pelham, Gnagy, et al., 2001; Pelham, Hoza, et al., 2002), for example, found that daily report cards containing target behaviors that were unique to each child (e.g., starts work with fewer than two reminders, asks for help when frustrated) were able to detect improvements in classroom behavior in response to methylphenidate. More recently, Volpe and Gadow (2010) created three-item scales assessing peer conflict, which demonstrated adequate sensitivity to methylphenidate treatment, comparable with the sensitivity of the longer, full-length scale.
Taken together, the research to date provides initial psychometric support for the use of DBR within a progress-monitoring context; however, additional research is clearly needed to understand the psychometric adequacy (i.e., dependability, sensitivity to change) of teacher-completed DBR scales targeting a broader range of intervention targets. The purpose of the current study was therefore twofold. The first goal of the study was to examine the dependability of data obtained from a multi-item DBR designed to assess peer conflict. Although there are many different areas of student functioning school-based practitioners may wish to target for intervention, interpersonal aggression is one of particular importance. Studies examining patterns of school-based referral concerns have consistently identified difficulties with both peer relationships and interpersonal aggression as among the most common reasons for referral to a school psychologist or problem-solving team (e.g., Bramlett, Murphy, Johnson, Wallingsford, & Hall, 2002; Briesch, Ferguson, Volpe, & Briesch, 2013). Furthermore, the outcomes for students engaging in conflict with peers are troublesome. Research has shown that those students who exhibit high rates of both verbal and physical aggression at the beginning of the year are more likely to experience rejection from peers within the school context (Crick, 1996). In addition, numerous studies have demonstrated a link between social problems (e.g., interpersonal aggression) exhibited in childhood and negative outcomes in adolescence and adulthood, including academic failure, continued social problems, and delinquency (Bradley, Doolittle, & Bartolotta, 2008; Cullinan & Sabornie, 2004; Underwood, Beron, & Rosen, 2011). Given both the prevalence of, and negative outcomes associated with, interpersonal aggression, the need for both school-based intervention efforts to improve student functioning and defensible tools to monitor response to such intervention efforts appears warranted.
The second goal of the study was to examine the treatment sensitivity of abbreviated DBR-MIS scales constructed using two different methods: factor derived and individualized. Factor-derived DBR-MIS are constructed by including items with the highest factor loadings in descending order until the desired number of items is reached. The factor-derived method has the benefit of reducing the amount of items a teacher needs to rate, while retaining items that are purportedly most representative of the construct (i.e., peer conflict). Individualized DBR-MIS are constructed by selecting the items with the highest rating in the pretreatment condition. These scales are individualized because the items included in the scale are different for each student. Results of two studies have shown that it is possible to create shorter, more efficient versions of extant rating scales that demonstrate similar psychometric characteristics to the original, full-length scales (Volpe & Gadow, 2010; Volpe et al., 2009). However, in both studies, the individualized scales demonstrate greater sensitivity to the effects of stimulant medication.
Our ultimate goal was to improve the efficiency of an existing rating scale while retaining desirable technical characteristics. Based on the results of previous studies, we hypothesized that an acceptable level of dependability (ϕ > .70) could be reached with a three- to five-item scale across 10 occasions (i.e., 2 weeks) or less. Furthermore, we hypothesized that the brief multi-item scales would demonstrate adequate treatment sensitivity. That is, we hypothesized that differences in scores between the placebo condition and treatment with methylphenidate would be statistically significant (p < .05). Finally, we hypothesized that the brief multi-item scales would detect the same magnitude of change as the original, full-length scale. To be more specific, we hypothesized effect sizes, or changes in teacher ratings from placebo to treatment with methylphenidate, for the brief multi-item scales would be equal to the longer, full-length rating scale.
Participants and Setting
Participants were originally recruited for a randomized, placebo-controlled, crossover study to evaluate the effectiveness of immediate-release methylphenidate for treating symptoms associated with oppositional defiant disorder (ODD), attention-deficit/hyperactivity disorder (ADHD), tic disorder, or Tourette’s disorder. The sample of 65 children (53 boys, 12 girls) in the present study was recruited through clinics, schools, media advertisements, and parent support groups for participation in the larger study. Children ranged in age between 6 and 12 years (M = 8.9; SD = 1.8). The majority of children (63.1%) received at least some special education services, and 29.2% of the children in the sample were enrolled in full-time special education programs. White children comprised 90.8% of the sample, and African American and Latino children each comprised 4.6% of the sample.
Each child met Diagnostic and Statistical Manual of Mental Disorders (3rd ed., rev.; DSM-III-R; American Psychiatric Association [APA], 1987) or Diagnostic and Statistical Manual of Mental Disorders (4th ed.; DSM-IV; APA, 1994) criteria for ADHD, and impairment was evident across both home and school settings. Children also met criteria for either chronic motor tic disorder or Tourette’s disorder. Approximately half of the children in the sample met criteria for ODD, and one third met criteria for an anxiety disorder as indicated via a structured diagnostic interview with the parent. Children were excluded from the study if they exhibited one or more of the following: (a) tics as the primary clinical concern, (b) danger to self or others, (c) psychosis or intellectual impairment (IQ < 70), or (d) seizures, major organic brain dysfunction, major medical illness, or pervasive developmental disorder.
Measures
Peer Conflict Scale (PCS)
The PCS (Gadow, 1986) contains 10 items intended to assess aggression toward other children. Items assess physical (e.g., engages in physical fights with other children) and nonphysical (e.g., curses or teases other children to provoke conflict) aggression, and overlap with symptoms of conduct disorder (Gadow & Sprafkin, 2008; Gadow, Sprafkin, & Nolan, 1996). Items are rated on a 4-point scale: 0 = never, 1 = sometimes, 2 = often, 3 = very often.
Coefficient alpha reported by Gadow and Nolan (2002) reached .95 for teacher ratings. Evidence of test–retest reliability was provided by Pearson correlation coefficients ranging from .62 to .65 for teacher ratings across days in the same week (Nolan & Gadow, 1994) and .47 for teacher ratings across 8 months (Gadow & Sprafkin, 1997; Gadow, Sprafkin, & Nolan, 2001). Evidence of convergent validity was indicated by high correlations between teacher ratings on the PCS and direct observations of noncompliance (r = .63), nonphysical aggression in the classroom (r = .74), and nonphysical aggression in the lunchroom (r = .62; Nolan & Gadow, 1994). Finally, treatment sensitivity of the full PCS was demonstrated by lower teacher ratings in low (0.3 mg/kg) and moderate (0.5/0.6 mg/kg) methylphenidate dose conditions when compared with ratings in placebo/no-treatment conditions (Gadow & Sprafkin, 1997). Items from the original PCS served as the basis for brief multi-item scales (factor-derived and individualized DBR-MIS), which were developed according to the procedures described in subsequent sections.
Procedures
Each of the 65 participants received placebo and three doses of methylphenidate (0.1 mg/kg, 0.3 mg/kg, and 0.5 mg/kg) under double-blind conditions. Each dose condition lasted 2 weeks, and dose schedules were randomly assigned and counterbalanced. Medication was dispensed to parents at 2-week intervals in dated, sealed envelopes. Medication was administered twice daily; the majority of children received the first dose before arriving at school and the second dose approximately 3.5 hr later. Across conditions, teachers rated child behavior in classrooms using the PCS (a) on prespecified days (Tuesdays and Thursdays) and (b) immediately following a designated 30-min academic activity, which occurred approximately midway between the start of the school day and the students’ lunch period. Although four ratings per dose condition were planned (i.e., 2 days × 2 weeks), high percentages of missing data on the fourth occasion precluded its use in subsequent analyses conducted as part of this study. Specifically, 40.0% of data were missing in the placebo condition, and 52.3% of data were missing in the very low dose (0.1 mg/kg) methylphenidate treatment condition. Fidelity to rating procedures was not explicitly assessed. Each teacher conducted ratings for one child; general education classroom teachers conducted ratings for 46 children (70.8%), and special education teachers conducted ratings for 19 children (29.2%) enrolled in full-time special education programs.
As opposed to traditional rating scales that ask teachers to assess a child’s behavior over an extended period of time (e.g., 2 weeks, 1 month), teachers in the current study were instructed to rate student behavior at prespecified times during maximum drug efficacy, immediately following a designated academic activity in the morning. The rating procedure was therefore consistent with the definition of DBR proposed by Christ et al. (2009), in that direct ratings of children’s behavior were made at specified times and in the context in which behavior occurred. Although rating student behavior twice per week immediately following a prespecified academic activity may have been novel for teachers, additional training was not deemed necessary, given that teachers had prior experience responding to Likert-type scales similar to the 4-point scale of the PCS.
Data Analysis
Within the current study, generalizability and decision studies were first conducted to determine the number of items and assessment occasions needed to obtain an adequate level of dependability. Second, teacher ratings obtained during the placebo condition were compared with ratings obtained during the very low dose methylphenidate condition to evaluate the treatment sensitivity of DBR-MIS derived from the PCS. Methylphenidate was judged to be an appropriate intervention for exploring the treatment sensitivity of a DBR-MIS focused on peer conflict, given meta-analytic research highlighting the fact that the effect of methylphenidate on aggressive behaviors may be similar in magnitude to its effect on symptoms of inattention and hyperactivity in students with ADHD (Connor, Glatt, Lopez, Jackson, & Melloni, 2002). Data analysis procedures are described in detail in the following sections.
Generalizability and dependability studies
Much of the DBR-MIS research to date has focused on scale development through the application of generalizability theory (GT). GT allows one to estimate the proportions of variance attributable to different sources of error he or she is likely to encounter when assessing behavior in applied contexts, including the rater, item or alternate form, and rating occasion. Furthermore, GT allows one to evaluate the dependability of measurements, or the accuracy of an observed sample of behavior compared with actual behavior within the range of all possible measurement conditions (Shavelson & Webb, 1991).
Two types of studies are conducted when applying GT in behavioral assessment: generalizability (G) and decision (D) studies (Briesch, Swaminathan, Welsh, & Chafouleas, 2014). The goal of a G study is isolate and estimate sources of error variance or facets (e.g., item, occasion, rater). After identifying the variance associated with each facet, one conducts D studies to examine the consistency of measurement or dependability, with the goal of designing assessment procedures that minimize error for a particular purpose (Shavelson & Webb, 1991). Designing the D study requires one to define the universe of generalization, including which facets are specified in the measurement model. Similar to reliability in classical test theory, dependability in GT is assessed by computing generalizability (ρ2) and dependability (ϕ) coefficients for relative (interindividual) and absolute (intraindividual) decisions, respectively.
G studies were conducted using teacher ratings in the 2-week placebo condition of the study, given that children’s behavior in the placebo condition varied naturally in the absence of treatment. G studies were restricted to the placebo condition because a fundamental assumption of GT is that each sample of behavior is theoretically exchangeable with every other sample of behavior. This is expected to occur in the absence of intervention (i.e., under natural circumstances); however, it is not as likely to occur once treatment begins because behavior tends to gradually improve over time. As a result, 1 day of intervention data is not inherently exchangeable with another, particularly in the context of medication evaluation, wherein the effects of medication may gradually increase before stabilization occurs. Variance component analyses (ANOVA with Type III sum of squares) were conducted using SPSS 23 for the fully crossed, two-facet design involving occasions and items. These G study results were then used to inform a series of decision studies to compare the dependability of measurement across items and occasions.
Results of the D studies were used to inform the number of items for inclusion in two versions of DBR-MIS constructed from PCS items. The starting model examined the number of items necessary to reach an adequate level of dependability across 3 days because actual data were not collected with fidelity across more than 3 days in each dose condition. As the second goal of the study was to evaluate the treatment sensitivity of DBR-MIS, we sought to identify a combination of items and occasions that would both demonstrate adequate dependability and match existing data to allow for defensible in vivo evaluations of DBR-MIS treatment sensitivity to pharmacological intervention.
Two methods were used to develop abbreviated DBR-MIS: (a) factor derived and (b) individualized. Items for the factor-derived DBR-MIS were selected based on item component loadings obtained by Volpe and Gadow (2010) via principal components analysis (see Table 1). That is, the item with the highest component loading served as the first item on the factor-derived DBR-MIS, and items were subsequently added in descending order of component loading until the number of items on the DBR-MIS matched the number identified via D studies. Items for the individualized DBR-MIS were uniquely selected for each student based on the highest ratings on the first assessment occasion during the placebo condition. Items were selected in descending order of rating until the number of items on the DBR-MIS matched the number identified via D studies. In the event of a tie between multiple items with the same rating, a random number generator was used to select an item for inclusion to reduce possible investigator biases and ensure objective DBR-MIS construction.
|
Table 1. Peer Conflict Scale Component Loadings From Volpe and Gadow (2010).

Treatment sensitivity
Analyses of treatment sensitivity occurred in three stages: First, one dependent-samples t test was conducted to compare factor-derived DBR-MIS data obtained during placebo and very low dose (0.1 mg/kg) conditions. Similarly, one dependent-samples t test was conducted to compare individualized DBR-MIS data obtained during placebo and very low dose conditions. Statistically significant (p < .05) differences between DBR-MIS ratings in placebo and very low dose conditions would provide initial evidence to indicate that the scales can detect changes in behavior following treatment. The very low dose condition was selected for comparison because it yields the subtlest changes in behavior. Hypothetically, if a scale is able to detect these small changes, it should also be able to detect changes between placebo and higher doses, or changes between one dose and another. Furthermore, children often exhibit diminished improvement in social functioning as dose increases over time, and floor effects may be observed.
Second, following the precedent set by Chafouleas et al. (2012) and Cheney, Flower, and Templeton (2008), four of the five change metrics identified by Gresham (2005) were applied to further evaluate the treatment sensitivity of factor-derived and individualized DBR-MIS: absolute change, effect size, percentage change from baseline, and reliable change index (RCI). The fifth metric outlined by Gresham (2005), percentage nonoverlapping data (PND), was not used because it is vulnerable to floor and ceiling effects. Given that 35 of 65 children (54%) received a zero rating on at least 1 day in the placebo condition, PND would severely underestimate the magnitude of behavior change. Each of the four change metrics was calculated for the factor-derived DBR-MIS, and was compared with the corresponding metric calculated for the individualized DBR-MIS and full PCS. Similar metrics across the three scales would indicate that the scales detected similar magnitudes of change. Conversely, a larger metric for a particular scale when compared with the other two would indicate that the scale was more sensitive to change. Absolute change is defined as the magnitude of change an individual demonstrates across treatment conditions. We computed absolute change by subtracting each child’s mean placebo rating from his or her mean treatment rating. We computed standardized mean difference effect size by subtracting each child’s mean placebo rating from his or her mean treatment rating and dividing by the standard deviation of ratings in the placebo condition (Busk & Serlin, 1992). We calculated percentage change by subtracting each child’s mean treatment rating from his or her mean placebo rating and dividing by the mean placebo rating. Finally, we calculated RCI by subtracting a child’s mean placebo rating from his or her mean intervention rating, and then dividing by the standard error of the difference between mean placebo and treatment ratings (RCI = [χ2 − χ1] / √[2(SE)2]), given that SE = s1√(1 – rxx), where s1 is the placebo standard deviation and rxx is the reliability; Jacobson & Truax, 1991). Dependability, at the level demonstrated in D studies using a six-item scale across three occasions (.71), was substituted for reliability in the formula for factor-derived and individualized DBR-MIS; dependability, at the level demonstrated in D studies using all 10 items of the PCS across three occasions (.76), was substituted for reliability in the formula for the full PCS.
Third, Spearman’s ρ coefficients were calculated to examine associations between corresponding change metrics obtained for the (a) full PCS and factor-derived DBR-MIS, (b) full PCS and individualized DBR-MIS, and (c) factor-derived and individualized DBR-MIS. Analyses were restricted to examining associations within each of the four change metrics (e.g., full PCS absolute change compared with factor-derived DBR-MIS absolute change). High coefficients would indicate that the scales provide similar information in regard to rank ordering of treatment response.
Table 2 indicates the percentages of missing data for each day and item across all dose conditions. Analysis of missing data indicated that an average of 9.2% of all data were missing; however, the amount of missing data varied across days and items. Multiple imputation was used to replace missing data because the procedure allows imputed values to vary across data sets to account for uncertainty (Baraldi & Enders, 2010). According to Rubin (1987), the efficiency of an estimate based on m imputations, with γ as the rate of missing information, can be computed as (1 + γ / m)−1. Five imputations were deemed sufficient, given that 98.2% efficiency would be obtained for five imputations with 9.2% missing data, and diminishing returns were observed as the number of imputations increased above 5. ANOVA with Type III sum of squares was subsequently computed on each of the five data sets, and variance components were averaged across the five imputed data sets in the placebo condition to generate values for generalizability analyses.
|
Table 2. Percentages of Missing Data.

Descriptive information for ratings in the placebo condition, including item means, standard deviations, minimum values, and maximum values, is provided in Table 3. A total of 325 ratings were included in the five imputed data sets (65 subjects × five imputations). Item means, standard deviations, and ranges were generally consistent across the 3 days; however, data indicated that some items varied more across days than other items in the placebo condition. For example, items, such as gives dirty looks or makes threatening gestures to other children, curses or teases other children to provoke conflict, and annoys other children to provoke them, exhibited the largest ranges of means across days, as well as the largest standard deviations. In contrast, items that reflected overt, discrete behaviors, including throws things at other children, engages in physical fights with other children, and threatens to hurt other children, had the smallest variations across days and the smallest standard deviations.
|
Table 3. Descriptive Statistics for Item Ratings in the Placebo Condition.

Generalizability Study
A G study was first conducted to examine the proportion of variance attributable to the facets of person, item, and occasion, as well as the interactions between facets and residual error. The largest proportions of rating variance were attributable to person (27%) and error (34%). These results indicate a substantial proportion of variability was attributable to the differences in PCS scores between students; however, a notable percentage of the variance in ratings was left unexplained by facets in the model. The next largest source of variance was the interaction between person and item (18%), which indicates that item ratings varied across students. Percentages of variance attributable to the person by occasion interaction (14%; indicating the rank order of students changed across occasions) and item (6%; indicating differences in individual item scores averaged across students) were both notable. Finally, negligible percentages of variance were attributable to occasion (1%) and the item by occasion interaction (0%). These results indicate that item ratings, on average, remained relatively stable over time within the placebo condition.
Dependability Studies
Variance component results were subsequently used in D studies to determine the number of items and assessment occasions necessary to reach an acceptable level of dependability (ϕ > .70) for making low-stakes, intraindividual (i.e., absolute) decisions. Results of the D studies are presented in Figure 1. Although adequate dependability could not be achieved using either a one- or two-item scale, the criterion for absolute decision making was achieved with three items after eight occasions, four or five items after four occasions, or six items after three occasions.
Treatment Sensitivity
Within the final set of analyses, the goal was to assess the treatment sensitivity of two abbreviated DBR-MIS scales developed using factor-derived and individualized methods. Although multiple combinations of items and occasions were identified through D studies (i.e., three items after eight occasions, four or five items after four occasions, six items after three occasions), treatment sensitivity analyses were conducted using six-item DBR-MIS because we sought to use a combination of items and occasions that matched existing data.
First, paired-samples t tests were performed to evaluate the sensitivity of each scale in regard to detecting differences in behavior between placebo and medication (very low dose methylphenidate) conditions. Significant differences (p < .05) were found between mean teacher ratings during placebo and very low dose treatment for the factor-derived DBR-MIS, t(64) = 2.75, p = .010, d = 0.36, 95% confidence interval (CI) = [0.03, 0.19], and individualized DBR-MIS, t(64) = 2.73, p = .009, d = 0.35, 95% CI = [0.03, 0.21], which indicated that both DBR-MIS detected changes in classroom behavior in response to treatment with methylphenidate.
Second, four change metrics were calculated within individuals across placebo and very low dose conditions; data were subsequently averaged across all subjects for each metric within each scale (factor-derived DBR-MIS, individualized DBR-MIS, and full PCS). Means and standard deviations are reported in Table 4. Most importantly, all metrics indicated that the three scales detected changes in interpersonal aggression in the expected direction following treatment. Results indicated that the full PCS demonstrated the largest absolute change (−1.57) from placebo to treatment when compared with both DBR-MIS. Interestingly, the individualized six-item DBR-MIS demonstrated the largest change from placebo to treatment on the remaining change metrics (effect size, percentage change, RCI) when compared with factor-derived DBR-MIS and the full PCS. Conversely, the factor-derived DBR-MIS demonstrated the smallest magnitudes of change across all four metrics when compared with individualized DBR-MIS and the full PCS.
|
Table 4. Descriptive Statistics for Change Metrics.

Third, Spearman’s ρ coefficients were calculated between corresponding change metrics obtained for the (a) full PCS and factor-derived DBR-MIS, (b) full PCS and individualized DBR-MIS, and (c) factor-derived DBR-MIS and individualized DBR-MIS. Results indicate statistically significant (p < .01) associations between all pairs of corresponding change metrics (see Table 5). Similar magnitudes of association between DBR-MIS and full PCS change metrics were obtained regardless of the DBR-MIS method examined. Overall, the results indicate that both factor-derived and individualized DBR-MIS provide similar information in regard to rank ordering of methylphenidate treatment response.
|
Table 5. Spearman’s Rho Correlation Matrix Between Change Metrics.

The dual purposes of the study were to examine the (a) dependability of data obtained from a multi-item DBR designed to assess peer conflict and (b) treatment sensitivity of abbreviated DBR-MIS scales constructed using factor-derived and individualized methods. We sought to maximize the efficiency of teacher ratings of interpersonal conflict by identifying the minimum number of items and occasions necessary to achieve an acceptable level of dependability. Results of G and D studies informed the construction of brief DBR-MIS, which we hypothesized may be used for progress monitoring. The treatment sensitivity of DBR-MIS was also evaluated due to its importance in measuring response to intervention. First, we hypothesized that brief, multi-item scales would detect changes in classroom peer conflict in response to treatment with methylphenidate. Second, we hypothesized that the brief, multi-item scales would demonstrate equal or superior treatment sensitivity when compared with the full-length PCS.
G study results indicated that more than half of the variance was attributable to either differences between students (27%) or residual error (34%). However, notable percentages of variance were also identified for the interactions between persons and occasions (14%), and persons and items (18%). Respectively, these results suggest that the relative standing of students changed somewhat over time, and the relative standing of students differed from one item to the next during the placebo condition. Negligible percentage of variance attributable to the interaction between item and occasion (0%) and occasion (1%) indicate that item ratings remained stable over time during the placebo condition.
Results of D studies indicated that efficient combinations of items and assessment occasions reached an acceptable level of dependability. Specifically, a three-item scale across eight occasions, four- or five-item scales after four occasions, or a six-item scale after three occasions demonstrated acceptable dependability for intraindividual progress monitoring. Theoretically, the number of items included in DBR-MIS could be further reduced if more occasions were included in the progress-monitoring period. Item–occasion combinations identified by D studies are similar to the findings of previous DBR-MIS studies. For example, recommendations have varied depending on the target behavior of interest with as few as four occasions needed to assess oppositional defiant behavior (Volpe, Briesch, & Gadow, 2011) or academic engagement (Volpe & Briesch, 2012), whereas as many as eight (Volpe & Briesch, 2016) to 12 (Volpe & Briesch, 2012) occasions may be needed to dependably assess disruptive behavior. It is interesting to note that four occasions were needed to dependably assess behavior that was more interpersonal in nature (i.e., interpersonal peer conflict in the current study; oppositional defiant behavior in Volpe et al., 2011) using a five-item scale.
With the exception of the 2011 study by Volpe, Briesch, and Gadow, the current study differs from previous DBR-MIS studies in two important ways: One, classroom teachers served as raters of child behavior for G, D, and treatment sensitivity studies, which increases the likelihood that the results will generalize to other classroom settings. Two, DBR-MIS were evaluated in the context of pharmacological intervention. The majority of studies to date have only examined the treatment sensitivity of DBR-MIS with regard to behavioral interventions.
Limitations
First, GT attempts to consider assessment within the universe of possibility. The universe of possible measurement in schools includes facets modeled in the present study (e.g., person, item, occasion), as well as facets not included in the model (e.g., rater). Although the facet of rater was not included in the current study, it is possible that it would account for a substantial amount of variance. Prior research has shown that ratings have varied across raters (Chafouleas et al., 2007; Chafouleas, Riley-Tillman, Sassu, LaFrance, & Patwa, 2007). Therefore, additional work is needed to examine whether similar levels of dependability would be achieved using different raters in the same classroom, such as a classroom teacher and special education teacher.
Second, whereas the number of persons provided a substantial sample, the number of measurement occasions was relatively small. Although this did not represent an issue for the generalizability and decision studies given the overall number of data points (65 students × 10 items × three occasions = 1,950 data points), a greater number of assessment occasions may have provided a more robust assessment of treatment sensitivity.
Third, on average, 9.2% of the data were missing, although the amount of missing data varied substantially across days. Multiple imputation was selected as the method to address missing data, given that replacement values are allowed to vary across imputations; however, it is possible that imputed values over- or underestimated actual behavior exhibited by children in the study. Percentages of missing data increased over time within each dose condition, although the reasons for the missing data are not entirely clear. It is possible that teacher fatigue, due to high informant load, resulted in fewer teachers continuing to complete ratings over time. However, because the PCS was completed in conjunction with other measures used as part of a larger study, it is impossible to separate the impact of the PCS on teacher fatigue from the impact of all measures. Furthermore, procedures to ensure rating fidelity likely would have reduced the amount of missing data, and future investigations of the technical characteristics of DBR-MIS should also take into account the feasibility and acceptability of the rating procedures.
Finally, there may be limitations to the generalization of findings, given distinctive features of the study sample and treatment utilized. For one, all of the children in the study met diagnostic criteria for psychiatric disorders (e.g., ADHD, ODD, anxiety); therefore, the results may not generalize to children who do not meet diagnostic criteria for the specific aforementioned disorders. Many children receiving targeted school-based intervention within MTSS would not meet criteria for a child psychiatric disorder. Further research is needed to replicate these results with students in schools who exhibit less severe forms of disruptive and aggressive behavior. In addition, the treatment sensitivity of the DBR-MIS scale was assessed in response to pharmacological treatment (i.e., methylphenidate), which arguably represents a different manner of intervention than that which is typically provided within the context of a school-based MTSS. Although school personnel are often called upon to monitor the effects of medication to inform the decision making of prescribing physicians, such effects would likely not be the focus of eligibility-focused decision making within a problem-solving model. Thus, additional research is also needed to replicate these results in response to more common school-based intervention strategies designed to target interpersonal aggression.
Implications for Practice
Results of the present study add to the literature supporting DBR-MIS for assessing social behavior in schools. Although DBR-MIS may reach an acceptable level of dependability for low-stakes decisions (i.e., 70), a higher coefficient (>.90) has been recommended for high-stakes or diagnostic decisions. It is important to note that although the six-item DBR-MIS in the present study demonstrates adequate dependability for progress monitoring (i.e., low-stakes decisions), an adequate level of dependability for high-stakes decisions could not be reached after 20 rating occasions, regardless of the number of items used from the original 10-item scale. Stated differently, DBR-MIS derived from the PCS may be useful for monitoring students’ social behavior in response to intervention; however, DBR-MIS should be used in combination with other established assessment tools when making diagnostic decisions or decisions regarding eligibility for special education services. This study contributes to the growing body of literature that suggest DBR may be best used for low-stakes, intraindividual comparisons.
The present study also contributes to the literature investigating an individualized approach to constructing progress-monitoring scales. Although continued research is warranted, preliminary evidence indicates that individualized DBR-MIS demonstrate similar sensitivity to a full-length scale comprised of more items. Given the need for feasibility in school-based assessment, individualized DBR-MIS have the potential to reduce the amount of time needed to monitor student progress by eliminating irrelevant items and reducing the number of items a teacher needs to rate.
Finally, some researchers have argued for the need to assess student behavior in relation to both short-term intervention targets and global objectives (Volpe, Briesch, & Chafouleas, 2010). An individualized DBR-MIS may offer the benefit of assessing progress toward both specific intervention targets, reflected by individual items (e.g., hits, pushes, or trips other children), and broad domains of behavior, reflected by a DBR-MIS composite score. Whereas specific intervention targets are highly sensitive indicators of response to treatment, composite scores reflect general social functioning. This is likely to be seen as an advantage to teachers and school-based support personnel, given that they are often concerned with measuring progress in both the short and long term.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
|
American Psychiatric Association . (1987). Diagnostic and statistical manual of mental disorders (3rd ed., Rev.). Washington, DC: Author. Google Scholar | |
|
American Psychiatric Association . (1994). Diagnostic and statistical manual of mental disorders (4th ed.). Washington, DC: Author. Google Scholar | |
|
Baraldi, A. N., Enders, C. K. (2010). An introduction to modern missing data analyses. Journal of School Psychology, 48, 5–37. doi:10.1016/j.jsp.2009.10.001 Google Scholar | Crossref | Medline | ISI | |
|
Bradley, R., Doolittle, J., Bartolotta, R. (2008). Building on the data and adding to the discussion: The experiences and outcomes of students with emotional disturbance. Journal of Behavioral Education, 17, 4–23. doi:10.1007/s10864-007-9058-6 Google Scholar | Crossref | |
|
Bramlett, R., Murphy, J., Johnson, J., Wallingsford, L., Hall, J. (2002). Contemporary practices in school psychology: A national survey of roles and referral problems. Psychology in the Schools, 39, 327–335. doi:10.1002/pits.10022 Google Scholar | Crossref | ISI | |
|
Briesch, A. M., Chafouleas, S. M., Riley-Tillman, T. C. (2016). Direct Behavior Rating (DBR): Linking assessment, communication, and intervention. New York, NY: Guilford Press. Google Scholar | |
|
Briesch, A. M., Ferguson, T. D., Volpe, R. J., Briesch, J. M. (2013). Examining teachers’ perceptions of social-emotional and behavioral referral concerns. Remedial and Special Education, 34, 249–256. doi:10.1177/0741932512464579 Google Scholar | SAGE Journals | ISI | |
|
Briesch, A. M., Swaminathan, H., Welsh, M., Chafouleas, S. M. (2014). Generalizability theory: A practical guide to study design, implementation, and interpretation. Journal of School Psychology, 52, 13–35. doi:10.1016/j.jsp.2013.11.008 Google Scholar | Crossref | Medline | ISI | |
|
Busk, P. L., Serlin, R. C. (1992). Meta-analysis for single-case research. In Kratochwill, T., Levin, J. (Eds.), Single-case design and analysis (pp. 187–212). Hillsdale, NJ: Lawrence Erlbaum. Google Scholar | |
|
Chafouleas, S. M. (2011). Direct Behavior Rating: A review of the issues and research in its development. Education & Treatment of Children, 34, 575–591. Google Scholar | Crossref | |
|
Chafouleas, S. M., Briesch, A. M., Riley-Tillman, T. C., Christ, T. J., Black, A., Kilgus, S. P. (2010). An investigation of the generalizability and dependability of Direct Behavior Rating Single Item Scales (DBR-SIS) to measure academic engagement and disruptive behavior of middle school students. Journal of School Psychology, 48, 219–246. doi:10.1016/j.jsp.2010.02.001 Google Scholar | Crossref | Medline | ISI | |
|
Chafouleas, S. M., Christ, T. J., Riley-Tillman, T. C., Briesch, A. M., Chanese, J. A. (2007). Generalizability and dependability of direct behavior ratings to assess social behavior of preschoolers. School Psychology Review, 36, 63–79. Google Scholar | ISI | |
|
Chafouleas, S. M., Riley-Tillman, T. C., Christ, T. J. (2009). Direct Behavior Rating (DBR): An emerging method for assessing social behavior within a tiered intervention system. Assessment for Effective Intervention, 34, 195–200. doi:10.1177/1534508409340391 Google Scholar | SAGE Journals | |
|
Chafouleas, S. M., Riley-Tillman, T. C., Sassu, K. A., LaFrance, M. J., Patwa, S. S. (2007). Daily behavior report cards: An investigation of the consistency of on-task data across raters and methods. Journal of Positive Behavior Interventions, 9, 30–37. doi:10.1177/10983007070090010401 Google Scholar | SAGE Journals | ISI | |
|
Chafouleas, S. M., Sanetti, L. M., Kilgus, S. P., Maggin, D. M. (2012). Evaluating sensitivity to behavioral change using direct behavior rating single-item scales. Exceptional Children, 78, 491–505. Google Scholar | SAGE Journals | ISI | |
|
Cheney, D., Flower, A., Templeton, T. (2008). Applying response to intervention metrics in the social domain for students at risk of developing emotional or behavioral disorders. Journal of Special Education, 42, 108–126. doi:10.1177/0022466907313349 Google Scholar | SAGE Journals | ISI | |
|
Christ, T. J., Riley-Tillman, T. C., Chafouleas, S. M. (2009). Foundation for the development and use of Direct Behavior Rating (DBR) to assess and evaluate student behavior. Assessment for Effective Intervention, 34, 201–213. doi:10.1177/1534508409340390 Google Scholar | SAGE Journals | |
|
Connor, D. F., Glatt, S. J., Lopez, I. D., Jackson, D., Melloni, R. H. (2002). Psychopharmacology and aggression: A meta-analysis of stimulant effects on overt/covert aggression-related behaviors in ADHD. Journal of the American Academy of Child & Adolescent Psychiatry, 41, 253–261. doi:10.1097/00004583-200203000-00004 Google Scholar | Crossref | Medline | ISI | |
|
Crick, N. R. (1996). The role of overt aggression, relational aggression, and prosocial behavior in the prediction of children’s future social adjustment. Child Development, 67, 2317–2327. doi:10.1111/j.1467-8624.1996.tb01859.x Google Scholar | Crossref | Medline | ISI | |
|
Cullinan, D., Sabornie, E. J. (2004). Characteristics of emotional disturbance in middle and high school students. Journal of Emotional and Behavioral Disorders, 12, 157–167. doi:10.1177/10634266040120030301 Google Scholar | SAGE Journals | ISI | |
|
Fuchs, L., Fuchs, D., Hamlett, C. L., Walz, L., Germann, G. (1993). Formative evaluation of academic progress: How much growth can we expect? School Psychology Review, 22, 27–48. Google Scholar | ISI | |
|
Gadow, K. D. (1986). Peer Conflict Scale. Stony Brook: Department of Psychiatry, State University of New York. Google Scholar | |
|
Gadow, K. D., Nolan, E. E. (2002). Differences between preschool children with ODD, ADHD, and ODD+ADHD symptoms. Journal of Child Psychology and Psychiatry, 43, 191–201. doi:10.1111/1469-7610.00012 Google Scholar | Crossref | Medline | ISI | |
|
Gadow, K. D., Sprafkin, J. (1997). ADHD Symptom Checklist-4 manual. Stony Brook, NY: Checkmate Plus. Google Scholar | |
|
Gadow, K. D., Sprafkin, J. (2008). ADHD Symptom Checklist-4 2008 manual. Stony Brook, NY: Checkmate Plus. Google Scholar | |
|
Gadow, K. D., Sprafkin, J., Nolan, E. E. (1996). ADHD school observation code. Stony Brook, NY: Checkmate Plus. Google Scholar | |
|
Gadow, K. D., Sprafkin, J., Nolan, E. E. (2001). DSM-IV symptoms in community and clinic preschool children. Journal of the American Academy of Child & Adolescent Psychiatry, 40, 1383–1392. doi:10.1097/00004583-200112000-00008 Google Scholar | Crossref | Medline | ISI | |
|
Gresham, F. M. (2005). Response to intervention: An alternative means of identifying students as emotionally disturbed. Education & Treatment of Children, 28, 328–344. Google Scholar | |
|
Jacobson, N. S., Truax, P. (1991). Clinical significance: A statistical approach to defining meaningful change in psychotherapy research. Journal of Consulting and Clinical Psychology, 59, 12–19. doi:10.1037/0022-006X.59.1.12 Google Scholar | Crossref | Medline | ISI | |
|
Kratochwill, T. R., Albers, C. A., Shernoff, E. S. (2004). School-based interventions. Child and Adolescent Psychiatric Clinics of North America, 13, 885–903. doi:10.1016/j.chc.2004.05.003 Google Scholar | Crossref | Medline | ISI | |
|
Leff, S. S., Power, T. J., Manz, P. H., Costigan, T. E., Nabors, L. A. (2001). School-based aggression prevention program for young children: Current status and implications for violence prevention. School Psychology Review, 30, 344–362. Google Scholar | ISI | |
|
Nolan, E. E., Gadow, K. D. (1994). Relation between ratings and observations of stimulant drug response in hyperactive children. Journal of Clinical Child Psychology, 23, 78–90. doi:10.1177/1534508409333547 Google Scholar | SAGE Journals | |
|
Pelham, W. E., Gnagy, E. M., Burrows-Maclean, L., Williams, A., Fabiano, G. A., Morrisey, S. M., . . . Lock, T. M. (2001). Once-a-day Concerta methylphenidate versus three-times-daily methylphenidate in laboratory and natural settings. Pediatrics, 107, e105–e105. Google Scholar | Crossref | Medline | ISI | |
|
Pelham, W. E., Hoza, B., Pillow, D. R., Gnagy, E. M., Kipp, H. L., Greiner, A. R., . . . Fitzpatrick, E. (2002). Effects of methylphenidate and expectancy on children with ADHD: Behavior, academic performance, and attributions in a summer treatment program and regular classroom settings. Journal of Consulting and Clinical Psychology, 70, 320–335. doi:10.1037//0022-006X.70.2.320 Google Scholar | Crossref | Medline | ISI | |
|
Riley-Tillman, T. C., Methe, S. A., Weegar, K. (2009). Examining the use of Direct Behavior Rating on formative assessment of class-wide engagement: A case study. Assessment for Effective Intervention, 34, 224–230. doi:10.1177/1534508409333879 Google Scholar | SAGE Journals | |
|
Rubin, D. B. (1987). The calculation of posterior distributions by data augmentation: Comment: A noniterative sampling/importance resampling alternative to the data augmentation algorithm for creating a few imputations when fractions of missing information are modest: The SIR algorithm. Journal of the American Statistical Association, 82, 543–546. Google Scholar | ISI | |
|
Shavelson, R. J., Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: SAGE. Google Scholar | |
|
Spaulding, S. A., Horner, R. H., May, S. L., Vincent, C. G. (2008). Evaluation brief: Implementation of school-wide PBS across the United States. Washington, DC: Office of Special Education Programs Technical Assistance Center on Positive Behavioral Interventions and Supports. Retrieved from https://www.pbis.org/blueprint/evaluation-briefs/implementation-across-us Google Scholar | |
|
Underwood, M. K., Beron, K. J., Rosen, L. H. (2011). Joint trajectories for social and physical aggression as predictors of adolescent maladjustment: Internalizing symptoms, rule-breaking behaviors, and borderline and narcissistic personality features. Development and Psychopathology, 23, 659–678. doi:10.1017/S095457941100023X Google Scholar | Crossref | Medline | ISI | |
|
Volpe, R. J., Briesch, A. M. (2012). Generalizability and dependability of single-item and multiple-item direct behavior rating scales for engagement and disruptive behavior. School Psychology Review, 41, 246–261. Google Scholar | ISI | |
|
Volpe, R. J., Briesch, A. M. (2015). Multi-item Direct Behavior Ratings: Dependability of two levels of assessment specificity. School Psychology Quarterly, 30, 431–442. doi:10.1037/spq0000115 Google Scholar | Crossref | Medline | ISI | |
|
Volpe, R. J., Briesch, A. M. (2016). Dependability of two scaling approaches to Direct Behavior Rating Multi-Item Scales assessing disruptive classroom behavior. School Psychology Review, 35, 39–52. doi:10.17105/SPR45-1.39-52 Google Scholar | Crossref | ISI | |
|
Volpe, R. J., Briesch, A. M., Chafouleas, S. M. (2010). Linking screening for emotional and behavioral problems to problem-solving efforts: An adaptive model of behavioral assessment. Assessment for Effective Intervention, 35, 240–244. doi:10.1177/1534508410377194 Google Scholar | SAGE Journals | |
|
Volpe, R. J., Briesch, A. M., Gadow, K. D. (2011). The efficiency of behavior rating scales to assess disruptive classroom behavior: Applying generalizability theory to streamline assessment. Journal of School Psychology, 49, 131–155. doi:10.1016/j.jsp.2010.09.005 Google Scholar | Crossref | Medline | ISI | |
|
Volpe, R. J., Gadow, K. D. (2010). Creating abbreviated rating scales to monitor classroom inattention-overactivity, aggression, and peer conflict: Reliability, validity, and treatment sensitivity. School Psychology Review, 39, 350–363. Google Scholar | ISI | |
|
Volpe, R. J., Gadow, K. D., Blom-Hoffman, J., Feinberg, A. B. (2009). Factor analytic and individualized approaches to constructing brief measures of ADHD behaviors. Journal of Emotional and Behavioral Disorders, 17, 118–128. doi:10.1177/1063426608323370 Google Scholar | SAGE Journals | ISI |


