-
Essay / The case against methodological absolutism: when and why experimental evaluation is and is not appropriate
Impact evaluations are studies that seek to establish whether changes observed as a result of a program interventions were caused by the intervention rather than other factors (Khandker et al., 2010, p. They can be undertaken in a variety of contexts and for a variety of purposes, but generally involve an attempt to compare outcomes that were observed to the results that would have prevailed if the program had not been undertaken A European Commission report describes the aim of impact evaluations as being "to allow policy makers to rule out alternative explanations for the changes in circumstances or achievements that might be observed” (2012, p. 1), and to provide evidence that “allows policy makers to assess the effectiveness of interventions and, furthermore, to make comparisons between interventions and assess their relative performance” (ibid., p. 1). In other words, impact evaluations are a way for policy makers to answer the question “what works?” » when it comes to interventions, so they can make decisions about which interventions to invest in and which are counterproductive or simply not profitable. Say no to plagiarism. Get a tailor-made essay on “Why Violent Video Games Should Not Be Banned”? Get the original essay Researchers also emphasize the importance of impact evaluations in informing decisions about program and policy development. Campbell wrote that reforms act like experiments and that the purpose of evaluation is to test the effectiveness of reforms as one would measure the results of an experiment (Campbell, 1979). Tilley (2000) writes that impact evaluation gives policy makers the time and space to learn about the effects of a program before making large-scale decisions about the future of the program or its magnitude. To achieve these goals, all impact evaluation studies attempt to study cause and effect: looking for changes or outcomes directly attributable to an intervention or treatment. However, the evaluation discipline continues to debate how researchers might investigate and attribute cause. . Experimental methods, which rely on the use of a “counterfactual” to attribute cause, have become popular (Scriven, 2008). Randomized controlled trials (RCTs), a subset of the experimental evaluation tradition, have become idolized as “real experiments” or as the “gold standard” for attributing cause (Tilly, 2000). Other experimental designs, including a wide variety of quasi-experimental designs, are now considered less rigorous and less acceptable in some circles of the evaluation and policy communities (Scriven, 2008). This trend has given rise to what Scriven calls "the politics of exclusion" (2008), in which programs and policies that are not evidenced by the experimental design of RCTs often fail to receive approval or support. funding. Some leading evaluators have condemned this policy and encourage the discipline to reject methodological absolutism in favor of adapting methods appropriate to the task at hand (Morén and Blom, 2003). The perception that rigor resides only in experimental designs and methods is giving way, for some, to the understanding that there are a plurality of viable and rigorous methodologies and that it is up to the evaluator to determine which is the most appropriate. Andfeasible for the task at hand given the circumstances. Scriven argues that, rather than having an optimal research method, "the optimal procedure is simply to require very high standards for matching the wide range of designs that can produce high levels of confidence to the problem and available resources." (Scriven, 2008). , p. 23). Consistent with this movement, this article does not seek to participate in a simplistic argument for or against experimental methods as a whole, but to consider circumstances in which experimental methods may not be appropriate when conducting an evaluation of impact. It first addresses the feasibility of ECR experimental designs in different scenarios and then discusses the generalizability of the results. He considers the position of “realistic evaluation” as an alternative to the supposed superiority of experimental methods. It concludes that the "gold standard" approach to experimental designs of RCTs is an effective method for assessing impact in ideal circumstances, but that alternative approaches such as quasi-experimental designs are more effective in informing decision-making. decision in many situations that prevail in reality. Critics of RCTs highlight how difficult it is to conduct a randomized controlled trial that adheres to best practice in certain circumstances and policy contexts (Scriven, 2008). One of the most common circumstances in which experimental designs are difficult is when it is impractical, prohibitively expensive, or unethical to establish an appropriate matched control group (Chen & Rossi, 1987). Control groups are a means of establishing a counterfactual, which is the central mechanism by which experimental methods infer cause. Counterfactuals involve a comparison between what happened after an intervention and what would have happened in the absence of the intervention. Through a process of constructing equivalent experimental and control groups and applying an intervention to the experimental group only, experimental evaluators can then compare the effect in each group. If it can be proven that the two groups were sufficiently comparable before the intervention (ideally by random assignment), any differences between the two groups after the intervention can be attributed to that intervention. However, establishing a matched control group by random allocation is not always possible. One solution to this feasibility challenge is to use one of several quasi-experimental designs that do not involve randomization to create control groups. These designs can improve the feasibility of an experimental approach by seeking creative but effective ways to establish meaningful comparisons and benchmarks, control threats to internal validity, and produce rigorous causal claims. For example, in interrupted time series design, the effects of an intervention are evaluated based on changes in measures before and after implementation of the intervention (Penfold and Zhang, 2013). This controls for differences between participants that may influence the observed effects, because individual participants act as their own control. Likewise, this approach avoids certain forms of bias that harm the design of control groups, such as environmental factors that affect the control group in a different way than the treatment group, due to their locations or contexts. different (Tilley, 2000). Quasi-experimental designs are very appropriate in many circumstances. It is often less aboutknowing whether experimentation is an appropriate approach than knowing what form of experimental design is best suited to its purpose – which should include an assessment of feasibility. A second criticism of experimental designs concerns generalizability. Researchers have expressed doubts about the generalizability of the effects of certain types of interventions implemented in a given context to other contexts under different circumstances. Moren and Blom argue that “randomized controlled trials do not really take context into account” (2003, p. 38), which for them means that what works in one initial context may be useless in a different context. Accordingly, Stern et al. argue that experiments can answer the question “did it work here?” but not “will this work for us elsewhere?” (Stern et al, 2012) Situations where generalizability is a particular problem for impact evaluations are those in which context is a strong mediating factor. This is true for many social programs in the areas of justice, social work, education and development. Arrière (2012) writes that perhaps as few as 5% of development programs are tailored to RCTS. Moren and Blom (2003) point out that experimental approaches often overlook the fact that social work and development practices are carried out in what they call “open conditions”. They argue that social work is highly contextual and involves complex and dynamic client-worker relationships – and that these are not confounding variables to be controlled and homogenized through an experimental approach, but rather important factors to exploit in an intervention. While in the pharmaceutical world the impact of a drug can be more easily measured, social programs are not simple "treatments" or "dosages" like drugs administered to passive recipients (Chen and Rossi, 1983). . As Pawson argues, "programs do not 'work', rather it is the action of stakeholders that makes them work, and the causal potential of an initiative takes the form of providing reasons and resources to enable participants to change” (Pawson, 2002 p. 215). Tilley (2000) reflects on a case study that illustrates the pitfalls of inappropriately applying an experimental design in a complex social situation. Tilley reviews Sherman's work on a mandatory arrest policy aimed at reducing domestic violence (Sherman, 1992). For the study discussed, a randomized control group was established and the evaluation revealed that repeated assaults were fewer in the intervention group than in the control group. As a result of the study, many U.S. cities were encouraged to adopt a mandatory domestic violence arrest policy to reduce repeat assaults. However, after its implementation in other cities, it became clear that the intervention did not reduce recidivism in all cities. Sherman suggested that the mixed results could be explained by the different causal mechanisms operating under different economic and community conditions. He hypothesized that where employment rates are high, arrest may produce shame in the perpetrator, who is then less likely to reoffend. Where there were fewer jobs and less stability in a community, arrest was likely to trigger the offender's anger, which was a factor in higher rates of recidivism. Tilley (2000) uses this case as a clear example of a treatment and effect that varies by context..