The parachute problem

John Locker, Principal Research Adviser, New Zealand Police

Bronwyn Morrison, Principal Research Adviser, Department of Corrections

Author biographies:

John Locker has a Ph.D in Criminology. He worked as lecturer in Criminology at Keele University in the UK, before moving to New Zealand in 2005. Since that time he has worked for New Zealand Police in a variety of research and evaluation roles. His current research focus is police use of force, and the broader environment in which these fundamental police powers are managed.

Bronwyn Morrison has a Ph.D in Criminology. She has worked in New Zealand government research roles for the past 13 years. Since joining Corrections as a Principal Researcher in 2015 she has conducted research on prisoners’ post release experiences, family violence perpetrators, remand prisoners, and, most recently, trainee corrections officers.

Introduction

In 2003 the British Medical Journal published an article entitled “Parachute use to prevent death and major trauma related to gravitational challenge: systematic review of randomised controlled trials” (Smith and Pell, 2003). The article attempted to summarise randomised controlled trial (RCT) evidence on the effectiveness of jumping out of aircraft with and without a parachute, in order to determine the effect of parachute use on bodily trauma and likelihood of death. The authors noted that the perception that parachutes are a successful intervention is based solely on observational study; while these studies have shown that non-use of a parachute can be associated with morbidity and mortality, they fail to meet the “gold standard” of research evidence. As such, a systematic review of randomised controlled trials on this subject was conducted, as the first step to establishing a more robust evidence-base.

Not surprisingly Smith and Pell’s (2003) review failed to unearth any RCT studies, with the evidence “limited” to observational data. Strictly applying the tenets of evidence-based medicine, the authors concede that this knowledge is of insufficient validity to make sound causative claims about the link between parachute use and mortality. To resolve this problem the article concludes with the suggestion that “radical protagonists” of RCT methodologies should perhaps consider participating in a “double blind, randomised, placebo controlled, crossover trial of the parachute” to deliver a more definitive evidence base (Smith and Pell, 2003: 1459).

While undoubtedly tongue in cheek, the article makes several important points that researchers, policymakers, and practitioners need to consider when assessing evidence-based needs. In particular, that sound causal knowledge is not always premised on whether RCT-knowledge exists. Rather, alternative methods can and do provide sufficient (or better) evidence about “what works”, including in which contexts, and why. This observation is particularly pertinent to criminal justice research, where various obstacles often impede the successful application of RCT methodologies.

The rise of experimental research in criminal justice evaluation in New Zealand (and beyond)

Internationally, technological improvements in data capture, quality, and connectivity (such as New Zealand’s Integrated Data Infrastructure) alongside the rise of public managerialist approaches to governance, have driven increased interest in “evidence-based policy” and the means by which we establish “what works” (Gelsthorpe and Sharpe, 2005; Hughes, 1998). Within New Zealand, the social investment approach exemplified this shift. Officially launched in 2015, the social investment approach involved using “information and technology to better understand the people who need public services and what works, and then adjusting services accordingly” (Treasury, 2017). As Treasury currently states on its website, “to make sure services actually deliver in practice, proposals being considered as part of the social investment approach will need to deliver measurable results. Systematic evaluation of services will be a key part of this”.

It is entirely reasonable to expect publicly-funded services and interventions to be rigorously evaluated; however, while not always made explicit, the requirement for “systematic evaluation” has often been narrowly interpreted as the need to incorporate RCTs within the evaluation design accompanying budget bids for new services (the trend has also been seen in other jurisdictions, see Sampson, 2010). This interpretation is based on an underlying premise about the hierarchy of social science methods, which assumes RCTs represent the apex of scientific inquiry, and a “gold standard” in evaluative outputs (Davidson, 2014). As such, it is widely believed that RCTs can and will provide the level of certainty necessary to direct policy decisions about which services to continue and/or expand, and which ones to cease.

What is an RCT, and why should we consider using RCTs in criminal justice settings?

RCTs seek to measure the effect of an intervention on an outcome by employing a random allocation process to treatment and control groups, thereby ensuring that any biases are equally distributed between groups. This, in turn, provides confidence that treatment was the cause of any observed difference in outcome, rather than some hidden confounding factors (Hollin, 2012).

RCTs have a long history within the science field and have played an important role in the development of medical knowledge. Although used to a far lesser extent in social science, where appropriate, well designed, and properly executed, RCTs can undeniably add value. As a variety of scholars have noted, RCTs represent a reliable method for making causative claims (Farrington and Welsh, 2005; Sampson, 2010; Weisburd, Lum and Petrosino, 2001); and as Sampson (2010: 490) suggests, experiments are “an essential part of the toolkit of criminologists” with “more, not fewer, field experiments” needed. We take no issue with this position: RCTs undoubtedly have value. However, where we take issue is with the application and appropriate use of RCTs, their presentation as a universal gold standard, and the over-extended claims that are made about their potential to lead evidence-based knowledge of “what works” (see Hough, 2010; Stenson and Silverstone, 2014).

Our argument is three-fold: first, and most fundamentally, if the term “gold standard” is appropriate at all, it refers to approach(es) that are most suitable to answering the question(s) at hand. In this sense, there is no singular “gold standard” (see Berk, 2005; Sampson, 2010; Grossman and MacKenzie, 2005; Gelsthorpe and Sharpe, 2005). To demarcate RCTs as a universal “gold standard”, therefore, makes little sense, and may well be detrimental to the breadth and depth of the “evidence-based” environment. As Grossman and MacKenzie (2005: 520) argue, context is critical to methodological design and quality:

“To claim the RCT is a gold standard, is like arguing that since being tall makes for a good high jumper, it follows that a 6’ elderly drunkard with a spinal injury is bound to be a better high jumper than a 5’11 Olympic athlete. All things are never equal, and one has to consider many factors other than, in this example, the person’s height. Just as being tall is often a good property for a jumper to have, the property of being an RCT is often a good property for a study to have, but it does not follow that anything that is an RCT is better than anything that isn’t.”

Second, while the term “gold standard” is often taken as a generic indicator of research quality, in fact this is a misinterpretation of a term that has a very specific meaning. Rating scales (such as the Scientific Methods Scale) from which “gold standard” language is derived, measure only one aspect of study validity: ‘internal validity’ (our confidence in assuming causation from findings). Even if RCTs do achieve the “gold standard” of internal validity, this is only one component of study validity; and RCTs’ high internal validity is often obtained at the expense of other important aspects of validity (such as “external validity”: our ability to generalise findings).

Third, even where we seek to understand causation and can agree that an RCT is (at least in principle) the best approach, in reality knowledge produced through RCTs is not necessarily more reliable, credible, or valuable than that produced through other methods. This is just as true in the medical arena as it is within social science (for instance, Bothwell et al, 2016). There is typically distance between textbook accounts of research methods and their application in the field. This is particularly true of RCTs, given the rigidity of their method and the stringency with which they need to be applied (to ensure internal validity), coupled with implementation difficulties associated with complex criminal justice settings. As Grossman and MacKenzie (2005: 523) state, we need to avoid scales whereby we assume that “even the most well-designed, carefully implemented, appropriate observational study will fall short of even the most badly designed, badly implemented, ill-suited RCT”.

When should we use RCTs, and how?

Given increased interest in “evidence-based policy” and “what works”, and given the central place of RCTs within this agenda (Davidson, 2006), it is critical that researchers, policymakers, and practitioners understand what RCTs involve and what type of evidence they are capable of delivering. The following sections discuss these issues.

“It depends on the context”: understanding outcomes

RCTs seek to understand outcomes, and, as such, are an evaluative tool for answering “what works” questions. It should go without saying that they are therefore appropriate within this environment, but not necessarily others. However, even where our focus is “what works”, we must take care in simply accepting the natural primacy of RCTs in establishing causation for a number of reasons.

While a perfectly designed and administered RCT may allow us to observe, with confidence, that a treatment affected an outcome, RCTs do not identify or elaborate the mechanisms within programmes that caused such an outcome to occur. Yet, in the event a programme is deemed to “work”, an understanding of these “how” and “why” questions are critical to replication. Given the significance of “social structures” and “cultural processes” in shaping causation (and ultimately programme outcomes), this understanding is essential if we wish to transport experimentally-successful interventions to new settings or populations (Sampson, 2010; Hough, 2010). As Latessa (2018: 1) notes in respect of correctional programmes, “the challenge for those administering programmes is not ‘what to do’ but rather ‘how to do it’ and ‘how to do it well’” (see also Pawson, 2013; Pawson and Tilley, 1997; Hope, 2009; Sampson, 2010; Sampson, Winship, and Knight, 2013).

In recognising this issue, some scholars have observed that the stringency of RCT design, and its necessary prioritisation of internal validity (that is, its ability to manage confounding variables) has been at the expense of external validity (that is, its generalisability to other places, contexts, or populations). As Berwick (2008) states, one of the great ironies of the RCT approach is that the stringency of its methodological approach ultimately strips this method of the context required to generalise and replicate successful programmes. Ultimately, this makes “experimental evidence an inflexible vehicle for predicting outcomes in environments different from those used to conduct the experiment” (Heckman, 1992: 227). For this reason Berwick (2008: 1183) labels RCTs “an impoverished way to learn”.

Perhaps, a more positive framing of this limitation is Hough’s (2010: 14) observation that RCTs are best conceptualised as a “starting point”, rather than the end step of evaluative understanding:

“It is of great value for evaluative research to establish that something can work in reducing reoffending, but this is only the beginning of any serious evaluation. If a programme has been shown to be effective in one setting, the important next step is to identify the mechanisms by which this impact was achieved. The sort of evidence that one needs to search for this enterprise may be distinctly different from that which one needs to establish whether a programme can work.”

Consequently, understanding “what works” ultimately requires the use of a variety of methods (Clear, 2010). Elaborating this point, Latessa (2018) argues that outcome evaluation should be pre-empted by detailed intervention logic work (e.g. formative evaluation), and/or assessment of a programme’s implementation (e.g. process evaluation) to ensure that application and fidelity issues are resolved. Such work is crucial as it ensures that a programme is at a suitable stage to justify and support timely and expensive outcome study (thereby reducing mid-RCT implementation changes that affect RCT methodology and quality of findings). Indeed, Latessa (2018) suggests that where programmes fail to meet set quality standards at this earlier stage, the considerable resources earmarked for outcome measurement (such as through RCTs) should be diverted into improving the service or intervention to better deliver desired outcomes (Latessa, 2018).

The presumed supremacy of RCTs in articulating “what works” also overstates the universality of this method in responding to all research questions that seek to understand causation. As Sampson (2010) notes, not all research that is interested in causation lends itself to the RCT method. For instance, criminologists are, and ought to be, concerned with macro-level causation and the development of causal mechanisms over long periods of time (sometimes even decades). In such situations, other research approaches, such as observation-based longitudinal studies (the Dunedin Longitudinal Study being one such example) offer better opportunities to assess causation. Indeed, much robust causal theory has emerged from the careful accumulation of observational studies, as opposed to laboratory-style experiments.

Ethics matter

Historically, one of the main areas of criticism levelled at RCTs has been the ethics of randomising treatment interventions (Braga et al, 2013). In response, a variety of authors have suggested that the increasing number of “ethically implemented” RCTs in criminal justice settings somewhat negates these concerns, and demonstrates that they are, for the most part, “based in folklore rather than facts” (for example, see Weisburd, Lum and Petrosino, 2001; Farrington and Welsh, 2005; Sampson, 2010).

The presentation of RCT ethics as “fact” or “fiction” is a superficial and unhelpful dichotomy. While recognising and accepting that it is possible to carry out “ethically implemented” RCTs in criminal justice settings, this does not reduce the relevance of ethical considerations. Rather, such considerations remain very much alive in debates about experimental social science methods. As Hollin (2012) states, RCTs should adhere to “the principle of equipoise”: a term commonly used in medicine to denote the need for “genuine doubt and an absence of evidence” of effectiveness as a basis for research: “ethically, RCTs can only be planned and carried out where there is reasonable uncertainty about the effectiveness of an intervention” (Hollin, 2012: 238). To return to our initial example, given what we know from observational data about the relationship between jumping out of a plane without a parachute and the likelihood of physical injury, even if an RCT study produced more reliable causative knowledge, requiring some participants to jump out of a plane without a parachute to attain this knowledge is untenable for ethical reasons. Applying this ‘harm minimisation’ principle to the criminal justice environment, practitioners should carefully consider the strength of existing evidence in assessing the need for, and appropriateness of, RCT study before committing resources to this enterprise.

Impracticalities of blinding

In addition to ethical and theoretical issues, there are a broad range of practical obstacles that can either prevent the application of RCTs altogether, or undermine the validity of results. One of these is ‘blinding’ – a design feature associated with more rigorous approaches. There are different blinding options open to RCT designers: single blinding ensures that participants are unaware of whether they are in the treatment or control group, while double blinding removes this knowledge from both participants and administrators. Finally, triple blinding makes participants, administrators and researchers unaware of the group allocation of individuals. The purpose of blinding is to limit participant, administrator and/or researcher biases that can result from knowledge of how and where individuals are placed within RCT groups, and which can, in turn, undermine the study design and therefore its results (Hollin, 2012; Goldacre, 2008).

Blinding may be a relatively easy methodological task in simple medical settings (for example, where the treatment is a pill, and group participants are administered either the medication or an identical-looking placebo). However, its application is often considerably more difficult in criminal justice environments – such as target hardening initiatives to reduce burglary, specialist court services for victims, or treatment programmes for offenders (Hollin, 2008). In such circumstances, both those providing treatment and those receiving it are likely to be acutely aware of their groupings (Gelsthorpe and Sharpe, 2005).

The inability to blind a programme can impact on its delivery in ways that affect outcomes; for example, if randomisation procedures are properly adhered to and participants are asked to opt into the service or programme prior to treatment allocation, it is plausible to assume the effect of being denied treatment may have an impact on the attitudes and behaviours of those in the control group. Similarly, it may be difficult for those administering treatment to deny help to more promising cases, leading to flexible reinterpretations of randomisation procedures as trials progress (see Farrington and Welsh, 2005). Should this occur, considerable bias may be introduced into randomised experiments (Goldacre, 2008; Hollin, 2008).

Intervention volumes and the time lag for results

In addition to blinding, one of the most significant and insurmountable practical barriers to the implementation of RCTs in New Zealand criminal justice research is participant volumes (and relatedly timeframes). RCTs generally require large numbers, in both the treatment and control groups, to ensure that resultant analyses are of sufficient power to reliably infer causation. For example, in a situation where we might predict that an intervention will reduce re-offending by 5%, to achieve an 80% power result, sample sizes of around 1,500 participants would be needed for each of the treatment and control groups.[1] This is one of the reasons why RCTs work best in simple, high-volume intervention environments. In reality, few New Zealand criminal justice innovations are piloted at such high volumes. The simpler the initiative, the better, as any variation or bifurcation of the treatment group along the intervention pathway will effectively split the sample size of the resultant group, thereby reducing volumes and increasing the time taken to achieve the numbers necessary for reliable analysis. As such, RCTs are less feasible in more complex social environments, such as those often found in criminal justice settings (Hough, 2010).

Without sufficient volumes, an RCT is likely to be lengthy, as the requisite numbers are gradually accumulated. These timeframes will be further extended to take account of post intervention outcomes. For example, should a trial intervention, with a focus on re-offending, take three years to accumulate sufficient volumes for analysis, analysis will then be delayed for a further 15 months to enable a standard re-offending follow-up period to be completed. Consequently, it may take up to four and a half years to obtain results. Such timelines are routinely seen in RCT research, but are often incompatible with the decision deadlines of policymakers. This further limits the scope and applicability of RCTs as an evidence leader; and as Clear (2010) argues, such delays also therefore have implications for the ability of RCTs to lead an innovative, future-focused vision of “what works”.

Matching design with outcome need (ITT vs TR)

Information needs should be matched with the specifics of RCT design to ensure the required outcome information is obtained. Broadly speaking, there are two frameworks for RCT analysis: Intention to Treat (ITT) and Treatment Received (TR). These measure different outcomes. The purest approach to RCT analysis is ITT (Hollin, 2012). ITT incorporates all those allocated to the treatment group, regardless of their individual progress, while TR includes only those who receive treatment. Thus, within ITT analysis those who complete a programme, those who drop out, and, perhaps, those who do not even start a programme are all included in the “treatment” group. ITT analysis, therefore, measures the effectiveness of an entire intervention (broadly defined), and does so in a more “real world setting”, where not all those who have access to treatment take it up (Grossman and MacKenzie, 2005).

Practically speaking, an ITT approach ensures that maximum numbers of people are included in the treatment group, enabling accumulation of the necessarily volumes for analysis in the shortest timeframe possible. Methodologically speaking, ITT ensures that the principle of randomisation (which is crucial to RCT claims to superior causative knowledge) is upheld. In contrast, TR analysis is restricted to those who received treatment: a sub-sample of the original treatment group. This approach is often favoured by practitioners and policymakers, since their focus is on understanding the impact of a programme on those who actually receive it.

Importantly, neither ITT nor TR is without limitation. ITT analyses may tell us only a limited amount about the effectiveness of the treatment in question (Hollin, 2012; Grossman and MacKenzie, 2005). In situations where there is a reasonable amount of treatment attrition (and the longer the duration of the programme or intervention being assessed the more likely this will be), combining the results of completers and non-completers may effectively cancel out the visibility of any positive treatment effect, or may even give the impression that a successful treatment intervention makes people worse. Moreover, ITT analysis does not explain why people dropped out of treatment, nor distinguish the degree to which this was a function of the intervention itself or merely a factor associated with the broader context surrounding it (Grossman and MacKenzie, 2005).

A TR model is also not without issue. In particular, by focusing only on those receiving treatment it is methodologically weaker than ITT owing to the biases inherent in its self-selecting sample. For example, if we were to take a TR approach to examining the effectiveness of a programme to reduce re-offending amongst recidivist family violence perpetrators, how likely is it that those who completed and those who did not complete the programme could be considered to be equally motivated to change their future behaviour? In presenting findings researchers will often provide both ITT and TR results. This is useful; however, it is important to note that doing so does not overcome all limitations, since the issues with each approach cannot simply be resolved through recourse to the other.

Slow knowledge

Owing to the problems with volumes and follow-up times, RCTs represent an extremely slow method of accumulating knowledge. While this is partly about the pragmatics of RCT methodology, such as case volume requirements, more fundamentally it is about the ability of the method to deliver knowledge that meets the needs of the “what works” agenda. In policy terms the “what works” agenda is about a desire to understand and select (from amongst the endless array of options available to policymakers) those initiatives or programmes which best achieve particular outcomes. However, RCTs do not answer this question; rather, RCTs test whether a programme or initiative can be seen to influence a desired outcome. Thus, while RCTs are often heralded as the mechanism for understanding “what works”, this is a rather inflated claim. Even ignoring other limitations of RCTs, to the degree they inform policymakers about “what works”, they deliver this knowledge extremely slowly. As Goldacre observes (see McManus, 2009: 52-3):

“Each RCT provides only one bit of information: ‘yes’ or ‘no’ to a single question. Just as one could climb a mountain by asking at each step which way to go, so RCT-based medicine is progress, but it’s so very, very slow.”

One of the approaches used to overcome this sluggishness is the systematic review of similar studies. However, as outlined elsewhere in this article, the inability of RCTs to examine or understand the mechanisms that cause a result, raises questions about whether, in doing so, we are “comparing apples with apples” (Hope, 2009: 130; Garg et al, 2008:255). And, beyond this issue, RCTs remain a relatively scarce commodity in criminal justice settings, which reduces the impact of summative analysis. For example, in an audit of RCTs published in one of the premier criminology journals – the British Journal of Criminology – between 1960 and 2004, Petrosino et al (2006) found evidence of only nine RCT studies, eight of which were published pre-1983 (see also Hough, 2010: 13). The commonality of null RCT results – those finding no significant difference between the treatment and control groups – and the publication bias in favour of only publishing significant results – is also an issue in this context (see Pawson, 2006; Stevens, 2011; Goldarce, 2008).

For some commentators, the delays associated with RCT knowledge production means they are “measures of desperate last resort when no better way exists to answer important questions” (Goldacre, 2008). Whether or not this is so, it is at least the case that RCT knowledge is unlikely to be the sole or primary evidence base for properly informed policymaking.

The policy limitations of narrow results

In addition to delays in knowledge production, the knowledge produced through RCTs is often of limited scope, which has implications for evidence-based policymaking. As Carr (2010: 8) points out, due to their narrow focus, experimentally-validated policies can ignore the wider context of interventions in ways that encourage adverse unintended consequences (see also Sampson, Winship, and Knight, 2013). Using the example of exhaustive “stop and frisk” by the Philadelphia Police Department (based on experimental evidence from a Kansas City initiative), Carr (2010) observes that what was not taken into account in the experimental research was the broader impact of this policy on racial relations between police and ethnic minorities (and therefore the ability of this experimentally-validated intervention to sustain crime reductions).

The implication of this issue to policymaking is clear: an intervention that has been experimentally shown to “work” in the respect of one specific outcome, in one place, and at one time, does not necessarily represent good policy or offer sustainable benefits in the long term. Of course, this criticism can equally be levelled at other methods; the difference though, is that other methods are not claiming “gold” status in respect of driving “what works” knowledge and improved social investment, and so are typically less “exclusive” in their approach (see Carr 2010; Sampson, 2010).

Conclusion

In his launch of the social investment approach in 2015, former Minister of Finance Hon. Bill English noted: “solutions to complex problems cannot be reduced to simple equations”. As this article has demonstrated, the same maxim holds in relation to evaluations of interventions that target complex problems. In reality, establishing “what works” is difficult and often uncertain, with definitive results few and far between. While RCTs can undoubtedly make a useful contribution to our knowledge on intervention effectiveness, they are by no means problem-free and, even when appropriate and well-implemented, do not always deliver results that can be generalised. Nor do they provide definitive answers about why an intervention works or for which types of people or settings interventions work best.

RCTs may be an important part of the “toolkit”, but they are simply that: “a part of”, not a superior replacement to knowledge generated through other methods. While RCT results may appear deceptively simple, knowledge generated through this method (as in the case of all methods) has limitations, and requires careful interpretation (and, critically, this interpretation is not theory free). Even where RCTs are useful in providing some knowledge, evidence produced via other methods is both crucial to the successful implementation of RCTs and provides the broader context within which RCT results can be more meaningfully assessed. Most importantly, when it comes to questions of how something works and why (needed to successfully transport an intervention to other settings or populations), RCTs must defer to other methods.

Returning to the subject matter with which we began this article, when someone chooses to exit a plane with a parachute strapped to their back, between air and ground they will need to survey the fast approaching terrain and use their skills and experience to steer themselves towards a safe landing point. Like parachuting, conducting evaluation work well requires an ability to scan and negotiate the theoretical and methodological topography, sight and manage obstacles, and follow the right trajectory towards an appropriate “landing”. This journey is often far from straightforward, and requires knowledge, skills, and experience. If in doubt, ask a researcher; don’t jump out of the plane and hope for the best.

References

Berk, R.A. (2005) Randomised experiments as the bronze standard, Journal of Experimental Criminology, 1, 417-433

Berwick, D. M (2008) The Science of Improvement, Journal of the American Medical Association, 10, 1182-1184

Bothwell, L.E. Greene, J.A. Podolsky, S.H. Jones, D.S. (2016) Assessing the Gold Standard – Lessons from the History of RCTs, The New England Journal of Medicine, June, 2175-2181

Braga, A.A. Welsh, B.C. Bruinsma, G.J.N. (2013) Integrating Experimental and Observational Methods to Improve Criminology and Criminal Justice Policy, in Welsh, B.C. Braga, A.A. Bruinsma, G.J.N. (eds) Experimental Criminology: Prospects for Advancing Science and Public Policy, New York: Cambridge University Press

Carr, P.J. (2010) The problem with experimental criminology: a response to Sherman’s ‘Evidence and Liberty’’, Criminology and Criminal Justice, 10, 1, 3-10

Clear, T. (2010) Policy and Evidence: The Challenge to the American Society of Criminology (2009 Presidential Address to the American Society of Criminology), Criminology, 48, 1, 1-25

Davidson, E.J. (2006) The RCTs-Only Doctrine: Brakes on the Acquisition of Knowledge? Journal of Multi-Disciplinary Evaluation, 6, November, ii-v

Davidson, E.J. (2014) Evaluative Reasoning, UNICEF Methodological Briefs: Impact Evaluation, no.4, Florence: UNICEF

Farrington, D.P. and Welsh, B.C. (2005) Randomized experiments in criminology: what we have learned in the last two decades? Journal of Experimental Criminology, 1, 9-38

Garg, A.X. Hackman, D. Tonelli, M. (2008) Systematic Review and Meta-analysis: When One Study Is Just Not Enough, Clinical Journal of the American Society of Nephrology, 3, 253-260

Gelsthorpe, L. Sharpe, G. (2005) Criminological research: typologies versus hierarchies, Criminal Justice Matters, 62, 1, 8-9, and 43

Goldacre, B. (2008) Bad Science. London: Fourth Estate

Grossman, J. MacKenzie, F.J. (2005) The randomized controlled trial: gold standard, or merely standard? Perspectives in Biology and Medicine, 48, 4, 516-534

Heckman, J.J. (1992) Randomization and social policy evaluation, in Manski, C.F. Garfinkel, I. (eds) Evaluating Welfare and Training Programmes. Cambridge: Harvard University Press

Hollin, C.R. (2008) Evaluating offending behaviour programmes: does only randomisation glister? Criminology and Criminal Justice, 8, 1, 89-106

Hollin, C. (2012) Strengths and weaknesses of Randomised Controlled Trials in Sheldon, K. Davies, J. Howells, K.(eds) Research in Practice For Forensic Professionals, Abingdon: Routledge

Hope, T. (2009) The illusion of control: A response to Professor Sherman, Criminology and Criminal Justice, 9, 125-134

Hough, M. (2010) Gold standard or fool’s gold? The pursuit of certainty in experimental criminology, Criminology and Criminal Justice, 10, 11, 11-22

Hughes, G. (1998) Understanding Crime Prevention: Social Control, Risk and Late Modernity. Milton Keynes: Open University Press

Latessa, E. (2018) Does Treatment Quality Matter? Of course it does, and there is growing evidence to support it, Criminology and Public Policy, 17, 1, 1-8

McManus, I.C. (2009) Review of Bad Science by Ben Goldacre, Times Higher Education, 22 January, 52-3

Petrosino, A. Kiff, P. Lavenberg, J. (2006) Randomised field experiments published in the British Journal of Criminology, 1960- 2004. Journal of Experimental Criminology, 2, 99-111

Pawson, R. (2006) Evidence-based policy: A Realist Perspective. London: Sage

Pawson, R. (2013) The Science of Evaluation: A Realist Manifesto. London: Sage

Pawson, R. Tilley, N. (1997) Realistic Evaluation. London: Sage

Sampson, R.J. (2010) Gold standard Myths: Observations on the experimental turn in Quantitative Criminology, Journal of Quantitative Criminology, 26, 489-500

Sampson, R.J. Winship, C. Knight, C. (2013) Translating Causal Claims: Principles and Strategies for Policy-Relevant Criminology, Criminology & Public Policy, 12, 4, 587-616

Smith, G.C.S. Pell, J.P. (2003) Parachute use to prevent death and major trauma related to gravitational challenge: systematic review of randomised controlled trials, British Medical Journal, 327, 1459-1460

Stenson, K. Silverstone, D. (2014) Making Police Accountable, in Brown, J. (ed) The Future of Policing, London: Routledge

Stevens, A. (2011) Telling Policy Stories: An Ethnographic Study of the Use of Evidence in Policy-making in the UK, Journal of Social Policy, 40, 2, 237-255

Weisburd, D. Lum, C.M. Petrosino, A. (2001) Does research design affect study outcomes in criminal justice? Annals of the American Academy of Political and Social Science, 578, 50-70

[1]Lenth, R. V. (2006-9). Java Applets for Power and Sample Size [Computer software]. Retrieved 19 January 2018, from http://www.stat.uiowa.edu/~rlenth/Power