Preventing machines from lying: why interdisciplinary collaboration is essential for understanding artefactual or artefactually dependent expert evidence

This article demonstrates a significantly different approach to managing probative risks arising from the complex and fast changing relationship between law and computer science. Law’s historical problem in adapting to scientific and technologically dependent evidence production is seen less as a socio-techno issue than an ethical failure within criminal justice and state institutions. This often arises because of an acceptance of epistemological incomprehension between lawyers and scientists. Something compounded by the political economy of criminal justice and safeguard evasion within state institutions. What is required is an exceptionally broad interdisciplinary collaboration to enable criminal justice decision-makers to understand and manage the risk of further ethical failure. If academic studies of law and technology are to address practitioner concerns, it may often be necessary, however, to step down the doctrinal analysis to a specific jurisdictional level.


Introduction
Of course, machines cannot lie any more than, as Alan Turing observed, nothing could be gained by asking whether they could think.Hence, the term 'Artificial Intelligence' (AI).Turing reformulated the latter question as whether human interrogators 'could be taken in by cunningly designed but quite unintelligent programs?'. 1 Likewise, expert evidence that is unreliable because of error in or misunderstanding about data processing by computer programmes trained by Machine Learning (ML), hereafter 'artefactual/artefactually dependent evidence', clearly cannot be termed lies.Allowing verdicts or sentencing decisions to turn on unsound, potentially unsound or misunderstood artefactual/artefactual dependent evidence or being complicit in the evasion of probative safeguards, however, is normatively equivalent to negligently or knowingly colluding with perjury.This analogy, 2 reflecting the 'legal, moral and social foundations of criminal law', 3 highlights the importance of ensuring that the trial of fact must not be invalidated because of avoidable errors or Guidance published in 2022 by the Information Commissioner's Office (ICO) and the Alan Turing Institute (Turing) about explaining the use of AI/ML in decision making (hereafter the 'explainer approach') enables us to approach this probative issue also from public law and professional good practice perspectives, with analogous examples from medical AI/ML assisted decision-making. 13This contemporary perspective together with earlier literature about electronic and artefactual dependence evidence, especially fingerprint comparison and forensic DNA, means that we question a 1990s turn towards a view uncritical of computers as a source of evidence.During the 1980s the potential fallibility of electronic evidence was acknowledged.For example, it was noted that 'computers must be regarded as imperfect devices' 14 and, where relevant, the court might expect to hear testimony relating to a computer system from its initial development to use in the instant case. 15Such caution 16 appears to have been put aside -possibly under pressure to follow what was seen as more efficient American and English civil admissibility rules developed from the 1960s 17 -and a PACE safeguard in respect of the admission of such evidence was abolished in 1997. 18This consolidated English criminal evidence admissibility doctrine around what by then had come to be seen as a common law rule that computers are 'reliable'. 19This rule has been enforced by judicial notice and, as will be considered in the second section, sometimes by statute, in all major common law jurisdictions to expedite criminal proceedings by avoiding the need to prove what thereby doctrinally became 'obvious' facts. 20This approach, however, was subsequently qualified in 2003 in criminal proceedings in England and Wales, so that where a representation made by a machine (including a computer programme) relies upon information supplied directly or indirectly by a person it must be proven that the information supplied was accurate in order for the evidence to be admissible. 21This article builds on the recognition in the Act of 2003 of the importance of the cognitive link between the human mind and computer processing by highlighting the importance of computer science's interface with the natural and/or social sciences in the production of artefactual or artefactually dependent expert evidence.
The 1997 PACE amendment -to implement a Law Commission recommendation -was undertaken with rare alacrity (almost on publication) and justified solely or primarily by Tapper's almost passing remark that 'most computer error is either immediately detectable or results from error in the data entered into the machine'.. 22 Mason noted that the Commission ignored 'a great deal of technical material in the 1970s and 1980s [demonstrating] that software errors might not be obvious'. 23He also endorsed Ormerod's argument at the time, based on Dillon, that where digital evidence is fundamental, the prosecution is not entitled to rely on a presumption to establish facts central to an offence. 24In addition to looking more closely at Tapper's caveat about the scope 'for error in the data entered into the machine', we also draw on a 2020 article solicited by the Law Commission to review the basis for the 1997 amendment 25 .This authoritative deployment of scientific scepticism against the justification for the safeguard's abolition also provides valuable insights about the scope for malfunction within the technological aspects of artefactual or artefactually dependent expert evidence production.
juries from being 'unduly impressed' by proof derived from any apparently 'scientific mechanism, instrument, or procedure': People v. McDonald (1984) 37 Cal.3d 351 17 The civil developments were summarised in C. Tapper, 'Discovery in Modern Times: A Voyage around the Common Law World' (1991) 67 Chicago-Kent Law Review 217 18 The Youth Justice and Criminal Evidence Act 1999, ss.60, 67(3), Sch.6 repealed The Police and Criminal Evidence Act 1984 s69 (1) rule that computer evidence was only admissible if certified to have been working correctly. 19'In the absence of evidence to the contrary, the courts will presume that mechanical instruments were in order at the material time'.S. Mason 'The presumption that computers are "reliable"' in Mason and Seng (eds.), above n.19 at 101 20 Ibid. at 104-107 21 Criminal Justice Act 2003, c.44, s.129 22 Law Commission, Evidence in Criminal Proceedings: Hearsay and Related Topics LC245 91997) para.13.7; Above n.18 at 248; The Commission failed to acknowledge that article's subject was efficiency in English and American civil litigation and in the paragraph concerned criticised a 'quite extraordinarily lax' English statutory admissibility rule 23 See Mason, above n.19 at 171-173 24 Ibid.at 182-183; Dillon v R 32 [1982] AC 484; see Law Commission, n. 20 paras.13.15 -13.22;D. Ormerod, 'Proposals for the admissibility of computer evidence ' (1995)  A change of approach in legal proceedings is not required, however, to reverse undue deference to evidence production involving computers.A rigorously scientific approach to the use of Part 7 of the Criminal Practice Direction (Crim PD) would be sufficient.To be admissible in criminal proceedings in England and Wales any expert opinion evidence must be (a) 'sufficiently reliable'.(Crim PD, 7.1.1(d)):the court is empowered to make a pre-trial determination of the reliability of such evidence which includes consideration of the validity of any methodology employed by the expert (Crim PD 7.1.2(b))and whether the expert's methods followed established practice in the field (Crim PD 7.1.2(i);and, similarly, (b) reliant 'on an examination, technique, method or process which was not properly carried out or applied, or was not appropriate for use in the particular case' will be indicative of a lack of reliability (Crim PD 7.1.3(d)).
What is easily stated as doctrine, however, as we show, is not necessarily readily achievable in practice.
There is limited evidence to suggest that such challenges are being brought in practice and, as is noted below, despite various admissibility reforms unscientific or problematic evidence generally faces weak scrutiny.Hence, our argument that the identification of problems with artefactual or artefactually dependent expert evidence technical systems is reliant either on both the knowledge of the expert that such problems exist and the expert's candour in revealing them.We explain why we are sceptical about the ability of defence experts or the court to independently identify and resolve potential problems.Hence, the thrust of this article is the need for clear and comprehensive expert candour.
We begin with an explanation of the interdisciplinary knowledge gap, and how this overlaps and interconnects with an institutional tendency to safeguards evasion and the impoverished political economy of criminal justice.The second section considers how the knowledge gap has been exemplified in formal procedural rules, caselaw and scholarship dealing with admissibility and the ultimate issue rule.This leads to the proposal (after noting highly relevant analogies with medical best practice) that for an expert witness to be effectively peritus they should become-in ICO and Turing terminology -'explainers of risk in AI-assisted decisions'.The next section begins by analysing cultural inhibitors to narrowing the interdisciplinary knowledge gap, and then illustrates the critical importance of interdisciplinary knowledge in computer science CJS focused research, development and use.The fourth section brings together the two themes -evidence in court and end-user relevant research -that emerge from the analysis.It suggests that being able to explain AI-assisted decisions or evidence production results in the type of professional insight needed to inform the development, operationalisation and upgrade life-cycle stages of AI/ML applications.Such coproduction26 is essential for significantly and systemically reducing reliability risks in artefactual or artefactually dependent expert evidence.

The interdisciplinary knowledge gap, a tendency towards safeguards evasion and economic (organisational and ideological as well as quantum) influences
Unlike the speculative questions in Alan Turing's paper, problems and risks arising from poor AIassisted decision-making in criminal justice are not theoretical.There is a long and shameful history of courts globally being 'taken in' by expert evidence based on what is sometimes, not inappropriately, referred to as 'junk' science'. 27Misrepresentation of or misunderstandings about genuine science has also resulted in major miscarriages of justice. 28The 'infallibility' myth or zero error rate claims of expert fingerprint comparison evidence 29 long survived the first serious attempts to address admissibility systematically and scientifically, with the US Federal Rules of Evidence in 1975 to Daubert 30 in 1993, when US courts effectively grandfathered an inductive fallacy -'the uniqueness of all human fingerprints' -rather than question the accuracy of the identification process. 31These failures often stem from interdisciplinary knowledge gaps.This risk could be systemic in AI-assisted decisions.For example, intoximeter reliability depends on the accurate application of computer programming, chemistry, and biology.A natural science error can invalidate the results, even if the programming itself is flawless. 32rr et al. have drawn attention to how with Streamlined Forensic Reporting (SFR), 'assumptions that traditional legal safeguards will identify any weaknesses and strengths in expert evidence misses the valuable opportunity to properly consider the evidence before an admission of guilt may have to be made.'Irreversible plea or scope of defence decisions -usually made without expert advice -about the validity of scientific evidence predate the court hearing when admissibility safeguards apply. 33Similarly with electronic/digital evidence generally, a failure to identify a potential defect in such evidence early enough may restrict the ability to challenge the reliability of the evidence. 34Both are examples of a tendency towards safeguards evasion.The institutional evaluation and justification for the introduction of SFR was devoid of any 'idea about its impact on the quality of the evidence presented or the rectitude of outcomes'. 35SFR effectively reintroduced 'zero error' fingerprint testimony 36 by a back door that a judge cannot close, having been 'validated solely against institutional efficiencies and related savings'. 37Thus, the knowledge gap permits a further drift in safeguards evasion propelled by economic objectives overriding fair trial principles.The risks are likely to be greater with electronic/digital evidence.A SFR report may be submitted by a police officer managing the investigation, who may simply report what he/she believes is 'useful' for their case, 38 and, presumably because of extensive backlogs of work in digital units, without digital specialist endorsement or drawing necessary caveats to decision-makers' attention.
These three problems have multiple causes that cannot be analysed in a single article, but the economic problems are both endogenous and exogenous to the political economy of criminal justice systems.
Digital investigations 39 and AI/ML dependent evidence production applications -initially, automated fingerprint identification systems (AFIS) -emerged after neoliberalism had become the dominant politico-economic ideology in pluralist democracies. 40Interdisciplinary knowledge gap risks appear to increase as criminal justice became more reliant on commercially developed equipment.It will be seen in the fourth section how black box/source code challenges are rarely possible in US courts because IPR outweighs fair trial principles.Economic considerations have threatened US professional ethics and probative reliability, with expert testimony vulnerable to contingency fee bias manipulation 41 and "commercial pressure to make proficiency tests easier". 42ntemporary politico-corporate 43 literature places AI/ML devices in the sphere of 'disruptive technology' 44 or 'disruptive innovation, even if modestly stated as improving productivity and efficiency by 'minimizing administrative and operational overheads within policing'. 45A example of this wide-spread trend from health care, conceptualises technologically driven innovative as a rapid transition to new models of service provision with more than a hint of workforce deskilling and changes in legal frameworks to profit from the opportunities created by technology to manage 'rising demand, increasing cost and insufficient funding.' 46 Similar expectations can be found in unexpected contexts, for example, by 'reimagining the human role and contribution within the evolving human-machine cognitive system' for military decision-making, where the role of military commander needs to evolve from controller to teammate. 47In this brave new world it could prove difficult to show where command responsibility for war crimes might lie.
A more immediate question for criminal justice professionals and policy makers, however, arises from whether research into the impact of AI/ML by focusing on manufacturing and services industries, has generally failed to examine its impact on knowledge-intensive activities.Ribeiro et al. suggest -at least in biosciences -that 'routine tasks do not necessarily disappear … and challenge the assumption that automation and digitalisation contribute to productivity in exclusively positive ways.'48While we cannot comment on whether this 'digitalisation paradox' also applies to criminal justice, the arguments in this article support the case for similar research into the automation of criminal justice knowledge-intensive work.This is not to deny the scale of the challenge created for criminal justice professionals and governmental budgets from the volume and complexity of crime arising from the digitalisation of everyday life.For example, the industrialised scale and organisation49 of cybercrime can be seen from initial reports of a single international dark web police operation.The iSpoof takedown, identified 59,000 potential suspects (with an estimated 200,000 victims in the UK alone) who purchased access to its cyber-fraud enabling services. 50The use of the term 'Dark Web' for anonymous communication networks and services adds to the complexity of criminal justice responses.The TOR-protocol has many legitimate uses, such as protecting journalists' sources, whistleblowers and access to uncensored information. 51Dealing with rising demand in such sensitive and complex contexts requires the exercise of human discretion and judgement, even if such decision making can be usefully supported by automation.T The approach adopted by the applicable CrimPD in England and Wales, provides an alternative to rigid presumptions about reliability or accuracy of digital processing.The Directions provide scope for all types of expert evidence.to be tested, pre-trial, against a series of factors52 to determine evidentiary reliability.Whilst doctrinally this should avoid rigid presumptions about any form of expert evidence, discussed briefly above, there is limited evidence to support the fact that, in practice, such challenges are routinely being brought.This article by analysing the wide and varied scope for error in artefactual or artifactually dependent evidence, indicates that with such expert evidence, there may be substantial risks of erroneous presumptions about reliability.
Guidance about making processes, services and decisions delivered by AI/ML intelligible by default has emerged from within the computer science community, hence, the ICO and Turing 'explainer' approach that is applied in the fourth section to expert artefactual or artefactually dependant evidence.First, however, the next two sections analyse why this approach is potentially so valuable in criminal justice: respectively, because of the problems encountered in successfully adapting criminal justice systems to scientific developments (including computer science) to improve the quality of criminal justice; and the critical importance of interdisciplinary collaboration for ensuring successful adaptation or avoiding serious epistemological error.

Expert witness artefactual or artefactually dependant evidence: admissibility and the ultimate issue rule
The first question -is a digital specialist's testimony expert witness evidence or not -might surprise some readers.It partly reflects a cultural legacy from the beginning of digital investigations/ digital forensics in the 1980s.Basic evidential considerations were soon adhered to, as can be seen with the emphasis on chain of custody requirements in most descriptions of 'forensic soundness' in investigative work that relied on computers.Insufficient consideration, however, was given to evaluation of the results or alternative interpretations for such results 53 or even whether digital artefacts recovered and analysed during a digital forensic investigation might have been tampered with before seizure. 54The idea of digital work as a technical operation rather than a scientific practice may have been reinforced by the marketing of 'effectively idiot-proofed' applications.
Where such mindsets still prevail, 'digital forensic practitioners incorrectly assume that they are simply reporting what they observe and are unconscious of the interpretations and decisions inherent in digital investigations.' 55 In England and Wales such naïve confidence is likely to have been reinforced by the turn in legal and governmental thinking with consolidation around the common law presumption of computer reliability.This would have been amplified by a statutory requirement in many common law jurisdictions that results from government approved AI/ML devices (e.g., intoxilisers and genotyping software) must be treated as accepted fact.In England and Wales intoxilisers approved under statutory powers are erroneously (see fourth section) presumed to be reliable even following untested modifications. 56The reliability of these systems cannot be readily challenged. 57e admissibility of expert opinion evidence in most major common law jurisdictions 58 hinges on relevance and reliability. 59The former results in the doctrinal requirement that expert opinion evidence is only admissible where it provides factfinders with 'scientific information which is likely to be outside the experience and knowledge of a judge or jury.' 60 For this the witness, as discussed above, must be competent or 'peritus'. 61general lack of statutory definitions/specifications for expert qualification and training provides opportunities for a range of errors and omissions, particularly the admission of opinion evidence by 53  individuals who should not be treated as experts. 62English and American courts appear to have glossed over the epistemological distinction between lay testimony of fact about reliance on or the use of a computer compared with expert scientific evidence about its reliability.Though, as Mason observes, there may often be a fine dividing line between lay evidence about the day-to-day operation of a system and expertise in its operation, and expert opinion about the operation of computer systems. 63thin many common law jurisdictions, admissibility safeguards are applied with what US judges have described as a generally 'liberal' or 'permissive' approaches, hence, admissible evidence might be 'shaky'. 64Comparative analysis suggests that 'admissibility standards have not contributed to the exclusion (or informed systematic evaluation) of unreliable and speculative forms of incriminating opinion evidence in courts.' 65 English practitioner experience is that 'the working principle of assumed reliability appears to be the default position'. 66missibility decisions about AI/ML dependent evidence may turn on an exceptionally fine line.For example, the Court of Appeal in Dlugosz 67 endorsed the admissibility of expert statements initially based on AI/ML application results that were subsequently criticised by fellow scientists for substantial professional reasons: the testimony exceeded what reliable methodology in the scientific literature, training or standards would allow 68 An alternative view on Dlugosz, however, suggested by Ward, and based on analogy with hearsay evidence (must be 'potentially safely reliable' in the context of the evidence as a whole) is to recognise the potential value of 'expert evidence of weak or unknown probative value … adduced as one part of a body of evidence which taken together is arguably compelling'. 69He sees indications in the judgment that the Dlugosz DNA evidence was seen 'as quite close to the borderline'. 70Nonscientific evidence discovered because of the AI/ML DNA outputs probably convinced the Court of Appeal that justice had been done for the victim.Forensic Science Regulator (FSR) guidance subsequently issued in response to this case did not take an exclusionary stance.It confirmed that the evidence had been presented in a scientifically erroneous manner and advised how the results obtained from the AI/ML search application should have been presented more neutrally and with frankness about their weak probative value. 71hat we term the interdisciplinary knowledge gap prompted Ward to ask whether it is simply unrealistic to expect prosecution and defence lawyers, judges or juries to detect, unaided, 'the ways in which an expert's necessarily simplified account of the science unduly favours one party'? 72One option is for judges and lawyers to keep abreast of scientific developments through the work of 'key epistemic "monitors"', such as the FSR and the US National Academies of Science (NAS), and 'if possible, to be informed of any cogent criticisms of those bodies' work.' 73 This begs the question, however, of whether busy lawyers and, we suggest, also investigators, can be expected to keep abreast of, in the case of the FSR, voluminous guidance that is highly technical, subject to regular revision and written essentially for the relevant expert scientific communities.For historical reasons, the FSR guidance is generally a remedial and -even in its 2023 statutory incarnation -is still an incomplete response to known problems and progress in extending its coverage is necessarily slow.Evidence that is not Code-compliant remains admissible (and the weight to be attached to it) remains a matter for case-by-case decisions. 74The judicial gatekeeping role becomes even more difficult when admissibility turns on the significant minutiae of computer science.While one American judge learned to code in Java in preparation for the copyright dispute,' do most judges even possess the technical knowledge to understand coding languages?' 75 In a later paper Ward canvasses another option for all forensic science testimony identical to the explainer approach: the professional ethics and legal duties of expert witnesses should require the revelation of uncertainties -'where there are possibilities of error, bias, disagreement or alternative explanation' -to assist CJS decision making. 76e explainer approach is consistent with the ultimate issue rule: 77 even when a decision turns on a matter which the tribunal would be unable to understand 'without the assistance of experts', 'the power of decision is retained by the tribunal of fact 78 Expert witnesses should be careful to recognise 'the need to avoid supplanting the court's role as the ultimate decision-maker on matters that are central to the outcome of the case'. 79Commentators have noted how the significance of this rule has been diminished, 80 or dismissed as 'a matter of form rather than of substance'. 81Strict compliance with the rule may certainly be undesirable in certain circumstances, such as diminished responsibility cases, where the clinical symptoms diagnosed by the expert are used to explain the events. 82Hence, the jury were discouraged in Golds -a case involving expert evidence unchallenged by the prosecution -from making themselves 'amateur psychiatrists'. 83Whatever the current status and detailed application or definition of the rule itself, there remains considerable authority 84 for the view that the evaluation of the reliability of an expert evidence remains the role of the tribunal of fact aligned -consistent with supported by doctrinal analysis that separates the expert and decision maker's roles 85 -with a warning that experts must not trespass upon jurisprudential territory and confine themselves 'to purely scientific questions, leaving open any issue as to the surrounding facts'. 86Otherwise -as Biedermann and Kotsoglou have commented -with the court's complicity, an expert witness would usurp the factfinders' normative role, for example, in making legally significant judgements with the risk, for example, of incorrect false identification and, hence, false incrimination of a defendant. 87 England and Wales, moreover, the Rule's function has been authoritatively preserved in guidance about how judges should deal under Part 7 of the CrimPD, with any issues relating to the reliability of expert evidence raised pre-trial.This will form part of the judge's determination regarding the admissibility of the evidence.Where the evidence is sufficiently reliable to be admitted, any dispute as to the reliability of the evidence will be addressed in open court to assist the factfinder in judging the weight to be attached to the evidence.The Crown Court Compendium 88 offers guidance to judges on the direction to be given to juries, including in the following terms: "…as with any other witness, it is the jury's task to weigh up the evidence of the expert(s), which includes any evidence of opinion, and to decide what they accept and which they do not... Any factors capable of undermining the reliability of the expert opinion or detracting from his/her credibility or impartiality should be summarised.The reliability factors listed in CrimPD Ch 7 reflect the common law, and should be used to assist the jury in evaluating and assessing the weight of the expert evidence.It may be that not all these factors will be under consideration during the evidence and therefore the direction and the factors should be tailored to the issues in the case." 89 As a result, the experts themselves (both prosecution and defence) have a key role in identifying issues of evidentiary reliability and in assisting the court to understand them consistent with the expert's overriding duty as an "objective and unbiased" 90 assistant to the court.
Expert witnesses -when diligently seeking to fulfil this assistive role -still need to overcome the problem noted by Bollé et al. that '[m]any existing ML approaches lack sufficient transparency and reproducibility for forensic purposes, and are not designed in a way that helps forensic practitioners evaluate and explain the outputs of automated systems effectively'. 91The next section analyses the cultural and institutional inhibitors to overcoming the interdisciplinary knowledge gap, and, more generally, how the lack of interdisciplinary collaboration may compromise the reliability of evidence or intelligence reliant on computer science CJS focused research, development and operationalisation.

Approaching AI/ML artefactual evidence with interdisciplinary insight
AI/ML-applications are typically developed in numerous stages (over two decades for automated facial recognition (AFR)) and at multiple sites.This is all (including false starts and problems in achieving accurate results) recorded in the vast body of general computer science/technological literature.In practice, however, that literature is unlikely to be accessed by many criminal justice professionals.
The technological and scientific papers that record and present such developments are structured differently to legal literature and often contain detailed statistical data to substantiate the results.Such cultural inhibitors to interdisciplinary understanding predate AI/ML.When forensic science practice and statistics began to converge over the reform of fingerprint comparisons, a distinguished statistician, referred to 'two communities divided by an apparently common language'. 92Similar cultural inhibitions have been noted more recently in cybercrime studies.Techno-epistemic networks of experts (such as computer and data scientists, both in academia and in cybersecurity companies) have great digital capital in research but may lose sight of its 'socio-technical nature'. 93Within medicine, early during the Covid-19 pandemic, concerns were expressed about the relationship between quantitative research scientists engaged in COVID-19 clinical trials and the AI/ML community. 94The practical consequence of the interdisciplinary knowledge gap is that 'legal personnel have typically struggled to incorporate the advice and insights of mainstream scientific and technical organisations into their consciousness and practice.' 95Conversely, the format and subdisciplinary structure of legal literature must be a barrier to many technologists understanding the jurisdictionally specific legal requirements that their programming must be tailored to achieve.While within digital forensics -an obvious interface for computer science and the law -relevant articles are not necessarily helpfully sign-posted, peer reviewed, or Open Access and tend to deal with 'isolated forensic challenges '. 96 Where AI/ML issues are directly addressed in the socio-legal literature, including within 'the recent burgeoning of American techno-legal studies', 97 AI/ML reliant predictive policing (deployment, bail and sentencing decisions) 98  spheres, 100 and like the relevant caselaw is overwhelmingly common law, 101 and for that matter American.The nuanced manner by which different jurisdictions take note and, up to a point, borrow from each other may not be readily apparent to computer scientists looking for clear and universally standardised rules with which their research must comply.
The importance of interdisciplinary insight can be illustrated in the rest of this section by examples of unreliable or unlawful AI/ML research and operationalisation.
We might like to think that some 'forms of evidence have unfortunately come and thankfully gone, including, phrenology'. 102However, two research studies about predicting criminality from facial appearance appeared in 2017 and 2020.Both reported a high level of accuracy at the proof-ofconcept stage.Wu and Zhang recorded cross-validation accuracy of 97% with a data set of 1,856 facial images. 103Hashemi and Hall reported the same score with one of the classifiers used, but against a data set of 44,713 facial images and that their results were 'not biased to put people of a specific gender or race in a specific category while ignoring their criminal tendency.' 104The latter paper was quickly retracted, but solely because the research involving human biometric data had not received institutional ethics clearance. 105Presumably as result of this the authors did not respond to criticism of their research..
The research concept and the reported high accuracies were criticised as illusory. 106These responses originated in the computer science community, but drew on a combination of interdisciplinary knowledge including pertinent sociological and ethnographic insights (modified and slightly expanded in the summary of some of the issues here): • Technical robustness: the exceptionally high accuracy of the 'proof of concept' claims in the two articles could reflect research design errors, such as the programme's ability to spot differences in metadata (e.g., comparator images may have been standardised as grey scale photographs) rather than any inherent differences between the images themselves.• Socio-legal error: AI/ML tools need to be jurisdiction specific because of variations in the social construction and temporal definition of 'crime' and 'criminal'.For example, how possession and use of marijuana varies under US state laws, and that the decriminalisation of such behaviour is gaining traction.As Wu and Zhang acknowledged, a court conviction was not a reliable method for distinguishing between "criminal" and non-criminal data sets. 107 Ethnographic error: Wu and Zhang argued that the high accuracy of the results was possible because all the images were of individuals of the 'same race'. 108This consistent with how differences in the accuracy of different AFR systems reflect skin tone and gender bias in training data sewtsetc.. 109 They failed to acknowledge the equally high accuracy reported by Hashemi and Hall, who had used highly diverse US data sets.Their response, however, revealed a failure to distinguish between observable variations in facial appearance and, what is now recognised to be a social construct, race. 110Racial or ethnic categories are socially fluid labels, often based on a less-than-fully transparent combination of self-identification or official ascription 111 and, while risk of appearance etc. bias has to be managed to avoid discrimination in many areas of research, is not a source of empirically consistent reliable information for AI/ML data training.

• Incompatibility of the original concept with a critical area of scientific consensus: Wu and
Zhang's did not accept that physiological and anthropometric theories of criminal appearance had long been discredited; 112 the research concept also confused psychological research into social perception of faces with the accuracy of such perceptions. 113ere was also a public law issue in all EU and UK jurisdictions, not necessarily those where the research took place.Wu and Zhang's 'non-criminal' subset consisted of 1126 images acquired without consent from the Internet. 114Similar activity but on an industrial scale and involving more than 600 law enforcement agencies globally 115 searching investigative facial images against images of known individuals harvested without consent in vast numbers from global social media by a commercial AFR developer, Clearview AI Inc.This resulted, inter alia, in data protection proceedings in Canada, Australia and other jurisdictions, with a £7,552,800 fine in the UK. 116As noted in the Canadian and the Australian determinations, 100% accuracy claims were included in the marketing.Law enforcement agencies presumably attracted by such claims seeing the application as highly economically efficient and investigatory effective, either paid for access or tested the application (including in live investigations) in free trials. 117t most law enforcement officials did not understand how the technology actually worked.
Nor, … did anyone know much about the company behind the technology. 118is article is focused on criminal proceedings and the need to ensure that CrimPD are used effectively, but this scandal is a reminder of the wider risks within the criminal justice system generally, where potential judicial safeguards are non-existent.The number of individuals wrongly associated with serious offending because of a lack of technological understanding within law enforcement and thereby socially stigmatised, or enrolled in suspect or safeguarding records is unlikely ever to be known.

Expert witnesses as 'explainers' of their AI-dependent findings and participants in AI/ML research and development
Good medical practice provides helpful guidance about the objectives that expert witnesses should be trained to achieve -consistent with the ultimate issue rule -when explaining the significance of the artefactual or artefactually dependent nature of their evidence in the instant case.Guidance about explaining AI-assisted decisions published by the ICO and Turing Institute in 2022 119 illustrates how explanations should be given to patients in a high impact (life/death) situation.It is essential that they should understand how the diagnosis was made, including reliance on an AI/ML system.The explanation needs to be intelligible to patients who may not know how to query an AI/ML system output, by discussing for example: • The quality of data processing: how the data used by the application was used, collected and cleaned and why it was chosen to train the model.Also, information about safeguards to ensure it was accurate, consistent, up to date, balanced and complete.• What is known about the applications' performance metrics in terms of the available training data, and the healthcare organisation or third-party vendor that decided how accuracy should be assessed.• Safeguards to ensure the system's robustness and reliability if used outside laboratorycontrolled conditions.When providing this information doctors should 'indicate how much confidence they have in the AI system's result based on its performance and uncertainty metrics as well as their weighing of other clinical evidence against these measures.' 120Such an approach does, however, contradict thinking in some influential circles about the transformational nature of AI/ML within criminal justice.For instance, The US President's Council of Information Commissioner, Enforcement Notice to Clearview AI Inc, 18 May 2022 < Clearview AI Inc Enforcement Notice (ico.org.uk)> on 24 June 2022 117 See Purshouse and Campbell, above, n.115 at 213; In the USA, for example, the Justice and Homeland Security departments, US Drug Enforcement Administration, the FBI, the US Immigration and Customs Enforcement, the US Secret Service and hundreds of police departments, and administrators and campus police at and school scores of universities, P. Dauvergne, Identified, Tracked, and Profiled (Edward Elgar: Cheltenham,2022), 60 118 Ibid.at 61 and 63 119 See ICO and Turing Institute, above n.13 120 Ibid.at 12 Advisors on Science and Technology (PCAST) suggested in 2016 that forensic analyses could be performed by an automated system or human examiners exercising little or no judgment. 121Such a view does not, of course, is unlikely to comply with UK and EU data protection law, that variously provide for controls and remedies against 'significant decision based solely on automated processing'. 122though theoretically and doctrinally strong, the adoption of this approach also needs to overcome major practical limitations.Considerable investment is taking place in medical AI/ML applications under the guidance or scrutiny of multi-skilled research and globally interconnected development teams that, inter alia, improves the medical profession's ability to explain the reliability of AI/ML generated reports.Physicians can view AI/ML generated data critically, for example, by seeing the risk score for a given source of information that contributes to a multiple source prediction to identify potential errors. 123Expert witnesses, may often use applications that are comparatively rudimentary.Criminal justice is a much smaller and, in terms of economic theory, imperfect market.Until institutional investment in criminal justice AI/ML applications produces sufficiently transparent, detailed and comprehensive information about potential risks.Significant qualifications may have to be given about the reliability of artefactual or artefactually dependent evidence, including, for example, when with hardware changes and the black box issue (both considered below) it may be impossible to judge how reliable the system was in the instant case.

Evaluation, reliability, accuracy and error
"Accuracy, … is partly a question of objective facts and partly a function of striking an appropriate balance for the purposes at hand between tractable generalisations and exhaustive technical detail 124 The above comment by an interdisciplinary group of authors (statistician, legal academic and forensic scientists) in a publication about uncertainties statistics and probability, applies equally to computer science.It is germane to explaining the reliability of artefactual or artefactually dependent expert evidence.AI/ML offers the prospect of standardised and transparent statistical measurements and probability estimates for elements of expert evidence that are at present entirely subjective, especially feature comparisons (i.e., the measurement of latent fingerprint image quality and the probability of it corresponding to other fingerprint data, whether other latent or reference).Without the expected new AI/ML applications, the best that can be achieved for many feature comparison disciplines is to compare variations between different practitioners.Such error rate measurements are helpful in exposing methodological/conceptual flaws and reasons for biased results, 125 but cannot guarantee the avoidance of significant error.Proficiency testing, at least if not undertaken blind and replicating casework level difficulties, may have a limited value, 126 and all the experts tested could have made the same erroneous decision. 127planations of the accuracy of such new AI/ML tools and how this is determined by the quality and use of the training data, however, are never short and straightforward.Contrary to the impression created by marketing, numerous metrics are used for evaluating the performance of AI/ML applications, but no single measure is generally superior.Usually, combined metrics are required to gain an understanding of the credibility, validity, reliability, and generalisability of a tool's performance.
The starting point for such enhanced competency lies in understanding how the classification model or 'classifier' sets parameters for AI/ML coding evaluation.In data science, an AI/ML model depends on the performance of the algorithm selected for its development.Common classification tasks, such as image recognition, can use algorithms such as support vector machine (SVM) or convolutional neural networks (CNN) for ML and AI models, respectively.There are dozens of algorithms available for classification and other tasks involving big data.Despite their different composition, they are all designed to deal with factors such as time complexity, scalability, update capability, capacity for generalisation, accuracy, degree of reliability, resilience, and potential impact on validity and verifiability.This is not the place to attempt a comprehensive summary of such issues, which would soon become out of date.The nature of the issues that we have in mind for an explainer's tool kit can be illustrated 128 , however, as follows: • How the classifier is trained: This requires a pre-existing dataset containing a correctly labelled set of samples, for instance, external images of firearms.Ideally, a predictive model should classify 100% of the samples in the dataset with the correct classification label.However, this does not guarantee that all previously unseen samples will be correctly classified.A training dataset is likely to be incomplete since there will always be new samples that have not yet been trained.Nevertheless, based on the generalisability achieved by classifying previous samples, it might be capable of classifying an unseen sample correctly.sensitive to outlying values in unbalanced datasets and for that reason can sometimes be misleading.For example, if the dataset consists of 1 malign value and 99 benign values, the accuracy will be reported as 99/100, (0.99), for example, if the model predicts 100 values as benign by missing the outlying false value which might be a critical one.Accuracy scores can, however, be made more reliable by calculating a balanced accuracy score.The balanced accuracy score calculates the true negative (TN) and true positive (TP) rates, namely, the TP / (TP + false negatives (FN)) and TN / (TN + false positives (FP)), respectively and divides them by two.A balanced accuracy score for the example of 100 benign predictions in a dataset of 99 benign samples and one malign sample would be 0.50.Such reporting is consistent with long-recognised criminal justice expert good practice avoidance of presenting accuracy as a singular number.129• A clearer understanding of accuracy by calculating sensitivity/precision and specificity/recall values: In data science -as with clinical research and forensic genetics130 -these measures are commonly used and are equally important to accuracy scores, but possibly with disciplinespecific differences and terminology (precision and recall in computer science).Precision, or positive predictive value, can be described as the classifier's ability to correctly label the positive predictions (i.e., the true positives divided by the true positives and false positives).
Conversely, recall measures the classifier's exclusivity in labelling the actual true values (i.e., true positives divided by true positives and false negatives).
Such a toolkit could be of particular use to experts giving evidence in criminal proceedings both as a means of establishing their competence to testify and the reliability of the evidence.In England and Wales, an expert's report should contain "details of the expert's qualifications, relevant experience and accreditation"131 as well as "such information as the court may need to decide whether the expert's opinion is sufficiently reliable to be admissible as evidence"132 .
It is essential in the criminal justice context to stress that precision and recall are likely to be essential metrics.They are complementary to accuracy measurements because they are less sensitive to skewed datasets.Precision and recall should be viewed as a pair to give a clear view of the evaluation of the classification model's performance on the dataset.In cases where both precision and recall values need to be considered, the F-score (sometimes 'F1-score') is applicable.
The F-score is a harmonised mean of precision and recall that presents a type of average of the two metrics.
Although such evaluation algorithms are able to compensate to some extent for skewed datasets, balanced datasets are preferable.It is possible to create a more reliable dataset and a more reliable model.The dataset can be shuffled like a deck of cards to spread the samples more evenly over the set.It can also be balanced by adding samples of the minority class, for example, by adding 98 malign samples to a dataset consisting of one malign and 99 benign samples.In addition to balance the dataset used for training an ML/AI model, usually, the same fraction of the dataset is not used over and over again as the testing dataset could potentially bias the model given that the training dataset contains only one type of values, and the testing dataset contains only other types of values.
In a small dataset of 100 samples, suppose 90 are pictures of hardware tools (hammers, wrenches and such) and ten (10) samples depict firearms.Given that the same ten samples of firearms are used as the testing dataset, the model is trained only to classify hardware tools, resulting in a poor accuracy score for identifying firearms.To achieve a more precise performance evaluation score on an imbalanced dataset, the dataset can be divided into "n-fold" partitions to be used in a "cross validation" -meaning the dataset is split into n (usually five or ten) partitions and every partition will serve for example, as the testing partition once; cross-validating the dataset partitions.N-fold crossvalidation means a testing dataset will not constantly be the same fraction of the dataset -i.e., not only the same ten samples of firearms as in the previous example -and hence the dataset training the model becomes more nuanced and ultimately results have greater accuracy.
Expert witnesses have been expected to pay particular attention to the relationship between sample size, in which we include training datasets, and potential inaccuracy.133'The 'power' of machine learning in recognising patterns is proportional to the size of the dataset, the smaller the dataset, less powerful and less accurate are the machine learning algorithms.'134Kokol et al have suggested that the solution to this problem, which affects many activities outside criminal justice, might be for learning to be generalised on data sets from various fields so that many different small data sets might become a big data set.While this may be feasible for many types of now automated economic activity, such as contract reviews and financial audits, it may not be a practical or, recalling the risks revealed by the Clearview example, legal way forward for sensitive personal data within the criminal justice context. 135There is also the issue of legal variation between jurisdictions, to which we shall return in the final subsection.

Technological and technologist anticipative issues
The critical importance of training data has been indicated above.Beyond the proof-of-concept stage, a large and diverse data set must be used for training the programme and predictions need to be tested using data that was not used in any way during model training.There is considerable knowledge about the problems caused by data (often termed 'algorithmic') bias during AFR development.White males were over-represented in initial datasets used during the training stage and the images used had been created on film whose chemical formulae was designed to produce sharper images with light skin tones.Unaware of this, programmers did not anticipate how accuracy would be skewed for non-light toned people. 136day, computer scientists have a betterunderstanding of the causes of inaccurate or biased performance.Some problems may be missed for years, for example, learning reinforcement.This can occur when the programme is trained to invent ways to accomplish tasks, in effect, by penalising it or rewarding it for achieving specific objectives.It may respond by 'wireheading'/inventing 'short cuts'.Background cues or scene biases in the dataset may create shortcut opportunities to recognise the primary objects or may arise from the source or acquisition or preparation method of the data samples.Programmes may learn from the presence or absence of ancillary tokens present in images, including the presence of originator logos or the position of such logos video frames of pornographic images, to classify images on that basis and not the content of the image itself. 137This was only recognised as a general problem in 2016 but publications discussing examples of this phenomenon can be traced back to 1983. 138Other causes of technological risks include: • Inadequate scalable oversight: the programme's continued adherence to the intended objectives needs to be frequently evaluated during programme training. 139 Robustness' of the programme in operational conditions: 'harsh real-world conditions' need to be modelled and tested during the training process. 140Classification thresholds (the measurable amount of correlation required for a classification to be recorded) set by the programmers may not allow for typical variations in environmental variations (e.g., lighting and camera quality) that affect critical inputs. 141• Pre-operationalisation validation testing parameters: the reported accuracy of proof-ofconcept or 'developmental validation' only hold good for objectives set within the parameters for laboratory testing. 142 Programme upgrades: upgrades that provide more functions or remedy identified defects may result in new modes of operation that have not been previously tested 143 While this paper focuses on risks intrinsic to AI/ML programming, however, hardware changes may be equally significant and possibly more likely to go unremarked.Changes such as memory size or available disk space may modify the programme's operation or cause it to behave unpredictably.'There is not even a theoretical technical solution to this drawback that will lead to reliable practical countermeasures.' 144 It is difficult to see how a court can be satisfied about the methodological soundness 145 with which artefactual or artefactually dependent evidence is produced, without the expert witness being able to produce and explain -in the terminology used by the ICO and Turing Institute -evidenceassurance documentation 146 any relevant issues considered immediately above.The problem here is whether defence counsel is aware of the risks described in this article and how they may apply to the instant case.The rarity of reported CrimPD Part 7 challenges, suggests they are not.

Anticipating how end-users understand and operate AI/ML systems
Tschider noted from the techno-clinical literature that 'two of the most crucial choices an AI designer makes are the mechanisms for immediate feedback and correction.If 'a system trains on data from hospitals with a high degree of resources -such as the newest technologies and the most highly trained practitioners -the model the AI system creates will be oriented towards high-resource use and may not be as effective as one trained on low-resource environments.'To avoid this disparity, training data should be representative of the population or community where the AI might be used. 147This suggests, at a minimum, that end user involvement at the inception of an AI/ML project, ideally at the proof-of-concept stage and no later than the development stage is critical for any evidence-assured system.In criminal justice, the kind of experience, behaviours and risks that expert collaboration would enable software developers to anticipate include: • Expert knowledge and manipulation of existing systems: By the early 1980s, at least in parts of the USA, fingerprint examiners specifically tailored their latent print annotations when encoding data in line with observations about variations in the responsiveness of different propriety black box AFIS programmes to metadata variations in input data. 148uch expertise and the professional culture that gave rise to such behaviour needs to be understood by programmers as early as possible during system development.
• Expert competence: increasingly professional and organisational competency in the production of expert evidence is quality assured,149 but, as noted earlier, for well understood reasons in England and Wales the FSR is having to concentrate on remedial responses to known problems and progress is necessarily slow.Whilst quality assurance of professional and organisational competence will not establish an expert's competence to give evidence in criminal proceedings per se it may form a part of the basis for making such a determination.A US sentencing case, involving a predictive incarceration issues report, illustrates the extreme end of the risk continuum.The report was skewed by arithmetical error, double counting and conclusions not supported (even after input errors) by the MLencoded tool outcomes. 150More deeply seated problems within the institution where evidence is produced or commissioned, however, are more difficult to identify and assess.NIST has begun to trial a practitioner competency assessment methodology to measure this both individually and with reference to demographic characteristics (workplace environment, education, and work experience). 151Their methodologically, however, has yet to be proved to be successful and initially it only covers mobile and hard-drive forensic investigation.It is certainly unlikely to equal the obligations placed on individuals under Part 7 CrimPD if counsel are sufficiently knowledgeable and resourced to activate these safeguards in relevant cases.
How would the 'explainer' approach resolve the Blackbox/access to source code issue?
The degree of accuracy and predictability embedded in the operation of source code is critical for the reliability of artefactual or artefactually dependent evidence: The code dictates which tasks a computer program performs, how the program performs the tasks, and the order in which the program performs the tasks. 152hen the programme also uses neural networks, or deep learning systems because specific weightings are added to relationships between data elements.Even where an explanation is possible, she suggests that it may not provide 'the kind of information needed to actually evaluate risks of unfairness, discrimination, safety, or other social impacts.' 160 This view is shared by many computer scientists. 161winkelried has suggested a judicially managed two or three stage process for resolving source access disputes.Firstly, the defence must convince the court that the validation information available for the tool (i) 'do not adequately address the effect of a specified, material variable or condition present in the instant case' and (ii) could plausibly affect the verdict.Secondly, if that bar is passed, up to two more steps should follow: (a) a new validation study focused on the instant case issues by a defence expert and, if that does not resolve significant expert disagreement, (b) an opportunity for the defence team to examine the source code to assess the accuracy of the results cited in the prosecution case. 162England and Wales, however, lacks a judicially managed expert evidence dispute resolution procedure.Such a statutory framework was suggested by the Law Commission in 2011 163 but ultimately rejected on cost grounds. 164The alternative approach adopted, as discussed above, was the introduction of amendments to the CrimPR and the associated CrimPD, including the introduction of criteria to assist the court with the pre-trial assessment of reliability 165 and making provision for inter alia pre-hearing discussions between experts. 166Therefore, the jury is left to determine the potential effect of the AI/ML system operation on the evidence, but this will only be possible in practice, if sufficiently comprehensive explanation are either provided by the expert witness(es).

The need for expert witnesses/end-users to participate in AI/ML research and development: a jurisdictional specificity and an illicit trading example
The relationship between jurisdictional specificity and the reliability of AI/ML applications has been noted already in this article.What constitutes unlawful behaviour (a) may change over time (e.g., marijuana decriminalisation), (b) differ significantly between jurisdictions (e.g., the scope to allow marijuana sales in the Netherlands but not the UK) and (c) substantive elements of the offence may vary within jurisdictions (in the Netherlands marijuana can be lawfully sold only to Dutch citizens).
Similarly, some cryptomarket vendors 'offer firearms that have different legal statuses based on the parties' location and jurisdiction. 167Bergman and Popov's PDTOR research has demonstrated how this problem could be resolved by giving end-users access to an annotation tool that would enable them to ensure that automated internet illegal transactions search criteria are jurisdiction specific and easily updated should the law change.tool users would not depend on new releases of the model to maintain the empirical accuracy of data classification.Users could themselves maintain empirical accuracy, for example, when monitoring illicit firearms trading, adding images of new types or novel modifications of firearms observed in illicit marketplaces.
This has been published as the results of a proof-of-concept validation research to create an annotation to improve the reliability of tor cryptomarket surveillance. 168The operation of this annotation tool can be summarised as a four-step process: 1.The forensically sound (i.e., chain of custody) record of the manual capture -by an investigator -and preservation of a Web page and its metadata as an annotated/annotatable data set and the storage of this artefact within an archive.2. Each artefact is automatically indexed and accessed via a server so that multiple investigators can record in the archive their judgement about the artefact's classification using a Web browser that allows full data visualization (i.e., the page as seen on the web plus its metadata as originally captured and separately annotations by colleagues).3. Within the chain of custody record, the archiving of an annotator consensus agreement or the statistical calculation of degree of variation between annotator judgements about the quality and accuracy of each data set.4. Relevant annotated artefacts could then be used to create a training data set for the unsupervised programming of a web crawler to search the Dark Web and to capture and archive ('scraping') additional artefacts that conform to the quality and accuracy parameters created and recorded during the annotation process, selected by an AI/ML-based classification model.
From a jurisdictional perspective, stage 2 is critical.This would enable criminal justice experts to ensure that any artefact is only confirmed as evidence of unlawful activity against the substantive criminal law at time of annotation.These parameters are then embedded for stage 4 whentheoretically-they cannot be changed, irrespective of how the black box/source code of the crawler operates.
At this proof-of-concept stage, the results achieved for AI/ML-encoded crawler data capture using four classification algorithms was a balanced accuracy rate of between 85% and 95% against a small set of 150 HTML web pages, mostly from dark web marketplaces.The next stage of the research will involve a bigger data set.It will also test the tool against better protected tor web pages, include further examination of the resilience of the chain of custody for archived data, and widen functionality to include the annotation of graphic content located on the Dark Web.The tool could be incorporated within a criminal justice data system -subject to rigorous connectivity validationrather than the free-standing database used for this proof-of-concept research.The tool is highly adaptable, -as indicated in a later study 169 -it has been proved to be being suitable for use (including capturing images) with any dark or clear (surface) web crawler and is easily reconfigurable for Clear (Surface) Web surveillance by using a different browser.The article explains why dark web scraping -irrespective of the technology used (e.g., multi-threaded distributed crawling engines used for clear web commercial services) -is significantly slower and labour intensive.This reflects how ANC network speeds are significantly slower by design and the need for pseudo-random delays to evade (not always successfully) security features.Slower scraping, however, allows investigators to 'invigilate' and steer the process for probative purposes -in a cyberspace location with a deliberately intensified volatility, as servers reportedly disappear regularly from such networks.At the time of writing, what the project lacks is end-user participation, largely because of the pressure of work on criminal justice digital experts.

Conclusions
In this article we set out to explain a significantly different approach to mainstream techno-legal literature for examining the complex and fast changing relationship between law and computer science.An historical inability to adapt to scientific and technologically dependent evidence production is seen primarily as an ethical failure within criminal justice and state institutions.This often arises because of the acceptance of epistemological incomprehension between lawyers and scientists.It is compounded, however, by the political economy of criminal justice and safeguard evasion within state institutions.
In England and Wales doctrine makes a distinction between expert witness competence to give evidence in any circumstances and whether the evidence in the instant case is admissible and, if so, what probative value it carries.The practice of giving expert evidence has been reformed significantly with the CrimPD.Also, since 2023 scientific evidence producers are required increasingly -but not comprehensively -to confirm institutional and specific testimonial conformity with statutory standards that are subject to ongoing revision.This is matched by cultural change within a senior judiciary now committed to supporting the 'enormous strides in getting forensic science set on a course of absolute science, rather than old wives' tales or police lore.'170However, such advances in themselves may be insufficient.This caveat is highly relevant to expert opinion evidence that relies on AI/ML applications so that it is either artefactually dependent or wholly artefactual.In such circumstances for expert witnesses to be effectively peritus and to assist the court in determining reliability and decision makers with assessing the weight of evidence it is not enough for them to be competent to give admissible evidence because of their knowledge of a field of forensic activity, for example, forensic genetics or digital forensics.They must also be able to describe potential risks or weaknesses in their evidence that arise because of the interrelationship(s) developed through computer science between their own disciplinary expertise and other sciences (not necessarily just STEM disciplines).
Looking beyond England and Wales and at the wider implications of this article, it is simply unrealistic to expect legal professionals -without the proactive assistance of expert witnesses -to have sufficient scientific expertise to ensure in such circumstances that unreliable evidence is deemed inadmissible, and that weak scientific evidence is presented accurately and fairly -that is with all necessary caveats-to the factfinders.Science today is too diverse, both in its theoretical and applied aspects, for professionals in other fields to necessarily identify problems as they arise in individual cases.The interdisciplinary knowledge gap problem will be amplified as criminal justice decisions increasingly become AI/ML-assisted decisions, where in addition to computer science, relevant evidence may require knowledge of other sciences.
Four key principles emerge from our analysis of the risks that arise from expert opinion evidence production that is either artefactually dependent or wholly artefactual: 1. Interdisciplinary insight is essential with opinion evidence coproduction at the interface of law, computer science, and, variously, other STEM disciplines and social sciences.
2. Lawyers and investigators cannot be relied upon to identify significant risks that may affect the credibility that decision-makers, especially factfinders might accord to opinion evidence that might be highly material to the verdict.3. The ICO/Turing explainer approach to AI/ML-assisted decision-making is highly relevant for (a) framing professional standards for both the producers and users of such evidence and (b), in a way that is broadly adaptable to the jurisdictionally specific doctrinal and organisational requirements.4.There is an urgent need to develop law public policy and practice on these matters to overcome institutional and cultural tendencies to safeguard evasion, for example, weaknesses arising from SFR in the UK and the likely primacy of commercial confidentiality over fair trial protections in the USA.The explainer approach appears to be potentially valuable even in the different and difficult circumstances created by the US constitution.The article has also demonstrated how an understanding of medical good practice that has evolved for managing the use of AI/ML applications is an important source of insight for both researching and developing/implementing AI/ML safeguards for criminal justice.It has also linked the criteria for expert witness competency and training for fair trial purposes with the value of such experts engaging in critical and transparent collaboration with computer science researchers and developers throughout the life cycle of such AI/ML applications, including the development and validation of later versions of applications introduced into use.

6
Computers and Law 24 25 P. B. Ladkin, et al. 'The Law Commission presumption concerning the dependability of computer evidence: an Invited Paper' (2020) Digital Evidence and Electronic Signature Law Review 17 121 PCAST, Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods (PCAST: Washinton DC. 2016), 5 122 Data Protection Act 2018, s. 49 and Directive (EU) 2016/680 The Data Protection Police Directive L 119/89, art.11 123 R. J. H. Miller et al., 'Artificial Intelligence/Machine Learning in Nuclear Medicine and Hybrid Imaging' in P. Veit-Haibach and K. Herrmann (eds.),Artificial Intelligence/Machine Learning in Nuclear Medicine and Hybrid Imaging (Springer: Cham: 2022) 124 See C Aitken et al, above n.85 at 9 125 I. E. Dror and D. Charlton, Journal of Forensic Identification (2006) 56 600 has received much greater attention than probative issues.Academic work dealing with expert scientific evidence 99 has focused on the applied sciences and medical 92 B. Silverman, then RSS President, in 'Discussion on the paper by Neumann, Evett and Skerrett' (2012) 175 Journal of the Royal Statistical Society: Series A (Statistics in Society) 371, 398 93 A. Lavorgna and G. A. Antonopoulos, 'Criminal markets and networks in Cyberspace', (2022) 25 Trends in Organized Crime145 94 W. R. Zame et al. 'Machine Learning for Clinical Trials in the Era of COVID-19' (2020) 12 Statistics in Biopharmaceutical Research 506 95 See Edmond et al., above n.35 764, 788 96 G. Horsman and A. B. Mammen, 'A glance at digital forensic academic research demographics'(2020) 60 Science & Justice 399 (Their publication search criteria may have missed much of the relevant legal and computer science research relevant to this article.) 97 P. W. Grimm et al., 'Artificial Intelligence as Evidence (2021) 19 Nw.J. Tech.& Intell.Prop.1,9 98 See, for example, D. Hunter et al., 'A Framework for the Efficient and Ethical use of Artificial Intelligence in the Criminal Justice System' (2020) 47 Fla.St. U.L. Rev 749 99 See, for example, P. Roberts and M. Stockdale (eds), above n.28

•
How the classifier's accuracy/reliability is demonstrated: One or more of the different evaluation algorithms that are generally available for computer science research is used to assess how many samples in the existing dataset it was able to classify correctly.By dividing the dataset into a training dataset and a testing dataset, usually in a split of 80% and 20%, respectively, where the model has never seen the testing dataset, it should be able to generalise its classification model to classify those test samples.Since 100% of the dataset is available, the number of correctly/incorrectly classified samples from the testing dataset can be examined, and the different metrics from each evaluator can provide a range of accuracy measures.•Howtheclassifier's accuracy is reported: If the classifications are all correct, the data science metric "accuracy" is 1.0 (or 100%).That is simply the fraction of the correctly classified samples out of the total number of classifications.Such measurements, however, are126See Gardner and Neuman, above, n. 42 127 I.E.Dror et al., 'Biasability and reliability of expert forensic document examiners', Forensic Science International (2021) 318 110610 128 For a more detailed discussion of accuracy, precision and recall see: I. H. Witten et al.Data Mining: Practical Machine Learning Tools and Techniques 3rd edn (Morgan Kaufmann: Amsterdam, 2011), 163-175