Usage Benefits, Risks and Cost
For any AI tool to be successfully integrated into clinical practice, stakeholders should first clearly identify areas that need improvement and define relevant key performance indicators.
52,53 The integration of an AI tool may then be part of a larger strategy devised to attain the goal set for the institution. Alternatively, it might also be the case that a particular AI tool proposed by a vendor offers a potential to improve the quality of the institutions’ services in an area not previously considered. In either case, as outlined in Section 4A, it is essential to determine whether or not the tool solves a real, specific problem that the institution has; tools are solutions, and a solution to a non-existent problem has no value. Note also that different institutions have different problems; a tool that is valuable for one group may not have value for another.
For the positive impact of an AI tool to be measurable, objective and quantifiable goals should be set. It may be useful to consider both what proportion of cases or patients an AI tool is expected to impact, and what the magnitude of impact on each case or patient is expected to be. Purchasers should be aware that the beneficiary of the AI tools’ potential for improvement does not always need to be the radiologist or the radiology department alone. Ideally, all stakeholders involved, from the patient requiring a service to the respective institution and even the wider society could benefit from AI being successfully implemented in a clinical workflow. An example of a strong use-case could be AI as a supporting tool in high-volume radiological screening settings (e.g. mammography). In this case the benefits for patients could include earlier and better detection of breast cancer, leading to better overall outcomes, while benefits for radiologists could include increased productivity, the availability of an additional “safety net” or the potential to increase the time available for interaction with the patient.
54 Apart from improvements in productivity and service quality positively reflecting on the institution, they could potentially help reduce costs, while for the wider society positive effects on overall healthcare costs and population health could be envisioned. Such effects could also be expected for other commonly suggested use-cases, such as the detection of large vessel occlusions or in other time-sensitive situations. However, for other applications like organizational AI support tools or as-of-yet more research-driven applications (such as AI-powered opportunistic screening) the benefits might not be as easily definable.
55,56 Depending on the local circumstances and healthcare system in place, such potential benefits need to be carefully weighed against their immediate and mid- or long-term economic impact. Return on investment (RoI) and cost–benefit analyses should be planned and carried out to ensure the viability of the planned AI integration. Depending on the healthcare system, establishment of a viable payment mechanism for AI use may be critical. AI models that primarily benefit a fee-for-service hospital or outpatient imaging center prove RoI through decreasing length of stay,
57 improving throughput in the emergency department,
58 increasing the volume of findings that require follow-up and/or treatment, decreasing the length of time it takes to perform an imaging exam, and improving operations in the radiology department. Other potential benefits to the radiology practice include decreased mental fatigue, improved radiologist recruitment and retention, and decreased medical malpractice liability, although these tend to be additive as they do not generally cover the cost of the AI.
Lastly, potential costs (both capital and recurrent) and risks associated with the implementation and usage of an AI system are essential components of any purchase analysis and decision. In part, risk assessment can be facilitated by consulting the risk matrix and the risk–benefit analysis provided in the regulatory files by vendors. However, some risks may not be addressed in such regulatory filings or only become apparent during use. The most obvious component of cost is the licensing costs paid to the vendor, but these are typically only a small part of the total cost of ownership. Other sources of cost include contracting and legal agreements, IT effort and professional services for integration with existing systems, training for users and administrators, infrastructure for running the AI, and on-going maintenance and monitoring.
Other essential factors in making an informed decision include evaluating the vendor’s compatibility as a reliable partner, the vendor’s staying power in a competitive environment with limited payor reimbursement (even more important in this era of AI vendor consolidation), optimized model pricing, and opportunities for collaboration beyond product purchase, such as co-development and product resale.
A key component of risk is understanding what the performance characteristics of the algorithm are likely to be in the environment in which it will be used. The error rate in use may differ substantially from what was reported in testing, particularly when the characteristics or distributions of the input data (e.g. scanner manufacturers, scan protocols, patient demographics, disease prevalence, comorbidities) differ from the test data. Ideally, each site considering implementation would perform a statistically rigorous evaluation of performance on their own local data (a method for this evaluation is presented in the Clinical Evaluation Section below). In practice, this may not be feasible. At a minimum, the characteristics of local data should be compared with those of the test data (a typical example might be where a model has been tested only on one manufacturer’s MRI scanner, but will be used on a scanner made by a different manufacturer). Where these are similar, the reported performance metrics may be relied upon with some confidence; where they are not (e.g., an algorithm tested only on adults being considered for off-label use in a pediatric hospital) one should proceed with great caution, if at all. Error frequency, conceptually the inverse of performance, is not the final word on risk, because different errors pose different risks. One should consider the detectability of the errors that are anticipated. That is, for each error, what is the probability that people in the workflow will notice that the AI has produced an erroneous output? For each detected error, what is the probability that the error will be corrected? Finally, if an error is not detected or not corrected, what is the expected impact on patients or other stakeholders? The consideration together of error frequency, detectability, correctability and impact provides a framework for assessing the direct risk of algorithmic errors. Ongoing monitoring of these risks is considered in Section 7.
Another key component of risk is the impact of an AI tool on radiologist performance. Relying on an automated tool to perform a task may lead to de-skilling of radiologists for the task the tool has taken on. This risk is particularly problematic if the radiologist is expected to perform the task manually when the tool fails, but may no longer be skilled enough to do so adequately. User over-reliance and under-reliance also decrease the accuracy of the combined output of the radiologist in combination with the AI model and is discussed further in Section 8.
A final aspect of risk that must be considered is the potential for AI to create or exacerbate healthcare disparities. AI is particularly prone to this because it is generally trained on retrospective data drawn from clinical archives, and these data represent the current and historical healthcare disparities and inequities of our society. Training an AI is a mathematical process of minimizing a cost function that proceeds without ethics or morals. Therefore AI may learn from the inequities and disparities embedded in the training data, and can perpetuate these in implementation. There is no easy or straightforward process for comprehensively identifying these biases, but it is incumbent upon us as physicians and data scientists to think about, search for and mitigate these biases; if these questions are unasked, they will most certainly remain unanswered.
Integration, Verification and Monitoring
Once expected benefits and goals have been decided upon, cost–benefit analysis has been carried out and potential risks have been assessed, integration of the selected AI tools can be planned. Depending on the local IT infrastructure and policies, purchasers can consider different technical integrations—either as local installations with dedicated computational resources on site or as a cloud-based software as a service (SaaS) model. In both types of installation, data orchestration of DICOM and HL7 play a vital role ensuring the right slices from the correct series of the relevant study for the right patient in the right setting are sent to the appropriate AI in an optimized time. To achieve a robust orchestration, understanding and structuring the content of your data is essential. Unfortunately, relying on DICOM metadata is often insufficient due to the high variability and labile nature of study and series names, and the fact that DICOM headers may be incomplete. A more robust option is to use imaging AI to determine the data contents at the studies and series level and use that output for orchestration. Using computer vision AI to determine which body parts are on each image and if intravenous contrast has been administrated are two of the most useful additions. Downstream data orchestration from the AI system requires an intelligent system able to facilitate different workflows depending on an understanding of the AI results. Most current implementations only send the AI results to the Picture Archiving and Communication System (PACS). This limited integration not only allows visualization of AI results by referring physicians, which may not be optimal if these physicians haven’t been educated about the details and accuracy of the AI model, but also has been shown to increase automation bias among radiologists.
59 Furthermore, PACS currently offers limited modes for AI results integration and in most instances, the radiologist cannot modify the AI results in PACS. To optimize AI results management and integration, a PACS should enable the radiologist to interact with and modify the AI results and, if results are changed, empower the AI to immediately reprocess a new output. In addition, the updated AI result should be provided to the AI vendor so it can be used for future model improvement. This type of interaction is facilitated in a cloud-native environment where both the PACS and AI models can share radiology data and AI results. Additionally, the ability to accept and store AI results along with radiologist feedback, optimize data security, and continuously monitor AI accuracy are crucial technical aspects that are facilitated in cloud-native systems.
Whatever the integration, ideally AI tools should be well integrated into the usual clinical workflow and information systems in order to avoid additional workload by requiring users to switch between applications. A recently published survey revealed concerns about additional workload to be one of the main reasons for respondents not intending to acquire AI tools for their clinical practice.
60 The same survey found that another major concern was that the AI system would not perform as well as advertised. This concern is important and should not be overlooked. Of course, vendors will have performed testing and quality assurance of the respective AI tools during regulatory approval, but purchasers should consider validation of the AI’s performance on a local dataset, and adjust parameters if needed prior to implementation in clinical practice. This process should be repeated whenever relevant changes are made to the AI software or the equipment used in combination with the AI. In the example of a commercially available breast screening AI model an update of the AI tool resulted in a substantially different recall rate, requiring recalibration of the decision threshold to ensure continued usage with clinically acceptable diagnostic accuracy.
61 These findings highlight that it cannot be taken for granted that diagnostic performance claimed in premarket publications translates to a comparable and stable performance during clinical usage, emphasising the need for continuous post-market surveillance of the AI systems used. The exact approaches to how this should be done are currently being discussed by the respective regulatory bodies. For example, the UK’s Medicine and Healthcare products Regulatory Agency (MHRA)
Guidance for manufacturers on reporting adverse incidents involving Software as a Medical Device under the vigilance system details various circumstances in which an adverse event should be reported—including “[failure] to identify clinically relevant brain image findings related to acute stroke” and “[degradation of MRI image] appearance of anatomical and pathological structures”.
62 Similarly, the FDA’s
Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device would expect manufacturers “to commit to the principles of transparency and real world performance monitoring” when making updates to their products.
63 Stakeholders in implementation of AI tools in clinical practice should therefore familiarize themselves with the relevant methods and metrics for clinical evaluation and devise strategies to verify performance claims prior to tool introduction, and should continuously monitor performance during routine usage.
64 This is especially important as the previously mentioned survey found that a large majority of respondents did not assess the AI’s diagnostic accuracy on a regular basis.
60 Post-market monitoring is discussed in greater detail in Section 7 (below).
Human-AI Interaction
Besides technical performance details and the practical workflow integration of AI tools in radiology, the importance of difficult-to-measure human factors should not be underestimated. AI has undeniably made impressive progress and for many use-cases can reach diagnostic performance comparable to that of human readers. This has especially been shown in the context of breast cancer screening.
65-69 However, as discussed above, many factors can influence the technical diagnostic performance of AI tools in clinical practice. While it has been suggested that the combination of human reader and AI tool could help increase overall diagnostic accuracy by either the human detecting an error made by the AI or vice versa, recent studies question this premise and highlight the need to further study the psychological phenomena that can bias decision making in a setting of human-AI interaction. It is well known that automation bias—the tendency to over-rely on automated systems, such as AI-powered decision support tools—can influence human readers and negatively impact their ability to exercise oversight.
70 Recently, a study focused on mammography found that even the most experienced readers exhibited this bias in an experimental setting and had significantly worse performance when a purported AI system suggested a wrong BI-RADS category.
71 Conversely, the opposite effect described as algorithmic aversion—where information is rejected in a decision making process solely based on it being AI-generated—can also be observed.
72 A recent study showed that radiologists and other physicians rated the same information about a chest X-ray as being less reliable when it appeared to come from an AI system than when it appeared to come from a human expert.
73 These issues are further complicated by the fact that human-AI interaction may be influenced by details of the user interface’s (UI) design. For example, while many radiologists preferred image overlays to detect pulmonary nodules, it was found that this configuration of the UI did not improve reader performance, while a minimalistic setup with text-only UI output did.
74 Similarly, a study evaluating eye gaze in endoscopy found that the usage of a computer-aided system for polyp detection led to significantly reduced eye movements while evaluating endoscopic videos and an increase of misinterpretation of normal mucosa.
75 These findings highlight the need for further education on those topics to increase awareness amongst users and stakeholders and allow for safe and successful implementation of AI into clinical routine.
76 Opportunities to help mitigate human-AI bias are discussed in Section 8. More focused research into this area is needed to provide reliable evidence on how to best design human-AI interaction.
Clinical Evaluation
While FDA or other relevant authority approval/clearance data provides some insights, testing the AI model on local data, with the local systems and workflows used in practice, is essential to ensure accuracy and relevance when the model is deployed. While local evaluation will need to be tailored to the specific AI model and local resources,
Table 3 outlines tactics which may help practices decide if a given model is relevant to local practice and performs with suitable accuracy on local data (
Table 3).
A clinical accuracy evaluation process can be performed efficiently and does not require model implementation into your clinical workflow. The first step involves comparing the AI model’s performance on local data against regulatory authority documentation, specifically evaluating accuracy through the lens of radiologist acceptance and engagement with the AI tool. Hence, parameters that are radiologist-facing, including positive and negative predictive values for the disease prevalence are more relevant than overall accuracy, Area Under the Curve (AUC), or sensitivity/ specificity. Secondly, calculate an “Enhanced Detection Rate,” the optimized detection that could be obtained through a combined detection of radiologist plus AI true positive results. Thirdly, impressive, or "WOW cases," should be identified to demonstrate the AI model’s value to users and stakeholders. Fourthly, categorizing AI false positives and, when possible, false negative cases can set radiologist expectations and improve their acceptance of an imperfect AI model (all AI models are imperfect). Finally, all the findings should be reviewed to determine if the AI model is worthy of clinical deployment.
Ultimately, the decision lies in the balance between positive predictive value (which is highly dependent on disease prevalence) and the value and number of “WOW” cases. Radiologists are more willing to accept false positives, if the model also identifies pathology that impresses the radiologist or would add value for the patient or other stakeholder. Disease prevalence also has a strong impact on downstream model acceptance. Low disease prevalence AI models produce results with numerous false positives limiting user acceptance. Disease prevalence in a patient group presented to an AI model can be modified by properly selecting patient imaging locations, such as Emergency Department, inpatient, or outpatient. Hence, some AI models may be deployed on a subset of exams because disease prevalence in that exam subset is increased from baseline. For example, pneumothorax (PTX) on Chest XRay (CXR) has a higher prevalence in the inpatient rather than the average population. Limiting a PTX AI model to only inpatient CXRs will provide fewer false positive results and will more likely be accepted by the radiologists from an accuracy standpoint.
Utilizing information from the above 5-step clinical evaluation for radiologist education, coupled with change management, is vital to set user expectations before AI model implementation. A local AI champion plays a significant role in promoting AI adoption among radiologists. Finally, continuous user education throughout the lifecycle of AI utilization and monitoring radiologist AI usage and the combined accuracy of radiologist plus AI are instrumental in ensuring optimal patient care.
Purchasing considerations are summarised in
Table 4: