By Stacey Kusterbeck
Recently, Medical Ethics Advisor (MEA) spoke with Shaun Grannis, MD, MS, vice president for data and analytics at Regenstrief Institute and a professor of family medicine at the Indiana University School of Medicine, about ethical issues involved with evaluating artificial intelligence (AI) tools.
MEA: What are the central ethical concerns with validation of AI tools that you see currently?
Grannis: We know that AI algorithms can inadvertently reinforce biases if the data they are trained on are not representative of diverse populations. Researchers need to ask whether the tool’s performance is equitable across demographic groups to avoid exacerbating health disparities. There needs to be some consideration of that and some plan to address those issues, to at least make efforts to detect and mitigate this where possible.
Transparency and interoperability are also important. This is an unsolved question. These AI models are so large that they are often viewed as black boxes. When you are thinking about clinical trials, this black box approach raises questions about whether participants can give truly informed consent, and whether clinicians can rely on the AI’s recommendations safely.
There also is the issue of what we call data drift. AI systems learn and adapt over time, and that can complicate trial protocols. If the AI learns particular behaviors from the data that are inconsistent with the research protocol, that is an issue. The continuous learning that AI does requires close monitoring to prevent those unforeseen risks and departures, and make sure that the tool maintains safety and efficacy.
Another tried-and-true issue that has been with us forever is data privacy and security. AI systems rely on large datasets, often requiring sensitive patient data. We need to be mindful how those large datasets with protected health information are managed, because we are increasingly pulling together highly sensitive, very large collections of data.
MEA: Where do things stand currently with the ability to ethically validate AI tools?
Grannis: Probing its outputs in similar environments or probing the outputs with data that’s representative of the task at hand will be important. For example, here at Regenstrief Institute, I have several large language model [LLM] projects underway. All of them are designed to evaluate the consistency and accuracy of the performance of the model. Before I even think about studying that in the context of a workflow, I need to make sure that the tool is behaving in understandable and consistent ways.
My feeling is, when I submit for more of these healthcare interventions or clinical trial-like activities, I want a tool that has known behavioral characteristics. And those characteristics need to be monitored on a regular basis to ensure that the output continues to be as you would expect.
We are still in the very early days of these new types of AI models. Over time, we will better understand how to measure and manage their behavior. The analogy I use is, back when the model of the atom was emerging, we discovered that there was an electron, but we didn’t know how much the electron weighed. Eventually we were able to measure the mass of the electron and understand that. We are measuring the mass of the electron, so to speak, with LLMs right now. We are trying to get a sense of exactly how they behave and under what circumstances they are useful.
There is some good news here. Many of these models, particularly generative models, are designed to produce a different result each time. That variation is something, in many circumstances, that we may not want. The good news is, there are parameters within the model that you can set to dramatically reduce, and nearly eliminate, all of that variability. So, for a given input for a trained model, you’ll get the same output.
We need many, many more evidence-based best practices for how to manage the variability. But that’s an example of how we are developing an understanding of how to manage the tools.
MEA: What can we learn from previous approaches used to validate machine learning models? What is unique about AI tools that raise additional ethical challenges?
Grannis: I’ve been working on machine learning models that blur into the field of AI for about 20 years. We’ve been dealing with very similar circumstances — the black box approach, the data drift, and the potential change in the output. When I train a model or evaluate its performance on the training set, we always have to evaluate it in the real-world context to ensure that it continues to behave that way. And that practice needs to continue.
What’s different here is that these models are just so much more complex. We need to perform due diligence to ensure that the behavior is what one would expect. And it requires more effort, more data, and more consideration of potential unintended consequences.
MEA: Do Institutional Review Boards (IRBs) need to obtain outside expertise in some cases, to be able to evaluate study protocols involving AI tools?
Grannis: I have had conversations with several folks on IRBs and what I hear people talking about now is, absolutely, we need AI expertise as part of the IRB review process going forward. However that expertise is accessed — whether it’s a participating member of the IRB or a consultation — that is very important.
There are health systems that are beginning to form advisory panels for AI. As new proposed uses of AI are emerging, the health system needs some way of evaluating the safety and efficacy and ethical nature of that. I think we are going to see AI expertise become a requisite on IRBs, where AI is an essential component.
MEA: What role can ethicists potentially play in helping to address these issues?
Grannis: I believe there is going to be growth in ethicists focused specifically on AI and data science technology. Those tools are going to grow in their prevalence of use. We may see the discipline of ethics grow a subspecialty in that area. I’m fairly certain that’s going to happen.
There are professional organizations, such as IEEE, with dedicated workgroups and members specializing in AI ethics in healthcare. So, AI ethics considerations apply not just to the IRB protocols, but to healthcare writ large, and decisions that universities are making. I could see establishing dedicated AI review boards that participate or contribute to various AI decision-making processes.
MEA: What developments are we likely to see in the near future?
Grannis: Clearly, there needs to be thoughtful consideration given to this. What we need is a framework for thinking about how to evaluate and adjudicate proposed uses of AI. There are a number of different ways to do that. But over time, we will see emerging evidence-based patterns of practice that will inform how these issues are assessed.
We’ve seen similar patterns with the type of complex analytical work that has been part of my research. We’ve had to think through, and provide assurances for, ethical concerns. We’re going to be using very large, identified data sets — here’s how we’re going to protect that. Or, we’re going to be using complex algorithms that are being used to predict a particular outcome — here’s how we’re going to ensure this continues to perform within bounds of expectations.
So, in some ways, we’ve been through this with other previous innovations. But AI is an innovation whose potential behavior and output is not fully understood at this point. That’s what makes it a bit different.
I think we’re going to start with very tightly scoped problems with relatively low risk. One of the problems we’re working on is notifiable disease reporting to public health. We’ve done work in this area now for a couple of decades, using different types of technology. Now we are looking at moving that to LLM. We know that physicians and health systems do not do a good job of reporting these conditions completely or accurately, so automated systems can be helpful. We’re designing very focused models, whose job it is to review a particular result and determine whether it’s reportable to public health. That is a low-risk decision, and it is very tightly scoped.
So initially, we will see very tightly scoped uses of the AI model before we see them expanded to, let’s say, diagnosing and treating patients. That is years away.