Skip to main content

​​Clinicians take a larger role in evaluating AI tools for healthcare

At HIMSS26, leaders from Emory Healthcare and Mass General Brigham discussed how initiatives like the Healthcare AI Challenge aim to help health systems make safer deployment decisions.
By Nathan Eddy
Nabile Safdar, chief AI officer at Emory Healthcare, and Bernardo Bizzo, senior director of artificial intelligence at Mass General Brigham

Nabile Safdar, chief AI officer at Emory Healthcare, and Bernardo Bizzo, senior director of artificial intelligence at Mass General Brigham 

Photo courtesy of Nathan Eddy

LAS VEGAS – During a talk at the 2026 HIMSS Global Health Conference & Exposition here on Wednesday, Nabile Safdar, chief AI officer at Emory Healthcare, and Bernardo Bizzo, senior director of artificial intelligence at Mass General Brigham, discussed how healthcare leaders can make smarter, data-driven AI deployment decisions by pooling data and constructing benchmarks that are accessible to everyone.

AI tools are advancing rapidly in healthcare, but many clinicians lack clear frameworks for evaluating their safety, reliability and clinical usefulness.

Initiatives such as the Healthcare AI Challenge aim to address that gap by giving providers structured methods to test emerging AI tools and provide feedback that can guide safer adoption across health systems.

"The AI opportunity is to enhance healthcare workflow efficiency," Bizzo said. "But health systems lack tools to assess foundational models for safety and effectiveness."

Safdar said organizations often face a difficult evaluation process before committing to a new AI system. Deploying these tools requires investment in implementation, staff training and workflow integration.

"Clinical leaders are not sure if it's bringing them value," Safdar said.

While technologies, such as ambient documentation and imaging AI, have shown measurable benefits, the broader landscape is becoming more complicated.

Bizzo explained that the return on investment for many AI tools remains uncertain. Although the technology is powerful, most foundational models were developed as general-purpose systems rather than tools designed for healthcare environments.

"The biggest challenge we have is we lack benchmarks to assess how these tools are performing and how well they can help us," Bizzo said.

To address that problem, Bizzo and colleagues created an initiative designed to systematically evaluate AI systems across multiple clinical scenarios.

The Healthcare AI Challenge allows clinicians and researchers from participating institutions to test different AI models using shared datasets and standardized evaluation methods.

The initiative includes an "AI Arena" platform where clinical experts review outputs from different AI models and compare their performance across tasks such as radiology reporting or medical record summarization.

So far, the Healthcare AI Challenge has conducted five challenges involving more than 4,500 evaluations from roughly 200 participants across 40 institutions. Researchers have tested 18 foundation models, including both general-purpose systems and healthcare-specific models.

The platform allows clinicians to evaluate AI tools across several dimensions – not just technical accuracy.

"When it comes to AI, a lot of us get stuck on accuracy," Safdar said. "But often your family practice clinician is thinking, 'Does it make me faster?'"

In the AI Arena, evaluators can compare human performance with AI outputs or analyze how different models perform against each other. They can also assess factors, such as speed, clinical usefulness, and whether results meet an acceptable threshold for clinical workflows.

Bizzo said the goal is to create a repeatable process for evaluating AI systems before health systems invest heavily in them.

"As healthcare professionals, this is the information you want to know before investing in a model," he said.

Looking ahead, Bizzo said there are plans to expand the platform to integrate directly with electronic health record systems and evaluate emerging agentic AI workflows. Those efforts aim to measure not only technical performance but also real productivity gains for clinicians using the technology in practice.

"We want to measure how much more efficient users are and have that information available so you know how much ROI you can expect," he added.