The Underrated Challenges of Building a Learning Model

Ryan Black
JANUARY 19, 2018

Nothing about machine learning is necessarily simple, but some aspects may be more difficult than some in healthcare might think.  

Mark Michalski, the Executive Director of the MGH & BWH Center for Clinical Data Science (CCDS)—a joint projection between Massachusetts General Hospital and Brigham & Women’s Hospital—ran the crowd at the AI in Healthcare Summit today in Boston through the more grueling parts of the artificial intelligence (AI) model design process.

As applied statistics turn to machine learning and into deep learning and neural networks, the data demands become greater, Michalski said. Neural networks require extensive annotated data, with the optimal word being “annotated.” A lot of outside data scientists might see the sheer volume of data that the healthcare industry possesses and think “If only I could get my hands on that, I could…” the speaker said, but what they don’t realize is that the majority is unstructured and poorly annotated.

“You need a mandate from the institutional leadership down that we’re going to makes this data accessible and usable to make progress,” according to Michalski, if healthcare is to fully leverage new innovations in computing.

CCDS has an inordinate amount of data, including over 5 billion images and records from millions of examinations. Its executive director laid out the process of crunching all of that down to a usable training set for a learning system.

After laying out the theoretical underpinnings of the model, which can be 10-20% of the total work and “a pain that no one wants to do,” software has to be built. From there, a cohort must be assembled and whittled down into a usable form.

First, a team must find all the patients in its repository that are representative of the cohort using data like clinical information and radiology reports. Then all the reports must be curated, using data like exam and billing codes, before some machine learning can be applied to find patterns and identify the usable portion. The process can start with millions records and produce a training set of 1,000.

And once it exists, it needs to be put into a form that the end user—say, a radiologist—can easily use it. A testing environment highly representative of their typical workflow so they can deliver numerous free screenings to adjust and validate the model. That, Michalski said, is an “underappreciated and often ignored part of the process.”

The bright side is, once an organization becomes adept at curation and validation design, the algorithmic architecture can be a much easier process. The actual model can be swapped out if the infrastructure behind it is well-built.

In a large stroke mapping study, the model was one of the easier parts of the process. “The architecture was actually 80% commodity,” Michalski said. “We were able to just pull it from the internet and tweak it.”

Another underappreciated aspect of AI for healthcare, he said, is how the definitions have blurred as hype has built.

“AI means a lot of things to a lot of people. Data science might be more specific-it’s applied statistics at scale. Machine learning is not the same as deep learning, and there’s a lot of tools within it,” he said. “The term has reached this level of fervor that you really have to understand what they’re actually talking about.”

That’s important for CCDS, because they have to have the conversation often. Boston is teeming with AI startups and other leading health systems glad to collaborate: Early in his speech, he joked that a new AI company would have been started before he finished talking.

Become a contributor