Advancing Genomics With High-Performance Computing

John Kirkley
MAY 08, 2018
mit harvard,broad institute genomics,hpc genomics,hca news

The Broad Institute of MIT and Harvard is playing a major role in accelerating genomic analysis and is collaborating with Intel to do so.

With the massive growth of genomics data, the collaboration makes use of technology to enable genomics analytics at scale. The latest result is a suite of optimized software, along with reference architectures for turnkey configuration, setup, and deployment to run genomics analysis that includes Broad’s Genomic Analysis Toolkit (GATK). 

Geraldine Van der Auwera, PhD, Broad’s associate director of outreach and communications, Data Sciences Platform Group, commented, “Our goal is to reduce the challenges that researchers face to generate ever-more meaningful insights from ever-larger sets of genomics data. We’re working with Intel to make the GATK Best Practices pipelines run even faster, at even greater scale, and with easier deployment for genomic research worldwide.”

>> READ: How Edge Computing Can Advance Healthcare

Broad is an academic, non-commercial entity interested in furthering science and curing disease. The institute, one of the world’s largest producers of human genome data, creates about 24 TB of new data per day and manages more than 50 PB of data. And this is only the beginning.

Genomics received a major boost in 2003 with the first sequencing of the human genome by the Human Genome Project. This was an ambitious undertaking that spanned 13 years and cost $3 billion. Now, Broad and Intel are helping to usher in a new era by developing advanced, affordable tools to research the torrents of data that are being created by the genomics community.

The 2 organizations have collaborated on computing infrastructure and software optimization for years. Last year, they launched the Intel-Broad Center for Genomic Data Engineering to simplify and accelerate genomics workflow execution using a variety of advanced tools and techniques—including the popular GATK, a set of more than 100 tools and a framework for genomic analysis. 

This program is a direct offshoot of Broad’s focus on standardizing the methods used to analyze genomics data. For example, Intel has worked with the institute on accelerating compute-intensive genomics workloads by developing the genomics kernel library for Intel architecture, as well as optimizing and benchmarking best practices workflows on the latest Intel reference hardware platform. 

Several years ago, processing a whole genome sequence (WGS) took 36.12 hours; most recently the GATK-enabled sequencing, running on an updated Intel platform, finished the job in 10.8 hours—3.3 times faster.

According to Eric Banks, senior director of the Broad Institute’s Data Sciences Program, “We [the Broad team] are computational biologists with degrees in computer sciences and mathematics, but we know very little about the hardware we’re using to run our workloads. Several years ago, Intel offered to help us with the hardware platform. We welcomed them with open arms.

>> LISTEN: The Gene-Editing Company That Didn’t Need CRISPR

“The collaboration has helped us scale the software significantly as well as run it faster and better. For example, one of the problems scientists face when analyzing genomes is the need to work with multiple, very large data sources. Intel showed the Broad team how to optimize the genome analytics codes to ensure that the genome processing runs as fast as possible and can scale to unprecedented levels.”

Banks said that GenomicsDB, a variant store, was developed and tightly leveraged for joint genotyping—an approach in which cohorts of large samples are examined in a single variant run. “It became possible for us to compare large number of genomes with each other,” Banks said. “We are reaching levels of analysis that were not possible before.”

Among the tasks the 2 organizations have worked on together over the past few years are hardware optimization and improvement. A major factor in their success is leveraging Intel Select Solutions for Genomics Analytics, verified hardware and software stacks that are optimized for specific software workloads across compute, storage, and network. It includes hardware and software components specifically targeted for the workload packaged up in a high-performance compute cluster.

“The human genome is very large,” Banks says. “What we are doing goes far beyond just processing this data—it requires complex analysis to determine which indications point to an underlying disease. We now have massive amounts of data that includes many samples from the same disease type that we can study together using a process known as joint analysis.

“The first time we did this was several years ago with a cohort of about 3000 samples,” he adds. “It took all the compute power we could get from our [information technology] organization. We had to beg, borrow, and steal to get our job running on every core in the data center, and it still took about 6 weeks to analyze the 3K genomes. Last year, we analyzed 15,000 genomes—an order of magnitude larger, running the job on cloud infrastructure, using analysis software based on Intel Select Solutions for Genomic Analytics. We used only a small percentage of the system’s capacity and had our results in less than 2 weeks. We are now preparing to analyze 72,000 genomes, which should take a week and a half at this point. There is no way that we could have achieved these results with the previous platform.”

He says there are 2 classes of users whose workflows are particularly suited to the architecture. Users who have to analyze huge amounts of genomic data are now able to run their workloads far faster at a significantly lower cost and, as a result, advance the science.

Building upon an existing collaboration, the new effort will apply Intel’s data analytics and artificial intelligence capabilities with Broad’s expertise in genomic data generation, health research, and analysis tools. The goal is to build new resources that will promote biomedical discoveries, including those that advance precision medicine.

John Kirkley has a long history of creating high-impact, creative marketing communications copy for many of the leading companies in the computer industry. This story is not a sponsored post, though it was produced as part of Intel’s editorial program, which aims to highlight cutting-edge research in the space.

Get the best insights in healthcare analytics directly to your inbox.

Related
Study Highlights Power of IBM Watson’s AI for Genomic Sequencing
Why Consumers Could Drive the Genomic Revolution
DNA Sequencing Tech Lands Stratos Genomics $20M

SHARE THIS SHARE THIS
41
Become a contributor