Most Influential Projects 2020

04 COVID-19 Data Lake

50 MIP 4 Covid Data Lake

For collecting—and organizing—a deluge of virus insights

As the coronavirus crisis deepened, scientists needed information. There was plenty available, but it was disparate, disconnected and constantly growing. Looking to bring order to the chaos, created the COVID-19 Data Lake—letting researchers plunge into multiple data sources that were now integrated into a unified model, ready for immediate analysis. Users could explore the data based on features such as diagnosis, age, location and preexisting conditions to find patterns in the disease, evaluate efforts to combat it, and develop medicines and vaccines to prevent future outbreaks.

“Experts estimate that data scientists spend up to 90 percent of their time and effort ‘wrangling’ data so that the data are in a form that is accessible for analysis,” CEO Thomas M. Siebel wrote in a company blog post. “In the COVID-19 Data Lake, we have done all of that work.” estimates a researcher would normally spend about three hours finding, importing, cleaning, standardizing and analyzing data that compares the impacts of the pandemic in just two locations. Of those three hours, only seven minutes would be spent on the final, crucial step: data analysis. And every time the researcher adds another location, the entire process would have to start all over again.

With the Data Lake, the pre-analysis work happens in seconds, which makes searching across data sources a whole lot easier. Siebel compared it the World Wide Web, “where we have used the big data equivalent of HTML to enable researchers and analysts to view and navigate all of the associations within and across the datasets from the union of the collections of all the university libraries.”

After just three weeks of development, Data Lake made its debut in April, with a boost from Amazon Web Services, which donated cloud computing to the initiative. As a free product, it’s designed to help researchers predict the virus’s trajectory, forecast the demand for hospital beds, and assess contact tracing initiatives and the effects of COVID-19 shutdowns and guidelines.

The team behind the Data Lake isn’t content to tread water: Already staking its claim as the world’s largest source of pandemic-related data, encourages users to recommend new data sources for future research. A U.S. physician, for instance, asked to add vaccination data to help track the success of previous vaccinations on hospitalization and infection rates.

Related Sponsors and Organizations

  • Amazon Web Services
  • C3.Ai