UNC Releases Huge Database of Molecules to Support Artificial Intelligence Methods in Chemistry

Olexsandr Isayev, Ph.D., said he hopes the release of this data set will jump start the development of new methods in chemistry that leverage the power of artificial intelligence.

Researchers at the University of North Carolina at Chapel Hill and the University of Florida have publicly released a massive set of data that they used to create an artificial intelligence that drastically cuts the time needed to design new molecules in drug discovery and other areas of chemistry. The high quality database is publicly available to aid other researchers in creating similar tools and contains over 20 million conformations — or structural variations — of more than 57,000 small, drug-like molecules.

In 2017 UNC-Chapel Hill and UF scientists created the ANAKIN-ME method, a neural network that can calculate the structure and energies of molecules up to a million times faster than traditional quantum mechanical methods. Called ANI-1 for short, the tool takes statistical shortcuts to increase speed while still being nearly as accurate as the laborious and computer-intensive process traditionally needed to calculate the energy levels and structure of new molecules.

“One of the grand challenges in modern theoretical chemistry is designing and implementing approximations that speed up our calculations without sacrificing accuracy,” said Olexandr Isayev, Ph.D., a research assistant professor at the UNC Eshelman School of Pharmacy. “Machine learning methods are emerging as powerful options, but these models are heavily dependent on the quality and quantity of data used in their creation. With the release of this data set, we are supplying our colleagues around with world with a vast amount of quality data to power the development of a new generation of tools powered by artificial intelligence.”

The heart of ANI-1’s machine intelligence is the ability to recognize and predict molecular patterns. ANI-1 developed this ability after researchers calculated and plugged in millions of molecules and their various conformations. That set of high quality data took about a year of computing time to develop, Isayev said, but ANI-1 can now perform the same calculations in five to six hours. What used to require a supercomputer can now be done by an average laptop.

“Developing analytical tools powered by machine learning requires large amounts of accurate data,” Isayev said. “Right now, ANI-1 state of the art, and we’re releasing the data used to create it in the hopes that other researchers can exceed what we have done.”

The data set of more than 20 million off-equilibrium conformations for 57,462 small organic molecules is freely available at https://github.com/isayev/ANI1_dataset. The description of the data set was published in Scientific Data, a Nature research journal.

Authors and Citation

Justin S. Smith, a graduate student in the Department of Chemistry at the University of Florida
Olexandr Isayev, Ph.D., a research assistant professor in the UNC Eshelman School of Pharmacy at the University of North Carolina at Chapel Hill
Adrian E. Roitberg, Ph.D., a professor in the Department of Chemistry at the University of Florida

Smith, J. S., Isayev, O., & Roitberg, A. E. (2017). ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic molecules. Scientific Data, 4, 170193. doi:10.1038/sdata.2017.193