Solvate Prediction for Pharmaceutical Organic Molecules with Machine Learning


Pharmaceutical development has recently taken a new route owing to advancement in technology with the latest one being a significant improvement in the small molecule drug development. Solvates are generally obtained by incorporating solvent molecules in a solute crystal during the crystallization process. Considering the potential effects of the solvate formation in pharmaceutical development, alternative crystallization process development methods to either prevent solvate formation or generate new solvates for enhanced physiochemical properties is highly desirable. Current approaches for predicting solvate forms majorly rely on the small-scale experimental screening, which may not reflect material behavior in large scale production. To this end, there is a great need to enhance the efficiency of solvate formation prediction in terms of new solid forms identification, risk assessment, and crystallization process.

Among the available solvate prediction methods, thermodynamic and multiple energy-based approaches have been advanced through better prediction algorithms. Unfortunately, several challenges like ignoring the effects of molecular interactions in solid states have led to inaccurate results. However, solvate prediction based on statistical models has provided a better platform for systematic analysis of both solvate and non-solvate crystalline structures. Recently, researchers have identified a machine learning model as a promising technique for solid-state property prediction. Owing to the limited applicability of these models, understanding of the chemical diversity of the active pharmaceutical ingredients is highly desirable.

To this note, scientists at Boehringer Ingelheim Pharmaceuticals: Dr. Dongyue Xin, Dr. Nina Gonnella, Dr. Xiaorong He and Dr. Keith Horspool explored two machine learning models based on random forests and support vector machine algorithms and validated their potential for pharmaceutical organic molecule solvate prediction. In particular, the data used in this study for training and testing the models were derived from the Cambridge Structural Database. The research work is currently published in the research journal, Crystal Growth and Design.

In brief, the research team initiated their studies by cross-examining different solvate prediction methods for pharmaceutical molecules. Next, the data obtained from the Cambridge Structural Database was filtered to remain only the structures resembling the pharmaceutical molecules. Thus, nine organic solvents commonly used in crystallization of pharmaceuticals with large number of solvate and non-solvate structures were investigated. Eventually, the performance of the best models was tested for the selected pharmaceutically relevant molecules.

The developed machine learning models only required two-dimensional input structure. Both random forests and support vector machine algorithms were able to successfully predict solvate formation propensity for organic molecules with a high success rate of 86% as demonstrated by the selected twenty pharmaceutical molecules. However, RF performed slightly better than support vector machine. Additionally, it was worth noting that different machine learning models exhibit varying driving force depending on the type of the solvate.

In summary, Boehringer Ingelheim Pharmaceuticals researchers presented two useful machine learning-based algorithms: random forests and support vector machine for predicting solvate formation in pharmaceutical molecules. A collection of 20 pharmaceutical molecules was selected from the literature to validate the performance of the models. In general, machine learning models proved a promising practical tool for accurate and fast prediction of solvate formation in pharmaceutical molecules. Therefore, the study provides insights that will enable expansion of the experimental screening data sheets.

About the author

Dongyue Xin received his BS in chemistry from Shandong University (2010) and PhD in chemistry from Texas A&M University (2016). He joined Boehringer Ingelheim Pharmaceuticals in 2016 and is currently a principal scientist in Material and Analytical Sciences department supporting API characterization, predictive modeling and drug product development.

He has published 17 peer-reviewed journal articles in the areas of organic chemistry, NMR spectroscopy and computational chemistry. He is interested in developing new interdisciplinary technologies to address challenging problems in drug development.

About the author

Keith Horspool, Ph.D., is Vice President of External Alternative CMC Development (EACD) at Boehringer-Ingelheim Pharmaceuticals, Ridgefield, CT. EACD will be operating as a “biotech within pharma” using a highly flexible integrated network of virtual resources to deliver all CMC requirements for projects, applying alternative development approaches for NCEs and NCE-like new modalities. In particular, the Department seeks to explore, exploit and export “new development” from biotech communities and to access innovative technologies enabling future development.

Keith was previously VP of Pharmaceutical Development US, and VP of Material and Analytical Sciences that he established as a new global capability for BI.

Prior to joining BI, Keith worked at Pfizer and Astra Zeneca. He has worked in Pharma for more than 30 years with experience managing preformulation, product development, drug delivery and materials science. Keith has diverse experience with a broad range of delivery systems and various routes of delivery, particularly oral, inhalation and parenteral delivery (including depots).

Over the years, he has worked with these various drug delivery technologies from early feasibility through to submission and launch. From a business perspective, he has been involved in simple feasibility contracts, commercial partnerships and technology acquisition. He has a B.Sc. in Pharmacy, and a Ph.D. in Pharmaceutical Chemistry.

About the author

Nina C. Gonnella, Ph.D. is a Senior Associate Director at Boehringer Ingelheim Inc. where she heads a molecular structure and solid form informatics group. She has extensive experience in Pharmaceutical Research and Development, and has led initiatives ranging from structural characterization and development of associated predictive tools to NMR ligand-based screening and in-vitro/in-vivo biological studies.

Nina initiated and led the development of powerful commercial ready in-silico structure elucidation programs (using quantum chemistry, density functional theory and probability theory) to solve challenging chemical structures as well as the development and automation of orthogonal based predictive methods for co-crystal design. She co-founded, organized and chaired a new Gordon Research Conference on “Molecular Structure Elucidation,” taught courses on NMR spectroscopy theory and application, presented national and international lectures, served on scientific review boards, published over 95 journal articles, two book chapters and two books on LC-SPE-NMR hyphenated technology.

About the author

Xiaorong He has 20 years of experience encompassing drug product development, solid-state sciences, material science, API engineering, and predictive modeling. Her leadership experience ranges from building new division overseas, creating network of business clients, to leading a large group of multidisciplinary scientists delivering projects and innovation.

In her current role as the vice president, department head of Material and Analytical Sciences at Boehringher-Ingelheim Pharmaceutical Company, she provides leadership and sets strategy for her department to support CMC development of global portfolios, spear heads innovation, and ensures compliance with GMP/GLP regulation.

Xiaorong received her Pharmacy BS from Beijing Medical University, MS from University of Minnesota, Ph.D. in Pharmaceutical Sciences from Purdue University and MBA degree from Western Michigan University. She has published many peer-reviewed papers and book chapters, and held several patents. She has been volunteering for United States Pharmacopeia since 2005. She is the chair of USP General Chapter – Physical Analysis Expert Committee and Council of Expert for the 2015-2020 cycle.

Xiaorong is a member of scientific advisory board for the Journal of Pharmaceutical Sciences since 2013, and a member of International Society of Business Leaders.


Xin, D., Gonnella, N., He, X., & Horspool, K. (2019). Solvate Prediction for Pharmaceutical Organic Molecules with Machine Learning. Crystal Growth & Design, 19(3), 1903-1911. .

Go To Crystal Growth & Design

Check Also

Nanostructured Electrodes as Random Arrays of Active Sites: Modeling and Theoretical Characterization - Advances in Engineering

Nanostructured Electrodes as Random Arrays of Active Sites