A data science roadmap for open science organizations engaged in early-stage drug discovery

Kristina Edfeldt; Aled M Edwards; Ola Engkvist; Judith Günther; Matthew Hartley; David G Hulcoop; Andrew R Leach; Brian D Marsden; Amelie Menge; Leonie Misquitta; Susanne Müller; Dafydd R Owen; Kristof T Schütt; Nicholas Skelton; Andreas Steffen; Alexander Tropsha; Erik Vernet; Yanli Wang; James Wellnitz; Timothy M Willson; Djork-Arné Clevert; Benjamin Haibe-Kains; Lovisa Holmberg Schiavone; Matthieu Schapira

doi:10.1038/s41467-024-49777-x

A data science roadmap for open science organizations engaged in early-stage drug discovery

Nat Commun. 2024 Jul 5;15(1):5640. doi: 10.1038/s41467-024-49777-x.

Authors

Kristina Edfeldt¹, Aled M Edwards², Ola Engkvist³, Judith Günther⁴, Matthew Hartley⁵, David G Hulcoop^{6

7}, Andrew R Leach⁵, Brian D Marsden⁸, Amelie Menge⁹, Leonie Misquitta¹⁰, Susanne Müller⁹, Dafydd R Owen¹¹, Kristof T Schütt¹², Nicholas Skelton¹³, Andreas Steffen¹², Alexander Tropsha¹⁴, Erik Vernet¹⁵, Yanli Wang¹⁰, James Wellnitz¹⁴, Timothy M Willson¹⁶, Djork-Arné Clevert¹⁷, Benjamin Haibe-Kains^{18

19

20

21}, Lovisa Holmberg Schiavone²², Matthieu Schapira^{23

24}

Affiliations

¹ Structural Genomics Consortium, Department of Medicine, Karolinska University Hospital and Karolinska Institutet, Stockholm, Sweden.
² Structural Genomics Consortium, University of Toronto, Toronto, ON, Canada.
³ Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden & Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg, Sweden.
⁴ Bayer AG Research and Development, Computational Molecular Design, Berlin, Germany.
⁵ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK.
⁶ Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire, UK.
⁷ European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, UK.
⁸ Centre for Medicines Discovery, NDM, University of Oxford, Oxford, UK.
⁹ Institute of Pharmaceutical Chemistry, Johann Wolfgang Goethe University, Frankfurt am Main, 60438, Germany & Structural Genomics Consortium (SGC), Buchmann Institute for Life Sciences, Johann Wolfgang Goethe University, Frankfurt am Main, Germany.
¹⁰ National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
¹¹ Pfizer Worldwide Research, Development & Medical, Cambridge, MA, USA.
¹² Pfizer, Worldwide Research, Development and Medical, Machine Learning & Computational Sciences, Berlin, Germany.
¹³ Department of Discovery Chemistry, Genentech, Inc., South San Francisco, CA, USA.
¹⁴ Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, North Carolina, USA.
¹⁵ Digital Science & Innovation, Novo Nordisk A/S, Maaloev, Denmark.
¹⁶ Structural Genomics Consortium, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
¹⁷ Pfizer, Worldwide Research, Development and Medical, Machine Learning & Computational Sciences, Berlin, Germany. Djork-Arne.Clevert@pfizer.com.
¹⁸ Structural Genomics Consortium, University of Toronto, Toronto, ON, Canada. benjamin.haibe.kains@utoronto.ca.
¹⁹ Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada. benjamin.haibe.kains@utoronto.ca.
²⁰ Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada. benjamin.haibe.kains@utoronto.ca.
²¹ Vector Institute for Artificial Intelligence, Toronto, ON, Canada. benjamin.haibe.kains@utoronto.ca.
²² Discovery Biology, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden. Lovisa.Holmberg.Schiavone@astrazeneca.com.
²³ Structural Genomics Consortium, University of Toronto, Toronto, ON, Canada. matthieu.schapira@utoronto.ca.
²⁴ Department of Pharmacology & Toxicology, University of Toronto, Toronto, ON, Canada. matthieu.schapira@utoronto.ca.

Abstract

The Structural Genomics Consortium is an international open science research organization with a focus on accelerating early-stage drug discovery, namely hit discovery and optimization. We, as many others, believe that artificial intelligence (AI) is poised to be a main accelerator in the field. The question is then how to best benefit from recent advances in AI and how to generate, format and disseminate data to enable future breakthroughs in AI-guided drug discovery. We present here the recommendations of a working group composed of experts from both the public and private sectors. Robust data management requires precise ontologies and standardized vocabulary while a centralized database architecture across laboratories facilitates data integration into high-value datasets. Lab automation and opening electronic lab notebooks to data mining push the boundaries of data sharing and data modeling. Important considerations for building robust machine-learning models include transparent and reproducible data processing, choosing the most relevant data representation, defining the right training and test sets, and estimating prediction uncertainty. Beyond data-sharing, cloud-based computing can be harnessed to build and disseminate machine-learning models. Important vectors of acceleration for hit and chemical probe discovery will be (1) the real-time integration of experimental data generation and modeling workflows within design-make-test-analyze (DMTA) cycles openly, and at scale and (2) the adoption of a mindset where data scientists and experimentalists work as a unified team, and where data science is incorporated into the experimental design.

Publication types

Review

MeSH terms

Artificial Intelligence
Cloud Computing
Data Mining / methods
Data Science* / methods
Databases, Factual
Drug Discovery* / methods
Humans
Information Dissemination / methods
Machine Learning*

Abstract

Publication types

MeSH terms

Grants and funding