Assessment of fine-tuned large language models for real-world chemistry and material science applications

Joren Van Herck; María Victoria Gil; Kevin Maik Jablonka; Alex Abrudan; Andy S Anker; Mehrdad Asgari; Ben Blaiszik; Antonio Buffo; Leander Choudhury; Clemence Corminboeuf; Hilal Daglar; Amir Mohammad Elahi; Ian T Foster; Susana Garcia; Matthew Garvin; Guillaume Godin; Lydia L Good; Jianan Gu; Noémie Xiao Hu; Xin Jin; Tanja Junkers; Seda Keskin; Tuomas P J Knowles; Ruben Laplaza; Michele Lessona; Sauradeep Majumdar; Hossein Mashhadimoslem; Ruaraidh D McIntosh; Seyed Mohamad Moosavi; Beatriz Mouriño; Francesca Nerli; Covadonga Pevida; Neda Poudineh; Mahyar Rajabi-Kochi; Kadi L Saar; Fahimeh Hooriabad Saboor; Morteza Sagharichiha; K J Schmidt; Jiale Shi; Elena Simone; Dennis Svatunek; Marco Taddei; Igor Tetko; Domonkos Tolnai; Sahar Vahdatifar; Jonathan Whitmer; D C Florian Wieland; Regine Willumeit-Römer; Andreas Züttel; Berend Smit

doi:10.1039/d4sc04401k

Assessment of fine-tuned large language models for real-world chemistry and material science applications

Chem Sci. 2024 Nov 22. doi: 10.1039/d4sc04401k. Online ahead of print.

Authors

Joren Van Herck¹, María Victoria Gil^{1

2}, Kevin Maik Jablonka^{1

3

4}, Alex Abrudan⁵, Andy S Anker^{6

7}, Mehrdad Asgari⁸, Ben Blaiszik^{9

10}, Antonio Buffo¹¹, Leander Choudhury¹², Clemence Corminboeuf¹³, Hilal Daglar¹⁴, Amir Mohammad Elahi¹, Ian T Foster^{9

10}, Susana Garcia¹⁵, Matthew Garvin¹⁵, Guillaume Godin¹⁶, Lydia L Good^{5

17}, Jianan Gu¹⁸, Noémie Xiao Hu¹, Xin Jin¹, Tanja Junkers¹⁹, Seda Keskin¹⁴, Tuomas P J Knowles^{5

20}, Ruben Laplaza¹³, Michele Lessona¹¹, Sauradeep Majumdar¹, Hossein Mashhadimoslem²¹, Ruaraidh D McIntosh²², Seyed Mohamad Moosavi²³, Beatriz Mouriño¹, Francesca Nerli²⁴, Covadonga Pevida², Neda Poudineh¹⁵, Mahyar Rajabi-Kochi²³, Kadi L Saar⁵, Fahimeh Hooriabad Saboor²⁵, Morteza Sagharichiha²⁶, K J Schmidt⁹, Jiale Shi^{27

28}, Elena Simone¹¹, Dennis Svatunek²⁹, Marco Taddei²⁴, Igor Tetko^{16

30}, Domonkos Tolnai¹⁸, Sahar Vahdatifar²⁶, Jonathan Whitmer^{28

31}, D C Florian Wieland¹⁸, Regine Willumeit-Römer¹⁸, Andreas Züttel³², Berend Smit¹

Affiliations

¹ Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland Berend.Smit@epfl.ch.
² Instituto de Ciencia y TecnologÍa del Carbono (INCAR), CSIC Francisco Pintado Fe 26 33011 Oviedo Spain.
³ Laboratory of Organic and Tecnolog'ıa Chemistry (IOMC), Friedrich Schiller University Jena Humboldtstrasse 10 07743 Jena Germany.
⁴ Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena) Lessingstrasse 12-14 07743 Jena Germany.
⁵ Yusuf Hamied Department of Chemistry, University of Cambridge Cambridge CB2 1EW UK.
⁶ Department of Energy Conversion and Storage, Technical University of Denmark DK-2800 Kgs. Lyngby Denmark.
⁷ Department of Chemistry, University of Oxford Oxford OX1 3TA UK.
⁸ Department of Chemical Engineering & Biotechnology, University of Cambridge Philippa Fawcett Drive Cambridge CB3 0AS UK.
⁹ Department of Computer Science, University of Chicago Chicago IL 60637 USA.
¹⁰ Data Science and Learning Division, Argonne National Laboratory Lemont IL 60439 USA.
¹¹ Department of Applied Science and Technology (DISAT), Politecnico di Torino 10129 Turino Italy.
¹² Laboratory of Catalysis and Organic Synthesis (LCSO), Institute of Chemical Sciences and Engineering (ISIC), École Polytechnique Fédérale de Lausanne (EPFL) CH-1015 Lausanne Switzerland.
¹³ Laboratory for Computational Molecular Design (LCMD), Institute of Chemical Sciences and Engineering (ISIC), École Polytechnique Fédérale de Lausanne (EPFL) CH-1015 Lausanne Switzerland.
¹⁴ Department of Chemical and Biological Engineering, Koç University Rumelifeneri Yolu, Sariyer 34450 Istanbul Turkey.
¹⁵ The Research Centre for Carbon Solutions (RCCS), School of Engineering and Physical Sciences, Heriot-Watt University Edinburgh EH14 4AS UK.
¹⁶ BIGCHEM GmbH Valerystraße 49 85716 Unterschleißheim Germany.
¹⁷ Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health Bethesda Maryland 20892 USA.
¹⁸ Institute of Metallic Biomaterials, Helmholtz Zentrum Hereon Geesthacht Germany.
¹⁹ Polymer Reaction Design Group, School of Chemistry, Monash University Clayton VIC 3800 Australia.
²⁰ Cavendish Laboratory, Department of Physics, University of Cambridge Cambridge CB3 0HE UK.
²¹ Department of Chemical Engineering, University of Waterloo Waterloo N2L3G1 Canada.
²² Institute of Chemical Sciences, School of Engineering and Physical Sciences, Heriot-Watt University Edinburgh EH14 4AS UK.
²³ Chemical Engineering & Applied Chemistry, University of Toronto Toronto Ontario M5S 3E5 Canada.
²⁴ Dipartimento di Chimica e Chimica Industriale, Unità di Ricerca INSTM, Università di Pisa Via Giuseppe Moruzzi 13 56124 Pisa Italy.
²⁵ Chemical Engineering Department, University of Mohaghegh Ardabili P. O. Box 179 Ardabil Iran.
²⁶ Department of Chemical Engineering, College of Engineering, University of Tehran Tehran Iran.
²⁷ Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA 02139 USA.
²⁸ Department of Chemical and Biomolecular Engineering, University of Notre Dame Notre Dame Indiana 46556 USA.
²⁹ Institute of Applied Synthetic Chemistry, TU Wien Getreidemarkt 9 1060 Vienna Austria.
³⁰ Institute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Munich - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH) Ingolstädter Landstraße 1 85764 Neuherberg Germany.
³¹ Department of Chemistry and Biochemistry, University of Notre Dame Notre Dame Indiana 46556 USA.
³² Laboratory of Materials for Renewable Energy (LMER), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland.

Abstract

The current generation of large language models (LLMs) has limited chemical knowledge. Recently, it has been shown that these LLMs can learn and predict chemical properties through fine-tuning. Using natural language to train machine learning models opens doors to a wider chemical audience, as field-specific featurization techniques can be omitted. In this work, we explore the potential and limitations of this approach. We studied the performance of fine-tuning three open-source LLMs (GPT-J-6B, Llama-3.1-8B, and Mistral-7B) for a range of different chemical questions. We benchmark their performances against "traditional" machine learning models and find that, in most cases, the fine-tuning approach is superior for a simple classification problem. Depending on the size of the dataset and the type of questions, we also successfully address more sophisticated problems. The most important conclusions of this work are that, for all datasets considered, their conversion into an LLM fine-tuning training set is straightforward and that fine-tuning with even relatively small datasets leads to predictive models. These results suggest that the systematic use of LLMs to guide experiments and simulations will be a powerful technique in any research study, significantly reducing unnecessary experiments or computations.