Loading…

Machine Learning Prediction of Nine Molecular Properties Based on the SMILES Representation of the QM9 Quantum-Chemistry Dataset

Machine learning (ML) models can potentially accelerate the discovery of tailored materials by learning a function that maps chemical compounds into their respective target properties. In this realm, a crucial step is encoding the molecular systems into the ML model, in which the molecular represent...

Full description

Saved in:
Bibliographic Details
Published in:The journal of physical chemistry. A, Molecules, spectroscopy, kinetics, environment, & general theory Molecules, spectroscopy, kinetics, environment, & general theory, 2020-11, Vol.124 (47), p.9854-9866
Main Authors: Pinheiro, Gabriel A, Mucelini, Johnatan, Soares, Marinalva D, Prati, Ronaldo C, Da Silva, Juarez L. F, Quiles, Marcos G
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Machine learning (ML) models can potentially accelerate the discovery of tailored materials by learning a function that maps chemical compounds into their respective target properties. In this realm, a crucial step is encoding the molecular systems into the ML model, in which the molecular representation plays a crucial role. Most of the representations are based on the use of atomic coordinates (structure); however, it can increase ML training and predictions’ computational cost. Herein, we investigate the impact of choosing free-coordinate descriptors based on the Simplified Molecular Input Line Entry System (SMILES) representation, which can substantially reduce the ML predictions’ computational cost. Therefore, we evaluate a feed-forward neural network (FNN) model’s prediction performance over five feature selection methods and nine ground-state properties (including energetic, electronic, and thermodynamic properties) from a public data set composed of ∼130k organic molecules. Our best results reached a mean absolute error, close to chemical accuracy, of ∼0.05 eV for the atomization energies (internal energy at 0 K, internal energy at 298.15 K, enthalpy at 298.15 K, and free energy at 298.15 K). Moreover, for the atomization energies, the results obtained an out-of-sample error nine times less than the same FNN model trained with the Coulomb matrix, a traditional coordinate-based descriptor. Furthermore, our results showed how limited the model’s accuracy is by employing such low computational cost representation that carries less information about the molecular structure than the most state-of-the-art methods.
ISSN:1089-5639
1520-5215
DOI:10.1021/acs.jpca.0c05969