Loading…

BERT-based drug structure presentation: A comparison of tokenizers

The BERT framework for molecular graph representation has shown that pre-learning a language model, which is effective in improving natural language processing problems, is also useful in chemistry. Prediction of molecular properties has recently improved with the success of graph neural networks (G...

Full description

Saved in:
Bibliographic Details
Main Authors: Davronov, Rifkat, Adilova, Fatima
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The BERT framework for molecular graph representation has shown that pre-learning a language model, which is effective in improving natural language processing problems, is also useful in chemistry. Prediction of molecular properties has recently improved with the success of graph neural networks (GNNs), where BERT provides significant gains. In this paper, we investigate the efficient representation of molecules using BERT, and compare different tokenization methods (BPE-ChemBERT and Smiles-Tokenizer ChemBERT) in the problems of predicting molecular properties on 11 different datasets and targets by solving 7 problems of designing regression models and 5 classification tasks. However, computational experiments have shown that ChemBERT performs these tasks below the current state of the art. The application of transformers to molecular data poses questions that need to be seriously investigated.
ISSN:0094-243X
1551-7616
DOI:10.1063/5.0144799