Loading…

MatSciBERT: A materials domain language model for text mining and information extraction

A large amount of materials science knowledge is generated and stored as text published in peer-reviewed scientific literature. While recent developments in natural language processing, such as Bidirectional Encoder Representations from Transformers (BERT) models, provide promising information extra...

Full description

Saved in:

Bibliographic Details
Published in:	npj computational materials 2022-05, Vol.8 (1), p.1-11, Article 102
Main Authors:	Gupta, Tanishq, Zaki, Mohd, Krishnan, N. M. Anoop, Mausam
Format:	Article
Language:	English
Subjects:	639/301 639/301/119 Characterization and Evaluation of Materials Chemistry and Materials Science Classification Coders Computational Intelligence Data mining Domain specific languages Information retrieval Language Materials Science Mathematical and Computational Engineering Mathematical and Computational Physics Mathematical Modeling and Industrial Mathematics Natural language processing Reviews Theoretical
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	A large amount of materials science knowledge is generated and stored as text published in peer-reviewed scientific literature. While recent developments in natural language processing, such as Bidirectional Encoder Representations from Transformers (BERT) models, provide promising information extraction tools, these models may yield suboptimal results when applied on materials domain since they are not trained in materials science specific notations and jargons. Here, we present a materials-aware language model, namely, MatSciBERT, trained on a large corpus of peer-reviewed materials science publications. We show that MatSciBERT outperforms SciBERT, a language model trained on science corpus, and establish state-of-the-art results on three downstream tasks, named entity recognition, relation classification, and abstract classification. We make the pre-trained weights of MatSciBERT publicly accessible for accelerated materials discovery and information extraction from materials science texts.
ISSN:	2057-3960 2057-3960
DOI:	10.1038/s41524-022-00784-w