Loading…

Multilingual evaluation of pre-processing for BERT-based sentiment analysis of tweets

•Sentiment analysis of tweets by a state-of-the-art classification model (BERT).•Evaluation of tweet pre-processing, to avoid noise and exploit hidden information.•Available data in two languages are considered, i.e., English and Italian.•The most convenient strategy to pre-process tweets is individ...

Full description

Saved in:
Bibliographic Details
Published in:Expert systems with applications 2021-11, Vol.181, p.115119, Article 115119
Main Authors: Pota, Marco, Ventura, Mirko, Fujita, Hamido, Esposito, Massimo
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•Sentiment analysis of tweets by a state-of-the-art classification model (BERT).•Evaluation of tweet pre-processing, to avoid noise and exploit hidden information.•Available data in two languages are considered, i.e., English and Italian.•The most convenient strategy to pre-process tweets is individuated.•The state of the art is improved in both languages for tweet sentiment analysis. Social media offer a big amount of information, to exploit in many fields of research. However, while methods for Natural Language Processing are being developed with good results when applied to well-formed datasets made of written text with a clear syntax, these sources present text written in informal language, unstructured syntax, and with peculiar symbols; therefore, particular approaches are required for text processing in this case. In this paper, the task of sentiment analysis of tweets is regarded. In particular, in order to avoid noise constituted by some web constructs like URLs and mentions and by other text fragments, and to exploit information hidden in symbols like emoticons, emojis and hashtags, the pre-processing of tweets is analyzed. More in detail, a number of experiments, performed by a state-of-the-art classification model (BERT), are designed, to evaluate many currently available operations for pre-processing tweets, in terms of the statistical significance of their influence on sentiment analysis performances. Moreover, available data in two languages are considered, i.e., English and Italian, in order to also evaluate dependence on the language. Results allow to individuate the most convenient strategy to pre-process tweets, and thus to improve the state of the art in both languages for the considered task of sentiment analysis.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2021.115119