Loading…

SSA: A More Humanized Automatic Evaluation Method for Open Dialogue Generation

Dialogue generation has been gaining ever-increasing attention, and various models have been proposed and adopted in many fields in recent years. How to evaluate their performance is critical. However, current evaluation metrics tend to be insufficient because of their simplicity and crudeness, resu...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhan, Zhiqiang, Hou, Zifeng, Yang, Qichuan, Zhao, Jianyu, Zhang, Yang, Hu, Changjian
Format:	Conference Proceeding
Language:	English
Subjects:	Coherence Computational modeling evaluation generative dialogue Measurement Neural networks semantic coherence Semantics syntactic validity Syntactics Task analysis
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Dialogue generation has been gaining ever-increasing attention, and various models have been proposed and adopted in many fields in recent years. How to evaluate their performance is critical. However, current evaluation metrics tend to be insufficient because of their simplicity and crudeness, resulting in weak correlation with human judgements. To solve this issue, we propose an automatic and comprehensive evaluation metric, which consists of three assessment criteria: Semantic Coherence, Syntactic Validity and Ability of Expression (SSA). The first two criteria are used to evaluate the generations from semantic and syntactic aspects respectively at the sentence level and the last one is to evaluate the overall performance at the model level. With two generative models, we conduct experiments on three datasets, including Twitter, Subtitle and Lenovo. Comparing with the previous metrics such as BLEU, METEOR and ROUGE, the correlation coefficient between SSA and human judgements is increased by 0.23-0.35, i.e. 324%-864% relative improvements. The experimental results demonstrate that SSA correlates more strongly with human judgements on the evaluation for open dialogue generation. Additionally, SSA is able to evaluate the semantic coherence and syntactic validity of generations exactly. More importantly, the evaluation models can be trained without human annotations. Thus, SSA is flexible and extensible to different datasets.
ISSN:	2161-4407
DOI:	10.1109/IJCNN.2019.8851960