Loading…

Cost Effective Annotation Framework Using Zero-Shot Text Classification

Manual and high-quality annotation of social media data has enabled companies and researchers to develop improved implementations using natural language processing. However, human text-annotation is expensive and time-consuming. Crowd-sourcing platforms such as Amazon's Mechanical Turk (MTurk)...

Full description

Saved in:
Bibliographic Details
Main Authors: Kasthuriarachchy, Buddhika, Chetty, Madhu, Shatte, Adrian, Walls, Darren
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Manual and high-quality annotation of social media data has enabled companies and researchers to develop improved implementations using natural language processing. However, human text-annotation is expensive and time-consuming. Crowd-sourcing platforms such as Amazon's Mechanical Turk (MTurk) can be leveraged for the creation of large training corpora for text classification tasks using social media data. Nevertheless, the quality of annotations can vary significantly, based on the interpretations and motivations of annotators completing the tasks. Further, the labelling cost of data through MTurk will increase if target messages are small and having a significant amount of noise (e.g. promotional messages on Twitter). In this work, we propose a new annotation framework to create high-quality human-annotated datasets for text classification from social media data. We present a zero-shot text classification based pre-annotation technique reducing the adverse effects arising due to the highly skewed distribution of data across target classes. The proposed framework significantly reduces the cost and time while maintaining the quality of the annotations. Being generic, it can be applied to annotating text data from any discipline. Our experiment with a Twitter data annotation using the proposed annotation framework shows a cost reduction of 80% with no compromise to quality.
ISSN:2161-4407
DOI:10.1109/IJCNN52387.2021.9534335