Loading…

Zebra: A novel method for optimizing text classification query in overload scenario

Text classification is a crucial task in the text mining field, and it can be included in queries with user-defined functions(UDF). In many web applications, such as Twitter mining or Weibo real-time processing, when the amount of text data to be processed is enormous, there will be many overload ph...

Full description

Saved in:
Bibliographic Details
Published in:World wide web (Bussum) 2023-05, Vol.26 (3), p.905-931
Main Authors: Yu, Tianhuan, He, Zhenying, Yang, Zhihui, Ye, Fei, Fan, Yuankai, Jing, Yinan, Zhang, Kai, Wang, X. Sean
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Text classification is a crucial task in the text mining field, and it can be included in queries with user-defined functions(UDF). In many web applications, such as Twitter mining or Weibo real-time processing, when the amount of text data to be processed is enormous, there will be many overload phenomena. At the same time, when the system is overloaded, the delays in the query process can negatively affect the user experience in a streaming scenario. This paper focuses on the query with text classification on streaming data. We propose a novel method called Zebra with progressive pipelines to optimize the overload query situations. The core module of Zebra is the probabilistic filter which can reduce an incredible amount of text data based on semantic information of the query predicate. We train weak classifiers as filters using data with labels from brute-force pipelines. Next, we use a parameter search method to choose a suitable filter with the best settings and apply it to progressive pipelines. Experiments with several text workloads on real-world datasets show that Zebra can achieve higher accuracy stably while answering the query in time.
ISSN:1386-145X
1573-1413
DOI:10.1007/s11280-022-01061-y