Loading…

A Comprehensive Data Preprocessing Framework towards Improving Internet Chinese Medical Data Quality

Medical large language models (MLLMs) have attracted increasing attention recently. Data is the key to building MLLMs, and the most commonly used manner is to obtain data from online healthcare platforms. Raw Internet data contains various types of noise; however, current data preprocessing methods...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhang, Chong, Zhan, Yibing, Zhong, Yunzhou, Ni, Jun, Zhu, Jianqing, Zan, Changtong, Tao, Dapeng
Format:	Conference Proceeding
Language:	English
Subjects:	Data integrity Data preprocessing internet Chinese medical data Large language models medical large language model Noise Redundancy Training Transformers
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Medical large language models (MLLMs) have attracted increasing attention recently. Data is the key to building MLLMs, and the most commonly used manner is to obtain data from online healthcare platforms. Raw Internet data contains various types of noise; however, current data preprocessing methods are costly, incomplete, and ineffective. In this paper, we propose a comprehensive data preprocessing framework to reduce the noise in Internet data as much as possible. Specifically, our framework divides noise into four categories: chaotic data format, low data quality, data duplication, and personal privacy, and designs four modules to reduce each type of data noise, respectively. First, all data must pass through the data unification module to ensure that the subsequent processing can have a stable data form. Then, keyword matching, text statistics, and metric features are designed in a quality filtering module to detect and eliminate low-quality elements. Subsequently, a data deduplication module is developed to remove redundancy from the data at the text level and line level, alleviating potential interference with model training. Lastly, personal identity information will be eliminated to ensure the protection of user privacy. To validate the usefulness of our data preprocessing framework, we select the MedDialog-CN dataset, a typical Internet Chinese Medical dataset, as a testbed with three typical language models: BERT-GPT, DialoGPT, and Transformer. According to automatic and manual experiments, our data preprocessing framework can filter 26.84% of the data with noise in MedDialog-CN, and the performance of all methods is improved using our framework.
ISSN:	2159-1288
DOI:	10.1109/ICCEA62105.2024.10603802