Loading…
A Comprehensive Data Preprocessing Framework towards Improving Internet Chinese Medical Data Quality
Medical large language models (MLLMs) have attracted increasing attention recently. Data is the key to building MLLMs, and the most commonly used manner is to obtain data from online healthcare platforms. Raw Internet data contains various types of noise; however, current data preprocessing methods...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Medical large language models (MLLMs) have attracted increasing attention recently. Data is the key to building MLLMs, and the most commonly used manner is to obtain data from online healthcare platforms. Raw Internet data contains various types of noise; however, current data preprocessing methods are costly, incomplete, and ineffective. In this paper, we propose a comprehensive data preprocessing framework to reduce the noise in Internet data as much as possible. Specifically, our framework divides noise into four categories: chaotic data format, low data quality, data duplication, and personal privacy, and designs four modules to reduce each type of data noise, respectively. First, all data must pass through the data unification module to ensure that the subsequent processing can have a stable data form. Then, keyword matching, text statistics, and metric features are designed in a quality filtering module to detect and eliminate low-quality elements. Subsequently, a data deduplication module is developed to remove redundancy from the data at the text level and line level, alleviating potential interference with model training. Lastly, personal identity information will be eliminated to ensure the protection of user privacy. To validate the usefulness of our data preprocessing framework, we select the MedDialog-CN dataset, a typical Internet Chinese Medical dataset, as a testbed with three typical language models: BERT-GPT, DialoGPT, and Transformer. According to automatic and manual experiments, our data preprocessing framework can filter 26.84% of the data with noise in MedDialog-CN, and the performance of all methods is improved using our framework. |
---|---|
ISSN: | 2159-1288 |
DOI: | 10.1109/ICCEA62105.2024.10603802 |