Loading…
Evaluating data mining procedures: techniques for generating artificial data sets
In this article, we discuss the need to evaluate the performance of data mining procedures and argue that tests done with real data sets cannot provide all the information needed for a thorough assessment of their performance characteristics. We argue that artificial data sets are therefore essentia...
Saved in:
Published in: | Information and software technology 1999-06, Vol.41 (9), p.579-587 |
---|---|
Main Authors: | , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In this article, we discuss the need to evaluate the performance of data mining procedures and argue that tests done with real data sets cannot provide all the information needed for a thorough assessment of their performance characteristics. We argue that artificial data sets are therefore essential. After a discussion of the desirable characteristics of such artificial data, we describe two pseudo-random generators. The first is based on the multi-variate normal distribution and gives the investigator full control of the degree of correlation between the variables in the artificial data sets. The second is inspired by fractal techniques for synthesizing artificial landscapes and can produce data whose classification complexity can be controlled by a single parameter. We conclude with a discussion of the additional work necessary to achieve the ultimate goal of a method of matching data sets to the most appropriate data mining technique. |
---|---|
ISSN: | 0950-5849 1873-6025 |
DOI: | 10.1016/S0950-5849(99)00021-X |