Loading…

A Framework to Generate Synthetic Multi-label Datasets

A controlled environment based on known properties of the dataset used by a learning algorithm is useful to empirically evaluate machine learning algorithms. Synthetic (artificial) datasets are used for this purpose. Although there are publicly available frameworks to generate synthetic single-label...

Full description

Saved in:
Bibliographic Details
Published in:Electronic notes in theoretical computer science 2014-02, Vol.302, p.155-176
Main Authors: Tomás, Jimena Torres, Spolaôr, Newton, Cherman, Everton Alvares, Monard, Maria Carolina
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:A controlled environment based on known properties of the dataset used by a learning algorithm is useful to empirically evaluate machine learning algorithms. Synthetic (artificial) datasets are used for this purpose. Although there are publicly available frameworks to generate synthetic single-label datasets, this is not the case for multi-label datasets, in which each instance is associated with a set of labels usually correlated. This work presents Mldatagen, a multi-label dataset generator framework we have implemented, which is publicly available to the community. Currently, two strategies have been implemented in Mldatagen: hypersphere and hypercube. For each label in the multi-label dataset, these strategies randomly generate a geometric shape (hypersphere or hypercube), which is populated with points (instances) randomly generated. Afterwards, each instance is labeled according to the shapes it belongs to, which defines its multi-label. Experiments with a multi-label classification algorithm in six synthetic datasets illustrate the use of Mldatagen.
ISSN:1571-0661
1571-0661
DOI:10.1016/j.entcs.2014.01.025