Loading…

How to design the fair experimental classifier evaluation

Many researchers working on classification problems evaluate the quality of developed algorithms based on computer experiments. The conclusions drawn from them are usually supported by the statistical analysis and chosen experimental protocol. Statistical tests are widely used to confirm whether con...

Full description

Saved in:
Bibliographic Details
Published in:Applied soft computing 2021-06, Vol.104, p.107219, Article 107219
Main Authors: Stapor, Katarzyna, Ksieniewicz, Paweł, García, Salvador, Woźniak, Michał
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Many researchers working on classification problems evaluate the quality of developed algorithms based on computer experiments. The conclusions drawn from them are usually supported by the statistical analysis and chosen experimental protocol. Statistical tests are widely used to confirm whether considered methods significantly outperform reference classifiers. Usually, the tests are applied to stratified datasets, which could raise the question of whether data folds used for classification are really randomly drawn and how the statistical analysis supports robust conclusions. Unfortunately, some scientists do not realize the real meaning of the obtained results and overinterpret them. They do not see that inappropriate use of such analytical tools may lead them into a trap. This paper aims to show the commonly used experimental protocols’ weaknesses and discuss if we really can trust in such evaluation methodology, if all presented evaluations are fair and if it is possible to manipulate the experimental results using well-known statistical evaluation methods. We will present that it is possible to choose only such results, confirming the experimenter’s expectation. We will try to show what could be done to avoid such likely unethical behavior. At the end of this work, we will formulate recommendations on improving an experimental protocol to design fair experimental classifier evaluation. •Presenting the weakness of the commonly used experimental protocols.•Discussing if all reported evaluations are always fair.•Demonstrating how to manipulate the experimental results using well-known statistical evaluation methods.•Showing the possibility of choosing only such results which can confirm the expectation of the experimenter.•Recommending how to design fair experimental classifier evaluation to avoid likely unethical behavior.
ISSN:1568-4946
1872-9681
DOI:10.1016/j.asoc.2021.107219