Loading…
Proposing the Use of Hazard Analysis for Machine Learning Data Sets
There is no debating the importance of data for artificial intelligence. The behavior of data-driven machine learning models is determined by the data set, or as the old adage states: “garbage in, garbage out (GIGO).” While the machine learning community is still debating which techniques are necess...
Saved in:
Published in: | Journal of System Safety 2023-06, Vol.58 (2), p.30-39 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | There is no debating the importance of data for artificial intelligence. The behavior of data-driven machine learning models is determined by the data set, or as the old adage states: “garbage in, garbage out (GIGO).” While the machine learning community is still debating which techniques are necessary and sufficient to assess the adequacy of data sets, they agree some techniques are necessary. In general, most of the techniques being considered focus on evaluating the volumes of attributes. Those attributes are evaluated with respect to anticipated counts of attributes without considering the safety concerns associated with those attributes. This paper explores those techniques to identify instances of too little data and incorrect attributes. Those techniques are important; however, for safety critical applications, the assurance analyst also needs to understand the safety impact of not having specific attributes present in the machine learning data sets. To provide that information, this paper proposes a new technique the authors call data hazard analysis. The data hazard analysis provides an approach to qualitatively analyze the training data set to reduce the risk associated with the GIGO. |
---|---|
ISSN: | 2832-3041 2832-305X |
DOI: | 10.56094/jss.v58i2.253 |