Loading…

An integrated approach based on Gaussian noises-based data augmentation method and AdaBoost model to predict faecal coliforms in rivers with small dataset

•Small dataset are a major problem in the application of ML models.•Integration of the Adaboost model into Gaussian noise-based data augmentation method.•Optimal virtual datasets were evaluated. Machine Learning (ML) techniques can be valuable for modelling the faecal contamination in the rivers to...

Full description

Saved in:
Bibliographic Details
Published in:Journal of hydrology (Amsterdam) 2021-08, Vol.599, p.126510, Article 126510
Main Authors: EL Bilali, Ali, Taleb, Abdeslam, Bahlaoui, Moulay Abdellah, Brouziyne, Youssef
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•Small dataset are a major problem in the application of ML models.•Integration of the Adaboost model into Gaussian noise-based data augmentation method.•Optimal virtual datasets were evaluated. Machine Learning (ML) techniques can be valuable for modelling the faecal contamination in the rivers to overcome the limitations of the process-based models. However, this approach requires large sufficient data for training and validation processes to avoid the over-fitting problem. This study attempts to overcome the small dataset limitation by relying on the data augmentation techniques. To that end, Adaptive boosting (AdaBoost) models were trained and integrated into the data augmentation method to generate 600 virtual samples based on 40 original datasets. The results revealed that the proposed method significantly improved the accuracy (RMSE = 0.716ln(Colony Forming Unit (CFU)/100 ml)) and generalization ability of the AdaBoost model for predicting the faecal coliform in the rivers compared to the baseline model developed only with a small dataset (RMSE = 2.348ln(CFU/100 ml)). However, the study showed that generating and using too many virtual data could deteriorate the generalization ability of the ML model and the optimal virtual datasets are about (337–415) virtual samples. Globally, the results of this study provide new insights to improve the prediction accuracy of the health risk related to the faecal coliforms in raw water used for drinking purposes under a small dataset. The developed method can broaden the application of ML to water resources and environmental sciences when it is impossible to get a large dataset required by ML models.
ISSN:0022-1694
1879-2707
DOI:10.1016/j.jhydrol.2021.126510