Loading…

Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree

•Classification and Regression Tree (CART) model was used to predict surface SOC levels across the Czech Republic.•Data splitting strategies and feature selection methods greatly improved CART model results.•Conditional Latin Hypercube Sampling approach proves robust for prediction with the CART mod...

Full description

Saved in:

Bibliographic Details
Published in:	Smart agricultural technology 2023-02, Vol.3, p.100106, Article 100106
Main Authors:	Kebonye, Ndiye M., Agyeman, Prince C., Biney, James K.M.
Format:	Article
Language:	English
Subjects:	Czech Republic Digital soil mapping (DSM) Generalization Intelligible models Model parsimony
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	•Classification and Regression Tree (CART) model was used to predict surface SOC levels across the Czech Republic.•Data splitting strategies and feature selection methods greatly improved CART model results.•Conditional Latin Hypercube Sampling approach proves robust for prediction with the CART model.•Distinct feature selection methods affect model performance differently yet results using fewer covariates are promising. There are relatively few studies that explicitly evaluate the performance of machine learning algorithms (MLAs) such as decision trees while varying conditions like data splitting strategies and feature selection methods in digital soil mapping (DSM). Since several more powerful black-box models such as Random Forest (RF) exist, regular models like the Classification and Regression Tree (CART) are least applied despite being more intelligible than the former. We demonstrate a simple yet relevant way to improve the performance of a CART model for DSM while still benefiting from its intelligibility, interpretability and intuition potential. Soil organic carbon (SOC) levels across the Czech Republic are predicted at 30 m × 30 m resolution using selected covariates coupled with respective CART models. For this work, 440 topsoils (0–20 cm) for the Czech Republic were retrieved from the LUCAS soil database. Regarding the distinct CART models, data splitting strategies (Random, SPlit and Conditional Latin Hypercube Sampling: cLHS) and 7 feature selection methods were varied. Meanwhile, overall model results were compared using accuracy metrics including the root mean square error (RMSE). One of the satisfactory SOC model validation results based on SPlit has a root mean square error (RMSE) of 17.30 g/kg and a coefficient of determination (R2) of 0.52. The cLHS proves robust for model data splitting. Feature selection methods including Stepwise Regression (SWR), Recursive Feature Elimination (RFE) and the Genetic Algorithm (GA) were considered computationally efficient for identifying relevant covariates. Generally, the study demonstrates the relevance and effectiveness of varying data splitting strategies and feature selection methods for improving SOC modelling via a decision tree (CART). [Display omitted]
ISSN:	2772-3755 2772-3755
DOI:	10.1016/j.atech.2022.100106