Loading…

Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree

•Classification and Regression Tree (CART) model was used to predict surface SOC levels across the Czech Republic.•Data splitting strategies and feature selection methods greatly improved CART model results.•Conditional Latin Hypercube Sampling approach proves robust for prediction with the CART mod...

Full description

Saved in:
Bibliographic Details
Published in:Smart agricultural technology 2023-02, Vol.3, p.100106, Article 100106
Main Authors: Kebonye, Ndiye M., Agyeman, Prince C., Biney, James K.M.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c344t-e4e29952378d6266351d446e7ab1c3d58186f171177862a16a7a0684b98a27043
cites cdi_FETCH-LOGICAL-c344t-e4e29952378d6266351d446e7ab1c3d58186f171177862a16a7a0684b98a27043
container_end_page
container_issue
container_start_page 100106
container_title Smart agricultural technology
container_volume 3
creator Kebonye, Ndiye M.
Agyeman, Prince C.
Biney, James K.M.
description •Classification and Regression Tree (CART) model was used to predict surface SOC levels across the Czech Republic.•Data splitting strategies and feature selection methods greatly improved CART model results.•Conditional Latin Hypercube Sampling approach proves robust for prediction with the CART model.•Distinct feature selection methods affect model performance differently yet results using fewer covariates are promising. There are relatively few studies that explicitly evaluate the performance of machine learning algorithms (MLAs) such as decision trees while varying conditions like data splitting strategies and feature selection methods in digital soil mapping (DSM). Since several more powerful black-box models such as Random Forest (RF) exist, regular models like the Classification and Regression Tree (CART) are least applied despite being more intelligible than the former. We demonstrate a simple yet relevant way to improve the performance of a CART model for DSM while still benefiting from its intelligibility, interpretability and intuition potential. Soil organic carbon (SOC) levels across the Czech Republic are predicted at 30 m × 30 m resolution using selected covariates coupled with respective CART models. For this work, 440 topsoils (0–20 cm) for the Czech Republic were retrieved from the LUCAS soil database. Regarding the distinct CART models, data splitting strategies (Random, SPlit and Conditional Latin Hypercube Sampling: cLHS) and 7 feature selection methods were varied. Meanwhile, overall model results were compared using accuracy metrics including the root mean square error (RMSE). One of the satisfactory SOC model validation results based on SPlit has a root mean square error (RMSE) of 17.30 g/kg and a coefficient of determination (R2) of 0.52. The cLHS proves robust for model data splitting. Feature selection methods including Stepwise Regression (SWR), Recursive Feature Elimination (RFE) and the Genetic Algorithm (GA) were considered computationally efficient for identifying relevant covariates. Generally, the study demonstrates the relevance and effectiveness of varying data splitting strategies and feature selection methods for improving SOC modelling via a decision tree (CART). [Display omitted]
doi_str_mv 10.1016/j.atech.2022.100106
format article
fullrecord <record><control><sourceid>elsevier_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_5b616c36f9ed48a6b0877e625aa1290b</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S2772375522000715</els_id><doaj_id>oai_doaj_org_article_5b616c36f9ed48a6b0877e625aa1290b</doaj_id><sourcerecordid>S2772375522000715</sourcerecordid><originalsourceid>FETCH-LOGICAL-c344t-e4e29952378d6266351d446e7ab1c3d58186f171177862a16a7a0684b98a27043</originalsourceid><addsrcrecordid>eNp9kE1LAzEQhhdRsKi_wEv-QGuS3U2yBw9S_CgUvOg5ziazOmW7KUms6K9324p48jTDy7wPw1MUl4LPBBfqajWDjO5tJrmUY8IFV0fFRGotp6Wu6-M_-2lxkdKKcy5NrUxjJsXL4ybTmr7Qs3Xw2Pc0vLLQMRfehxw_P8gjS4F6FuIrDOSYg9iGgfW4xT6xLQGDgdGQMW4iZmh7ZB4dJRqPckQ8L0466BNe_Myz4vnu9mn-MF0-3i_mN8upK6sqT7FC2TS1LLXxSipV1sJXlUINrXClr40wqhNaCK2NkiAUaODKVG1jQGpelWfF4sD1AVZ2E2kN8dMGILsPxvctxEyuR1u3SihXqq5BXxlQLTdao5I1gJANb0dWeWC5GFKK2P3yBLc753Zl987tzrk9OB9b14fWKAa3hNEmRzg49BTR5fEP-rf_DWlJinE</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree</title><source>Elsevier ScienceDirect Journals</source><creator>Kebonye, Ndiye M. ; Agyeman, Prince C. ; Biney, James K.M.</creator><creatorcontrib>Kebonye, Ndiye M. ; Agyeman, Prince C. ; Biney, James K.M.</creatorcontrib><description>•Classification and Regression Tree (CART) model was used to predict surface SOC levels across the Czech Republic.•Data splitting strategies and feature selection methods greatly improved CART model results.•Conditional Latin Hypercube Sampling approach proves robust for prediction with the CART model.•Distinct feature selection methods affect model performance differently yet results using fewer covariates are promising. There are relatively few studies that explicitly evaluate the performance of machine learning algorithms (MLAs) such as decision trees while varying conditions like data splitting strategies and feature selection methods in digital soil mapping (DSM). Since several more powerful black-box models such as Random Forest (RF) exist, regular models like the Classification and Regression Tree (CART) are least applied despite being more intelligible than the former. We demonstrate a simple yet relevant way to improve the performance of a CART model for DSM while still benefiting from its intelligibility, interpretability and intuition potential. Soil organic carbon (SOC) levels across the Czech Republic are predicted at 30 m × 30 m resolution using selected covariates coupled with respective CART models. For this work, 440 topsoils (0–20 cm) for the Czech Republic were retrieved from the LUCAS soil database. Regarding the distinct CART models, data splitting strategies (Random, SPlit and Conditional Latin Hypercube Sampling: cLHS) and 7 feature selection methods were varied. Meanwhile, overall model results were compared using accuracy metrics including the root mean square error (RMSE). One of the satisfactory SOC model validation results based on SPlit has a root mean square error (RMSE) of 17.30 g/kg and a coefficient of determination (R2) of 0.52. The cLHS proves robust for model data splitting. Feature selection methods including Stepwise Regression (SWR), Recursive Feature Elimination (RFE) and the Genetic Algorithm (GA) were considered computationally efficient for identifying relevant covariates. Generally, the study demonstrates the relevance and effectiveness of varying data splitting strategies and feature selection methods for improving SOC modelling via a decision tree (CART). [Display omitted]</description><identifier>ISSN: 2772-3755</identifier><identifier>EISSN: 2772-3755</identifier><identifier>DOI: 10.1016/j.atech.2022.100106</identifier><language>eng</language><publisher>Elsevier B.V</publisher><subject>Czech Republic ; Digital soil mapping (DSM) ; Generalization ; Intelligible models ; Model parsimony</subject><ispartof>Smart agricultural technology, 2023-02, Vol.3, p.100106, Article 100106</ispartof><rights>2022 The Authors</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c344t-e4e29952378d6266351d446e7ab1c3d58186f171177862a16a7a0684b98a27043</citedby><cites>FETCH-LOGICAL-c344t-e4e29952378d6266351d446e7ab1c3d58186f171177862a16a7a0684b98a27043</cites><orcidid>0000-0001-9246-1987</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.sciencedirect.com/science/article/pii/S2772375522000715$$EHTML$$P50$$Gelsevier$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,3549,27924,27925,45780</link.rule.ids></links><search><creatorcontrib>Kebonye, Ndiye M.</creatorcontrib><creatorcontrib>Agyeman, Prince C.</creatorcontrib><creatorcontrib>Biney, James K.M.</creatorcontrib><title>Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree</title><title>Smart agricultural technology</title><description>•Classification and Regression Tree (CART) model was used to predict surface SOC levels across the Czech Republic.•Data splitting strategies and feature selection methods greatly improved CART model results.•Conditional Latin Hypercube Sampling approach proves robust for prediction with the CART model.•Distinct feature selection methods affect model performance differently yet results using fewer covariates are promising. There are relatively few studies that explicitly evaluate the performance of machine learning algorithms (MLAs) such as decision trees while varying conditions like data splitting strategies and feature selection methods in digital soil mapping (DSM). Since several more powerful black-box models such as Random Forest (RF) exist, regular models like the Classification and Regression Tree (CART) are least applied despite being more intelligible than the former. We demonstrate a simple yet relevant way to improve the performance of a CART model for DSM while still benefiting from its intelligibility, interpretability and intuition potential. Soil organic carbon (SOC) levels across the Czech Republic are predicted at 30 m × 30 m resolution using selected covariates coupled with respective CART models. For this work, 440 topsoils (0–20 cm) for the Czech Republic were retrieved from the LUCAS soil database. Regarding the distinct CART models, data splitting strategies (Random, SPlit and Conditional Latin Hypercube Sampling: cLHS) and 7 feature selection methods were varied. Meanwhile, overall model results were compared using accuracy metrics including the root mean square error (RMSE). One of the satisfactory SOC model validation results based on SPlit has a root mean square error (RMSE) of 17.30 g/kg and a coefficient of determination (R2) of 0.52. The cLHS proves robust for model data splitting. Feature selection methods including Stepwise Regression (SWR), Recursive Feature Elimination (RFE) and the Genetic Algorithm (GA) were considered computationally efficient for identifying relevant covariates. Generally, the study demonstrates the relevance and effectiveness of varying data splitting strategies and feature selection methods for improving SOC modelling via a decision tree (CART). [Display omitted]</description><subject>Czech Republic</subject><subject>Digital soil mapping (DSM)</subject><subject>Generalization</subject><subject>Intelligible models</subject><subject>Model parsimony</subject><issn>2772-3755</issn><issn>2772-3755</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>DOA</sourceid><recordid>eNp9kE1LAzEQhhdRsKi_wEv-QGuS3U2yBw9S_CgUvOg5ziazOmW7KUms6K9324p48jTDy7wPw1MUl4LPBBfqajWDjO5tJrmUY8IFV0fFRGotp6Wu6-M_-2lxkdKKcy5NrUxjJsXL4ybTmr7Qs3Xw2Pc0vLLQMRfehxw_P8gjS4F6FuIrDOSYg9iGgfW4xT6xLQGDgdGQMW4iZmh7ZB4dJRqPckQ8L0466BNe_Myz4vnu9mn-MF0-3i_mN8upK6sqT7FC2TS1LLXxSipV1sJXlUINrXClr40wqhNaCK2NkiAUaODKVG1jQGpelWfF4sD1AVZ2E2kN8dMGILsPxvctxEyuR1u3SihXqq5BXxlQLTdao5I1gJANb0dWeWC5GFKK2P3yBLc753Zl987tzrk9OB9b14fWKAa3hNEmRzg49BTR5fEP-rf_DWlJinE</recordid><startdate>202302</startdate><enddate>202302</enddate><creator>Kebonye, Ndiye M.</creator><creator>Agyeman, Prince C.</creator><creator>Biney, James K.M.</creator><general>Elsevier B.V</general><general>Elsevier</general><scope>6I.</scope><scope>AAFTH</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0001-9246-1987</orcidid></search><sort><creationdate>202302</creationdate><title>Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree</title><author>Kebonye, Ndiye M. ; Agyeman, Prince C. ; Biney, James K.M.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c344t-e4e29952378d6266351d446e7ab1c3d58186f171177862a16a7a0684b98a27043</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Czech Republic</topic><topic>Digital soil mapping (DSM)</topic><topic>Generalization</topic><topic>Intelligible models</topic><topic>Model parsimony</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kebonye, Ndiye M.</creatorcontrib><creatorcontrib>Agyeman, Prince C.</creatorcontrib><creatorcontrib>Biney, James K.M.</creatorcontrib><collection>ScienceDirect Open Access Titles</collection><collection>Elsevier:ScienceDirect:Open Access</collection><collection>CrossRef</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>Smart agricultural technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kebonye, Ndiye M.</au><au>Agyeman, Prince C.</au><au>Biney, James K.M.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree</atitle><jtitle>Smart agricultural technology</jtitle><date>2023-02</date><risdate>2023</risdate><volume>3</volume><spage>100106</spage><pages>100106-</pages><artnum>100106</artnum><issn>2772-3755</issn><eissn>2772-3755</eissn><abstract>•Classification and Regression Tree (CART) model was used to predict surface SOC levels across the Czech Republic.•Data splitting strategies and feature selection methods greatly improved CART model results.•Conditional Latin Hypercube Sampling approach proves robust for prediction with the CART model.•Distinct feature selection methods affect model performance differently yet results using fewer covariates are promising. There are relatively few studies that explicitly evaluate the performance of machine learning algorithms (MLAs) such as decision trees while varying conditions like data splitting strategies and feature selection methods in digital soil mapping (DSM). Since several more powerful black-box models such as Random Forest (RF) exist, regular models like the Classification and Regression Tree (CART) are least applied despite being more intelligible than the former. We demonstrate a simple yet relevant way to improve the performance of a CART model for DSM while still benefiting from its intelligibility, interpretability and intuition potential. Soil organic carbon (SOC) levels across the Czech Republic are predicted at 30 m × 30 m resolution using selected covariates coupled with respective CART models. For this work, 440 topsoils (0–20 cm) for the Czech Republic were retrieved from the LUCAS soil database. Regarding the distinct CART models, data splitting strategies (Random, SPlit and Conditional Latin Hypercube Sampling: cLHS) and 7 feature selection methods were varied. Meanwhile, overall model results were compared using accuracy metrics including the root mean square error (RMSE). One of the satisfactory SOC model validation results based on SPlit has a root mean square error (RMSE) of 17.30 g/kg and a coefficient of determination (R2) of 0.52. The cLHS proves robust for model data splitting. Feature selection methods including Stepwise Regression (SWR), Recursive Feature Elimination (RFE) and the Genetic Algorithm (GA) were considered computationally efficient for identifying relevant covariates. Generally, the study demonstrates the relevance and effectiveness of varying data splitting strategies and feature selection methods for improving SOC modelling via a decision tree (CART). [Display omitted]</abstract><pub>Elsevier B.V</pub><doi>10.1016/j.atech.2022.100106</doi><orcidid>https://orcid.org/0000-0001-9246-1987</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2772-3755
ispartof Smart agricultural technology, 2023-02, Vol.3, p.100106, Article 100106
issn 2772-3755
2772-3755
language eng
recordid cdi_doaj_primary_oai_doaj_org_article_5b616c36f9ed48a6b0877e625aa1290b
source Elsevier ScienceDirect Journals
subjects Czech Republic
Digital soil mapping (DSM)
Generalization
Intelligible models
Model parsimony
title Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T13%3A04%3A22IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Optimized%20modelling%20of%20countrywide%20soil%20organic%20carbon%20levels%20via%20an%20interpretable%20decision%20tree&rft.jtitle=Smart%20agricultural%20technology&rft.au=Kebonye,%20Ndiye%20M.&rft.date=2023-02&rft.volume=3&rft.spage=100106&rft.pages=100106-&rft.artnum=100106&rft.issn=2772-3755&rft.eissn=2772-3755&rft_id=info:doi/10.1016/j.atech.2022.100106&rft_dat=%3Celsevier_doaj_%3ES2772375522000715%3C/elsevier_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c344t-e4e29952378d6266351d446e7ab1c3d58186f171177862a16a7a0684b98a27043%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true