Loading…

Vision transformers for cotton boll segmentation: Hyperparameters optimization and comparison with convolutional neural networks

For the automation of cotton harvesting operations, precise segmentation of cotton bolls is important. In the past, various handcrafted image processing-based algorithms and convolutional neural networks (CNNs) have been developed for this purpose. Handcrafted algorithms often only extract low-dimen...

Full description

Saved in:
Bibliographic Details
Published in:Industrial crops and products 2025-01, Vol.223, p.120241, Article 120241
Main Authors: Singh, Naseeb, Tewari, V.K., Biswas, P.K.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c267t-7597db8f9241344e28578b752e6aa143bedc10b7127d13ac0a7b1976411245233
container_end_page
container_issue
container_start_page 120241
container_title Industrial crops and products
container_volume 223
creator Singh, Naseeb
Tewari, V.K.
Biswas, P.K.
description For the automation of cotton harvesting operations, precise segmentation of cotton bolls is important. In the past, various handcrafted image processing-based algorithms and convolutional neural networks (CNNs) have been developed for this purpose. Handcrafted algorithms often only extract low-dimensional features, while CNNs have limitations to capture global features due to their small receptive fields. However, in recent times, Vision Transformers (ViTs) have proven to have the ability to capture long-range dependencies through the self-attention mechanism, thus resulting in superior segmentation accuracy. In this study, ViTs were utilized to segment cotton bolls, and the impact of various hyperparameters on their efficacy was investigated. Different ViT variants were developed using varying combinations of hyperparameters. Among all developed ViT variants, the model with a patch size of 16, hidden dimensions of 8, 6 no. of Multi-head Self attention (MHSA) heads, 12 transformer layers, and multilayer perceptron (MLP) dimension of 128 outperformed the others. This optimal configuration achieved precision, recall, mean Intersection over Union (m-IoU), and cotton-IoU values of 0.94, 0.94, 0.93, and 0.89, respectively. The findings show that increasing hidden dimensions and the number of attention heads increased model complexity but did not necessarily improve performance. The cotton-IoU score was found to be higher for the best-performing ViT model (cotton-IoU = 0.89) compared to the CNN model (cotton-IoU = 0.84). These results indicate that the ViT model outperforms the CNN model (having a comparable number of trainable parameters) for the segmentation of cotton bolls. Hence, ViTs can be effectively utilized for semantic segmentation tasks in agriculture with higher segmentation performance while requiring lower computational power. This makes ViTs a suitable technique for the automation of the cotton harvesting process on resource-constrained devices without compromising performance. Future work should include the use of pure transformer architectures, incorporating advanced techniques to further optimize performance and efficiency in various agricultural tasks. •Developed ViTs for cotton boll segmentation.•Effects of hyperparameters on ViT performance were analyzed.•Achieved 0.94 precision, 0.93 m-IoU, and 0.89 cotton-IoU with optimal ViT.•ViTs showed better performance with 0.89 cotton-IoU vs. CNN's 0.84.•Increasing hidden dimensions and attention hea
doi_str_mv 10.1016/j.indcrop.2024.120241
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_3154269711</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0926669024022180</els_id><sourcerecordid>3154269711</sourcerecordid><originalsourceid>FETCH-LOGICAL-c267t-7597db8f9241344e28578b752e6aa143bedc10b7127d13ac0a7b1976411245233</originalsourceid><addsrcrecordid>eNqFkD9PwzAQxTOARCl8BKSMLCk-548bFoQqoEiVWIDVcpwLuCR2sN1WZeKj4zTdWfx0fu-ddL8ougIyAwLFzXqmdC2t6WeU0GwGwwsn0YSUtEiKoiRn0blza0KAEcom0e-7csro2FuhXWNsh9bFQWNpvA__lWnb2OFHh9oLH5K38XLfo-2FFR36IW16rzr1c3BjoetQ7YKtXBh3yn-GWW9Nuxl80cYaN_Ygfmfsl7uIThvROrw86jR6e3x4XSyT1cvT8-J-lUhaMJ-wvGR1NW_KcE6aZUjnOZtXLKdYCAFZWmEtgVQMKKshFZIIVkHJigyAZjlN02l0Pe7trfneoPO8U05i2wqNZuN4CnlGi5IBhGg-RgNH5yw2vLeqE3bPgfCBMl_zI2U-8OUj5dC7G3sY7tgqtNxJhVpirSxKz2uj_tnwByWEjgE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3154269711</pqid></control><display><type>article</type><title>Vision transformers for cotton boll segmentation: Hyperparameters optimization and comparison with convolutional neural networks</title><source>Elsevier</source><creator>Singh, Naseeb ; Tewari, V.K. ; Biswas, P.K.</creator><creatorcontrib>Singh, Naseeb ; Tewari, V.K. ; Biswas, P.K.</creatorcontrib><description>For the automation of cotton harvesting operations, precise segmentation of cotton bolls is important. In the past, various handcrafted image processing-based algorithms and convolutional neural networks (CNNs) have been developed for this purpose. Handcrafted algorithms often only extract low-dimensional features, while CNNs have limitations to capture global features due to their small receptive fields. However, in recent times, Vision Transformers (ViTs) have proven to have the ability to capture long-range dependencies through the self-attention mechanism, thus resulting in superior segmentation accuracy. In this study, ViTs were utilized to segment cotton bolls, and the impact of various hyperparameters on their efficacy was investigated. Different ViT variants were developed using varying combinations of hyperparameters. Among all developed ViT variants, the model with a patch size of 16, hidden dimensions of 8, 6 no. of Multi-head Self attention (MHSA) heads, 12 transformer layers, and multilayer perceptron (MLP) dimension of 128 outperformed the others. This optimal configuration achieved precision, recall, mean Intersection over Union (m-IoU), and cotton-IoU values of 0.94, 0.94, 0.93, and 0.89, respectively. The findings show that increasing hidden dimensions and the number of attention heads increased model complexity but did not necessarily improve performance. The cotton-IoU score was found to be higher for the best-performing ViT model (cotton-IoU = 0.89) compared to the CNN model (cotton-IoU = 0.84). These results indicate that the ViT model outperforms the CNN model (having a comparable number of trainable parameters) for the segmentation of cotton bolls. Hence, ViTs can be effectively utilized for semantic segmentation tasks in agriculture with higher segmentation performance while requiring lower computational power. This makes ViTs a suitable technique for the automation of the cotton harvesting process on resource-constrained devices without compromising performance. Future work should include the use of pure transformer architectures, incorporating advanced techniques to further optimize performance and efficiency in various agricultural tasks. •Developed ViTs for cotton boll segmentation.•Effects of hyperparameters on ViT performance were analyzed.•Achieved 0.94 precision, 0.93 m-IoU, and 0.89 cotton-IoU with optimal ViT.•ViTs showed better performance with 0.89 cotton-IoU vs. CNN's 0.84.•Increasing hidden dimensions and attention heads not necessarily improve performance.</description><identifier>ISSN: 0926-6690</identifier><identifier>DOI: 10.1016/j.indcrop.2024.120241</identifier><language>eng</language><publisher>Elsevier B.V</publisher><subject>Automated harvesting ; automation ; Cotton ; Deep learning ; neural networks ; Semantic segmentation ; vision ; Vision transformers</subject><ispartof>Industrial crops and products, 2025-01, Vol.223, p.120241, Article 120241</ispartof><rights>2024 The Authors</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c267t-7597db8f9241344e28578b752e6aa143bedc10b7127d13ac0a7b1976411245233</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Singh, Naseeb</creatorcontrib><creatorcontrib>Tewari, V.K.</creatorcontrib><creatorcontrib>Biswas, P.K.</creatorcontrib><title>Vision transformers for cotton boll segmentation: Hyperparameters optimization and comparison with convolutional neural networks</title><title>Industrial crops and products</title><description>For the automation of cotton harvesting operations, precise segmentation of cotton bolls is important. In the past, various handcrafted image processing-based algorithms and convolutional neural networks (CNNs) have been developed for this purpose. Handcrafted algorithms often only extract low-dimensional features, while CNNs have limitations to capture global features due to their small receptive fields. However, in recent times, Vision Transformers (ViTs) have proven to have the ability to capture long-range dependencies through the self-attention mechanism, thus resulting in superior segmentation accuracy. In this study, ViTs were utilized to segment cotton bolls, and the impact of various hyperparameters on their efficacy was investigated. Different ViT variants were developed using varying combinations of hyperparameters. Among all developed ViT variants, the model with a patch size of 16, hidden dimensions of 8, 6 no. of Multi-head Self attention (MHSA) heads, 12 transformer layers, and multilayer perceptron (MLP) dimension of 128 outperformed the others. This optimal configuration achieved precision, recall, mean Intersection over Union (m-IoU), and cotton-IoU values of 0.94, 0.94, 0.93, and 0.89, respectively. The findings show that increasing hidden dimensions and the number of attention heads increased model complexity but did not necessarily improve performance. The cotton-IoU score was found to be higher for the best-performing ViT model (cotton-IoU = 0.89) compared to the CNN model (cotton-IoU = 0.84). These results indicate that the ViT model outperforms the CNN model (having a comparable number of trainable parameters) for the segmentation of cotton bolls. Hence, ViTs can be effectively utilized for semantic segmentation tasks in agriculture with higher segmentation performance while requiring lower computational power. This makes ViTs a suitable technique for the automation of the cotton harvesting process on resource-constrained devices without compromising performance. Future work should include the use of pure transformer architectures, incorporating advanced techniques to further optimize performance and efficiency in various agricultural tasks. •Developed ViTs for cotton boll segmentation.•Effects of hyperparameters on ViT performance were analyzed.•Achieved 0.94 precision, 0.93 m-IoU, and 0.89 cotton-IoU with optimal ViT.•ViTs showed better performance with 0.89 cotton-IoU vs. CNN's 0.84.•Increasing hidden dimensions and attention heads not necessarily improve performance.</description><subject>Automated harvesting</subject><subject>automation</subject><subject>Cotton</subject><subject>Deep learning</subject><subject>neural networks</subject><subject>Semantic segmentation</subject><subject>vision</subject><subject>Vision transformers</subject><issn>0926-6690</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2025</creationdate><recordtype>article</recordtype><recordid>eNqFkD9PwzAQxTOARCl8BKSMLCk-548bFoQqoEiVWIDVcpwLuCR2sN1WZeKj4zTdWfx0fu-ddL8ougIyAwLFzXqmdC2t6WeU0GwGwwsn0YSUtEiKoiRn0blza0KAEcom0e-7csro2FuhXWNsh9bFQWNpvA__lWnb2OFHh9oLH5K38XLfo-2FFR36IW16rzr1c3BjoetQ7YKtXBh3yn-GWW9Nuxl80cYaN_Ygfmfsl7uIThvROrw86jR6e3x4XSyT1cvT8-J-lUhaMJ-wvGR1NW_KcE6aZUjnOZtXLKdYCAFZWmEtgVQMKKshFZIIVkHJigyAZjlN02l0Pe7trfneoPO8U05i2wqNZuN4CnlGi5IBhGg-RgNH5yw2vLeqE3bPgfCBMl_zI2U-8OUj5dC7G3sY7tgqtNxJhVpirSxKz2uj_tnwByWEjgE</recordid><startdate>20250101</startdate><enddate>20250101</enddate><creator>Singh, Naseeb</creator><creator>Tewari, V.K.</creator><creator>Biswas, P.K.</creator><general>Elsevier B.V</general><scope>6I.</scope><scope>AAFTH</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7S9</scope><scope>L.6</scope></search><sort><creationdate>20250101</creationdate><title>Vision transformers for cotton boll segmentation: Hyperparameters optimization and comparison with convolutional neural networks</title><author>Singh, Naseeb ; Tewari, V.K. ; Biswas, P.K.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c267t-7597db8f9241344e28578b752e6aa143bedc10b7127d13ac0a7b1976411245233</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2025</creationdate><topic>Automated harvesting</topic><topic>automation</topic><topic>Cotton</topic><topic>Deep learning</topic><topic>neural networks</topic><topic>Semantic segmentation</topic><topic>vision</topic><topic>Vision transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Singh, Naseeb</creatorcontrib><creatorcontrib>Tewari, V.K.</creatorcontrib><creatorcontrib>Biswas, P.K.</creatorcontrib><collection>ScienceDirect Open Access Titles</collection><collection>Elsevier:ScienceDirect:Open Access</collection><collection>CrossRef</collection><collection>AGRICOLA</collection><collection>AGRICOLA - Academic</collection><jtitle>Industrial crops and products</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Singh, Naseeb</au><au>Tewari, V.K.</au><au>Biswas, P.K.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Vision transformers for cotton boll segmentation: Hyperparameters optimization and comparison with convolutional neural networks</atitle><jtitle>Industrial crops and products</jtitle><date>2025-01-01</date><risdate>2025</risdate><volume>223</volume><spage>120241</spage><pages>120241-</pages><artnum>120241</artnum><issn>0926-6690</issn><abstract>For the automation of cotton harvesting operations, precise segmentation of cotton bolls is important. In the past, various handcrafted image processing-based algorithms and convolutional neural networks (CNNs) have been developed for this purpose. Handcrafted algorithms often only extract low-dimensional features, while CNNs have limitations to capture global features due to their small receptive fields. However, in recent times, Vision Transformers (ViTs) have proven to have the ability to capture long-range dependencies through the self-attention mechanism, thus resulting in superior segmentation accuracy. In this study, ViTs were utilized to segment cotton bolls, and the impact of various hyperparameters on their efficacy was investigated. Different ViT variants were developed using varying combinations of hyperparameters. Among all developed ViT variants, the model with a patch size of 16, hidden dimensions of 8, 6 no. of Multi-head Self attention (MHSA) heads, 12 transformer layers, and multilayer perceptron (MLP) dimension of 128 outperformed the others. This optimal configuration achieved precision, recall, mean Intersection over Union (m-IoU), and cotton-IoU values of 0.94, 0.94, 0.93, and 0.89, respectively. The findings show that increasing hidden dimensions and the number of attention heads increased model complexity but did not necessarily improve performance. The cotton-IoU score was found to be higher for the best-performing ViT model (cotton-IoU = 0.89) compared to the CNN model (cotton-IoU = 0.84). These results indicate that the ViT model outperforms the CNN model (having a comparable number of trainable parameters) for the segmentation of cotton bolls. Hence, ViTs can be effectively utilized for semantic segmentation tasks in agriculture with higher segmentation performance while requiring lower computational power. This makes ViTs a suitable technique for the automation of the cotton harvesting process on resource-constrained devices without compromising performance. Future work should include the use of pure transformer architectures, incorporating advanced techniques to further optimize performance and efficiency in various agricultural tasks. •Developed ViTs for cotton boll segmentation.•Effects of hyperparameters on ViT performance were analyzed.•Achieved 0.94 precision, 0.93 m-IoU, and 0.89 cotton-IoU with optimal ViT.•ViTs showed better performance with 0.89 cotton-IoU vs. CNN's 0.84.•Increasing hidden dimensions and attention heads not necessarily improve performance.</abstract><pub>Elsevier B.V</pub><doi>10.1016/j.indcrop.2024.120241</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0926-6690
ispartof Industrial crops and products, 2025-01, Vol.223, p.120241, Article 120241
issn 0926-6690
language eng
recordid cdi_proquest_miscellaneous_3154269711
source Elsevier
subjects Automated harvesting
automation
Cotton
Deep learning
neural networks
Semantic segmentation
vision
Vision transformers
title Vision transformers for cotton boll segmentation: Hyperparameters optimization and comparison with convolutional neural networks
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T08%3A12%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Vision%20transformers%20for%20cotton%20boll%20segmentation:%20Hyperparameters%20optimization%20and%20comparison%20with%20convolutional%20neural%20networks&rft.jtitle=Industrial%20crops%20and%20products&rft.au=Singh,%20Naseeb&rft.date=2025-01-01&rft.volume=223&rft.spage=120241&rft.pages=120241-&rft.artnum=120241&rft.issn=0926-6690&rft_id=info:doi/10.1016/j.indcrop.2024.120241&rft_dat=%3Cproquest_cross%3E3154269711%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c267t-7597db8f9241344e28578b752e6aa143bedc10b7127d13ac0a7b1976411245233%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3154269711&rft_id=info:pmid/&rfr_iscdi=true