Loading…

Vision transformers for cotton boll segmentation: Hyperparameters optimization and comparison with convolutional neural networks

For the automation of cotton harvesting operations, precise segmentation of cotton bolls is important. In the past, various handcrafted image processing-based algorithms and convolutional neural networks (CNNs) have been developed for this purpose. Handcrafted algorithms often only extract low-dimen...

Full description

Saved in:

Bibliographic Details
Published in:	Industrial crops and products 2025-01, Vol.223, p.120241, Article 120241
Main Authors:	Singh, Naseeb, Tewari, V.K., Biswas, P.K.
Format:	Article
Language:	English
Subjects:	Automated harvesting automation Cotton Deep learning neural networks Semantic segmentation vision Vision transformers
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	For the automation of cotton harvesting operations, precise segmentation of cotton bolls is important. In the past, various handcrafted image processing-based algorithms and convolutional neural networks (CNNs) have been developed for this purpose. Handcrafted algorithms often only extract low-dimensional features, while CNNs have limitations to capture global features due to their small receptive fields. However, in recent times, Vision Transformers (ViTs) have proven to have the ability to capture long-range dependencies through the self-attention mechanism, thus resulting in superior segmentation accuracy. In this study, ViTs were utilized to segment cotton bolls, and the impact of various hyperparameters on their efficacy was investigated. Different ViT variants were developed using varying combinations of hyperparameters. Among all developed ViT variants, the model with a patch size of 16, hidden dimensions of 8, 6 no. of Multi-head Self attention (MHSA) heads, 12 transformer layers, and multilayer perceptron (MLP) dimension of 128 outperformed the others. This optimal configuration achieved precision, recall, mean Intersection over Union (m-IoU), and cotton-IoU values of 0.94, 0.94, 0.93, and 0.89, respectively. The findings show that increasing hidden dimensions and the number of attention heads increased model complexity but did not necessarily improve performance. The cotton-IoU score was found to be higher for the best-performing ViT model (cotton-IoU = 0.89) compared to the CNN model (cotton-IoU = 0.84). These results indicate that the ViT model outperforms the CNN model (having a comparable number of trainable parameters) for the segmentation of cotton bolls. Hence, ViTs can be effectively utilized for semantic segmentation tasks in agriculture with higher segmentation performance while requiring lower computational power. This makes ViTs a suitable technique for the automation of the cotton harvesting process on resource-constrained devices without compromising performance. Future work should include the use of pure transformer architectures, incorporating advanced techniques to further optimize performance and efficiency in various agricultural tasks. •Developed ViTs for cotton boll segmentation.•Effects of hyperparameters on ViT performance were analyzed.•Achieved 0.94 precision, 0.93 m-IoU, and 0.89 cotton-IoU with optimal ViT.•ViTs showed better performance with 0.89 cotton-IoU vs. CNN's 0.84.•Increasing hidden dimensions and attention hea
ISSN:	0926-6690
DOI:	10.1016/j.indcrop.2024.120241