Loading…

End-to-end deep learning for directly estimating grape yield from ground-based imagery

[Display omitted] •Grape yield measured using a yield monitor was estimated using proximal imagery.•Yield estimation performance increased with increasing spatial aggregation.•Deep regression models trained on image data eliminated the need for pixel labeling.•Box area and deep regression models sho...

Full description

Saved in:
Bibliographic Details
Published in:Computers and electronics in agriculture 2022-07, Vol.198, p.107081, Article 107081
Main Authors: Olenskyj, Alexander G., Sams, Brent S., Fei, Zhenghao, Singh, Vishal, Raja, Pranav V., Bornhorst, Gail M., Earles, J. Mason
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:[Display omitted] •Grape yield measured using a yield monitor was estimated using proximal imagery.•Yield estimation performance increased with increasing spatial aggregation.•Deep regression models trained on image data eliminated the need for pixel labeling.•Box area and deep regression models showed similar yield estimation performance.•Vision transformer with metadata predicted yield with the lowest percent error. Yield estimation prior to harvest is a powerful tool in vineyard management, as it allows growers to fine-tune management practices to optimize yield and quality. However, yield estimation is currently performed using manual sampling, which is time-consuming and imprecise. This study demonstrates the applicability of nondestructive proximal imaging combined with deep learning for yield estimation in vineyards. Continuous image data collection using a vehicle-mounted sensing kit combined with collection of ground truth yield data at harvest using a commercial yield monitor allowed for the generation of a large dataset of 23,581 yield points and 107,933 images. Moreover, this study was conducted in a commercial vineyard which was mechanically managed, representing a challenging environment for image analysis but a common set of conditions in the California Central Valley. Three model architectures were tested: object detection, CNN regression, and transformer models. The object detection model was trained on hand-labeled images to localize grape bunches, and detections were either counted or their pixel count was summed to obtain a metric which was correlated to grape yield. Conversely, regression models were trained end-to-end to directly predict grape yield from image data without the need for hand labeling. Results demonstrated that both a transformer model as well as the object detection model with pixel area processing performed comparably, with a mean absolute percent error of 18% and 18.5%, respectively on a representative holdout dataset. Saliency mapping was used to demonstrate the attention of the CNN regression model was localized near the predicted location of grape bunches, as well as on the top of the grapevine canopy. Overall, the study demonstrated the applicability of proximal imaging and deep learning for prediction of grapevine yield on a large scale. Additionally, the end-to-end modeling approach was able to perform comparably to the object detection approach while eliminating the need for hand-labeling.
ISSN:0168-1699
DOI:10.1016/j.compag.2022.107081