Loading…

ALICE Software: Machine learning & computer vision for automatic label extraction

Insects make up over 70% of the world's known species (Resh and Carde 2009). This is well represented in collections across the world, with the Natural History Museum's pinned insect collection alone making up nearly 37% of the museum's remarkable 80 million specimen collection. Thus,...

Full description

Saved in:
Bibliographic Details
Published in:Biodiversity Information Science and Standards 2022-08, Vol.6
Main Authors: Salili-James, Arianna, Scott, Ben, Smith, Vincent
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Insects make up over 70% of the world's known species (Resh and Carde 2009). This is well represented in collections across the world, with the Natural History Museum's pinned insect collection alone making up nearly 37% of the museum's remarkable 80 million specimen collection. Thus, this extraordinary dataset is a major focus of digitisation efforts here at the Museum. While hardware developments have seen digitisation processes significantly improve and speed up (Blagoderov et al. 2017), we now concentrate on the latest software and explore whether machine learning can lend a bigger hand in accelerating our digitisation of pinned insects. Traditionally, the digitisation of pinned specimens involves the removal of labels (as well as any supplementary specimen miscellanies) prior to photographing the specimen. In order to document labels, this process is typically followed by additional photographs of labels as the label documentation is often obstructed by their stacking on a pin, the specimen and additional specimen material, or the pin itself. However, these steps not only slow down the process of digitisation but also increase the risk of specimen damage. This encouraged the team at the Natural History Museum to develop a novel setup that would bypass the need for removing labels during digitisation. This led to the development of ALICE ( Angled Label Image Capture and Extraction ) (Dupont and Price 2019). ALICE is a multi-camera setup designed to capture images of angled specimens, which allows users to get a full picture of a specimen in a collection, including that of the label and the text within. Specifically, ALICE involves four cameras angled at different viewpoints in order to capture label information, as well as two additional cameras providing a lateral and dorsal view of the specimen. By viewing all the images taken from one specimen simultaneously, we can obtain a full account of the labels and of the specimen, despite any obstructions. This setup notably accelerates parts of the digitisation process, sometimes by up to 7 times (Price et al. 2019). Furthermore, ALICE presents the opportunity to incorporate machine learning and computer vision techniques to create a software that automates the process of transcribing the information contained on labels. Automatically transcribing text (whether typed or handwritten) from label images, leads to the topic of Optical Character Recognition (OCR). Regardless of any obstructions to the labels, st
ISSN:2535-0897
2535-0897
DOI:10.3897/biss.6.91443