Loading…

A Pipeline for Processing Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens

iDigBio currently references over 22 million media files, and stores approximately 120 terabytes worth of those media files co-located with our computing infrastructure (Matsunaga et al. 2013). Using these images for scientific research is a logistical and technical challenge. Transferring large num...

Full description

Saved in:
Bibliographic Details
Published in:Biodiversity Information Science and Standards 2017-08, Vol.1, p.e20326
Main Authors: Yeole, Gaurav, Sahdev, Saniya, Collins, Matthew, Thompson, Alex, Dikow, Rebecca, Frandsen, Paul, Orli, Sylvia, Figueiredo, Renato
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c1256-162a7b1e1c14851236fc8d40039501c175af108f488f22c0760b5948d4a12ab33
container_end_page
container_issue
container_start_page e20326
container_title Biodiversity Information Science and Standards
container_volume 1
creator Yeole, Gaurav
Sahdev, Saniya
Collins, Matthew
Thompson, Alex
Dikow, Rebecca
Frandsen, Paul
Orli, Sylvia
Figueiredo, Renato
description iDigBio currently references over 22 million media files, and stores approximately 120 terabytes worth of those media files co-located with our computing infrastructure (Matsunaga et al. 2013). Using these images for scientific research is a logistical and technical challenge. Transferring large numbers of images requires programming skill, bandwidth, and storage space. While simple image transformations such as resizing and generating histograms are approachable on desktops and laptops, the neural networks commonly used for learning from images require server-based graphical processing units (GPUs) to run effectively. Using the GUODA (Global Unified Open Data Access) infrastructure, we are building a model pipeline for applying user-defined processing to all or any subset of images stored in iDigBio on servers located in the Advanced Computing and Information Systems lab (ACIS) alongside the iDigBio storage system. This pipeline utilizes Apache Spark, the Hadoop File System (HDFS), and Mesos (Collins et al. 2017). We have placed a Jupyter notebook server in front of this architecture, which provides an easy environment for end users to write their own Python or R software programs. Users can access the stored data and images and manipulate them per their requirements and make their work publicly available on GitHub. As an example of how this pipeline can be used in research, we are applying a neural network developed at the Smithsonian Institution to identify herbarium sheets that were prepared with hazardous mercury-containing solutions (Schuettpelz, in preparation ). The model was trained on Smithsonian servers using their herbarium images and it is being transferred to the GUODA infrastructure hosted at the ACIS lab. All herbarium images in iDigBio are being classified using this model to illustrate the application of these techniques to larger sets of images using a deep convolutional neural network that detects visible mercury crystallization present on digitized herbarium sheets. Such an automated detection process can potentially be used, for instance, to notify other data publishers of any contamination. We are presenting the results of this classification not as a verified research result, but as an example of the collaborative and scalable workflows this pipeline and infrastructure enable.
doi_str_mv 10.3897/tdwgproceedings.1.20326
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2169991140</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2169991140</sourcerecordid><originalsourceid>FETCH-LOGICAL-c1256-162a7b1e1c14851236fc8d40039501c175af108f488f22c0760b5948d4a12ab33</originalsourceid><addsrcrecordid>eNpdkd1Kw0AUhBdRsNQ-gwtep-5Pfi9rrW2hYkF7vWw2J2FLsom7DVpfwld2Y0XEqzMMHzNwBqFrSqY8zZLbQ_FWdbZVAIU2lZvSKSOcxWdoxCIeBcQz53_0JZo4tyeEsIyxNE5H6HOGt7qDWhvAZWvxdghzzofh5w6UbsDgdSMrcFgbrO91dadbHOBZ19XHgZKmwEswYGWtP04GXrzLRht50K3BbYkfwareHvHOwRCytdBJO6ArsLlXffPb5a7QRSlrB5OfO0a7h8XLfBVsnpbr-WwTKMqiOKAxk0lOgSoaphFlPC5VWoSE8Cwi3kwiWVKSlmGalowpksQkj7LQI5IymXM-RjenXP-91x7cQezb3hpfKRiNsyyjNCSeSk6Usq1zFkrRWd1IexSUiGEA8W8AQcX3APwLxzt-ZA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2169991140</pqid></control><display><type>article</type><title>A Pipeline for Processing Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens</title><source>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</source><creator>Yeole, Gaurav ; Sahdev, Saniya ; Collins, Matthew ; Thompson, Alex ; Dikow, Rebecca ; Frandsen, Paul ; Orli, Sylvia ; Figueiredo, Renato</creator><creatorcontrib>Yeole, Gaurav ; Sahdev, Saniya ; Collins, Matthew ; Thompson, Alex ; Dikow, Rebecca ; Frandsen, Paul ; Orli, Sylvia ; Figueiredo, Renato</creatorcontrib><description>iDigBio currently references over 22 million media files, and stores approximately 120 terabytes worth of those media files co-located with our computing infrastructure (Matsunaga et al. 2013). Using these images for scientific research is a logistical and technical challenge. Transferring large numbers of images requires programming skill, bandwidth, and storage space. While simple image transformations such as resizing and generating histograms are approachable on desktops and laptops, the neural networks commonly used for learning from images require server-based graphical processing units (GPUs) to run effectively. Using the GUODA (Global Unified Open Data Access) infrastructure, we are building a model pipeline for applying user-defined processing to all or any subset of images stored in iDigBio on servers located in the Advanced Computing and Information Systems lab (ACIS) alongside the iDigBio storage system. This pipeline utilizes Apache Spark, the Hadoop File System (HDFS), and Mesos (Collins et al. 2017). We have placed a Jupyter notebook server in front of this architecture, which provides an easy environment for end users to write their own Python or R software programs. Users can access the stored data and images and manipulate them per their requirements and make their work publicly available on GitHub. As an example of how this pipeline can be used in research, we are applying a neural network developed at the Smithsonian Institution to identify herbarium sheets that were prepared with hazardous mercury-containing solutions (Schuettpelz, in preparation ). The model was trained on Smithsonian servers using their herbarium images and it is being transferred to the GUODA infrastructure hosted at the ACIS lab. All herbarium images in iDigBio are being classified using this model to illustrate the application of these techniques to larger sets of images using a deep convolutional neural network that detects visible mercury crystallization present on digitized herbarium sheets. Such an automated detection process can potentially be used, for instance, to notify other data publishers of any contamination. We are presenting the results of this classification not as a verified research result, but as an example of the collaborative and scalable workflows this pipeline and infrastructure enable.</description><identifier>ISSN: 2535-0897</identifier><identifier>EISSN: 2535-0897</identifier><identifier>DOI: 10.3897/tdwgproceedings.1.20326</identifier><language>eng</language><publisher>Sofia: Pensoft Publishers</publisher><subject>Contamination ; Crystallization ; Data processing ; Information systems ; Infrastructure ; Mercury ; Neural networks</subject><ispartof>Biodiversity Information Science and Standards, 2017-08, Vol.1, p.e20326</ispartof><rights>2017. This work is published under http://creativecommons.org/licenses/by/4.0 (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c1256-162a7b1e1c14851236fc8d40039501c175af108f488f22c0760b5948d4a12ab33</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2169991140?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,25753,27924,27925,37012,44590</link.rule.ids></links><search><creatorcontrib>Yeole, Gaurav</creatorcontrib><creatorcontrib>Sahdev, Saniya</creatorcontrib><creatorcontrib>Collins, Matthew</creatorcontrib><creatorcontrib>Thompson, Alex</creatorcontrib><creatorcontrib>Dikow, Rebecca</creatorcontrib><creatorcontrib>Frandsen, Paul</creatorcontrib><creatorcontrib>Orli, Sylvia</creatorcontrib><creatorcontrib>Figueiredo, Renato</creatorcontrib><title>A Pipeline for Processing Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens</title><title>Biodiversity Information Science and Standards</title><description>iDigBio currently references over 22 million media files, and stores approximately 120 terabytes worth of those media files co-located with our computing infrastructure (Matsunaga et al. 2013). Using these images for scientific research is a logistical and technical challenge. Transferring large numbers of images requires programming skill, bandwidth, and storage space. While simple image transformations such as resizing and generating histograms are approachable on desktops and laptops, the neural networks commonly used for learning from images require server-based graphical processing units (GPUs) to run effectively. Using the GUODA (Global Unified Open Data Access) infrastructure, we are building a model pipeline for applying user-defined processing to all or any subset of images stored in iDigBio on servers located in the Advanced Computing and Information Systems lab (ACIS) alongside the iDigBio storage system. This pipeline utilizes Apache Spark, the Hadoop File System (HDFS), and Mesos (Collins et al. 2017). We have placed a Jupyter notebook server in front of this architecture, which provides an easy environment for end users to write their own Python or R software programs. Users can access the stored data and images and manipulate them per their requirements and make their work publicly available on GitHub. As an example of how this pipeline can be used in research, we are applying a neural network developed at the Smithsonian Institution to identify herbarium sheets that were prepared with hazardous mercury-containing solutions (Schuettpelz, in preparation ). The model was trained on Smithsonian servers using their herbarium images and it is being transferred to the GUODA infrastructure hosted at the ACIS lab. All herbarium images in iDigBio are being classified using this model to illustrate the application of these techniques to larger sets of images using a deep convolutional neural network that detects visible mercury crystallization present on digitized herbarium sheets. Such an automated detection process can potentially be used, for instance, to notify other data publishers of any contamination. We are presenting the results of this classification not as a verified research result, but as an example of the collaborative and scalable workflows this pipeline and infrastructure enable.</description><subject>Contamination</subject><subject>Crystallization</subject><subject>Data processing</subject><subject>Information systems</subject><subject>Infrastructure</subject><subject>Mercury</subject><subject>Neural networks</subject><issn>2535-0897</issn><issn>2535-0897</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpdkd1Kw0AUhBdRsNQ-gwtep-5Pfi9rrW2hYkF7vWw2J2FLsom7DVpfwld2Y0XEqzMMHzNwBqFrSqY8zZLbQ_FWdbZVAIU2lZvSKSOcxWdoxCIeBcQz53_0JZo4tyeEsIyxNE5H6HOGt7qDWhvAZWvxdghzzofh5w6UbsDgdSMrcFgbrO91dadbHOBZ19XHgZKmwEswYGWtP04GXrzLRht50K3BbYkfwareHvHOwRCytdBJO6ArsLlXffPb5a7QRSlrB5OfO0a7h8XLfBVsnpbr-WwTKMqiOKAxk0lOgSoaphFlPC5VWoSE8Cwi3kwiWVKSlmGalowpksQkj7LQI5IymXM-RjenXP-91x7cQezb3hpfKRiNsyyjNCSeSk6Usq1zFkrRWd1IexSUiGEA8W8AQcX3APwLxzt-ZA</recordid><startdate>20170815</startdate><enddate>20170815</enddate><creator>Yeole, Gaurav</creator><creator>Sahdev, Saniya</creator><creator>Collins, Matthew</creator><creator>Thompson, Alex</creator><creator>Dikow, Rebecca</creator><creator>Frandsen, Paul</creator><creator>Orli, Sylvia</creator><creator>Figueiredo, Renato</creator><general>Pensoft Publishers</general><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FH</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>LK8</scope><scope>M7P</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope></search><sort><creationdate>20170815</creationdate><title>A Pipeline for Processing Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens</title><author>Yeole, Gaurav ; Sahdev, Saniya ; Collins, Matthew ; Thompson, Alex ; Dikow, Rebecca ; Frandsen, Paul ; Orli, Sylvia ; Figueiredo, Renato</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c1256-162a7b1e1c14851236fc8d40039501c175af108f488f22c0760b5948d4a12ab33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Contamination</topic><topic>Crystallization</topic><topic>Data processing</topic><topic>Information systems</topic><topic>Infrastructure</topic><topic>Mercury</topic><topic>Neural networks</topic><toplevel>online_resources</toplevel><creatorcontrib>Yeole, Gaurav</creatorcontrib><creatorcontrib>Sahdev, Saniya</creatorcontrib><creatorcontrib>Collins, Matthew</creatorcontrib><creatorcontrib>Thompson, Alex</creatorcontrib><creatorcontrib>Dikow, Rebecca</creatorcontrib><creatorcontrib>Frandsen, Paul</creatorcontrib><creatorcontrib>Orli, Sylvia</creatorcontrib><creatorcontrib>Figueiredo, Renato</creatorcontrib><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>ProQuest Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Biological Science Collection</collection><collection>Biological Science Database</collection><collection>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><jtitle>Biodiversity Information Science and Standards</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yeole, Gaurav</au><au>Sahdev, Saniya</au><au>Collins, Matthew</au><au>Thompson, Alex</au><au>Dikow, Rebecca</au><au>Frandsen, Paul</au><au>Orli, Sylvia</au><au>Figueiredo, Renato</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Pipeline for Processing Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens</atitle><jtitle>Biodiversity Information Science and Standards</jtitle><date>2017-08-15</date><risdate>2017</risdate><volume>1</volume><spage>e20326</spage><pages>e20326-</pages><issn>2535-0897</issn><eissn>2535-0897</eissn><abstract>iDigBio currently references over 22 million media files, and stores approximately 120 terabytes worth of those media files co-located with our computing infrastructure (Matsunaga et al. 2013). Using these images for scientific research is a logistical and technical challenge. Transferring large numbers of images requires programming skill, bandwidth, and storage space. While simple image transformations such as resizing and generating histograms are approachable on desktops and laptops, the neural networks commonly used for learning from images require server-based graphical processing units (GPUs) to run effectively. Using the GUODA (Global Unified Open Data Access) infrastructure, we are building a model pipeline for applying user-defined processing to all or any subset of images stored in iDigBio on servers located in the Advanced Computing and Information Systems lab (ACIS) alongside the iDigBio storage system. This pipeline utilizes Apache Spark, the Hadoop File System (HDFS), and Mesos (Collins et al. 2017). We have placed a Jupyter notebook server in front of this architecture, which provides an easy environment for end users to write their own Python or R software programs. Users can access the stored data and images and manipulate them per their requirements and make their work publicly available on GitHub. As an example of how this pipeline can be used in research, we are applying a neural network developed at the Smithsonian Institution to identify herbarium sheets that were prepared with hazardous mercury-containing solutions (Schuettpelz, in preparation ). The model was trained on Smithsonian servers using their herbarium images and it is being transferred to the GUODA infrastructure hosted at the ACIS lab. All herbarium images in iDigBio are being classified using this model to illustrate the application of these techniques to larger sets of images using a deep convolutional neural network that detects visible mercury crystallization present on digitized herbarium sheets. Such an automated detection process can potentially be used, for instance, to notify other data publishers of any contamination. We are presenting the results of this classification not as a verified research result, but as an example of the collaborative and scalable workflows this pipeline and infrastructure enable.</abstract><cop>Sofia</cop><pub>Pensoft Publishers</pub><doi>10.3897/tdwgproceedings.1.20326</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2535-0897
ispartof Biodiversity Information Science and Standards, 2017-08, Vol.1, p.e20326
issn 2535-0897
2535-0897
language eng
recordid cdi_proquest_journals_2169991140
source Publicly Available Content Database (Proquest) (PQ_SDU_P3)
subjects Contamination
Crystallization
Data processing
Information systems
Infrastructure
Mercury
Neural networks
title A Pipeline for Processing Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T23%3A35%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Pipeline%20for%20Processing%20Specimen%20Images%20in%20iDigBio%20-%20Applying%20and%20Generalizing%20an%20Examination%20of%20Mercury%20Use%20in%20Preparing%20Herbarium%20Specimens&rft.jtitle=Biodiversity%20Information%20Science%20and%20Standards&rft.au=Yeole,%20Gaurav&rft.date=2017-08-15&rft.volume=1&rft.spage=e20326&rft.pages=e20326-&rft.issn=2535-0897&rft.eissn=2535-0897&rft_id=info:doi/10.3897/tdwgproceedings.1.20326&rft_dat=%3Cproquest_cross%3E2169991140%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c1256-162a7b1e1c14851236fc8d40039501c175af108f488f22c0760b5948d4a12ab33%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2169991140&rft_id=info:pmid/&rfr_iscdi=true