Loading…
A Pipeline for Processing Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens
iDigBio currently references over 22 million media files, and stores approximately 120 terabytes worth of those media files co-located with our computing infrastructure (Matsunaga et al. 2013). Using these images for scientific research is a logistical and technical challenge. Transferring large num...
Saved in:
Published in: | Biodiversity Information Science and Standards 2017-08, Vol.1, p.e20326 |
---|---|
Main Authors: | , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-c1256-162a7b1e1c14851236fc8d40039501c175af108f488f22c0760b5948d4a12ab33 |
container_end_page | |
container_issue | |
container_start_page | e20326 |
container_title | Biodiversity Information Science and Standards |
container_volume | 1 |
creator | Yeole, Gaurav Sahdev, Saniya Collins, Matthew Thompson, Alex Dikow, Rebecca Frandsen, Paul Orli, Sylvia Figueiredo, Renato |
description | iDigBio currently references over 22 million media files, and stores approximately 120 terabytes worth of those media files co-located with our computing infrastructure (Matsunaga et al. 2013). Using these images for scientific research is a logistical and technical challenge. Transferring large numbers of images requires programming skill, bandwidth, and storage space. While simple image transformations such as resizing and generating histograms are approachable on desktops and laptops, the neural networks commonly used for learning from images require server-based graphical processing units (GPUs) to run effectively. Using the GUODA (Global Unified Open Data Access) infrastructure, we are building a model pipeline for applying user-defined processing to all or any subset of images stored in iDigBio on servers located in the Advanced Computing and Information Systems lab (ACIS) alongside the iDigBio storage system. This pipeline utilizes Apache Spark, the Hadoop File System (HDFS), and Mesos (Collins et al. 2017). We have placed a Jupyter notebook server in front of this architecture, which provides an easy environment for end users to write their own Python or R software programs. Users can access the stored data and images and manipulate them per their requirements and make their work publicly available on GitHub. As an example of how this pipeline can be used in research, we are applying a neural network developed at the Smithsonian Institution to identify herbarium sheets that were prepared with hazardous mercury-containing solutions (Schuettpelz, in preparation ). The model was trained on Smithsonian servers using their herbarium images and it is being transferred to the GUODA infrastructure hosted at the ACIS lab. All herbarium images in iDigBio are being classified using this model to illustrate the application of these techniques to larger sets of images using a deep convolutional neural network that detects visible mercury crystallization present on digitized herbarium sheets. Such an automated detection process can potentially be used, for instance, to notify other data publishers of any contamination. We are presenting the results of this classification not as a verified research result, but as an example of the collaborative and scalable workflows this pipeline and infrastructure enable. |
doi_str_mv | 10.3897/tdwgproceedings.1.20326 |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2169991140</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2169991140</sourcerecordid><originalsourceid>FETCH-LOGICAL-c1256-162a7b1e1c14851236fc8d40039501c175af108f488f22c0760b5948d4a12ab33</originalsourceid><addsrcrecordid>eNpdkd1Kw0AUhBdRsNQ-gwtep-5Pfi9rrW2hYkF7vWw2J2FLsom7DVpfwld2Y0XEqzMMHzNwBqFrSqY8zZLbQ_FWdbZVAIU2lZvSKSOcxWdoxCIeBcQz53_0JZo4tyeEsIyxNE5H6HOGt7qDWhvAZWvxdghzzofh5w6UbsDgdSMrcFgbrO91dadbHOBZ19XHgZKmwEswYGWtP04GXrzLRht50K3BbYkfwareHvHOwRCytdBJO6ArsLlXffPb5a7QRSlrB5OfO0a7h8XLfBVsnpbr-WwTKMqiOKAxk0lOgSoaphFlPC5VWoSE8Cwi3kwiWVKSlmGalowpksQkj7LQI5IymXM-RjenXP-91x7cQezb3hpfKRiNsyyjNCSeSk6Usq1zFkrRWd1IexSUiGEA8W8AQcX3APwLxzt-ZA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2169991140</pqid></control><display><type>article</type><title>A Pipeline for Processing Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens</title><source>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</source><creator>Yeole, Gaurav ; Sahdev, Saniya ; Collins, Matthew ; Thompson, Alex ; Dikow, Rebecca ; Frandsen, Paul ; Orli, Sylvia ; Figueiredo, Renato</creator><creatorcontrib>Yeole, Gaurav ; Sahdev, Saniya ; Collins, Matthew ; Thompson, Alex ; Dikow, Rebecca ; Frandsen, Paul ; Orli, Sylvia ; Figueiredo, Renato</creatorcontrib><description>iDigBio currently references over 22 million media files, and stores approximately 120 terabytes worth of those media files co-located with our computing infrastructure (Matsunaga et al. 2013). Using these images for scientific research is a logistical and technical challenge. Transferring large numbers of images requires programming skill, bandwidth, and storage space. While simple image transformations such as resizing and generating histograms are approachable on desktops and laptops, the neural networks commonly used for learning from images require server-based graphical processing units (GPUs) to run effectively. Using the GUODA (Global Unified Open Data Access) infrastructure, we are building a model pipeline for applying user-defined processing to all or any subset of images stored in iDigBio on servers located in the Advanced Computing and Information Systems lab (ACIS) alongside the iDigBio storage system. This pipeline utilizes Apache Spark, the Hadoop File System (HDFS), and Mesos (Collins et al. 2017). We have placed a Jupyter notebook server in front of this architecture, which provides an easy environment for end users to write their own Python or R software programs. Users can access the stored data and images and manipulate them per their requirements and make their work publicly available on GitHub. As an example of how this pipeline can be used in research, we are applying a neural network developed at the Smithsonian Institution to identify herbarium sheets that were prepared with hazardous mercury-containing solutions (Schuettpelz, in preparation ). The model was trained on Smithsonian servers using their herbarium images and it is being transferred to the GUODA infrastructure hosted at the ACIS lab. All herbarium images in iDigBio are being classified using this model to illustrate the application of these techniques to larger sets of images using a deep convolutional neural network that detects visible mercury crystallization present on digitized herbarium sheets. Such an automated detection process can potentially be used, for instance, to notify other data publishers of any contamination. We are presenting the results of this classification not as a verified research result, but as an example of the collaborative and scalable workflows this pipeline and infrastructure enable.</description><identifier>ISSN: 2535-0897</identifier><identifier>EISSN: 2535-0897</identifier><identifier>DOI: 10.3897/tdwgproceedings.1.20326</identifier><language>eng</language><publisher>Sofia: Pensoft Publishers</publisher><subject>Contamination ; Crystallization ; Data processing ; Information systems ; Infrastructure ; Mercury ; Neural networks</subject><ispartof>Biodiversity Information Science and Standards, 2017-08, Vol.1, p.e20326</ispartof><rights>2017. This work is published under http://creativecommons.org/licenses/by/4.0 (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c1256-162a7b1e1c14851236fc8d40039501c175af108f488f22c0760b5948d4a12ab33</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2169991140?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,25753,27924,27925,37012,44590</link.rule.ids></links><search><creatorcontrib>Yeole, Gaurav</creatorcontrib><creatorcontrib>Sahdev, Saniya</creatorcontrib><creatorcontrib>Collins, Matthew</creatorcontrib><creatorcontrib>Thompson, Alex</creatorcontrib><creatorcontrib>Dikow, Rebecca</creatorcontrib><creatorcontrib>Frandsen, Paul</creatorcontrib><creatorcontrib>Orli, Sylvia</creatorcontrib><creatorcontrib>Figueiredo, Renato</creatorcontrib><title>A Pipeline for Processing Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens</title><title>Biodiversity Information Science and Standards</title><description>iDigBio currently references over 22 million media files, and stores approximately 120 terabytes worth of those media files co-located with our computing infrastructure (Matsunaga et al. 2013). Using these images for scientific research is a logistical and technical challenge. Transferring large numbers of images requires programming skill, bandwidth, and storage space. While simple image transformations such as resizing and generating histograms are approachable on desktops and laptops, the neural networks commonly used for learning from images require server-based graphical processing units (GPUs) to run effectively. Using the GUODA (Global Unified Open Data Access) infrastructure, we are building a model pipeline for applying user-defined processing to all or any subset of images stored in iDigBio on servers located in the Advanced Computing and Information Systems lab (ACIS) alongside the iDigBio storage system. This pipeline utilizes Apache Spark, the Hadoop File System (HDFS), and Mesos (Collins et al. 2017). We have placed a Jupyter notebook server in front of this architecture, which provides an easy environment for end users to write their own Python or R software programs. Users can access the stored data and images and manipulate them per their requirements and make their work publicly available on GitHub. As an example of how this pipeline can be used in research, we are applying a neural network developed at the Smithsonian Institution to identify herbarium sheets that were prepared with hazardous mercury-containing solutions (Schuettpelz, in preparation ). The model was trained on Smithsonian servers using their herbarium images and it is being transferred to the GUODA infrastructure hosted at the ACIS lab. All herbarium images in iDigBio are being classified using this model to illustrate the application of these techniques to larger sets of images using a deep convolutional neural network that detects visible mercury crystallization present on digitized herbarium sheets. Such an automated detection process can potentially be used, for instance, to notify other data publishers of any contamination. We are presenting the results of this classification not as a verified research result, but as an example of the collaborative and scalable workflows this pipeline and infrastructure enable.</description><subject>Contamination</subject><subject>Crystallization</subject><subject>Data processing</subject><subject>Information systems</subject><subject>Infrastructure</subject><subject>Mercury</subject><subject>Neural networks</subject><issn>2535-0897</issn><issn>2535-0897</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpdkd1Kw0AUhBdRsNQ-gwtep-5Pfi9rrW2hYkF7vWw2J2FLsom7DVpfwld2Y0XEqzMMHzNwBqFrSqY8zZLbQ_FWdbZVAIU2lZvSKSOcxWdoxCIeBcQz53_0JZo4tyeEsIyxNE5H6HOGt7qDWhvAZWvxdghzzofh5w6UbsDgdSMrcFgbrO91dadbHOBZ19XHgZKmwEswYGWtP04GXrzLRht50K3BbYkfwareHvHOwRCytdBJO6ArsLlXffPb5a7QRSlrB5OfO0a7h8XLfBVsnpbr-WwTKMqiOKAxk0lOgSoaphFlPC5VWoSE8Cwi3kwiWVKSlmGalowpksQkj7LQI5IymXM-RjenXP-91x7cQezb3hpfKRiNsyyjNCSeSk6Usq1zFkrRWd1IexSUiGEA8W8AQcX3APwLxzt-ZA</recordid><startdate>20170815</startdate><enddate>20170815</enddate><creator>Yeole, Gaurav</creator><creator>Sahdev, Saniya</creator><creator>Collins, Matthew</creator><creator>Thompson, Alex</creator><creator>Dikow, Rebecca</creator><creator>Frandsen, Paul</creator><creator>Orli, Sylvia</creator><creator>Figueiredo, Renato</creator><general>Pensoft Publishers</general><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FH</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>LK8</scope><scope>M7P</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope></search><sort><creationdate>20170815</creationdate><title>A Pipeline for Processing Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens</title><author>Yeole, Gaurav ; Sahdev, Saniya ; Collins, Matthew ; Thompson, Alex ; Dikow, Rebecca ; Frandsen, Paul ; Orli, Sylvia ; Figueiredo, Renato</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c1256-162a7b1e1c14851236fc8d40039501c175af108f488f22c0760b5948d4a12ab33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Contamination</topic><topic>Crystallization</topic><topic>Data processing</topic><topic>Information systems</topic><topic>Infrastructure</topic><topic>Mercury</topic><topic>Neural networks</topic><toplevel>online_resources</toplevel><creatorcontrib>Yeole, Gaurav</creatorcontrib><creatorcontrib>Sahdev, Saniya</creatorcontrib><creatorcontrib>Collins, Matthew</creatorcontrib><creatorcontrib>Thompson, Alex</creatorcontrib><creatorcontrib>Dikow, Rebecca</creatorcontrib><creatorcontrib>Frandsen, Paul</creatorcontrib><creatorcontrib>Orli, Sylvia</creatorcontrib><creatorcontrib>Figueiredo, Renato</creatorcontrib><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>ProQuest Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Biological Science Collection</collection><collection>Biological Science Database</collection><collection>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><jtitle>Biodiversity Information Science and Standards</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yeole, Gaurav</au><au>Sahdev, Saniya</au><au>Collins, Matthew</au><au>Thompson, Alex</au><au>Dikow, Rebecca</au><au>Frandsen, Paul</au><au>Orli, Sylvia</au><au>Figueiredo, Renato</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Pipeline for Processing Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens</atitle><jtitle>Biodiversity Information Science and Standards</jtitle><date>2017-08-15</date><risdate>2017</risdate><volume>1</volume><spage>e20326</spage><pages>e20326-</pages><issn>2535-0897</issn><eissn>2535-0897</eissn><abstract>iDigBio currently references over 22 million media files, and stores approximately 120 terabytes worth of those media files co-located with our computing infrastructure (Matsunaga et al. 2013). Using these images for scientific research is a logistical and technical challenge. Transferring large numbers of images requires programming skill, bandwidth, and storage space. While simple image transformations such as resizing and generating histograms are approachable on desktops and laptops, the neural networks commonly used for learning from images require server-based graphical processing units (GPUs) to run effectively. Using the GUODA (Global Unified Open Data Access) infrastructure, we are building a model pipeline for applying user-defined processing to all or any subset of images stored in iDigBio on servers located in the Advanced Computing and Information Systems lab (ACIS) alongside the iDigBio storage system. This pipeline utilizes Apache Spark, the Hadoop File System (HDFS), and Mesos (Collins et al. 2017). We have placed a Jupyter notebook server in front of this architecture, which provides an easy environment for end users to write their own Python or R software programs. Users can access the stored data and images and manipulate them per their requirements and make their work publicly available on GitHub. As an example of how this pipeline can be used in research, we are applying a neural network developed at the Smithsonian Institution to identify herbarium sheets that were prepared with hazardous mercury-containing solutions (Schuettpelz, in preparation ). The model was trained on Smithsonian servers using their herbarium images and it is being transferred to the GUODA infrastructure hosted at the ACIS lab. All herbarium images in iDigBio are being classified using this model to illustrate the application of these techniques to larger sets of images using a deep convolutional neural network that detects visible mercury crystallization present on digitized herbarium sheets. Such an automated detection process can potentially be used, for instance, to notify other data publishers of any contamination. We are presenting the results of this classification not as a verified research result, but as an example of the collaborative and scalable workflows this pipeline and infrastructure enable.</abstract><cop>Sofia</cop><pub>Pensoft Publishers</pub><doi>10.3897/tdwgproceedings.1.20326</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2535-0897 |
ispartof | Biodiversity Information Science and Standards, 2017-08, Vol.1, p.e20326 |
issn | 2535-0897 2535-0897 |
language | eng |
recordid | cdi_proquest_journals_2169991140 |
source | Publicly Available Content Database (Proquest) (PQ_SDU_P3) |
subjects | Contamination Crystallization Data processing Information systems Infrastructure Mercury Neural networks |
title | A Pipeline for Processing Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T23%3A35%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Pipeline%20for%20Processing%20Specimen%20Images%20in%20iDigBio%20-%20Applying%20and%20Generalizing%20an%20Examination%20of%20Mercury%20Use%20in%20Preparing%20Herbarium%20Specimens&rft.jtitle=Biodiversity%20Information%20Science%20and%20Standards&rft.au=Yeole,%20Gaurav&rft.date=2017-08-15&rft.volume=1&rft.spage=e20326&rft.pages=e20326-&rft.issn=2535-0897&rft.eissn=2535-0897&rft_id=info:doi/10.3897/tdwgproceedings.1.20326&rft_dat=%3Cproquest_cross%3E2169991140%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c1256-162a7b1e1c14851236fc8d40039501c175af108f488f22c0760b5948d4a12ab33%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2169991140&rft_id=info:pmid/&rfr_iscdi=true |