Loading…
Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction
Cell type identification is one of the most important questions in single-cell RNA sequencing (scRNA-seq) data analysis. With the accumulation of public scRNA-seq data, supervised cell type identification methods have gained increasing popularity due to better accuracy, robustness, and computational...
Saved in:
Published in: | Genome Biology 2021-09, Vol.22 (1), p.264-264, Article 264 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c631t-75f2bc22219d377a379f51bbf448af196868d20c1c6efc1b0cc2959f266d27343 |
---|---|
cites | cdi_FETCH-LOGICAL-c631t-75f2bc22219d377a379f51bbf448af196868d20c1c6efc1b0cc2959f266d27343 |
container_end_page | 264 |
container_issue | 1 |
container_start_page | 264 |
container_title | Genome Biology |
container_volume | 22 |
creator | Ma, Wenjing Su, Kenong Wu, Hao |
description | Cell type identification is one of the most important questions in single-cell RNA sequencing (scRNA-seq) data analysis. With the accumulation of public scRNA-seq data, supervised cell type identification methods have gained increasing popularity due to better accuracy, robustness, and computational performance. Despite all the advantages, the performance of the supervised methods relies heavily on several key factors: feature selection, prediction method, and, most importantly, choice of the reference dataset.
In this work, we perform extensive real data analyses to systematically evaluate these strategies in supervised cell identification. We first benchmark nine classifiers along with six feature selection strategies and investigate the impact of reference data size and number of cell types in cell type prediction. Next, we focus on how discrepancies between reference and target datasets and how data preprocessing such as imputation and batch effect correction affect prediction performance. We also investigate the strategies of pooling and purifying reference data.
Based on our analysis results, we provide guidelines for using supervised cell typing methods. We suggest combining all individuals from available datasets to construct the reference dataset and use multi-layer perceptron (MLP) as the classifier, along with F-test as the feature selection method. All the code used for our analysis is available on GitHub ( https://github.com/marvinquiet/RefConstruction_supervisedCelltyping ). |
doi_str_mv | 10.1186/s13059-021-02480-2 |
format | article |
fullrecord | <record><control><sourceid>proquest_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_1cefabdcd5c84096bd25d56141822cf4</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><doaj_id>oai_doaj_org_article_1cefabdcd5c84096bd25d56141822cf4</doaj_id><sourcerecordid>2574441807</sourcerecordid><originalsourceid>FETCH-LOGICAL-c631t-75f2bc22219d377a379f51bbf448af196868d20c1c6efc1b0cc2959f266d27343</originalsourceid><addsrcrecordid>eNpdkt9qFDEYxQdRbK2-gBcS8MaLjuZ_ZrwolFK1UBREwbuQSb6sWWaTbTKz0Pfwgc3u1NJ6ERKSc358OZymeU3we0I6-aEQhkXfYkrq4h1u6ZPmmHDFWyXxr6cPzkfNi1LWGJOeU_m8OWJcYCYkP27-XO7MOJsppIiSRyVtAJmyBTsVFCIq8xbyLhRwyMI4oul2Cyg4iFPwwS42nzIqIa5GaA-a71_P2wI3H5EdTSlVB_kUeTDTnAEVGCu72k6RiQ5l8JAhWkA2xTLl-fD2snnmzVjg1d1-0vz8dPnj4kt7_e3z1cX5dWslI1OrhKeDpZSS3jGlDFO9F2QYPOed8aSXnewcxZZYCd6SAVtLe9F7KqWjinF20lwtXJfMWm9z2Jh8q5MJ-nCR8kqbPAU7giYWvBmcdcJ2HPdycFQ4IQknHaXW71lnC2s7DxtwtkaUzfgI-vglht96lXa641T1klTAuztATjczlElvQtkHaiKkuWgqFOmJwphV6dv_pOs051ij2qs4r0NhVVV0UdmcSqlJ3w9DsN4XSC8F0rVA-lAgTavpzcNv3Fv-NYb9BXwvxEs</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2574441807</pqid></control><display><type>article</type><title>Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction</title><source>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</source><source>PubMed Central Free</source><creator>Ma, Wenjing ; Su, Kenong ; Wu, Hao</creator><creatorcontrib>Ma, Wenjing ; Su, Kenong ; Wu, Hao</creatorcontrib><description>Cell type identification is one of the most important questions in single-cell RNA sequencing (scRNA-seq) data analysis. With the accumulation of public scRNA-seq data, supervised cell type identification methods have gained increasing popularity due to better accuracy, robustness, and computational performance. Despite all the advantages, the performance of the supervised methods relies heavily on several key factors: feature selection, prediction method, and, most importantly, choice of the reference dataset.
In this work, we perform extensive real data analyses to systematically evaluate these strategies in supervised cell identification. We first benchmark nine classifiers along with six feature selection strategies and investigate the impact of reference data size and number of cell types in cell type prediction. Next, we focus on how discrepancies between reference and target datasets and how data preprocessing such as imputation and batch effect correction affect prediction performance. We also investigate the strategies of pooling and purifying reference data.
Based on our analysis results, we provide guidelines for using supervised cell typing methods. We suggest combining all individuals from available datasets to construct the reference dataset and use multi-layer perceptron (MLP) as the classifier, along with F-test as the feature selection method. All the code used for our analysis is available on GitHub ( https://github.com/marvinquiet/RefConstruction_supervisedCelltyping ).</description><identifier>ISSN: 1474-760X</identifier><identifier>ISSN: 1474-7596</identifier><identifier>EISSN: 1474-760X</identifier><identifier>DOI: 10.1186/s13059-021-02480-2</identifier><identifier>PMID: 34503564</identifier><language>eng</language><publisher>England: BioMed Central</publisher><subject>Accuracy ; Algorithms ; Animals ; Brain - metabolism ; Cell size ; Computer applications ; Data processing ; Databases, Genetic ; Datasets ; Deep learning ; Feature selection ; Gene expression ; Genomics ; Humans ; Leukocytes, Mononuclear - metabolism ; Mars ; Mice ; Molecular Sequence Annotation ; Predictions ; Principal components analysis ; Reference dataset construction ; RNA-Seq ; scRNA-seq ; Single-Cell Analysis ; Supervised cell typing</subject><ispartof>Genome Biology, 2021-09, Vol.22 (1), p.264-264, Article 264</ispartof><rights>2021. The Author(s).</rights><rights>2021. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>The Author(s) 2021</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c631t-75f2bc22219d377a379f51bbf448af196868d20c1c6efc1b0cc2959f266d27343</citedby><cites>FETCH-LOGICAL-c631t-75f2bc22219d377a379f51bbf448af196868d20c1c6efc1b0cc2959f266d27343</cites><orcidid>0000-0003-1269-7354</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC8427961/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2574441807?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,25753,27924,27925,37012,37013,44590,53791,53793</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/34503564$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Ma, Wenjing</creatorcontrib><creatorcontrib>Su, Kenong</creatorcontrib><creatorcontrib>Wu, Hao</creatorcontrib><title>Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction</title><title>Genome Biology</title><addtitle>Genome Biol</addtitle><description>Cell type identification is one of the most important questions in single-cell RNA sequencing (scRNA-seq) data analysis. With the accumulation of public scRNA-seq data, supervised cell type identification methods have gained increasing popularity due to better accuracy, robustness, and computational performance. Despite all the advantages, the performance of the supervised methods relies heavily on several key factors: feature selection, prediction method, and, most importantly, choice of the reference dataset.
In this work, we perform extensive real data analyses to systematically evaluate these strategies in supervised cell identification. We first benchmark nine classifiers along with six feature selection strategies and investigate the impact of reference data size and number of cell types in cell type prediction. Next, we focus on how discrepancies between reference and target datasets and how data preprocessing such as imputation and batch effect correction affect prediction performance. We also investigate the strategies of pooling and purifying reference data.
Based on our analysis results, we provide guidelines for using supervised cell typing methods. We suggest combining all individuals from available datasets to construct the reference dataset and use multi-layer perceptron (MLP) as the classifier, along with F-test as the feature selection method. All the code used for our analysis is available on GitHub ( https://github.com/marvinquiet/RefConstruction_supervisedCelltyping ).</description><subject>Accuracy</subject><subject>Algorithms</subject><subject>Animals</subject><subject>Brain - metabolism</subject><subject>Cell size</subject><subject>Computer applications</subject><subject>Data processing</subject><subject>Databases, Genetic</subject><subject>Datasets</subject><subject>Deep learning</subject><subject>Feature selection</subject><subject>Gene expression</subject><subject>Genomics</subject><subject>Humans</subject><subject>Leukocytes, Mononuclear - metabolism</subject><subject>Mars</subject><subject>Mice</subject><subject>Molecular Sequence Annotation</subject><subject>Predictions</subject><subject>Principal components analysis</subject><subject>Reference dataset construction</subject><subject>RNA-Seq</subject><subject>scRNA-seq</subject><subject>Single-Cell Analysis</subject><subject>Supervised cell typing</subject><issn>1474-760X</issn><issn>1474-7596</issn><issn>1474-760X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><sourceid>DOA</sourceid><recordid>eNpdkt9qFDEYxQdRbK2-gBcS8MaLjuZ_ZrwolFK1UBREwbuQSb6sWWaTbTKz0Pfwgc3u1NJ6ERKSc358OZymeU3we0I6-aEQhkXfYkrq4h1u6ZPmmHDFWyXxr6cPzkfNi1LWGJOeU_m8OWJcYCYkP27-XO7MOJsppIiSRyVtAJmyBTsVFCIq8xbyLhRwyMI4oul2Cyg4iFPwwS42nzIqIa5GaA-a71_P2wI3H5EdTSlVB_kUeTDTnAEVGCu72k6RiQ5l8JAhWkA2xTLl-fD2snnmzVjg1d1-0vz8dPnj4kt7_e3z1cX5dWslI1OrhKeDpZSS3jGlDFO9F2QYPOed8aSXnewcxZZYCd6SAVtLe9F7KqWjinF20lwtXJfMWm9z2Jh8q5MJ-nCR8kqbPAU7giYWvBmcdcJ2HPdycFQ4IQknHaXW71lnC2s7DxtwtkaUzfgI-vglht96lXa641T1klTAuztATjczlElvQtkHaiKkuWgqFOmJwphV6dv_pOs051ij2qs4r0NhVVV0UdmcSqlJ3w9DsN4XSC8F0rVA-lAgTavpzcNv3Fv-NYb9BXwvxEs</recordid><startdate>20210909</startdate><enddate>20210909</enddate><creator>Ma, Wenjing</creator><creator>Su, Kenong</creator><creator>Wu, Hao</creator><general>BioMed Central</general><general>BMC</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8FE</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>LK8</scope><scope>M0S</scope><scope>M1P</scope><scope>M7P</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-1269-7354</orcidid></search><sort><creationdate>20210909</creationdate><title>Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction</title><author>Ma, Wenjing ; Su, Kenong ; Wu, Hao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c631t-75f2bc22219d377a379f51bbf448af196868d20c1c6efc1b0cc2959f266d27343</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Accuracy</topic><topic>Algorithms</topic><topic>Animals</topic><topic>Brain - metabolism</topic><topic>Cell size</topic><topic>Computer applications</topic><topic>Data processing</topic><topic>Databases, Genetic</topic><topic>Datasets</topic><topic>Deep learning</topic><topic>Feature selection</topic><topic>Gene expression</topic><topic>Genomics</topic><topic>Humans</topic><topic>Leukocytes, Mononuclear - metabolism</topic><topic>Mars</topic><topic>Mice</topic><topic>Molecular Sequence Annotation</topic><topic>Predictions</topic><topic>Principal components analysis</topic><topic>Reference dataset construction</topic><topic>RNA-Seq</topic><topic>scRNA-seq</topic><topic>Single-Cell Analysis</topic><topic>Supervised cell typing</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ma, Wenjing</creatorcontrib><creatorcontrib>Su, Kenong</creatorcontrib><creatorcontrib>Wu, Hao</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Health & Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>ProQuest Biological Science Collection</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Biological Science Database</collection><collection>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>Directory of Open Access Journals</collection><jtitle>Genome Biology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ma, Wenjing</au><au>Su, Kenong</au><au>Wu, Hao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction</atitle><jtitle>Genome Biology</jtitle><addtitle>Genome Biol</addtitle><date>2021-09-09</date><risdate>2021</risdate><volume>22</volume><issue>1</issue><spage>264</spage><epage>264</epage><pages>264-264</pages><artnum>264</artnum><issn>1474-760X</issn><issn>1474-7596</issn><eissn>1474-760X</eissn><abstract>Cell type identification is one of the most important questions in single-cell RNA sequencing (scRNA-seq) data analysis. With the accumulation of public scRNA-seq data, supervised cell type identification methods have gained increasing popularity due to better accuracy, robustness, and computational performance. Despite all the advantages, the performance of the supervised methods relies heavily on several key factors: feature selection, prediction method, and, most importantly, choice of the reference dataset.
In this work, we perform extensive real data analyses to systematically evaluate these strategies in supervised cell identification. We first benchmark nine classifiers along with six feature selection strategies and investigate the impact of reference data size and number of cell types in cell type prediction. Next, we focus on how discrepancies between reference and target datasets and how data preprocessing such as imputation and batch effect correction affect prediction performance. We also investigate the strategies of pooling and purifying reference data.
Based on our analysis results, we provide guidelines for using supervised cell typing methods. We suggest combining all individuals from available datasets to construct the reference dataset and use multi-layer perceptron (MLP) as the classifier, along with F-test as the feature selection method. All the code used for our analysis is available on GitHub ( https://github.com/marvinquiet/RefConstruction_supervisedCelltyping ).</abstract><cop>England</cop><pub>BioMed Central</pub><pmid>34503564</pmid><doi>10.1186/s13059-021-02480-2</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0003-1269-7354</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1474-760X |
ispartof | Genome Biology, 2021-09, Vol.22 (1), p.264-264, Article 264 |
issn | 1474-760X 1474-7596 1474-760X |
language | eng |
recordid | cdi_doaj_primary_oai_doaj_org_article_1cefabdcd5c84096bd25d56141822cf4 |
source | Publicly Available Content Database (Proquest) (PQ_SDU_P3); PubMed Central Free |
subjects | Accuracy Algorithms Animals Brain - metabolism Cell size Computer applications Data processing Databases, Genetic Datasets Deep learning Feature selection Gene expression Genomics Humans Leukocytes, Mononuclear - metabolism Mars Mice Molecular Sequence Annotation Predictions Principal components analysis Reference dataset construction RNA-Seq scRNA-seq Single-Cell Analysis Supervised cell typing |
title | Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T17%3A16%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Evaluation%20of%20some%20aspects%20in%20supervised%20cell%20type%20identification%20for%20single-cell%20RNA-seq:%20classifier,%20feature%20selection,%20and%20reference%20construction&rft.jtitle=Genome%20Biology&rft.au=Ma,%20Wenjing&rft.date=2021-09-09&rft.volume=22&rft.issue=1&rft.spage=264&rft.epage=264&rft.pages=264-264&rft.artnum=264&rft.issn=1474-760X&rft.eissn=1474-760X&rft_id=info:doi/10.1186/s13059-021-02480-2&rft_dat=%3Cproquest_doaj_%3E2574441807%3C/proquest_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c631t-75f2bc22219d377a379f51bbf448af196868d20c1c6efc1b0cc2959f266d27343%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2574441807&rft_id=info:pmid/34503564&rfr_iscdi=true |