Loading…

Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction

Cell type identification is one of the most important questions in single-cell RNA sequencing (scRNA-seq) data analysis. With the accumulation of public scRNA-seq data, supervised cell type identification methods have gained increasing popularity due to better accuracy, robustness, and computational...

Full description

Saved in:
Bibliographic Details
Published in:Genome Biology 2021-09, Vol.22 (1), p.264-264, Article 264
Main Authors: Ma, Wenjing, Su, Kenong, Wu, Hao
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c631t-75f2bc22219d377a379f51bbf448af196868d20c1c6efc1b0cc2959f266d27343
cites cdi_FETCH-LOGICAL-c631t-75f2bc22219d377a379f51bbf448af196868d20c1c6efc1b0cc2959f266d27343
container_end_page 264
container_issue 1
container_start_page 264
container_title Genome Biology
container_volume 22
creator Ma, Wenjing
Su, Kenong
Wu, Hao
description Cell type identification is one of the most important questions in single-cell RNA sequencing (scRNA-seq) data analysis. With the accumulation of public scRNA-seq data, supervised cell type identification methods have gained increasing popularity due to better accuracy, robustness, and computational performance. Despite all the advantages, the performance of the supervised methods relies heavily on several key factors: feature selection, prediction method, and, most importantly, choice of the reference dataset. In this work, we perform extensive real data analyses to systematically evaluate these strategies in supervised cell identification. We first benchmark nine classifiers along with six feature selection strategies and investigate the impact of reference data size and number of cell types in cell type prediction. Next, we focus on how discrepancies between reference and target datasets and how data preprocessing such as imputation and batch effect correction affect prediction performance. We also investigate the strategies of pooling and purifying reference data. Based on our analysis results, we provide guidelines for using supervised cell typing methods. We suggest combining all individuals from available datasets to construct the reference dataset and use multi-layer perceptron (MLP) as the classifier, along with F-test as the feature selection method. All the code used for our analysis is available on GitHub ( https://github.com/marvinquiet/RefConstruction_supervisedCelltyping ).
doi_str_mv 10.1186/s13059-021-02480-2
format article
fullrecord <record><control><sourceid>proquest_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_1cefabdcd5c84096bd25d56141822cf4</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><doaj_id>oai_doaj_org_article_1cefabdcd5c84096bd25d56141822cf4</doaj_id><sourcerecordid>2574441807</sourcerecordid><originalsourceid>FETCH-LOGICAL-c631t-75f2bc22219d377a379f51bbf448af196868d20c1c6efc1b0cc2959f266d27343</originalsourceid><addsrcrecordid>eNpdkt9qFDEYxQdRbK2-gBcS8MaLjuZ_ZrwolFK1UBREwbuQSb6sWWaTbTKz0Pfwgc3u1NJ6ERKSc358OZymeU3we0I6-aEQhkXfYkrq4h1u6ZPmmHDFWyXxr6cPzkfNi1LWGJOeU_m8OWJcYCYkP27-XO7MOJsppIiSRyVtAJmyBTsVFCIq8xbyLhRwyMI4oul2Cyg4iFPwwS42nzIqIa5GaA-a71_P2wI3H5EdTSlVB_kUeTDTnAEVGCu72k6RiQ5l8JAhWkA2xTLl-fD2snnmzVjg1d1-0vz8dPnj4kt7_e3z1cX5dWslI1OrhKeDpZSS3jGlDFO9F2QYPOed8aSXnewcxZZYCd6SAVtLe9F7KqWjinF20lwtXJfMWm9z2Jh8q5MJ-nCR8kqbPAU7giYWvBmcdcJ2HPdycFQ4IQknHaXW71lnC2s7DxtwtkaUzfgI-vglht96lXa641T1klTAuztATjczlElvQtkHaiKkuWgqFOmJwphV6dv_pOs051ij2qs4r0NhVVV0UdmcSqlJ3w9DsN4XSC8F0rVA-lAgTavpzcNv3Fv-NYb9BXwvxEs</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2574441807</pqid></control><display><type>article</type><title>Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction</title><source>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</source><source>PubMed Central Free</source><creator>Ma, Wenjing ; Su, Kenong ; Wu, Hao</creator><creatorcontrib>Ma, Wenjing ; Su, Kenong ; Wu, Hao</creatorcontrib><description>Cell type identification is one of the most important questions in single-cell RNA sequencing (scRNA-seq) data analysis. With the accumulation of public scRNA-seq data, supervised cell type identification methods have gained increasing popularity due to better accuracy, robustness, and computational performance. Despite all the advantages, the performance of the supervised methods relies heavily on several key factors: feature selection, prediction method, and, most importantly, choice of the reference dataset. In this work, we perform extensive real data analyses to systematically evaluate these strategies in supervised cell identification. We first benchmark nine classifiers along with six feature selection strategies and investigate the impact of reference data size and number of cell types in cell type prediction. Next, we focus on how discrepancies between reference and target datasets and how data preprocessing such as imputation and batch effect correction affect prediction performance. We also investigate the strategies of pooling and purifying reference data. Based on our analysis results, we provide guidelines for using supervised cell typing methods. We suggest combining all individuals from available datasets to construct the reference dataset and use multi-layer perceptron (MLP) as the classifier, along with F-test as the feature selection method. All the code used for our analysis is available on GitHub ( https://github.com/marvinquiet/RefConstruction_supervisedCelltyping ).</description><identifier>ISSN: 1474-760X</identifier><identifier>ISSN: 1474-7596</identifier><identifier>EISSN: 1474-760X</identifier><identifier>DOI: 10.1186/s13059-021-02480-2</identifier><identifier>PMID: 34503564</identifier><language>eng</language><publisher>England: BioMed Central</publisher><subject>Accuracy ; Algorithms ; Animals ; Brain - metabolism ; Cell size ; Computer applications ; Data processing ; Databases, Genetic ; Datasets ; Deep learning ; Feature selection ; Gene expression ; Genomics ; Humans ; Leukocytes, Mononuclear - metabolism ; Mars ; Mice ; Molecular Sequence Annotation ; Predictions ; Principal components analysis ; Reference dataset construction ; RNA-Seq ; scRNA-seq ; Single-Cell Analysis ; Supervised cell typing</subject><ispartof>Genome Biology, 2021-09, Vol.22 (1), p.264-264, Article 264</ispartof><rights>2021. The Author(s).</rights><rights>2021. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>The Author(s) 2021</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c631t-75f2bc22219d377a379f51bbf448af196868d20c1c6efc1b0cc2959f266d27343</citedby><cites>FETCH-LOGICAL-c631t-75f2bc22219d377a379f51bbf448af196868d20c1c6efc1b0cc2959f266d27343</cites><orcidid>0000-0003-1269-7354</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC8427961/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2574441807?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,25753,27924,27925,37012,37013,44590,53791,53793</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/34503564$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Ma, Wenjing</creatorcontrib><creatorcontrib>Su, Kenong</creatorcontrib><creatorcontrib>Wu, Hao</creatorcontrib><title>Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction</title><title>Genome Biology</title><addtitle>Genome Biol</addtitle><description>Cell type identification is one of the most important questions in single-cell RNA sequencing (scRNA-seq) data analysis. With the accumulation of public scRNA-seq data, supervised cell type identification methods have gained increasing popularity due to better accuracy, robustness, and computational performance. Despite all the advantages, the performance of the supervised methods relies heavily on several key factors: feature selection, prediction method, and, most importantly, choice of the reference dataset. In this work, we perform extensive real data analyses to systematically evaluate these strategies in supervised cell identification. We first benchmark nine classifiers along with six feature selection strategies and investigate the impact of reference data size and number of cell types in cell type prediction. Next, we focus on how discrepancies between reference and target datasets and how data preprocessing such as imputation and batch effect correction affect prediction performance. We also investigate the strategies of pooling and purifying reference data. Based on our analysis results, we provide guidelines for using supervised cell typing methods. We suggest combining all individuals from available datasets to construct the reference dataset and use multi-layer perceptron (MLP) as the classifier, along with F-test as the feature selection method. All the code used for our analysis is available on GitHub ( https://github.com/marvinquiet/RefConstruction_supervisedCelltyping ).</description><subject>Accuracy</subject><subject>Algorithms</subject><subject>Animals</subject><subject>Brain - metabolism</subject><subject>Cell size</subject><subject>Computer applications</subject><subject>Data processing</subject><subject>Databases, Genetic</subject><subject>Datasets</subject><subject>Deep learning</subject><subject>Feature selection</subject><subject>Gene expression</subject><subject>Genomics</subject><subject>Humans</subject><subject>Leukocytes, Mononuclear - metabolism</subject><subject>Mars</subject><subject>Mice</subject><subject>Molecular Sequence Annotation</subject><subject>Predictions</subject><subject>Principal components analysis</subject><subject>Reference dataset construction</subject><subject>RNA-Seq</subject><subject>scRNA-seq</subject><subject>Single-Cell Analysis</subject><subject>Supervised cell typing</subject><issn>1474-760X</issn><issn>1474-7596</issn><issn>1474-760X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><sourceid>DOA</sourceid><recordid>eNpdkt9qFDEYxQdRbK2-gBcS8MaLjuZ_ZrwolFK1UBREwbuQSb6sWWaTbTKz0Pfwgc3u1NJ6ERKSc358OZymeU3we0I6-aEQhkXfYkrq4h1u6ZPmmHDFWyXxr6cPzkfNi1LWGJOeU_m8OWJcYCYkP27-XO7MOJsppIiSRyVtAJmyBTsVFCIq8xbyLhRwyMI4oul2Cyg4iFPwwS42nzIqIa5GaA-a71_P2wI3H5EdTSlVB_kUeTDTnAEVGCu72k6RiQ5l8JAhWkA2xTLl-fD2snnmzVjg1d1-0vz8dPnj4kt7_e3z1cX5dWslI1OrhKeDpZSS3jGlDFO9F2QYPOed8aSXnewcxZZYCd6SAVtLe9F7KqWjinF20lwtXJfMWm9z2Jh8q5MJ-nCR8kqbPAU7giYWvBmcdcJ2HPdycFQ4IQknHaXW71lnC2s7DxtwtkaUzfgI-vglht96lXa641T1klTAuztATjczlElvQtkHaiKkuWgqFOmJwphV6dv_pOs051ij2qs4r0NhVVV0UdmcSqlJ3w9DsN4XSC8F0rVA-lAgTavpzcNv3Fv-NYb9BXwvxEs</recordid><startdate>20210909</startdate><enddate>20210909</enddate><creator>Ma, Wenjing</creator><creator>Su, Kenong</creator><creator>Wu, Hao</creator><general>BioMed Central</general><general>BMC</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8FE</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>LK8</scope><scope>M0S</scope><scope>M1P</scope><scope>M7P</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-1269-7354</orcidid></search><sort><creationdate>20210909</creationdate><title>Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction</title><author>Ma, Wenjing ; Su, Kenong ; Wu, Hao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c631t-75f2bc22219d377a379f51bbf448af196868d20c1c6efc1b0cc2959f266d27343</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Accuracy</topic><topic>Algorithms</topic><topic>Animals</topic><topic>Brain - metabolism</topic><topic>Cell size</topic><topic>Computer applications</topic><topic>Data processing</topic><topic>Databases, Genetic</topic><topic>Datasets</topic><topic>Deep learning</topic><topic>Feature selection</topic><topic>Gene expression</topic><topic>Genomics</topic><topic>Humans</topic><topic>Leukocytes, Mononuclear - metabolism</topic><topic>Mars</topic><topic>Mice</topic><topic>Molecular Sequence Annotation</topic><topic>Predictions</topic><topic>Principal components analysis</topic><topic>Reference dataset construction</topic><topic>RNA-Seq</topic><topic>scRNA-seq</topic><topic>Single-Cell Analysis</topic><topic>Supervised cell typing</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ma, Wenjing</creatorcontrib><creatorcontrib>Su, Kenong</creatorcontrib><creatorcontrib>Wu, Hao</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Health &amp; Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>ProQuest Biological Science Collection</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Biological Science Database</collection><collection>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>Directory of Open Access Journals</collection><jtitle>Genome Biology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ma, Wenjing</au><au>Su, Kenong</au><au>Wu, Hao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction</atitle><jtitle>Genome Biology</jtitle><addtitle>Genome Biol</addtitle><date>2021-09-09</date><risdate>2021</risdate><volume>22</volume><issue>1</issue><spage>264</spage><epage>264</epage><pages>264-264</pages><artnum>264</artnum><issn>1474-760X</issn><issn>1474-7596</issn><eissn>1474-760X</eissn><abstract>Cell type identification is one of the most important questions in single-cell RNA sequencing (scRNA-seq) data analysis. With the accumulation of public scRNA-seq data, supervised cell type identification methods have gained increasing popularity due to better accuracy, robustness, and computational performance. Despite all the advantages, the performance of the supervised methods relies heavily on several key factors: feature selection, prediction method, and, most importantly, choice of the reference dataset. In this work, we perform extensive real data analyses to systematically evaluate these strategies in supervised cell identification. We first benchmark nine classifiers along with six feature selection strategies and investigate the impact of reference data size and number of cell types in cell type prediction. Next, we focus on how discrepancies between reference and target datasets and how data preprocessing such as imputation and batch effect correction affect prediction performance. We also investigate the strategies of pooling and purifying reference data. Based on our analysis results, we provide guidelines for using supervised cell typing methods. We suggest combining all individuals from available datasets to construct the reference dataset and use multi-layer perceptron (MLP) as the classifier, along with F-test as the feature selection method. All the code used for our analysis is available on GitHub ( https://github.com/marvinquiet/RefConstruction_supervisedCelltyping ).</abstract><cop>England</cop><pub>BioMed Central</pub><pmid>34503564</pmid><doi>10.1186/s13059-021-02480-2</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0003-1269-7354</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1474-760X
ispartof Genome Biology, 2021-09, Vol.22 (1), p.264-264, Article 264
issn 1474-760X
1474-7596
1474-760X
language eng
recordid cdi_doaj_primary_oai_doaj_org_article_1cefabdcd5c84096bd25d56141822cf4
source Publicly Available Content Database (Proquest) (PQ_SDU_P3); PubMed Central Free
subjects Accuracy
Algorithms
Animals
Brain - metabolism
Cell size
Computer applications
Data processing
Databases, Genetic
Datasets
Deep learning
Feature selection
Gene expression
Genomics
Humans
Leukocytes, Mononuclear - metabolism
Mars
Mice
Molecular Sequence Annotation
Predictions
Principal components analysis
Reference dataset construction
RNA-Seq
scRNA-seq
Single-Cell Analysis
Supervised cell typing
title Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T17%3A16%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Evaluation%20of%20some%20aspects%20in%20supervised%20cell%20type%20identification%20for%20single-cell%20RNA-seq:%20classifier,%20feature%20selection,%20and%20reference%20construction&rft.jtitle=Genome%20Biology&rft.au=Ma,%20Wenjing&rft.date=2021-09-09&rft.volume=22&rft.issue=1&rft.spage=264&rft.epage=264&rft.pages=264-264&rft.artnum=264&rft.issn=1474-760X&rft.eissn=1474-760X&rft_id=info:doi/10.1186/s13059-021-02480-2&rft_dat=%3Cproquest_doaj_%3E2574441807%3C/proquest_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c631t-75f2bc22219d377a379f51bbf448af196868d20c1c6efc1b0cc2959f266d27343%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2574441807&rft_id=info:pmid/34503564&rfr_iscdi=true