Loading…

Large-scale benchmark study of survival prediction methods using multi-omics data

Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such...

Full description

Saved in:
Bibliographic Details
Published in:Briefings in bioinformatics 2021-05, Vol.22 (3)
Main Authors: Herrmann, Moritz, Probst, Philipp, Hornung, Roman, Jurinovic, Vindi, Boulesteix, Anne-Laure
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c358t-a76599f873cec0cd1475013e530cdd34756956b56e65ddc60b0d6ee7ff14d8553
cites cdi_FETCH-LOGICAL-c358t-a76599f873cec0cd1475013e530cdd34756956b56e65ddc60b0d6ee7ff14d8553
container_end_page
container_issue 3
container_start_page
container_title Briefings in bioinformatics
container_volume 22
creator Herrmann, Moritz
Probst, Philipp
Hornung, Roman
Jurinovic, Vindi
Boulesteix, Anne-Laure
description Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database ‘The Cancer Genome Atlas’ (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan–Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno’s C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups—especially clinical variables—from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited. Contact:  moritz.herrmann@stat.uni-muenchen.de, +49 89 2180 3198 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. All analyses are reproducible using R code freely available on Github.
doi_str_mv 10.1093/bib/bbaa167
format article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_8138887</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2436393986</sourcerecordid><originalsourceid>FETCH-LOGICAL-c358t-a76599f873cec0cd1475013e530cdd34756956b56e65ddc60b0d6ee7ff14d8553</originalsourceid><addsrcrecordid>eNpVUU1LAzEQDaLYWj35B3IUZG2y2WSzF0GKX1AQQc8hm8y20d1NTbKF_nu3tAgehpnHPN6b4SF0TckdJRWb166e17XWVJQnaEqLsswKwovT_SzKjBeCTdBFjF-E5KSU9BxNWC7zsdgUvS91WEEWjW4B19CbdafDN45psDvsGxyHsHVb3eJNAOtMcr7HHaS1txEP0fUr3A1tcpnvnInY6qQv0Vmj2whXxz5Dn0-PH4uXbPn2_Lp4WGaGcZkyXQpeVY0smQFDjB3v5oQy4GwElo1IVFzUXIDg1hpBamIFQNk0tLCSczZD9wfdzVB3YA30KehWbYIbP9gpr536v-ndWq38VknKpBx9Z-jmKBD8zwAxqc5FA22re_BDVHnBBKtYJcVIvT1QTfAxBmj-bChR-xDUGII6hsB-AU-ze_A</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2436393986</pqid></control><display><type>article</type><title>Large-scale benchmark study of survival prediction methods using multi-omics data</title><source>Business Source Ultimate</source><source>Oxford University Press Open Access</source><source>PubMed Central</source><creator>Herrmann, Moritz ; Probst, Philipp ; Hornung, Roman ; Jurinovic, Vindi ; Boulesteix, Anne-Laure</creator><creatorcontrib>Herrmann, Moritz ; Probst, Philipp ; Hornung, Roman ; Jurinovic, Vindi ; Boulesteix, Anne-Laure</creatorcontrib><description>Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database ‘The Cancer Genome Atlas’ (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan–Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno’s C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups—especially clinical variables—from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited. Contact:  moritz.herrmann@stat.uni-muenchen.de, +49 89 2180 3198 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. All analyses are reproducible using R code freely available on Github.</description><identifier>ISSN: 1467-5463</identifier><identifier>EISSN: 1477-4054</identifier><identifier>DOI: 10.1093/bib/bbaa167</identifier><identifier>PMID: 32823283</identifier><language>eng</language><publisher>Oxford University Press</publisher><subject>Method Review</subject><ispartof>Briefings in bioinformatics, 2021-05, Vol.22 (3)</ispartof><rights>The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c358t-a76599f873cec0cd1475013e530cdd34756956b56e65ddc60b0d6ee7ff14d8553</citedby><cites>FETCH-LOGICAL-c358t-a76599f873cec0cd1475013e530cdd34756956b56e65ddc60b0d6ee7ff14d8553</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC8138887/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC8138887/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,881,27901,27902,53766,53768</link.rule.ids></links><search><creatorcontrib>Herrmann, Moritz</creatorcontrib><creatorcontrib>Probst, Philipp</creatorcontrib><creatorcontrib>Hornung, Roman</creatorcontrib><creatorcontrib>Jurinovic, Vindi</creatorcontrib><creatorcontrib>Boulesteix, Anne-Laure</creatorcontrib><title>Large-scale benchmark study of survival prediction methods using multi-omics data</title><title>Briefings in bioinformatics</title><description>Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database ‘The Cancer Genome Atlas’ (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan–Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno’s C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups—especially clinical variables—from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited. Contact:  moritz.herrmann@stat.uni-muenchen.de, +49 89 2180 3198 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. All analyses are reproducible using R code freely available on Github.</description><subject>Method Review</subject><issn>1467-5463</issn><issn>1477-4054</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNpVUU1LAzEQDaLYWj35B3IUZG2y2WSzF0GKX1AQQc8hm8y20d1NTbKF_nu3tAgehpnHPN6b4SF0TckdJRWb166e17XWVJQnaEqLsswKwovT_SzKjBeCTdBFjF-E5KSU9BxNWC7zsdgUvS91WEEWjW4B19CbdafDN45psDvsGxyHsHVb3eJNAOtMcr7HHaS1txEP0fUr3A1tcpnvnInY6qQv0Vmj2whXxz5Dn0-PH4uXbPn2_Lp4WGaGcZkyXQpeVY0smQFDjB3v5oQy4GwElo1IVFzUXIDg1hpBamIFQNk0tLCSczZD9wfdzVB3YA30KehWbYIbP9gpr536v-ndWq38VknKpBx9Z-jmKBD8zwAxqc5FA22re_BDVHnBBKtYJcVIvT1QTfAxBmj-bChR-xDUGII6hsB-AU-ze_A</recordid><startdate>20210520</startdate><enddate>20210520</enddate><creator>Herrmann, Moritz</creator><creator>Probst, Philipp</creator><creator>Hornung, Roman</creator><creator>Jurinovic, Vindi</creator><creator>Boulesteix, Anne-Laure</creator><general>Oxford University Press</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>20210520</creationdate><title>Large-scale benchmark study of survival prediction methods using multi-omics data</title><author>Herrmann, Moritz ; Probst, Philipp ; Hornung, Roman ; Jurinovic, Vindi ; Boulesteix, Anne-Laure</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c358t-a76599f873cec0cd1475013e530cdd34756956b56e65ddc60b0d6ee7ff14d8553</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Method Review</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Herrmann, Moritz</creatorcontrib><creatorcontrib>Probst, Philipp</creatorcontrib><creatorcontrib>Hornung, Roman</creatorcontrib><creatorcontrib>Jurinovic, Vindi</creatorcontrib><creatorcontrib>Boulesteix, Anne-Laure</creatorcontrib><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Briefings in bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Herrmann, Moritz</au><au>Probst, Philipp</au><au>Hornung, Roman</au><au>Jurinovic, Vindi</au><au>Boulesteix, Anne-Laure</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Large-scale benchmark study of survival prediction methods using multi-omics data</atitle><jtitle>Briefings in bioinformatics</jtitle><date>2021-05-20</date><risdate>2021</risdate><volume>22</volume><issue>3</issue><issn>1467-5463</issn><eissn>1477-4054</eissn><abstract>Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database ‘The Cancer Genome Atlas’ (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan–Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno’s C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups—especially clinical variables—from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited. Contact:  moritz.herrmann@stat.uni-muenchen.de, +49 89 2180 3198 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. All analyses are reproducible using R code freely available on Github.</abstract><pub>Oxford University Press</pub><pmid>32823283</pmid><doi>10.1093/bib/bbaa167</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1467-5463
ispartof Briefings in bioinformatics, 2021-05, Vol.22 (3)
issn 1467-5463
1477-4054
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_8138887
source Business Source Ultimate; Oxford University Press Open Access; PubMed Central
subjects Method Review
title Large-scale benchmark study of survival prediction methods using multi-omics data
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-12T19%3A59%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Large-scale%20benchmark%20study%20of%20survival%20prediction%20methods%20using%20multi-omics%20data&rft.jtitle=Briefings%20in%20bioinformatics&rft.au=Herrmann,%20Moritz&rft.date=2021-05-20&rft.volume=22&rft.issue=3&rft.issn=1467-5463&rft.eissn=1477-4054&rft_id=info:doi/10.1093/bib/bbaa167&rft_dat=%3Cproquest_pubme%3E2436393986%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c358t-a76599f873cec0cd1475013e530cdd34756956b56e65ddc60b0d6ee7ff14d8553%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2436393986&rft_id=info:pmid/32823283&rfr_iscdi=true