Loading…

Model Selection Strategies for Determining the Optimal Number of Overlapping Clusters in Additive Overlapping Partitional Clustering

In various scientific fields, researchers make use of partitioning methods (e.g., K -means) to disclose the structural mechanisms underlying object by variable data. In some instances, however, a grouping of objects into clusters that are allowed to overlap (i.e., assigning objects to multiple clust...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of classification 2022-07, Vol.39 (2), p.264-301
Main Authors:	Rossbroich, Julian, Durieux, Jeffrey, Wilderjans, Tom F.
Format:	Article
Language:	English
Subjects:	Bioinformatics Clustering Information theory Marketing Mathematics and Statistics Pattern Recognition Psychometrics Signal,Image and Speech Processing Statistical Theory and Methods Statistics
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c363t-2268a9b7b166b7e62ac7cfc3c46aa51e020f015b46a73c3b8af28bb6c8b8fe3b3
cites	cdi_FETCH-LOGICAL-c363t-2268a9b7b166b7e62ac7cfc3c46aa51e020f015b46a73c3b8af28bb6c8b8fe3b3
container_end_page	301
container_issue	2
container_start_page	264
container_title	Journal of classification
container_volume	39
creator	Rossbroich, Julian Durieux, Jeffrey Wilderjans, Tom F.
description	In various scientific fields, researchers make use of partitioning methods (e.g., K -means) to disclose the structural mechanisms underlying object by variable data. In some instances, however, a grouping of objects into clusters that are allowed to overlap (i.e., assigning objects to multiple clusters) might lead to a better representation of the underlying clustering structure. To obtain an overlapping object clustering from object by variable data, Mirkin’s ADditive PROfile CLUStering (ADPROCLUS) model may be used. A major challenge when performing ADPROCLUS is to determine the optimal number of overlapping clusters underlying the data, which pertains to a model selection problem. Up to now, however, this problem has not been systematically investigated and almost no guidelines can be found in the literature regarding appropriate model selection strategies for ADPROCLUS. Therefore, in this paper, several existing model selection strategies for K -means (a.o., CHull, the Caliński-Harabasz, Krzanowski-Lai, Average Silhouette Width and Dunn Index and information-theoretic measures like AIC and BIC) and two cross-validation based strategies are tailored towards an ADPROCLUS context and are compared to each other in an extensive simulation study. The results demonstrate that CHull outperforms all other model selection strategies and this especially when the negative log-likelihood, which is associated with a minimal stochastic extension of ADPROCLUS, is used as (mis)fit measure. The analysis of a post hoc AIC-based model selection strategy revealed that better performance may be obtained when a different—more appropriate—definition of model complexity for ADPROCLUS is used.
doi_str_mv	10.1007/s00357-021-09409-1
format	article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2694457168</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2694457168</sourcerecordid><originalsourceid>FETCH-LOGICAL-c363t-2268a9b7b166b7e62ac7cfc3c46aa51e020f015b46a73c3b8af28bb6c8b8fe3b3</originalsourceid><addsrcrecordid>eNp9kE1LAzEQhoMoWKt_wFPAczQfu8n2WOonVCtUzyFJZ2vKtrsmacG7P9ysFcSLp2GG531n5kXonNFLRqm6ipSKUhHKGaGjgo4IO0ADVghOmCjEIRpQpiQpuKyO0UmMK5pFUqoB-nxsF9DgOTTgkm83eJ6CSbD0EHHdBnwNCcLab_xmidMb4FmX_No0-Gm7thBwW-PZDkJjuq4nJs02Zj5iv8HjxcInv4M_wLMJyfd7ssUPnMen6Kg2TYSznzpEr7c3L5N7Mp3dPUzGU-KEFInwfL4ZWWWZlFaB5MYpVzvhCmlMyYByWlNW2twq4YStTM0ra6WrbFWDsGKILva-XWjftxCTXrXbkG-JmstRUZSKySpTfE-50MYYoNZdyD-HD82o7tPW-7R1Tlt_p61ZFom9KHb9RxB-rf9RfQFvaYV8</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2694457168</pqid></control><display><type>article</type><title>Model Selection Strategies for Determining the Optimal Number of Overlapping Clusters in Additive Overlapping Partitional Clustering</title><source>Library & Information Science Abstracts (LISA)</source><source>Springer Link</source><creator>Rossbroich, Julian ; Durieux, Jeffrey ; Wilderjans, Tom F.</creator><creatorcontrib>Rossbroich, Julian ; Durieux, Jeffrey ; Wilderjans, Tom F.</creatorcontrib><description>In various scientific fields, researchers make use of partitioning methods (e.g., K -means) to disclose the structural mechanisms underlying object by variable data. In some instances, however, a grouping of objects into clusters that are allowed to overlap (i.e., assigning objects to multiple clusters) might lead to a better representation of the underlying clustering structure. To obtain an overlapping object clustering from object by variable data, Mirkin’s ADditive PROfile CLUStering (ADPROCLUS) model may be used. A major challenge when performing ADPROCLUS is to determine the optimal number of overlapping clusters underlying the data, which pertains to a model selection problem. Up to now, however, this problem has not been systematically investigated and almost no guidelines can be found in the literature regarding appropriate model selection strategies for ADPROCLUS. Therefore, in this paper, several existing model selection strategies for K -means (a.o., CHull, the Caliński-Harabasz, Krzanowski-Lai, Average Silhouette Width and Dunn Index and information-theoretic measures like AIC and BIC) and two cross-validation based strategies are tailored towards an ADPROCLUS context and are compared to each other in an extensive simulation study. The results demonstrate that CHull outperforms all other model selection strategies and this especially when the negative log-likelihood, which is associated with a minimal stochastic extension of ADPROCLUS, is used as (mis)fit measure. The analysis of a post hoc AIC-based model selection strategy revealed that better performance may be obtained when a different—more appropriate—definition of model complexity for ADPROCLUS is used.</description><identifier>ISSN: 0176-4268</identifier><identifier>EISSN: 1432-1343</identifier><identifier>DOI: 10.1007/s00357-021-09409-1</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Bioinformatics ; Clustering ; Information theory ; Marketing ; Mathematics and Statistics ; Pattern Recognition ; Psychometrics ; Signal,Image and Speech Processing ; Statistical Theory and Methods ; Statistics</subject><ispartof>Journal of classification, 2022-07, Vol.39 (2), p.264-301</ispartof><rights>The Author(s) 2022</rights><rights>The Author(s) 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c363t-2268a9b7b166b7e62ac7cfc3c46aa51e020f015b46a73c3b8af28bb6c8b8fe3b3</citedby><cites>FETCH-LOGICAL-c363t-2268a9b7b166b7e62ac7cfc3c46aa51e020f015b46a73c3b8af28bb6c8b8fe3b3</cites><orcidid>0000-0002-1677-4938</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925,34135</link.rule.ids></links><search><creatorcontrib>Rossbroich, Julian</creatorcontrib><creatorcontrib>Durieux, Jeffrey</creatorcontrib><creatorcontrib>Wilderjans, Tom F.</creatorcontrib><title>Model Selection Strategies for Determining the Optimal Number of Overlapping Clusters in Additive Overlapping Partitional Clustering</title><title>Journal of classification</title><addtitle>J Classif</addtitle><description>In various scientific fields, researchers make use of partitioning methods (e.g., K -means) to disclose the structural mechanisms underlying object by variable data. In some instances, however, a grouping of objects into clusters that are allowed to overlap (i.e., assigning objects to multiple clusters) might lead to a better representation of the underlying clustering structure. To obtain an overlapping object clustering from object by variable data, Mirkin’s ADditive PROfile CLUStering (ADPROCLUS) model may be used. A major challenge when performing ADPROCLUS is to determine the optimal number of overlapping clusters underlying the data, which pertains to a model selection problem. Up to now, however, this problem has not been systematically investigated and almost no guidelines can be found in the literature regarding appropriate model selection strategies for ADPROCLUS. Therefore, in this paper, several existing model selection strategies for K -means (a.o., CHull, the Caliński-Harabasz, Krzanowski-Lai, Average Silhouette Width and Dunn Index and information-theoretic measures like AIC and BIC) and two cross-validation based strategies are tailored towards an ADPROCLUS context and are compared to each other in an extensive simulation study. The results demonstrate that CHull outperforms all other model selection strategies and this especially when the negative log-likelihood, which is associated with a minimal stochastic extension of ADPROCLUS, is used as (mis)fit measure. The analysis of a post hoc AIC-based model selection strategy revealed that better performance may be obtained when a different—more appropriate—definition of model complexity for ADPROCLUS is used.</description><subject>Bioinformatics</subject><subject>Clustering</subject><subject>Information theory</subject><subject>Marketing</subject><subject>Mathematics and Statistics</subject><subject>Pattern Recognition</subject><subject>Psychometrics</subject><subject>Signal,Image and Speech Processing</subject><subject>Statistical Theory and Methods</subject><subject>Statistics</subject><issn>0176-4268</issn><issn>1432-1343</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>F2A</sourceid><recordid>eNp9kE1LAzEQhoMoWKt_wFPAczQfu8n2WOonVCtUzyFJZ2vKtrsmacG7P9ysFcSLp2GG531n5kXonNFLRqm6ipSKUhHKGaGjgo4IO0ADVghOmCjEIRpQpiQpuKyO0UmMK5pFUqoB-nxsF9DgOTTgkm83eJ6CSbD0EHHdBnwNCcLab_xmidMb4FmX_No0-Gm7thBwW-PZDkJjuq4nJs02Zj5iv8HjxcInv4M_wLMJyfd7ssUPnMen6Kg2TYSznzpEr7c3L5N7Mp3dPUzGU-KEFInwfL4ZWWWZlFaB5MYpVzvhCmlMyYByWlNW2twq4YStTM0ra6WrbFWDsGKILva-XWjftxCTXrXbkG-JmstRUZSKySpTfE-50MYYoNZdyD-HD82o7tPW-7R1Tlt_p61ZFom9KHb9RxB-rf9RfQFvaYV8</recordid><startdate>20220701</startdate><enddate>20220701</enddate><creator>Rossbroich, Julian</creator><creator>Durieux, Jeffrey</creator><creator>Wilderjans, Tom F.</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>E3H</scope><scope>F2A</scope><scope>JQ2</scope><orcidid>https://orcid.org/0000-0002-1677-4938</orcidid></search><sort><creationdate>20220701</creationdate><title>Model Selection Strategies for Determining the Optimal Number of Overlapping Clusters in Additive Overlapping Partitional Clustering</title><author>Rossbroich, Julian ; Durieux, Jeffrey ; Wilderjans, Tom F.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c363t-2268a9b7b166b7e62ac7cfc3c46aa51e020f015b46a73c3b8af28bb6c8b8fe3b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Bioinformatics</topic><topic>Clustering</topic><topic>Information theory</topic><topic>Marketing</topic><topic>Mathematics and Statistics</topic><topic>Pattern Recognition</topic><topic>Psychometrics</topic><topic>Signal,Image and Speech Processing</topic><topic>Statistical Theory and Methods</topic><topic>Statistics</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Rossbroich, Julian</creatorcontrib><creatorcontrib>Durieux, Jeffrey</creatorcontrib><creatorcontrib>Wilderjans, Tom F.</creatorcontrib><collection>Springer_OA刊</collection><collection>CrossRef</collection><collection>Library & Information Sciences Abstracts (LISA)</collection><collection>Library & Information Science Abstracts (LISA)</collection><collection>ProQuest Computer Science Collection</collection><jtitle>Journal of classification</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Rossbroich, Julian</au><au>Durieux, Jeffrey</au><au>Wilderjans, Tom F.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Model Selection Strategies for Determining the Optimal Number of Overlapping Clusters in Additive Overlapping Partitional Clustering</atitle><jtitle>Journal of classification</jtitle><stitle>J Classif</stitle><date>2022-07-01</date><risdate>2022</risdate><volume>39</volume><issue>2</issue><spage>264</spage><epage>301</epage><pages>264-301</pages><issn>0176-4268</issn><eissn>1432-1343</eissn><abstract>In various scientific fields, researchers make use of partitioning methods (e.g., K -means) to disclose the structural mechanisms underlying object by variable data. In some instances, however, a grouping of objects into clusters that are allowed to overlap (i.e., assigning objects to multiple clusters) might lead to a better representation of the underlying clustering structure. To obtain an overlapping object clustering from object by variable data, Mirkin’s ADditive PROfile CLUStering (ADPROCLUS) model may be used. A major challenge when performing ADPROCLUS is to determine the optimal number of overlapping clusters underlying the data, which pertains to a model selection problem. Up to now, however, this problem has not been systematically investigated and almost no guidelines can be found in the literature regarding appropriate model selection strategies for ADPROCLUS. Therefore, in this paper, several existing model selection strategies for K -means (a.o., CHull, the Caliński-Harabasz, Krzanowski-Lai, Average Silhouette Width and Dunn Index and information-theoretic measures like AIC and BIC) and two cross-validation based strategies are tailored towards an ADPROCLUS context and are compared to each other in an extensive simulation study. The results demonstrate that CHull outperforms all other model selection strategies and this especially when the negative log-likelihood, which is associated with a minimal stochastic extension of ADPROCLUS, is used as (mis)fit measure. The analysis of a post hoc AIC-based model selection strategy revealed that better performance may be obtained when a different—more appropriate—definition of model complexity for ADPROCLUS is used.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s00357-021-09409-1</doi><tpages>38</tpages><orcidid>https://orcid.org/0000-0002-1677-4938</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0176-4268
ispartof	Journal of classification, 2022-07, Vol.39 (2), p.264-301
issn	0176-4268 1432-1343
language	eng
recordid	cdi_proquest_journals_2694457168
source	Library & Information Science Abstracts (LISA); Springer Link
subjects	Bioinformatics Clustering Information theory Marketing Mathematics and Statistics Pattern Recognition Psychometrics Signal,Image and Speech Processing Statistical Theory and Methods Statistics
title	Model Selection Strategies for Determining the Optimal Number of Overlapping Clusters in Additive Overlapping Partitional Clustering
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T09%3A59%3A28IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Model%20Selection%20Strategies%20for%20Determining%20the%20Optimal%20Number%20of%20Overlapping%20Clusters%20in%20Additive%20Overlapping%20Partitional%20Clustering&rft.jtitle=Journal%20of%20classification&rft.au=Rossbroich,%20Julian&rft.date=2022-07-01&rft.volume=39&rft.issue=2&rft.spage=264&rft.epage=301&rft.pages=264-301&rft.issn=0176-4268&rft.eissn=1432-1343&rft_id=info:doi/10.1007/s00357-021-09409-1&rft_dat=%3Cproquest_cross%3E2694457168%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c363t-2268a9b7b166b7e62ac7cfc3c46aa51e020f015b46a73c3b8af28bb6c8b8fe3b3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2694457168&rft_id=info:pmid/&rfr_iscdi=true