Loading…

K-means Clustering and Principal Components Analysis of Microarray Data of L1000 Landmark Genes

Dimensionality reduction methods such as principal component analysis (PCA) are used to select relevant features, and k-means clustering performs well when applied to data with low effective dimensionality. This study integrated PCA and k-means clustering using the L1000 dataset, containing gene mic...

Full description

Saved in:
Bibliographic Details
Published in:Procedia computer science 2020, Vol.168, p.97-104
Main Authors: Clayman, Carly L., Srinivasan, Satish M., Sangwan, Raghvinder S.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Dimensionality reduction methods such as principal component analysis (PCA) are used to select relevant features, and k-means clustering performs well when applied to data with low effective dimensionality. This study integrated PCA and k-means clustering using the L1000 dataset, containing gene microarray data from 978 landmark genes, which have been previously shown to predict expression of ~81% of the remaining 21,290 target genes with low error. Groups within the L1000 dataset were characterized using both microarray and clinical metadata to assess whether 978 landmark genes would improve clustering results, compared to a random set of 978 genes. The role of clinical variables, including morphological diagnosis, were assessed across k-means clustering groups within homogeneous tissue samples in the L1000 dataset. Results show that the 978 landmark genes better differentiated k-means clusters, relative to 978 randomly selected non-landmark genes. K-means clusters generated from the landmark genes showed more separation of cluster groups when plotted against the first two principal components, which capture a greater proportion of variation for the 978 landmark genes. These results suggest that the 978 landmark genes better represent the overall genetic profile of these heterogeneous samples. Future studies will implement predictive analytics techniques to further investigate the interaction of microarray data and clinical variables such as cancer stage.
ISSN:1877-0509
1877-0509
DOI:10.1016/j.procs.2020.02.265