Result: Variable Selection Methods for Model-Based Clustering: Procedures for Functional Data and Bayesian Inference
This item is protected by copyright, with all rights reserved.
Illustrations:some color
English
967277448
From OAIsterĀ®, provided by the OCLC Cooperative.
Further Information
Thesis (Ph.D.)--University of Rochester. School of Medicine & Dentistry. Dept. of Biostatistics and Computational Biology, 2016.
Data is becoming more readily available and collected in larger and more frequent amounts as technology advances. Discrete, continuous, and time-course data are all easily obtained from business, medical, and biological applications and fields. With the growth and accessibility of this diverse data, there is a larger demand for extracting important information to develop meaningful conclusions. Model-based clustering is a useful unsupervised learning technique that aims to identify subpopulations within the data using a parametric framework. However, in model-based clustering, it is possible that some variables in the dataset do not contribute to the clustering model and can mask true subgroup structure. We develop two novel model-based clustering variable selection procedures with motivating examples for this statistical problem. The first method addresses the lack of a simultaneous parametric clustering and variable selection technique for functional time-course data. The procedure we develop uses a greedy search algorithm to integrate variable selection into the clustering procedure by comparing two nested subsets to find a locally optimal solution for functional data. Our new method successfully identifies the most important variables for clustering in a simulation study. The procedure is also applied to a dataset of respiratory function measurements for irradiated and non-irradiated mice, where it is found that only a small subset of variables are necessary to classify the functional data reasonably well. The second method recognizes the disadvantages of the greedy search method and proposes a new simultaneous variable selection model-based clustering method under a fully Bayesian framework for non-functional data. This procedure enables more complete inference and the possibility of finding a set of globally optimal solutions, by successfully modeling the posterior distributions of the cluster specific means, variances, and proportion of cluster membership. Ou