Bayesian nonparametric processes for clustering high-dimensional count data
High-dimensional count data are collected in many applied domains, including microbiome and RNA-Seq data. In the present talk, we are motivated by an application to microbiome data, where the abundance of various microbial taxa is recorded for each patient. In this context, a crucial aim may be clustering patients based on their expression profiles, accounting for library size normalization, following the conventional vocabulary used in microbiome studies. Trait allocation models are extremely helpful to analyze this type of data, since they assume that each subject may exhibit multiple traits, i.e., taxa, along with corresponding counts, i.e., abundances. We propose a novel Bayesian nonparametric prior which allows to perform clustering of the expression profiles, building upon finite Completely Random Vectors. As a valuable improvement of the existing methods, our construction results in a clustering of subjects exhibiting the same distributional patterns across the entire spectrum of taxa. We are able to provide a fully-Bayesian analysis of our model, in particular we discuss the marginal distribution of the data through a new probabilistic object which extends the Exchangeable Feature Probability Function within the trait allocation framework. Posterior and predictive inference are also addressed in closed-form. We have designed an algorithm for posterior inference and it has proven to be extremely efficient even in high-dimensional settings. We validate our proposal on several simulated scenarios. We finally apply our model to analyze the microbiome composition of subjects under different diet regimes. A comparison with existing methods is also discussed.
Area: IS2 - Dependence structures in Bayesian nonparametrics (Federico Camerlenghi)
Keywords: Trait allocation clustering, finite Completely Random Vectors, microbiome data, expression profiles
Please Login in order to download this file