October, 2005
For large data sets, it can be difficult or impossible to fit models with random effects using standard algorithms due to convergence or memory problems. In addition, it would be advantageous to use the abundant information to relax assumptions, such as normality of random effects. Motivated by maternal smoking and childhood growth data from the Collaborative Perinatal Project (CPP), we propose a two-stage clustering procedure for large longitudinal data sets. In the first stage, we use a multivariate clustering method to identify G << N groups of subjects whose data have no scientifically important differences, as defined by subject matter experts. Then, in stage 2, group-specific random effects are assumed to come from an unknown distribution, which is assigned a Dirichlet process prior (DPP), further clustering the groups from stage 1. The methods are illustrated using simulated data and applied to the CPP data.
Keywords: Cluster analysis; Dirichlet process; Empirical Bayes; Longitudinal data; Mixed effects model; Prior elicitation.
The manuscript is available in PDF formats.