This paper was selected as a 2010 Bayesian Analysis Invited Discussion Paper
after being accepted for publication in
Bayesian Analysis, 2010
Original Manuscript: December 2009
One of the challenges in using Markov chain Monte Carlo for model analysis in studies with very large datasets is the need to scan through the whole data at each iteration of the sampler, which can be computationally prohibitive. Several approaches have been developed to address this, typically drawing computationally manageable subsamples of the data. Here we consider the specific case where most of the data from a mixture model provides little or no information about the parameters of interest, and we aim to select subsamples such that the information extracted is most relevant. The motivating application arises in flow cytometry, where several measurements from a vast number of cells are available. Interest lies in identifying specific rare cell subtypes and characterizing them according to their corresponding markers. We present a Markov chain Monte Carlo approach where an initial subsample of the full dataset is used to guide selection sampling of a further set of observations {\em targeted} at a scientifically interesting, low probability region. We define a Sequential Monte Carlo strategy in which the targeted subsample is augmented sequentially as estimates improve, and introduce a stopping rule for determining the size of the targeted subsample. An example from flow cytometry illustrates the ability of the approach to increase the resolution of inferences for rare cell subtypes.
Keywords: Flow cytometry, large data sets, mixture models, rare events, resampling, selection sampling, sequential Monte Carlo
Research was partially supported by grants to Duke University from the NSF (DMS-0342172) and the National Institutes of Health (grant P50-GM081883,P30-AI064518-0 and contract HHSN268200500019C). Aspects of the research were also partially supported by the NSF grant DMS-0635449 to the Statistical and Applied Mathematical Sciences Institute. CC and MW were also partially supported on this research by NIH grant RC1AI086032. Any opinions, findings and conclusions or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of the NSF or NIH.
Computer code (an example Matlab script and supporting functions) implementing the analyses reported here is available for free download.