Understanding the use of unlabelled data in predictive modelling

Feng Liang, Sayan Mukhejee and Mike West

May, 2005

The incorporation of unlabelled data in statistical machine learning methods for prediction, including regression and classification, has demonstrated the potential for improved accuracy in prediction in a number of recent examples.  The statistical basis for this semi-supervised analysis does not, however, appear to have been well delineated in the literature to date. Nor, perhaps, are statisticians as fully engaged in the vigourous research in this area of machine learning as might be desired. Much of the theoretical work in the literature has focused, for example, on geometric and structural properties of the unlabeled data in the context of particular algorithms, rather than probabilistic and statistical questions. This paper overviews the fundamental statistical foundations for predictive modelling and the general questions associated with unlabelled data, highlighting the relevance of venerable concepts of sampling design and prior specification. This theory, illustrated with a series of simple but central examples, shows precisely when, why and how unlabelled data matter.

Keywords: Bayesian analysis, Bayesian kernel regression, Predictive distribution, Semi-supervised learning, Unlabelled data.


The original manuscript is available as a pdf document. The revised and substantially updated paper is in Statistical Science, 2007 (May issue).