A Review of Forecasting Techniques for Large Data Sets

Article excerpt

This paper aims to provide a brief and relatively non-technical overview of state-of-the-art forecasting with large data sets. We classify existing methods into four groups depending on whether data sets are used wholly or partly, whether a single model or multiple models are used and whether a small subset or the whole data set is being forecast. In particular, we provide brief descriptions of the methods and short recommendations where appropriate, without going into detailed discussions of their merits or demerits.

Keywords: Large datasets; factors; forecast combinations.

JEL Classifications: C01; C53

I. Introduction

In recent years there has been increasing interest in forecasting methods that utilise large data sets. There is an awareness that there is a huge quantity of information available in the economic arena which might be useful for forecasting, but standard econometric techniques are not well suited to extract this in a useful form. This is not an issue of mere academic interest. Lars Svensson described what central bankers do in practice in Svensson (2005). "Large amounts of data about the state of the economy and the rest of the world ... are collected, processed, and analyzed before each major decision." In an effort to assist in this task, econometricians began assembling large macroeconomic data sets and devising ways of forecasting with them.

In the past few years a large number of methods which are either new or new to econometrics has been proposed to deal with forecasting using large data sets. This review aims to provide a brief discussion of the available methods. Given the recent and evolving nature of the literature this review is bound to be incomplete. The need for new methods in the face of the availability of large data sets arises out of the fact that, given time series observations for a large data set, which at time t is denoted by the N-dimensional vector [x.sub.t], it is either inefficient or downright impossible to incorporate [x.sub.t] in a single forecasting model and estimate it using standard econometric techniques.

We assume that primary interest focuses on forecasting a single variable [y.sub.t], which may or may not be included in [x.sub.t]. Broadly speaking, the available methodologies for forecasting with large data sets fall into four groups. The first group consists of estimation strategies that allow estimation of a single equation model that utilises the whole of [x.sub.t]. This is perhaps the most diverse group ranging from factor-based methods to Bayesian regression. The methods of the second group involve inherently two steps. In the first step some form of variable selection is undertaken. The variables that are chosen are then most likely to be used in a standard forecasting model. Of course, if the resulting data set is too large, it may still be analysed using methods designed for large data sets. These first two groups of methods inevitably overlap. However, we feel that the step of variable selection is, and involves methods that are, sufficiently distinct to merit separate mention and treatment. The third group of methods involves the use of subsets of [x.sub.t] in distinct forecasting models and the production of multiple forecasts for [y.sub.t], which are then averaged to produce a final forecast. The distinctive feature of this group is the explicit use of model and forecast averaging. Finally, the fourth and perhaps most innovative group of methods departs from the convention of forecasting a single variable. For this group the aim is to forecast the whole of [x.sub.t] (which is now assumed to contain [y.sub.t]). Thus, use of multivariate models is inevitable. As is clear, specially designed estimation methods need to be employed, as the size of the data set, [x.sub.t], does not allow use of standard econometric techniques.

As the above makes clear, our review will focus on statistical/econometric methods for dealing with large data sets. …