A Tutorial on MM Algorithms
Hunter, David R., Lange, Kenneth, The American Statistician
Maximum likelihood and least squares are the dominant forms of estimation in frequentist statistics. Toy optimization problems designed for classroom presentation can be solved analytically, but most practical maximum likelihood and least squares estimation problems must be solved numerically. This article discusses an optimization method that typically relies on convexity arguments and is a generalization of the well-known EM algorithm method (Dempster, Laird, and Rubin 1977; McLachlan and Krishnan 1997). We call any algorithm based on this iterative method an MM algorithm.
To our knowledge, the general principle behind MM algorithms was first enunciated by the numerical analysts Ortega and Rheinboldt (1970) in the context of line search methods. De Leeuw and Heiser (1977) presented an MM algorithm for multidimensional scaling contemporary with the classic Dempster et al. (1977) article on EM algorithms. Although the work of de Leeuw and Heiser did not spark the same explosion of interest from the statistical community set off by the Dempster et al. (1977) article, steady development of MM algorithms has continued. The MM principle reappears, among other places, in robust regression (Huber 1981), in correspondence analysis (Heiser 1987), in the quadratic lower bound principle of Bohning and Lindsay (1988), in the psychometrics literature on least squares (Bijleveld and de Leeuw 1991; Kiers and Ten Berge 1992), and in medical imaging (De Pierro 1995; Lange and Fessler 1995). The recent survey articles of de Leeuw (1994), Heiser (1995), Becket, Yang, and Lange (1997), and Lange, Hunter, and Yang (2000) deal with the general principle, but it is not until the rejoinder of Hunter and Lange (2000a) that the acronym MM first appears. This acronym pays homage to the earlier names "majorization" and "iterative majorization" of the MM principle, emphasizes its crucial link to the better-known EM principle, and diminishes the possibility of confusion with the distinct subject in mathematics known as majorization (Marshall and Olkin 1979). Recent work has demonstrated the utility of MM algorithms in a broad range of statistical contexts, including quantile regression (Hunter and Lange 2000b), survival analysis (Hunter and Lange 2002), paired and multiple comparisons (Hunter 2004), variable selection (Hunter and Li 2002), and DNA sequence analysis (Sabatti and Lange 2002).
One of the virtues of the MM acronym is that it does double duty. In minimization problems, the first M of MM stands for majorize and the second M for minimize. In maximization problems, the first M stands for minorize and the second M for maximize. (We define the terms "majorize" and "minorize" in Section 2.) A successful MM algorithm substitutes a simple optimization problem for a difficult optimization problem. Simplicity can be attained by (a) avoiding large matrix inversions, (b) linearizing an optimization problem, (c) separating the parameters of an optimization problem, (d) dealing with equality and inequality constraints gracefully, or (e) turning a nondifferentiable problem into a smooth problem. Iteration is the price we pay for simplifying the original problem.
In our view, MM algorithms are easier to understand and sometimes easier to apply than EM algorithms. Although we have no intention of detracting from EM algorithms, their dominance over MM algorithms is a historical accident. An EM algorithm operates by identifying a theoretical complete data space. In the E-step of the algorithm, the conditional expectation of the complete data log-likelihood is calculated with respect to the observed data. The surrogate function created by the E-step is, up to a constant, a minorizing function. In the M-step, this minorizing function is maximized with respect to the parameters of the underlying model; thus, every EM algorithm is an example of an MM algorithm. Construction of an EM algorithm sometimes demands creativity in identifying the complete data and technical skill in calculating an often complicated conditional expectation and then maximizing it analytically. …