Choice Models for Predicting Divisional Winners in Major League Baseball
Barry, Daniel, Hartigan, J. A., Journal of the American Statistical Association
Major league baseball is played by 26 teams divided into two leagues, the American League and the National League. The American Leage (AL) has two divisions, the AL East and the AL West, each consisting of 7 teams. The National League (NL) has two divisions, the NL East and the NL West, each also consisting of 7 teams. Each of the 26 teams plays 162 games in a season against teams in their own league. At the end of the regular season, the teams with the best record in each division play one another in a best-of-7-game series to decide the league champions. The two league champions then meet in a best-of-7-game series called the World Series.
The data for a particular season is available from the weekly magazine The Sporting News, which publishes a schedule of all games in early March and publishes the results of all games from the previous week during the season. Summary data for the 1991 National League season appears in Table 1. The table presents the results of home and away games for each pair of teams in four 6-week periods.
We will estimate the probability of winning their division for each of the teams, given the results of all games played up to a certain date and the list of remaining games for each team. We wish to allow for differing strengths for each team, for differing home advantages, and for changing strengths over time. It will be seen from Table 1 for example, that Atlanta had a poor record before the All-Star break (39 of 79) and a good record after the break (55 of 83).
We use a choice model for predicting the outcome of each game. The parameters of the model depend on the teams involved, on which team is playing at home, and on time. Markov chain sampling is used to simulate the outcomes of future games and so predict the eventual division winners. We argue that it is necessary to allow for changing team strengths with time, because some teams appear to change noticeably over a season. We present a variety of predictions of outcomes for the 1991 National League season. One prediction, at a certain point in the season when Atlanta was 2 games behind the Dodgers in number of wins, was that Atlanta had a better chance of winning the division than did the Dodgers. This occurred because Atlanta appeared stronger in the second half of the season and that strength was projected to the remaining games. In general, allowing for changing strengths encourages more conservative probability estimates; that is, the team with the best record has a lower probability of winning under the changing strength model than under a fixed strength model, because it is likely to have lower future strength than its record indicates.
Of course, other variables influence the probability of winning. In baseball, the starting pitcher has a substantial effect on the final outcome of each game. We judged that this effect would average out in a relatively short period. Starting pitchers work in a regular rotation, so that the total number of wins in a couple of weeks could be predicted from an estimate of the average ability of the team's starting pitchers. Collecting and analyzing the data for the rather large number of starting pitchers would be a formidable task. We suspect that including this variable would change the estimates for particular games quite a bit, and so the predictions as the season draws to a close (with 4 or 5 games remaining for each team) would be substantially affected. Toward the end of the season, you need to know who is pitching.
Point spreads--the differences in scores between the contending teams-were examined for the National Football League by Harville (1980). Harville predicted future point spreads using past point spreads observed over several seasons. The expected point spread in any game is the difference between a parameter for the home team and one for the away team; these parameters are the same in each year and vary from year to year according to an autoregressive process with lag 1. …