Clustering Methods for Real Estate Portfolios
Goetzmann, William N., Wachter, Susan M., Real Estate Economics
William N. Goetzmann [*]
Susan M. Wachter [**]
A clustering algorithm is applied to effective rents for twenty-one metropolitan U.S. office markets, and to twenty-two metropolitan markets using vacancy data. It provides support for the conjecture that there exists a few major "families" of cities: including an oil and gas group and an industrial Northeast group. Unlike other clustering studies, we find strong evidence of bicoastal city associations among cities such as Boston and Los Angeles. We present a bootstrapping methodology for investigating the robustness of the clustering algorithm, and develop a means for testing the significance of city associations. While the analysis is limited to aggregate rent and vacancy data, the results provide a guideline for the further application of cluster analysis to other types of real estate and economic information.
The benefit of geographical diversification across real estate markets is well documented, and has been a guiding feature of portfolio management for some time. See Miles and McCue (1982); Hartzell, Heckman and Miles (1986); and Firstenberg, Ross and Zisler (1987). The reasons for this are not a mystery. As several researchers have pointed out, since fundamentally different economic forces influence the various regions in the United States, real estate values in different regions tend to vary greatly in their behavior, manifesting possibly unique risk factors (see Grissom, Hartzell and Liu 1987). For example, as Hartzell, Schulman and Wurtzbach (1987) hypothesize, and as we demonstrate in this paper, the fortunes of much of the southwest (from Denver to New Orleans) are related to natural resource extraction, such as oil, gas and mining. Diversification across regions can help to reduce the overall risk of the real estate portfolio.
In this paper, a clustering algorithm is applied to effective rents for twenty metropolitan U.S. office markets, and to twenty-two metropolitan markets, using vacancy data. It provides support for the conjecture that there exist a few major families of cities: including an oil and gas group and an industrial northeast group. Unlike other clustering studies, there is strong evidence of bicoastal city associations among cities such as Boston and Los Angeles. We present a new bootstrapping methodology for investigating the robustness of the clustering algorithm, and develop a means for testing the significance of city associations. While the analysis is limited to aggregate rent and vacancy data, the results provide a guideline for the further application of cluster analysis to other types of real estate and economic information. The major benefit of this approach is in reducing the potentially large estimation errors in diversification studies for real estate portfolios.
The Diversification Problem
Mean-variance analysis, developed by Markowitz (1952) has become a widely used method for optimally diversifying real estate portfolios. As originally conceived, mean-variance calculates a set of portfolio weights across assets that result in thc highest expected return for each given level of investor risk. While useful, the procedure is not without drawbacks. A number of authors have pointed out potentially serious problems created by estimation error. Michaud (1989) calls the optimization procedure an error-maximizer because it has the unfortunate feature of overweighting the influence of outlying observations. Jorion (1985, 1986) simulates the effect of estimation error on the composition of portfolios on the efficient frontier by optimizing randomly generated finite-sample means, standard deviations and correlations, rather than population values. His analysis, and similar studies by Broadie (1991) show that seemingly efficient portfolios, determined from historical estimates of means and covariances, ca n lay far from the actual efficient frontier. While the standard error of most statistics decreases in the number of observations, the standard error of the efficient frontier does not. As the number of assets increases, the reliability of the composition of the frontier portfolios decreases. The problem, as Best and Grauer (1991) point out, is that the optimizer selects assets with the highest sample means.  The results of using mean-variance analysis with even slightly misspecified means appear contradictory to the principle intention of the optimizer. It tends to undiversify investor portfolios.
The problem of estimation error in mean-variance optimization is particularly acute in real estate portfolio management because mean returns are difficult to estimate with accuracy. Disaggregated return series' are largely unavailable for any length of time. In addition, mean returns for commercial properties are typically based upon appraisals rather than transactions. These data limitations imply that mean-variance, if improperly applied, can be detrimental to the goals of real estate portfolio diversification. 
One way to reduce estimation error is to gather more data.  Unfortunately, for real estate analysts, this may not be possible, except in an approximate sense. Our approach to eliminating the estimation error of individual series' is by grouping series' together. As we show, it has effects similar to gathering more data for reducing the estimation error for mean returns. As such, it reduces the primary source of misallocation in mean-variance optimization studies.
Relation to Previous Research
The clustering procedures explored in this paper address the problem of estimation error in mean-variance analysis by aggregating cross-sectional asset series' according to endogenous time-series characteristics.  This is not the only basis for grouping cities, however. Indeed, many aggregation studies in real estate can be seen as approaches to the development of meaningful clusters. These studies in turn, aid in the problem of geographical diversification. For instance, Hartzell, Schulman and Wurtzebach (1987); Hartzell, Heckman and Miles (1986); Schulman and Hopkins (1988); and Corgel and Gay (1987) use a variety of econometric methods to identify fundamental groupings of regions. Cole, Guilkey, Miles and Webb (1989) propose an ingenious grouping scheme based upon economic characteristics. They show that properties tend to aggregate into groups characterized by such terms as Tomorrow-land, which refers to Los Angeles based properties and the associated lifestyle of the region's inhabitants. They suggest that economic sector-based clusters may provide a useful means to diversify within real estate. Research firms such as the Frank Russell Company have long divided commercial property returns into four groups: East, South, Midwest and West. Recently, Malizia and Simons (1991) have studied the efficacy of the Salomon Brothers classification scheme, by which cities are grouped into eight economically meaningful clusters: Mineral Extraction, Farm Belt, Industrial Mid-West, New England, Mid-Atlantic Corridor, South, Southern California and Northern California. They find that demand-side variables, such as employment, personal income, and population growth, support the usefulness of the Salomon Brothers classification. All of these studies are based upon the need to reduce the uncertainty of real estate portfolio returns by allocation across relatively uncorrelated groups. The clustering procedure proposed in this paper can be thought of as another method for discriminating among different groups of cities. As suc h, it may be applied to fundamental variables as well. For instance, while our data in this paper is confined to rents and vacancies, clustering procedures may profitably be applied to complex multi-variate data-sets including employment sector data.
Relationship to Econometric Models
One major tool for forecasting returns to investment in commercial real estate in a given metropolitan area is a formal econometric model of the inter-related time-series dynamics of rents, vacancies, employment and other factors. This approach may also be understood in the context of reducing estimation error to the mean-variance problem. Because mean-variance relies upon forecasts of future means, econometric models may provide better inputs than simple arithmetic averages of past returns. Because of the relative brevity of real estate data-sets, forecasting models are typically based upon panel data-sets, which require the model coefficients to be equal across cities. This results in a parsimonious function, mapping current and past economic conditions into future rents or other variables of interest. While time-series models reduce the dimensionality of lagged relationships among city-specific economic variables, they generate forecasts for each city. They reduce input estimation error by taking advantage of forecastability, rather than by aggregation. In a sense, the clustering analysis we employ does exactly the opposite. It imposes no reduced-form model on the panel, but projects the forecasts to a lower dimension, to reduce error. In principle, there is no reason both techniques cannot be used. Clustering can be performed in the space of estimated model parameters to further refine forecasts. Alternately, one could use the econometric model as a test of relevent variables to include in the clustering algorithm. For instance, in principle, one could use nearly any city-specific variable as the basis for clustering, however an econometric model helps determine which of these variables is useful in forecasting returns.  The guiding principle in such an application is to identify a few variables that are significant indicators of future returns. Work by Fisher and Webb (1994) confirm that rents are very strongly associated with returns, however there are likely to be other variables that are useful for clu stering analysis.
A Clustering and Bootstrapping Approach
This research differs from previous work in that we explore the methodological issues confronting cluster analysis of real estate data. In particular, we address a fundamental question, "How can we test the robustness of the classification scheme?" Groupings of cities may be produced by any one of a number of statistical methods, but a real estate portfolio manager needs to know whether such groupings will continue in the future. To answer this question, we employ the iterative relocation algorithm, K-means, to group cities on the basis of percent changes in commercial office rents. As a crosscheck to determine whether alternative data sets may provide similar results, we also apply K-means to changes in vacancies.
Our approach is to extend techniques developed by researchers in the evolutionary sciences to test the repeatability of these clusters. This method relies upon the bootstrap proposed by Efron (1979) to generate a distribution of the sample clusters. In other words, we use simulations to evaluate how wrong the results of the clustering algorithm might be. After all, since classification schemes are simply point estimates, they may only reflect chance historical relationships. We would like to know if two cities that grouped together in the past did so because of a fundamental relationship, or because of a random association. The bootstrap allows us to answer these questions in probabilistic terms. Suppose we observe Boston and New York clustering together in the past -- what are the chances that they will do so in the future? How sensitive are the results to the characteristics upon which we perform the analysis? For instance, do clusters based upon vacancies resemble clusters based upon rents? How sensitive a re the results to the number of observations used and the number of clusters specified? The bootstrap can be used to address these questions and to provide confidence bands around the associations among cities and permit us to evaluate the likelihood that past clustering will predict future clustering.
Like most academic researchers, we are limited by the availability of reliable data on real estate rents, as well as supply and demand side variables. Ideally, we would like return data for commercial properties for each of the metropolitan statistical areas we might consider for inclusion in our portfolio. In fact, this is the type of data most stock and bond portfolio managers can access for use in their asset allocation decision. To date, nothing is available for real estate. Fortunately cluster analysis can be applied to alternative types of data. To the extent that real estate returns can be explained by fundamental factors such as rents, vacancies and unemployment, clustering on any or all of these factors provides a guide for series aggregation, and consequently for reduction of mean-variance input uncertainty.
In this paper, we report results for two types of data sets, rents and vacancies. Not only do these fundamental factors show distinct groups of cities, but using the bootstrap, we can explicitly report confidence ranges around these results. Future research, with better data, can refine our conclusions. In addition, the bootstrapping method can be applied to property-specific data to verify the existence of meaningful clusters.
For rental data, we examine animal percentage changes in asking and effective rents from twenty-one U.S. cities over the period 1982-1989, made available by the REIS company.  The REIS company surveys a broad range of commercial office space management companies in each of the cities and uses this data to estimate a mean asking rent per square foot, as well as other data pertinent to the market: vacancies, new construction, maximum rents and absorption. REIS also estimates effective rents, net of improvements and free rent for eight of the cities in the sample. We form our own estimate of net rents for the remaining thirteen cities using a linear model of net rents as a function of asking rents and vacancies. 
The effective rent series' are pictured in figure 1.  Note that, while rent levels differ markedly across cities, there appear to be groups of cities that follow the same dynamic. The oil-related cities clearly plunged beginning in 1984, while Boston, and a few other cities trended upward over that period.  This short time series does not allow meaningful time series analysis, however it does provide some information about the cross-sectional behavior of commercial rents. Each city represents a single observation. For every city, we have eight variables: the percentage change in rent each year. We used percentage changes in rents, because the rent levels themselves simply separate the major cities from the minor cities. By focusing upon innovations in rents, we capture the trends in the data.
We obtained a data-set for vacancies from The Wharton Econometrics database (WEFA) that only partially overlaps the REIS rents, however, it is based upon quarterly observations.  These provide an independent sample to compare to the results obtained from the REIS data. We used the data for twenty-two cities over the period 1981-1990 as the basis for analysis. Several of the cities represented in the REIS data set also appear in the WEFA data-set and this overlap is useful, since we would like to understand the effect of differing samples on the cluster analysis. As we demonstrate, vacancy data presents its own challenge to analysis.
Clusters of Commercial Rents
The Appendix provides details regarding the clustering procedure and its relationship to mean-variance estimation problems. We use the K-means algorithm (see Hartigan 1975), a general iterative relocation algorithm. It searches for a partition of a given set of multivariate observations into K groups. For the REIS data, each observation is a city, and each variable is the change in rent over one year. The algorithm requires that we prespecify the number of groups, and give initial beginning values for the middle of each group. Customarily, one chooses K existing observations as initial centers.  The algorithm precedes by seeking switches of observations between groups that reduces the within-group sum of squared euclidean distances from the group's geometric center. For instance, if we only had two years of data for each city, the clustering algorithm would group cities on a plane whose axes were rent changes in the first year and rent changes in the second year. Cities with two successive years of positi ve rent changes would group together, cities with a positive change the first year and a negative change the second year would group together, and so on.
Among the drawbacks of K-means is that, like all clustering algorithms, it is not guaranteed to find a global optimum, nor is there a consistent method for identifying the best starting values or the optimal number of groups. Although the choice of the number of groups is not endogenous to the model, we suspect that there are more than two major clusters, consequently, we initiate the algorithm with at least three groups in the following analysis. For comparison to the Frank Russell Classification into East, Midwest, South and West, we also perform the analysis with four groups. For comparison to the Salomon Brothers classification (Malizia and Simmons 1991) we perform the analysis with seven groups. 
What results from the K-means analysis is a list of the cities with a code identifying it with one of the resulting groups. Table 1 reports these clusters for asking and effective rents for each of the three K-means runs. There are some differences between the asking and effective rent groupings. Dallas, for instance, does not group with the oil-related cities when asking rents are used. Austin groups with Dallas when asking rents are used, but not when effective rents are used. Besides these two switches by Texas cities, however, the first two columns of Table 1 are consistent with each other. Perhaps the ambiguity of the Austin and Dallas associations is driven by the fact that the economy of both cities is diversified. The second column of Table 1 contains the codes for effective rents. Notice that Chicago, Washington, Detroit, New York, Philadelphia, Baltimore, Kansas City, Miami, Austin, Memphis and Orlando are coded with the number 1, and form the largest single group.
This first group contains many of the industrial Northeast cities, but also a few sun-belt cities as well. Houston, Dallas, Denver, New Orleans and Oklahoma City are coded as 2, and form one oil-related group. Los Angeles, Phoenix, Atlanta and Boston form the third group. The interpretation of this third group is difficult. If anything, it appears to include several cities with a concentration in the high technology industry. To check whether the association could be an artifact of a particular time period, we divided the sample into two sub-periods: 1981-1984 and 1985-