An autoregressive approach to house price modeling - arXiv

An autoregressive approach to house price modeling - arXiv

The Annals of Applied Statistics 2011, Vol. 5, No. 1, 124–149 DOI: 10.1214/10-AOAS380 In the Public Domain arXiv:1104.2719v1 [stat.AP] 14 Apr 2011 A...

637KB Sizes 0 Downloads 0 Views

Recommend Documents

An Integrated Approach to Teaching Price Discrimination
Textbooks present the three 'degrees' of price discrimination as a sequence of independent pricing methods and consequen

Modeling House Price Dynamics with Heterogeneous - Springer
stock of houses. Using a mixture of analytical and numerical tools, we derive the following results. In the absence of s

Modeling house price dynamics with heterogeneous - CiteSeerX
Feb 3, 2012 - Abstract. This paper investigates the impact of speculative behavior on house price dynamics. Spec- ... On

Kronecker Graphs: An Approach to Modeling Networks - Duke ECE
graphs can effectively model the structure of real networks. We then present KRONFIT, a fast and scalable algorithm for

An Approach to CLT Diaphragm Modeling for Seismic - WoodWorks
Adapted from a paper by Scott Breneman,” Eric McDonnel” and Reid B. Zimmerman,” presented at the 2016 World Confer

an approach to modeling and forecasting real estate - CiteSeerX
Nov 9, 2014 - estate market of the entire country. The GIS and socio-economic modeling results show that higher property

An Extra Dimensional Approach of Entanglement - arXiv
An Extra Dimensional Approach of Entanglement. Axel Dietrich1* & Willem Been2. 1 )Institute of Human Genetics, 2 ) Depar

An Expectation Conditional Maximization approach for - arXiv
Sep 20, 2017 - Dunnhumby repository1. The dataset contains sales for the top five products from each of the top three br

Web Services Modeling and Composition Approach using Object - arXiv
Web Services Modeling and Composition Approach using. Object-Oriented Petri Nets. Sofiane Chemaa1, Raida Elmansouri 2 an

A Realistic approach in modeling Ad hoc Networks - arXiv
Keywords: Throughput, attenuation, opnet, radio propagation model, packet, Simulation. 1. ... networks simulator is a fr

The Annals of Applied Statistics 2011, Vol. 5, No. 1, 124–149 DOI: 10.1214/10-AOAS380 In the Public Domain

arXiv:1104.2719v1 [stat.AP] 14 Apr 2011

AN AUTOREGRESSIVE APPROACH TO HOUSE PRICE MODELING1 By Chaitra H. Nagaraja, Lawrence D. Brown2 and Linda H. Zhao US Census Bureau, University of Pennsylvania and University of Pennsylvania A statistical model for predicting individual house prices and constructing a house price index is proposed utilizing information regarding sale price, time of sale and location (ZIP code). This model is composed of a fixed time effect and a random ZIP (postal) code effect combined with an autoregressive component. The former two components are applied to all home sales, while the latter is applied only to homes sold repeatedly. The time effect can be converted into a house price index. To evaluate the proposed model and the resulting index, single-family home sales for twenty US metropolitan areas from July 1985 through September 2004 are analyzed. The model is shown to have better predictive abilities than the benchmark S&P/Case–Shiller model, which is a repeat sales model, and a conventional mixed effects model. Finally, Los Angeles, CA, is used to illustrate a historical housing market downturn.

1. Introduction. Modeling house prices presents a unique set of challenges. Houses are distinctive, each has its own set of hedonic characteristics: number of bedrooms, square footage, location, amenities and so forth. Moreover, the price of a house, or the value of the bundle of characteristics, is observed only when sold. Sales, however, occur infrequently. As a result, during any period of time, out of the entire population of homes, only a small percentage are actually sold. From this information, our objective is to develop a practical model to predict prices from which we can construct a price index. Such an index would summarize the housing market and would be Received April 2009; revised June 2010. Disclaimer: This report is released to inform interested parties of research and to encourage discussion. The views expressed on statistical issues are those of the authors and not necessarily those of the US Census Bureau. 2 Supported in part by NSF Grant DMS-07-07033. Key words and phrases. Housing index, time series, repeat sales. 1

This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Applied Statistics, 2011, Vol. 5, No. 1, 124–149. This reprint differs from the original in pagination and typographic detail. 1



used to monitor changes over time. Including both objectives allows one to look at both micro and macro features of a market, from individual houses to entire markets. In the following discussion, we propose an autoregressive model which is a simple, but effective and interpretable, way to model house prices and construct an index. We show that our model outperforms, in a predictive sense, the benchmark S&P/Case–Shiller Home Price Index method when applied to housing data for twenty US metropolitan areas. We use these results to evaluate the proposed autoregressive model as well as the resulting house price index. A common approach for modeling house prices, called repeat sales, utilizes homes that sell multiple times to track market trends. Bailey, Muth and Nourse (1963) first proposed this method and Case and Shiller (1987, 1989) extended it to incorporate heteroscedastic errors. In both models, the log price difference between two successive sales of a home is used to construct an index using linear regression. The previous sale price acts as a surrogate for hedonic information, provided the home does not change substantially between sales. There is a large body of work focused on improving the index estimates produced by the Bailey, et al. approach. For instance, a modified form of the repeat sales model is used for the Home Price Index produced by the Office of Federal Housing Enterprise Oversight (OFHEO). Gatzlaff and Haurin (1997) suggest a repeat sales model that corrects for the correlation between economic conditions and the chance of a sale occurring. Alternatively, Shiller (1991) and Goetzmann and Peng (2002) propose arithmetic average versions of the repeat sales estimator as an alternative to the original geometric average estimator. The former work is used commercially by Standard and Poors to produce the S&P/Case–Shiller Home Price Index. We will be using this index in our analysis as it is the most well known. Several criticisms have been made about repeat sales methods. Theoretically, for a house to be included in a repeat sales analysis, no changes must have been made to it; however, in practice, that is almost never the case. Furthermore, Englund, Quigley and Redfearn (1999) and Goetzmann and Speigel (1995) have commented on the difficulty of detecting such changes without the availability of additional information about the home. Goetzmann and Speigel, however, do propose an alternate model which corrects for the effect of changes to homes around the time the house is sold. Even if homes which have changed are removed from the data set, an index constructed out of the remaining homes may still not reflect the true index value. Case and Quigley (1991) argue that houses age which has a depreciating effect on their price. Therefore, as Case, Pollakowski and Wachter (1991) write, repeat sales indices produce estimates of time effects confounded with age effects. Palmquist (1982) has suggested applying an independently computed depreciation factor to account for the impact of age.



In a sample period, out of the entire population of homes, only a small fraction are actually sold. A fraction of these sales are repeat sales homes with no significant changes. Recall that the remaining sales, those of the single sales homes, are omitted from the analysis. If repeat sales indices are used to describe the housing market as a whole, one would like the sample of repeat sales homes to have similar characteristics to all homes. If not, Case, Pollakowski and Wachter remark that the indices would be affected by sample selection bias. Englund, Quigley and Redfearn in a study of Swedish home sales, and Meese and Wallace (1997), in a study of Oakland and Freemont home sales, both found that repeat sales homes are indeed different from single sale homes. Both studies also observed that in addition to being older, repeat sales homes were smaller and more “modest” [Englund, Quigley and Redfearn (1999)]. Therefore, repeat sales indices seem to provide information only about a very specific type of home and may not apply to the entire housing market. However, published indices do not seem to be interpreted in that manner. Case and Quigley (1991) propose an alternative hybrid model that combines repeat sales methodology with hedonic information which makes use of all sales. While the index constructed with this method represents all home sales, it requires housing characteristics which may be difficult to collect on a broad scale. We feel the repeat sales concept is valuable although the current models of this type have the issues described above. The proposed model applies the repeat sales idea in a new way to address some of the criticisms while still maintaining the simplicity and reduced data requirements that the original Bailey et al. method had. While our primary goal is prediction, we believe the resulting index could be a better general description of housing sales than traditional repeat sales methodology. In our method, log prices are modeled as the sum of a time effect (index), a location effect modeled as a random effect for ZIP (postal) code, and an underlying first-order autoregressive time series [AR(1)]. This structure offers four advantages. First, the price index is estimated with all sales: single and repeat. Essentially, the index can be thought of as a weighted sum of price information from single and repeat sales. The latter component receives a much higher weight because more useful information is available for those homes. Second, the previous sale price becomes less useful the longer it has been since the last sale. The AR(1) series includes this feature into the model more directly than the Case–Shiller method. Third, metropolitan areas are diverse and neighborhoods may have disparate trends. We include ZIP code effects to model these differences in location.3 Finally, the proposed model 3

ZIP code was readily available in our data; other geographic variables at roughly this scale might have been even more useful had they been available.



is straightforward to interpret even while including the features described above. We believe the model captures trends in the overall housing market better than existing repeat sales methods and is a practical alternative. We apply this model to data on single family home sales from July 1985 through September 2004 for twenty US metropolitan areas. These data are described in Section 2. The autoregressive model is outlined and estimation using maximum likelihood is described in Section 3; results are discussed in Section 4. For comparison, two alternative models are fit: a conventional mixed effects model and the method used in the S&P/Case–Shiller Home Price Index. As a quantitative way to compare the indices, the predictive capacity of the three methods are assessed in Section 5. In Section 6 we examine the case of Los Angeles, CA, where the proposed model does not perform as well. We end with a general discussion in Section 7. 2. House price data. The data are comprised of single family home sales qualifying for conventional mortgages from the twenty US metropolitan areas listed in Table 1. These sales occurred between July 1985 and September 2004. Not included in these data are homes with prices too high to be considered for a conventional mortgage or those sold at subprime rates. Note, however, that subprime loans were not prevalent during the time period covered by our data. Similar data are used by Fannie Mae, Freddie Mac, and to construct the OFHEO Home Price Index. For each sale, the following information is available: address with ZIP code, month and year of sale, and price. To ensure adequate data per time period, we divide the sample period into three month intervals for a total of 77 periods, or quarters. We make an attempt to remove sales which are not arm’s length by omitting homes sold more than once in a single quarter. Given the lack of hedonic information, we have no way of determining whether a house has changed substantially between sales. Therefore, we do not filter our data to remove such houses. Table 2 displays the number of sales and unique houses sold in the sample period for a selection of cities. Complete tables for all summaries in this section are provided in Appendix A. Observe that the total number of sales is always greater than the number of houses because houses can sell multiple Table 1 Metropolitan areas in the data Ann Arbor, MI Atlanta, GA Chicago, IL Columbia, SC Columbus, OH

Kansas City, MO Lexington, KY Los Angeles, CA Madison, WI Memphis, TN

Minneapolis, MN Orlando, FL Philadelphia, PA Phoenix, AZ Pittsburgh, PA

Raleigh, NC San Francisco, CA Seattle, WA Sioux Falls, SD Stamford, CT



Table 2 Summary counts for a selection of cities Metropolitan area



Stamford, CT Ann Arbor, MI Pittsburgh, PA Los Angeles, CA Chicago, IL

14,602 68,684 104,544 543,071 688,468

11,128 48,522 73,871 395,061 483,581

times (repeat sales). Perhaps more illuminating is Table 3, where we count the number of times each house is sold. We see that as the number of sales per house increases, the number of houses reduces rapidly. Nevertheless, a significant number of houses sell more than twice. With a sample period of nearly twenty years, this is not unusual; however, single sales are the most common despite the long sample period. The first column of Table 3 shows this clearly. Moreover, this pattern holds for all cities in our data. Finally, in Figure 1, we plot the median price across time for the subset of cities. This graph illustrates that both the cost of homes and the trends over time vary considerably across cities. For all metropolitan areas in our data, the time of a sale is fuzzy, as there is often a lag between the day when the price is agreed upon and the day the sale is recorded (around 20–60 days). Theoretically, the true value of the house would have changed between these two points. Therefore, in the strictest sense, the sale price of the house does not reflect the price at the time when the sale is recorded. Dividing the year into quarters reduces the importance of this lag effect. 3. Model. The log house price series is modeled as the sum of an index component, an effect for ZIP code (as an indicator for location), and an AR(1) time series. The sale prices of a particular house are treated as a series of sales: yi,1,z , yi,2,z , . . . , yi,j,z , . . . , where yi,j,z is the log sale price of Table 3 Sale frequencies for a selection of cities Metropolitan area

1 sale

2 sales

3 sales

4+ sales

Stamford, CT Ann Arbor, MI Pittsburgh, PA Los Angeles, CA Chicago, IL

8,200 32,458 48,618 272,258 319,340

2,502 12,662 20,768 100,918 130,234

357 2,781 3,749 18,965 28,369

62 621 718 2,903 5,603



Fig. 1.

Median prices for a selection of cities.

the jth sale of the ith house in ZIP code z. Note that yi,1,z is defined as the first sale price in the sample period; as a result, both new homes and old homes sold for the first time in the sample period are indicated with the same notation. Let there be 1, . . . , T discrete time periods where house sales occur. Allow t(i, j, z) to denote the time period when the jth sale of the ith house in ZIP code z occurs and let γ(i, j, z) = t(i, j, z) − t(i, j − 1, z), or the gap time P PI z between sales. Finally, there are a total of N = Z i=1 Ji observations z=1 in the data where there are Z ZIP codes, Iz houses in each ZIP code and Ji sales for a given house. The log sale price yi,j,z can now be described as follows: yi,1,z = µ + βt(i,1,z) + τz + εi,1,z , (1)

j = 1,

yi,j,z = µ + βt(i,j,z) + τz + φγ(i,j,z)(yi,j−1,z − µ − βt(i,j−1,z) − τz ) + εi,j,z ,

j > 1,

where: 1. The parameter βt(i,j,z) is the log price index at time t(i, j, z). Let β1 , . . . , βT denote the log price indices, assumed to be fixed effects. 2. φ is the autoregressive coefficient and |φ| < 1. i.i.d.

3. τz is the random effect for ZIP code z. τz ∼ N (0, στ2 ) where τ1 , . . . , τZ are the ZIP code random effects which are distributed normally with mean 0 and variance στ2 and where i.i.d. denotes independent and identically distributed. P 4. We impose the restriction that Tt=1 nt βt = 0 where nt is the number of sales at time t. This allows us to interpret µ as an overall mean. 5. Finally, let     σε2 (1 − φ2γ(i,j,z) ) σε2 , εi,j,z ∼ N 0, , εi,1,z ∼ N 0, 1 − φ2 1 − φ2



and assume that all εi,j,z are independent. Note that there is only one process for the series yi,1,z , yi,2,z , . . . . The error variance for the first sale, σε2 /(1 − φ2 ), is a marginal variance. For subsequent sales, because we have information about previous sales, it is appropriate to use the conditional variance (conditional on the previous sale), σε2 (1 − φ2γ(i,j,z) )/(1 − φ2 ), instead. For more details refer to the supplemental article [Nagaraja, Brown and Zhao (2010)]. The underlying series for each house is given by ui,j,z = yi,j,z −µ−βt(i,j,z) − τz . We can rewrite this series as ui,j,z = φγ(i,j,z) ui,j−1,z + εi,j,z where εi,j,z is as given above. This autoregressive series is stationary, given a starting observation ui,1,z , because E[ui,j,z ] = 0, a constant, where E[·] is the expectation function, and the covariance between two points depends only on the gap time and not on the actual sale times. Specifically, Cov(ui,j,z , ui,j ′ ,z ) = ′ σε2 φ(t(i,j ,z)−t(i,j,z)) /(1 − φ2 ) if j < j ′ . Therefore, the covariance between a pair of sales depends only on the gap time between sales. Consequently, the time of sale is uninformative for the underlying series, only the gap time is required. As a result, the autoregressive series ui,j,z where i and z are fixed and j ≥ 1 is a Markov process. The autoregressive component adds two important features to the model. Intuitively, the longer the gap time between sales, the less useful the previous price should become when predicting the next sale price. For the model described in (1), as the gap time increases, the autoregressive coefficient decreases by construction (φγ(i,j,z) ), meaning that sales prices of a home with long gap times are less correlated with each other. (See Remark 3.1 at the end of this section for additional discussion on the form of φ.) Moreover, as the gap time increases, the variance of the error term increases. This indicates that the information contained in the previous sale price is less useful as the time between sales grows. To fit the model, we formulate the autoregressive model in (1) in matrix form: (2)

y = Xβ + Zτ + ε∗ ,

where y is the vector of log prices and X and Z are the design matrices for the fixed effects β = [µβ1 · · · βT −1 ]′ and random effects τ , respectively. Then, the log price can be modeled as a mixed effects model with autocorrelated errors, ε∗ , and with covariance matrix V. We apply a transformation matrix T to the model in (2) to simplify the computations; essentially, this matrix applies the autoregressive component of the model to both sides of (2). It is an N × N matrix and is defined as follows. Let t(i,j,z),(i′ ,j ′ ,z ′ ) be the cell corresponding to the (i, j, z)th row and



(i′ , j ′ , z ′ )th column. Then,   1, t(i,j,z),(i′ ,j ′ ,z ′ ) = −φγ(i,j) , (3)  0,

if i = i′ , j = j ′ , z = z ′ , if i = i′ , j = j ′ + 1, z = z ′ , otherwise.


σε As a result, Tε∗ ∼ N (0, 1−φ 2 diag(r)) where diag(r) is a diagonal matrix of dimension N with the diagonal elements r being given by  1, when j = 1, (4) ri,j,z = 1 − φ2γ(i,j) , when j > 1. P Using the notation from (1), let ε = Tε∗ . Finally, we restrict Tt=1 nt βt = 0 P −1 where nt is the number of sales at time t. Therefore, βT = − n1T Tt=1 nt βt . The likelihood function for the transformed model is


L(θ; y) = (2π)−N/2 |V|−1/2

× exp{− 12 (T(y − Xβ))′ V−1 (T(y − Xβ))},

where θ = {β, σε2 , στ2 , φ} is the vector of parameters, N is the total number of observations, V is the covariance matrix, and T is the transformation matrix. We can split V into a sum of the variance contributions from the time series and the random effects. Specifically, (6)


σε2 diag(r) + (TZ)D(TZ)′ , 1 − φ2

where D = στ2 IZ and IZ is an identity matrix with dimension Z × Z. We use the coordinate ascent algorithm to compute the maximum likelihood estimates (MLE) of θ for the model in (1). This iterative procedure maximizes the likelihood function with respect to each group of parameters while holding all other parameters constant. The algorithm terminates when the parameter estimates have converged according to the specified stopping rule. Bickel and Doksum (2001) include a proof showing that, for models in the exponential family, the estimates computed using the coordinate ascent algorithm converge to the MLE. The proposed model, however, is a member of the differentiable exponential family; therefore, as Brown (1986) states, the proof does not directly apply. Nonetheless, we find empirically that the likelihood function is well behaved, so the MLE appears to be reached for this case as well. Empirical evidence of convergence can be found in the supplemental article [Nagaraja, Brown and Zhao (2010)]. We outline Algorithm 1 below. The equations for updating the parameters and random effects estimates are given in Appendix B. To predict a log price, we substitute the estimated parameters and random effects into (1): (7)

yˆi,j,z = µ ˆ + βˆt(i,j,z) + τˆz + φˆγ(i,j,z)(yi,j−1,z − µ ˆ − βˆt(i,j−1,z) − τˆz ).



We then convert yˆi,j,z to the price scale (denoted as Yˆi,j,z ) using   σ2 2 ˆ (8) , Yi,j,z (σ ) = exp yˆi,j,z + 2

where σ 2 denotes the variance of yi,j,z . The additional term σ 2 /2 approximates the difference between E[exp{X}] and exp{E[X]} where E[·] is the expectation function. We must adjust the latter expression to approximate the conditional mean of the response, y. We improve the efficiency of our estimates by using the adjustment stated in Shen, Brown and Zhi (2006). In (8), σ 2 can be estimated from the mean squared residuals (MSR), where P MSR = N1 N ˆi,j,z )2 and N is the total number of observations i=1 (yi,j,z − y used to fit the model. Therefore, the log price estimates, yˆi,j,z , are converted to the price scale by   MSR (9) . Yˆi,j,z = exp yˆi,j,z + 2

Goetzmann (1992) proposes a similar transformation for the index values computed using a traditional repeat sales method. Calhoun (1996) suggests applying Goetzmann’s adjustment when using an index value to predict a particular house price. For the autoregressive model, the standard error of the index is sufficiently small that the efficiency adjustment has a negligible impact on the estimated index. Therefore, we simply use exp{βˆt } to convert the index to the price scale. Finally, we rescale the vector of indices so that the first quarter has an index value of 1.

Remark 3.1. The autoregressive coefficient form, φγ(i,j,z) , deserves further explanation. For each house indexed by (i, z), let t1 (i, z) = t(i, 1, z) Algorithm 1 Autoregressive (AR) model fitting algorithm. 1. Set a tolerance level ǫ (possibly different for each parameter). 2. Initialize the parameters: θ 0 = {β 0 , σε2,0 , στ2,0 , φ0 }. 3. For iteration k (k = 0 when the parameters are initialized), (a) Calculate β k using (19) in Appendix B with {σε2,k−1 , στ2,k−1 , φk−1 }. (b) Compute σε2,k by computing the zero of (20) using {β k , στ2,k−1 , φk−1 }. (c) Compute στ2,k by calculating the zero of (21) using {β k , σε2,k , φk−1 }. (d) Find the zero of (22) to compute φk using {β k , σε2,k , στ2,k }. (e) If |θ ik−1 − θki | > ǫ for any θi ∈ θ, repeat step 3 after replacing θ k−1 with θ k . Otherwise, stop (call this iteration K). P −1 ˆK 4. Solve for βT by computing: βˆT = − n1T Tt=1 nt βt .

5. Plug in {β K , σε2,K , στ2,K , φK } to compute the estimated values for τ using (23).



denote the time of the initial sale. Conditioning on the (unobserved) values of the parameters {µ, βt , σε2 , στ2 } and on the values of the random ZIP code effects, {τz }, let {ui,z;t : t = t1 (i, z), t1 (i, z) + 1, . . .} be an underlying AR(1) process. To be more precise, ui,z;t is a conventional, stationary AR(1) process defined by  εi,1,z , if t = t1 (i, z), (10) ui,z;t = φui,z;t−1 + εi,1,z , if t > t1 (i, z), i.i.d.


σε where if t = t(i, j, z), then εi,z;t(i,j,z) = εi,j,z and otherwise εi,z;t ∼ N (0, 1−φ 2 ). Then the observed log sale prices are given by {yi,j,z } where ui,z;t(i,j,z) = yi,j,z − (µ + βt(i,j,z) + τz ). The values of ui,z;t are to be interpreted as the potential sale price adjusted by {µ, βt , σε2 , στ2 } of the house indexed by (i, z) if the house were to be sold at time t. For housing data like ours, the value of the autoregressive parameter φ for this latent process will be near the largest possible value, φ = 1. Consequently, if the underlying process were actually an observed process from which one wanted to estimate φ, then estimation of φ could be a delicate matter. However, sales generally occur with fairly large gap times and so the values of φγ(i,j,z) occurring in the data will generally not be close to 1. For that reason, conventional estimation procedures perform satisfactorily when estimating φ. We provide empirical evidence for this in Section 4 and in the supplemental article [Nagaraja, Brown and Zhao (2010)].

4. Estimation results. To fit and validate the autoregressive (AR) model, we divide the observations for each city into training and test sets. The test set contains all final sales for homes that sell three or more times. Among homes that sell twice, the second sale is added to the test set with probability 1/2. As a result, the test set for each city contains roughly 15% of the sales. The remaining sales (including single sales) comprise the training set. Table 8 in Appendix A lists the training and test set sizes for each city. We fit the model on the training set and examine the estimated parameters. The test set will be used in Section 5 to validate the AR model against two alternatives. In Table 4, the estimates for the overall mean µ (on the log scale), the autoregressive parameter φ, the variance of the error term σε2 , and the variance of the random effects στ2 are provided for each metropolitan area. As expected, the most expensive cities have the highest values of µ: Los Angeles, CA, San Francisco, CA, and Stamford, CT. In Figure 2, the indices for a sample of the twenty cities are provided. There are clearly different trends across cities. The estimates for the AR model parameter φ are close to one. This is not surprising as the adjusted log sale prices, ui,j,z , for sale pairs with short gap


AN AUTOREGRESSIVE APPROACH TO HOUSE PRICE MODELING Table 4 Parameter estimates for the AR model Metropolitan area

µ ˆ

ˆ φ

σ ˆ ε2

σ ˆ τ2

Ann Arbor, MI Atlanta, GA Chicago, IL Columbia, SC Columbus, OH Kansas City, MO Lexington, KY Los Angeles, CA Madison, WI Memphis, TN Minneapolis, MN Orlando, FL Philadelphia, PA Phoenix, AZ Pittsburgh, PA Raleigh, NC San Francisco, CA Seattle, WA Sioux Falls, SD Stamford, CT

11.6643 11.6882 11.8226 11.3843 11.5159 11.4884 11.6224 12.1367 11.7001 11.6572 11.8327 11.6055 11.7106 11.7022 11.3408 11.7447 12.4236 11.9998 11.6025 12.5345

0.993247 0.992874 0.992000 0.997526 0.994807 0.993734 0.996236 0.981888 0.994318 0.994594 0.992008 0.993561 0.991767 0.992349 0.992059 0.993828 0.985644 0.989923 0.995262 0.987938

0.001567 0.001651 0.001502 0.000883 0.001264 0.001462 0.000968 0.002174 0.001120 0.001120 0.001515 0.001676 0.001679 0.001543 0.002546 0.001413 0.001788 0.001658 0.001120 0.002294

0.110454 0.070104 0.110683 0.028062 0.090329 0.121954 0.048227 0.111708 0.023295 0.101298 0.050961 0.046727 0.183495 0.106971 0.103488 0.047029 0.056201 0.039459 0.032719 0.093230

times are expected to be closer in value than those with longer gap times. It may be tempting to assume that since φ is so close to 1, the prices form a random walk instead of an AR(1) time series (see Remark 3.1). However, this is clearly not the case. Recall that φ enters the model not by itself but as φγ(i,j,z) where γ(i, j, z) is the gap time. These gap times are high enough that the correlation coefficient φγ(i,j,z) is considerably lower than 1. The

Fig. 2.

The AR index for a selection of cities.



Fig. 3.

Checking the AR(1) assumption for Columbus, OH.

mean gap time across cities is around 22 quarters. As an example, for Ann Arbor, MI, φˆ22 = 0.99324722 ≈ 0.8615 which is clearly less than 1. Therefore, the types of sensitivity often produced as a consequence of near unit roots do not apply to our autoregressive model. We have modeled the adjusted log prices, ui,j,z = yi,j,z − βt(i,j,z) − τz , as a latent AR(1) time series. Accordingly, for each gap time, γ(i, j, z) = h, there is an expected correlation between the sale pairs: φh . To check that the data support the theory, we compare the correlation between pairs of quarter-adjusted log prices at each gap length to the correlation predicted by the model. First, we compute the estimated adjusted log prices u ˆi,j,z = yi,j,z − βˆt(i,j,z) − τˆz for the training data. Next, for each gap time h, we find all the sale pairs (ˆ ui,j−1,z , u ˆi,j,z ) with that particular gap length. The sample correlation between those sale pairs produces an estimate of φ for gap length h. If we repeat this procedure for each possible gap length, we should obtain a steady decrease in the correlation as gap time increases. In particular, the points should follow the curve φh if the model is specified correctly. In Figure 3, we plot the correlation of the adjusted log prices by gap time for Columbus, OH. Note that the computed correlations for each gap time were computed with varying quantities of sale pairs. Those computed with fewer than twenty sale pairs are plotted as blue triangles. We also overlay the predicted relationship between φ and gap time. The inverse relationship between gap time and correlation seems to hold well and we obtain similar results for most cities. One notable exception is Los Angeles, CA, which we discuss in Section 6. 5. Model validation. To show that the proposed AR model produces good predictions, we fit the model separately to each of the twenty cities and



apply the fitted models to each test set. For comparison purposes, a mixed effects model along with the benchmark S&P/Case–Shiller model is applied to the data. The former model is a simple, but reasonable, alternative to the AR model. Both models are described below. In addition to the predictions, we compare the price indices and training set residuals. The root mean squared error (RMSE)4 is used to evaluate predictive performance for each city in Section 5.3. We will see that the AR model provides the best predictions. In addition, we will show the results from Columbus, OH as a typical example. 5.1. Mixed effects model. A mixed effects model provides a very simple, but plausible, approach for modeling these data. This model treats the time effect (βt ) as a fixed effect, and the effects of house (αi ) and ZIP code (τz ) are modeled as random effects. There is no time series component to this model. We describe the model as follows: (11)

yi,j,z = µ + αi + τz + βt(i,j,z) + εi,j,z , i.i.d.



where αi ∼ N (0, σα2 ), τz ∼ N (0, στ2 ), and εi,j,z ∼ N (0, σε2 ) for houses i from 1, . . . , Iz , sales j from 1, . . . , Ji , and ZIP codes z from 1, . . . , Z. As before, µ is a fixed parameter and βi,j,z is the fixed effect for time. The estimates for the parameters θ = {µ, β, σε2 , στ2 } are computed using maximum likelihood estimation. Finally, estimates for the random effects α and τ are calculated by iteratively calculating the following:  2 −1 σε ′ ˆ − Zˆ ˆ= α (12) W′ (y − Xβ τ ), II + W W σα2  2 −1 σε ′ ˆ − Wα), ˆ τˆ = (13) Z′ (y − Xβ IZ + Z Z στ2 where X and W are the design matrices for the fixed and random effects respectively and y is the response vector. These expressions are derived using the method of computing BLUP estimators outlined by Henderson (1975). To predict the log price, yˆi,j,z , we substitute the estimated values: yˆi,j,z = µ ˆ + βˆt(i,j,z) + α ˆ i + τˆz .


We use transformation (9) to convert these predictions back to the price scale. Finally, we construct a price index similar to the autoregressive case. Therefore, as in Figure 2, the values of exp{βˆt } are rescaled so that the price index in the first quarter is 1. 4


q P n 1 n

k=1 (Yk

− Yˆk )2 , where Y is the sale price and n is the test set size.



5.2. S&P/Case–Shiller model. The original Case and Shiller (1987, 1989) model is a repeat–sales model which expands upon the Bailey, Muth and Nourse (1963) setting by accounting for heteroscedasticity in the data due to the gap time between sales. Borrowing some of their notation, the framework for their model is (15)

yi,t = βt + Hi,t + ui,t ,

where yi,t is the log price of the sale of the ith house at time t, βt is the log i.i.d. index at time t, and ui,t ∼ N (0, σu2 ). The middle term, Hi,t , is a Gaussian random walk which incorporates the previous log sale price of the house. Location information, such as ZIP codes, are not included in this model. Like the Bailey, Muth and Nourse setup, the Case and Shiller setting is a model for differences in prices. Thus, the following model is fit: ′


yi,t′ − yi,t = βt′ − βt +

t X

vi,k + ui,t′ − ui,t ,

k=t+1 i.i.d.

where t′ > t. The random walk steps are normally distributed where vi,k ∼ N (0, σv2 ). Weighted least squares is used to fit the model to account for both sources of variation. The S&P/Case–Shiller procedure follows in a similar vein but is fit on the price scale instead of the log price scale. The procedure is similar to the arithmetic index proposed by Shiller (1991) which we will describe next; R however, full details are available in the S&P/Case–Shiller Home Price Indices: Index Methodology (2009) report. Let there be S sale pairs, consisting of two consecutive sales of the same house, and T time periods. An S × (T − 1) design matrix X, an S × (T − 1) instrumental variables (IV) matrix Z, and an S × 1 response vector w are defined next. Let the subscripts s and t denote the row and column index respectively. Finally, let Ys,t be the sale price (not log price) of the house in sale pair s at time t. Therefore, in each sale pair, there will be two prices Ys,t and Ys,t′ where t 6= t′ . The matrices X, Z and vector w where s indicates the row and t indicates the column are now defined as follows: ( −Ys,t, if first sale of pair s is at time t, t > 1, Xs,t = Ys,t , if second sale of pair s is at time t, 0, otherwise, ( −1, if first sale of pair s is at time t, t > 1, Zs,t = 1, if second sale of pair s is at time t, 0, otherwise,  Ys,t , first sale of pair s at time 1, ws = 0, otherwise.



The goal is to fit the model w = Xb+ε where b = (b1 · · · bT )′ is the vector of the reciprocal price indices. That is, Bt = 1/bt is the price index at time t. A three-step process is implemented to fit this model. First, b is estimated using regression with instrumental variables. Second, the residuals from this regression are used to compute weights for each observation. Finally, b is estimated once more while applying the weights. This process, outlined in full R in the S&P/Case–Shiller Home Price Indices: Index Methodology report, is described below: ˆ= 1. Estimate b by running a regression using instrumental variables: b ′ −1 ′ (Z X) × Z w. 2. Calculate the weights for each observation using the squared residuals from the first step. These weights are dependent on the gap time between sales. We denote the residual as εˆi which is an estimate of ui,t′ − ui,t + Pt′ −t Pt′ −t k=1 vi,k ] = 0 and the k=1 vi,k . The expectation of εi is E[ui,t′ − ui,t + Pt′ −t ′ 2 ′ variance is Var[ui,t − ui,t + k=1 vi,k ] = 2σu + (t − t)σv2 . To compute the weights for each observation, the squared residuals from the first step are regressed against the gap time. That is, (17)

εˆ2i = α0 + α1 (t′ − t) + ηi , |{z} |{z} 2 2σu


where E[ηi ] = 0. The reciprocal of the square root of the fitted values from the above regression are the weights. Using their notation, we denote this weight matrix by Ω−1 . 3. The final step is to estimate b again while incorporating the weights, Ω: ˆ = (Z′ Ω−1 X)−1 Z′ Ω−1 w. The indices are simply the reciprocals of each b element in b for t > 1 and, by construction, B1 = 1. Finally, to estimate the prices in the test set, we simply calculate (18)

Yˆi,j =

ˆt(i,j−1) B Y , ˆt(i,j) i,j−1 B

where Yi,j is the price of the jth sale of the ith house and Bt is the price index at time t. We do not apply the correction proposed by Goetzmann when estimating prices because it is appropriate only for predictions on the log price scale. The S&P/Case–Shiller method is fit on the price scale so no transformation is required. 5.3. Comparing predictions. We fit all three models on the training sets for each city and predict prices for those homes in the corresponding test set. The RMSE for the test set observations is calculated in dollars for each model in order to compare performance across models. These results are listed in



Table 5. The model with the lowest RMSE value for each city is shown in italicized font. Note that while the S&P/Case–Shiller method produces predictions directly on the price scale, the autoregressive and mixed effects models must be converted back to the price scale using (9). It is clear that the AR model performs better than the S&P/Case–Shiller model for all of the cities, reducing the RMSE by up to 21% in some cases; the AR model produces lower RMSE values when compared to the mixed effects model as well for nearly all cities, San Francisco, CA, being the only exception. Moreover, the AR model performs better under alternate loss functions as well, which we show in the supplemental article [Nagaraja, Brown and Zhao (2010)]. Note that the RMSE value is missing for Kansas City, MO for the S&P/Case– Shiller model. Some of the observation weights calculated in the second step of the procedure were negative, halting the estimation process. This is another drawback to some of the existing repeat sales procedures. Calhoun (1996) suggests replacing the sale specific error ui,t [as given in (16)] with a house specific error ui ; however, this fundamentally changes the structure of the error term and, as a result, the fitting process. Furthermore, it is not implemented in the S&P/Case–Shiller methodology. Therefore, we do not apply it to our data. Table 5 Test set RMSE for three models (in dollars) Metropolitan area

AR (local)

Mixed effects (local)


Ann Arbor, MI Atlanta, GA Chicago, IL Columbia, SC Columbus, OH Kansas City, MO Lexington, KY Los Angeles, CA Madison, WI Memphis, TN Minneapolis, MN Orlando, FL Philadelphia, PA Phoenix, AZ Pittsburgh, PA Raleigh, NC San Francisco, CA Seattle, WA Sioux Falls, SD Stamford, CT

41,401 30,914 36,004 35,881 27,353 24,179 21,132 37,438 28,035 24,588 31,900 28,449 33,246 28,247 26,406 25,839 49,927 38,469 20,160 57,722

46,519 34,912 — 38,375 30,163 25,851 21,555 — 30,297 25,502 34,065 30,438 — 29,286 28,630 27,493 48,217 41,950 21,171 58,616

52,718 35,482 42,865 42,301 30,208 — 21,731 41,951 30,640 25,267 34,787 30,158 35,350 29,350 30,135 26,775 50,249 43,486 21,577 68,132


Fig. 4.


Comparing the variance of the residuals for Columbus, OH.

Three values are also missing in Table 5 for the mixed effect model results. For these three cities, the iterative fitting procedure failed to converge. We can attribute this to the size of these data and, more importantly, that the data do not conform well to the mixed effects model structure. Next, we will examine several diagnostic plots to assess whether the model assumptions are satisfied for each method. We begin by investigating the variance of the residuals. As the gap time increases, we expect a higher error variance indicating that the previous price becomes less useful over time. The proposed autoregressive model and the S&P/Case–Shiller model each incorporate this feature differently, using an underlying AR(1) time series and a random walk respectively. The mixed effects model, however, assumes a constant variance regardless of gap time. In Figure 4, for each



Fig. 5.

Normality of ZIP code effects for Columbus, OH.

model, we plot the variance of the predictions by gap time for the training set residuals.5 The expected variance by gap time values using the estimated parameters is then overlaid. The autoregressive and mixed effects models are fit on the log price scale, whereas the S&P/Case–Shiller model is fit on the price scale. Therefore, the residual plots are graphed on very different scales. There are two features to note here. The first is that heteroscedasticity is clearly present: the variance of the residuals does in fact increase with gap time. The second feature is that while none of the methods perfectly model the heteroscedastic error, the mixed effects model is undoubtedly the worst. This pattern holds across all of the cities in the data set. Both the autoregressive and S&P/Case–Shiller models seem to have lower than expected variances in Figure 4. For both the AR and mixed effects models, the random effects for ZIP codes are assumed to be normally distributed. As a diagnostic procedure, we construct the normal quantile plots of the ZIP code effects. The results are shown in Figure 5. Columbus, OH has a total of 103 ZIP codes, or random effects. We find the normality assumption appears to be reasonably satisfied for the mixed effects model but less so for the autoregressive model. Note, however, that each random effect is estimated using a different number of sales. This interferes with the routine interpretation of these plots. In particular, the outliers in both plots correspond to ZIP codes containing 5 Note that for these three plots, the term “residual” indicates the usual statistical residual values produced by applying the model and comparing the predictions with the response vector. For the AR and mixed effects models, these residuals are identical to the predictions on the log price scale discussed in previous sections; however, for the S&P/C–S model, this is not the case.


Fig. 6.


House price indices for Columbus, OH.

ten or fewer sales. Across all metropolitan areas, the normality assumption seems to be well satisfied in some cases and not so well in others, but with no clear pattern we could discern as to the type of analysis, size of the data or geographic region. The supplemental article contains results of the Shapiro–Wilk test for normality [Nagaraja, Brown and Zhao (2010)]. In Figure 6, we plot four indices for Columbus, OH: the AR index, the mixed effects index, the S&P/Case–Shiller index, and the mean price index. The mean index is simply the average sale price at each quarter rescaled so that the first index value is 1. From the plot, we see that the autoregressive index is generally between the S&P/Case–Shiller index and the mean index at each point in time. The mean index treats all sales as single sales. That is, information about repeat sales is not included; in fact, no information about house prices is shared across quarters. The S&P/Case–Shiller index, on the other hand, only includes repeat sales houses. The autoregressive model, because it includes both single sales and repeat sales, is a mixture of the two perspectives. Essentially, the index constructed from the proposed model is a measure of the average house price placing more weight to those homes which have sold more than once. 6. The case of Los Angeles, CA. Even though the autoregressive model has a lower RMSE than the S&P/Case–Shiller model for Los Angeles, CA, it does not seem to fit the data well. If we examine Figure 7, a plot of the correlation against gap time, we immediately see two significant issues when what is expected (line) is compared with what the data indicate (dots). First, the value of φ is not as close to 1 as expected. Second, the pattern of decay, φγ(i,j,z) , also does not follow the presumed pattern. We will focus



Fig. 7.

Problems with the assumptions.

on Los Angeles, CA, and discuss these two issues for the remainder of this section. We expect φ to be close to 1; however, for Los Angeles, CA, this does not seem to be the case. In fact, according to the data, for short gap times, the correlation between sale pairs seems to be far lower than one. To investigate this feature, we examine sale pairs with gap times between 1 and 5 quarters more closely. In Figure 8, we construct a histogram of the quarters where the second sale occurred for this subset of sale pairs. We pair this histogram with a plot of the price index for Los Angeles, CA. Most of these sales occurred during the late 1980s and early 1990s. This corresponds to the same period when Sing and Furlong (1989) found that lenders were offering people mortgages where the monthly payment was greater than 33% of their monthly income. The threshold of 33% is set to help ensure that people will be able to afford their mortgage. Those persons with mortgages that exceed this percentage tend to have a higher probability of defaulting on their payments. Bates (1989) found that a number of banks including the Bank of California and Wells Fargo were highly exposed to these risky investments, especially in the wake of the housing downturn during the early 1990s. If a short gap time is an indication that a foreclosure took place, this would explain why these sale pair prices are not highly correlated. We did observe, however, that other cities also experienced periods of decline, such as Stamford, CT (see Figure 2), but did not have anomalous autoregressive patterns like those in Figure 7 for Los Angeles, CA. Even if this were not the case, the autoregressive model may not be performing well simply because there was a downturn in the housing market. Most of the cities in our data cover periods where the indices are increasing– the model may be performing well only because of this feature. In the case


Fig. 8.


Examining the housing downturn.

of Los Angeles, CA, if we examine the period between January 1990 and December 1996 on Figure 8, the housing index was decreasing. However, if we calculate the RMSE of test set sales for this period only, we find that the autoregressive model still performs better than the S&P/Case–Shiller method. The RMSE values are $32,039 and $41,841, respectively. Therefore, the autoregressive model seems to perform better in a period of decline as well as in times of increase. The second irregularity evident in Figure 7 is that the AR(1) process does not decay at the same rate as the model predicts. In 1978 California voters, as a protest against rising property taxes, passed Proposition 13 which limited how fast property tax assessments could increase per year. Galles and Sexton (1998) argue that Proposition 13 encouraged people to retain homes especially if they have owned their home for a long time. It is possible that this feature of Figure 7 is a long term effect of Proposition 13. On the other hand, it could be that California home owners tend to renovate their homes more frequently than others, reducing the decay in prices over time. However, we have no way of verifying either of these explanations given our data. 7. Discussion. Two key tasks when analyzing house prices are predicting sale prices of individual homes and constructing price indices which measure general housing trends. Using extensive data from twenty metropolitan



areas, we have compared our predictive method to two other methods, including the S&P/Case–Shiller Home Price Index. We find that on average the predictions using our method are more accurate in all but one of the twenty metropolitan areas examined. Data such as ours often do not contain reliable hedonic information on individual homes, if at all. Therefore, harnessing the information contained in a previous sale is critical. Repeat sales indices attempt to do exactly that. Some methods have also incorporated ad hoc adjustments to take account of the gap time between the repeat sales of a home. In contrast, our model involves an underlying AR(1) time series which automatically adjusts for the time gap between sales. It also uses the home’s ZIP code as an additional indicator of its hedonic value. This indicator has some predictive value, although its value is quite weak by comparison with the price in a previous sale if one has been recorded. The index constructed from our statistical model can be viewed as a weighted average of estimates from single and repeat sales homes, with the repeat sales prices having a substantially higher weight. As noted, the time series feature of the model guarantees that this weight for repeat sales prices slowly decreases in a natural fashion as the gap time between sales increases. Our results do not provide definitive evidence as to the value of our index when comparing with other currently available indices as a general economic indicator. Indeed, such a determination should involve a study of the economic uses of such indicators as well as an examination of their formulaic construction and their use for prediction of individual sale prices. We have not undertaken such a study, and so can offer only a few comments about the possible comparative values of our index. As we have discussed, we feel it may be an advantage that our index involves all home sales in the data (subject to the naturally occurring weighting described above), rather than only repeat sales. Repeat sales homes are only a small, selected fraction of all home sales. Studies have shown that repeat sales homes may have different characteristics than single sale homes. In particular, they are evidently older on average, and this could be expected to have an effect on their sale price. Since our measure brings all home sales into consideration, albeit in a gently weighted manner, and since it provides improved prediction on average, it may produce a preferable index. Another advantage of our model is that it remains easy to interpret at both the micro and macro levels, in spite of including several features inherent in the data. Future work seems desirable to understand anomalous features such as those we have discussed in the Los Angeles, CA, area. Such research may allow us to construct a more flexible model to accommodate such cases. For example, it could involve the inclusion of economic indicators which may affect house prices such as interest rates and tax rates and measures of general economic status such as the unemployment rate.



APPENDIX A: DATA SUMMARY Table 6 Summary counts No. houses per sale count City

No. sales

No. houses





Ann Arbor, MI Atlanta, GA Chicago, IL Columbia, SC Columbus, OH Kansas City, MO Lexington, KY Los Angeles, CA Madison, WI Memphis, TN Minneapolis, MN Orlando, FL Philadelphia, PA Phoenix, AZ Pittsburgh, PA Raleigh, NC San Francisco, CA Seattle, WA Sioux Falls, SD Stamford, CT

68,684 376,082 688,468 7,034 162,716 123,441 38,534 543,071 50,589 55,370 330,162 104,853 402,935 180,745 104,544 100,180 73,598 253,227 12,439 14,602

48,522 260,703 483,581 4,321 109,388 90,504 26,630 395,061 35,635 37,352 240,270 72,976 280,272 129,993 73,871 68,306 59,416 182,770 8,974 11,128

32,458 166,646 319,340 2,303 67,926 62,489 16,891 272,258 23,685 23,033 166,811 45,966 179,107 87,249 48,618 42,545 46,959 124,672 6,117 8,200

12,662 76,046 130,234 1,470 31,739 23,706 7,901 100,918 9,439 11,319 59,468 22,759 82,681 35,910 20,768 20,632 10,895 47,406 2,353 2,502

2,781 15,163 28,369 431 7,892 3,773 1,555 18,965 2,086 2,412 11,856 3,706 15,878 5,855 3,749 4,306 1,413 9,198 419 357

621 2836 5,603 117 1,831 534 282 2,903 425 587 2,127 543 2,606 968 718 818 149 1,494 85 62

Table 7 Number of ZIP codes by city City Ann Arbor, MI Atlanta, GA Chicago, IL Columbia, SC Columbus, OH Kansas City, MO Lexington, KY Los Angeles, CA Madison, WI Memphis, TN Minneapolis, MN Orlando, FL Philadelphia, PA Phoenix, AZ

No. ZIP codes 57 184 317 12 103 179 31 280 40 64 214 96 330 130


C. H. NAGARAJA, L. D. BROWN AND L. H. ZHAO Table 7 (Continued.) City

No. ZIP codes

Pittsburgh, PA Raleigh, NC San Francisco, CA Seattle, WA Sioux Falls, SD Stamford, CT

257 82 70 110 30 23

Table 8 Training and test set sizes Autoregressive model

S&P/Case–Shiller model




No. houses

Training pairs

No. houses

Ann Arbor, MI Atlanta, GA Chicago, IL Columbia, SC Columbus, OH Kansas City, MO Lexington, KY Los Angeles, CA Madison, WI Memphis, TN Minneapolis, MN Orlando, FL Philadelphia, PA Phoenix, AZ Pittsburgh, PA Raleigh, NC San Francisco, CA Seattle, WA Sioux Falls, SD Stamford, CT

58,953 319,925 589,289 5,747 136,989 107,209 32,705 470,721 43,349 46,724 286,476 89,123 343,354 155,823 89,762 84,678 66,527 218,741 10,755 12,902

9,731 56,127 99,179 1,287 25,727 16,232 5,829 72,350 7,240 8,646 43,686 15,730 59,581 24,922 14,782 15,502 7,071 34,486 1,684 1,700

48,522 260,703 483,581 4,321 109,388 90,504 26,630 395,061 35,635 37,352 240,270 72,976 280,272 129,993 73,871 68,306 59,416 182,770 8,974 11,128

10,431 59,222 105,708 1,426 27,601 16,705 6,075 75,660 7,714 9,372 46,206 16,147 63,082 25,830 15,891 16,372 7,111 35,971 1,781 1,774

9,735 55,911 99,069 1,279 25,458 16,092 5,748 72,338 7,221 8,673 43,764 15,531 60,068 24,656 14,956 15,388 6,948 34,304 1,677 1,654

APPENDIX B: UPDATING EQUATIONS In this section we provide the updating equations for estimating the parameters θ = {β, σε2 , στ2 , φ} in the autoregressive model (see Section 3). Observe that the covariance matrix V is an N × N matrix where N is the sample size. Given the size of our data, it is simpler computationally to exploit the block diagonal structure of V. Each block, denoted by Vz,z , corresponds to observations in ZIP code z. Computations are carried out



on the ZIP code level and the updating equations provided below reflect this. For instance, yz and Tz are the elements of the log price vector and transformation matrix respectively for observations in ZIP code z. To start, an explicit expression for β can be formulated: !−1 Z Z X X ′ −1 ˆ= (19) (Tz Xz )′ V−1 Tz yz . (Tz Xz ) V Tz Xz β z,z




Estimates must be computed numerically for the remaining parameters. As all of these are one-dimensional parameters, methods such as the Newton– Raphson algorithm are highly suitable. We first define wz = yz − Xz β for clarity. To update σε2 , compute the zero of (20) 0 = −


−1 tr(Vz,z diag(rz )) +

Z X −1 −1 diag(rz )Vz,z (Tz wz ), (Tz wz )′ Vz,z z=1


where tr(·) is the trace of a matrix and diag(r) is as defined in (4). Similarly, to update στ2 , find the zero of 0=


−1 tr(Vz,z (Tz 1nz )(Tz 1nz )′ )



Z X −1 −1 (Tz wz )′ Vz,z (Tz 1nz )(Tz 1nz )′ Vz,z (Tz wz ), + z=1

where nz denotes the number of observations in ZIP code z and 1k is a (k × 1) vector of ones. Finally, to update the autoregressive parameter φ, we must calculate the zero of the function below:     Z X −1 2 ∂(Tz 1nz ) tr Vz,z στ 0=− (Tz 1nz )′ ∂φ z=1   ∂(Tz 1nz ) ′ 2 + στ (Tz 1nz ) ∂φ  2φσε2 σε2 ∂ diag(rz ) + diag(rz ) + (1 − φ2 )2 1 − φ2 ∂φ     Z Z ′ X X ∂Tz ′ −1 ∂Tz −1 − wz Vz,z (Tz wz ) − (Tz wz ) Vz,z wz ∂φ ∂φ z=1



Z  X z=1


−1 (Tz wz ) Vz,z


 ∂(Tz 1nz ) (Tz 1nz )′ ∂φ



+ στ2 (Tz 1nz )

∂(Tz 1nz ) ∂φ


2φσε2 diag(rz ) (1 − φ2 )2   σε2 ∂ diag(rz ) −1 + V (T w ) z z . z,z 1 − φ2 ∂φ +

After the estimates converge, we must estimate the random effects. We use Henderson’s procedure to derive the Best Linear Unbiased Predictors (BLUP) for each ZIP code. His method assumes that the parameters in the covariance matrix, V, are known; however, we use the estimated values. The formula is −1  2 2ˆ σε 2 ˆ ′ −1 ˆ ˆ + (1 − φ )(Tz 1z ) diag (rz )(Tz 1z ) τˆz = σ ˆτ2 (23) ˆ z 1z )′ diag−1 (rz )(T ˆ zw ˆ z )), ×((1 − φˆ2 )(T where diag−1 (ˆr) is the inverse of the estimated diagonal matrix diag(r). Acknowledgments. The authors would like to thank the referees for their thorough and helpful comments. SUPPLEMENTARY MATERIAL Supplement to “An autoregressive approach to house price modeling” (DOI: 10.1214/10-AOAS380SUPP; .pdf). This supplement contains extra analysis on a variety of topics related to the paper from examining the convergence of the coordinate ascent algorithm, or applying alternate loss functions, to studying the impact of each feature included in the autoregressive (AR) model. REFERENCES Bailey, M. J., Muth, R. F. and Nourse, H. O. (1963). A regression method for real estate price index construction. J. Amer. Statist. Assoc. 58 933–942. Bates, J. (1989). Survey cites four California banks with possibly risky realty loans. Los Angeles Times 1, December 30. Bickel, P. J. and Doksum, K. A. (2001). Mathematical Statistics—Basic Ideas and Selected Topics, Vol. I, 2nd ed. Prentice Hall, Englewood Cliffs, NJ. Brown, L. D. (1986). Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. IMS Lecture Notes Monogr. Ser. 9. IMS, Hayward, CA. MR0882001 Calhoun, C. (1996). OFHEO house price indices: HPI technical description. Available at: Case, B., Pollakowski, H. O. and Wachter, S. M. (1991). On choosing among house price index methodologies. AREUEA J. 19 286–307. Case, B. and Quigley, J. M. (1991). The dynamics of real estate prices. Rev. Econ. Statist. 73 50–58.



Case, K. E. and Shiller, R. J. (1987). Prices of single-family homes since 1970: New indexes for four cities. N. Engl. Econ. Rev. Sept./Oct. 45–56. Case, K. E. and Shiller, R. J. (1989). The efficiency of the market for single family homes. Amer. Econ. Rev. 79 125–137. Englund, P., Quigley, J. M. and Redfearn, C. L. (1999). The choice of methodology for computing housing price indexes: Comparisons of temporal aggregation and sample definition. J. Real Estate Fin. Econ. 19 91–112. Galles, G. M. and Sexton, R. L. (1998). A tale of two tax jurisdictions: The surprising effects of California’s Proposition 13 and Massachusetts’ Proposition 2 1/2. Amer. J. Econ. Soc. 57 123–133. Gatzlaff, D. H. and Haurin, D. R. (1997). Sample selection bias and repeat-sales index estimates. J. Real Estate Fin. Econ. 14 33–50. Goetzmann, W. N. (1992). The accuracy of real estate indices: Repeat sales estimators. J. Real Estate Fin. Econ. 5 5–53. Goetzmann, W. N. and Peng, L. (2002). The bias of the RSR estimator and the accuracy of some alternatives. Real Estate Econ. 30 13–39. Goetzmann, W. N. and Spiegel, M. (1995). Non-temporal components of residential real estate appreciation. Rev. Econ. Statist. 77 199–206. Henderson, C. R. (1975). Best linear unbiased estimation and prediction under a selection model. Biometrics 31 423–447. Meese, R. A. and Wallace, N. E. (1997). The construction of residential housing price indices: A comparison of repeat-sales, hedonic-regression, and hybrid approaches. J. Real Estate Fin. Econ. 14 51–73. Nagaraja, C. H., Brown, L. B. and Zhao, L. H. (2010). Supplement to “An autoregressive approach to house price modeling.” DOI: 10.1214/10-AOAS380SUPP. Palmquist, R. B. (1982). Measuring environmental effects on property values without hedonic regression. J. Urban Econ. 11 333–347. Shen, H., Brown, L. D. and Zhi, H. (2006). Efficient estimation of log-normal means with application to pharmacokinetic data. Statist. Med. 25 3023–3038. MR2247080 Shiller, R. (1991). Arithmetic repeat sales price estimators. J. Housing Econ. 1 110–126. Sing, B. and Furlong, T. (1989). Defaults feared if payments keep ballooning fast-rising interest rates causing concern for adjustable mortgages. Los Angeles Times 5, March 29. R Home Price Indices: Index Methodology (November 2009). S&P/Case–Shiller Available at C. H. Nagaraja US Census Bureau Center for Statistical Research and Methodology 4600 Silver Hill Rd. Washington, DC 20233 USA E-mail: [email protected]

L. D. Brown L. H. Zhao Statistics Department The Wharton School University of Pennsylvania 400 Jon M. Huntsman Hall 3730 Walnut St. Philadelphia, Pennsylvania 19104-6302 USA E-mail: [email protected] [email protected]