Best Practices for Statistical Trading

Article excerpt

When applied properly, statistical analysis can offer some powerful insights to market moves. When applied carelessly, it can waste a lot of your time. Here is how you can analyze readily available fundamental reports to assess the pricing prospects of the soybean market.

Valid statistical analysis requires considerable care at every stage of the process. You can't take shortcuts and you can't make assumptions. Thankfully, you can reasonably achieve this.

Our goal is to develop a standard model to forecast the average fall price of new-crop soybeans. To do so, we will use standard multiple regression analysis to assign weights, or correlation coefficients, to the values of independent variables that, we suspect, will correlate to the average fall price of soybeans.

For the average fall price, which we refer to as the dependent variable, we are using the average price from mid August through expiration of the November soybean contract. Using this time frame allows us to glean our fundamental data from the mid August World Agriculture Supply and Demand Estimate (Wasde) reports.

And why we have to do that brings us to the first best practice that we'll adhere to in our analysis.


Some statistical models are designed to explain, but we're interested in models that forecast, so when we analyze the past we want to compare the average price of November soybeans from mid August on to data that was available before mid August. In other words, we will model price vs. expectations of what the fundamentals will be, not what the fundamentals ultimately were.

Past forecasts of fundamental data are not as readily available as the final revised numbers, but they are out there.

Next, and even more important than modeling expectations, is the need to hold back part of our data as "out-of-sample," to avoid curve fitting.

In our case, we are beginning our analysis with the 1976 crop year. We will end the in-sample data set with 2000. Our out-of-sample validation set will be 2001 through 2005.

Third, we will account for inflation by adjusting past prices according to the producer price index (PPI) before we examine the effect, if any, of the selected independent variables. This also means that as we apply our model going forward, the results will have to be adjusted by the most recent value of the PPI. The PPI is a gauge of inflation calculated by the Bureau of Labor Statistics.

Fourth, we will look for independent variables that have a linear relationship with the dependent variable. We want the fundamental relationship to be stable through time. If a 10% change in yield per acre affected prices by 50¢ in 1980, we want to see the same relationship in 1993. We do not want to see a relationship that changes in its significance. The reason is simple. Without manipulating the variables themselves, standard multiple regression analysis does not result in valid models if the relationships are not linear.

Fifth, our model must not exhibit the three problems that often plague multiple regression analysis: multicolinearity, heteroscedasticity and autocorrelation. We'll explain these terms later.


The fundamental drivers of the soybean market don't have to be complicated. We will look for our independent variables in past Wasde reports. This monthly report provides the most current U.S. Department of Agriculture forecasts of U.S. and world supply-use balances of major grains, soybeans and cotton, as well as the U.S. supply and use of sugar and livestock.

You can find the actual numbers from past Wasde reports (not final revised figures) at : (prior to 1995) and reports/waobr/wasde-bb (after 1995).

Current Wasde reports can be downloaded off the USDA's Web site.

The variables we're interested in are annual forecasted soybean production, the forecasted soybean usage/ending stocks ratio, forecasted soybean crushings, forecasted soybean yield and the forecasted corn usage/ending stocks ratio. …