Newsletter TOC CCPRP NICPRE NEC 63
NICPRE QUARTERLY
A newsletter from the National Institute for Commodity Promotion Research and Evaluation on program evaluation and related issues
Vol. 4 No. 4
Fourth Quarter 1998

CONTENTS

Evaluating the Beef Promotion Checkoff

Replication: An Essential Step Toward Improved Program Evaluation

Editor's Notes

Next Meeting

 

Replication:
An Essential Step Toward Improved Program Evaluation

by Henry W. Kinnucan

In a previous NICPRE Quarterly article, I discussed three principles of sound evaluation: use of the scientific method, peer review, and reproducibility. Here I elaborate on reproducibility. In particular, I argue that replication, i.e., estimating the same model with new or updated data, is an essential step toward improved advertising benefit-cost analysis. Moreover, I argue that replication should not be confined to assessing the robustness of previous findings. Rather, it should be incorporated into current evaluations as part of the research design. Before describing how this can be done in a relatively unintrusive manner, I discuss three recent studies that highlight the need for replication and its role in assuring that results are trustworthy.

Meat Study: For the meat study, I refer to the results published in the February 1997 issue of the American Journal of Agricultural Economics. In that study, the same model (a Rotterdam specification for beef, pork, poultry, and fish) was estimated for two sample periods: an "initial" sample consisting of quarterly data for the period 1976-91, and an "updated" sample for the period 1976-93. The updated sample was identical to the initial sample except that it contained seven additional observations.

Estimates for both sample periods were similar for the economic variables (prices and income), health information, and trend. In particular, the economic, trend, and health variables that were significant in the initial sample tended also to be significant in the updated sample. And in most cases the estimated magnitudes of the coefficients were similar, meaning that one could have confidence in the estimates.

This was not the case for the advertising variables. In particular, whereas the initial sample indicated a significant relationship between beef advertising expenditures (lagged one period) and beef consumption, the updated sample indicated an insignificant relationship. A similar result occurred for the remaining commodities, i.e., the advertising coefficients that were significant in the initial sample tended not to be significant in the updated sample, and vice versa. Consequently, the authors concluded that the effects of generic advertising of meats in the U.S. are uncertain.

Note that this conclusion is the direct result of the replication. If the authors had not updated their sample and provided estimates for both sample periods, they may well have concluded that beef advertising increased beef demand. Given the sensitivity of this conclusion to sample period, it would have been difficult to confirm.

Citrus Study: In a 1961 article that has since become a classic, Nerlove and Waugh found that with production held constant, U.S. orange growers would receive $20 in gross revenue from an existing dollar invested in advertising. This marginal return estimate was based in part on a statistical demand equation that was estimated with data published in the study's appendix.

Using the appendix data, Tomek was able to duplicate exactly Nerlove and Waugh's empirical estimates of the demand parameters. However, he found that the parameters were sensitive to sample period. In particular, the model estimated with the post-war data produced an insignificant advertising effect. In addition, Tomek found that the demand specification had serial correlation, which suggests that the model may have been misspecified. Although Nerlove and Waugh's major contribution was theoretical, not empirical, the sensitivity of their empirical estimates to sample period, coupled with the serial correlation problem, calls into question the validity of their estimated rate of return. The lesson from this historical analysis is that estimated advertising effects are fragile, which reinforces the need for replication.

Catfish Study: The catfish example is more optimistic in the sense that results hold up better under replication. Four studies were done, an initial study based on monthly data for the period 1988-89 (study A), and three updates based on data for 1986-94 (study B), 1986-96 (Study C), and 1986-97 (study D). The model used in the updates is essentially the same as used in the original study with the following exceptions. First, in the original study the dependent variable was defined as whole catfish. In the updates the definition was expanded to include value-added catfish products, mainly fillets. Since data for the value-added products were not available prior to 1986, it was necessary to truncate the samples in the updates to begin in 1986 rather than 1980. This truncation, however, had no material effect on the replications, as the advertising campaign commenced in 1987, so none of the advertising observations were lost.

The second important difference between the the original study and the updates is that the updates contain more advertising variables. This is because the industry expanded its media mix to include newspapers and radio in 1992, and television in 1994. In studies B and C, the print and electronic media were combined; in study D they were kept separate.

The final important difference is that studies A, B, and C imposed a geometrically declining (Nerlove) lag on the estimated advertising responses. In study D the advertising lag structure was estimated using the polynomial inverse lag (PIL) procedure, which subsumes the Nerlove lag as a special case. (Further details, including citations for the earlier studies, are provided in Kinnucan and Miao.)

The main results are summarized in Table 1. These elasticities are long-run estimates. As expected given the change in the dependent variable to include value-added products, the updates show a larger income elasticity. (The insignificant income elasticity in study A suggests that by the late 1980s catfish had overcome its status as an inferior good, a result from earlier econometric studies.) The price elasticities, although smaller in the more recent periods, are not significantly different from -1.0.

The most interesting results are for the advertising variables. Focusing first on print media, and bearing in mind that newspaper advertising was restricted to three months, the estimated advertising elasticities are remarkably robust. Regardless of sample period and distributed-lag restrictions, elasticity estimates are significant. Moreover, for the estimates based on the Nerlove lag, the elasticity from study A (0.0075) is almost identical to the elasticity from study C (0.0078). Thus, unlike for the meat and citrus studies, one can conclude with some confidence that generic advertising did indeed affect demand, at least with respect to the magazine portion of the campaign.

For the electronic media, the elasticity estimates are not nearly as stable. For example, study B produces a significant effect, whereas study C produces an insignificant effect. In study D, which separates the media, the elasticity for radio is positive and significant, but the elasticity for television is negative and unreliable. Thus, one cannot make any definitive statements about the effectiveness of electronic media. There appears to be some evidence that the radio advertising is effective, but given the instability of the estimates, one would want to replicate the study to be sure.

Returning to print media, notice that the advertising elasticity for magazines in study D is about three times larger than the estimates obtained in the earlier studies. This is due to the use of a less restrictive procedure for estimating the distributive lag in study D. In other words, the earlier studies, by employing the Nerlove lag, produced estimates of the advertising elasticities that were severely downward biased. The discovery of this bias highlights the side benefits of replication noted by Tomek, namely learning and research innovation. With the improved model specification, we now know that the returns to the print media campaign estimated in the earlier studies were understated. This finding should be reassuring to producers, since print media accounted for the bulk of their advertising investment.

As the foregoing examples illustrate, replication is essential to establish whether estimated advertising effects are robust, and therefore trustworthy. There are three additional reasons for replication: 1.) Data used in economic analysis are frequently revised, which means that published studies based on the unrevised data may be compromised due to measurement error. It is only with replication that we can know whether the results in published studies are spurious or real. 2.) Most published studies showing a 'significant' advertising effect are from ad hoc models, i.e., models that are based more on heuristics than theory. (The Rotterdam specification used in the meat study is an example of a theory-based model.) As such, it is safe to assume that the models are the product of a specification search. That is, the researcher has tried alternative combinations of variables, lag structures, and estimation procedures to find the most 'plausible' result. There are two problems with this. First, the estimated advertising effect may be more a reflection of the pecularities of the data set than the actual market response. Second, the reported t-values used to determine significance are difficult to interpret because one does not know the true probability of a Type I error. Although replication cannot solve the latter problem, it can shed light on the former. In particular, if the estimated advertising effects are indeed data specific, this will show up in the replication. 3.) Replication of the type that was done in the meat study, hereafter referred to as 'internal replication,' permits more extensive model testing. For example, a number of authors have provided compelling arguments that model testing go beyond the R2 and the Durbin Watson statistic to include tests for model mis-specification and structural change.A number of the regression diagnostics provided for this purpose in econometric software such as Eviews require post-sample predictions. By witholding a portion of the sample to be used for the internal replication, these additional tests can be performed to provide a more thorough assessment of model adequacy.

Recommendation: all advertising studies based on time series data should provide two sets of estimates, one for the first 80- to 90 % of the observations in the series, and another for the entire sample period. In addition, a complete set of regression diagnostics based on the initial sample should be provided. This will help ensure that the results are robust, and it should mitigate some of the more serious problems associated with specification search. Internal replication and the associated diagnostic testing will add only moderately to the cost of the evluation, while significantly enhancing the study's credibility from a scientific perspective.

[ return to text ] [ top ]
Table 1: Demand Elasticity Estimates for Catfish from Initial (Study A) and Replicated Studies (Studies B, C, and D)
Elasticity Study A
(1980-89)
Study B
(1986-94)
Study C
(1986-96)
Study D
(1986-97)
Income 0.38* 2.19 1.79 1.58
Own-Price -1.00 -0.83 -0.86 -0.71
Advertising:
Magazines 0.0075 -- -- 0.0244
Newspapers -- -- -- 0.0614*
Radio -- -- -- 0.0204
TV -- -- -- -0.1074**
Mags. & News.*** -- 0.0095 0.0078 --
Radio & TV -- 0.0002 -0.0005* --
* Not Significant.
** The PIL Coefficients are significant, but the elasticity is unreliable. See Kinnucan and Miao for details.
***Newspaper advertising is resricted to 3 months.