Guest Editors’ Introduction: Information in Economic Forecasting
本文是《牛津公报》特刊的引言,探讨经济预测中信息的角色,包括信息量、模型选择、数据变换和评估信息,通过四个主题和四个定理分析信息对预测准确性的影响。
This Special Issue of the Oxford Bulletin is on the theme Information in Economic Forecasting. It comprises 10 papers which investigate a number of different aspects of its role, preceded by our introduction. Information is manifestly central to all forms of forecasting, yet many questions remain concerning precisely what information should be used, what form it should take, how it should be selected, how models based on it should be formulated, and what role that information should play in those models. Debate persists over whether more information is always better: if so, one must explain why there are repeated claims that large models perform poorly: if not, then why does additional information reduce forecast accuracy? Possible answers include estimation uncertainty, collinearity between explanatory variables, data measurement errors and structural breaks. Alternatively, one might enquire as to what types of information might improve forecasting: theory, causal, leading indicators, or something else? Are 'better' models of past data also better for forecasting? What information should be collected that is not currently available? The papers reported here, after rigorous reviewing and revision, set out to address the main issues lying behind these, and related, questions. Unfortunately, 'information' is not an unambiguous concept in standard usage. Universally, it denotes the contents of the available information set on which forecasts are conditioned, denoted below by ℐt−1. However, it may also relate to knowledge as to how ℐt−1 enters the conditional distribution of the variables to be forecast, or even which elements of ℐt−1'really matter' (as some may be irrelevant if ℐt−1 is the σ-field generated by the history of all variables under consideration). This switch in meaning can be confusing, and helps explain the conflict between themes (A) and (B) below, as the former refers to using the largest set of relevant ℐt−1, whereas the latter is more about knowledge as to the relevant components of ℐt−1, and how they affect the forecast distribution. We perceive four themes, which comprise analysing: the role of more information in forecasting, including the theory and practice of factor forecasting, the advantages and disadvantages of disaggregation (across variables and data frequencies), the role of forecast combinations, and information from leading indicators; the role of less information via imposing restrictions, both theory-based and data-based information, including selecting models for forecasting, and any possible benefits of parsimony in reducing estimation uncertainty and susceptibility to breaks; transformed information, including model transformations (such as differencing and intercept corrections [IC]), transformations of variables (including nonlinearities), and collinearity reductions; the role of evaluation information in forecasting, including historical forecast comparisons, interval and density forecasts, and forecast encompassing. After presenting the background to these themes in section II, we will review the properties of unpredictability in section III and establish its main implications (based on Hendry, 1997), extending earlier results to non-stationary processes. Two important theorems about the role of information are established therein, and 10 steps linking predictability to forecastability are delineated in a taxonomy. Then section IV considers the implications of that taxonomy for the formulation of forecasting devices, noting two more theorems that act to limit the benefits of additional information in the context of practical forecasting. Section V investigates the specific setting of a cointegrated data generating process (DGP) subject to breaks, and examines some adaptions which might improve robustness in forecasting. Cointegrated vector autoregressions (VARs) are a natural DGP to study from a forecasting perspective, given that VARs in levels, differences, and with unit root and cointegration restrictions, are commonly used in macroeconomic forecasting. Section VI provides a summary and overview of the papers in this issue in relation to our discussion of the role of information in economic forecasting. The historical track record of econometric forecasting is both littered with forecast failures and the empirical out-performance of econometric models by 'naive devices': see, for example, many of the papers reprinted in Mills (1999). This adverse outcome for econometric systems is surprising since they incorporate inter-temporal economic theory-based causal information representing inertial dynamics in the economy: such models should have smaller prediction errors than purely extrapolative devices – but often do not. Discussions of the problems confronting economic forecasting date from the early history of econometrics: see, inter alia, Persons (1924), Morgenstern (1928) and Marget (1929). To explain such outcomes, Clements and Hendry (1996a, b, 1998, 1999) developed a theory of forecasting for non-stationary processes subject to structural breaks, where the forecasting model differed from the data generating mechanism (extended from a theory implicitly based on the assumptions that the model coincided with a constant-parameter mechanism). They thereby accounted for the successes and failures of various alternative forecasting approaches, and helped explain the outcomes of forecasting competitions (see, e.g. Makridakis and Hibon, 2000; Clements and Hendry, 2001; Fildes and Ord, 2002). Nevertheless, there remained a conflict between the intuitive notion that more relevant information should help in forecasting, and the hard reality that attempts to make it do so have not been uniformly successful. 'The probability law of the T + H variables (x1,…, xT+H) is of such a type that the specification of implies the complete specification of and, therefore, of .' (Haavelmo, 1944, p. 107: our notation). Haavelmo's formulation highlights the major problems that need to be confronted for successful forecasting. The form of and the value of θ in-sample must be learned from the observed data, involving problems of specification of the set of relevant variables {xt}, measurement of the xs, formulation of the joint density , modelling of the relationships, selection of the relevant connections, and estimation of θ, all of which introduce uncertainties, the baseline level of which is set by the properties of . When forecasting, the actual future form of determines the 'intrinsic' uncertainty, growing as H increases (especially for non-stationary data from stochastic trends, etc.), further increased by any changes in the distribution function , or parameters thereof, between T and T + H (lack of time invariance). These 10 italicized issues structure the analysis of economic forecasting, but Clements and Hendry (1998, 1999) emphasized the importance of the last of these in determining forecast failure. Here, we develop a complementary explanation based on the many steps between the ability to predict a random variable at a point in time, and a forecast of the realizations of that variable over a future horizon from a model based on an historical sample. This overview spells out those various steps, and also demonstrates that many of the results on forecasting in Clements and Hendry (1998, 1999) have a foundation in the properties of unpredictability. Having established such foundations, we draw some implications for forecasting non-stationary processes using incomplete (i.e. mis-specified) models, via a 'forecasting strategy' which uses a combination of 'causal' information with 'robustification' of the forecasting device. Such a combination could be either by rendering the econometric system robust, or by modifying a robust device using an estimate of any likely causal changes: for the latter, in the policy context, see Hendry and Mizon (2000, 2005). Themes (A) and (B) above deliberately conflict, matching the contrasting views in the literature. Both cannot be correct, yet both have staunch supporters. Theme (A) includes studies using factor forecasting by Forni et al. (2000), Stock and Watson (2002) and Amstad and Fischer (2004) among many others, which provide empirical support for the value of extensive information sets. Combining forecasts also has a long pedigree (since Francis Galton, according to Surowiecki (2004): see, e.g. Bates and Granger, 1969; Diebold and Pauly, 1987; Clemen, 1989; Diebold and Lopez, 1996; Stock and Watson, 1999; Newbold and Harvey, 2002; Clements and Galvão, 2005a), as well as theories for its success (see Granger, 1989; Hendry and Clements, 2004), and again suggests that more information helps. Theme A also includes disaggregation across variables, with recent examples including Espasa et al. (2002) and Hubrich (2005). Recent research by Ghysels et al. (2004b) and Ghysels et al. (2004a) suggests a way of disaggregating in terms of data frequency: their MIDAS (MIxed Data Sampling) approach allows the regressand and regressors to be sampled at different frequencies, allowing higher frequency data to be used in forecasting. The MIDAS approach can be viewed as an alternative way of exploiting information in intraday data compared to the 'realized volatility' approach (see Andersen et al., 2003, inter alia). Clements and Galvão (2005b) show that the MIDAS approach can also be useful in macroeconomic forecasting contexts. Theme (B) is a well-known claim in the folklore of economic forecasting, as well as an inference from 'forecasting competitions', such as in Makridakis and Hibon (2000) and Fildes and Ord (2002). The forecasting literature relies heavily on historical track records of which devices did well or badly on some (usually many) data sets using many measures, from which inductive inferences are made as to why the given outcomes eventuated. While one must applaud the empirical emphasis, a sound theoretical framework for interpreting the findings is often lacking. As an example, Clements and Hendry (2001) suggest that mistaken inferences concerning the role of parsimony in forecasting can arise when simplicity and robustness are confounded.1 A general framework must allow for an economy lacking time invariance from various forms of non-stationarity, with both slowly evolving and sudden shifts in distributions (so the future does not simply replicate the past), using forecasting devices that are mis-specified for the underlying DGP, and estimated from (possibly mismeasured) historical data. Hendry and Clements (2003) argued that such an approach explained a large fraction of the available evidence about forecast performance, and suggested some solutions that might be effective. Four provable theorems underpin our explanation here. To describe these, we must distinguish predictability from forecastability. The former is the relationship between a random variable νt and an information set ℐt−1, such that νt is predictable if its distribution conditional on ℐt−1 differs from its unconditional distribution. Forecastability concerns our ability to make 'useful' statements about a future outcome, where 'useful' can vary according to context. Then the first theorem from the theoretical analysis in section III is that: more information unambiguously does not worsen predictability, even in intrinsically non-stationary processes. Unfortunately, more information cannot, in general, be shown to improve forecasting. Even in a stationary environment, with no measurement errors and no model mis-specification, when the parameters of the forecasting model have to be estimated from sample data, the estimated DGP need not produce the best forecasts. Our second theorem applies to stationary processes: forecasting using a model which retains all relevant variables with non-centralities τ 2 greater than unity in their t 2-distributions for testing the null of irrelevance will dominate in one-step forecast accuracy, at least as measured by mean-square forecast error. As explained in section IV, τ2 > 1 translates into an expected t2 > 2. That result, related to model selection using the AIC (the Akaike information criterion: see Akaike, 1985) stems from results in Chong and Hendry (1986), extended by Clements and Hendry (1998, 1999). Since 'conventional' critical values, such as 5% or 1% usually entail t2 > 4, then much larger models would be retained than result from standard practice. Section IV briefly explores model selection for forecasting. Together, these two theorems strongly support theme (A) and clearly run counter to the conventional wisdom that parsimony is necessary in forecasting, as they suggest that more information is usually good. The third theorem in section III returns to the context of predictability, and shows that: less information should not induce predictive failure. Such a result implies that the costs of reducing information are primarily inefficiency, not failure, thereby allowing for the possibility that parsimony need not be too expensive. However, the fourth theorem, demonstrated in Clements and Hendry (1998) and noted in section IV, acts as a powerful negative counter to theme (A): when the process to be forecast lacks time invariance, 'causal' models, no matter how significant all their estimated parameters, need not outperform naive 'robust' devices in forecasting, even though the 'causal' models have dominated in-sample. Thus, for forecasting variables from processes that lack time invariance, theoretical analysis cannot establish the pre-eminence of more information over less – theme (B). Moreover, re-interpreting the estimation result in the second theorem suggests that estimation and selection uncertainty remain important unless variables play a substantive role. Since τ2 is unknown, it must be inferred from the observed t2. Under the null that a variable is irrelevant, the mean value of t2 is unity, and the probability that t2 > 2 is about 16%. Thus, using the loose criterion of t2 > 2 would entail retaining many irrelevant variables, where their parameter estimates just added noise to a forecast, leading to larger errors and pointing back to more parsimonious models doing better. The practical trade-off between false retention of irrelevant variables and false exclusion of relevant variables needs careful consideration, addressed in section IV. An implication of these four theorems (explored in Hendry, 2005), is that to use congruent encompassing causal models in forecasting mode, it may be essential to transform them to achieve some robustness to ex post breaks, which is theme (C). Section V discusses transforming models for use in forecasting mode. Now the role of theme (D) becomes clearer: in part, evaluation helps to check in-sample dominance, provide selection rules, and discern the onset of forecast failure ex post. But the value of evaluation information for forecasting also depends partly on how a given model is used in forecasting mode. In-sample congruence may be of no help in either improving the accuracy of forecasting, or dominating rival methods, when location shifts occur. Conversely, in-sample congruence might be invaluable in building an undominated model of the local DGP when the resulting model is robustified before forecasting. The value of the loss of information from using 𝒥t−1 relative to ℐt−1 in terms of prediction will depend on the loss function, just as forecast loss depends on the loss function of the user. The sum of the squared bias and the forecast-error variance gives rise to 'squared-error loss', with empirical counterpart mean squared forecast error (MSFE). MSFE is a general-purpose loss function (see the exchanges in Clements and Hendry 1993a, b). The importance of decision or cost-based assessment of the quality of forecasts has long been recognized (see, e.g. Pesaran and Skouras, 2002), along with the recognition that decision-based forecast evaluation criteria and MSFE need not necessarily agree on which of two rival sets of forecasts is superior.2 Conversely, disaggregating components of ℐT−h into their elements cannot lower predictability of a given aggregate yT, where such disaggregation may be across space (e.g. regions of an economy), variables (such as sub-indices of a price measure), or both. Further, since a lower frequency is a subset of a higher, and unpredictability is not, in general, invariant to the data frequency, then equation (11) ensures that temporal disaggregation cannot lower the predictability of the same entity yT. In all these cases, DyT+h(yT+h|·) remains the target of interest, and ℐT−h is 'decomposed', in that additional content is added to the information set. A different, but related, form of disaggregation is of the target variable yT into components yi,T. It may be thought that predictability could be improved by this form of disaggregation. However, nothing is gained unless ℐT−1 increases, which does not necessitate forecasting the disaggregates, but does entail including the disaggregates in ℐT−1, as shown in Hendry and Hubrich (2004). These attributes sustain general models, and provide a formal basis for including as much information as possible for predictability, being potentially consistent with multi-variable 'factor forecasting' and the benefits claimed in the 'pooling of forecasts' literature, as noted in section II. Nevertheless, there remains a large gap between predictability and forecasting, an issue we address in section IV. Before that, we discuss the impact of non-stationarity in section 3.2. A major source of non-stationarity in economics derives from the presence of unit roots. However, these can be 'removed' for the purposes of the theoretical analysis by considering suitably differenced or cointegrated combinations of variables, and that is assumed below: section V considers the relevant transformations in detail for a VAR. Of course, predictability is thereby changed – a random walk is highly predictable in levels but has unpredictable changes – but it is convenient to consider such I(0) transformations. It is clear that one cannot forecast the unpredictable beyond its unconditional mean, but there may be hope of forecasting predictable events. To summarize, predictability of a random variable like yt in equation (6) from ℐt−1 has six distinct aspects: the composition of ℐt−1; how ℐt−1 influences Dyt(·∣ℐt−1) (or specifically, ft(ℐt−1)); how Dyt(·∣ℐt−1) (or specifically ft(ℐt−1)) changes over time; the use of the limited information set 𝒥t−1 ? ℐt−1∀t; the mapping of Dyt(·∣ℐt−1) into Dyt(·∣𝒥t−1) (or specifically, ft(ℐt−1) into gt(𝒥t−1)); how (or of from a forecast at T are made using a model based on the limited information set 𝒥t−1 with conditional The parameters (or of the assumed θ must be estimated as using a sample T of observed information, denoted by . so four more the of by a function measurement errors between 𝒥t−1 and the observed estimation of θ in from in-sample data forecasting from . We consider these 10 aspects in matching the 10 issues in section II. knowledge of the composition of ℐt−1 will be available for such a entity as an any hope of success in forecasting with models that they do inertial ℐt−1 needs to have value for the future of the variables to be forecast, either from a causal or on this has been based on using but in two there is a well-known of unpredictable variables, including changes in and if any of these could be forecast for a future a could be which in would the While these are all in decision and future have to help the of changes: there is as yet evidence the of those in forecasting the processes time, so the on the of the is whether the in aggregate is for this on theoretical the with the empirical evidence is not It to the that predictability does not to be if ℐt−1 is precisely how ℐt−1 is relevant via has been the main of thereby major in that in recent as various forms of non-stationarity have been Even so, a lack of empirical equation past changes in data that remain – and – data (especially at higher than and the of model selection to models entail that much remains to be at the in ft(ℐt−1) over time have been and earlier research has the on forecasting of shifts in its mean to economic theory is the main for the specification of the information set partly by empirical model of not but section that models with mean errors could be Thus, incomplete information about the 'causal' is not by is Unfortunately, mapping ft(ℐt−1) into its conditional is not under the beyond the of changes in ft(ℐt−1) over time will have on and make interpreting and modelling these shifts Nevertheless, the additional that arise from this mapping act like However, even if aspects could be in (6) highlights that can in the between and shifts in the being location shifts in the forecasting context by the of forecast errors in Clements and Hendry, 1998, and by a in Hendry, where the is on the mean, which is the over the DGP distribution at time T + 1 conditional on a information set and is at Then across alternative of the contents of could provide improved forecasts relative to any (i.e. better the when the distribution changes from time and those different of Of course, that after forecasts have been cannot be the form of cannot be time T + has been However, after time T + becomes an in-sample so could be to be the central is not than from changes in model and empirical by a function where the formulation of θ is to incorporate the of past and models are of this as are models with may be possible for historical a location or the forecast may not be to the and even if may have that are to and to model with the limited information given the always as available are these may bias estimated and the modelling by measurement errors do not forecasts relative to the measured However, in models, measurement errors induce negative so a differencing to or a location will a negative In section we discuss in detail the of measurement errors and location This result to at the of practical forecasting and may explain the many where differencing and intercept have the of historical data to estimate θ by additional in the model relative to the data, as well as increased there are estimation from not all past breaks, which would affect and the like are attempts to such concerning forecasts have the added of these are no more than would arise in the context of to location shifts usually resulting in forecast failure. To forecasting models need to that as there are many devices that track a and are robust to such after they have Section V considers such Before that, section the possible forecast errors in a taxonomy both to their relation to of information and for of their adverse discuss a number of issues which are in the taxonomy. the terms in equation in the first four cannot be the of a being on the future relative to the information sets future information to the to the limited information set which is the and changes in the process not by the all therefore, unpredictable ex affect the forecast-error and may its The second and third terms have expected of for information sets and so will not affect . we that a lack of knowledge of the complete information set is not an explanation for forecast failure, using more information will reduce the variance to The second is when > but then the of the errors for However, the fourth is a source of forecast failure when as from an location than just structural in general, where the may be on Conversely, the fourth would be under even complete information will one of the first four The terms depend on the of the model for the local DGP and on data accuracy, both in-sample and at the forecast as well as the of the is a function of the of the the of the data accuracy at and the last on the properties of the for θ when the observed data are Thus, the would be for a the for data, but the in an the in many of forecast-error on the of parameter estimation and error (e.g. 2002). knowledge about the relevant in-sample about the data, about how the information and how should be would all but there no additional role for more information of in the To the we consider the one-step from the forecasting model which can be into the six of Since is an the DGP information set nothing will reduce its The properties of matter specifically its as well as any unpredictable changes in its distribution. The baseline accuracy of a forecast cannot that from the DGP error. the problems by in the section below their it is often possible to that a has and at develop variables to but the from or the forecast Section considers possible we that if has a