A Nonparametric Method for Dealing With Mismeasured Covariate Data
提出一种非参数方法,利用验证子样本中真实与有误的协变量数据,通过估计似然函数来估计结果与协变量的关系参数,适用于缺失数据、变量误差或代理变量问题。
Abstract Mismeasurement of covariate data is a frequent problem in statistical data analysis. However, when true and mismeasured data are obtained for a subsample of the observations, it is possible to estimate the parameters relating the outcome to the covariate of interest. Maximum likelihood methods that rely on parametric models for the mismeasurement have not met with much success. Realistic models for the mismeasurement process are difficult to construct; the form of the likelihood is often intractable and, more important, such methods are not robust to model misspecification. We propose an easily implemented method that is nonparametric with respect to the mismeasurement process and that is applicable when mismeasurement is due to the problem of missing data, errors in variables, or use of imperfect surrogate covariates. Specifically, denote the outcome variable by Y, the covariate data subject to mismeasurement by X, and the remaining covariates, including perhaps surrogates or mismeasured values of X, by Z. We consider a general regression model of the form Pβ (Y | X, Z). Suppose data regarding Y, X, and Z are available for a validation sample V , a random subsample of the total sample, whereas data regarding only Y and Z are available for the remainder, the nonvalidation sample V . We propose to base inference on the estimated likelihood for β, [Lcirc](β) = Π i∈v P β (Yi | X i , Z i ) Π j∈ v [Pcirc] β (Y j | Z j ), where [Pcirc] β(Y j | Z j ) is estimated empirically using the validation sample covariate data. Asymptotic results are derived for the case in which the surrogate or mismeasured covariates are categorical. The asymptotic variance of the estimated score involves not only the second derivative of the log estimated likelihood but also a term that captures the variability induced by estimating the nonvalidation sample likelihood. An example and a small simulation study demonstrate that this method may be of value for the missing covariate data and covariate measurement error problems.