Predictive Subdata Selection for Computer Models
针对大规模计算机模型,提出一种基于期望改进优化的子数据选择方法,利用输入特征区域几何和输出值信息,选出能提高预测精度的子数据集,降低计算成本。
An explosion in the availability of rich data from the technological advances is hindering efforts at statistical analysis due to constraints on time and memory storage, regardless of whether researchers employ simple methods (e.g., linear regression) or complex models (e.g., Gaussian processes). A recent approach to overcoming these limits involves information-based optimal subdata selection and Latin hypercube subagging. In the current study, we develop a novel subdata selection method for large-scale computer models based on expected improvement optimization. Numerical and empirical analysis using real-world data are used to select subdata by which to derive accurate predictions. During the optimization procedure, the proposed scheme employs the geometry of the input feature region as well as information related to output values. The data points associated with the largest improvement in prediction accuracy are combined in the construction of a subdataset that can be used to formulate predictions with affordable computing time. Supplementary materials for this article, including proofs of theorems and additional numerical results, are available online.