Yudong Chen and Yining Chen’s contribution to the Discussion of ‘the Discussion Meeting on Probabilistic and statistical aspects of machine learning’
评论了使用深度神经网络检测变点的方法,讨论了标准化步骤、训练与测试样本分布假设以及更高效利用训练样本的方式,并提出了改进建议。
We congratulate the authors for providing this stimulating work which uses deep neural networks for detecting change-points. We shall comment on three aspects of the paper: (i) the standardization procedure, (ii) distributional assumptions of the training and testing samples, and (iii) potentially more efficient use of the training samples. First, we emphasize that classification procedures based on neural networks will not automatically be invariant to shifting or scaling. To illustrate this point, we consider scenario S1 with ρt = 0.5 and with no standardization performed on the training or testing sets. Figure 1a shows the results where an independent U ∼ U[−2,2] mean-shift is added to each series from only the testing set. Without standardization, the proposed method is no longer desirable even when we increase the size of the training set, N, to 1,600. Scaling to [0,1], as proposed in the paper, will alleviate this issue. Standardization using trimmed mean and trimmed standard deviation estimates would be a more robust option. From our experiments, other approaches such as adding a random baseline to all training samples would also work. Plot of test set misclassification error rate, computed on a test set of size 150,000, against training sample size n for detecting the existence of a change-point on data series of length n = 100 under scenario S1 described in Section 5 of the paper, with the exception that ρt = 0.5. We compare the performance of the CUSUM test and neural networks from four function classes as specified in Section 5. Here we use batch size of 32 for training with no regularization. For standardization, we use the approach suggested in the paper. For the testing set used to produce (a), an independent mean-shift of U[−2,2] is added to each series. (a) µL = 0 for the training samples, µL ∼ U[−2,2] for the testing samples; data not standardized. (b) µL = 0 for both training and testing; data not standardized. (c) µL = 0 for both training and testing; data standardized. (d) µL = 0 for both training and testing; data standardized; no reversed samples added. (e) µL = 0 for both training and testing; data standardized; reversed samples added. Nevertheless, if the distributions of the training and test samples, denoted by Dtrain and D, are indeed the same, then performing standardization might lead to worse performance, as is illustrated in Figure 1b and 1c. We suspect that it is because of certain distributions change-point can be characterized by statistics that are not invariant to shifting or scaling. These statistics are likely simpler to learn than the Cumulative Sum (CUSUM). Although, as demonstrated before, without standardization, the resulting classifier might not be transferable to even slightly different settings. Besides, the presented theory requires Dtrain = D. We anticipate similar results to hold if Dtrain is a finite mixture with one component being D. More broadly speaking, consistency should hold if the support of Dtrain contains that of D. In addition, one could use different labels for different types of change-points (e.g. change in mean/variance/etc.) in the training set, and apply the existing approach to learn a multi-class classifier. Finally, we believe that the training samples can be used in a more efficient manner with a minor twist. For X1, X2, …, Xn with label Y, its reversed sequence Xn, Xn−1, …, X1 should also have the same label. Consequently, we could double the size of N by adding all the reversed sequences into the training set. Sizeable improvement with this extra step can be seen, especially when N is small, by comparing Figure 1d with Figure 1e.