A control-theoretic perspective on optimal high-order optimization
从控制理论角度研究凸函数最小化的最优张量算法,通过闭环控制系统分析收敛速度,并给出两种离散化算法框架,其中一种推广了A-HPE框架,另一种得到新的最优p阶张量算法。
Abstract We provide a control-theoretic perspective on optimal tensor algorithms for minimizing a convex function in a finite-dimensional Euclidean space. Given a function $$\varPhi : {\mathbb {R}}^d \rightarrow {\mathbb {R}}$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>Φ</mml:mi><mml:mo>:</mml:mo><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mi>d</mml:mi></mml:msup><mml:mo>→</mml:mo><mml:mi>R</mml:mi></mml:mrow></mml:math> that is convex and twice continuously differentiable, we study a closed-loop control system that is governed by the operators $$\nabla \varPhi $$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>∇</mml:mi><mml:mi>Φ</mml:mi></mml:mrow></mml:math> and $$\nabla ^2 \varPhi $$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:msup><mml:mi>∇</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>Φ</mml:mi></mml:mrow></mml:math> together with a feedback control law $$\lambda (\cdot )$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>λ</mml:mi><mml:mo>(</mml:mo><mml:mo>·</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math> satisfying the algebraic equation $$(\lambda (t))^p\Vert \nabla \varPhi (x(t))\Vert ^{p-1} = \theta $$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi>λ</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mi>p</mml:mi></mml:msup><mml:msup><mml:mrow><mml:mo>‖</mml:mo><mml:mi>∇</mml:mi><mml:mi>Φ</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>‖</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>θ</mml:mi></mml:mrow></mml:math> for some $$\theta \in (0, 1)$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>θ</mml:mi><mml:mo>∈</mml:mo><mml:mo>(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math> . Our first contribution is to prove the existence and uniqueness of a local solution to this system via the Banach fixed-point theorem. We present a simple yet nontrivial Lyapunov function that allows us to establish the existence and uniqueness of a global solution under certain regularity conditions and analyze the convergence properties of trajectories. The rate of convergence is $$O(1/t^{(3p+1)/2})$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>O</mml:mi><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:msup><mml:mi>t</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mn>3</mml:mn><mml:mi>p</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo><mml:mo>/</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math> in terms of objective function gap and $$O(1/t^{3p})$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>O</mml:mi><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:msup><mml:mi>t</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mi>p</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math> in terms of squared gradient norm. Our second contribution is to provide two algorithmic frameworks obtained from discretization of our continuous-time system, one of which generalizes the large-step A-HPE framework of Monteiro and Svaiter (SIAM J Optim 23(2):1092–1125, 2013) and the other of which leads to a new optimal p -th order tensor algorithm. While our discrete-time analysis can be seen as a simplification and generalization of Monteiro and Svaiter (2013), it is largely motivated by the aforementioned continuous-time analysis, demonstrating the fundamental role that the feedback control plays in optimal acceleration and the clear advantage that the continuous-time perspective brings to algorithmic design. A highlight of our analysis is that we show that all of the p -th order optimal tensor algorithms that we discuss minimize the squared gradient norm at a rate of $$O(k^{-3p})$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>O</mml:mi><mml:mo>(</mml:mo><mml:msup><mml:mi>k</mml:mi><mml:mrow><mml:mo>-</mml:mo><mml:mn>3</mml:mn><mml:mi>p</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math> , which complements the recent analysis in Gasnikov et al. (in: COLT, PMLR, pp 1374–1391, 2019), Jiang et al. (in: COLT, PMLR, pp 1799–1801, 2019) and Bubeck et al. (in: COLT, PMLR, pp 492–507, 2019).