Abstract
The heavy-ball momentum method accelerates gradient descent with a momentum term but lacks accelerated convergence for general smooth strongly convex problems. This work introduces the Accelerated Over-Relaxation Heavy-Ball (AOR-HB) method, the first variant with provable global and accelerated convergence for such problems. AOR-HB closes a long-standing theoretical gap, extends to composite convex optimization and min-max problems, and achieves optimal complexity bounds. It offers three key advantages: (1) broad generalization ability, (2) potential to reshape acceleration techniques, and (3) conceptual clarity and elegance compared to existing methods.
1 Introduction
We first consider the convex optimization problem:
| | | (1) |
where the objective function is -strongly convex and -smooth. Later on we consider extension to composite convex optimization with a non-smooth function , and a class of min-max problems with bilinear coupling.
Notation.
is -dimensional Euclidean space with standard -inner product and the induced norm . is a differentiable function. We say is -strongly convex function when there exists such that
| | |
We say is -smooth with if its gradient is Lipschitz continuous:
| | |
The condition number is defined as .
When is -strongly convex and -smooth, the optimization problem (1) has a unique global minimizer . We focus on iterative methods to find . A useful measure of convergence is the Bregman divergence of , defined as In particular, , since .Various bounds and identities on the Bregman divergence can be found in Appendix A.
Heavy-ball methods and flow.
Over the past two decades, first-order methods, which rely solely on gradient information rather than the Hessian as required by Newton’s method, have gained significant interest due to their efficiency and adaptability to large-scale data-driven applications and machine learning tasks(Bottou etal., 2018). Among these methods, the gradient descent method is the most straightforward and well-established algorithms. However, for ill-conditioned problems, where the condition number , the gradient descent method suffers from slow convergence.
In order to accelerate the gradient descent methods, a momentum term was introduced, encouraging the method to move along search directions that utilize not only current but also previously seen information. The heavy-ball (HB) method (also known as the momentum method(Polyak, 1964) was in the form of:
| | | (2) |
where and are constant parameters.Polyak motivated the method by an analogy to a “heavy ball” moving in a potential well defined by the objective function . The corresponding ordinary differential equation (ODE) model is commonly referred to as the heavy-ball flow(Polyak, 1964):
| | | (3) |
where , is taking the derivative of , and and are positive constant parameters.
Non-convergence and non-acceleration of the HB method.
Polyak (1964) showed that for (2) with , , andwhen is sufficiently close to the optimal solution , converges at rate , where . Polyak’s choice relies on the spectral analysis of the linear system and thus the accelerated rate is only limited to the convex and quadratic objectives and is local for iterates near .IndeedLessard etal. (2016) designed a non-convergent example to show that the Polyak’s choice of parameters does not guarantee the global convergence for general strongly convex optimization. By changing the parameters, inGhadimi etal. (2015); Sun etal. (2019); SaabJr etal. (2022); Shi etal. (2022), the global linear convergence of the HB method has been established. However, the best rate is given byShi etal. (2022), which coincides with that of the gradient descent and not the accelerated rate .
The absence of acceleration is not due to a technical difficulty in the convergence analysis. Recently,Goujaud etal. (2023) have demonstrated that the HB method provably fails to achieve an accelerated convergence rate for smooth and strongly convex problems. Specifically, for any positive parameters and in (2), either there exists an -smooth, -strongly convex function, and an initialization such that HB fails to converge; or even in the class of smooth and strongly convex quadratic function , the convergence rate is not accelerated: . For more related works on HB methods, see Appendix B.1.
Accelerated first-order methods.
To accelerate the HB method, one can introduce either an additional gradient step with adequate decay or an extrapolation step into the algorithm; seeWilson etal. (2021); Siegel (2019); Chen & Luo (2021).
Nesterov accelerated gradient (NAG) method(Nesterov, 1983, page 81) can be viewed as an alternative enhancement of the HB method. Nesterov’s approach calculates the gradient at points that are extrapolated based on the inertial force:
| | | | (4) |
Nesterov devised the method of estimate sequences(Nesterov, 2013) to prove that (4) achieves the accelerated linear convergence rate .
Later on, numerous accelerated gradient methods have been developed for smooth strongly convex optimization problems(Lin etal., 2015; Drusvyatskiy etal., 2018; Bubeck etal., 2015; Aujol etal., 2022; VanScoy etal., 2017; Cyrus etal., 2018); to name just a few. However, little is known beyond convex optimization. One reason is that the techniques developed are often specialized on the convexity of the objective function, making them difficult to extend to non-convex cases. In machine learning terms, these approaches lack generalization ability.
Main contributions for smooth strongly convex optimization.
We propose a variant of the HB method in the form of
| | | | (5) |
| | | |
The most notable yet simple change is using not to approximate . Namely an over-relaxation technique(Hadjidimos, 1978) is applied to the gradient term. Therefore, we name (5) accelerated over-relaxation heavy-ball (AOR-HB) method.
Rather than second-order ODEs, we consider a first-order ODE system proposed inChen & Luo (2021):
| | | (6) |
which is a special case of the HB flow (3) with and when is eliminated. Although (6) is mathematically equivalent to an HB flow, the structure of this first-order ODE system is crucial for convergence analysis and algorithmic design. The same second-order ODE can correspond to different first-order systems. For example, another one equivalent to the HB flow (3) is
| | | (7) |
where represents velocity, offering a physical interpretation. However, the convergence analysis is less transparent in the form (7) or the original second-order ODE (3).
We obtain AOR-HB by discretization of (6) with two iterates cf. (14). The AOR is used to symmetrize the error equation (19), which represents a novel contribution compared toLuo & Chen (2022) andChen & Luo (2021).The choice of parameters and in (5) is derived from the time step-size . We rigorously prove that AOR-HB enjoys the global linear convergence with accelerated rate, which closes a long-standing theoretical gap in optimization theory.
Theorem 1.1 (Convergence of AOR-HB method).
Suppose is -strongly convex and -smooth. Let be generated by scheme (14) with initial value and step size . Then there exists a constant so that we have the accelerated linear convergence
| | | (8) |
Remark on non-strongly convex optimization.
When , we propose a variant of the AOR-HB method that incorporates the dynamic time rescaling introduced in (Luo & Chen, 2022):
| | | (9) |
which achieves an accelerated rate of , comparable to NAG. We refer the details and the proofs to Appendix D. Another approach to extend the acceleration involves using the perturbed objective , as discussed in (Lessard etal., 2016, Section 5.4).
Relation to other acceleration methods.
The AOR term can be treated as adding a gradient correction in the high resolution ODE model(Shi etal., 2022). AOR-HB (5) can be also rendered as a special case of the ‘SIE’ iteration described inZhang etal. (2021). However, the parameter choice ( and ) recovering AOR-HB, does not satisfy the condition in their convergence analysis(Zhang etal., 2021, Theorem 3).
For strongly convex optimization, acceleration can be viewed from various perspectives, with the resulting three-term formulas differing only by a higher-order perturbation.With a proper change of variables, NAG (4) can be seen as a parameter perturbation of the AOR-HB in the form of (5). The performance of AOR-HB are thus comparable to NAG but less precise than triple momentum (TM) methods(VanScoy etal., 2017) for strongly convex optimization.
However, extending these methods (NAG, TM, SIE, and high-resolution ODEs) beyond the convex optimization is rare and challenging. In contrast, AOR-HB has a superior generalization capability as the true driving force of acceleration is better captured by the first-order ODE model (6).
Extension to composite convex minimization.