Accelerated Over-Relaxation Heavy-Ball Method: Achieving Global Accelerated Convergence with Broad Generalization (2025)

Jingrong Wei
Department of Mathematics
University of California, Irvine
Irvine, CA 92697
jingronw@uci.edu
&Long Chen
Department of Mathematics
University of California, Irvine
Irvine, CA 92697
chenlong@math.uci.edu
Corresponding author.

Abstract

The heavy-ball momentum method accelerates gradient descent with a momentum term but lacks accelerated convergence for general smooth strongly convex problems. This work introduces the Accelerated Over-Relaxation Heavy-Ball (AOR-HB) method, the first variant with provable global and accelerated convergence for such problems. AOR-HB closes a long-standing theoretical gap, extends to composite convex optimization and min-max problems, and achieves optimal complexity bounds. It offers three key advantages: (1) broad generalization ability, (2) potential to reshape acceleration techniques, and (3) conceptual clarity and elegance compared to existing methods.

1 Introduction

We first consider the convex optimization problem:

minxdf(x),subscript𝑥superscript𝑑𝑓𝑥\min_{x\in\mathbb{R}^{d}}f(x),roman_min start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_x ) ,(1)

where the objective function f𝑓fitalic_f is μ𝜇\muitalic_μ-strongly convex and L𝐿Litalic_L-smooth. Later on we consider extension to composite convex optimization minxf(x)+g(x)subscript𝑥𝑓𝑥𝑔𝑥\min_{x}f(x)+g(x)roman_min start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_f ( italic_x ) + italic_g ( italic_x ) with a non-smooth function g𝑔gitalic_g, and a class of min-max problems minummaxpnf(u)g(p)+Bu,psubscript𝑢superscript𝑚subscript𝑝superscript𝑛𝑓𝑢𝑔𝑝𝐵𝑢𝑝\min_{u\in\mathbb{R}^{m}}\max_{p\in\mathbb{R}^{n}}f(u)-g(p)+\langle Bu,p\rangleroman_min start_POSTSUBSCRIPT italic_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_p ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_u ) - italic_g ( italic_p ) + ⟨ italic_B italic_u , italic_p ⟩ with bilinear coupling.

Notation.

dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is d𝑑ditalic_d-dimensional Euclidean space with standard 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-inner product ,\langle\cdot,\cdot\rangle⟨ ⋅ , ⋅ ⟩ and the induced norm \|\cdot\|∥ ⋅ ∥. f:d:𝑓superscript𝑑f:\mathbb{R}^{d}\to\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R is a differentiable function. We say f𝑓fitalic_f is μ𝜇\muitalic_μ-strongly convex function when there exists μ>0𝜇0\mu>0italic_μ > 0 such that

f(y)f(x)f(x),yxμ2xy2,x,yd.formulae-sequence𝑓𝑦𝑓𝑥𝑓𝑥𝑦𝑥𝜇2superscriptnorm𝑥𝑦2for-all𝑥𝑦superscript𝑑f(y)-f(x)-\langle\nabla f(x),y-x\rangle\geq\frac{\mu}{2}\|x-y\|^{2},\quad%\forall~{}x,y\in\mathbb{R}^{d}.italic_f ( italic_y ) - italic_f ( italic_x ) - ⟨ ∇ italic_f ( italic_x ) , italic_y - italic_x ⟩ ≥ divide start_ARG italic_μ end_ARG start_ARG 2 end_ARG ∥ italic_x - italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∀ italic_x , italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT .

We say f𝑓fitalic_f is L𝐿Litalic_L-smooth with L>0𝐿0L>0italic_L > 0 if its gradient is Lipschitz continuous:

f(x)f(y)Lxy,x,yd.formulae-sequencenorm𝑓𝑥𝑓𝑦𝐿norm𝑥𝑦for-all𝑥𝑦superscript𝑑\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|,\quad\forall~{}x,y\in\mathbb{R}^{d}.∥ ∇ italic_f ( italic_x ) - ∇ italic_f ( italic_y ) ∥ ≤ italic_L ∥ italic_x - italic_y ∥ , ∀ italic_x , italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT .

The condition number κ𝜅\kappaitalic_κ is defined as κ=L/μ𝜅𝐿𝜇\kappa=L/\muitalic_κ = italic_L / italic_μ.

When f𝑓fitalic_f is μ𝜇\muitalic_μ-strongly convex and L𝐿Litalic_L-smooth, the optimization problem (1) has a unique global minimizer xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We focus on iterative methods to find xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. A useful measure of convergence is the Bregman divergence of f𝑓fitalic_f, defined as Df(y,x):=f(y)f(x)f(x),yx.assignsubscript𝐷𝑓𝑦𝑥𝑓𝑦𝑓𝑥𝑓𝑥𝑦𝑥D_{f}(y,x):=f(y)-f(x)-\langle\nabla f(x),y-x\rangle.italic_D start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_y , italic_x ) := italic_f ( italic_y ) - italic_f ( italic_x ) - ⟨ ∇ italic_f ( italic_x ) , italic_y - italic_x ⟩ .In particular, Df(x,x)=f(x)f(x)subscript𝐷𝑓𝑥superscript𝑥𝑓𝑥𝑓superscript𝑥D_{f}(x,x^{*})=f(x)-f(x^{*})italic_D start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_f ( italic_x ) - italic_f ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), since f(x)=0𝑓superscript𝑥0\nabla f(x^{*})=0∇ italic_f ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 0.Various bounds and identities on the Bregman divergence can be found in Appendix A.

Heavy-ball methods and flow.

Over the past two decades, first-order methods, which rely solely on gradient information rather than the Hessian as required by Newton’s method, have gained significant interest due to their efficiency and adaptability to large-scale data-driven applications and machine learning tasks(Bottou etal., 2018). Among these methods, the gradient descent method is the most straightforward and well-established algorithms. However, for ill-conditioned problems, where the condition number κ1much-greater-than𝜅1\kappa\gg 1italic_κ ≫ 1, the gradient descent method suffers from slow convergence.

In order to accelerate the gradient descent methods, a momentum term was introduced, encouraging the method to move along search directions that utilize not only current but also previously seen information. The heavy-ball (HB) method (also known as the momentum method(Polyak, 1964) was in the form of:

xk+1=xkγf(xk)+β(xkxk1),subscript𝑥𝑘1subscript𝑥𝑘𝛾𝑓subscript𝑥𝑘𝛽subscript𝑥𝑘subscript𝑥𝑘1x_{k+1}=x_{k}-\gamma\nabla f(x_{k})+\beta(x_{k}-x_{k-1}),italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_γ ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + italic_β ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ,(2)

where β𝛽\betaitalic_β and γ𝛾\gammaitalic_γ are constant parameters.Polyak motivated the method by an analogy to a “heavy ball” moving in a potential well defined by the objective function f𝑓fitalic_f. The corresponding ordinary differential equation (ODE) model is commonly referred to as the heavy-ball flow(Polyak, 1964):

x′′+θx+ηf(x)=0,superscript𝑥′′𝜃superscript𝑥𝜂𝑓𝑥0x^{\prime\prime}+\theta x^{\prime}+\eta\nabla f(x)=0,italic_x start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT + italic_θ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_η ∇ italic_f ( italic_x ) = 0 ,(3)

where x=x(t)𝑥𝑥𝑡x=x(t)italic_x = italic_x ( italic_t ), xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is taking the derivative of t𝑡titalic_t, and θ𝜃\thetaitalic_θ and η𝜂\etaitalic_η are positive constant parameters.

Non-convergence and non-acceleration of the HB method.

Polyak (1964) showed that for (2) with β=(LμL+μ)2𝛽superscript𝐿𝜇𝐿𝜇2\beta=\left(\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}\right)^{2}italic_β = ( divide start_ARG square-root start_ARG italic_L end_ARG - square-root start_ARG italic_μ end_ARG end_ARG start_ARG square-root start_ARG italic_L end_ARG + square-root start_ARG italic_μ end_ARG end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, γ=4(L+μ)2𝛾4superscript𝐿𝜇2\gamma=\frac{4}{(\sqrt{L}+\sqrt{\mu})^{2}}italic_γ = divide start_ARG 4 end_ARG start_ARG ( square-root start_ARG italic_L end_ARG + square-root start_ARG italic_μ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, andwhen xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is sufficiently close to the optimal solution xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, xkxnormsubscript𝑥𝑘superscript𝑥\|x_{k}-x^{*}\|∥ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ converges at rate 1ρ1+ρ1𝜌1𝜌\frac{1-\sqrt{\rho}}{1+\sqrt{\rho}}divide start_ARG 1 - square-root start_ARG italic_ρ end_ARG end_ARG start_ARG 1 + square-root start_ARG italic_ρ end_ARG end_ARG, where ρ=1/κ𝜌1𝜅\rho=1/\kappaitalic_ρ = 1 / italic_κ. Polyak’s choice relies on the spectral analysis of the linear system and thus the accelerated rate is only limited to the convex and quadratic objectives and is local for iterates near xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.IndeedLessard etal. (2016) designed a non-convergent example to show that the Polyak’s choice of parameters does not guarantee the global convergence for general strongly convex optimization. By changing the parameters, inGhadimi etal. (2015); Sun etal. (2019); SaabJr etal. (2022); Shi etal. (2022), the global linear convergence of the HB method has been established. However, the best rate is 1𝒪(ρ)1𝒪𝜌1-\mathcal{O}(\rho)1 - caligraphic_O ( italic_ρ ) given byShi etal. (2022), which coincides with that of the gradient descent and not the accelerated rate 1𝒪(ρ)1𝒪𝜌1-\mathcal{O}(\sqrt{\rho})1 - caligraphic_O ( square-root start_ARG italic_ρ end_ARG ).

The absence of acceleration is not due to a technical difficulty in the convergence analysis. Recently,Goujaud etal. (2023) have demonstrated that the HB method provably fails to achieve an accelerated convergence rate for smooth and strongly convex problems. Specifically, for any positive parameters β𝛽\betaitalic_β and γ𝛾\gammaitalic_γ in (2), either there exists an L𝐿Litalic_L-smooth, μ𝜇\muitalic_μ-strongly convex function, and an initialization such that HB fails to converge; or even in the class of smooth and strongly convex quadratic function f𝑓fitalic_f, the convergence rate is not accelerated: 1𝒪(ρ)1𝒪𝜌1-\mathcal{O}(\rho)1 - caligraphic_O ( italic_ρ ). For more related works on HB methods, see Appendix B.1.

Accelerated first-order methods.

To accelerate the HB method, one can introduce either an additional gradient step with adequate decay or an extrapolation step into the algorithm; seeWilson etal. (2021); Siegel (2019); Chen & Luo (2021).

Nesterov accelerated gradient (NAG) method(Nesterov, 1983, page 81) can be viewed as an alternative enhancement of the HB method. Nesterov’s approach calculates the gradient at points that are extrapolated based on the inertial force:

xk+1subscript𝑥𝑘1\displaystyle x_{k+1}italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT=xk+β(xkxk1)γf(xk+β(xkxk1)),withβ=LμL+μ,γ=1L.formulae-sequenceabsentsubscript𝑥𝑘𝛽subscript𝑥𝑘subscript𝑥𝑘1𝛾𝑓subscript𝑥𝑘𝛽subscript𝑥𝑘subscript𝑥𝑘1formulae-sequencewith𝛽𝐿𝜇𝐿𝜇𝛾1𝐿\displaystyle=x_{k}+\beta(x_{k}-x_{k-1})-\gamma\nabla f(x_{k}+\beta(x_{k}-x_{k%-1})),~{}\text{with}~{}\beta=\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}},%\gamma=\frac{1}{L}.= italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_β ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - italic_γ ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_β ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ) , with italic_β = divide start_ARG square-root start_ARG italic_L end_ARG - square-root start_ARG italic_μ end_ARG end_ARG start_ARG square-root start_ARG italic_L end_ARG + square-root start_ARG italic_μ end_ARG end_ARG , italic_γ = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG .(4)

Nesterov devised the method of estimate sequences(Nesterov, 2013) to prove that (4) achieves the accelerated linear convergence rate 1ρ1𝜌1-\sqrt{\rho}1 - square-root start_ARG italic_ρ end_ARG.

Later on, numerous accelerated gradient methods have been developed for smooth strongly convex optimization problems(Lin etal., 2015; Drusvyatskiy etal., 2018; Bubeck etal., 2015; Aujol etal., 2022; VanScoy etal., 2017; Cyrus etal., 2018); to name just a few. However, little is known beyond convex optimization. One reason is that the techniques developed are often specialized on the convexity of the objective function, making them difficult to extend to non-convex cases. In machine learning terms, these approaches lack generalization ability.

Main contributions for smooth strongly convex optimization.

We propose a variant of the HB method in the form of

xk+1subscript𝑥𝑘1\displaystyle x_{k+1}italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT=xkγ(2f(xk)f(xk1))+β(xkxk1),absentsubscript𝑥𝑘𝛾2𝑓subscript𝑥𝑘𝑓subscript𝑥𝑘1𝛽subscript𝑥𝑘subscript𝑥𝑘1\displaystyle=x_{k}-\gamma(2\nabla f(x_{k})-\nabla f(x_{k-1}))+\beta(x_{k}-x_{%k-1}),= italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_γ ( 2 ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ) + italic_β ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ,(5)
γ𝛾\displaystyle\gammaitalic_γ=1(L+μ)2,β=L(L+μ)2.formulae-sequenceabsent1superscript𝐿𝜇2𝛽𝐿superscript𝐿𝜇2\displaystyle=\frac{1}{(\sqrt{L}+\sqrt{\mu})^{2}},\quad\quad\beta=\frac{L}{(%\sqrt{L}+\sqrt{\mu})^{2}}.= divide start_ARG 1 end_ARG start_ARG ( square-root start_ARG italic_L end_ARG + square-root start_ARG italic_μ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , italic_β = divide start_ARG italic_L end_ARG start_ARG ( square-root start_ARG italic_L end_ARG + square-root start_ARG italic_μ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

The most notable yet simple change is using 2f(xk)f(xk1)2𝑓subscript𝑥𝑘𝑓subscript𝑥𝑘12\nabla f(x_{k})-\nabla f(x_{k-1})2 ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) not f(xk)𝑓subscript𝑥𝑘\nabla f(x_{k})∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) to approximate f(x)𝑓𝑥\nabla f(x)∇ italic_f ( italic_x ). Namely an over-relaxation technique(Hadjidimos, 1978) is applied to the gradient term. Therefore, we name (5) accelerated over-relaxation heavy-ball (AOR-HB) method.

Rather than second-order ODEs, we consider a first-order ODE system proposed inChen & Luo (2021):

x=yx,y=xy1μf(x),formulae-sequencesuperscript𝑥𝑦𝑥superscript𝑦𝑥𝑦1𝜇𝑓𝑥x^{\prime}=y-x,\quad y^{\prime}=x-y-\frac{1}{\mu}\nabla f(x),italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_y - italic_x , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_x - italic_y - divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG ∇ italic_f ( italic_x ) ,(6)

which is a special case of the HB flow (3) with θ=2𝜃2\theta=2italic_θ = 2 and η=1μ𝜂1𝜇\eta=\frac{1}{\mu}italic_η = divide start_ARG 1 end_ARG start_ARG italic_μ end_ARG when y𝑦yitalic_y is eliminated. Although (6) is mathematically equivalent to an HB flow, the structure of this 2×2222\times 22 × 2 first-order ODE system is crucial for convergence analysis and algorithmic design. The same second-order ODE can correspond to different first-order systems. For example, another one equivalent to the HB flow (3) is

x=v,v=θvηf(x),formulae-sequencesuperscript𝑥𝑣superscript𝑣𝜃𝑣𝜂𝑓𝑥x^{\prime}=v,\quad v^{\prime}=-\theta v-\eta\nabla f(x),italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_v , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = - italic_θ italic_v - italic_η ∇ italic_f ( italic_x ) ,(7)

where v𝑣vitalic_v represents velocity, offering a physical interpretation. However, the convergence analysis is less transparent in the form (7) or the original second-order ODE (3).

We obtain AOR-HB by discretization of (6) with two iterates (xk,yk)subscript𝑥𝑘subscript𝑦𝑘(x_{k},y_{k})( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) cf. (14). The AOR is used to symmetrize the error equation (19), which represents a novel contribution compared toLuo & Chen (2022) andChen & Luo (2021).The choice of parameters β𝛽\betaitalic_β and γ𝛾\gammaitalic_γ in (5) is derived from the time step-size α=μ/L𝛼𝜇𝐿\alpha=\sqrt{\mu/L}italic_α = square-root start_ARG italic_μ / italic_L end_ARG. We rigorously prove that AOR-HB enjoys the global linear convergence with accelerated rate, which closes a long-standing theoretical gap in optimization theory.

Theorem 1.1 (Convergence of AOR-HB method).

Suppose f𝑓fitalic_f is μ𝜇\muitalic_μ-strongly convex and L𝐿Litalic_L-smooth. Let (xk,yk)subscript𝑥𝑘subscript𝑦𝑘(x_{k},y_{k})( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) be generated by scheme (14) with initial value (x0,y0)subscript𝑥0subscript𝑦0(x_{0},y_{0})( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and step size α=μ/L𝛼𝜇𝐿\alpha=\sqrt{\mu/L}italic_α = square-root start_ARG italic_μ / italic_L end_ARG. Then there exists a constant C0=C0(x0,y0,μ,L)subscript𝐶0subscript𝐶0subscript𝑥0subscript𝑦0𝜇𝐿C_{0}=C_{0}(x_{0},y_{0},\mu,L)italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ , italic_L ) so that we have the accelerated linear convergence

f(xk+1)f(x)+μ2yk+1x2C0(11+12μ/L)k,k1.formulae-sequence𝑓subscript𝑥𝑘1𝑓superscript𝑥𝜇2superscriptnormsubscript𝑦𝑘1superscript𝑥2subscript𝐶0superscript1112𝜇𝐿𝑘𝑘1\displaystyle f(x_{k+1})-f(x^{*})+\frac{\mu}{2}\|y_{k+1}-x^{*}\|^{2}\leq C_{0}%\left(\frac{1}{1+\frac{1}{2}\sqrt{\mu/L}}\right)^{k},\quad k\geq 1.italic_f ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + divide start_ARG italic_μ end_ARG start_ARG 2 end_ARG ∥ italic_y start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG 1 + divide start_ARG 1 end_ARG start_ARG 2 end_ARG square-root start_ARG italic_μ / italic_L end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_k ≥ 1 .(8)

Remark on non-strongly convex optimization.

When μ=0𝜇0\mu=0italic_μ = 0, we propose a variant of the AOR-HB method that incorporates the dynamic time rescaling introduced in (Luo & Chen, 2022):

xk+1=xkkk+31L(2f(xk)f(xk1))+kk+3(xkxk1),subscript𝑥𝑘1subscript𝑥𝑘𝑘𝑘31𝐿2𝑓subscript𝑥𝑘𝑓subscript𝑥𝑘1𝑘𝑘3subscript𝑥𝑘subscript𝑥𝑘1x_{k+1}=x_{k}-\frac{k}{k+3}\frac{1}{L}\big{(}2\nabla f(x_{k})-\nabla f(x_{k-1}%)\big{)}+\frac{k}{k+3}(x_{k}-x_{k-1}),italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG italic_k end_ARG start_ARG italic_k + 3 end_ARG divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ( 2 ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ) + divide start_ARG italic_k end_ARG start_ARG italic_k + 3 end_ARG ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ,(9)

which achieves an accelerated rate of 𝒪(1/k2)𝒪1superscript𝑘2\mathcal{O}(1/k^{2})caligraphic_O ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), comparable to NAG. We refer the details and the proofs to Appendix D. Another approach to extend the acceleration involves using the perturbed objective f(x)+ϵ2x2𝑓𝑥italic-ϵ2superscriptnorm𝑥2f(x)+\frac{\epsilon}{2}\|x\|^{2}italic_f ( italic_x ) + divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, as discussed in (Lessard etal., 2016, Section 5.4).

Relation to other acceleration methods.

The AOR term 2f(xk)f(xk1)2𝑓subscript𝑥𝑘𝑓subscript𝑥𝑘12\nabla f(x_{k})-\nabla f(x_{k-1})2 ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) can be treated as adding a gradient correction in the high resolution ODE model(Shi etal., 2022). AOR-HB (5) can be also rendered as a special case of the ‘SIE’ iteration described inZhang etal. (2021). However, the parameter choice (s=1L𝑠1𝐿\sqrt{s}=\frac{1}{L}square-root start_ARG italic_s end_ARG = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG and m=1𝑚1m=1italic_m = 1 ) recovering AOR-HB, does not satisfy the condition in their convergence analysis(Zhang etal., 2021, Theorem 3).

For strongly convex optimization, acceleration can be viewed from various perspectives, with the resulting three-term formulas differing only by a higher-order 𝒪(ρ)𝒪𝜌\mathcal{O}(\rho)caligraphic_O ( italic_ρ ) perturbation.With a proper change of variables, NAG (4) can be seen as a parameter perturbation of the AOR-HB in the form of (5). The performance of AOR-HB are thus comparable to NAG but less precise than triple momentum (TM) methods(VanScoy etal., 2017) for strongly convex optimization.

However, extending these methods (NAG, TM, SIE, and high-resolution ODEs) beyond the convex optimization is rare and challenging. In contrast, AOR-HB has a superior generalization capability as the true driving force of acceleration is better captured by the 2×2222\times 22 × 2 first-order ODE model (6).

Extension to composite convex minimization.

Accelerated Over-Relaxation Heavy-Ball Method: Achieving Global Accelerated Convergence with Broad Generalization (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Aracelis Kilback

Last Updated:

Views: 6813

Rating: 4.3 / 5 (64 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Aracelis Kilback

Birthday: 1994-11-22

Address: Apt. 895 30151 Green Plain, Lake Mariela, RI 98141

Phone: +5992291857476

Job: Legal Officer

Hobby: LARPing, role-playing games, Slacklining, Reading, Inline skating, Brazilian jiu-jitsu, Dance

Introduction: My name is Aracelis Kilback, I am a nice, gentle, agreeable, joyous, attractive, combative, gifted person who loves writing and wants to share my knowledge and understanding with you.