#### Acceleration of the PDHGM on Partially Strongly Convex Functions

Acceleration of the PDHGM on Partially Strongly Convex Functions
Tuomo Valkonen 1 2 3
Thomas Pock 0 1 2 3
0 Institute for Computer Graphics and Vision, Graz University of Technology , 8010 Graz , Austria
1 Department of Mathematical Sciences, University of Liverpool , Liverpool , UK
2 Department of Applied Mathematics and Theoretical Physics, University of Cambridge , Cambridge , UK
3 Digital Safety and Security Department, AIT Austrian Institute of Technology GmbH , 1220 Vienna , Austria
We propose several variants of the primal-dual method due to Chambolle and Pock. Without requiring full strong convexity of the objective functions, our methods are accelerated on subspaces with strong convexity. This yields mixed rates, O(1/N 2) with respect to initialisation and O(1/N ) with respect to the dual sequence, and the residual part of the primal sequence. We demonstrate the efficacy of the proposed methods on image processing problems lacking strong convexity, such as total generalised variation denoising and total variation deblurring.
Primal-dual; Accelerated; Subspace; Total generalised variation
1 Introduction
Let G : X → R and F : Y → R be convex, proper, and
lower semicontinuous functionals on Hilbert spaces X and
Y , possibly infinite dimensional. Also let K ∈ L(X ; Y ) be a
bounded linear operator. We then wish to solve the problem
This can under mild conditions on F (see, for example, [1,2])
also be written with the help of the convex conjugate F ∗ in
the minimax form
One possibility for the numerical solution of the latter form
is the primal–dual algorithm of Chambolle and Pock [3], a
type of proximal point or extragradient method, also
classified as the ‘modified primal–dual hybrid gradient method’
or PDHGM by Esser et al. [4]. If either G or F ∗ is strongly
convex, the method can be accelerated to O(1/N 2)
convergence rates of the iterates and an ergodic duality gap [3]. But
what if we have only partial strong convexity? For example,
what if
G(x ) = G0( P x )
for a projection operator P to a subspace X0 ⊂ X , and
strongly convex G0 : X0 → R? This kind of structure is
common in many applications in image processing and data
science, as we will more closely review in Sect. 5. Under
such partial strong convexity, can we obtain a method that
would give an accelerated rate of convergence at least for
P x ?
We provide a partially positive answer: we can obtain
mixed rates, O(1/N 2) with respect to initialisation, and
O(1/N ) with respect to bounds on the ‘residual variables’ y
and (I − P)x . In this respect, our results are similar to the
‘optimal’ algorithm of Chen et al. [5]. Instead of strong
convexity, they assume smoothness of G to derive a primal–dual
algorithm based on backward–forward steps, instead of the
backward–backward steps of [3].
The derivation of our algorithms is based, firstly, on
replacing simple step length parameters by a variety of abstract
step length operators and, secondly, a type of abstract partial
strong monotonicity property
∂ G(x ) − ∂ G(x ), T −1(x − x )
≥ x − x 2T −1,∗
− penalty_term,
the full details of which we provide in Sect. 2. Here T is an
auxiliary step length operator. Our factor of strong convexity
is a positive semidefinite operator ≥ 0; however, to make
our algorithms work, we need to introduce additional
artificial strong convexity through another operator , which may
not satisfy 0 ≤ ≤ . This introduces the penalty term in
(1). The exact procedure can be seen as a type of
smoothing, famously studied by Nesterov [6], and more recently,
for instance, by Beck and Teboulle [7]. In these approaches,
one computes a priori a level of smoothing—comparable to
—needed to achieve prescribed solution quality. One then
solves a smoothed problem, which can be done at O(1/N 2)
rate. However, to obtain a solution with higher quality than
the a priori prescribed one, one needs to solve a new
problem from scratch, as the smoothing alters the problem being
solved. One can also employ restarting strategies, to take
some advantage of the previous solution, see, for
example, [8]. Our approach does not depend on restarting and
a priori chosen solution qualities: the method will converge
to an optimal solution to the original non-smooth problem.
Indeed, the introduced additional strong convexity is
controlled automatically.
The ‘fast dual proximal gradient method’, or FDPG
[9], also possesses different type of mixed rates, O(1/N )
for the primal, and O(1/N 2) for the dual. This is,
however, under standard strong convexity assumptions. Other
than that, our work is related to various further
developments from the PDHGM, such as variants for nonlinear K
[10,11] and non-convex G [12]. The PDHGM has been the
basis for inertial methods for monotone inclusions [13] and
primal–dual stochastic coordinate descent methods without
separability requirements [14]. Finally, the FISTA [15,16]
can be seen as a primal-only relative of the PDHGM. Not
attempting to do full justice here to the large family of
closely related methods, we point to [4,17,18] for further
references.
The contributions of our paper are twofold: firstly, to paint
a bigger picture of what is possible, we derive a very
general version of the PDHGM. This algorithm, useful as a basis
for deriving other new algorithms besides ours, is the
content of Sect. 2. In this section, we provide an abstract bound
on the iterates of the algorithm, later used to derive
convergence rates. In Sect. 3, we extend the bound to include an
ergodic duality gap under stricter conditions on the
acceleration scheme and the step length operators. A by-product
of this work is the shortest convergence rate proof for the
accelerated PDHGM known to us. Afterwards, in Sect. 4, we
derive from the general algorithm two efficient mixed-rate
algorithms for problems exhibiting strong convexity only on
subspaces. The first one employs the penalty or smoothing ψ
on both the primal and the dual. The second one only employs
the penalty on the dual. We finish the study with numerical
experiments in Sect. 5. The main results of interest for
readers wishing to apply our work are Algorithms 3 and 4 along
with the respective convergence results, Theorems 4.1 and
4.2.
2 A General Primal–Dual Method
2.1 Notation
To make the notation definite, we denote by L(X ; Y ) the
space of bounded linear operators between Hilbert spaces X
and Y . For T , S ∈ L(X ; X ), the notation T ≥ S means that
T − S is positive semidefinite; in particular, T ≥ 0 means
that T is positive semidefinite. In this case, we also denote
The identity operator is denoted by I , as is standard.
For 0 ≤ M ∈ L(X ; X ), which can possibly not be
selfadjoint, we employ the notation
a, b M :=
Ma, b , and
a M :=
We also use the notation T −1,∗ := (T −1)∗.
2.2 Background
As in the introduction, let us be given convex, proper, lower
semicontinuous functionals G : X → R and F ∗ : Y → R on
Hilbert spaces X and Y , as well as a bounded linear operator
K ∈ L(X ; Y ). We then wish to solve the minimax problem
assuming the existence of a solution u = (x , y) satisfying
the optimality conditions
− K ∗ y ∈ ∂ G(x ), and K x ∈ ∂ F ∗(y).
Such a point always exists if lim x →∞ G(x )/ x = ∞ and
lim y →∞ F ∗(y)/ y = ∞, as follows from [2, Proposition
VI.1.2 & Proposition VI.2.2]. More generally the existence
has to be proved explicitly. In finite dimensions, see, for
example, [19] for several sufficient conditions.
The primal–dual method of Chambolle and Pock [3] for
the solving (P) consists of iterating
xi+1 := (I + τi ∂ G)−1(xi − τi K ∗ yi ),
x¯i+1 := ωi (xi+1 − xi ) + xi+1,
yi+1 := (I + σi+1∂ F ∗)−1(yi + σi+1 K x¯i+1).
In the basic version of the algorithm, ωi = 1, τi ≡ τ0, and
σi ≡ σ0, assuming that the step length parameters satisfy
τ0σ0 K 2 < 1. The method has O(1/N ) rate for the ergodic
duality gap [3]. If G is strongly convex with factor γ , we may
use the acceleration scheme [3]
ωi := 1/ 1 + 2γ τi , τi+1 := τi ωi , and σi+1 := σi /ωi ,
(4)
to achieve O(1/N 2) convergence rates of the iterates and an
ergodic duality gap, defined in [3]. To motivate our choices
later on, observe that σ0 is never used expect to calculate σ1.
We may therefore equivalently parametrise the algorithm by
δ = 1 − K 2τ0σ0 > 0.
We note that the order of the steps in (3) is different from
the original ordering in [3]. This is because with the present
order, the method (3) may also be written in the proximal
point form. This formulation, first observed in [20] and later
utilised in [10,11,21], is also what we will use to
streamline our analysis. Introducing the general variable splitting
notation,
u = (x , y),
the system (3) then reduces into
0 ∈ H (ui+1) + Mbasic,i (ui+1 − ui ),
for the monotone operator
H (u) :=
∂∂GF ∗(x(y))+−KK∗xy ,
and the preconditioning or step length operator
Mbasic,i :=
We note that the optimality conditions (OC) can also be
encoded as 0 ∈ H (u).
2.3 Abstract Partial Monotonicity
Our plan now is to formulate a general version of (3),
replacing τi and σi by operators Ti ∈ L(X ; X ) and i ∈ L(Y ; Y ).
In fact, we will need two additional operators Ti ∈ L(X ; X )
and Tˆi ∈ L(Y ; Y ) to help communicate change in Ti to i .
They replace ωi in (3b) and (7), operating as Tˆi+1 K Ti−1 ≈
ωi K from both sides of K . The role of Ti is to split the
original primal step length τi in the space X into the two parts
Ti and Ti with potentially different rates. The role of Tˆi is
to transfer Ti into the space Y , to eventually control the dual
step length i . In the basic algorithm (3), we would simply
have Ti = Ti = τi I ∈ L(X ; X ), and Tˆi = τi I ∈ L(Y ; Y )
for the scalar τi .
To start the algorithm derivation, we now formulate
abstract forms of partially strong monotonicity. As a first
step, we take subsets of invertible operators
T ⊂ L(X ; X ), and Tˆ ⊂ L(Y ; Y ),
as well as subsets of positive semidefinite operators
0 ≤ K ⊂ L(X ; X ), and 0 ≤ Kˆ ⊂ L(Y ; Y ).
We assume T and Tˆ closed with respect to composition:
T1T2 ∈ T for T1, T2 ∈ T .
We use the sets K and Kˆ as follows. We suppose that ∂ G
is partially strongly (ψ, T , K)-monotone, meaning that for
all x , x ∈ X, T ∈ T , ∈ [0, ] + K holds
∂ G(x ) − ∂ G(x ), T −1(x − x )
x − x 2T −1,∗
− ψT −1,∗( − )(x − x ),
for some family of functionals {ψT : X → R}, and a
linear operator 0 ≤ ∈ L(X ; X ) which models partial strong
monotonicity. The inequality in (G-PM), and all such set
inequalities in the remainder of this paper, is understood to
hold for all elements of the sets ∂ G(x ) and ∂ G(x ). The
operator T ∈ T acts as a testing operator, and the operator
∈ K as introduced strong monotonicity. The functional
ψT −1,∗( − ) is a penalty corresponding to the test and the
introduced strong monotonicity. The role of testing will
become more apparent in Sect. 2.4.
Similarly to (G-PM), we assume that ∂ F ∗ is (φ, Tˆ , Kˆ
)monotone with respect to Tˆ in the sense that for all y, y ∈ Y ,
Tˆ ∈ Tˆ , R ∈ Kˆ holds
∂ F ∗(y ) − ∂ F ∗(y), Tˆ −1,∗(y − y)
y − y 2Tˆ −1 R − φTˆ −1 R (y − y),
for some family of functionals {φT : Y → R}. Again, the
inequality in (F∗-PM) is understood to hold for all elements
of the sets ∂ F ∗(y ) and ∂ F ∗(y).
In our general analysis, we do not set any conditions on
ψ and φ, as their role is simply symbolic transfer of
dissatisfaction of strong monotonicity into a penalty in our abstract
convergence results.
Let us next look at a few examples on how (G-PM) or
(F∗-PM) might be satisfied. First we have the very
wellbehaved case of quadratic functions.
Example 2.1 G(x ) = f − Ax 2/2 satisfies (G-PM) with
= A∗ A, K = {0}, and ψ ≡ 0 for any invertible T . Indeed,
G is differentiable with ∇ G(x ) − ∇ G(x ), T −1(x − x ) =
A∗ A(x − x ), T −1(x − x ) = x − x 2T −1,∗ .
The next lemma demonstrates what can be done when all
the parameters are scalar. It naturally extends to functions of
the form G(x1, x2) = G(x1) + G(x2) with corresponding
product form parameters.
Lemma 2.1 Let G : X → R be convex and proper with
dom G bounded. Then,
γ
G(x ) − G(x ) ≥ ∂ G(x ), x − x + 2
Proof We denote A := dom G. If x ∈/ A, we have G(x ) =
∞, so (8) holds irrespective of γ and C . If x ∈/ A, we have
∂ G(x ) = ∅, so (8) again holds. We may therefore compute
Algorithm 1 Primal–dual algorithm with partial acceleration
Require: F∗ and G satisfying (G-PM) and (F∗-PM) for some sets and
spaces K, Kˆ , T , Tˆ , and 0 ≤ ∈ L(X ; X ). Initial invertible T0 ∈
L(X ; X ), T0 ∈ T , Tˆ1 ∈ Tˆ , and 1 ∈ L(Y ; Y ), as well as δ ∈ (0, 1),
satisfying for j = 0 the condition
xi+1 := (I + Ti ∂G)−1(xi − Ti K ∗ yi ),
w¯ i+1 := Tˆi+1 K Ti−1(xi+1 − xi ) + K xi+1,
yi+1 := (I + i+1∂ F∗)−1(yi + i+1w¯ i+1).
4: Find invertible Ti+1 ∈ L(X ; X ), Ti+1 ∈ T , Tˆi+2 ∈ Tˆ , and i+2 ∈
L(Y ; Y ) satisfying (9) with j = i + 1, as well as the condition
Si (Mi + ¯ i ) ≥ Si+1 Mi+1
for some 0 ≤ Ri+1 ∈ Kˆ and i ∈ [0, ] + K.
5: until a stopping criterion is fulfilled.
the constants based on x , x ∈ A. Now, there is a constant M
such that supx∈A x ≤ M . Then, x − x ≤ 2M . Thus, if
we pick C = 4M 2, then (γ /2)( x − x 2 − C ) ≤ 0 for every
γ ≥ 0 and x , x ∈ A. By the convexity of G, (8) holds.
Example 2.2 An indicator function ιA of a convex bounded
set A satisfies the conditions of Lemma 2.1. This is generally
what we will use and need.
2.4 A General Algorithm and the Idea of Testing
The only change we make to the proximal point formulation
(5) of the method (3) is to replace the basic step length or
preconditioning operator Mbasic,i by the operator
Mi :=
As we have remarked, the operators Tˆi+1 and Ti play the role
of ωi , acting from both sides of K . Our proposed algorithm
can thus be characterised as solving on each iteration i ∈ N
for the next iterate ui+1 the preconditioned proximal point
problem
To study the convergence properties of (PP), we define
the testing operator
Si :=
It will turn out that multiplying or ‘testing’ (PP) by this
operator will allow us to derive convergence rates. The testing of
(PP) by Si is why we introduced testing into the monotonicity
conditions (G-PM) and (F∗-PM). If we only tested (PP) with
Si = I , we could at most obtain ergodic convergence of the
duality gap for the unaccelerated method. But by testing with
something appropriate and faster increasing, such as (11), we
are able to extract better convergence rates from (PP).
We also set
¯ i =
Ti∗(K Ti−1 − Tˆi−+11 K )∗
2 Ri+1
for some i ∈ [0, ] + K and Ri+1 ∈ Kˆ . We will see in
Sect. 2.6 that ¯ i is a factor of partial strong monotonicity
for H with respect to testing by Si . With this, taking a fixed
δ > 0, the properties
Si (Mi + ¯ i ) ≥ Si+1 Mi+1, and
≥ 0,
will turn out to be the crucial defining properties for the
convergence rates of the iteration (PP). The method resulting
from the combination of (PP), (C1), and (C2) can also be
expressed as Algorithm 1. The main steps in developing
practical algorithms based on Algorithm 1 will be in the choice
of the various step length operators. This will be the content
of Sects. 3 and 4. Before this, we expand the conditions (C1)
and (C2) to see how they might be satisfied and study abstract
convergence results.
2.5 A Simplified Condition We expand
Si Mi =
as well as
Si ¯ i =
2Ti−1,∗ i
K Ti−1 − Tˆi−+11 K
2Tˆi−+11 Ri+1
−Tˆi−+11 K
−K ∗Tˆi−+11,∗
We observe that if S, T ∈ L(X ; Y ), then for arbitrary
invertible Z ∈ L(Y ; Y ) a type of Cauchy (or Young) inequality
holds, namely
The inequality here can be verified using the basic Cauchy
inequality 2 x , y ≤ x 2 + y 2. Applying (14) in (12), we
see that (C2) is satisfied when
Tˆi−+11 i−+11 ≥ K Zi−1 Zi−1,∗ K ∗, and
(1 − δ)Ti−1,∗Ti−1 ≥ Ti−1,∗ Zi∗ Zi T −1,
for some invertible Zi ∈ L(X ; X ). The second condition in
(15) is satisfied as an equality if
2δ x N − x 2TN−1,∗TN−1 ≤ C0 +
N −1
i=0
Di+1 := ψTi−1,∗( i − )(xi+1 − x )
By the spectral theorem for self-adjoint operators on Hilbert
spaces (e.g. [22, Chapter 12]), we can make the choice (16)
if
Equivalently, by the same spectral theorem, Ti−1Ti ∈ Q.
Therefore, we see from (15) that (C2) holds when
and Tˆi−+11 i−+11 ≥ 1 −1 δ K Ti−1Ti K ∗.
Also, (C1) can be rewritten as
Ti−1,∗(Ti−1 +2 i )−Ti−+11,∗Ti−+11
K Ti−+11 −Tˆi−+11 K
≥ 0.
2.6 Basic Convergence Result
Our main result on Algorithm 1 is the following theorem,
providing some general convergence estimates. It is, however,
important to note that the theorem does not yet directly prove
convergence, as its estimates depend on the rate of decrease
in TN TN∗ , as well as the rate of increase in the penalty sum
N −1 Di+1 coming from the dissatisfaction of strong
coni=0
vexity. Deriving these rates in special cases will be the topic
of Sect. 4.
Theorem 2.1 Let us be given K ∈ L(X ; Y ), and convex,
proper, lower semicontinuous functionals G : X → R and
F ∗ : Y → R on Hilbert spaces X and Y , satisfying (G-PM)
and (F∗-PM). Pick δ ∈ (0, 1), and suppose (C1) and (C2) are
satisfied for each i ∈ N for some invertible Ti ∈ L(X ; X ),
Ti ∈ T , Tˆi+1 ∈ Tˆ , and i+1 ∈ L(Y ; Y ), as well as
i ∈ [0, ] + K and Ri+1 ∈ Kˆ . Suppose that Ti−1,∗Ti−1
and Tˆi−+11 i−+11 are self-adjoint. Let u = (x , y) satisfy (OC).
Then, the iterates of Algorithm 1 satisfy
Di+1, (N ≥ 1),
Remark 2.1 The term Di+1, coming from the dissatisfaction
of strong convexity, penalises the basic convergence, which
is on the right-hand side of (17) presented by the constant
C0. If TN TN is of the order O(1/N 2), at least on a subspace,
and we can bound the penalty Di+1 ≤ C for some constant
C , then we clearly obtain mixed O(1/N 2) + O(1/N )
convergence rates on the subspace. If we can assume that Di+1
actually converges to zero at some rate, then it will even be
possible to obtain improved convergence rates. Since
typically Ti , Tˆi+1 0 reduce to scalar factors within Di+1,
this would require prior knowledge of the rates of
convergence xi → x and yi → y. Boundedness of the iterates
{(xi , yi )}i∞=0, we can, however, usually ensure.
Proof Since 0 ∈ H (u), we have
H (ui+1), Si∗(ui+1 − u) ⊂
−H (u), Si∗(ui+1 − u) .
Recalling the definition of Si from (11), and of H from (6),
it follows
− ∂ G(x ), Ti−1(xi+1 − x )
H (ui+1), Si∗(ui+1 − u) ⊂ ∂ G(xi+1)
+ ∂ F ∗(yi+1) − ∂ F ∗(y), Tˆi−+11,∗(yi+1 − y)
+ K ∗(yi+1 − y), Ti−1(xi+1 − x )
− K (xi+1 − x ), Tˆi−+11,∗(yi+1 − y) .
H (ui+1), Si∗(ui+1 − u)
− y 2Tˆi−+1i Ri+1
(yi+1 − y) − ψTi−1,∗( i − )(xi+1 − x )
+ K Ti−1(xi+1 − x ), yi+1 − y
− Tˆi−+11 K (xi+1 − x ), yi+1 − y .
Using the expression (13) for Si ¯ i , and (18) for Di+1, we
thus deduce
For arbitrary self-adjoint M ∈ L(X × Y ; X × Y ), we
calculate
ui+1 − ui , ui+1 − u M = 21 ui+1 − ui 2M
− 21 ui − u 2M + 21 ui+1 − u 2M .
We observe that Si Mi in (12) is self-adjoint as we have
assumed that Ti−1,∗T −1 and Tˆ −1
i i+1 i−+11 are self-adjoint. In
consequence, using (20) we obtain
Using (C1) to estimate 21 ui+1−u 2Si Mi and (C2) to eliminate
21 ui+1 − ui 2Si Mi yields
Mi (ui − ui+1), Si∗(ui+1 − u) ≤ 21 ui − u 2Si Mi
− 21 ui+1 − u 2Si+1 Mi+1 + 21 ui+1 − u 2Si ¯ i .
Combining (19) and (21) through (PP), we thus obtain
Summing (22) over i = 1, . . . , N − 1, and applying (C2) to
estimate SN MN from below, we obtain (17).
3 Scalar Off-diagonal Updates and the Ergodic Duality Gap
One relatively easy way to satisfy (G-PM), (F∗-PM), (C1)
and (C2) is to take the ‘off-diagonal’ step length operators Tˆi
and Ti as equal scalars. Another good starting point would
be to choose Ti = Ti . We, however, do not explore this route
in the present work. Instead, we now specialise Theorem 2.1
to the scalar case. We then explore ways to add estimates
of the ergodic duality gap into (17). While this would be
possible in the general framework through convexity notions
analogous to (G-PM) and (F∗-PM), the resulting gap would
not be particularly meaningful. We therefore concentrate on
the scalar off-diagonal updates to derive estimates on the
ergodic duality gap.
3.1 Scalar Specialisation of Algorithm 1
the condition (C2 ) then becomes
Ti ∈ Q, and
The off-diagonal terms cancelling out (C1 ) on the other hand
become
Algorithm 2 Primal–dual
acceleration—partially scalar
Require: F∗ and G satisfying (G-pm) and (F*-pm) for some sets K,
Kˆ , and 0 ≤ ∈ L(X ; X ). A choice of δ ∈ (0, 1). Initial invertible
step length operators T0 ∈ Q and 0 ∈ L(Y ; Y ), as well as step
length parameter τ0 > 0.
1: Choose initial iterates x0 ∈ X and y0 ∈ Y .
2: repeat
3: Find ωi > 0, i ∈ L(X ; X ), and i ∈ [0, ] + K satisfying
ωi (I + 2 i Ti ) i ≥ I, and Ti i ∈ Q.
4: Set
Ti+1 := Ti i , and τi+1 := τi ωi .
5: Find i+1 ∈ L(Y ; Y ) and Ri ∈ Kˆ satisfying
6: Pei−rf1o+rm2tRhie ≥upωdai−t1es i−+11 ≥ (1 − δ)−1 K Ti K ∗.
xi+1 := (I + Ti ∂G)−1(xi − Ti K ∗ yi ),
x¯i+1 := ωi (xi+1 − xi ) + xi+1,
yi+1 := (I + i+1∂ F∗)−1(yi + i+1 K x¯i+1).
7: until a stopping criterion is fulfilled.
τi−1(I + 2 i Ti )Ti−1 ≥ τi−+11Ti−+11, and
τi−+11( i+1 + 2 Ri+1) ≥ τi−+12 i−+12.
−1
Observe also that Mi is under this setup self-adjoint if Ti
and i+1 are.
For simplicity, we now assume φ and ψ to satisfy the
identities
ψαT (x ) = αψT (x ), (x ∈ X ; 0 < α ∈ R).
The monotonicity conditions (G-PM) and (F∗-PM) then
simplify into
∂ G(x ) − ∂ G(x ), x − x ≥
holding for all x , x ∈ X , and
∈ [0, ] + K; and
∂ F ∗(y ) − ∂ F ∗(y), y − y ≥
holding for all y, y ∈ Y , and R ∈ Kˆ .
We have thus converted the main conditions (C2), (C1),
(G-PM), and (F∗-PM) of Theorem 2.1 into the respective
conditions (C2 ), (C1 ), (G-pm), and (F*-pm). Rewriting
(C1 ) in terms of 0 < i ∈ L(X ; X ) and ωi > 0 satisfying
Ti+1 = Ti i and τi+1 = τi ωi ,
we reorganise (C1 ) and (C2 ) into the parameter update
rules (23) of Algorithm 2. For ease of expression, we
introduce there 0 and R0 as dummy variables that are not used
anywhere else. Equating w¯ i+1 = K x¯i+1, we observe that
Algorithm 2 is an instance of Algorithm 1.
Example 3.1 (The method of Chambolle and Pock) Let G
be strongly convex with factor γ ≥ 0. We take Ti = τi I ,
Ti = τi I , Tˆi = τi I , and i+1 = σi+1 I for some scalars
τi , σi+1 > 0. The conditions (G-pm) and (F*-pm) then hold
with ψ ≡ 0 and φ ≡ 0, while (C2 ) and (C1 ) reduce with
Ri+1 = 0, i = γ I , i = ωi I , and ωi = ωi into
Updating σi+1 such that the last inequality holds as an
equality, we recover the accelerated PDHGM (3) + (4). If γ = 0,
we recover the unaccelerated PDHGM.
3.2 The Ergodic Duality Gap and Convergence To study the convergence of an ergodic duality gap, we now introduce convexity notions analogous to (G-pm) and (F*-pm). Namely, we assume
to hold for all x , x ∈ X and
∈ [0, ] + K and
F ∗(y ) − F ∗(y) ≥ ∂ F ∗(y), y − y
1 1
+ 2 y − y 2R − 2 ψR (y − y),
to hold for all y, y ∈ Y and R ∈ Kˆ . Clearly these imply
(G-pm) and (F*-pm).
To define an ergodic duality gap, we set
N −1
i=0
qN :=
and define the weighted averages
N −1
i=0
xN := qN−1
τi−1xi+1, and yN := qˆN−1
N −1
i=0
With these, the ergodic duality gap at iteration N is defined
as the duality gap for (xN , YN ), namely
N −1
i=0
G(xN ) + y, K xN − F ∗(y)
− G(x ) + yN , K x − F ∗(yN ) ,
and we have the following convergence result.
Theorem 3.1 Let us be given K ∈ L(X ; Y ), and convex,
proper, lower semicontinuous functionals G : X → R and
F ∗ : Y → R on Hilbert spaces X and Y , satisfying (G-pc)
and (F*-pc) for some sets K, Kˆ , and 0 ≤ ∈ L(X ; X ).
Pick δ ∈ (0, 1), and suppose (C2 ) and (C1 ) are satisfied
for each i ∈ N for some invertible self-adjoint Ti ∈ Q,
i ∈ L(Y ; Y ),
as well as i ∈ λ([0, ] + K) and Ri ∈ λKˆ for λ = 1/2.
Let u = (x , y) satisfy (OC). Then, the iterates of Algorithm
2 satisfy
2δ x N − x τ2N−1TN−1 + qN G
N ≤ C0 +
Here C0 is as in (18), and
N −1
i=0
Di+1 := τi−1ψ i −λ (xi+1 − x ) + τi−+11φRi+1 (yi+1 − y).
(27)
If only (G-pm) and (F*-pm) hold instead of (G-pc) and
(F*-pc), and we take λ = 1, then (26) holds with the
modification G N := 0.
Remark 3.1 For convergence of the gap, we must accelerate
less (factor 1/2 on i ).
Example 3.2 (No acceleration) Consider Example 3.1,
where ψ ≡ 0 and φ ≡ 0. If γ = 0, we get ergodic
convergence of the duality gap at rate O(1/N ). Indeed, we are
in the scalar step setting, with τ j = τ j = τ0. Thus, presently
qN = N τ0.
Example 3.3 (Full acceleration) With γ > 0 in Example
3.1, we know from [3, Corollary 1] that
Thus, qN is of the order (N 2), while τN TN = τN2 I is of
the order O(1/N 2). Therefore, (26) shows O(1/N 2)
convergence of the squared distance to solution. For O(1/N 2)
convergence of the ergodic duality gap, we need to slow down
(4) to ωi = 1/√1 + γ τi .
Remark 3.2 The result (28) can be improved to estimate
τN ≤ Cτ /N without a qualifier N ≥ N0. Indeed, from
[3, Lemma 2] we know the following for the rule ωi =
1/√1 + 2γ τi : given λ > 0 and N ≥ 0 with γ τN ≤ λ,
for any ≥ 0 holds
If we pick N = 0 and λ = γ τ0, this says
The first inequality gives τ
(γ −1 + τ0)/ .
Therefore, τN ≤ Cτ /N for Cτ := γ −1 + τ0. Moreover,
the second inequality gives τN−1 ≤ τ0−1 + γ N .
≤ (1 + γ τ0)/(τ0−1 + γ ) ≤
Proof (Theorem 3.1) The non-gap estimate in the last
paragraph of the theorem statement, where λ = 1, we modify
GN := 0, is a direct consequence of Theorem 2.1. We
therefore concentrate on the estimate that includes the gap, and
fix λ = 1/2. We begin by expanding
H (ui+1), Si∗(ui+1 − u)
= τi−1 ∂ G(xi+1), xi+1 − x + τi−+11 ∂ F ∗(yi+1), yi+1 − y
+ τi−1 K ∗ yi+1, xi+1 − x − τi−+11 K xi+1, yi+1 − y
Since then i ∈ ([0, ] + K)/2, and Ri+1 ∈ Kˆ /2, we may
take = 2 i and R = 2 Ri+1 in (G-pc) and (F*-pc). It
follows
+ τi−+11 F ∗(yi+1) − F ∗(y) + 21 yi+1 − y 22Ri+1
+ (τi−1 − τi−+11) yi+1, K xi+1 .
Using (2) and (24), we can make all of the factors ‘2’ and
‘1/2’ in this expression annihilate each other. With Di+1 as
in (27) and λ = 1/2, we therefore have
H (ui+1), Si∗(ui+1 − u)
≥ τi−1 G(xi+1) − G(x ) + y, K xi+1
F ∗(yi+1) − F ∗(y) − yi+1, K x
+ (τi−1 − τi−+11) yi+1 − y, K (xi+1 − x)
− y, K x ) − Di+1.
A little bit of reorganisation and referral to (13) for the
expansion of Si ¯ i thus gives
H (ui+1), Si∗(ui+1 − u)
≥ τi−1 G(xi+1) − G(x) + y, K xi+1
+ τi−+11 F∗(yi+1) − F∗(y) − yi+1, K x
1
− (τi−1 − τi−+11) y, K x + 2 ui+1 − u 2Si ¯ i − Di+1.
Let us write
Observing here the switches between the indices i + 1 and
i of the step length parameters in comparison with the last
step of (29), we thus obtain
We note that Si Mi in (12) is self-adjoint as we have assumed
Ti and i+1 to be, and taken Ti and Tˆi+1 to be scalars times
the identity. We therefore deduce from the proof of Theorem
2.1 that (21) holds. Using (PP) to combine (21) and (30), we
thus deduce
Summing this for i = 0, . . . , N − 1 gives with C0 from (27)
the estimate
− u 2SN MN +
≤ C0 +
N−1
i=0
N−1
i=0
G+i(ui+1, u) − G+i(u, u)
N−1
i=0
N−1
i=0
N−1
i=0
N−1
We want to estimate the sum of the gaps G+i in (31). Using
the convexity of G and F∗, we observe
N−1
i=0
τi−1G(xi+1) ≥ qN G(xN ), and
τi−+11 F∗(yi+1) ≥ qˆN F∗(yN ).
Also, by (25) and simple reorganisation
τi−+11G(x) = qN G(x) + τN−1G(x) − τ0−1G(x), and
τi−1 F∗(y) = qˆN F∗(yN ) − τN−1 F∗(y) + τ0−1 F∗(y).
All of (32)–(34) together give
− qN G(x) + qˆN yN , K x − qˆN F∗(yN )
+ τN−1G(x) − τ0−1G(x) + τN−1 FT∗ˆ−1,∗ (x) − τ0−1 F∗(y) .
N
Another use of (25) gives
N−1
N−1
i=0
where the remainder
+ τN−1G(x) − τ0−1G(x) + τN−1 F∗(x) − τ0−1 F∗(y) .
G+i(ui+1, u) − G+i(u, u) ≥ qN G N + rN ,
At a solution u = (x, y) to (OC), we have K xˆ ∈ ∂ F∗(yˆ),
so rN ≥ 0 provided qN ≤ qˆN . But qN − qˆN = τ0−1 − τN−1,
so this is guaranteed by our assumption (C3 ). Using (35) in
(31) therefore gives
A referral to (C2) to estimate SN MN from below shows (26),
concluding the proof.
4 Convergence Rates in Special Cases
To derive a practical algorithm, we need to satisfy the update
rules (C1) and (C2), as well as the partial monotonicity
conditions (G-PM) and (F∗-PM). As we have already discussed
in Sect. 3, this can be done when for some τi > 0 we set
Ti = τi I, and Tˆi = τi I.
The result of these choices is Algorithm 2, whose
convergence we studied in Theorem 3.1. Our task now is to verify
its conditions, in particular (G-pc) and (F*-pc) [alternatively
(F*-pm) and (G-pm)], as well as (C1 ), (C2 ), and (C3 ) for
of the projection form γ P.
4.1 An Approach to Updating
We have not yet defined an explicit update rule for i+1,
merely requiring that it has to satisfy (C2 ) and (C1 ). The
former in particular requires
Hiring the help of some linear operator F ∈ L(L(Y ; Y );
L(Y ; Y )) satisfying
To apply Theorem 3.1, all that remains is to verify in special
cases the conditions (40) together with (C3 ) and the partial
strong convexity conditions (G-pc) and (F*-pc).
F (K Ti K ∗) ≥ K Ti K ∗,
our approach is to define
Then, (C2 ) is satisfied provided Ti−1 ∈ Q. Since
τi−+11 i−+11 = τi−1(1 − δ)−1F (K Ti K ∗), the condition (C1 )
reduces into the satisfaction for each i ∈ N of
τi−1(I + 2 Ti )Ti−1 − τi−+11Ti−+11 ≥ −2τi−1( i −
K Ti+1 K ∗
4.2 When is a Multiple of a Projection
We now take = γ¯ P for some γ¯ > 0, and a projection
operator P ∈ L(X ; X ): idempotent, P2 = P, and self-adjoint,
P∗ = P. We let P⊥ := I − P. Then, P⊥ P = P P⊥ = 0.
With this, we assume that K is such that for some γ¯ ⊥ > 0
holds
To unify our analysis for gap and non-gap estimates of
Theorem 3.1, we now pick λ = 1/2 in the former case,
and λ = 1 in the latter. We then pick 0 ≤ γ ≤ λγ¯ , and
0 ≤ γi⊥ ≤ λγ¯ ⊥, and set
With this, τi , τi⊥ > 0 guarantee Ti ∈ Q. Moreover, Ti is
selfadjoint. Moreover, i ∈ λ([0, ] + K), exactly as required
in both the gap and the non-gap cases of Theorem 3.1.
Since
K Ti K ∗ = τi K P K ∗ + τi⊥ K P⊥ K ∗
= (τi − τi⊥)K P K ∗ + τi⊥ K K ∗,
we are encouraged to take
F (K Ti K ∗) := max{0, τi − τi⊥} K P 2 I + τi⊥ K 2 I.
Observe that (43) satisfies (38). Inserting (43) into (39),
we obtain
Since i+1 is now equivalent to a scalar, (40b), we also take
Ri+1 = ρi+1 I , assuming for some ρ¯ > 0 that
max{0, τi − τi⊥} K P 2 + τi⊥ K 2 .
ηi := τi−1 max{0, τi − τi⊥} − τi−+11 max{0, τi+1 − τi⊥+1}
we thus expand (40) as
τi−1(1 + 2γ τi )τi−1 − τi+1τi−+11 ≥ 0,
τi−1τi⊥,−1 − τi−+11τi⊥+,1−1 ≥ −2τi−1γi⊥,
= (γ − λγ¯ ) P + γi⊥ P⊥ ≤ γi⊥ P⊥
we suppose for simplicity that
ψ i −λ (x ) = γi⊥ψ ⊥( P⊥x ) and φRi+1 (y) = ρi+1φ (y)
for some ψ ⊥ : P⊥ X → R and φ : Y → R. The conditions
(G-pc) and (F*-pc) reduce in this case to the satisfaction for
some γ¯ , γ¯ ⊥, ρ¯ > 0 of
γ
G(x ) − G(x) ≥ ∂ G(x), x − x + 2¯ P(x − x) 2
P⊥(x − x) 2 − ψ (P⊥(x − x)) ,
for all x , x ∈ X and 0 ≤ γ ⊥ ≤ γ¯ ⊥, as well as of
y − y 2 − φ (y − y) ,
for all y, y ∈ Y and 0 ≤ ρ ≤ ρ¯. Analogues of (G-pm) and
(F*-pm) can be formed.
To summarise the findings of this section, we state the
following proposition.
Proposition 4.1 Suppose (G-pcr) and (F∗-pcr) hold for
some projection operator P ∈ L(X ; X ) and scalars
γ¯ , γ¯ ⊥, ρ¯ > 0. With λ = 1/2, pick γ ∈ [0, λγ¯ ]. For each
i ∈ N, suppose (45) is satisfied with
0 ≤ γi⊥ ≤ λγ¯ ⊥, 0 ≤ ρi ≤ λρ¯, and τ0 ≥ τi > 0. (47)
If we solve (45a) exactly, define Ti , i , and i+1 through
(42) and (44), and set Ri+1 = ρi+1 I , then the iterates of
Algorithm 2 satisfy with C0 and Di+1 as in (27) the estimate
2δ P(x N − x ) 2 + τ0−1 1+ 2γ G
N −1
i=0
We are almost ready to state a general convergence result
for projective . However, we want to make one more thing
more explicit. Since the choices (42) satisfy
Di+1 = τi−1γi⊥ψ ⊥( P⊥(xi+1 − x ))
+ τi−+11ρi+1φ (yi+1 − y).
With this, (50) yields (48).
ηi K P 2 + (τi−1τi⊥ − τi−+11τi⊥+1) K 2
If we take λ = 1, then (48) holds with G N = 0.
Observe that presently
Proof As we have assumed through (47), or otherwise
already verified its conditions, we may apply Theorem 3.1.
Multiplying (26) by τN τN , we obtain
2δ x N − x 2P + qN τN τN G N ≤ τN τN C0 +
N −1
i=0
Now, observe that solving (45a) exactly gives
τN−1τN−1 = τN−−11τN−−11 + 2γ τN−−11
N −1
j=0
2γ τ −j1 = τ0−1τ0−1 + 2γ qN . (51)
Therefore, we have the estimate
4.3 Primal and Dual Penalties with Projective
We now study conditions that guarantee the convergence of
the sum τN τN iN=−01 Di+1 in (48). Indeed, the right-hand
sides of (45b) and (45c) relate to Di+1. In most practical
cases, which we study below, φ and ψ transfer these
righthand side penalties into simple linear factors within Di+1.
Optimal rates are therefore obtained by solving (45b) and
(45c) as equalities, with the right-hand sides proportional to
each other. Since ηi ≥ 0, and it will be the case that ηi = 0 for
large i , we, however, replace (45c) by the simpler condition
Then, we try to make the left-hand sides of (45b) and (53)
proportional with only τi⊥+1 as a free variable. That is, for
some proportionality constant ζ > 0, we solve
τi−1τi⊥,−1 − τi−+11τi⊥+,1−1 = ζ (τi−1τi⊥ − τi−+11τi⊥+1).
Multiplying both sides of (54) by ζ −1τi+1τi⊥+1, gives on τi⊥+1
the quadratic condition
τi⊥+,12 + ωi (ζ −1τi⊥,−1 − τi⊥)τi⊥+1 − ζ −1 = 0.
ωi2(τi⊥ − ζ −1τi⊥,−1)2 + 4ζ −1 .
Solving (45b) and (53) as equalities, (54) and (55) give
τi−+11ρi+1 = ζ (τi−+11τi⊥+1 − τi−1τi⊥).
Note that this quantity is non-negative exactly when ωi⊥ ≥
ωi . We have
+ (1 − ζ −1τi⊥,−2)2 + 4ζ −1ωi−2τi⊥,−2 .
This quickly yields ωi⊥ ≥ ωi if ωi ≤ 1. In particular, (56) is
non-negative when ωi ≤ 1.
The next lemma summarises these results for the standard
choice of ωi .
Lemma 4.1 Let τi⊥+1 by given by (55), and set
Then, ω⊥ ≥ ωi , τi ≤ τ0, and (45) is satisfied with the
i
right-hand sides given by the non-negative quantity in (56).
Moreover,
Proof The choice (57) satisfies (45a), so that (45) in its
entirety will be satisfied with the right-hand sides of (45b)–
(45c) given by (56). The bound τi ≤ τ0 follows from ωi ≤ 1.
Finally, the implication (58) is a simple estimation of (55).
Specialisation of Algorithm 2 to the choices in Lemma
4.1 yields the steps of Algorithm 3. Observe that τi entirely
disappears from the algorithm. To obtain convergence rates,
and to justify the initial conditions, we will shortly seek to
Algorithm 3 Partial acceleration for projective
and dual penalties
Require: F∗ and G satisfying (G-pcr) and (F∗-pcr) for some γ¯ , γ¯ ⊥, ρ¯ ≥ 0,
and a projection operator P ∈ L(X; X). A choice of γ ∈ [0, γ¯ ]. Initial
step length parameters τ0, τ0⊥ > 0, a choice of δ ∈ (0, 1), and ζ ≤ τ0⊥,−2,
all satisfying (61).
1: Choose initial iterates x0 ∈ X and y0 ∈ Y .
2: repeat
3: Set
ωi = 1/ 1 + 2γ τi , and
1
ωi⊥ = 2 (1 − ζ −1τi⊥,−2)ωi + (1 − ζ −1τi⊥,−2)2ωi2 + 4ζ −1τi⊥,−2 .
4: Update
τi+1 = τi ωi , τi⊥+1 = τi⊥ωi⊥, and
σi+1 = ωi−1(1 − δ)/(max{0, τi − τi⊥} K P 2 + τi⊥ K 2),
5: With Ti = τi P + τi⊥ P⊥, perform the updates
yi+1 := (I + σi+1∂ F∗)−1(yi + σi+1 K x¯i+1).
6: until a stopping criterion is fulfilled.
exploit with specific φ and ψ the telescoping property
stemming from the non-negativity of the last term of (56).
There is still, however, one matter to take care of. We
need ρi ≤ λρ¯ and γi⊥ ≤ λγ¯ ⊥, although in many cases of
practical interest, the upper bounds are infinite and hence
inconsequential. We calculate from (55) and (57) that
ζ 1
γi⊥ = 2 (ωi−1τi⊥+1 − τi⊥) = 2
+ (ζ τi⊥ − τi⊥,−1)2 + 4ζ ωi−2
Therefore, we need to choose ζ and τ0 to satisfy 2ζ γ τ0 ≤
(λγ¯ ⊥)2. Likewise, we calculate from (56), (57), and (60) that
K 2ωi K 2ωi
ρi+1 = ωci γi⊥ = (1 − δ)ζ γi⊥ ≤ (1 − δ)ζ
This tells us to choose τ0 and ζ to satisfy 2 K 4/(1 −
δ)2ζ −1γ τ0 ≤ (λρ¯)2. Overall, we obtain for τ0 and ζ the
condition
This can always be satisfied through suitable choices of τ0
and ζ .
If now φ ≡ Cφ and ψ ≡ Cψ⊥, using the non-negativity of
(56), we calculate
N −1
i=0
N −1
i=0
K 2Cφ N −1
τi−+11ρi+1φ (yi+1 − y) = 2(1 − δ)
i=0
Using these expression to expand (49), we obtain the
following convergence result.
Theorem 4.1 Suppose (G-pcr) and (F∗-pcr) hold for some
projection operator P ∈ L(X ; X ), scalars γ¯ , γ¯ ⊥, ρ¯ > 0
with φ ≡ Cφ , and ψ ≡ Cψ⊥, for some constants Cφ , Cψ⊥ > 0.
With λ = 1/2, fix γ ∈ (0, λγ¯ ]. Select initial τ0, τ0⊥ > 0, as
well as δ ∈ (0, 1) and ζ ≤ (τ0⊥)−2 satisfying (61). Then,
Algorithm 3 satisfies for some C0, Cτ > 0 the estimate
2δ P(x N − x ) 2 + τ0−1 1+ 2γ G N ≤
ζ −1/2 K 2
If we take λ = 1, then (48) holds with G N = 0.
Proof During the course of the derivation of Algorithm 3,
we have verified (45), solving (45a) as an equality. Moreover,
Lemma 4.1 and (61) guarantee (47). We may therefore apply
Proposition 4.1. Inserting (62) and (63) into (48) and (49)
gives
2δ P(x N − x ) 2 + τ0−1 1+ 2γ G
τN−1τN⊥ + 2(K1 −2Cδφ) τN−1τN⊥ .
Remark 4.1 As a special case of Algorithm 3, if we choose
ζ = τ0⊥,−2, then we can show from (55) that τi⊥ = τ0⊥ =
ζ −1/2 for all i ∈ N.
In consequence ωi⊥ = ωi−1, and (45c) becomes
Remark 4.2 The convergence rate provided by Theorem 4.1
is a mixed O(1/N 2) + O(1/N ) rate, similarly to that derived
in [5] for a type of forward–backward splitting algorithm for
smooth G. Ours is of course backward–backward type
algorithm. It is interesting to note that using the differentiability
properties of infimal convolutions [23, Proposition 18.7], and
the presentation of a smooth G as an infimal convolution, it
is formally possible to derive a forward–backward algorithm
from Algorithm 3. The difficulties lie in combining this
conversion trick with conditions on the step lengths.
4.4 Dual Penalty Only with Projective
Continuing with the projective setup of Sect. 4.2, we now
study the case K = {0}, that is, when only the dual penalty
φ is available with ψ ≡ 0. To use Proposition 4.1, we need
to satisfy (47) and (45), with (45a) holding as an equality.
Since γi⊥ = 0, (45b) becomes
τi−1τi⊥,−1 − τi−+11τi⊥+,1−1 ≥ 0.
With respect to τi⊥+1, the left-hand side of (45c) is maximised
(and the penalty on the right-hand side minimised) when (66)
is minimised. Thus, we solve (66) exactly, which gives
(1 − ωi−2) K 2 ≥ −2τi−+11ρi+1.
In order to simultaneously satisfy (45a), this suggests for
some, yet undetermined, ai > 0, to choose
To use Proposition 4.1, we need to satisfy ρi+1 ≤ λρ¯. Since
(68) implies that {τi }i∞=0 is non-increasing, we can satisfy this
for large enough i if ai 0. To ensure satisfaction for all
i ∈ N, it suffices to take {ai }i∞=0 non-increasing, and satisfy
the initial condition
The rule τi+1 = ωi τi and (68) give τi−+21 = τi−2 + ai . We
therefore see that
ij−=10 a j
N −1
i=0
N −1
i=0
Di+1 =
N −1
i=0
i=0
(yi+1 − y)
Thus, the rate (48) in Proposition 4.1 states
2δ P(x N − x ) 2 + τ0−1 1+ 2γ G
≤ μ0N C0 + 2(1K−2δ) μ1N
N −1
i=0
The convergence rate is thus completely determined by μN
and μN . 0
1
Remark 4.3 If φ ≡ 0, that is, if F ∗ is strongly convex, we
may simply pick ωi = ωi = 1/√1 + 2γ τi , that is ai = 2γ ,
and obtain from (70) a O(1/N 2) convergence rate.
For a more generally applicable algorithm, suppose
φ (yi+1 − y) ≡ Cφ as in Theorem 4.1. We need to choose
ai . One possibility is to pick some q ∈ (0, 1] and
ai := τ0−2 (i + 1)q − i q .
The concavity of i → qi for q ∈ (0, 1] easily shows that
{ai }i∞=0 is non-increasing. With the choice (71), we then
compute
N −1
i=0
Algorithm 4 Partial acceleration for projective
penalty only
Require: G satisfying (G-pcr) (with ψ ≡ 0) for some γ¯ > 0 and a
projection operator P ∈ L(X ; X ). F∗ satisfying (F∗-pcr) for some
ρ¯ > 0. A choice of γ ∈ [0, γ¯ ] and a non-increasing sequence
{ai }i∞=0, for example as in (71). Initial step parameters τ0, τ0⊥, τ0 >
0, as well as δ ∈ (0, 1), satisfying (69).
1: Choose initial iterates x0 ∈ X and y0 ∈ Y .
2: repeat
3: Set
ωi := ωi−1/(1 + 2γ τi ), τi+1 := τi ωi ,
τi+1 := τi ωi , τi⊥+1 := τi⊥/ωi ,
σi+1 = ωi−1(1 − δ)/(max{0, τi − τi⊥} K P 2 + τi⊥ K 2).
xi+1 := (I + Ti ∂G)−1(xi − Ti K ∗ yi ),
x¯i+1 := ωi (xi+1 − xi ) + xi+1,
yi+1 := (I + σi+1∂ F∗)−1(yi + σi+1 K x¯i+1).
5: until a stopping criterion is fulfilled.
N −1
i=0
If N ≥ 2, we find with Ca = (1 + q/2)/(21+q/2λγ ) that
μ0N ≤ Nτ10+Cqa/2 , and μ1N ≤ τ0CNa1C−φq/2 .
The choice q = 0 gives uniform O(1/N ) over both the
initialisation and the dual sequence. By choosing q = 1, we get
O(1/N 3/2) convergence with respect to the initialisation, and
O(1/N 1/2) with respect to the residual sequence.
With these choices, Algorithm 2 yields Algorithm 4,
whose convergence properties are stated in the next theorem.
Theorem 4.2 Suppose (G-pcr) and (F∗-pcr) hold for some
projection operator P ∈ L(X ; X ) and γ¯ , γ¯ ⊥, ρ¯ ≥ 0 with
ψ ≡ 0 and φ ≡ Cφ for some constant Cφ ≥ 0. With λ = 1/2,
choose γ ∈ (0, λγ¯ ], and pick the sequence {ai }i∞=0 by (71) for
some q ∈ (0, 1]. Select initial τ0, τ0⊥, τ0 > 0 and δ ∈ (0, 1)
verifying (69). Then, Algorithm 4 satisfies
2δ P(x N − x ) 2 + τ0−11+ γ G
, (N ≥ 2).
If we take λ = 1, then (74) holds with G N = 0.
Proof We apply Proposition 4.1 whose assumptions we have
verified during the course of the present section. In particular,
τi ≤ τ0 through the choice (68) that forces ωi ≤ 1. Also,
have already derived the rate (70) from (48). Inserting (72)
into (70), noting that the former is only valid for N ≥ 2,
immediately gives (74).
5 Examples from Image Processing and the Data Sciences
We now consider several applications of our algorithms.
We generally have to consider discretisations, since many
interesting infinite-dimensional problems necessitate Banach
spaces. Using Bregman distances, it would be possible to
generalise our work form Hilbert spaces to Banach spaces,
as was done in [24] for the original method of [3]. This is,
however, outside the scope of the present work.
A large range of interesting application problems can be
written in the Tikhonov regularisation or empirical loss
minimisation form
Here α > 0 is a regularisation parameter, G0 : Z → R
typically convex and smooth fidelity term with data f ∈ Z .
The forward operator A ∈ L(X ; Z )—which can often also be
data—maps our unknown to the space of data. The operator
K ∈ L(X ; Y ) and the typically non-smooth and convex F :
Y → R act as a regulariser.
We are particularly interested in strongly convex G0 and
A with a non-trivial null-space. Examples include, for
example, Lasso—a type of regularised regression—with G0 =
x 22/2, K = I , and F (x ) = x 1, on finite-dimensional
spaces. If the data of the Lasso is ‘sparse’, in the sense that
A has a non-trivial null-space, then, based on accelerating
the strongly convex part of the variable, our algorithm can
provide improved convergence rates compared to standard
non-accelerated methods.
In image processing examples abound, we refer to [25]
for an overview. In total variation (TV) regularisation, we
still take F (x ) = x 1, but is K = ∇ the gradient operator.
Strictly speaking, this has to be formulated in the Banach
space BV( ), but we will consider the discretised setting to
avoid this problem. For denoising of Gaussian noise with TV
regularisation, we take A = I , and again G0 = x 22/2. This
problem is not so interesting to us, as it is fully strongly
convex. In a simple form of TV inpainting—filling in missing
regions of an image—we take A as a subsampling operator
S mapping an image x ∈ L2( ) to one in L2( \ d ), for
d ⊂ the defect region that we want to recreate. Observe
that in this case, = S∗ S is directly a projection
operator. This is therefore a problem for our algorithms! Related
problems include reconstruction from subsampled magnetic
resonance imaging (MRI) data (see, for example, [11,26]),
where we take A = SF for F the Fourier transform. Still,
A∗ A is a projection operator, so the problem perfectly suits
our algorithms.
Another related problem is total variation deblurring,
where A is a convolution kernel. This problem is slightly
more complicated to handle, as A∗ A is not a projection
operator. Assuming periodic boundary conditions on a box
m
= i=1[ci , di ], we can write A = F∗aˆ F, multiplying
the Fourier transform by some aˆ ∈ L2( ). If |aˆ | ≥ γ on a
subdomain, we obtain a projection form (it would also be
possible to extend our theory to non-constant γ , but we have
decided not to extend the length of the paper by doing so.
Dualisation likewise provides a further alternative).
Satisfaction of convexity conditions In all of the above
examples, when written in the saddle point form (P), F ∗ is a
simple pointwise ball constraint. Lemma 2.1 thus guarantees
(F∗-pcr). If F (x ) = x 1 and K = I , then clearly P⊥xˆ
can be bounded in Z = L1 for xˆ the optimal solution to
(75). Thus, for some M > 0, we can add to (75) the artificial
constraint
G (x ) := ι · Z ≤M ( P⊥x ).
In finite dimensions, this gives a bound in L2. Lemma 2.1
gives (G-pcr) with γ¯ ⊥ = ∞.
In case of our total variation examples, F (x ) = x 1
and K = ∇. Provided mean-zero functions are not in the
kernel of A, one can through Poincar’s inequality [27] on
BV( ) and a two-dimensional connected domain ⊂ R2
show that even the original infinite-dimensional problems
have bounded solutions in L2( ). We may therefore again
add the artificial constraint (76) with Z = L2 to (75).
Dynamic bounds and pseudo-duality gaps We seldom know
the exact bound M , but can derive conservative estimates.
Nevertheless, adding such a bound to Algorithm 4 is a
simple, easily implemented projection of P⊥(xi − Ti K ∗ yi ) into
the constraint set. In practise, we do not use or need the
projection, and update the bound M dynamically so as to
ensure that the constraint (76) is never active. Indeed, A
having a non-trivial nullspace also causes duality gaps for (P) to
be numerically infinite. In [28], a ‘pseudo-duality gap’ was
therefore introduced, based on dynamically updating M . We
will also use this type of dynamic duality gaps in our
reporting.
Fig. 1 We use sample image (b) for denoising, and (c) for deblurring experiments. Free Kodak image suite photo, at the time of writing online at
http://r0k.us/graphics/kodak/. a True image. b Noise image. c Blurry image
5.2 TGV2 Regularised Problems
So far, we have considered very simple regularisation terms.
Total generalised variation, TGV, was introduced in [29] as
a higher-order generalisation of TV. It avoids the
unfortunate stair-casing effect of TV—large flat areas with sharp
transitions—while preserving the critical edge preservation
property that smooth regularisers lack. We concentrate on the
second-order TGV2. In all of our image processing examples,
we can replace TV by TGV2.
As with total variation, we have to consider discretised
models due the original problem being set in the Banach
space BV( ). For two parameters α, β > 0, the
regularisation functional is written in the differentiation cascade form
of [30] as
TGV(2β,α)(u) := mwin α ∇u − w 1 + β E u 1.
Here E = (∇ T + ∇)/2 is the symmetrised gradient. With
x = (v, w) and y = (y1, y2), we may write the problem
in the saddle point form (P) with
G(x ) := G0( f − Av),
F ∗(y) = ι · L∞ ≤α(y1) + ι · L∞ ≤β (y2), and
K :=
If A = I , as is the case for denoising, we have
perfectly uncoupling in both Algorithm 3 and Algorithm 4
the prox updates for G into ones for G1 and G2. The condition
(F∗-pcr) with ρ¯ = ∞ is then immediate from Lemma 2.1.
Moreover, the Sobolev–Korn inequality [31] allows us to
bound on a connected domain ⊂ R2 an optimal wˆ to (77)
as
E wˆ 1 ≤ C G0( f )
for some constant C > 0. We may assume that w¯ = 0, as
the affine part of w is not used in (77). Therefore we may
again replace G2 = 0 by the artificial constraint G2(w) =
ι · L2 ≤M (w). By Lemma 2.1, G will then satisfy (G-pcr)
5.3 Numerical Results
We demonstrate our algorithms on TGV2 denoising and TV
deblurring. Our tests are done on the photographs in Fig. 1,
both at the original resolution of 768 × 512, and scaled down
by a factor of 0.25 to 192 × 128 pixels. It is image #23 from
the free Kodak image suite. Other images from the collection
that we have experimented on give analogous computational
results. For both of our example problems, we calculate a
target solution by taking one million iterations of the basic
PDHGM (3). We also tried interior point methods for this,
but they are only practical for the smaller denoising problem.
We evaluate Algorithms 3 and 4 against the standard
unaccelerated PDHGM of [3], as well as (a) the
mixedrate method of [5], denoted here C-L-O, (b) the relaxed
PDHGM of [20,32], denoted here ‘Relax’, and (c) the
adaptive PDHGM of [33], denoted here ‘Adapt’. All of these
methods are very closely linked and have comparable low
costs for each step. This makes them straightforward to
compare.
As we have discussed, for comparison and stopping
purposes, we need to calculate a pseudo-duality gap as in [28],
because the real duality gap is in practise infinite when A
has a non-trivial nullspace. We do this dynamically;
upgrading, the M in (76) every time, we compute the duality gap.
For both of our example problems, we use for simplicity
Fig. 2 Step length parameter
evolution, both axes
logarithmic. ‘Alg.3’ and ‘Alg.4
q=1’ have the same parameters
as our numerical experiments
for the respective algorithms, in
particular ζ = τ0⊥,−2 for
Algorithm 3, which yields
constant τ ⊥. ‘Alg.3 ζ /100’ uses
the value ζ = τ0⊥,−2/100, which
causes τ ⊥ to increase for some
iterations. ‘Alg.4 q=0.1’ uses the
value q = 0.1 for Algorithm 4,
everything else being kept equal
Z = L2 in (76). In the calculation of the final duality gaps
comparing each algorithm, we then take as M the maximum
over all evaluations of all the algorithms. This makes the
results fully comparable. We always report the duality gap
in decibels 10 log10(gap2/gap20) relative to the initial iterate.
Similarly, we report the distance to the target solution uˆ in
decibels 10 log10( ui − uˆ 2/ uˆ 2), and the primal
objective value val(x ) := G(x ) + F (K x ) relative to the target
as 10 log10(val(x )2/val(xˆ)2). Our computations were
performed in MATLAB+C-MEX on a MacBook Pro with 16GB
RAM and a 2.8 GHz Intel Core i5 CPU.
TGV2 denoising The noise in our high-resolution test image,
with values in the range [0, 255], has standard deviation
29.6 or 12 dB. In the downscaled image, these become,
respectively, 6.15 or 25.7 dB. As parameters (β, α) of the
TGV2 regularisation functional, we choose (4.4, 4) for the
downscale image, and translate this to the original image
by multiplying by the scaling vector (0.25−2, 0.25−1)
corresponding to the 0.25 downscaling factor. See [34] for a
discussion about rescaling and regularisation factors, as well
as for a justification of the β/α ratio.
For the PDHGM and our algorithms, we take γ = 0.5,
corresponding to the gap convergence results. We choose
δ = 0.01, and parametrise the PDHGM with σ0 = 1.9/ K
and τ0∗ = τ0 ≈ 0.52/ K solved from τ0σ0 = (1 − δ) K 2.
These are values that typically work well. For
forwarddifferences discretisation of TGV2 with cell width h = 1,
we have K 2 ≤ 11.4 [28]. We use the same value of δ for
Algorithm 3 and Algorithm 4, but choose τ0⊥ = 3τ0∗, and
τ0 = τ0 = 80τ0∗. We also take ζ = τ0⊥,−2 for Algorithm
3. These values have been found to work well by trial and
error, while keeping δ comparable to the PDHGM. A similar
choice of τ0 with a corresponding modification of σ0 would
significantly reduce the performance of the PDHGM. For
Algorithm 4, we take exponent q = 0.1 for the sequence {ai }.
This gives in principle a mixed O(1/N 1.5)+ O(1/N 0.5) rate,
possibly improved by the convergence of the dual sequence.
We plot the evolution of the step length for these and some
other choices in Fig. 2. For the C-L-O, we use the detailed
parametrisation from [35, Corollary 2.4], taking as Y the
true L2-norm Bregman divergence of B(0, α) × B(0, β), and
X = 10 · f 2/2 as a conservative estimate of a ball
containing the true solution. For ‘Adapt’, we use the exact choices
of α0, η, and c from [33]. For ‘Relax’, we use the value 1.5 for
the inertial ρ parameter of [32]. For both of these algorithms,
we use the same choices of σ0 and τ0 as for the PDHGM.
We take fixed 20,000 iterations and initialise each
algorithm with y0 = 0 and x 0 = 0. To reduce computational
overheads, we compute the duality gap and distance to
target only every 10 iterations instead of at each iteration.
The results are in Fig. 3 and Table 1. As we can see,
Algorithm 3 performs extremely well for the low-resolution
image, especially in its initial iterations. After about 700
or 200 iterations, depending on the criterion, the standard
and relaxed PDHGM start to overtake. This is a general
effect that we have seen in our tests: the standard PDHGM
performs in practise very well asymptotically, although in
principle all that exists is a O(1/N ) rate on the ergodic
Gap ≤ −40 dB
Tgt ≤ −30 dB
Val ≤ 1 dB
The CPU time and number of iterations (at a resolution of 10) needed
to reach given solution quality in terms of the duality gap, distance to
target, or primal objective value
Fig. 3 TGV2 denoising performance, 20,000 iterations, high- and
lowresolution images. The plot is logarithmic, with the decibels calculated
as in Sect. 5.3. The poor high-resolution results for ‘Adapt’ [33] have
been omitted to avoid poor scaling of the plots. a Gap, low resolution,
b target, low resolution, c value, low resolution, d gap, high resolution,
e target, high resolution, f value, high resolution
Table 1 TGV2 denoising performance, maximum 20,000 iterations
Gap ≤ −50 dB
Tgt ≤ −40 dB
Val ≤ 1 dB
duality gap. Algorithm 4, by contrast, does not perform
asymptotically so well. It can be extremely fast on its initial
iterations, but then quickly flattens out. The C-L-O
surprisingly performs better on the high-resolution image than on
the low-resolution image, where it does somewhat poorly in
comparison with the other algorithms. The adaptive PDHGM
performs very poorly for TGV2 denoising, and we have
indeed excluded the high-resolution results from our reports
to keep the scaling of the plots informative. Overall,
Algorithm 3 gives good results fast, although the basic and relaxed
PDHGM seems to perform, in practise, better
asymptotically.
TV deblurring Our test image has now been distorted by
Gaussian blur of standard deviation 4, which we intent to
remove. We denote by aˆ the Fourier presentation of the blur
operator as discussed in Sect. 5.1. For numerical stability of
the pseudo-duality gap, we zero out small entries, replacing
this aˆ by aˆ χ|aˆ( · )|≥ aˆ ∞/1000(ξ ). Note that this is only needed
for the stable computation of G∗ for the pseudo-duality gap,
to compare the algorithms; the algorithms themselves are
stable without this modification. To construct the projection
operator P , we then set pˆ(ξ ) = χ|aˆ( · )|≥0.3 aˆ ∞ (ξ ), and P =
F∗ pˆF.
We use TV parameter 2.55 for the high-resolution image
and the scaled parameter 2.55 ∗ 0.15 for the low-resolution
image. We parametrise all the algorithms almost exactly as
Fig. 4 TV deblurring performance, 10,000 iterations, high- and low-resolution images. The plot is logarithmic, with the decibels calculated as in
Sect. 5.3. a Gap, low resolution. b Target, low resolution. c Value, low resolution. d Gap, high resolution. e Target, high resolution. f Value, high
resolution
Table 2 TV deblurring performance, maximum 10,000 iterations
Gap ≤ −60 dB
Tgt ≤ −40 dB
Val ≤ 1 dB
Gap ≤ −60 dB
Tgt ≤ −30 dB
Val ≤ 1 dB
The CPU time and number of iterations (at a resolution of 10) needed to reach given solution quality in terms of the duality gap, distance to target,
or primal objective value
TGV2 denoising above, of course with appropriate U and
K 2 ≤ 8 corresponding to K = ∇ [36]. The only difference
in parameterisation is that we take q = 1 instead of q = 0.1
for Algorithm 4.
The results are in Fig. 4 and Table 2. It does not appear
numerically feasible to go significantly below −100 or
−80 dB gap. Our guess is that this is due to the numerical
inaccuracies of the fast Fourier transform implementation
in MATLAB. The C-L-O performs very well judged by
the duality gap, although the images themselves and the
primal objective value appear to take a little bit longer to
converge. The relaxed PDHGM is again slightly improved
from the standard PDHGM. The adaptive PDHGM performs
very well, slightly outperforming Algorithm 3, although not
Algorithm 4. This time Algorithm 4 performs remarkably
well.
6 Conclusion
To conclude, overall, our algorithms are very competitive
within the class of proposed variants of the PDHGM. Within
our analysis, we have, moreover, proposed very
streamlined derivations of convergence rates for even the standard
PDHGM, based on the proximal point formulation and the
idea of testing. Interesting continuations of this study include
whether the condition Tˆi K = K Ti can reasonably be relaxed
such that Tˆi and Ti would not have to be scalars, as well as
the relation to block coordinate descent methods, in
particular [14, 37].
Acknowledgements This research was started while T. Valkonen
was at the Center for Mathematical Modeling at Escuela Politécnica
Nacional in Quito, supported by a Prometeo scholarship of the Senescyt
(Ecuadorian Ministry of Science, Technology, Education, and
Innovation). In Cambridge, T. Valkonen has been supported by the EPSRC
Grant EP/M00483X/1 “Efficient computational tools for inverse
imaging problems”. Thomas Pock is supported by the European Research
Council under the Horizon 2020 programme, ERC starting Grant
Agreement 640156.
Compliance with ethical standards
A Data Statement for the EPSRC This is primarily a theory paper,
with some demonstrations on a photograph freely available from the
Internet. As this article was written, the used photograph from the
Kodak image suite was, in particular, available at http://r0k.us/graphics/
kodak/. It has also been archived with our implementations of the
algorithms at https://www.repository.cam.ac.uk/handle/1810/253697.
Open Access This article is distributed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecomm
ons.org/licenses/by/4.0/), which permits unrestricted use, distribution,
and reproduction in any medium, provided you give appropriate credit
to the original author(s) and the source, provide a link to the Creative
Commons license, and indicate if changes were made.
Tuomo Valkonen received his
Ph.D. from the University of
Jyväskylä in 2008. Since then he
has worked as researcher in Graz,
Cambridge, and Quito. In
February 2016 he started as a lecturer
at the University of Liverpool.
Thomas Pock born 1978 in
Graz, received his M.Sc. (1998–
2004) and his Ph.D. (2005–
2008) in Computer Engineering
(Telematik) from Graz
University of Technology. After a
Postdoc position at the University of
Bonn, he moved back to Graz
University of Technology where
he has been an Assistant
Professor at the Institute for Computer
Graphics and Vision. In 2013 he
received the START price of the
Austrian Science Fund (FWF)
and the German Pattern
recognition award of the German association for pattern recognition (DAGM)
and in 2014, he received an starting grant from the European Research
Council (ERC). Since June 2014, he is a Professor of Computer
Science at Graz University of Technology (AIT Stiftungsprofessur “Mobile
Computer Vision”) and a principal scientist at the Digital Department
of Safety and Security at the Austrian Institute of Technology (AIT).
The focus of his research is the development of mathematical models
for computer vision and image processing as well as the development
of efficient algorithms to solve these models.
1. Rockafellar , R.T.: Convex Analysis . Princeton University Press, Princeton ( 1972 )
2. Ekeland , I., Temam , R.: Convex Analysis and Variational Problems . SIAM ( 1999 )
3. Chambolle , A. , Pock , T.: A first-order primal-dual algorithm for convex problems with applications to imaging . J. Math. Imaging Vis . 40 , 120 - 145 ( 2011 ). doi:10.1007/s10851- 010 - 0251 -1
4. Esser , E. , Zhang , X., Chan, T.F. : A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science . SIAM J. Imaging Sci . 3 ( 4 ), 1015 - 1046 ( 2010 ). doi:10.1137/09076934X
5. Chen , Y. , Lan , G. , Ouyang , Y. : Optimal primal-dual methods for a class of saddle point problems . SIAM J. Optim . 24 ( 4 ), 1779 - 1814 ( 2014 ). doi:10.1137/130919362
6. Nesterov , Y. : Smooth minimization of non-smooth functions . Math. Program. 103 ( 1 ), 127 - 152 ( 2005 ). doi:10.1007/ s10107- 004 - 0552 -5
7. Beck , A. , Teboulle , M. : Smoothing and first order methods: a unified framework . SIAM J. Optim . 22 ( 2 ), 557 - 580 ( 2012 ). doi:10. 1137/100818327
8. O'Donoghue , B. , Candès , E. : Adaptive restart for accelerated gradient schemes . Found. Comput. Math. 15 ( 3 ), 715 - 732 ( 2015 ). doi:10. 1007/s10208- 013 - 9150 -3
9. Beck , A. , Teboulle , M. : A fast dual proximal gradient algorithm for convex minimization and applications . Oper. Res. Lett . 42 ( 1 ), 1 - 6 ( 2014 ). doi:10.1016/j.orl. 2013 .10.007
10. Valkonen , T.: A primal-dual hybrid gradient method for non-linear operators with applications to MRI . Inverse Probl. 30 ( 5 ), 055 , 012 ( 2014 ). doi:10.1088/ 0266 - 5611 /30/5/055012
11. Benning , M. , Knoll , F. , Schönlieb , C.B., Valkonen, T.: Preconditioned ADMM with nonlinear operator constraint (2015) . arXiv:1511.00425
12. Möllenhoff , T. , Strekalovskiy , E. , Moeller , M. , Cremers , D. : The primal-dual hybrid gradient method for semiconvex splittings . SIAM J. Imaging Sci . 8 ( 2 ), 827 - 857 ( 2015 ). doi:10.1137/ 140976601
13. Lorenz , D. , Pock , T.: An inertial forward-backward algorithm for monotone inclusions . J. Math. Imaging Vis . 51 ( 2 ), 311 - 325 ( 2015 ). doi:10.1007/s10851- 014 - 0523 -2
14. Fercoq , O. , Bianchi , P. : A coordinate descent primal-dual algorithm with large step size and possibly non separable functions ( 2015 ). arXiv:1508.04625
15. Beck , A. , Teboulle , M. : A fast iterative shrinkage-thresholding algorithm for linear inverse problems . SIAM J. Imaging Sci . 2 ( 1 ), 183 - 202 ( 2009 ). doi:10.1137/080716542
16. Beck , A. , Teboulle , M. : Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems . IEEE Trans. Image Process . 18 (11), 2419 - 2434 ( 2009 ). doi:10. 1109/TIP. 2009 .2028250
17. Setzer , S. : Operator splittings, Bregman methods and frame shrinkage in image processing . Int. J. Comput. Vis . 92 ( 3 ), 265 - 280 ( 2011 ). doi:10.1007/s11263- 010 - 0357 -3
18. Valkonen , T.: Optimising big images . In: A. Emrouznejad (ed.) Big Data Optimization: Recent Developments and Challenges, Studies in Big Data , pp. 97 - 131 . Springer, Berlin ( 2016 ). doi:10.1007/ 978- 3 - 319 - 30265 -2_ 5
19. Rockafellar , R.T., Wets , R.J.B.: Variational Analysis . Springer, Berlin ( 1998 ). doi:10.1007/978- 3 - 642 - 02431 -3
20. He , B. , Yuan , X.: Convergence analysis of primal-dual algorithms for a saddle-point problem: from contraction perspective . SIAM J. Imaging Sci . 5 ( 1 ), 119 - 149 ( 2012 ). doi:10.1137/100814494
21. Pock , T. , Chambolle , A. : Diagonal preconditioning for first order primal-dual algorithms in convex optimization . In: Computer Vision (ICCV) , 2011 IEEE International Conference on, pp. 1762 - 1769 ( 2011 ). doi:10.1109/ICCV. 2011 .6126441
22. Rudin , W.: Functional Analysis . International series in Pure and Applied Mathematics. McGraw-Hill , New York ( 2006 )
23. Bauschke , H. , Combettes , P. : Convex Analysis and Monotone Operator Theory in Hilbert Spaces . CMS Books in Mathematics. Springer, Berlin ( 2011 )
24. Hohage , T. , Homann , C.: A generalization of the Chambolle-Pock algorithm to Banach spaces with applications to inverse problems ( 2014 ). arXiv:1412.0126
25. Chan, T., Shen , J. : Image Processing and Analysis: Variational , PDE, Wavelet, and Stochastic Methods . Society for Industrial and Applied Mathematics (SIAM) ( 2005 )
26. Benning , M. , Gladden , L. , Holland, D. , Schönlieb , C.B., Valkonen, T.: Phase reconstruction from velocity-encoded MRI measurements-a survey of sparsity-promoting variational approaches . J. Magn. Reson . 238 , 26 - 43 ( 2014 ). doi:10.1016/j. jmr. 2013 .10.003
27. Ambrosio , L. , Fusco , N. , Pallara , D. : Functions of Bounded Variation and Free Discontinuity Problems . Oxford University Press, Oxford ( 2000 )
28. Valkonen , T. , Bredies , K. , Knoll , F. : Total generalised variation in diffusion tensor imaging . SIAM J. Imaging Sci . 6 ( 1 ), 487 - 525 ( 2013 ). doi:10.1137/120867172
29. Bredies , K. , Kunisch , K. , Pock , T.: Total generalized variation . SIAM J. Imaging Sci . 3 , 492 - 526 ( 2011 ). doi:10.1137/090769521
30. Bredies , K. , Valkonen , T.: Inverse problems with second-order total generalized variation constraints . In: Proceedings of the 9th International Conference on Sampling Theory and Applications (SampTA) 2011 , Singapore ( 2011 )
31. Temam , R.: Mathematical Problems in Plasticity. Gauthier-Villars ( 1985 )
32. Chambolle , A. , Pock , T.: On the ergodic convergence rates of a first-order primal-dual algorithm . Math. Program. ( 2015 ). doi:10. 1007/s10107- 015 - 0957 -3
33. Goldstein , T. , Li , M. , Yuan , X.: Adaptive primal-dual splitting methods for statistical learning and image processing . Adv. Neural Inf. Process. Syst . 28 , 2080 - 2088 ( 2015 )
34. de Los Reyes, J.C., Schönlieb , C.B., Valkonen, T.: Bilevel parameter learning for higher-order total variation regularisation models . J. Math. Imaging Vis . ( 2016 ). doi:10.1007/s10851- 016 - 0662 -8. Published online
35. Chen , K. , Lorenz , D.A. : Image sequence interpolation using optimal control . J. Math. Imaging Vis . 41 , 222 - 238 ( 2011 ). doi:10. 1007/s10851- 011 - 0274 -2
36. Chambolle , A. : An algorithm for mean curvature motion . Interfaces Free Bound . 6 ( 2 ), 195 ( 2004 )
37. Suzuki , T.: Stochastic dual coordinate ascent with alternating direction multiplier method ( 2013 ). arXiv: 1311 .0622v1