Sequential linear regression with online standardized data
January
Sequential linear regression with online standardized data
KeÂ vin Duarte 0 1 2
JeanMarie Monnez 0 1 2
Eliane Albuisson 1 2
0 R, CHRU de Nancy, VandoeuvrelèsNancy, France , 6 FaculteÂ de MeÂdecine, InSciDenS, VandoeuvrelèsNancy , France
1 UniversiteÂ de Lorraine, Institut Elie Cartan de Lorraine, UMR 7502, VandoeuvrelèsNancy , F54506 , France , 2 ProjectTeam BIGS, INRIA, VillerslèsNancy , F54600 , France , 3 INSERM U1116, Centre d'Investigations CliniquesPluritheÂmatique 1433, UniversiteÂ de Lorraine, Nancy, France , 4 UniversiteÂ de Lorraine , Institut Universitaire de Technologie NancyCharlemagne, Nancy , F54052, France, 5 BIOBASE, PoÃle S
2 Editor: Chenping Hou, National University of Defense Technology , CHINA
The present study addresses the problem of sequential least square multidimensional linear regression, particularly in the case of a data stream, using a stochastic approximation process. To avoid the phenomenon of numerical explosion which can be encountered and to reduce the computing time in order to take into account a maximum of arriving data, we propose using a process with online standardized data instead of raw data and the use of several observations per step or all observations until the current step. Herein, we define and study the almost sure convergence of three processes with online standardized data: a classical process with a variable stepsize and use of a varying number of observations per step, an averaged process with a constant stepsize and use of a varying number of observations per step, and a process with a variable or constant stepsize and use of all observations until the current step. Their convergence is obtained under more general assumptions than classical ones. These processes are compared to classical processes on 11 datasets for a fixed total number of observations used and thereafter for a fixed processing time. Analyses indicate that the thirddefined process typically yields the best results.

Data Availability Statement: All datasets used in
our experiments except those derived from
EPHESUS study are available online and links to
download these data appear in Table 2 of our
article. Due to legal restrictions, data from
EPHESUS study are only available upon request.
Interested researchers may request access to data
upon approval from the EPHESUS Executive
Steering Committee of the study. This committee
can be reached through Pr Faiez Zannad (f.
) who is member of this
board.
1 Introduction
ering the least square multidimensional linear regression of S with respect to R: the (p, q)
matrix θ and the (q, 1) matrix η are estimated such that E[kS − θ0 R − ηk2] is minimal.
Funding: This work is supported by a public grant
overseen by the French National Research Agency
(ANR) as part of the second ªInvestissements
d'Avenirº programme (reference:
ANR15RHU0004). The funders had no role in study design,
data collection and analysis, decision to publish, or
preparation of the manuscript.
Competing interests: The authors have declared
that no competing interests exist.
Denote the covariance matrices
B
F
h
CovarR E
R
h
CovarR; S E
R
ER
R
ER
S
ER0 i;
ES0 i:
If we assume B is positive definite, i.e. there is no affine relation between the components of
R, then
y B 1F; Z ES
y0 ER:
Note that, R1 denoting the random vector in Rp1 such that R01 R0
matrix such that y01 y0 Z , B1 E R1R01 and F1 = E[R1 S0], we obtain y1 B1 1F1.
In order to estimate θ (or θ1), a stochastic approximation process (Xn) in Rp q (or R
p1 q)
is recursively defined such that
1 , θ1 the (p + 1, q)
where (an) is a sequence of positive real numbers, eventually constant, called stepsizes (or
gains). Matrices Bn and Fn have the same dimensions as B and F, respectively. The convergence
of (Xn) towards θ is studied under appropriate definitions and assumptions on Bn and Fn.
Suppose that ((R1n, Sn), n 1) is an i.i.d. sample of (R1, S). In the case where q = 1,
Bn R1nR01n and Fn R1nS0n, several studies have been devoted to this stochastic gradient
process (see for example Monnez [
1
], Ljung [
2
] and references hereafter). In order to
accelerate general stochastic approximation procedures, Polyak [
3
] and Polyak and Juditsky [
4
]
introduced the averaging technique. In the case of linear regression, GyoÈrfi and Walk [
5
]
studied an averaged stochastic approximation process with a constant stepsize. With the
same type of process, Bach and Moulines [
6
] proved that the optimal convergence rate is
achieved without strong convexity assumption on the loss function.
However, this type of process may be subject to the risk of numerical explosion when
components of R or S exhibit great variances and may have very high values. For datasets used as
test sets by Bach and Moulines [
6
], all sample points whose norm of R is fivefold greater than
the average norm are removed. Moreover, generally only one observation of (R, S) is
introduced at each step of the process. This may be not convenient for a large amount of data
generated by a data stream for example.
Two modifications of this type of process are thus proposed in this article.
The first change in order to avoid numerical explosion is the use of standardized, i.e. of
zero mean and unit variance, components of R and S. In fact, the expectation and the variance
of the components are usually unknown and will be estimated online.
The parameter θ can be computed from the standardized components as follows. Let σj the
standard deviation of Rj for j = 1,. . .,p and sk1 the standard deviation of Sk for k = 1,. . .,q. Define
2 / 27
the following matrices
respect to Rc is achieved by estimating the (p, q) matrix θc such that E jjSc
Let Sc = Γ1(S − E[S]) and Rc = Γ(R − E[R]). The least square linear rhegression of Sc with
0 i
ycRcjj2 is
minimal. Then θc = Γ−1(B−1 F)Γ1 , θ = B−1 F = Γθc(Γ1)−1.
The second change is to use, at each step of the process, several observations of (R, S) or an
estimation of B and F computed recursively from all observations until the current step
without storing them.
More precisely, the convergence of three processes with online standardized data is studied
in sections 2, 3, 4 respectively.
First, in section 2, a process with a variable stepsize an and use of several online
standardized observations at each step is studied; note that the number of observations at each step may
vary with n.
Secondly, in section 3, an averaged process with a constant stepsize and use of a varying
number of online standardized observations at each step is studied.
Thirdly, in section 4, a process with a constant or variable stepsize and use of all online
standardized observations until the current step to estimate B and F is studied.
These three processes are tested on several datasets when q = 1, S being a continuous or
binary variable, and compared to existing processes in section 5. Note that when S is a binary
variable, linear regression is equivalent to a linear discriminant analysis. It appears that the
thirddefined process most often yields the best results for the same number of observations
used or for the same duration of computing time used.
These processes belong to the family of stochastic gradient processes and are adapted to
data streams. Batch gradient and stochastic gradient methods are presented and compared in
[
7
] and reviewed in [
8
], including noise reduction methods, like dynamic sample sizes
methods, stochastic variance reduced gradient (also studied in [
9
]), secondorder methods,
ADAGRAD [
10
] and other methods. This work makes the following contributions to the variance
reduction methods:
· In [
9
], the authors proposed a modification of the classical stochastic gradient algorithm to
reduce directly the gradient of the function to be optimized in order to obtain a faster
convergence. It is proposed in this article to reduce this gradient by an online standardization of
the data.
· Gradient clipping [
11
] is another method to avoid a numerical explosion. The idea is to limit
the norm of the gradient to a maximum number called threshold. This number must be
chosen, a bad choice of threshold can affect the computing speed. Moreover it is then necessary
to compare the norm of the gradient to this threshold at each step. In our approach the
limitation of the gradient is implicitly obtained by online standardization of the data.
· If the expectation and the variance of the components of R and S were known,
standardization of these variables could be made directly and convergence of the processes obtained
using existing theorems. But these moments are unknown in the case of a data stream and
3 / 27
are estimated online in this study. Thus the assumptions of the theorems of almost sure (a.s.)
convergence of the processes studied in sections 2 and 3 and the corresponding proofs are
more general than the classical ones in the linear regression case [1±5].
· The process defined in section 4 is not a classical batch method. Indeed in this type of
method (gradient descent), the whole set of data is known a priori and is used at each step of
the process. In the present study, new data are supposed to arrive at each step, as in a data
stream, and are added to the preceding set of data, thus reducing by averaging the variance.
This process can be considered as a dynamic batch method.
· A suitable choice of stepsize is often crucial for obtaining good performance of a stochastic
gradient process. If the stepsize is too small, the convergence will be slower. Conversely, if
the stepsize is too large, a numerical explosion may occur during the first iterations.
Following [
6
], a very simple choice of the stepsize is proposed for the methods with a constant
stepsize.
· Another objective is to reduce computing time in order to take into account a maximum of
data in the case of a data stream. It appears in the experiments that the use of all observations
until the current step without storing them, several observations being introduced at each
step, increases at best in general the convergence speed of the process. Moreover this can
reduce the influence of outliers.
As a whole the major contributions of this work are to reduce gradient variance by online
standardization of the data or use of a ªdynamicº batch process, to avoid numerical explosions,
to reduce computing time and consequently to better adapt the stochastic approximation
processes used to the case of a data stream.
2 Convergence of a process with a variable stepsize
Let (Bn, n 1) and (Fn, n 1) be two sequences of random matrices in Rp p and Rp q
respectively. In this section, the convergence of the process (Xn, n 1) in Rp q recursively defined by
and its application to sequential linear regression are studied.
2.1 Theorem
Let X1 be a random variable in Rp q independent from the sequence of random variables
((Bn, Fn), n 1) in Rp p Rp q.
Denote Tn the σfield generated by X1 and (B1, F1),. . .,(Bn−1, Fn−1). X1, X2,. . .,Xn are
Tnmeasurable.
Let (an) be a sequence of positive numbers.
Make the following assumptions:
(H1a) There exists a positive definite symmetrical matrix B such that a.s.
X
1
1)
2)
1)
2)
(H3a)
a2E jjFn
n
n1 n1
Theorem 1 Suppose H1a, H2a and H3a hold. Then Xn converges to θ = B−1 F a.s.
State the RobbinsSiegmund lemma [
12
] used in the proof.
Lemma 2 Let (O, A, P) be a probability space and (Tn) a nondecreasing sequence of
subσfields of A. Suppose for all n, zn, αn, βn and γn are four integrable nonnegative Tnmeasurable
random variables defined on (O, A, P) such that:
(
Then, in the set
an < 1;
bn < 1 , (zn) converges to a finite random variable and
Proof of Theorem 1. The Frobenius norm kAk for a matrix A is used. Recall that, if kAk2
denotes the spectral norm of A, kABk kAk2kBk.
Xn1
y Xn
I
y
an
BnXn
anB
Xn
y
Fn
an
Bn
BXn
Fn
F
Denote Zn = (Bn − B)Xn − (Fn − F) = (Bn − B)(Xn − θ) + (Bn − B)θ − (Fn − F) and
Xn1 Xn y. Then:
1
Xn1
jjXn11jj2
I
jj
I
anBXn1
anZn
anBXn1jj2
2anh
I
anBXn1; Zni a2njjZnjj2:
Denote λ the smallest eigenvalue of B. As an
! 0, we have for n sufficiently large
jjI
anBjj2 1
anl < 1:
Then, taking the conditional expectation with respect to Tn yields almost surely:
we obtain, as jjXn1jj
1 jjXn1jj2:
jh
I
anBXn1; EZnjTnij
jjXn1jj jjEZnjTnjj
E jjZnjj2jTn
E jjXn11jj2jTn
jjXn1jj2
bn
1 jjyjj dn bnjjyjj dn;
3bnjjXn1jj2 3bnjjyjj2 3dn;
1 a2nl2 2
1 jjyjjanbn 2andn 3a2nbnjjXn1jj2
2jjyjjanbn 2andn 3jjyjj2a2nbn 3a2ndn
2anljjXn1jj2:
Applying RobbinsSiegmund lemma under assumptions H1a, H2a and H3a implies that
there exists a nonnegative random variable T such that a.s.
jjXn1jj ! T;
anjjXn1jj2 < 1:
As
n1
A particular case with the following assumptions is now studied.
(H1a') There exist a positive definite symmetrical matrix B and a positive real number b
such that a.s.
1) for all n, E[BnTn] = B
2) sup E[kBn − Bk2Tn] < b.
n
(H2a') There exist a matrix F and a positive real number d such that a.s.
1) for all n, E[FnTn] = F
2) sup E[kFn − Fk2Tn] < d.
n
(H3a') Denoting λ the smallest eigenvalue of B,
an naa ; a > 0; 12 < a < 1 or an na ; a > 21l .
Theorem 3 Suppose H1a’, H2a’ and H3a’ hold. Then Xn converges to θ almost surely and in
quadratic mean. Moreover lim a1n E jjXn yjj2 < 1.
Proof of Theorem 3. In the proof of theorem 1, take βn = 0, δn = 0, bn < b, dn < d; then a.s.:
1 l2a2n 3ba2njjXn1jj2 3
bjjyjj2 da2n
2anljjXn1jj2:
Taking the mathematical expectation yields:
By RobbinsSiegmund lemma:
E jjXn11jj2
1
l2 3ba2nE jjXn1jj2 3
bjjyjj2 da2n
2anlE jjXn1jj2 :
an 1, t = 0. Therefore, there exist N 2 N and f > 0 such that for n > N:
9t
0 : E jjXn1jj2
!t;
anE jjXn1jj2 < 1:
2anlE jjXn1jj2 fa2n:
6 / 27
Applying a lemma of Schmetterer [
13
] for an naa with 21 < a < 1 yields:
lim naE jjXn1jj2 < 1:
Applying a lemma of Venter [
14
] for an a with a > 1 yields:
n 2l
lim nE jjXn1jj2 < 1 ∎
2.2 Application to linear regression with online standardized data
Let (R1, S1),. . .,(Rn, Sn),. . . be an i.i.d. sample of a random vector (R, S) in Rp Rq. Let Γ
(respectively Γ1) be the diagonal matrix of order p (respectively q) of the inverses of the
standard deviations of the components of R (respectively S).
Define the correlation matrices
h
B GE
R
h
F GE
R
ER
R
ER
S
ER0 iG;
ES0 iG1:
Suppose that B−1 exists. Let θ = B−1 F.
Denote Rn (respectively Sn) the mean of the nsample (R1, R2,. . .,Rn) of R (respectively
(S1, S2,. . .,Sn) of S).
Denote
Vnj2 the variance of the nsample
Rj1; Rj2; :::; Rjn of the jth component Rj of R, and
Vn1k2 the variance of the nsample
Sk1; Sk2; :::; Skn of the kth component Sk of S.
Denote Γn (respectively G1n) the diagonal matrix of order p (respectively q) whose element
r r
(j, j) (respectively (k, k)) is the inverse of n n 1Vnj (respectively n n 1Vn1k).
Corollary 4 Suppose there is no affine relation between the components of R and the moments
of order 4 of (R, S) exist. Suppose moreover that assumption H3a” holds:
(H3aº) an > 0; X1 pan < 1; X1 a2n < 1:
n1 n n1
Then Xn converges to θ a.s.
This process was tested on several datasets and some results are given in section 5 (process
S11 for mn = 1 and S12 for mn = 10).
The following lemma is first proved.
7 / 27
X1 a
pn < 1. Then
n
X
1
n1
are used in the proof.
Step 1:
Lemma 5 Suppose the moments of order 4 of R exist and an > 0,
anjjRMn 1
ERjj < 1 and
anjjG
Gjj < 1 a.s.
Proof of Lemma 5. The usual Euclidean norm for vectors and the spectral norm for matrices
Denote Var R E jjR
ERjj2
Var Rj.
X
p
j1
X
p
j1
h
E jjRMn 1
i
ERjj2
Var RjMn 1
Xp Var Rj
j1
Var R
n 1
:
Then:
Likewise
Step 2:
It follows that
anjjRMn 1
ERjj < 1 a.s.
anjjSMn 1
ESjj < 1 a.s.
jjG
1
max q
j1;...;p
anEjjRMn 1
ERjj
pX1
Var R
n1
a
pn < 1 by H3a”:
n 1
X
p
j1
X
p
j1
j
MMn n1 1 1VMn 1
q
j
Mn 1 VMn 1
Mn 1 1
q
j
MMn n1 1 1VMn 1
1
p
Var Rj
p
Var Rj
p
Var Rj
q
j
MMn n1 1 1VMn 1
MMn n1 1 1
VMjn 1 2
p q
Var Rj
Then by H3aº, as Mn−1
n−1:
)
X1 Xp
an
n1 j1
X1 Xp
an
n1
E
j1 Mn 1
anjjGMn 1
Gjj < 1 a:s:∎
Proof of Corollary 4.
Step 1: prove that assumption H1a1 of theorem 1 is verified.
Denote Rc = R − E[R], Rjc Rj ER, Rjc Rj ER.
Bn
Step 2: prove that assumption H1a2 of theorem 1 is verified.
kBn
Bk2
n1
E jjBn
Bjj2jTn
24jjGMn 1 jj4
E jjRcjj4 RcMn 1
2jjGE RcRc0 Gjj2 a:s:
As ΓMn−1 and RcMn 1 converge respectively to Γ and 0 a.s., and
a2n < 1, it follows that
X
1
3 Convergence of an averaged process with a constant stepsize
In this section, the process (Xn, n 1) with a constant stepsize a and the averaged process
(Yn, n 1) in Rp q are recursively defined by
Xn1
Yn1
Xn
1
a
BnXn
Xn1
Fn
1) and its application to sequential linear regression are
3.1 Lemma
Lemma 6 Let three real sequences (un), (vn) and (an), with un > 0 and an > 0 for all n, and a real
positive number λ such that, for n 1,
un1
1
anlun anvn:
Suppose:
1) vn
! 0;
! 0, we can suppose without loss of
un1
n2
Y
n
Xn2
3.2 Theorem
Make the following assumptions
(H1b) There exist a positive definite symmetrical matrix B in Rp p and a positive real
number b such that a.s.
1) limn!1(E[BnTn] − B) = 0
X1 1
Bjj2 21 < 1
2)
(H3b) λ and λmax being respectively the smallest and the largest eigenvalue of B,
0 < a < min 1 2l .
lmax ; l2 b
Theorem 7 Suppose H1b, H2b and H3b hold. Then Yn converges to θ = B−1 F a.s.
Remark 1 Györfi and Walk [
5
] proved that Yn converges to θ a.s. and in quadratic mean
under the assumptions E[BnTn] = B, E[FnTn] = F, H1b2 and H2b2. Theorem 7 is an extension of
their a.s. convergence result whe!n E[BnTn] ! B and E[FnTn] ! F a.s.
R
Remark 2 Define R1 , B E R1R01 , F = E[R1 S0]. If ((R1n, Sn), n 1) is an i.i.d.
sam1
ple of (R1, S) whose moments of order 4 exist, assumptions H1b and H2b are verified for
Bn R1nR01n and Fn R1nS0n as E R1nR01njTn E R1R01 B and E R1nS0njTn F.
Proof of Theorem 7. Denote
B
Xn
y
Bn
By
Fn
F;
1 Xn
n j1
By lemma 6, it suffices to prove
! 0 a.s. to conclude Yn1 ! 0 a.s.
Step 2: prove that assumptions H1b and H2b imply respectively
Take now the Frobenius norm of Yn11:
kYn11k
k
I
aBYn1k a
Under H3b, all the eigenvalues of I − aB are positive and the spectral norm of I − aB is
equal to 1 − aλ. Then:
1
1
aBXj1 1
aBYn1
a
a
1
1
n1
1 Xn
n j1
Bkl of Bn and B respectively,
As the spectral norm kI − aBk = 1 − aλ, taking the conditional expectation with respect to
Tn yields a.s.
h
E jjXn11jj2IGn1 jTn
i
Now:
h
E jjZnIGn jj2jTn
i
Therefore:
h
E jjXn11jj2IGn1 jTn
i
jjEZnjTnIGn jj
jj
EBnjTn
BXn1IGn
EBnjTn
ByIGn
2ah
I
1 jjXn1IGn jj2, taking mathematical expectation yields:
h
E jjXn11jj2IGn1
i
h i
rE jjXn1IGn jj2 e;
r
1
l a l2 b Zb
2
< 1 by the choice of δ, this
h h
E E jj
Bn
h
E E jjBn
bg:
BXn1IGn jj2jTn
i
Bjj2jTn jjXn1IGn jj2
ii
14 / 27
Then: Now:
Then:
h
X1 E jj
Bn
n1
i
BXn1IGn jj2
n2
1 Xn
Bj
3.3 Application to linear regression with online standardized data
Define as in section 2:
Bn
Fn
G
G
Therefore:
P
Denote U = (R − E[R]) (R − E[R])0, B = ΓE[U]Γ the correlation matrix of R, λ and λmax
respectively the smallest and the largest eigenvalue of B, b1 = E[kΓUΓ − Bk2], F = ΓE[(R − E[R])
(S − E[S])0]Γ1.
Corollary 8 Suppose there is no affine relation between the components of R and the moments
of order 4 of (R,S) exist. Suppose H3b1 holds:
1 2l
Then there exists d > 0 such that
h
E jjGMn 1
i 1
GjjIG 2 < 1.
Therefore
n1 n
Step 3: prove that assumption H1b2 is verified in G.
h
E jjGMn 1
GjjIG
i
d
p
Mn 1
d
p :
n 1
16 / 27
Rc = R − E[R] and RcMn 1 RMn 1
1
GjjIG2 < 1.
EjjEBnjTn
1
BjjIG2 < 1.
< 1.
X1 1
1 X
G
Let η > 0.
Then:
4 Convergence of a process with a variable or constant stepsize
and use of all observations until the current step
In this section, the convergence of the process (Xn, n
1) in Rp q recursively defined by
and its application to sequential linear regression are studied.
18 / 27
4.1 Theorem
1
Let ω be fixed belonging to the intersection of the convergence sets {Bn ! B} and
! F}. The writing of ω is omitted in the following.
Denote kAk the spectral norm of a matrix A and λ the smallest eigenvalue of B.
In the case an depending on n, as an ! 0, we can suppose without loss of generality
1 for all n. Then all the eigenvalues of I − anB are positive and kI − anBk = 1 − anλ.
! 0, we obtain for n sufficiently large:
jjI
anBnjj
anBjj anjjBn
Bjj
jjI
1
1
anl anε ; with an < l
an
l εjjXn1jj anjjZnjj:
ε
1
jjXn11jj
As Zn ! 0, applying lemma 6 yields jjXn1jj ! 0.
Therefore Xn ! B−1F a.s. ∎
instead of
Remark 4 In the definition of Bn and Fn, the Rj and the Sj are not directly pseudocentered
with respect to RMn and SMn respectively. Another equivalent definition of Bn and Fn can be used.
It consists of replacing Rj by Rj − m, RMn by RMn m, Sj by Sj − m, SMn by SMn m1, m and m1
being respectively an estimation of E[R] and E[S] Xcomputed in a preliminary phase with a small
number of observations. For example, at step n, GMn
Rj m
GMn
Rj m0 is computed
X j2In
GMn Rj
GMn Rj0 . This limits the risk of numerical explosion.
j2In
This process was tested on several datasets and some results are given in section 5 (with a
variable stepsize: process S13 for mn = 1 and S14 for mn = 10; with a constant stepsize:
process S31 for mn = 1 and S32 for mn = 10).
5 Experiments
E[(S−θ0 R−η)2] is equivalent to minimizing
The three previouslydefined processes of stochastic approximation with online standardized
data were compared with the classical stochastic approximation and averaged stochastic
approximation (or averaged stochastic gradient descent) processes with constant stepsize
(denoted ASGD) studied in [
5
] and [
6
]. A description of the methods along with abbreviations
and parameters used is given in Table 1.
With the variable S set at dimension 1, 11 datasets were considered, some of which are
available in free access on the Internet, while others were derived from the EPHESUS study [
15
]: 6
in regression (continuous dependent variable) and 5 in linear discriminant analysis (binary
dependent variable). All datasets used in our experiments are presented in detail in Table 2,
along with their download links. An a priori selection of variables was performed on each
dataset using a stepwise procedure based on Fisher's test with ptoenter and ptoremove fixed at
5 percent.
Let D = {(ri, si), i = 1, 2,. . .,N} be the set of data in Rp R and assuming that it represents
the set of realizations of a random vector (R, S) uniformly distributed in D, then minimizing
1 XN
Z2. One element of D (or several
si
y0ri
20 / 27
N denotes the size of global sample, pa the number of parameters available, p the number of parameters selected and T2 the trace of E[RR0]. Outlier is defined as an
observation whose the L2 norm is greater than five times the average norm.
To compare the methods, two different studies were performed: one by setting the total
number of observations used, the other by setting the computing time.
The choice of stepsize, the initialization for each method and the convergence criterion
used are respectively presented and commented below.
Choice of stepsize
In all methods of stochastic approximation, a suitable choice of stepsize is often crucial for
obtaining good performance of the process. If the stepsize is too small, the convergence rate
will be slower. Conversely, if the stepsize is too large, a numerical explosion phenomenon
may occur during the first iterations.
For the processes with a variable stepsize (processes C1 to C4 and S11 to S14), we chose to
use an of the following type:
an
b cg na :
1
tion in linear regression, and b = 1. The results obtained for the choice cg are presented
p
although the latter does not correspond to the best choice for a classical method.
For the ASGD method (A1, A2), two different constant stepsizes a as used in [
6
] were
1 1
tested: a T2 and a 2T2, T2 denoting the trace of E[RR0]. Note that this choice of constant
stepsize assumes knowing a priori the dataset and is not suitable for a data stream.
was chosen since the matrix E[RR0] is thus the correlation matrix of R, whose trace is equal to
p, such that this choice corresponds to that of [
6
].
Initialization of processes
All processes (Xn) were initialized by X1 0, the null vector. For the processes with
standardization, a small number of observations (n = 1000) were taken into account in order to
calculate an initial estimate of the means and standard deviations.
21 / 27
Convergence criterion
The ªtheoretical vectorº θ1 is assigned as that obtained by the least square method in D such
that y10 y0 Z . Let Y1n1 be the estimator of θ1 obtained by stochastic approximation after
n iterations.
In the case of a process (Xn) with standardized data, which yields an estimation of the vector
denoted θc in section 1 as θ = Γθc(Γ1)−1 and η = E[S] − θ0 E[R], we can define:
1 0
Yn1
Other criteria, such as jjy1 y1n1jj2 or f
y1n1 f
y1
jjy1jj2 f
y1
tested, although the results are not presented in this article.
, f being the loss function, were also
5.1 Study for a fixed total number of observations used
For all N observations used by the algorithm (N being the size of D) up to a maximum of 100N
observations, the criterion value associated with each method and for each dataset was
recorded. The results obtained after using 10N observations are provided in Table 3.
As can be seen in Table 3, a numerical explosion occured in most datasets using the classical
methods with raw data and a variable stepsize (C1 to C4). As noted in Table 2, these datasets
had a high T2 = Tr(E[RR0]). Corresponding methods S11 to S14 using the same variable
stepMean rank
11.6
12.2
9.9
12.3
9.2
8.8
5.2
6.1
3.7
6.9
10.5
4.1
2.3
2.2
Fig 1. Results obtained for dataset POLY using 10N and 100N observations: A/ process C1 with variable stepsize an
process C1 with variable stepsize an
1
p 2 by varying b, C/ process S21 by varying constant stepsize a.
b n3
size but with online standardized data quickly converged in most cases. However classical
methods with raw data can yield good results for a suitable choice of stepsize, as demonstrated
by the results obtained for POLY dataset in Fig 1. The numerical explosion can arise from a
too high stepsize when n is small. This phenomenon can be avoided if the stepsize is reduced,
although if the latter is too small, the convergence rate will be slowed. Hence, the right balance
23 / 27
must be found between stepsize and convergence rate. Furthermore, the choice of this
stepsize generally depends on the dataset which is not known a priori in the case of a data stream.
In conclusion, methods with standardized data appear to be more robust to the choice of
stepsize.
1 1
The ASGD method (A1 with constant stepsize a T2 and A2 with a 2T2) did not yield
good results except for the RINGNORM and TWONORM datasets which were obtained by
simulation (note that all methods functioned very well for these two datasets). Of note, A1
exploded for the QUANTUM dataset containing 1068 observations (2.1%) whose L2 norm
was fivefold greater than the average norm (Table 2). The corresponding method S21 with
1
online standardized data yielded several numerical explosions with the a stepsize,
howp
ever these explosions disappeared when using a smaller stepsize (see Fig 1). Of note, it is
1 2l 1 1
assumed in corollary 8 that 0 < a < min ; in the case of a , only a <
lmax ; l2 b1 p lmax
is certain.
Finally, for methods S31 and S32 with standardized data, the use of all observations until
1
the current step and the very simple choice of the constant stepsize a
p
uniformly yielded
good results.
Thereafter, for each fixed number of observations used and for each dataset, the 14 methods
ranging from the best (the highest cosine) to the worst (the lowest cosine) were ranked by
assigning each of the latter a rank from 1 to 14 respectively, after which the mean rank in all 11
datasets was calculated for each method. A total of 100 mean rank values were calculated for a
number of observations used varying from N to 100N. The graph depicting the change in
mean rank based on the number of observations used and the boxplot of the mean rank are
shown in Fig 2.
Fig 2. Results for a fixed total number of observations used: A/ change in the mean rank based on the number of observations used, B/ boxplot of
the mean rank by method.
24 / 27
Overall, for these 11 datasets, a method with standardized data, a constant stepsize and use
of all observations until the current step (S31, S32) represented the best method when the total
number of observations used was fixed.
5.2 Study for a fixed processing time
Mean rank
12.2
9.9
10.6
10.1
9.0
8.6
5.8
5.8
6.1
6.1
11.5
3.1
4.5
1.5
Fig 3. Results for a fixed processing time: A/ change in the mean rank based on the processing time, B/ boxplot of the mean rank by method.
6 Conclusion
In the present study, three processes with online standardized data were defined and for which
their a.s. convergence was proven.
A stochastic approximation method with standardized data appears to be advantageous
compared to a method with raw data. First, it is easier to choose the stepsize. For processes
S31 and S32 for example, the definition of a constant stepsize only requires knowing the
number of parameters p. Secondly, the standardization usually allows avoiding the phenomenon of
numerical explosion often obtained in the examples given with a classical method.
26 / 27
The use of all observations until the current step can reduce the influence of outliers and
increase the convergence rate of a process. Moreover, this approach is particularly adapted to
the case of a data stream.
Finally, among all processes tested on 11 different datasets (linear regression or linear
discriminant analysis), the best was a method using standardization, a constant stepsize equal to
1
and all observations until the current step, and the use of several new observations at each
Conceptualization: KeÂvin Duarte, JeanMarie Monnez, Eliane Albuisson.
Formal analysis: KeÂvin Duarte, JeanMarie Monnez, Eliane Albuisson.
Writing ± original draft: KeÂvin Duarte, JeanMarie Monnez, Eliane Albuisson.
1. Monnez JM . Le processus d'approximation stochastique de RobbinsMonro: reÂsultats theÂoriques; estimation seÂquentielle d'une espeÂrance conditionnelle . Statistique et Analyse des DonneÂes . 1979 ; 4 ( 2 ): 11 ± 29 .
2. Ljung L . Analysis of stochastic gradient algorithms for linear regression problems . IEEE Transactions on Information Theory . 1984 ; 30 ( 2 ): 151 ± 160 . https://doi.org/10.1109/TIT. 1984 .1056895
3. Polyak BT . New method of stochastic approximation type . Automation and remote control . 1990 ; 51 ( 7 ): 937 ± 946 .
4. Polyak BT , Juditsky AB . Acceleration of stochastic approximation by averaging . SIAM Journal on Control and Optimization . 1992 ; 30 ( 4 ): 838 ± 855 . https://doi.org/10.1137/0330046
5. GyoÈrfi L , Walk H . On the averaged stochastic approximation for linear regression . SIAM Journal on Control and Optimization . 1996 ; 34 ( 1 ): 31 ± 61 . https://doi.org/10.1137/S0363012992226661
6. Bach F , Moulines E . Nonstronglyconvex smooth stochastic approximation with convergence rate O(1 /n). Advances in Neural Information Processing Systems . 2013 ; 773 ± 781 .
7. Bottou L , Le Cun Y. Online learning for very large data sets . Applied Stochastic Models in Business and Industry . 2005 ; 21 ( 2 ): 137 ± 151 . https://doi.org/10.1002/asmb.538
8. Bottou L , Curtis FE , Noceda J . Optimization Methods for LargeScale Machine Learning . arXiv:1606.04838v2 . 2017 .
9. Johnson R , Zhang Tong. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction . Advances in Neural Information Processing Systems . 2013 : 315 ± 323 .
10. Duchi J , Hazan E , Singer Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization . Journal of Machine Learning Research . 2011 ; 12 : 2121 ± 2159 .
11. Pascanu R , Mikolov T , Bengio Y. Understanding the exploding gradient problem . arXiv:1211.5063v1 . 2012 .
12. Robbins H , Siegmund D . A convergence theorem for nonnegative almost supermartingales and some applications . Optimizing Methods in Statistics, Rustagi J.S. (ed.), Academic Press, New York. 1971 ; 233 ± 257 .
13. Schmetterer L. Multidimensional stochastic approximation . Multivariate Analysis II, Proc. 2nd Int . Symp., Dayton , Ohio, Academic Press. 1969 ; 443 ± 460 .
14. Venter JH . On Dvoretzky stochastic approximation theorems . The Annals of Mathematical Statistics . 1966 ; 37 : 1534 ± 1544 . https://doi.org/10.1214/aoms/1177699145
15. Pitt B. , Remme W. , Zannad F . et al. Eplerenone, a selective aldosterone blocker, in patients with left ventricular dysfunction after myocardial infarction . New England Journal of Medicine . 2003 ; 348 ( 14 ): 1309 ± 1321 . https://doi.org/10.1056/NEJMoa030207 PMID: 12668699
16. Xu W. Towards optimal one pass large scale learning with averaged stochastic gradient descent . arXiv:1107.2490v2 . 2011 .