Sequential linear regression with online standardized data

PLOS ONE, Jan 2018

The present study addresses the problem of sequential least square multidimensional linear regression, particularly in the case of a data stream, using a stochastic approximation process. To avoid the phenomenon of numerical explosion which can be encountered and to reduce the computing time in order to take into account a maximum of arriving data, we propose using a process with online standardized data instead of raw data and the use of several observations per step or all observations until the current step. Herein, we define and study the almost sure convergence of three processes with online standardized data: a classical process with a variable step-size and use of a varying number of observations per step, an averaged process with a constant step-size and use of a varying number of observations per step, and a process with a variable or constant step-size and use of all observations until the current step. Their convergence is obtained under more general assumptions than classical ones. These processes are compared to classical processes on 11 datasets for a fixed total number of observations used and thereafter for a fixed processing time. Analyses indicate that the third-defined process typically yields the best results.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0191186&type=printable

Sequential linear regression with online standardized data

January Sequential linear regression with online standardized data Ke vin Duarte 0 1 2 Jean-Marie Monnez 0 1 2 Eliane Albuisson 1 2 0 R, CHRU de Nancy, Vandoeuvre-lès-Nancy, France , 6 Faculte de MeÂdecine, InSciDenS, Vandoeuvre-lès-Nancy , France 1 Universite de Lorraine, Institut Elie Cartan de Lorraine, UMR 7502, Vandoeuvre-lès-Nancy , F-54506 , France , 2 Project-Team BIGS, INRIA, Villers-lès-Nancy , F-54600 , France , 3 INSERM U1116, Centre d'Investigations Cliniques-PluritheÂmatique 1433, Universite de Lorraine, Nancy, France , 4 Universite de Lorraine , Institut Universitaire de Technologie Nancy-Charlemagne, Nancy , F-54052, France, 5 BIOBASE, PoÃle S 2 Editor: Chenping Hou, National University of Defense Technology , CHINA The present study addresses the problem of sequential least square multidimensional linear regression, particularly in the case of a data stream, using a stochastic approximation process. To avoid the phenomenon of numerical explosion which can be encountered and to reduce the computing time in order to take into account a maximum of arriving data, we propose using a process with online standardized data instead of raw data and the use of several observations per step or all observations until the current step. Herein, we define and study the almost sure convergence of three processes with online standardized data: a classical process with a variable step-size and use of a varying number of observations per step, an averaged process with a constant step-size and use of a varying number of observations per step, and a process with a variable or constant step-size and use of all observations until the current step. Their convergence is obtained under more general assumptions than classical ones. These processes are compared to classical processes on 11 datasets for a fixed total number of observations used and thereafter for a fixed processing time. Analyses indicate that the third-defined process typically yields the best results. - Data Availability Statement: All datasets used in our experiments except those derived from EPHESUS study are available online and links to download these data appear in Table 2 of our article. Due to legal restrictions, data from EPHESUS study are only available upon request. Interested researchers may request access to data upon approval from the EPHESUS Executive Steering Committee of the study. This committee can be reached through Pr Faiez Zannad (f. ) who is member of this board. 1 Introduction ering the least square multidimensional linear regression of S with respect to R: the (p, q) matrix θ and the (q, 1) matrix η are estimated such that E[kS − θ0 R − ηk2] is minimal. Funding: This work is supported by a public grant overseen by the French National Research Agency (ANR) as part of the second ªInvestissements d'Avenirº programme (reference: ANR-15-RHU0004). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist. Denote the covariance matrices B ˆ F ˆ h Covar‰RŠ ˆ E …R h Covar‰R; SŠ ˆ E …R E‰RŠ†…R E‰RŠ†…S E‰RŠ†0 i; E‰SŠ†0 i: If we assume B is positive definite, i.e. there is no affine relation between the components of R, then y ˆ B 1F; Z ˆ E‰SŠ y0 E‰RŠ: Note that, R1 denoting the random vector in Rp‡1 such that R01 ˆ R0 matrix such that y01 ˆ y0 Z , B1 ˆ E R1R01 and F1 = E[R1 S0], we obtain y1 ˆ B1 1F1. In order to estimate θ (or θ1), a stochastic approximation process (Xn) in Rp q (or R…p‡1† q) is recursively defined such that 1 , θ1 the (p + 1, q) where (an) is a sequence of positive real numbers, eventually constant, called step-sizes (or gains). Matrices Bn and Fn have the same dimensions as B and F, respectively. The convergence of (Xn) towards θ is studied under appropriate definitions and assumptions on Bn and Fn. Suppose that ((R1n, Sn), n 1) is an i.i.d. sample of (R1, S). In the case where q = 1, Bn ˆ R1nR01n and Fn ˆ R1nS0n, several studies have been devoted to this stochastic gradient process (see for example Monnez [ 1 ], Ljung [ 2 ] and references hereafter). In order to accelerate general stochastic approximation procedures, Polyak [ 3 ] and Polyak and Juditsky [ 4 ] introduced the averaging technique. In the case of linear regression, GyoÈrfi and Walk [ 5 ] studied an averaged stochastic approximation process with a constant step-size. With the same type of process, Bach and Moulines [ 6 ] proved that the optimal convergence rate is achieved without strong convexity assumption on the loss function. However, this type of process may be subject to the risk of numerical explosion when components of R or S exhibit great variances and may have very high values. For datasets used as test sets by Bach and Moulines [ 6 ], all sample points whose norm of R is fivefold greater than the average norm are removed. Moreover, generally only one observation of (R, S) is introduced at each step of the process. This may be not convenient for a large amount of data generated by a data stream for example. Two modifications of this type of process are thus proposed in this article. The first change in order to avoid numerical explosion is the use of standardized, i.e. of zero mean and unit variance, components of R and S. In fact, the expectation and the variance of the components are usually unknown and will be estimated online. The parameter θ can be computed from the standardized components as follows. Let σj the standard deviation of Rj for j = 1,. . .,p and sk1 the standard deviation of Sk for k = 1,. . .,q. Define 2 / 27 the following matrices respect to Rc is achieved by estimating the (p, q) matrix θc such that E jjSc Let Sc = Γ1(S − E[S]) and Rc = Γ(R − E[R]). The least square linear rhegression of Sc with 0 i ycRcjj2 is minimal. Then θc = Γ−1(B−1 F)Γ1 , θ = B−1 F = Γθc(Γ1)−1. The second change is to use, at each step of the process, several observations of (R, S) or an estimation of B and F computed recursively from all observations until the current step without storing them. More precisely, the convergence of three processes with online standardized data is studied in sections 2, 3, 4 respectively. First, in section 2, a process with a variable step-size an and use of several online standardized observations at each step is studied; note that the number of observations at each step may vary with n. Secondly, in section 3, an averaged process with a constant step-size and use of a varying number of online standardized observations at each step is studied. Thirdly, in section 4, a process with a constant or variable step-size and use of all online standardized observations until the current step to estimate B and F is studied. These three processes are tested on several datasets when q = 1, S being a continuous or binary variable, and compared to existing processes in section 5. Note that when S is a binary variable, linear regression is equivalent to a linear discriminant analysis. It appears that the third-defined process most often yields the best results for the same number of observations used or for the same duration of computing time used. These processes belong to the family of stochastic gradient processes and are adapted to data streams. Batch gradient and stochastic gradient methods are presented and compared in [ 7 ] and reviewed in [ 8 ], including noise reduction methods, like dynamic sample sizes methods, stochastic variance reduced gradient (also studied in [ 9 ]), second-order methods, ADAGRAD [ 10 ] and other methods. This work makes the following contributions to the variance reduction methods: · In [ 9 ], the authors proposed a modification of the classical stochastic gradient algorithm to reduce directly the gradient of the function to be optimized in order to obtain a faster convergence. It is proposed in this article to reduce this gradient by an online standardization of the data. · Gradient clipping [ 11 ] is another method to avoid a numerical explosion. The idea is to limit the norm of the gradient to a maximum number called threshold. This number must be chosen, a bad choice of threshold can affect the computing speed. Moreover it is then necessary to compare the norm of the gradient to this threshold at each step. In our approach the limitation of the gradient is implicitly obtained by online standardization of the data. · If the expectation and the variance of the components of R and S were known, standardization of these variables could be made directly and convergence of the processes obtained using existing theorems. But these moments are unknown in the case of a data stream and 3 / 27 are estimated online in this study. Thus the assumptions of the theorems of almost sure (a.s.) convergence of the processes studied in sections 2 and 3 and the corresponding proofs are more general than the classical ones in the linear regression case [1±5]. · The process defined in section 4 is not a classical batch method. Indeed in this type of method (gradient descent), the whole set of data is known a priori and is used at each step of the process. In the present study, new data are supposed to arrive at each step, as in a data stream, and are added to the preceding set of data, thus reducing by averaging the variance. This process can be considered as a dynamic batch method. · A suitable choice of step-size is often crucial for obtaining good performance of a stochastic gradient process. If the step-size is too small, the convergence will be slower. Conversely, if the step-size is too large, a numerical explosion may occur during the first iterations. Following [ 6 ], a very simple choice of the step-size is proposed for the methods with a constant step-size. · Another objective is to reduce computing time in order to take into account a maximum of data in the case of a data stream. It appears in the experiments that the use of all observations until the current step without storing them, several observations being introduced at each step, increases at best in general the convergence speed of the process. Moreover this can reduce the influence of outliers. As a whole the major contributions of this work are to reduce gradient variance by online standardization of the data or use of a ªdynamicº batch process, to avoid numerical explosions, to reduce computing time and consequently to better adapt the stochastic approximation processes used to the case of a data stream. 2 Convergence of a process with a variable step-size Let (Bn, n 1) and (Fn, n 1) be two sequences of random matrices in Rp p and Rp q respectively. In this section, the convergence of the process (Xn, n 1) in Rp q recursively defined by and its application to sequential linear regression are studied. 2.1 Theorem Let X1 be a random variable in Rp q independent from the sequence of random variables ((Bn, Fn), n 1) in Rp p Rp q. Denote Tn the σ-field generated by X1 and (B1, F1),. . .,(Bn−1, Fn−1). X1, X2,. . .,Xn are Tnmeasurable. Let (an) be a sequence of positive numbers. Make the following assumptions: (H1a) There exists a positive definite symmetrical matrix B such that a.s. X 1 1) 2) 1) 2) (H3a) a2E jjFn n nˆ1 nˆ1 Theorem 1 Suppose H1a, H2a and H3a hold. Then Xn converges to θ = B−1 F a.s. State the Robbins-Siegmund lemma [ 12 ] used in the proof. Lemma 2 Let (O, A, P) be a probability space and (Tn) a non-decreasing sequence of sub-σfields of A. Suppose for all n, zn, αn, βn and γn are four integrable non-negative Tn-measurable random variables defined on (O, A, P) such that: ( Then, in the set an < 1; bn < 1 , (zn) converges to a finite random variable and Proof of Theorem 1. The Frobenius norm kAk for a matrix A is used. Recall that, if kAk2 denotes the spectral norm of A, kABk kAk2kBk. Xn‡1 y ˆ Xn ˆ …I y an…BnXn anB†…Xn y† Fn† an……Bn B†Xn …Fn F†† Denote Zn = (Bn − B)Xn − (Fn − F) = (Bn − B)(Xn − θ) + (Bn − B)θ − (Fn − F) and Xn1 ˆ Xn y. Then: 1 Xn‡1 jjXn1‡1jj2 ˆ …I ˆ jj…I anB†Xn1 anZn anB†Xn1jj2 2anh…I anB†Xn1; Zni ‡ a2njjZnjj2: Denote λ the smallest eigenvalue of B. As an ! 0, we have for n sufficiently large jjI anBjj2 ˆ 1 anl < 1: Then, taking the conditional expectation with respect to Tn yields almost surely: we obtain, as jjXn1jj 1 ‡ jjXn1jj2: jh…I anB†Xn1; E‰ZnjTnŠij jjXn1jj jjE‰ZnjTnŠjj E jjZnjj2jTn E jjXn1‡1jj2jTn jjXn1jj2…bn…1 ‡ jjyjj† ‡ dn† ‡ bnjjyjj ‡ dn; 3bnjjXn1jj2 ‡ 3bnjjyjj2 ‡ 3dn; …1 ‡ a2nl2 ‡ 2…1 ‡ jjyjj†anbn ‡ 2andn ‡ 3a2nbn†jjXn1jj2‡ 2jjyjjanbn ‡ 2andn ‡ 3jjyjj2a2nbn ‡ 3a2ndn 2anljjXn1jj2: Applying Robbins-Siegmund lemma under assumptions H1a, H2a and H3a implies that there exists a non-negative random variable T such that a.s. jjXn1jj ! T; anjjXn1jj2 < 1: As nˆ1 A particular case with the following assumptions is now studied. (H1a') There exist a positive definite symmetrical matrix B and a positive real number b such that a.s. 1) for all n, E[Bn|Tn] = B 2) sup E[kBn − Bk2|Tn] < b. n (H2a') There exist a matrix F and a positive real number d such that a.s. 1) for all n, E[Fn|Tn] = F 2) sup E[kFn − Fk2|Tn] < d. n (H3a') Denoting λ the smallest eigenvalue of B, an ˆ naa ; a > 0; 12 < a < 1 or an ˆ na ; a > 21l . Theorem 3 Suppose H1a’, H2a’ and H3a’ hold. Then Xn converges to θ almost surely and in quadratic mean. Moreover lim a1n E jjXn yjj2 < 1. Proof of Theorem 3. In the proof of theorem 1, take βn = 0, δn = 0, bn < b, dn < d; then a.s.: …1 ‡ l2a2n ‡ 3ba2n†jjXn1jj2 ‡ 3…bjjyjj2 ‡ d†a2n 2anljjXn1jj2: Taking the mathematical expectation yields: By Robbins-Siegmund lemma: E jjXn1‡1jj2 …1 ‡ …l2 ‡ 3b†a2n†E jjXn1jj2 ‡ 3…bjjyjj2 ‡ d†a2n 2anlE jjXn1jj2 : an ˆ 1, t = 0. Therefore, there exist N 2 N and f > 0 such that for n > N: 9t 0 : E jjXn1jj2 !t; anE jjXn1jj2 < 1: 2anl†E jjXn1jj2 ‡ fa2n: 6 / 27 Applying a lemma of Schmetterer [ 13 ] for an ˆ naa with 21 < a < 1 yields: lim naE jjXn1jj2 < 1: Applying a lemma of Venter [ 14 ] for an ˆ a with a > 1 yields: n 2l lim nE jjXn1jj2 < 1 ∎ 2.2 Application to linear regression with online standardized data Let (R1, S1),. . .,(Rn, Sn),. . . be an i.i.d. sample of a random vector (R, S) in Rp Rq. Let Γ (respectively Γ1) be the diagonal matrix of order p (respectively q) of the inverses of the standard deviations of the components of R (respectively S). Define the correlation matrices h B ˆ GE …R h F ˆ GE …R E‰RŠ†…R E‰RŠ†…S E‰RŠ†0 iG; E‰SŠ†0 iG1: Suppose that B−1 exists. Let θ = B−1 F. Denote Rn (respectively Sn) the mean of the n-sample (R1, R2,. . .,Rn) of R (respectively (S1, S2,. . .,Sn) of S). Denote …Vnj†2 the variance of the n-sample …Rj1; Rj2; :::; Rjn† of the jth component Rj of R, and …Vn1k†2 the variance of the n-sample …Sk1; Sk2; :::; Skn† of the kth component Sk of S. Denote Γn (respectively G1n) the diagonal matrix of order p (respectively q) whose element r r (j, j) (respectively (k, k)) is the inverse of n n 1Vnj (respectively n n 1Vn1k). Corollary 4 Suppose there is no affine relation between the components of R and the moments of order 4 of (R, S) exist. Suppose moreover that assumption H3a” holds: (H3aº) an > 0; X1 pan < 1; X1 a2n < 1: nˆ1 n nˆ1 Then Xn converges to θ a.s. This process was tested on several datasets and some results are given in section 5 (process S11 for mn = 1 and S12 for mn = 10). The following lemma is first proved. 7 / 27 X1 a pn < 1. Then n X 1 nˆ1 are used in the proof. Step 1: Lemma 5 Suppose the moments of order 4 of R exist and an > 0, anjjRMn 1 E‰RŠjj < 1 and anjjG Gjj < 1 a.s. Proof of Lemma 5. The usual Euclidean norm for vectors and the spectral norm for matrices Denote Var ‰RŠ ˆ E jjR E‰RŠjj2 ˆ Var ‰RjŠ. X p jˆ1 X p jˆ1 h E jjRMn 1 i E‰RŠjj2 ˆ Var RjMn 1 ˆ Xp Var ‰RjŠ jˆ1 Var ‰RŠ n 1 : Then: Likewise Step 2: It follows that anjjRMn 1 E‰RŠjj < 1 a.s. anjjSMn 1 E‰SŠjj < 1 a.s. jjG 1 ˆ max q jˆ1;...;p anE‰jjRMn 1 E‰RŠjjŠ pX1 Var ‰RŠ nˆ1 a pn < 1 by H3a”: n 1 X p jˆ1 X p jˆ1 j MMn n1 1 1VMn 1 q j Mn 1 VMn 1 Mn 1 1 q j MMn n1 1 1VMn 1 1 p Var ‰RjŠ p Var ‰RjŠ p Var ‰RjŠ ˆ q j MMn n1 1 1VMn 1 MMn n1 1 1 …VMjn 1 †2 p q Var ‰RjŠ Then by H3aº, as Mn−1 n−1: ) X1 Xp an nˆ1 jˆ1 X1 Xp an nˆ1 E jˆ1 Mn 1 anjjGMn 1 Gjj < 1 a:s:∎ Proof of Corollary 4. Step 1: prove that assumption H1a1 of theorem 1 is verified. Denote Rc = R − E[R], Rjc ˆ Rj E‰RŠ, Rjc ˆ Rj E‰RŠ. Bn Step 2: prove that assumption H1a2 of theorem 1 is verified. kBn Bk2 nˆ1 E jjBn Bjj2jTn 24jjGMn 1 jj4 E jjRcjj4 ‡ RcMn 1 ‡ 2jjGE RcRc0 Gjj2 a:s: As ΓMn−1 and RcMn 1 converge respectively to Γ and 0 a.s., and a2n < 1, it follows that X 1 3 Convergence of an averaged process with a constant step-size In this section, the process (Xn, n 1) with a constant step-size a and the averaged process (Yn, n 1) in Rp q are recursively defined by Xn‡1 Yn‡1 ˆ Xn 1 ˆ a…BnXn Xn‡1 Fn† 1) and its application to sequential linear regression are 3.1 Lemma Lemma 6 Let three real sequences (un), (vn) and (an), with un > 0 and an > 0 for all n, and a real positive number λ such that, for n 1, un‡1 …1 anl†un ‡ anvn: Suppose: 1) vn ! 0; ! 0, we can suppose without loss of un‡1 n2 Y n Xn2 3.2 Theorem Make the following assumptions (H1b) There exist a positive definite symmetrical matrix B in Rp p and a positive real number b such that a.s. 1) limn!1(E[Bn|Tn] − B) = 0 X1 1 Bjj2 †21 < 1 2) (H3b) λ and λmax being respectively the smallest and the largest eigenvalue of B, 0 < a < min 1 2l . lmax ; l2 ‡ b Theorem 7 Suppose H1b, H2b and H3b hold. Then Yn converges to θ = B−1 F a.s. Remark 1 Györfi and Walk [ 5 ] proved that Yn converges to θ a.s. and in quadratic mean under the assumptions E[Bn|Tn] = B, E[Fn|Tn] = F, H1b2 and H2b2. Theorem 7 is an extension of their a.s. convergence result whe!n E[Bn|Tn] ! B and E[Fn|Tn] ! F a.s. R Remark 2 Define R1 ˆ , B ˆ E R1R01 , F = E[R1 S0]. If ((R1n, Sn), n 1) is an i.i.d. sam1 ple of (R1, S) whose moments of order 4 exist, assumptions H1b and H2b are verified for Bn ˆ R1nR01n and Fn ˆ R1nS0n as E R1nR01njTn ˆ E R1R01 ˆ B and E R1nS0njTn ˆ F. Proof of Theorem 7. Denote B†…Xn y† ‡ …Bn B†y …Fn F†; 1 Xn n jˆ1 By lemma 6, it suffices to prove ! 0 a.s. to conclude Yn1 ! 0 a.s. Step 2: prove that assumptions H1b and H2b imply respectively Take now the Frobenius norm of Yn1‡1: kYn1‡1k k…I aB†Yn1k ‡ a Under H3b, all the eigenvalues of I − aB are positive and the spectral norm of I − aB is equal to 1 − aλ. Then: 1 1 aB†Xj1 1 aB†Yn1 a a 1 1 nˆ1 1 Xn n jˆ1 Bkl of Bn and B respectively, As the spectral norm kI − aBk = 1 − aλ, taking the conditional expectation with respect to Tn yields a.s. h E jjXn1‡1jj2IGn‡1 jTn i Now: h E jjZnIGn jj2jTn i Therefore: h E jjXn1‡1jj2IGn‡1 jTn i jjE‰ZnjTnŠIGn jj ˆ jj…E‰BnjTnŠ B†Xn1IGn ‡ …E‰BnjTnŠ B†yIGn 2ah…I 1 ‡ jjXn1IGn jj2, taking mathematical expectation yields: h E jjXn1‡1jj2IGn‡1 i h i rE jjXn1IGn jj2 ‡ e; r ˆ …1 ˆ l ‡ a l2 ‡ b ‡ Zb 2 < 1 by the choice of δ, this h h E E jj…Bn h E E jjBn bg: B†Xn1IGn jj2jTn i Bjj2jTn jjXn1IGn jj2 ii 14 / 27 Then: Now: Then: h X1 E jj…Bn nˆ1 i B†Xn1IGn jj2 n2 1 Xn Bj 3.3 Application to linear regression with online standardized data Define as in section 2: Bn Fn ˆ G ˆ G Therefore: P Denote U = (R − E[R]) (R − E[R])0, B = ΓE[U]Γ the correlation matrix of R, λ and λmax respectively the smallest and the largest eigenvalue of B, b1 = E[kΓUΓ − Bk2], F = ΓE[(R − E[R]) (S − E[S])0]Γ1. Corollary 8 Suppose there is no affine relation between the components of R and the moments of order 4 of (R,S) exist. Suppose H3b1 holds: 1 2l Then there exists d > 0 such that h …E jjGMn 1 i 1 GjjIG †2 < 1. Therefore nˆ1 n Step 3: prove that assumption H1b2 is verified in G. h E jjGMn 1 GjjIG i d p Mn 1 d p : n 1 16 / 27 Rc = R − E[R] and RcMn 1 ˆ RMn 1 1 GjjIGŠ†2 < 1. …E‰jjE‰BnjTnŠ 1 BjjIGŠ†2 < 1. < 1. X1 1 1 X ˆ G Let η > 0. Then: 4 Convergence of a process with a variable or constant step-size and use of all observations until the current step In this section, the convergence of the process (Xn, n 1) in Rp q recursively defined by and its application to sequential linear regression are studied. 18 / 27 4.1 Theorem 1 Let ω be fixed belonging to the intersection of the convergence sets {Bn ! B} and ! F}. The writing of ω is omitted in the following. Denote kAk the spectral norm of a matrix A and λ the smallest eigenvalue of B. In the case an depending on n, as an ! 0, we can suppose without loss of generality 1 for all n. Then all the eigenvalues of I − anB are positive and kI − anBk = 1 − anλ. ! 0, we obtain for n sufficiently large: jjI anBnjj anBjj ‡ anjjBn Bjj jjI 1 …1 anl ‡ anε ; with an < l an…l 円jjXn1jj ‡ anjjZnjj: ε 1 jjXn1‡1jj As Zn ! 0, applying lemma 6 yields jjXn1jj ! 0. Therefore Xn ! B−1F a.s. ∎ instead of Remark 4 In the definition of Bn and Fn, the Rj and the Sj are not directly pseudo-centered with respect to RMn and SMn respectively. Another equivalent definition of Bn and Fn can be used. It consists of replacing Rj by Rj − m, RMn by RMn m, Sj by Sj − m, SMn by SMn m1, m and m1 being respectively an estimation of E[R] and E[S] Xcomputed in a preliminary phase with a small number of observations. For example, at step n, GMn …Rj m† …GMn …Rj m††0 is computed X j2In GMn Rj …GMn Rj†0 . This limits the risk of numerical explosion. j2In This process was tested on several datasets and some results are given in section 5 (with a variable step-size: process S13 for mn = 1 and S14 for mn = 10; with a constant step-size: process S31 for mn = 1 and S32 for mn = 10). 5 Experiments E[(S−θ0 R−η)2] is equivalent to minimizing The three previously-defined processes of stochastic approximation with online standardized data were compared with the classical stochastic approximation and averaged stochastic approximation (or averaged stochastic gradient descent) processes with constant step-size (denoted ASGD) studied in [ 5 ] and [ 6 ]. A description of the methods along with abbreviations and parameters used is given in Table 1. With the variable S set at dimension 1, 11 datasets were considered, some of which are available in free access on the Internet, while others were derived from the EPHESUS study [ 15 ]: 6 in regression (continuous dependent variable) and 5 in linear discriminant analysis (binary dependent variable). All datasets used in our experiments are presented in detail in Table 2, along with their download links. An a priori selection of variables was performed on each dataset using a stepwise procedure based on Fisher's test with p-to-enter and p-to-remove fixed at 5 percent. Let D = {(ri, si), i = 1, 2,. . .,N} be the set of data in Rp R and assuming that it represents the set of realizations of a random vector (R, S) uniformly distributed in D, then minimizing 1 XN Z†2. One element of D (or several …si y0ri 20 / 27 N denotes the size of global sample, pa the number of parameters available, p the number of parameters selected and T2 the trace of E[RR0]. Outlier is defined as an observation whose the L2 norm is greater than five times the average norm. To compare the methods, two different studies were performed: one by setting the total number of observations used, the other by setting the computing time. The choice of step-size, the initialization for each method and the convergence criterion used are respectively presented and commented below. Choice of step-size In all methods of stochastic approximation, a suitable choice of step-size is often crucial for obtaining good performance of the process. If the step-size is too small, the convergence rate will be slower. Conversely, if the step-size is too large, a numerical explosion phenomenon may occur during the first iterations. For the processes with a variable step-size (processes C1 to C4 and S11 to S14), we chose to use an of the following type: an ˆ b ‡cg n†a : 1 tion in linear regression, and b = 1. The results obtained for the choice cg ˆ are presented p although the latter does not correspond to the best choice for a classical method. For the ASGD method (A1, A2), two different constant step-sizes a as used in [ 6 ] were 1 1 tested: a ˆ T2 and a ˆ 2T2, T2 denoting the trace of E[RR0]. Note that this choice of constant step-size assumes knowing a priori the dataset and is not suitable for a data stream. was chosen since the matrix E[RR0] is thus the correlation matrix of R, whose trace is equal to p, such that this choice corresponds to that of [ 6 ]. Initialization of processes All processes (Xn) were initialized by X1 ˆ 0, the null vector. For the processes with standardization, a small number of observations (n = 1000) were taken into account in order to calculate an initial estimate of the means and standard deviations. 21 / 27 Convergence criterion The ªtheoretical vectorº θ1 is assigned as that obtained by the least square method in D such that y10 ˆ y0 Z . Let Y1n‡1 be the estimator of θ1 obtained by stochastic approximation after n iterations. In the case of a process (Xn) with standardized data, which yields an estimation of the vector denoted θc in section 1 as θ = Γθc(Γ1)−1 and η = E[S] − θ0 E[R], we can define: 1 0 Yn‡1 ˆ ˆ Other criteria, such as jjy1 y1n‡1jj2 or f …y1n‡1† f …y1† jjy1jj2 f …y1† tested, although the results are not presented in this article. , f being the loss function, were also 5.1 Study for a fixed total number of observations used For all N observations used by the algorithm (N being the size of D) up to a maximum of 100N observations, the criterion value associated with each method and for each dataset was recorded. The results obtained after using 10N observations are provided in Table 3. As can be seen in Table 3, a numerical explosion occured in most datasets using the classical methods with raw data and a variable step-size (C1 to C4). As noted in Table 2, these datasets had a high T2 = Tr(E[RR0]). Corresponding methods S11 to S14 using the same variable stepMean rank 11.6 12.2 9.9 12.3 9.2 8.8 5.2 6.1 3.7 6.9 10.5 4.1 2.3 2.2 Fig 1. Results obtained for dataset POLY using 10N and 100N observations: A/ process C1 with variable step-size an ˆ process C1 with variable step-size an ˆ 1 p 2 by varying b, C/ process S21 by varying constant step-size a. b ‡ n†3 size but with online standardized data quickly converged in most cases. However classical methods with raw data can yield good results for a suitable choice of step-size, as demonstrated by the results obtained for POLY dataset in Fig 1. The numerical explosion can arise from a too high step-size when n is small. This phenomenon can be avoided if the step-size is reduced, although if the latter is too small, the convergence rate will be slowed. Hence, the right balance 23 / 27 must be found between step-size and convergence rate. Furthermore, the choice of this stepsize generally depends on the dataset which is not known a priori in the case of a data stream. In conclusion, methods with standardized data appear to be more robust to the choice of stepsize. 1 1 The ASGD method (A1 with constant step-size a ˆ T2 and A2 with a ˆ 2T2) did not yield good results except for the RINGNORM and TWONORM datasets which were obtained by simulation (note that all methods functioned very well for these two datasets). Of note, A1 exploded for the QUANTUM dataset containing 1068 observations (2.1%) whose L2 norm was fivefold greater than the average norm (Table 2). The corresponding method S21 with 1 online standardized data yielded several numerical explosions with the a ˆ step-size, howp ever these explosions disappeared when using a smaller step-size (see Fig 1). Of note, it is 1 2l 1 1 assumed in corollary 8 that 0 < a < min ; in the case of a ˆ , only a < lmax ; l2 ‡ b1 p lmax is certain. Finally, for methods S31 and S32 with standardized data, the use of all observations until 1 the current step and the very simple choice of the constant step-size a ˆ p uniformly yielded good results. Thereafter, for each fixed number of observations used and for each dataset, the 14 methods ranging from the best (the highest cosine) to the worst (the lowest cosine) were ranked by assigning each of the latter a rank from 1 to 14 respectively, after which the mean rank in all 11 datasets was calculated for each method. A total of 100 mean rank values were calculated for a number of observations used varying from N to 100N. The graph depicting the change in mean rank based on the number of observations used and the boxplot of the mean rank are shown in Fig 2. Fig 2. Results for a fixed total number of observations used: A/ change in the mean rank based on the number of observations used, B/ boxplot of the mean rank by method. 24 / 27 Overall, for these 11 datasets, a method with standardized data, a constant step-size and use of all observations until the current step (S31, S32) represented the best method when the total number of observations used was fixed. 5.2 Study for a fixed processing time Mean rank 12.2 9.9 10.6 10.1 9.0 8.6 5.8 5.8 6.1 6.1 11.5 3.1 4.5 1.5 Fig 3. Results for a fixed processing time: A/ change in the mean rank based on the processing time, B/ boxplot of the mean rank by method. 6 Conclusion In the present study, three processes with online standardized data were defined and for which their a.s. convergence was proven. A stochastic approximation method with standardized data appears to be advantageous compared to a method with raw data. First, it is easier to choose the step-size. For processes S31 and S32 for example, the definition of a constant step-size only requires knowing the number of parameters p. Secondly, the standardization usually allows avoiding the phenomenon of numerical explosion often obtained in the examples given with a classical method. 26 / 27 The use of all observations until the current step can reduce the influence of outliers and increase the convergence rate of a process. Moreover, this approach is particularly adapted to the case of a data stream. Finally, among all processes tested on 11 different datasets (linear regression or linear discriminant analysis), the best was a method using standardization, a constant step-size equal to 1 and all observations until the current step, and the use of several new observations at each Conceptualization: KeÂvin Duarte, Jean-Marie Monnez, Eliane Albuisson. Formal analysis: KeÂvin Duarte, Jean-Marie Monnez, Eliane Albuisson. Writing ± original draft: KeÂvin Duarte, Jean-Marie Monnez, Eliane Albuisson. 1. Monnez JM . Le processus d'approximation stochastique de Robbins-Monro: reÂsultats theÂoriques; estimation seÂquentielle d'une espeÂrance conditionnelle . Statistique et Analyse des DonneÂes . 1979 ; 4 ( 2 ): 11 ± 29 . 2. Ljung L . Analysis of stochastic gradient algorithms for linear regression problems . IEEE Transactions on Information Theory . 1984 ; 30 ( 2 ): 151 ± 160 . https://doi.org/10.1109/TIT. 1984 .1056895 3. Polyak BT . New method of stochastic approximation type . Automation and remote control . 1990 ; 51 ( 7 ): 937 ± 946 . 4. Polyak BT , Juditsky AB . Acceleration of stochastic approximation by averaging . SIAM Journal on Control and Optimization . 1992 ; 30 ( 4 ): 838 ± 855 . https://doi.org/10.1137/0330046 5. GyoÈrfi L , Walk H . On the averaged stochastic approximation for linear regression . SIAM Journal on Control and Optimization . 1996 ; 34 ( 1 ): 31 ± 61 . https://doi.org/10.1137/S0363012992226661 6. Bach F , Moulines E . Non-strongly-convex smooth stochastic approximation with convergence rate O(1 /n). Advances in Neural Information Processing Systems . 2013 ; 773 ± 781 . 7. Bottou L , Le Cun Y. On-line learning for very large data sets . Applied Stochastic Models in Business and Industry . 2005 ; 21 ( 2 ): 137 ± 151 . https://doi.org/10.1002/asmb.538 8. Bottou L , Curtis FE , Noceda J . Optimization Methods for Large-Scale Machine Learning . arXiv:1606.04838v2 . 2017 . 9. Johnson R , Zhang Tong. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction . Advances in Neural Information Processing Systems . 2013 : 315 ± 323 . 10. Duchi J , Hazan E , Singer Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization . Journal of Machine Learning Research . 2011 ; 12 : 2121 ± 2159 . 11. Pascanu R , Mikolov T , Bengio Y. Understanding the exploding gradient problem . arXiv:1211.5063v1 . 2012 . 12. Robbins H , Siegmund D . A convergence theorem for nonnegative almost supermartingales and some applications . Optimizing Methods in Statistics, Rustagi J.S. (ed.), Academic Press, New York. 1971 ; 233 ± 257 . 13. Schmetterer L. Multidimensional stochastic approximation . Multivariate Analysis II, Proc. 2nd Int . Symp., Dayton , Ohio, Academic Press. 1969 ; 443 ± 460 . 14. Venter JH . On Dvoretzky stochastic approximation theorems . The Annals of Mathematical Statistics . 1966 ; 37 : 1534 ± 1544 . https://doi.org/10.1214/aoms/1177699145 15. Pitt B. , Remme W. , Zannad F . et al. Eplerenone, a selective aldosterone blocker, in patients with left ventricular dysfunction after myocardial infarction . New England Journal of Medicine . 2003 ; 348 ( 14 ): 1309 ± 1321 . https://doi.org/10.1056/NEJMoa030207 PMID: 12668699 16. Xu W. Towards optimal one pass large scale learning with averaged stochastic gradient descent . arXiv:1107.2490v2 . 2011 .


This is a preview of a remote PDF: http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0191186&type=printable

Kévin Duarte, Jean-Marie Monnez, Eliane Albuisson. Sequential linear regression with online standardized data, PLOS ONE, 2018, DOI: 10.1371/journal.pone.0191186