Robust \(\ell _1\) Approaches to Computing the Geometric Median and Principal and Independent Components

Journal of Mathematical Imaging and Vision, Feb 2016

Robust measures are introduced for methods to determine statistically uncorrelated or also statistically independent components spanning data measured in a way that does not permit direct separation of these underlying components. Because of the nonlinear nature of the proposed methods, iterative methods are presented for the optimization of merit functions, and local convergence of these methods is proved. Illustrative examples are presented to demonstrate the benefits of the robust approaches, including an application to the processing of dynamic medical imaging.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1007%2Fs10851-016-0637-9.pdf

Robust \(\ell _1\) Approaches to Computing the Geometric Median and Principal and Independent Components

J Math Imaging Vis Robust 1 Approaches to Computing the Geometric Median and Principal and Independent Components Stephen L. Keeling Karl Kunisch Robust measures are introduced for methods to determine statistically uncorrelated or also statistically independent components spanning data measured in a way that does not permit direct separation of these underlying components. Because of the nonlinear nature of the proposed methods, iterative methods are presented for the optimization of merit functions, and local convergence of these methods is proved. Illustrative examples are presented to demonstrate the benefits of the robust approaches, including an application to the processing of dynamic medical imaging. Geometric median; principal component analysis; Independent component analysis; Robustness; Local convergence of iterative methods 1 Introduction The topics of this work focus on the low-dimensional representation of complex measured data. The lowest dimensional representation is a type of average. More accurate representations add dimensions beyond the average based upon subspaces in which the data vary the most. Choosing a basis for such subspaces is driven by the priority that data coordinates with respect to this basis be statistically uncorrelated or even statistically independent. The particular interest here is to present methods for performing these tasks which are robust against outliers in the measured data. 1 Institut für Mathematik und Wissenschaftliches Rechnen, Karl-Franzens-Universität Graz, Heinrichstraße 36, 8010 Graz, Austria The most common type of average is the mean, which may be formulated variationally as the point minimizing the sum of squared distances to data points. As discussed in [ 14 ], a more robust method involves minimizing a merit function which does not grow as rapidly with respect to the data and would therefore apply less weight to erroneous data points far from a natural average. Various notions of an average based upon 1 measures are discussed in [ 11 ]. Based upon examples presented in Sect. 3, the type of average selected for this work is the geometric median, which may be formulated variationally as the point minimizing the sum of distances (not squared) to data points. The problem of determining the geometric median has a long history. In the 1937 paper by Weiszfeld [ 21 ], three proofs concerning the uniqueness of the geometric median are given, and one of these supplies an algorithm for its computation. We also refer to the recent annotated translation of that paper [ 19 ]. See also [ 3 ]. A shorter proof of uniqueness is given in Sect. 6 which is based upon a strict convexity argument. Moreover, a possibly novel characterization of a solution is provided in case data points are collinear. A regularized Weiszfeld iteration for computing the geometric median is proposed in Sect. 3, and the local convergence of this scheme is proved in Sect. 6. Alternative approaches, based upon the Chambolle–Pock algorithm [ 8 ] and projected dual ascent [ 4 ], are compared to our approach. Given a natural average or center of the measured data, one may then wish to determine the direction in which data points vary the most from the center. This direction is the most significant principal component of the data. Principal components of lesser significance are sought similarly but within the orthogonal complement of the more significant ones. Determining and analyzing such components is the subject of principal component analysis (PCA) [ 15 ]. The most common way of determining these components is to select them as the eigenvectors of the covariance matrix of the data. The more significant components correspond to the larger eigenvalues of the covariance matrix since each eigenvalue gives the variance of the data projected onto the corresponding eigenvector. As discussed in Sect. 4, determining each eigenvector can be formulated variationally in terms of finding a best fit line through the data, where the line minimizes the sum of squared distances to data points. See [ 5 ] and [ 10 ] for 1-based alternatives to this criterion. Based upon examples presented in Sect. 4, this line is determined here more robustly by minimizing the sum of distances (not squared) to data points. In other words, analogous to defining an average as a geometric median point, a principal component is defined here as a geometric median line. In another context, the terms robust PCA have been associated with the separation of sparse and low-rank components in the given data. This separation was first proposed in [ 7 ], where a low-rank component is measured with the spectral norm and a sparse component with the 1 norm. See also the related approaches of [ 23 ] and [ 18 ], where a low-rank component is separated from a column sparse component using similar norms. As in the works of [ 10 ] and [ 17 ], we do not assume here that the given data may be decomposed into sparse and low-rank components. Our task is instead to decompose (maximal rank) data into components which may all have significant energy in the data. The sum of distances (not squared) between data points and best lines is considered in [ 10 ] to obtain robust principal components all at once, and a convex relaxation of this approach is analyzed in [ 17 ]. In our approach, geometric median lines are obtained sequentially in decreasing orthogonal subspaces. In Sect. 4, an iterative scheme is proposed for computing these lines, and the scheme is based upon that used for computing the geometric median point. However, since the merit function is not convex, uniqueness of minimizers cannot be expected. Nevertheless, local convergence of the scheme to a minimizer is proved in Sect. 7. For approaches to PCA involving maximization of an 1 norm we refer to [ 16 ] and [ 18 ]. Suppose that the data are rotated to an axis system aligned with principal components and that they are then scaled along each new axis to normalize the respective variances to unity. When this rotation and scaling is carried out by standard methods using 2 measures, the transformed data have a covariance matrix equal to the identity. Then the data are said to have been sphered. In particular, the new data coordinates are statistically uncorrelated. However, they are not necessarily statistically independent [ 14 ]. (See, e.g., the example of Fig. 5 with m = 1 in (5.1) so that the data are sphered but the coordinates do not satisfy the independence criterion (5.4).) It might then be postulated that the data can be represented in a rotated axis system with respect to which coordinates are statistically independent. Determining and analyzing such a system is the subject of independent component analysis (ICA). In case the postulate holds, coordinates of the sphered data represent weighted sums of statistically independent variables, and by the Central Limit Theorem [ 1 ] histograms of such coordinates tend to be bell shaped. In order to identify the postulated rotation, it is standard to minimize the Gaussianity of histograms of coordinates in the desired rotated system. The approach proposed by [ 14 ] is to determine this rotation by maximizing a merit function which is known to be minimized by data with a Gaussian distribution. It is also argued in [ 14 ] that one such merit function is more robust to data outliers than another when it does not grow as rapidly with respect to the data. Such candidate merit functions are considered in Sect. 5. The optimization method of [ 24 ] is robust against local extrema. Here we propose an approach for determining the desired rotation by first targeting independence directly instead of using the indirect measure of Gaussianity. The merit function proposed in Sect. 5 is motivated by the observation that while sphered axes tend to be aligned with data clusters, independent axes tend to separate clusters. See the examples presented in Sect. 5 for details. A fixed point iteration scheme based on the optimality condition is proposed in Sect. 5 for computing robust independent components, and the local convergence of this scheme is proved in Sect. 8. The paper is outlined as follows. In Sect. 2, standard 2 approaches to PCA and ICA are summarized, particularly to establish the background used later for the presentation of more robust methods. In Sect. 3, a robust method of data centering is proposed using the geometric median. In Sect. 4, a robust method for determining principal components is proposed using lines which are best fit in the sense that the sum of distances (not squared) to the data points is minimized within the subspace orthogonal to other components. In Sect. 5, a robust method for determining independent components is proposed which maximizes separations among sphered data clusters. Due to the nonlinearity of the respective optimality conditions, iterative schemes are proposed in Sects. 3–5 to solve the respective optimization problems. Local convergence of these schemes is proved in Sects. 6–8. In Sect. 9, the proposed methods are applied to a magnetic resonance image sequence to separate intensity changes due to physiological motion from those due to contrast agent, and benefits of the robust methods are demonstrated with respect to this realistic example. See also [ 20 ] and [ 22 ]. The paper ends with a summary in Sect. 10. 2 Summary of 2 Approaches to PCA and ICA Let an unknown random vector z ∈ Rm be given with compo m nents {zi }i=1 which will be called sources. For example, the sources could be random variables associated with sounds X = W Y. W = U − 21 V T, The matrix W is determined stepwise in terms of its singular value decomposition Yc = Y − Y¯ with Y¯ = { ¯y1, . . . , ¯ym }T, ¯yi = n1 n j=1 yi j the product V TYc should rotate the data so that the new coordinate axes are aligned with the visually natural axes of the cluster of data points {Y eˆ j } j=1, eˆ j ∈ Rn, (eˆ j )i = δi, j . After n this rotation, the product produced independently at a cocktail party. The sources are assumed to satisfy the following: By (2.2), the estimation X = {x1, . . . , xm }T ∈ Rm×n of the sources satisfies (2.4) (2.5) (2.6) (2.7) (2.8) (2.9) (2.10) (2.3) where U, V ∈ Rm×m are orthogonal and ∈ Rm×m is positive definite and diagonal. Specifically, after the data are centered Ys = − 21 V TYc should scale the data so that the variance along each new coordinate axis is unity. For this reason, the data Ys are said to be sphered. The final orthogonal matrix U in (2.5) is chosen so that the components of the random variable x in (2.2) are maximally independent in a sense made precise below. To determine the transformations V and , the covariance matrix of the sphered data is required to be the identity, 1. For 1 ≤ i = j ≤ m, zi and z j are statistically independent. 2. No zi is normally distributed. 3. For 1 ≤ i ≤ m, the variance σi2 = E [(zi − E [zi ])2] of zi is positive. Here, E denotes the expectation. Since the sources are statistically independent, they are uncorrelated [ 14 ]. Let their positive definite diagonal covariance matrix be denoted by m 2 m C (z) = {E [zi − E [zi ], z j − E [z j ]]}i, j=1 = diag{σi }i=1 which is unknown. Let a random vector y ∈ Rm be defined through a measurement process y = Az modeled in terms of the mixing matrix A ∈ Rm×m . The m components {yi }i=1 of y will be called measurements. For example, the measurements could be random variables associated with sounds recorded by separate microphones at the cocktail party mentioned above. In this case, it may be assumed naturally that each microphone records a weighted sum of sources, where a weight is stronger when the source is nearer, and the set of vectors of such weights for the respective microphones is linearly independent. Under the assumption that the mixing matrix is invertible, the goal is to determine a matrix W ∈ Rm×m such that the components m {xi }i=1 of the random vector x = W y estimate the sources in the following sense. First, normaliz1 ing z = A−1 y according to C (z)− 2 z removes the ambiguity of unknown variances by setting the covariance matrix to the identity. Secondly, since the order and sign of components in 1 1 C (z)− 2 z are unknown, the alternative PC (z)− 2 z also satisfies the source assumptions when P ∈ Rm×m is any matrix satisfying ( Pqi , j )2 = δi, j with {qi }im=1 being a permutation 1 of {i }im=1. Thus, W estimates a product PC (z)− 2 A−1, and the covariance matrix of x in (2.2) is the identity. Suppose that each random measurement variable yi is n sampled directly to obtain n samples {yi j } j=1. Implicitly n underlying these are samples {zi j } j=1 of each random source variable zi . Define the sample vectors yi = {yi j }nj=1, n zi = {zi j } j=1, i = 1, . . . , m. According to the linear model in (2.1), the matrices Y = { y1, . . . , ym }T ∈ Rm×n and Z = {z1, . . . , zm }T ∈ Rm×n are related by Y = A Z . I = n1 YsYsT = − 21 V T[ n1 YcYcT]V 1 − 2 , which is accomplished by determining the matrices V and from the eigenspace decomposition of the centered data, n1 YcYcT = V V T, V T V = I. The columns of V are the so-called principal components of the data Y . Analyzing this decomposition is the subject of principal component analysis (PCA). For instance, the sampled data may be filtered by projecting these data onto subspaces spanned by principal components. For this, assume that the entries of = diag{λi }im=1 and V = {v1, . . . , vm } are ordered according to λ1 ≥ λ2 ≥ · · · ≥ λm . This means that the variance λi = n1 YcTvi 22 of the data Yc along the axis vi is larger than the variance λ j along the axis v j for i < j . To select only the r < m components with respect to which the data have the most variation, define the projected data YP ≈ Y by YP = Y¯ + V 21 PT P − 21 V T(Y − Y¯ ), where the projector P ∈ Rr×m is defined with entries Pi, j = δi, j . Note that with (2.6), (2.8), and (2.10), this result can be rewritten as YP = Y¯ + n1 Yc( PYs)T( PYs). Next, the transformation U in (2.1) is determined so that the components of the random variable x in (2.2) are independent. While the rows of the sphered data Ys are statistically uncorrelated, they are not necessarily statistically independent [ 14 ]. A criterion is now sought for a final rotation of axes which gives the desired independence. Since as seen in (2.1) measurements are sums of independent random variables, the Central Limit Theorem suggests why the measurements tend to be normally distributed [ 1 ]. The matrix U is often chosen to reverse this effect, i.e., to make the components of x depart from being normally distributed as much as possible. Here, the significance of the assumption that no component zi be normally distributed can be seen, as otherwise the proposed measure of independence would not bring a separation of sources in the following. For the required statistical constructions, let E [x ] denote the expectation of a random variable x . Since a normally distributed random variable n with mean 0 and variance σ 2 has moments E [|n|m ] = κm σ m , ⎩ ⎧ (m − 1) · (m − 3) · (m − 5) · · · 3 · · · 1, m even κm = ⎨ π2 (m − 1) · (m − 3) · (m − 5) · · · 2, m odd, Ul = {u1, . . . , ul }T, Ul ∈ Rl×m , l = 1, . . . , m, U0 = {} and the projected data it follows that the Kurtosis K = K4, Km (x ) = E [|x − E [x ]|m ] − κm E [|x − E [x ]|2]m/2, (2.13) of n satisfies K (n) = E [|n|4] − 3E [|n|2]2 = 0. Hence, a parameter-dependent random variable may be made to depart maximally from being normally distributed by maximizing the square of its Kurtosis with respect to parameters. Applying this criterion to the rows of Xc = U Ys , the rows of U = {uiT}im=1 are determined as follows. Define Yl = (I − UlT−1Ul−1)Ys, l = 2, . . . , m, Y1 = Ys (2.11) whose columns lie in Tl ⊂ Rm defined as the range of Yl . Note that ulTYl = ulTYs − ulTUlT−1Ul−1Ys = ulTYs, u ∈ Tl . (2.17) (2.12) (2.14) (2.15) By (2.9) and (2.17), the second moment of uTYl / u 2 is uT[Yl YlT/n]u/ u 22 = uT[YsYsT/n]u/ u 22 = uTu/ u 22 = 1, u ∈ Tl . Thus, F in (2.18) is computed according to F (x) = n1 x 44 − 3 2 for n1 x 22 = 1. In this way, the rows of U are determined sequentially so that the earlier components of x depart from being normally distributed more than later components. Alternatively, all components of x may be estimated with roughly equal quality by determining all rows of U simultaneously through maximizing lm=1 F (ulTYs) under the conditions uiTu j = δi, j , 1 ≤ i, j ≤ m. Once matrices U , , and V are determined, the source samples are estimated according to (2.4) and (2.5). The columns of V 21 U T are the so-called independent components of the data Y . Analyzing this decomposition is the subject of independent component analysis (ICA). For instance, the sampled data may be filtered by projecting these data onto subspaces spanned by independent components. Specifically, to select the r < m desired independent components {q1, . . . , qr } ⊂ {1, . . . , m}, define the projected data YQ ≈ Y by YQ = Y¯ + V where the projector Q ∈ Rr×m is defined with entries Qi j = δqi , j . Note that with (2.6), (2.8), (2.10) and (2.14), this result can be rewritten as YQ = Y¯ + n1 Yc(Q Xc)T(Q Xc). In the calculations above, it is implicitly assumed that the number of samples n is at least as large as the number of sources m. Otherwise, the rank n of the covariance matrix n1 YsYsT would be less than its dimension m, and the diagonal matrix in (2.10) would not be positive definite. In case n < m does in fact hold, because so few samples have been collected, one might be inclined simply to replace Y with Y T and therefore reverse the roles of time and space in the data. However, the data must possess an ergodicity property for the results with transposed data to be roughly equivalent to those without transposed data. Since such a property may not generally hold, the matrices above are determined here as follows; see also [ 6 ]. With Y¯ and Yc given by (2.6), define the singular value decomposition Yc/√n = V ˆ Yˆs/√n in terms of rotation matrices V ∈ Rm×m and Yˆs/√n ∈ Rn×n and a rectangular matrix ˆ ∈ Rm×n for which ˆ = ˆ T ˆ ∈ Rn×n is diagonal and positive definite. The matrices ˆ and Yˆs are determined from the eigenspace decomposition YcTYc = YˆsT ˆ Yˆs, YˆsYˆsT = n I . Since the last m −n rows of ˆ are zero, the last m −n columns V˜ ∈ Rm×(m−n) of V = [Vˆ , V˜ ] may be neglected to obtain 1 Yc = Vˆ ˆ 2 Yˆs. The matrix Vˆ ∈ Rm×n is determined from 1 Vˆ = YcYˆsT ˆ − 2 /n. The sphered data Yˆs are transformed by the rotation matrix Uˆ ∈ Rn×n maximizing independence of the rows of Xˆ c = Uˆ Yˆs ∈ Rn×n. Note that with = ˆ ˆ T it also follows that YcYcT = V ˆ [YˆsYˆsT] ˆ T V T = nV V T holds, giving (2.10). Let Yˆs be padded with m − n zero rows to obtain Ys = [Yˆs; 0] ∈ Rm×n. With = 21 , it follows from the singular value decomposition of Yc that Yc = V Ys holds, and hence the counterpart Ys = ( †) 21 V TYc to (2.8) holds, where † denotes the pseudo-inverse of . The rotation matrix U = [[Uˆ , 0]; [0, U˜ ]] ∈ Rm×m can be defined by supplementing Uˆ with the rotation matrix U˜ ∈ R(m−n)×(m−n) and otherwise padding with zeros, and Xc = U Ys ∈ Rm×n can be defined to give (2.14). However, according to Xc = U Ys = [[Uˆ , 0]; [0, U˜ ]][Yˆs; 0] = [Xˆ c; 0], the last m − n rows of Xc are zero, contrary to the objective that the rows be independent. Thus, Xˆ c marks the end of the 1 calculation and Xˆ = Uˆ ˆ − 2 Vˆ TY gives the maximum number of independent components which can be determined from the undersampled data. Finally, for projectors P, Q ∈ Rn×n, (2.11) and (2.21) become YP = Y¯ + n1 Yc( PYˆs)T( PYˆs) and YQ = Y¯ + n1 Yc(Q Xˆ s)T(Q Xˆ s), respectively. 3 1 Approach to Centering That 1 measures lead to statistically robust results may be highlighted by the following simple example. Suppose samples Y = {y j }nj=1 = {0, 1, . . . , 1} ∈ Rn, n > 2, have been collected, where the first measurement is clearly an outlier. The 2 mean of these data is given by argmin μ∈R n j=1 |y j − μ|2 = mean( y) = (n − 1)/n (3.1) = median{yi j }nj=1 im=1 , (3.2) (3.3) (3.5) which is clearly influenced by the outlier. On the other hand, the 1 mean is given by argmin μ∈R n j=1 |y j − μ| = median( y) = 1 which is insensitive to the outlier. Here, the median is defined as the middle entry in the sorted list of values, if n is odd, or else the average of two middle values, if n is even. For a generalization of this robust scalar mean to its counterpart for vectors, let the data Y = { y1, . . . , ym }T ∈ Rm×n, yi = {yi j }nj=1, with columns Y eˆ j , eˆ j ∈ Rn, (eˆ j )i = δi, j , be given as in Sect. 2. Note that if the 1 mean were defined according to n j=1 argmin μ∈Rm ⎧ = ⎨ argmin μi ∈R ⎩ n j=1 Y eˆ j − μ 1 |yi j − μi |⎬ ⎫ m ⎭ i=1 then the solution would be unnaturally determined componentwise through decoupled minimizations. By contrast, the following 1 mean for vectors [ 11 ], i.e., the geometric median [ 19 ], n j=1 Y¯ = argmin M (μ), μ∈Rm minimizes, in a natural way, the 1 norm of Euclidean distances between the data points and the selected mean. The robustness of this measure in relation to the mean or median can be highlighted by the following simple example, which is illustrated in Fig. 1a. Here the data are given by Y = marked with · in Fig. 1a, and the point (0, 1) may be regarded as an outlier from points otherwise lying on the line between (0, 0) and (1, 1). The componentwise mean of the data gives (0.375, 0.625) marked with × in Fig. 1a. Then the componentwise median gives (0.25, 0.75) as marked with + in Fig. 1a. Finally, the measure defined in (3.4) gives the geometric median Y¯ = (0.4996, 0.5004) marked with in Fig. 1a, where the smooth landscape for the merit function M is shown in Fig. 1b. Because of the natural result obtained by the geometric median in Fig. 1a, (3.4) will be used here for the 1 vector mean. To compute the 1 mean of (3.4), the following iteration is used. For τ > 0, compute iteratively Dl ∈ Rm×n and μl ∈ Rm by Dl+1 eˆ j = D1l +eˆjτ+μτl(μ−l Y−eˆ Yj eˆ j2) , j = 1, . . . , n n n μl+1 = (Dl − τ Y )eˆ j j=1 1 + τ μl − Y eˆ j 2 −τ j=1 1 + τ μl − Y eˆ j 2 . (3.6) (3.7) The motivation for this iteration is based upon the optimality conditions (6.9) derived later in Sect. 6. A continuum level steepest descent approach to solving (6.9) is given formally by D (t )eˆ j = (μ(t ) − Y eˆ j ) − μ(t ) − Y eˆ j 2 D(t )eˆ j , D : [0, ∞) → Rm×n (3.8) n j=1 (μ(t ) − Y eˆ j ) − μ(t ) − Y eˆ j 2 D(t )eˆ j = 0, μ : [0, ∞) → Rm (3.9) where (3.9) implicitly defines μ(t ) so that the sum over j in (3.8) and hence nj=1 D(t )eˆ j remains 0. A semi-implicit time-stepping approach to solving (3.8)–(3.9) is given by τ −1(Dl+1 − Dl )eˆ j = (μl − Y eˆ j ) − n j=1 Dl eˆ j + τ (μl+1 − Y eˆ j ) 1 + τ μl − Y eˆ j 2 = 0 μl − Y eˆ j 2 Dl+1 eˆ j (3.10) (3.11) where (3.11) defines μl+1 to correct the departure of n j=1 Dl+1 eˆ j in (3.10) from 0, as required by the last equation in (6.9). After rearranging terms, the scheme (3.6)–(3.7) is seen to be equivalent to (3.10)–(3.11). This derivation of the scheme (3.6)–(3.7) offers a heuristic explanation for the advantage of choosing τ large so that the desired steady state of (3.8)–(3.9) is achieved rapidly. Further details of this iteration and its convergence analysis are given in Sect. 6. The 1-mean is given by taking the limit, Y¯ = lim μ . l→∞ l (3.12) After these calculations have been completed, the centered data are given by Yc = Y − Y¯ , the counterpart to (2.6) with (3.4) replacing (2.7). Note that the iteration (3.6)–(3.7) agrees with the Weiszfeld Algorithm for τ → ∞ [ 21 ]. Since this limit may lead to undefined terms, the proposed scheme may be regarded as a regularized Weiszfeld iteration. This scheme is compared in Figs. 2 and 3 with two alternatives which are described next. The first alternative approach is given by the Chambolle– Pock Algorithm [8], which can be written for μl , νl ∈ Rm , Dl ∈ Rm×n as ⎧ Dl+1 eˆ j = PB (Dl eˆ j + ς (νl − Y eˆ j )), 1 ≤ j ≤ n ⎨ μl+1 = μl − τ /n nj=1 Dl+1 eˆ j ⎩ νl+1 = μl+1 + θ (μl+1 − μl ). (3.13) In (3.13), PB is a projection onto the set B = {v ∈ Rm : ν 2 ≤ 1} and the theoretical requirements are that θ = 1 and 0 < ς, τ < 1 hold. Note that, as (3.5)–(3.6) can be seen as a semi-implicit time-stepping scheme for solving (3.8)–(3.9), (3.13) can be seen as an alternative explicit timestepping scheme. The second alternative approach is given by solving the dual problem supD∈A∩C nj=1(eˆ j D) Y eˆ j , where A = {D ∈ Rm×nn : D eˆ j 2 ≤ 1, 1 ≤ j ≤ n} und C = {D ∈ Rm×n : j=1 D eˆ j = 0}. The solution D∗ is computed by a dual ascent iteration Dl+1 = Dl + τ PA∩C Y, (3.14) Fig. 2 Convergence history for the regularized Weiszfeld scheme (3.6)–(3.7), the Chambolle–Pock scheme (3.13), and the dual ascent scheme (3.14) to compute the geometric median for data which are a subgaussian (uniformly) distributed and b supergaussian (Laplacian) distributed. (The authors wish to thank the referee who provided the code used for this comparison.) where τ > 0 is a stepsize and PA∩C is a projection onto A∩C , computed according to [ 4 ]. This ascent method can also be viewed as an explicit time-stepping scheme for solving the dual problem. After D∗ is obtained, the fixed point iteration μl+1 = (1/n) nj=1[Y eˆ j − μl − Y eˆ j 2 D∗ eˆ j ] is used to compute a solution μ∗ to the optimality condition in (6.9). The convergence of the three schemes, regularized Weiszfeld (3.6)–(3.7), Chambolle–Pock (3.13), and dual ascent (3.14), is compared in Fig. 2. All parameters have been chosen manually to minimize the number of iterations required for convergence. For these comparisons, the data Y ∈ Rm×n, m = 2, n = 1000, are chosen so that the points n {Y eˆ j } j=1 are subgaussian (uniformly, Y = rand(m,n)) distributed in Fig. 2a and supergaussian (Laplacian, Y = log(rand(m,n)./rand(m,n))) distributed in Fig. 2b. Clearly, the regularized Weiszfeld iteration converges much more rapidly. Since additional work is required to determine μ∗ from the result of the dual ascent method, only the regularized Weiszfeld and the Chambolle–Pock iterations are compared in Fig. 3 for the realistic computation of Y¯ for the example of Sect. 9. Again, the regularized Weiszfeld iteration converges much more rapidly. This convergence performance combined with the popularity of the Weiszfeld (4.1) (4.2) Algorithm has motivated the focus on the scheme (3.6)–(3.7) and its analysis for this work. 4 1 Approach to PCA To present our approach, we start with some preliminaries. Unless otherwise specified, it is assumed that m ≤ n. With V and in (2.10) given in terms of components as V = {vˆ i }im=1 and = diag{λi }im=1, respectively, define Vk = {vˆ 1, . . . , vˆ k } ∈ Rm×k and the projected data Yk = (I − Vk−1VkT−1)Yc, k = 2, . . . , m, Y1 = Yc. Let Sk = R(Yk ) where R denotes the range. For convenience, it is assumed here that the data Yc have maximal rank so that S1 = Rm . Then 0 = vT(I − Vk−1VkT−1)Yc = vTYk is equivalent to v = Vk−1VkT−1v, which is equivalent to v = 0 exactly when VkT−1v = 0. Hence, Sk = R(Vk−1)⊥, k = 2, . . . , m, S1 = Rm . (4.3) Before presenting the proposed robust measure for determining visually natural data axes, a motivation is given by reformulating the 2 eigenspace decomposition in (2.10) in terms of a least squares fit of an axis system to the cloud of data points. Given Sk , let the kth column of V and the kth diagonal entry of be determined inductively by the regression vk T |Yk eˆ j |2 − |eˆ j YkTvˆ |2 = 2 Yk F − YkTvˆ 22 Y = , H˜k (vˆ ) = eˆ TjYkT(vˆ vˆ T − I )T(vˆ vˆ T − I )Yk eˆ j eˆ TjYkT(I − vˆ vˆ T)Yk eˆ j where H˜k (v) = Since minimizing = = n j=1 n j=1 n j=1 oRvaeyrlevˆig∈h qSkuowtiiethnt vvTYk2 Y=kTv1/ivsTevqu=ivalent to maximizing the ˆ YkTv 22 / v 22 over the same set, vˆ k in (4.4) is the eigenvector of n1 YcYcT with the kth largest eigenvalue λk = vˆ kTYc/√n 22 = vˆ kTYk /√n 22 as shown in (4.4). We now aim for an appropriate 1-variant of (4.4)–(4.5). The proposed approach to determine the orthogonal matrix V and the diagonal matrix in (2.5) is to replace the sum of squared norms in (4.5) with a sum of norms in (4.8) below. The approach is reminiscent of (3.4) in the sense that while a geometric median point is selected by (3.4), a geometric median line is determined by (4.7). As for (4.5), let Vk be given by (4.1) and Yk by (4.2). Then given Sk according to (4.3), let vk and λk be determined inductively by vvT v 2 − I Yk eˆ j 2 2 marked with · in Fig. 4a. The points (0, 21 ), (1, 21 ) may be regarded as outliers from points otherwise lying on the line between (0, 0) and (1, 1). The 2 data axes are given by (4.4) and are shown in Fig. 4a as dotted line segments. The 1 data axes are given by (4.7) and are shown in Fig. 4a as solid line segments. The landscape for the merit function H1(θ ) = H1(vˆ (θ )) of (4.8) with vˆ (θ ) = (cos(θ ), sin(θ ) is shown in Fig. 4b, and the minimum is marked with ; see Remark 2 concerning the regularity of the merit function. The data axes defined by (4.7) are computed by the following scheme. For τ > 0 and ρ > 0, compute iteratively Dl eˆ j ∈ Sk , j = 1, . . . , n, and vˆl ∈ Sk with D0 eˆ j 2 ≤ 1, j = 1, . . . , n, and vˆ0 2 = 1, Fig. 5 The 1 (solid) and K2, K4, and Ke (dotted, i.e., identical) data axes were obtained by maximizing the measures in (5.5). The results are compared in a, where the data of (5.10) are shown with ·. Shown in b is the landscape of the merit function U (θ)Y 1 which is maximized at vˆl+1 = vl+1 −ρ with vl+1 = vˆl vl+1 2 n (vˆlTYk eˆ j )(τ Yk eˆ j − Dl eˆ j ) j=1 1 + τ (vˆl vˆlT − I )Yk eˆ j 2 . (4.11) The motivation for this iteration and its convergence analysis are given in Sect. 7. Then the kth components of V and are given, respectively, by taking the limit (4.12) (4.13) vk = lim vˆl l→∞ and setting λk = Rk (vk ). 1A,f.te.r. ,tmhe,steheca1lcsuplhateiroendsdhataavearebegeivnencobmypYlset=ed f−o21r VkTY=c, the counterpart to (2.8). Since the minimization problems (4.7) are performed over progressively smaller subspaces Sk , it follows that λ1 ≥ λ2 ≥ · · · ≥ λm . Thus, as in Sect. 2, the 1-variation λi of the data Yc along the axis vi is larger than the 1-variation λ j along the axis v j for i < j . To select only the r < m components with respect to which the data have the most 1-variation, define the projected data YP ≈ Y by the counterpart to (2.11) where the matrices Y¯ , V , and are now determined by 1 measures while the projector P ∈ Rr×m is defined as before with entries Pi, j = δi, j . As the columns of V are determined sequentially, the earlier components of y are more strongly separated from other later components. Alternatively, all components of y may be estimated with roughly equal quality by determining all columns of V simultaneously through minimizing a sum of functionals of the form (4.8) for each column under the constraint that V be orthogonal [ 10,17 ]. In case the data are undersampled and n < m, the steps outlined in this section must be modified as follows. Let Y¯ and Yc be determined as described following (3.4). Then Vˆ ∈ Rm×n is determined by solving (4.7) but for k = 1, . . . , n where S1 is the range of Yc. With λk = Rk (vk ), n set ˆ = diag{λk }k=1. The sphered data are then given by 1 Yˆs = ˆ − 2 Vˆ TYc ∈ Rn×n. Finally, for a projector P ∈ Rn×n, 1 1 (2.11) becomes YP = Y¯ + Vˆ ˆ 2 PT P ˆ − 2 Vˆ T(Y − Y¯ ). 5 1 Approach to ICA While the Kurtosis has been used as a measure of Gaussianity in (2.18) to determine independent components, an alternative measure of independence is proposed here which is more robust in the presence of outliers. This approach targets independence directly in a manner which can be illustrated in terms of the example shown below in Fig. 5. Here the data are given by where σ (t ) = sign(t ). Let these data represent a realization of a random vector y ∈ R2 satisfying 1 P(y = Y eˆi ) = P(y = Y eˆ j ) = 2m , (eˆi ) j = δi j , where P denotes the probability. In order that the rotation dependent random vector x (θ ) = {xi (θ )}i2=1, x (θ ) = U (θ )y, U (θ ) cos(θ ) sin(θ ) = − sin(θ ) cos(θ ) satisfy the independence condition, P(x1(θ ) = α and x2(θ ) = β) = P(x1(θ ) = α) · P(x2(θ ) = β), ∀α, β ∈ R, (5.4) (5.1) (5.2) (5.3) a rotation angle θ = π/4 + kπ/2, k ∈ Z, must be chosen. For the determination of the proper rotation, four different measures of independence are compared in Fig. 5, The data axes obtained by maximizing the last three measures in (5.5) are identical and are shown in Fig. 9a as dotted line segments. The data axes obtained by maximizing the first measure in (5.5) are shown in Fig. 9a as solid line segments. The landscape for the merit function U (θ )Y 1 of (5.5) is shown in Fig. 9b and the maximum is marked with . Only the first measure in (5.5) is maximized at a desired angle as shown in the landscape of Fig. 9b. All other measures are maximized at a multiple of π . The correct rotation is also obtained using (5.8) for the data presented in Sect. 9. While a more detailed statistical analysis is intended for a future work, it is presently on the basis of these examples that the measure shown below in (5.8) is proposed to determine the rotation matrix U of (2.14). To achieve maximally independent rows of Xc = U Ys, the uT m rows of the orthogonal matrix U = { ˆ i }i=1 are determined as follows. Define Ul = { uˆ1, . . . , uˆl }T as in (2.15) and the projected data Yl = (I − U T l−1Ul−1)Ys, l = 2, . . . , m, Y1 = Ys, as in (2.16). Let Tl = R(Yl ) where R denotes the range. For convenience, it is assumed here that the data Ys have maximal rank so that T1 = Rm . Then 0 = uT(I −U T l−1Ul−1)Ys = uTYl is equivalent to u = U T l−1Ul−1u, which is equivalent to u = 0 exactly when Ul−1u = 0. Hence, Tl = R(UlT−1)⊥, l = 2, . . . , m, T1 = Rm . (5.7) (5.8) (5.9) where In this way, the rows of U are determined sequentially so that the earlier components of x are more strongly separated from other later components. Alternatively, all components of x may be estimated with roughly equal quality by determining all rows of U simultaneously through maximizing a sum of functionals of the form (5.9) for each row under the constraint that U be orthogonal [ 14 ]. Once matrices U , , and V are determined, the source samples are estimated according to (2.4) and (2.5). The robustness of the measure Gl in (5.9) in relation to the measure F in (2.20) is highlighted by the following simple example, which is illustrated in Fig. 6. Here the data Y are given by Yx = Yy = marked with · in Fig. 6a, and the points (±3, 0) may be regarded as outliers from points otherwise lying at the diamond vertices {(0, ±1), (±1, 0)}. The 2 data axes are given by (2.18) and are shown in Fig. 6a as dotted line segments. The 1 data axes are given by (5.8) and are shown in Fig. 6a as solid line segments, where the landscape for the merit function G1 of (5.9) is shown in Fig. 6b and the maximum is marked with . The vectors defined by (5.8) are computed by the following scheme. For τ > 0, compute iteratively uˆ k ∈ Tl with uˆ 0 2 = 1, uk+1 = uˆ k + τ Yl σ (YlTuˆ k ) − uˆ k uˆ k Yl 1 , T (5.11) σ (t ) = sign(t ) for t ∈ R, v = {v j }nj=1 ∈ Rn. σ (v) = {σ (v j )}nj=1 for (5.12) The motivation for this iteration and its convergence analysis are given in Sect. 8. The lth column of U T is given by taking the limit, uˆl = kl→im∞ uˆ k . After these calculations have been completed for l = 1, . . . , m, the 1 maximally independent data are given by Xc = U Ys, the counterpart to (2.14). To select the r < m desired independent components {q1, . . . , qr } ⊂ {1, . . . , m}, define the projected data YQ ≈ Y by the counterpart to (2.21) where the matrices Y¯ , V , , and U are now determined by 1 measures, while the projector Q ∈ Rr×m is defined as before with entries Qi j = δqi , j . In case the data are undersampled and n < m, let Vˆ ∈ Rm×n, Yˆs, ˆ ∈ Rn×n be given as described at the end of Sect. 3. Then, the sphered data Yˆs are transformed by the rotation matrix Uˆ ∈ Rn×n maximizing independence1 of the rows of Xˆ c = Uˆ Yˆs ∈ Rn×n. Thus, Xˆ = Uˆ ˆ − 2 Vˆ TY gives the maximum number of independent components which can be determined from the undersampled data. Finally, for a projector Q ∈ Rn×n, (2.21) becomes 1 1 YQ = Y¯ + Vˆ ˆ 2 Uˆ T QT QUˆ ˆ − 2 Vˆ T(Y − Y¯ ). 6 Convergence of the Iterative Scheme for the 1 Mean The analysis of the scheme (3.6)–(3.7) begins by establishing basic properties for the minimization problem (3.4), i.e., the determination of the geometric median. As indicated in Sect. 1, a proof of uniqueness of the geometric median is provided here which is shorter than that found in [ 21 ] or [ 19 ] and is based on strict convexity of the functional M . Also, a possibly novel characterization of a solution is provided in case the columns of Y are collinear, which means that Y can be expressed in the form (5.13) ∇2 M ( μˆ) = n 1 i=1 μˆ − Y eˆi 32 μ ˆ Y = aeT + b yT, where e = (1, . . . , 1)T, a, b ∈ Rm , e, y ∈ Rn. Lemma 1 If the columns of Y ∈ Rm×n are not collinear, then M is strictly convex. Proof The mapping M is the sum of convex mappings and hence convex itself. If M is not strictly convex, then there are vectors μ1, μ2 ∈ Rm , and μ = αμ1 + (1 − α)μ2, with α ∈ (0, 1) such that M (μ) = α M (μ1) + (1 − α)M (μ2) and thus the mapping α → M (αμ1+(1−α)μ2), α ∈ (0, 1), is affine. Let αˆ ∈ (0, 1) be such that μˆ = αˆ μ1 + (1 − αˆ )μ2 does not coincide with any of the columns Y eˆi . Then we find the Hessian (6.1) −Y eˆi 22 I − (μˆ − Y eˆi )( μˆ − Y eˆi )T , (6.2) and hence for any x ∈ Rm xT∇2 M ( μˆ)x = n 1 i=1 μˆ − Y eˆi 32 μ ˆ −Y eˆi 22 x 22 − |(μˆ − Y eˆi )T x|2 . (6.3) Note that |( μˆ − Y eˆi )T x| ≤ μˆ − Y eˆi 2 x 2 . If |( μˆ − Y eˆi )T x| = μˆ − Y eˆi x for all i = 1, . . . , n, then there exist bi ∈ R such that μˆ − Y eˆi = bi x, for i = 1, . . . , n. Thus there exists b ∈ Rn such that Y = μeT − bxT which contradicts the assumption. Hence there exists at least one index i such that |( μˆ − Y eˆi )T x| < μˆ − Y eˆi 2 x 2 and thus xT∇2 M ( μˆ)x > 0. This contradicts that α → M (αμ1 + (1 − α)μ2) is affine at αˆ . Lemma 2 If the columns of Y ∈ Rm×n are collinear, then M is minimized (not necessarily uniquely) by μ = a + b · median ( y). If the columns of Y are not collinear, there exists a unique μ ∈ Rm minimizing M . Proof Existence of a solution μ∗ ∈ Rm follows by standard subsequential limit arguments. If the columns of Y are not collinear, uniqueness of the solution μ follows from strict convexity of μ → M (μ). Suppose next that the columns of Y are collinear so that Y = aeT + b yT. Set ν = b 2 and w = b + ν eˆ1 where (eˆ1)i = δi1. Then the Householder transformation U = I − 2 wwT w 2 2 (6.4) is orthogonal and satisfies U b = −ν eˆ1 as well as U x 2 = x 2 , ∀x ∈ Rm . Let an arbitrary μ ∈ Rm be represented as μ = a + x b + b˜, x = (μ − a)T b/ b 2 , bT b˜ = 0 and note that (U b)T(U b˜) = bTU TU b˜ = bT b˜ = 0. With (6.5)–(6.7), the merit function can be written as Once the merit function has been reduced to this onedimensional form, it is readily seen that M (μ) ≥ M (a + γ b) for γ = median( y), and hence M is minimized at a + γ b. That the minimizer is not necessarily unique can be seen from the case that n is even and the components of y are distinctly ascending, so M has the same value M (a + γ b) at all points a + b[t yn/2 + (1 − t )yn/2+1], t ∈ [ 0, 1 ]. Lemma 3 The first-order necessary optimality condition for a minimizer μ of M over Rm is that there is D ∈ Rm×n satisfying, μ − Y eˆ j = μ − Y eˆ j 2 D eˆ j , D eˆ j 2 ≤ 1, j = 1, . . . , n, D eˆ j = 0. (6.9) n j=1 Proof The necessary optimality condition for a minimizer μ is that 0 ∈ ∂ M (μ ). By the chain rule (see, e.g., [ 2 ], (6.7) (6.8) n j=1 and j=1 n j=1 n j=1 n j=1 n j=1 (6.10) (6.11) (6.12) (6.15) p. 233), the subdifferential of M is given by the sum of the respective subdifferentials, n j=1 ∂ M (μ) = ∂ μ − Y eˆ j 2 . Thus, there exist d j ∈ ∂ μ − Y eˆ j 2 , j = 1, . . . , n, satisfying ⎧ ∂ μ − Y eˆ j 2 = ⎨ ⎩ μ − Y eˆ j μ − Y eˆ j 2 B(0, 1), , μ = Y eˆ j μ = Y eˆ j , where B(0, 1) is the unit ball. Combining these facts, we have (μ − Y eˆ j ) = μ − Y eˆ j 2 j d , j = 1, . . . , n. (6.13) The claim (6.9) follows with D = {d1, . . . , dn}. Turning to the iteration (3.6)–(3.7), we observe that if convergence to a fixed point {D , μ } can be guaranteed, then from (3.6) we have that D eˆ j μ − Y j 2 = μ − Y eˆ j , j = 1, . . . , n. (6.14) From (3.7) we have n D eˆ j + τ (μ − Y eˆ j ) j=1 1 + τ μ − Y eˆ j 2 Moreover, if D0 eˆ j 2 ≤ 1, j = 1, . . . , n, then the iterates {Dl } also satisfy this bound, and hence D eˆ j 2 ≤ 1, j = 1, . . . , n. Thus {D , μ } must satisfy the necessary optimality condition (6.9). The following theorem provides sufficient conditions under which convergence of our algorithm to a fixed point can be guaranteed. Theorem 1 Suppose that the columns of Y ∈ Rm×n are not collinear so that (6.1) does not hold. Let {D , μ } satisfy (6.9) with μ ∈/ {Y eˆ j }nj=1. Then {D , μ } is a fixed point for the iteration (3.6)–(3.7), and for τ sufficiently large, iterates {Dl , μl } converge to this fixed point when {D0, μ0} starts the iteration close enough to {D , μ }. Proof It will first be shown that {D , μ } is a fixed point for the iteration (3.6)–(3.7), which is locally asymptotically stable for τ sufficiently large. Using (6.9) and substituting Dl = D on the right side of (3.6), gives μl+1 = μ . Thus, {D , μ } is a fixed point of the iteration (3.6)–(3.7). To establish the stability of the fixed point, define d j + τ (μ − Y eˆ j ) , j = 1, . . . , n F j (d1, . . . , dn, μ) = 1 + τ μ − Y eˆ j 2 gives Dl+1 = D . Using this result together with (6.9) and setting μl = μ on the right side of (3.7), Thus, the Jacobian satisfies The claimed stability will follow once it is shown that the Jacobian of this mapping evaluated at {D , μ } = {d1, . . . , dn, μ } has spectral radius less than 1 when τ is sufficiently large. For (6.19), ∂ F j (d1, . . . , dn, μ ) ∂ d I = 1 + τ μ − Y eˆ j 2 and and ∂ G ∂μ (6.24) The matrix B ∈ Rm×m is clearly symmetric positive semi definite, and it will now be shown that its spectrum lies in [0, 1). Suppose there is an x ∈ Rm satisfying xT(μ − Y eˆi ) = x 2 μ − Y eˆi 2 , i = 1, . . . , n. (6.27) Then there exists α = {αi }in=1 such that μ − Y eˆi = αi x ⇒ Y = μ eT − xαT (6.28) violating the assumption that the columns of Y are not collinear. Thus, there can be no x satisfying (6.27). This result implies strict inequality in the following estimate: xT B x = < n i=1 n i=1 [xT(μ − Y eˆi )]2 Thus, the spectral radius of B is less than 1. Let a rotation matrix P be chosen so that PT B P = = diag{λi }im=1 where λi ∈ [0, 1). Then the matrix J in (6.26) satisfies ⎡ I · · · 0 ⎤ T ⎢⎢ ... . . . ... ⎥⎥ ⎢⎣ 0 I 0 ⎥⎦ 0 · · · 0 P ⎡ I · · · 0 ⎤ ⎡ 0 · · · 0 A1 P ⎤ J ⎢⎢ ... . . . ... ⎥⎥ = ⎢⎢ ... . . . ... ⎥⎥ ⎢⎣ 0 I 0 ⎥⎦ ⎢⎣ 0 0 An P ⎥⎦ 0 · · · 0 P 0 · · · 0 7 Convergence of the Iterative Scheme for 1 PCA The analysis of the scheme (4.10)–(4.11) begins with establishing the existence of a minimizer for Hk in (4.8). Recall the assumption in Sect. 4 that S1 = R(Yc) = Rm so that Sk = R(Yk ) = R(Vk−1)⊥, k = 2, . . . , m, as seen in (4.3). Also, define Sk = {vˆ ∈ Sk : vˆ 2 = 1}. (7.1) Lemma 4 For Hk in (4.8), there exists a minimizer vˆ over Sk which satisfies vˆ ∈ Sk . Moreover, nj=1 Yk eˆ j 2 gives an upper bound for the minimum. Proof Because of the properties of projections, it follows that (vˆ vˆ T − I )Yk eˆ j 2 ≤ Yk eˆ j 2 holds ∀vˆ ∈ Sk and ∀ j . By (4.8), Hk (v) ≤ Hk (0) holds ∀v ∈ Sk . Thus, Hk is not strictly minimized at v = 0. According to (4.8), Hk is constant along rays outside the origin. Therefore, the minimization can as well be carried out over Sk . The claim follows since Sk is compact and Hk is continuous on Sk . Lemma 5 The first-order necessary optimality condition for a minimizer vˆ of Hk over Sk satisfying vˆ ∈ Sk as given by Lemma 4 is that there exists D ∈ Rm×n satisfying (vˆ vˆ T − I )Yk eˆ j = (vˆ vˆ T − I )Yk eˆ j 2 D eˆ j , D eˆ j 2 ≤ 1, n j=1 v T D eˆ j = 0, ˆ proving that the spectrum of J lies in [0, 1). Since the Jacobian in (6.26) is arbitrarily well approximated by J when τ is sufficiently large, the spectral radius of the Jacobian must be less than 1 for τ sufficiently large. Because of the assumption μ ∈/ {Y eˆ j }nj=1, the mapping (6.19)–(6.20) is smooth in a neighborhood of the fixed point {D , μ }, and hence iterates {Dl , μl } converge to the fixed point when started sufficiently close to it. Remark 1 Computations demonstrate that the iteration (3.6) – (3.7) converges to a minimizer of M for all τ > 0 and even when the condition μ ∈/ {Y eˆ j }nj=1 is violated. Furthermore, while the uniqueness of the minimizer is not guaranteed when the non-collinearity condition is violated, the iteration is found to converge to the median shown in Lemma 1 when τ is sufficiently small. (7.2) (7.3) (7.4) 0 ∈ ∂ Hk (v) ⊂ n j=1 ∂ ∂v = ∂v w − Yk eˆ j w − Yk eˆ j 2 ⎩ B(0, 1), , w = Yk eˆ j w = Yk eˆ j with the unit ball B(0, 1). Let c j be chosen so that (vˆ vˆ T − I )Yk eˆ j (vˆ vˆ T − I )Yk eˆ j 2 ⎩ B(0, 1), , Yk eˆ j = vˆ vˆ TYk eˆ j Yk eˆ j = vˆ vˆ TYk eˆ j b j = [vˆ TYk eˆ j I + vˆ eˆ TjYkT − 2vˆ vˆ Tvˆ TYk eˆ j ]c j . According to (7.8), (vˆ vˆ T − I )Yk eˆ j 2 c j = (vˆ vˆ T − I )Yk eˆ j , j = 1, . . . , n. With (7.9), it follows for Yk eˆ j = vˆ vˆ TYk eˆ j , b j = vˆ TYk eˆ j I + Yk eˆ j vˆ T − 2vˆ TYk eˆ j vˆ vˆ T (vˆ vˆ T − I )Yk eˆ j (vˆ vˆ T − I )Yk eˆ j 2 = (vˆ TYk eˆ j ) = (vˆ TYk eˆ j )(I − vˆ vˆ T) = (vˆ TYk eˆ j )(I − vˆ vˆ T)c j (vˆ vˆ T − I )Yk eˆ j (vˆ vˆ T − I )Yk eˆ j 2 (vˆ vˆ T − I )Yk eˆ j (vˆ vˆ T − I )Yk eˆ j 2 and for Yk eˆ j = vˆ vˆ TYk eˆ j , b j = [vˆ TYk eˆ j I + vˆ eˆ TjYkT − 2vˆ vˆ Tvˆ TYk eˆ j ]c j = (vˆ TYk eˆ j )(I − vˆ vˆ T)c j . where ∂v and ⎧ c j ∈ ⎨ and (7.11) (7.12) Moreover, if D0 eˆ j 2 ≤ 1, j = 1, . . . , n, then the iterates {Dl } also satisfy this bound, and hence D eˆ j 2 ≤ 1, j = 1, . . . , n. Thus, {D , vˆ } must satisfy the necessary optimality condition (7.2). The following theorem provides sufficient conditions under which convergence of our algorithm to a fixed point can be guaranteed. Theorem 2 Let {D , vˆ } satisfy (7.2) with vˆ ∈ Sk and suppose Define d j = (I − vˆ vˆ T)c j . Combining (7.4), (7.11), (7.12), and (7.13) gives n j=1 n j=1 0 = b j = (vˆ TYk eˆ j )d j . The claim (7.2) follows with D = {d1, . . . , dn}. Turning to the iteration (4.10)–(4.11), we observe that if convergence to a fixed point {D , vˆ } with vˆ 2 = 1 can be guaranteed, then from (4.10) we have that vˆ T D j eˆ j = 0 and (vˆ vˆ T − I )Yk eˆ j 2 D eˆ j = (vˆ vˆ T − I )Yk eˆ j , j = 1, . . . , n. According to (4.11), the fixed point vˆ satisfies vˆ v 2 = v = vˆ − ρ n (vˆ TYk eˆ j )(τ Yk eˆ j − Dl eˆ j ) j=1 1 + τ (vˆ vˆ T − I )Yk eˆ j 2 , 0 = = j=1 n n j=1 n j=1 where v is defined by the right side of (7.16). Applying (vˆ vˆ T − I ) to both sides of (7.16) gives 0 = (vˆ vˆ T − I ) n (vˆ TYk eˆ j )(τ Yk eˆ j − Dl eˆ j ) j=1 1 + τ (vˆ vˆ T − I )Yk eˆ j 2 Dl eˆ j + τ (vˆ vˆ T − I )Yk eˆ j . (vˆ TYk eˆ j ) 1 + τ (vˆ vˆ T − I )Yk eˆ j 2 Combining (7.15) and (7.17) gives (vˆ TYk eˆ j ) D eˆ j + τ (vˆ vˆ T − I )Yk eˆ j 2 D eˆ j 1 + τ (vˆ vˆ T − I )Yk eˆ j 2 (vˆ TYk eˆ j )D eˆ j . (7.5) (7.6) (7.7) (7.8) (7.9) (7.10) = (7.13) (7.14) (7.15) (7.16) (7.17) (7.18) and (7.24) T v=vˆ := A j . (7.27) (7.28) and = ∂ F j (d1, . . . , dn, vˆ ) ∂ di I − vˆ vˆ T = 1 + τ (vˆ vˆ T − I )Yk eˆ j 2 ∂ F j (d1, . . . , dn, v) ∂v * vT(τ Yk eˆ j − d j )I + v(τ Yk eˆ j − d j )T 1 + τ (vvT − I )Yk eˆ j 2 (vvT − I )(τ Yk eˆ j − d j ) − [1 + τ (vvT − I )Yk eˆ j 2 ]2 +vTYk eˆ j I + v eˆ TjYkT,T τ (vvT − I )Yk eˆ j (vvT − I )Yk eˆ j 2 τ−→→∞ vˆ TYk eˆ j I + vˆ eˆ TjYkT (vˆ vˆ T − I )Yk eˆ j 2 −(vˆ TYk eˆ j ) (vˆ vˆ T − I )Yk eˆ j eˆ TjYkT(vˆ vˆ T − I ) (vˆ vˆ T − I )Yk eˆ j 32 (7.21) (7.22) (7.23) For (7.24), ∂ g ∂ di = ρ (d1, . . . , dn, vˆ ) (vˆ TYk eˆi )I / vˆ 2 1 + τ (vˆ vˆ T − I )Yk eˆi 2 Then {D , vˆ } is a fixed point of the iteration (4.10)–(4.11), and for τ sufficiently large and for ρ sufficiently small, the iterates {Dl , vˆl } converge to this fixed point when {D0, vˆ 0} starts the iteration close enough to {D , vˆ }. Proof It will first be shown that {D , vˆ } is a fixed point for the iteration (4.10)–(4.11), which is locally asymptotically stable for τ sufficiently large and for ρ sufficiently small. Using (7.2) and substituting Dl = D and vˆl = vˆ on the right side of (4.10), (vˆ vˆ T − I )(τ Yk eˆ j − D eˆ j ) Dl+1 eˆ j = 1 + τ (vˆ vˆ T − I )Yk eˆ j 2 j=1 n j=1 1 + τ (vˆ vˆ T − I )Yk eˆ j 2 (vˆ TYk eˆ j )D eˆ j 1 + τ (vˆ vˆ T − I )Yk eˆ j 2 D eˆ j + τ (vˆ vˆ T − I )Yk eˆ j (vˆ TYk eˆ j ) 1 + τ (vˆ vˆ T − I )Yk eˆ j 2 = (vˆ vˆ T − I ) n j=1 τ Yk eˆ j − D eˆ j (vˆ TYk eˆ j ) 1 + τ (vˆ vˆ T − I )Yk eˆ j 2 and hence v = vˆ − ρ j=1 n (v TYl eˆ j )(τ Yk eˆ j − D eˆ j ) ˆ 1 + τ (vˆ vˆ T − I )Yk eˆ j 2 ⎡ = vˆ ⎣ 1 − ρvˆ T j=1 n (v TYl eˆ j )(τ Yk eˆ j − D eˆ j ) ⎤ ˆ 1 + τ (vˆ vˆ T − I )Yk eˆ j 2 ⎦ satisfies v = vˆ v 2 . Thus, setting Dl = D and vˆl = vˆ on the right side of (4.11) gives vˆl+1 = vˆ . Therefore, {D , vˆ } is a fixed point of the iteration (4.10)–(4.11). To establish the stability of the fixed point, define (vvT − I )(τ Yk eˆ j − d j ) , F j (d1, . . . , dn, v) = 1 + τ (vvT − I )Yk eˆ j 2 G(d1, . . . , dn, v) = g(d1, . . . , dn, v) = v −ρ j=1 n (vTYk eˆ j )(τ Yk eˆ j − d j )/ v 2 1 + τ (vvT − I )Yk eˆ j 2 g(d1, . . . , dn, v) g(d1, . . . , dn, v) 2 so that (4.10)–(4.11) is given by = G(Dl+1 eˆ1, . . . , Dl+1 eˆn, vˆl ). Dl+1 eˆ j = F j (Dl eˆ1, . . . , Dl eˆn, vˆl ), j = 1, . . . , n, vˆl+1 (7.25) The claimed stability will follow once it is shown that the Jacobian of this mapping from the (n + 1)-fold Cartesian product (Sk)n+1 into (Sk )n+1 evaluated at {D , vˆ } = {d1, . . . , dn, vˆ } has only eigenvalues with magnitude less than 1 when τ is sufficiently large and ρ is sufficiently small. For (7.23), and with gˆ = g(d1,..., dn,vˆ ) = vˆ , ∂∂dGi (d1,..., dn,vˆ ) = (I − vˆ vˆ T)∂∂dgi (d1,..., dn,vˆ ) τ−→→∞ 0. Also ∂g(d1,..., dn,v) ∂v ⎧ vvT 1 = ⎨ I −ρ n (τYkeˆj − d j)eˆTjYkT I − v 22 v 2 ⎩ j=1 1+τ (vvT − I) 2 j=1 [1 + τ (vvT − I)Ykeˆj 2]2 +vTYkeˆj I + veˆTjYkT,T τ(vvT − I)Ykeˆj T ⎫⎬ (vvT − I)Ykeˆj 2 ⎭ v=vˆ = I − ρ n (τYkeˆj − d j)eˆTjYkT j=1 1 + τ (vˆˆ vˆ T − I)Ykeˆj 2 τ(vˆTYkeˆj)2 1 + [1 + τ (vˆvˆT − I)Ykeˆj 2] (vˆvˆT − I)Ykeˆj 2 and ×(I − vˆ vˆ T) ⎡ n τ−→→∞ I − ρ ⎣ (vˆ TYkeˆj)2 j=1 Ykeˆj eˆTjYkT (vˆ vˆ T − I)Ykeˆj 32 (I − vˆ vˆ T) ∂∂Gv(d1,..., dn,vˆ ) 1 g g T ∂g(d1,..., dn,vˆ ) = g 2 I − g 2 ∂v = (I − vˆ vˆ T)∂g(d1,..., dn,vˆ ) ∂v ⎡ n τ−→→∞ (I − vˆ vˆ T)⎣ I − ρ (vˆ TYkeˆj)2 j=1 (7.32) (7.33) (7.29) 0 ··· 0 B Thus, thespectrum of B : Sk → Sk lies in (−1,1). Recalling that the dimension of Sk is m − k + 1, let P ∈ Rm×(m−k+1) be chosen with orthonormal columns such that PTBP = Ykeˆj eˆTjYkT m−k+1 where λi ∈ (−1,1). Also set Z ∈ (vˆ vˆ T − I)Ykeˆj 32 (I − vˆ vˆ T) =: B. (7.31) Rm×=(md−ika+g1{)λwi}iit=h1all zero entries. Further let I ∈ Rm×m and By the definition Sk = R(Yk) prior to (4.3), it follows from (7.27) that Aj : Sk → Sk, j = 1,...,n, and from (7.31) that B : Sk → Sk. In the same way, it follows from (7.32) that J : (Sk)n+1 → (Sk)n+1. It will be shown that B : Sk → Sk has spectrum in (−1,1). Let B be expressed as B = (I − vˆ vˆ T) − ρCˆ where Cˆ = (I − vˆ vˆ T)C(I − vˆ vˆ T), n C = (vˆ TYkeˆj)2 j=1 Ykeˆj eˆTjYkT (vˆ vˆ T − I)Ykeˆj 32 . Suppose there were v ∈ Sk for which vTCv = 0. Then by (4.2) and (4.3), vTYk = vT(I − Vk−1VkT−1)Yc = vTYc and with (7.19), 0 = j=n1(vˆ TYkeˆj)2 (vˆ vˆ|vTTY−ceˆIj)|Y2keˆj (v TYkeˆj)2 min ˆ ≥ vTYc 22 1≤j≤n (vˆ vˆ T − I)Ykeˆj >0 , (7.34) and vTYc = 0 would contradict the assumption that S1 = R(Yc) = Rm. Thus, there exist 0 < λmin ≤ λmax such that C satisfies Let Sk = Uk ⊕ Wk where Wk = span{vˆ }. Clearly, λ = 0 is an eigenvalue of B associated with the eigenvector vˆ ∈ Wk. EigenvaluesforeigenvectorsinUk willnowbeestimated.Let ρ be small enough that −1 < 1 − ρλmax < 1 − ρλmin < 1. Then for x ∈ Uk it follows with the definition of Cˆ and C that xTBx = x 22 − ρxTCˆ x = x 22 − ρxTCx and − x 22 < (1 − ρλmax) x 22 ≤ x 22 − ρxTCx ≤ (1 − ρλmin) x 22 < x 22. (7.36) (7.30) λmin x 22 ≤ xTCx ≤ λmax x 22, ∀x ∈ Sk. (7.35) 0 ∈ Rm×m in (7.37) below denote the identity and the zero matrix, respectively. Then the matrix J satisfies ⎡ I · · · Z ⎤ T ⎢⎢ ... . . . ... ⎥⎥ ⎣⎢ 0 I Z ⎥⎦ 0 · · · 0 P ⎡ I · · · Z ⎤ ⎡ J ⎢⎢ ... . . . ... ⎥ ⎢ ⎢⎣ 0 I Z ⎦⎥⎥ = ⎣⎢⎢ 0 · · · 0 P 0 · · · 0 ... . . . 0 0 Z T Z · · · Z T Z A1 P ⎤ . .. ⎥⎥ An P ⎥⎦ (7.37) proving that the spectrum of J lies in (−1, 1). Since the Jacobian in (6.26) is arbitrarily well approximated by J when τ is sufficiently large, its spectrum lies strictly within the ball of radius 1 for τ sufficiently large and for ρ sufficiently small. Because of the assumption (7.19), the mapping (7.23)–(7.24) is smooth in a neighborhood of the fixed point {D , vˆ }, and hence the iterates {Dl , vˆl } converge to the fixed point when started sufficiently close to it. Remark 2 Computations demonstrate that the iteration (4.10)–(4.11) can converge to a minimizer for Hk even when the condition (7.19) is violated. Such a case is illustrated in Fig. 4. Variations of the example illustrated in Fig. 4, in which data points lie on the boundary of a rectangle instead of a square, indicate an advantage to having sphered the data by 2 means according to (2.8) and (2.10) before proceeding with the methods of Sect. 4. 8 Convergence of the Iterative Scheme for 1 ICA The analysis of the scheme (5.11) begins with establishing existence of a maximizer for Gl in (5.9). Recall the assumption in Sect. 5 that T1 = R(Ys) = Rm so that Tl = R(Yl ) = R(UlT−1)⊥, l = 2, . . . , m, as seen in (5.7). Also, define Tl = { uˆ ∈ Tl : uˆ 2 = 1}. (8.1) Lemma 6 For Gl in (5.9), there exists a maximizer uˆ over Tl satisfying uˆ ∈ Tl . Proof Since uˆTYl = uˆT(I − UlT−1Uk−1)Ys = uˆTYs holds for each l, the assumption T1 = R(Ys) = Rm implies that there is a uˆ ∈ Tl satisfying Gl ( uˆ) = 0. Thus, Gl is not maximized at u = 0. According to (5.9), Gl is constant along rays outside the origin. Thus, the maximization can as well be carried out on Tl . The claim follows since Tl is compact and Gl is continuous on Tl . Lemma 7 For any u ∈ Tl \{0}, the directional derivative of Gl in the direction of w ∈ Tl is given by ∂ Gl (u; w) = δ j (u, w)Yl eˆ j 1 u u 2 , ⎦ where δ j (u, w) = * σ (eˆ TjYlTu), eˆ TjYlTu = 0 σ (eˆ TjYlTw), eˆ TjYlTu = 0 σ (t ) = sign(t ). (8.2) (8.3) (8.4) 2 (8.5) (8.6) (8.7) Proof For u ∈ Tl \{0}, w ∈ Tl , and t > 0 sufficiently small that u + t w = 0, Gl (u + t w) − Gl (u) = − n j=1 |eˆ TjYlTu| u + t w u + t w 2 − 2 u 2 For eˆ TjYlTu = 0, the terms of the first sum in (8.4) satisfy |eˆ TjYlT(u + t w)| − |eˆ TjYlTu| lim t→0+ t u + t w 2 (2t eˆ TjYlTu + t 2 eˆ TjYlTu)(eˆ TjYlTu)/t = lim T t→0+ (|eˆ j YlT(u + t w)| + |eˆ TjYlTu|) u + t w = σ (eˆ TjYlTu) eˆ TjYlTw u 2 and for eˆ TjYlTu = 0, lim t→0+ = |eˆ TjYlT(u + t w)| − |eˆ TjYlTu| t u + t w 2 |eˆ TjYlTw| = σ (eˆ TjYlTw) eˆ TjYlTw . u 2 u 2 The terms of the second sum in (8.4) satisfy u + t w lim t→0+ t u + t w = lim t→0+ u + t w uTw = u 3 . 2 2 − 2 u 2 (2t uTw + t 2 w 22 )/t 2 u 2 ( u + t w 2 + u 2 ) Combining these calculations gives (8.2). Lemma 7 is now used to prove Lemma 8. For the following, let σ (v) = {σ (vi )} where v = {vi } and σ (t ) = sign(t ). Lemma 8 The first-order necessary optimality condition for a maximizer uˆ of Gl over Tl satisfying uˆ ∈ Tl as given by Lemma 6 is Yl σ (YlT uˆ ) = YlT uˆ u 1 ˆ and the sets Y = { j : Yl eˆ j = 0} and S = { j : u TYl eˆ j = 0} agree. As a consequence, ∃ > 0 such that ∇ Gl (u) = 1 Yl σ (YlTu) YlT u 1 u , ∀u ∈ B( uˆ , ). (8.9) Proof Let uˆ ∈ Tl be a maximizer for Gl guaranteed by Lemma 6. By (8.2), ∂ Gl ( uˆ ; w) = lim t→0+ ≤ 0, ∀w ∈ Tl . Gl ( uˆ + t w) − Gl ( uˆ ) t By decomposing sums into indices in S and Sc, σ (eˆ TjYlTw)Yl eˆ j + σ (eˆ TjYlT uˆ )Yl eˆ j ∂ Gl ( uˆ ; w) ⎡ = ⎣ − = = j∈S n j=1 j∈S j∈S n j=1 where v = |eˆ TjYlT uˆ | uˆ ⎦ ⎡ |eˆ TjYlTw| + ⎣ ⎤ T w j∈Sc σ (eˆ TjYlT uˆ ) Yl eˆ j − =!"0 |eˆ TjYlTw| + v Tw, σ (eˆ TjYlT uˆ )Yl eˆ j − |eˆ TjYlT uˆ | uˆ = Yl σ (YlT uˆ ) − YlT uˆ (8.10) (8.11) (8.12) j∈Sc n j=1 n j=1 1 uˆ . σ (eˆ TjYlT uˆ )Yl eˆ j |eˆ TjYlT uˆ | uˆ ⎥ ⎦ w (8.13) (8.14) (8.15) Combining (8.10) and (8.11), it follows ∀i ∈ S |eˆiTYlTw| ≤ |eˆ TjYlTw| ≤ 0, ∀w ∈ v ⊥. j∈S (8.8) This implies the existence of {γ j } j∈S (possibly zero) such that Yl eˆ j = γ j v , ∀ j ∈ S. 0 ≥ |γ j | v 2 2 + v Now with w = v , combining (8.10), (8.11), and (8.14) gives n 1 u 2 j=1 or v = 0. With (8.12), the optimality condition (8.8) follows. According to (8.14), S ⊂ Y holds. Since Y ⊂ S always holds, it follows that Y = S. Note that Turning to the iteration (5.11), we observe that if convergence to a fixed point uˆ can be guaranteed, then multiplying uˆ T by u uˆ = u = uˆ + τ [Yl σ (YlT uˆ ) − uˆ uˆ TYl 1 ] gives u 2 = uˆ Tu = 1 + τ [ uˆ TYl σ (YlT uˆ ) − uˆ TYl 1 ] = 1 where the last equality follows with (YlT uˆ )Tσ (YlT uˆ ) = uˆ TYl 1 . Since u 2 = uˆ 2 = 1 holds, it follows that u = uˆ and hence uˆ satisfies (8.8). The following theorem provides sufficient conditions under which convergence of our algorithm to a fixed point can be guaranteed. Theorem 3 Let uˆ ∈ Tl satisfy (8.8) where the set S = { j : eˆ TjYlT uˆ = 0} is empty. Then uˆ is a fixed point of the iteration (5.11), and for τ sufficiently small, the iterates { uˆk } converge to this fixed point when { uˆ0} starts the iteration close enough to { uˆ }. Proof It will first be shown that { uˆ } is a fixed point for the iteration (5.11) which is locally asymptotically stable for τ sufficiently small. Setting uˆ k+1 = uˆ on the right side of (5.11) and using (8.8) shows that uk+1 = uˆ . Hence with uk+1 2 = uˆ 2 = 1 it follows that uˆ k+1 = uk+1 = uˆ , (8.17) and thus uˆ is a fixed point. To establish the stability of the fixed point, define ∂ g ∂ u (8.18) (8.19) (8.20) (8.21) (8.22) It follows with (8.8) that g( uˆ ) = uˆ and thus g( uˆ ) 2 = uˆ 2 = 1. The Jacobian of G at uˆ is symmetric and is given by ∂ G ∂ u ( uˆ ) = (1 − τ YlT uˆ )(I − uˆ uˆ T). For v ∈ Rm , it follows with 0 ≤ (I − uˆ uˆ T)v 22 = vT(I − uˆ uˆ T)v ≤ v 22 that the Jacobian satisfies G(u) = g(u) g(u) 2 , g(u) = u + τ [Yl σ (YlTu) − u YlTu 1 ] so that (5.11) is given by uˆ k+1 = G(uˆ k ). The claimed stability will follow once it is shown that the Jacobian of this mapping evaluated at uˆ has spectral radius less than 1 when τ is sufficiently small. The Jacobian is given by ∂ G ∂ u = 1 g(u) 2 I − g(u)gT(u) g(u) 2 2 ∂ g ∂ u where with { j : eˆ TjYlTu = 0} = ∅, n j=1 (u) = (1 − τ YlT u 1 )I − τ u σ (eˆ TjYlTu)eˆ TjYlT. and the images Y¯ and 9 Application to DCE-MRI Sequences In this section, the proposed methods are applied to the dynamic contrast-enhanced magnetic resonance image (DCE-MRI) sequence http://math.uni-graz.at/keeling/manus kripten/dcemri.mpg to separate intensity changes due to physiological motion from those due to contrast agent. With such a separation, unavoidable physiological motion may be removed in order to investigate tissues in a stationary state. See also [ 20 ] and [ 22 ]. To focus on the period in which contrast agent arrives in the imaged tissues, only the first 40 of 134 frames are used for the following decompositions. Each frame consists of an image with 400 × 400 pixels. Thus, in the notation of Sect. 2, the data are Y ∈ Rm×n with m = 4002 40 = n. For a static display of the DCE-MRI sequence, representative stages are shown in Fig. 7: exhale and inhale, with and without contrast agent. Specifically, with Y¯ , V , given by (2.7) and (2.10), the images of Fig. 7 are given by Y¯ + V 1 2 ei j , i, j = ±1, ei j = (i, j, 0, . . . , 0)T vˆ i = V eˆi , i = 1, 2, (eˆi ) j = δi j , (9.1) (9.2) T ∂ G 0 ≤ v ∂ u ( uˆ )v ≤ (1 − τ YlT uˆ 1 ) v 22 < v 22 , ∀v ∈ Rm (8.23) provided that 1 > τ uˆ TYl 1 . For any such τ , the spectral radius of the Jacobian in (8.22) is less than 1. Because of the assumption that S is empty, the mapping (8.18) is smooth in a neighborhood of the fixed point { uˆ }, and hence the iterates { uˆk } converge to the fixed point when started sufficiently close to it. Remark 3 Computations demonstrate that the iteration (5.11) converges to a maximizer for Gl even when the con T dition S = { j : eˆ j YlT uˆ = 0} = ∅ is violated. Fig. 7 Representative stages of the DCE-MRI sequence: exhale and inhale with and without contrast agent, where these are defined by (9.1) are shown in Fig. 8. Brightness changes in relation to the background are seen throughout organs in Fig. 8b, and this suggests that vˆ 1 represents intensity changes in the DCEMRI sequence due to contrast agent. On the other hand, brightness changes are seen mainly on the edges of organs in Fig. 8c, and this suggests that vˆ 2 represents intensity changes in the DCE-MRI sequence due to physiological motion. The image sequence is shown more dynamically in Fig. 9. The graphs in the left column are the time courses vˆ iTY , vˆ iTYs, vˆ iT Xc, i = 1, 2, for the raw, sphered and independent data, respectively, where Ys and Xc are given by (2.8) and (2.14), respectively. The graphs in the right column are corresponding plots in a phase plane. To determine the most significant independent components, all but the top two principal components were discarded. Then V was replaced by its first two columns, Ys by its first two rows and by a diagonal matrix containing the largest two eigenvalues. Also, the independent images V 21 U T Qi Xc, i = 1, 2, Qi = diag{eˆi }, (eˆi ) j = δi j , (9.3) are shown in Fig. 9g, h. As explained in connection with Fig. 8, these can be associated respectively with intensity changes in the DCE-MRI sequence due to contrast agent and to physiological motion. Note that there are only small differences between Figs. 9c and e, between Figs. 9d and f, between Figs. 8b and 9g and between Figs. 8c and 9h. Thus, the separation of intensity changes due to physiological motion from those due to contrast agent is achieved here already with the sphered data. Hence the transformation to independent data had little effect for this particular example. Recall from Sect. 2 that the order and the sign of ICA components are not uniquely determined. This separation will now be considered in the presence of outliers. As seen in the full DCE-MRI sequence, an excessively bright frame may appear suddenly. To simulate this effect, intensities of the final frame of the sequence are increased by a constant factor. Then the same methods used for Fig. 9 are applied to the corrupted data, and the results are shown in Fig. 10 with the same format as that used in Fig. 9. The corrupted data may be seen at the final time shown in the graphs of the first column of Fig. 10. Also the outlier is conspicuous in the phase plane graphs in the right column of Fig. 10. Finally, the images defined by (9.3) with the corrupted data are shown in Figs. 10g and h. Since these clearly differ from their counterparts in Figs. 9g and h, the presence of the single outlier has corrupted the separation of intensity changes due to physiological motion from those due to contrast agent. We would like to report that we obtained results comparable to those seen in Fig. 10 by applying the FastICA code [ 13 ] using optional robust merit functions. For comparison, the 1-based methods of Sects. 3–5 are now applied to the corrupted data, and the results are shown in Fig. 11 with the same format as that used in Figs. 9 and 10. Now the matrices V , Yc, V , , Ys, and U are understood as explained in Sects. 3–5 as well as in Remark 2. As in the previous cases, all but the top two principal components are discarded and the respective matrices are reduced correspondingly to have only two rows or columns or both. Then (9.2) and (9.3) apply with these 1-based matrices. As before, the corrupted data may be seen at the final time shown in the graphs of the first column of Fig. 11. Also the outlier is conspicuous in the phase plane graphs in the right column of Fig. 11. Finally, the images defined by (9.3) with the corrupted data are shown in Figs. 11g and h. Note the similarities between Figs. 11e–h and their counterparts in Figs. 9e–h. On this basis, the 1 methods can be seen to have successfully separated intensity changes due to physiological motion from those due to contrast agent in spite of the outlier. With this separation, the data are projected onto the single independent component of Fig. 11g using (2.21) to produce the following transformed DCE-MRI sequence manifesting contrast changes free of physiological motion, http://math.uni-graz.at/keeling/manuskripten/dcemri_ ica.mpg. 10 Conclusion In this work, robust measures have been introduced for centering complex data in the presence of outliers and for determining principal and independent components of such data. The approach to centering is to use the geometric median. The approach for determining principal components is to find best fit lines through the data, where each line minimizes the sum of distances (not squared) to data points in the subspace orthogonal to other components. The approach for determining independent components is first to sphere the data so that the corresponding axes are aligned with clusters, and then to determine independent axes as those which separate sphered clusters as much as possible. This separation is accomplished by maximizing an 1 counterpart to Rayleigh quotients. To optimize the respective merit functions, iterative methods were proposed and their local convergence was proved. Illustrative examples were presented to demonstrate the benefits of the robust approaches. Finally, the proposed methods were applied to a DCE-MRI sequence to separate intensity changes due to physiological motion from those due to contrast agent, and benefits of the robust methods have been demonstrated with respect to this realistic example. Acknowledgments Open access funding provided by University of Graz. The authors wish to express their appreciation for many fruitful discussions with Gernot Reishofer of the Radiology Clinic of the Graz Medical University, who also provided data for this work. Also, the authors gratefully acknowledge the support from the Austrian Science Fund Fonds zur Förderung der Wissenschaftlichen Forschung (FWF) under Grant SFB F032. The authors are supported by the Austrian Science Fund Fonds zur Förderung der Wissenschaftlichen Forschung (FWF) under Grant SFB F032 and are linked on the SFB webpage http:// math.uni-graz.at/mobis. Stephen L. Keeling was born in 1956 in Louisville, Kentucky, USA. He received the B.S. in Biology and Chemistry from Eastern Kentucky University in 1978, the M.S. in Biomedical Engineering (bioelectric phenomena) from Case Western Reserve University in 1981, and the Ph.D. in Mathematics (numerical analysis for PDEs) from the University of Tennessee in 1986. His postdoctoral research at ICASE and at Vanderbilt University focused on active noise control until 1989. From 1989 until 1998, he worked as Senior and then Principal Scientist in the CFD Group at the Arnold Engineering Development Center specializing in flow control and imaging. From 1998 until 2001, he worked with the mathematics and radiology faculties of the University of Graz as a Research Associate. From 2001 until 2006, he worked as an Assistant Professor at the Institute of Mathematics and Scientific Computing of the University of Graz. In 2006, he received Habilitation at this institute, where he has since been an Associate Professor. His research interests include biomedical modeling and mathematical image processing with emphasis on MRI applications. Karl Kunisch is a professor and head of department of mathematics at the University of Graz, and Scientific Director of the Radon Institute of the Austrian Academy of Sciences in Linz. He received his Ph.D. and Habiliation at the Technical University of Graz in 1978 and 1980, respectively. His research interests include optimization and optimal control, inverse problems and mathematical imaging, and numerical analysis and applications, currently focusing on topics in the life sciences. He spent three years at the Lefschetz Center for Dynamical Systems at Brown University, USA, held visiting positions at INRIA Rocquencourt and the Universite Paris Dauphine, and was a consultant at ICASE, NASA Langley, USA. Before joining the faculty at the University in Graz, he was a professor of numerical mathematics at the Technical University of Berlin. He is the author of two monographs and about 270 papers. He is the editor of numerous journals, including SIAM Numerical Analysis and SIAM Optimization and Optimal Control, and the Journal of the European Mathematical Society. 1. Ash , R.B.: Real Analysis and Probability . Academic Press, New York ( 1972 ) 2. Bauschke , H.H. , Combettes , P.L. : Convex Analysis and Monotone Operator Theory in Hilbert Spaces . Springer, New York ( 2010 ) 3. Bose , P. , Maheshwari , A. , Morin , P. : Fast approximations for sums of distances, clustering and the Fermat Weber problem . Computational Geometry: Theory and Applications 24 ( 3 ), 135 - 146 ( 2003 ) 4. Boyle , J.P. , Dykstra , R.L. : A Method for Finding Projections onto the Intersection of Convex Sets in Hilbert Spaces . Lecture Notes in Statistics 37 , 28 - 47 ( 1986 ) 5. Brooks , J.P. , Dulá , J.H. , Boone , E.L. : A pure L1-norm principal comoponent analysis . Computational Statistics and Data Analysis 61 , 83 - 98 ( 2013 ) 6. Cai , T.T., Ma , Z. , Wu , Y. : Sparse PCA : Optimal rates and adaptive estimation . Annals of Statistics 41 ( 6 ), 3074 - 3110 ( 2013 ) 7. Candés , E.J. , Li , X. , Ma , Y. and Wright , J.: Robust Principal Component Analysis?, JACM , 58 ( 3 ), Art. 11, May ( 2011 ) 8. Chambolle , A. , Pock , T. : A first-order primal-dual algorithm for convex problems with applications to imaging . JMIV 40 ( 1 ), 120 - 145 ( 2011 ) 9. Clarke , F.H. : Optimization and Nonsmooth Analysis , Wiley Interscience, ( 1983 ) 10. Ding , C. and Zhou , D. and He , X. and Zha , H.: R1-pca: Rotation invariant L1-norm principal component analysis for robust subspace factorization , Proceedings of the 23rd International Coference on Machine Learning , pp. 281 - 288 , ( 2006 ) 11. Dodge , Y. , Rousson , V. : Multivariate L1 Mean . Metrika 49 ( 2 ), 127 - 134 ( 1999 ) 12. Ekeland , I. , Témam , R.: Convex Analysis and Variational Problems . North-Holland, Amsterdam-Oxford ( 1976 ) 13. Gävert , H. , Hurri , J. , Särelä , J. and Hyvärinen , A.: http://research. ics.aalto.fi/ica/fastica/, Copyright (C) 1996 - 2005 14. Hyvärinen , A. , Karhunen , J. , Oja , E.: Independent Component Analysis . Wiley, New York ( 2001 ) 15. Jolliffe , I.T.: Principal Component Analysis . Springer, New York ( 1986 ) 16. Kwak , N.: Principal component analysis based on L1-norm maximization . IEEE Transactions on Pattern Analysis and Machine Intelligence 30 , 1672 - 1680 ( 2008 ) 17. Lerman , G. , McCoy , M. , Tropp , J.A. , Zhang, T.: Robust Computation of Linear Models by Convex Relaxation . Foundations of Computational Mathematics 15 ( 2 ), 363 - 410 ( 2015 ) 18. McCoy , M. , Tropp , J.A. : Two Proposals for Robust PCA via Semidefinite Programming . Journal of Statistics 5 , 1123 - 1160 ( 2011 ) 19. Weiszfeld , E. , Plastria , F. : On the point for which the sum of the distances to n given points is minimum . Ann. Oper. Res . 167 , 7 - 41 ( 2009 ) 20. Reishofer , G. , Fazekas , F. , Keeling , S.L. , Enzinger , C. , Payer , F. , Simbrunner , J. , Stollberger , R.: Minimizing Macrovessel Signal in Cerebral Perfusion Imaging using Independent Component Analysis . Magnetic Resonance in Medicine 57(2) , 278 - 288 ( 2007 ) 21. Weiszfeld , E.: Sur le point pour lequel la somme des distances de n points donnés est minimum . Tôhoku Mathematical Journal 43 , 355 - 386 ( 1937 ) 22. Wollny , G. , Kellman , P. , Santos , A. , Ledesma-Carbayo , M.J.: Automatic motion compensation of free breathing acquired myocardial perfusion data by using independent component analysis . Medical Image Analysis 16 , 1015 - 1028 ( 2012 ) 23. Xu , H. , Caramanis , C. , Sanghavi , S. : Robust PCA via Outliear Pursuit . IEEE Transactions on Information Theory 58 ( 5 ), 3047 - 3064 ( 2012 ) 24. Zarzoso , V.: Robust Independent Component Analysis by Iterative Maximization of the Kurtosis Contrast with Algebraic Optimal Step Size . IEEE Trans. on Neural Networks 21 ( 2 ), 248 - 261 ( 2010 )


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007%2Fs10851-016-0637-9.pdf

Stephen L. Keeling, Karl Kunisch. Robust \(\ell _1\) Approaches to Computing the Geometric Median and Principal and Independent Components, Journal of Mathematical Imaging and Vision, 2016, 99-124, DOI: 10.1007/s10851-016-0637-9