{"title": "Alternating minimization for dictionary learning with random initialization", "book": "Advances in Neural Information Processing Systems", "page_first": 1997, "page_last": 2006, "abstract": "We present theoretical guarantees for an alternating minimization algorithm for the dictionary learning/sparse coding problem. The dictionary learning problem is to factorize vector samples $y^{1},y^{2},\\ldots, y^{n}$ into an appropriate basis (dictionary) $A^*$ and sparse vectors $x^{1*},\\ldots,x^{n*}$. Our algorithm is a simple alternating minimization procedure that switches between $\\ell_1$ minimization and gradient descent in alternate steps. Dictionary learning and specifically alternating minimization algorithms for dictionary learning are well studied both theoretically and empirically. However, in contrast to previous theoretical analyses for this problem, we replace a condition on the operator norm (that is, the largest magnitude singular value) of the true underlying dictionary $A^*$ with a condition on the matrix infinity norm (that is, the largest magnitude term). This not only allows us to get convergence rates for the error of the estimated dictionary measured in the matrix infinity norm, but also ensures that a random initialization will provably converge to the global optimum. Our guarantees are under a reasonable generative model that allows for dictionaries with growing operator norms, and can handle an arbitrary level of overcompleteness, while having sparsity that is information theoretically optimal. We also establish upper bounds on the sample complexity of our algorithm.", "full_text": "Alternating minimization for dictionary learning with\n\nrandom initialization\n\nNiladri S. Chatterji\n\nUC Berkeley\n\nniladri.chatterji@berkeley.edu\n\nPeter L. Bartlett\n\nUC Berkeley\n\npeter@berkeley.edu\n\nAbstract\n\nWe present theoretical guarantees for an alternating minimization algorithm for\nthe dictionary learning/sparse coding problem. The dictionary learning problem\nis to factorize vector samples y1, y2, . . . , yn into an appropriate basis (dictionary)\nA\u21e4 and sparse vectors x1\u21e4, . . . , xn\u21e4. Our algorithm is a simple alternating mini-\nmization procedure that switches between `1 minimization and gradient descent in\nalternate steps. Dictionary learning and speci\ufb01cally alternating minimization algo-\nrithms for dictionary learning are well studied both theoretically and empirically.\nHowever, in contrast to previous theoretical analyses for this problem, we replace\na condition on the operator norm (that is, the largest magnitude singular value)\nof the true underlying dictionary A\u21e4 with a condition on the matrix in\ufb01nity norm\n(that is, the largest magnitude term). This not only allows us to get convergence\nrates for the error of the estimated dictionary measured in the matrix in\ufb01nity norm,\nbut also ensures that a random initialization will provably converge to the global\noptimum. Our guarantees are under a reasonable generative model that allows\nfor dictionaries with growing operator norms, and can handle an arbitrary level of\novercompleteness, while having sparsity that is information theoretically optimal.\nWe also establish upper bounds on the sample complexity of our algorithm.\n\nyi = A\u21e4xi\u21e4\n\nIntroduction\n\n1\nIn the problem of sparse coding/dictionary learning, given i.i.d. samples y1, y2, . . . , yn 2 Rd\nproduced from the generative model\n(1)\nfor i 2{ 1, 2, . . . , n}, the goal is to recover a \ufb01xed dictionary A\u21e4 2 Rd\u21e5r and s-sparse vectors\nxi\u21e4 2 Rr. (An s-sparse vector has no more than s non-zero entries.) In many problems of interest,\nthe dictionary is often overcomplete, that is, r d. This is believed to add \ufb02exibility in modeling\nand robustness. This model was \ufb01rst proposed in neuroscience as an energy minimization heuristic\nthat reproduces features of the V1 region of the visual cortex [28; 22]. It has also been an extremely\nsuccessful approach to identifying low dimensional structure in high dimensional data; it is used\nextensively to \ufb01nd features in images, speech and video (see, for example, references in [13]).\nMost formulations of dictionary learning tend to yield non-convex optimization problems. For\nexample, note that if either xi\u21e4 or A\u21e4 were known, given yi, this would just be a (matrix/sparse)\nregression problem. However, estimating both xi\u21e4 and A\u21e4 simultaneously leads to both computational\nas well as statistical complications. The heuristic of alternating minimization works well empirically\nfor dictionary learning. At each step, \ufb01rst an estimate of the dictionary is held \ufb01xed while the sparse\ncoef\ufb01cients are estimated; next, using these sparse coef\ufb01cients the dictionary is updated. Note that in\neach step the sub-problem has a convex formulation, and there is a range of ef\ufb01cient algorithms that\ncan be used. This heuristic has been very successful empirically, and there has also been signi\ufb01cant\nrecent theoretical progress in understanding its performance, which we discuss next.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f1.1 Related Work\nA recent line of work theoretically analyzes local linear convergence rates for alternating minimization\nprocedures applied to dictionary learning [1; 4]. Arora et al. [4] present a neurally plausible algorithm\n\nthat recovers the dictionary exactly for sparsity up to s = O(pd/(\u00b5 log(d))), where \u00b5/pd is the\n\nlevel of incoherence in the dictionary (which is a measure of the similarity of the columns; see\nAssumption A1 below). Agarwal et al. [1] analyze a least squares/`1 minimization scheme and show\nthat it can tolerate sparsity up to s = O(d1/6). Both of these establish local linear convergence\nguarantees for the maximum column-wise distance. Exact recovery guarantees require a singular-\nvalue decomposition (SVD) or clustering based procedure to initialize their dictionary estimates (see\nalso the previous work [5; 2]).\nFor the undercomplete case (when r \uf8ff d), Sun et al. [33] provide a Riemannian trust region method\nthat can tolerate sparsity s = O(d), while earlier work by Spielman et al. [32] provides an algorithm\nthat works in this setting for sparsity O(pd).\n\nLocal and global optima of non-convex formulations for the problem have also been extensively\nstudied in [36; 17; 18], among others. Apart from alternating minimization, other approaches (without\ntheoretical convergence guarantees) for dictionary learning include K-SVD [3] and MOD [14]. There\nis also a nice formulation by Barak et al. [7], based on the sum-of-squares hierarchy. Recently,\nHazan and Ma [20] provide guarantees for improper dictionary learning, where instead of learning\na dictionary, they learn a comparable encoding via convex relaxations. Our work also adds to the\nrecent literature on analyzing alternating minimization algorithms [21; 26; 27; 19; 6].\n\n1.2 Contributions\nOur main contribution is to present new conditions under which alternating minimization for dictio-\nnary learning converges at a linear rate to the global optimum. We impose a condition on the matrix\nin\ufb01nity norm (largest magnitude entry) of the underlying dictionary. This allows dictionaries with\noperator norm growing with dimension (d, r). The error rates are measured in the matrix in\ufb01nity\nnorm, which is sharper than the previous error rates in maximum column-wise error.\nWe also identify conditions under which a trivial random initialization of the dictionary works, as\nopposed to the more complex SVD and clustering procedures required in previous work. This is\npossible as our radius of convergence, again measured in the matrix in\ufb01nity norm, is larger than that\nof previous results, which required the initial estimate to be close column-wise. Our results hold\nfor a rather arbitrary level of overcompleteness, r = O(poly(d)). We establish convergence results\nfor sparsity level s = O(pd), which is information theoretically optimal for incoherent dictionaries\nand improves the previously best known results in the overcomplete setting by a logarithmic factor.\nOur algorithm is simple, involving an `1-minimization step followed by a gradient update for the\ndictionary.\nA key step in our proofs is an analysis of a robust sparse estimator\u2014{`1,` 2,` 1}-MU Selector\u2014\nunder \ufb01xed (worst case) corruption in the dictionary. We prove that this estimator is minimax optimal\nin this setting, which might be of independent interest.\n\n1.3 Organization\nIn Section 2, we present our alternating minimization algorithm and discuss the sparse regression\nestimator. In Section 3, we list the assumptions under which our algorithm converges and state the\nmain convergence result. Finally, in Section 4, we prove convergence of our algorithm. We defer\ntechnical lemmas, analysis of the sparse regression estimator, and minimax analysis to the appendix.\n\nNotation\nFor a vector v 2 Rd, vi denotes the ith component of the vector, kvkp denotes the `p norm, supp(v)\ndenotes the support of a vector v, that is, the set of non-zero entries of the vector, sgn(v) denotes\nthe sign of the vector v, that is, a vector with sgn(v)i = 1 if vi > 0, sgn(v)i = 1 if vi < 0 and\nsgn(v)i = 0 if vi = 0. For a matrix W , Wi denotes the ith column, Wij is the element in the ith\nrow and jth column, kWkop denotes the operator norm, and kWk1 denotes the maximum of the\nmagnitudes of the elements of W . For a set J, we denote its cardinality by |J|. Throughout the paper,\n\n2\n\n\fAlgorithm 1: Alternating Minimization for Dictionary Learning\n\nInput\n\n:Step size \u2318, samples {yk}n\n{\u2327 (t)}T\n\nt=1, initial radius R(0) and parameters {(t)}T\n\nk=1, initial estimate A(0), number of steps T , thresholds\nt=1.\n\nt=1,{(t)}T\n\nt=1 and {\u232b(t)}T\n\nfor k = 1, 2, . . . , n do\n\n1 for t = 1, 2, . . . , T do\n2\n3\n4\n\nwk,(t) = M U S(t),(t),\u232b(t)(yk, A(t1), R(t1))\nfor l = 1, 2, 3 . . . , r do\n\n(xk,(t) is the sparse estimate)\n\nI\u21e3|wk,(t)\n\nl\n\n| >\u2327 (t)\u2318 ,\n\nxk,(t)\nl\n\n= wk,(t)\n\nl\n\nend\n\nend\nfor i = 1, 2, . . . , d do\n\n5\n\n6\n7\n8\n9\n\n10\n\nfor j = 1, 2, . . . , r do\nk=1hPr\nnPn\n \u2318\n\nij = A(t1)\nA(t)\n\nend\n\nij\n\np=1\u21e3A(t1)\n\nip\n\nxk,(t)\np\n\nxk,(t)\nj yk\n\ni xk,(t)\n\nj\n\n\u2318i\n\n11\n12\n13\n14 end\n\nend\nR(t) = 7\n\n8 R(t1).\n\nwe use C multiple times to denote global constants that are independent of the problem parameters\nand dimension. We denote the indicator function by I(\u00b7).\n2 Algorithm\n\nGiven an initial estimate of the dictionary A(0) we alternate between an `1 minimization procedure\n(speci\ufb01cally the {`1,` 2,` 1}-MU Selector\u2014M U S,,\u232b in the algorithm\u2014followed by a thresholding\nstep) and a gradient step, under sample `2 loss, to update the dictionary. We analyze this algorithm\nand demand linear convergence at a rate of 7/8; convergence analysis for other rates follows in the\nsame vein with altered constants. In subsequent sections, we also establish conditions under which\nthe initial estimate for the dictionary A(0) can be chosen randomly. Below we state the permitted\nrange for the various parameters in the algorithm above.\n\n1. Step size: We need to set the step size in the range 3r/4s < \u2318 < r/s.\n2. Threshold: At each step set the threshold at \u2327 (t) = 16R(t1)M (R(t1)(s + 1) + s/pd).\n3. Tuning parameters: We need to pick (t) and \u232b(t) such that the assumption (D5) is satis\ufb01ed.\n\nA choice that is suitable that satis\ufb01es this assumption is\n\n32\u2713s3/2\u21e3R(t1)\u23182\n\n+\n\ns3/2R(t1)\n\nd1/2\n\nWe need to set (t) as speci\ufb01ed by Theorem 16,\n\n(t) = ps\u21e3R(t1)\u23182\n\n\uf8ff \u232b(t) \uf8ff 3,\nps\u25c6 \uf8ff (t) \uf8ff 3.\n\n6\n\n128s\u21e3R(t1)\u23182\n\u25c6\u27134 +\n+r s\n\nR(t1).\n\nd\n\n2.1 Sparse Regression Estimator\n\nOur proof of convergence for Algorithm 1 also goes through with a different choices of robust sparse\nregression estimators, however, we can establish the tightest guarantees when the {`1,` 2,` 1}-MU\nSelector is used in the sparse regression step. The {`1,` 2,` 1}-MU Selector [8] was established as\na modi\ufb01cation of the Dantzig selector to handle uncertainty in the dictionary. There is a beautiful\nline of work that precedes this that includes [30; 31; 9]. There are also modi\ufb01ed non-convex LASSO\n\n3\n\n\fprograms that have been studied [23] and Orthogonal Matching Pursuit algorithms under in-variable\nerrors [11]. However these estimators require the error in the dictionary to be stochastic and zero mean\nwhich makes them less suitable in this setting. Also note that standard `1 minimization estimators\nlike the LASSO and Dantzig selector are highly unstable under errors in the dictionary and would\nlead to much worse guarantees in terms of radius of convergence (as studied in [1]). We establish the\nerror guarantees for a robust sparse estimator {`1,` 2,` 1}-MU Selector under \ufb01xed corruption in\nthe dictionary. We also establish that this estimator is minimax optimal when the error in the sparse\nestimate is measured in in\ufb01nity norm k\u02c6\u2713 \u2713\u21e4k1 and the dictionary is corrupted.\nThe {`1,` 2,` 1}-MU Selector\nDe\ufb01ne the estimator \u02c6\u2713 such that (\u02c6\u2713, \u02c6t, \u02c6u) 2 Rr \u21e5 R+ \u21e5 R+ is the solution to the convex minimization\nproblem\n\nmin\n\n\u2713,t,u(k\u2713k1 + t + \u232bu\n\n1\nd\n\n\u2713 2 Rr,\n\nA>y A\u27131 \uf8ff t + R2u,k\u2713k2 \uf8ff t,k\u2713k1 \uf8ff u) (2)\n\nwhere, , and \u232b are tuning parameters that are chosen appropriately. R is an upper bound on the\nerror in our dictionary measured in matrix in\ufb01nity norm. Henceforth the \ufb01rst coordinate (\u02c6\u2713) of this\nestimator is called M U S,,\u232b (y, A, R), where the \ufb01rst argument is the sample, the second is the\nmatrix, and the third is the value of the upper bound on the error of the dictionary measured in in\ufb01nity\nnorm. We will see that under our assumptions we will be able to establish an upper bound on the\n\nerror on the estimator, k\u02c6\u2713 \u2713\u21e4k1 \uf8ff 16RM\u21e3R(s + 1) + s/pd\u2318, where |\u2713\u21e4j|\uf8ff M 8j. We de\ufb01ne a\nthreshold at each step \u2327 = 16RM (R(s + 1) + s/pd). The thresholded estimate \u02dc\u2713 is de\ufb01ned as\n\n\u02dc\u2713i = \u02c6\u2713iI[|\u02c6\u2713i| >\u2327 ]\n\n8i 2{ 1, 2, . . . , r}.\n\n(3)\nOur assumptions will ensure that we have the guarantee sgn(\u02dc\u2713) = sgn(\u2713\u21e4). This will be crucial in\nour proof of convergence. The analysis of this estimator is presented in Appendix B.\nTo identify the signs of the sparse covariates correctly using this class of thresholded estimators, we\nwould like in the \ufb01rst step to use an estimator \u02c6\u2713 that is optimal, as this would lead to tighter control\nover the radius of convergence. This makes the choice of {`1,` 2,` 1}-MU Selector natural, as we\nwill show it is minimax optimal under certain settings.\n\nTheorem 1 (informal). De\ufb01ne the sets of matrices A = {B 2 Rd\u21e5rkBik2 \uf8ff 1, 8i 2{ 1, . . . , r}}\nand W = {P 2 Rd\u21e5rkPk1 \uf8ff R} with R = O(1/ps). Then there exists an A\u21e4 2A and W 2W\n\nwith A , A\u21e4 + W such that\n\n\u2713\u21e4 k \u02c6T \u2713\u21e4k1 CRL s1 \n\nsup\n\nlog(s)\n\nlog(r)! ,\n\ninf\n\u02c6T\n\n(4)\n\nwhere the inf \u02c6T is over all measurable estimators \u02c6T with input (A\u21e4\u2713\u21e4, A, R), and the sup is over\ns-sparse vectors \u2713\u21e4 with 2-norm L > 0.\nRemark 2. Note that when R = O(1/ps) and s = O(pd), this lower bound matches the upper\nbound we have for Theorem 16 (up to logarithmic factors) and hence the {`1,` 2,` 1}-MU Selector\nis minimax optimal.\n\nThe proof of this theorem follows by Fano\u2019s method and is relegated to Appendix C.\n\n2.2 Gradient Update for the dictionary\nWe note that the update to the dictionary at each step in Algorithm 1 is as follows\n\nij = A(t1)\nA(t)\n\nij\n\nn\n\n \u2318 1\n|\n\nnXk=1\" rXp=1\u21e3A(t1)\n\nip\n\nxk,(t)\nj yk\n\ni xk,(t)\n\nj\n\nxk,(t)\np\n\n,\u02c6g(t)\n\nij\n\n{z\n\n,\n\n\u2318#!\n}\n\n4\n\n\ffor i 2{ 1, . . . , d}, j 2{ 1, . . . , r} and t 2{ 1, . . . , T}. If we consider the loss function at time step t\nbuilt using the vector samples y1, . . . , yn and sparse estimates x1,(t), . . . , xn,(t),\n\nLn(A) =\n\n1\n2n\n\nnXk=1yk Axk,(t)\n\n2\n\n2\n\n,\n\nA 2 Rd\u21e5r,\n\nwe can identify the update to the dictionary \u02c6g(t) as the gradient of this loss function evaluated at\nA(t1),\n\n\u02c6g(t) =\n\n3 Main Results and Assumptions\n\n@Ln(A)\n\n@A A(t1)\n\n.\n\nIn this section we state our convergence result and state the assumptions under which our results are\nvalid.\n\n3.1 Assumptions\nAssumptions on A\u21e4\n(A1) Incoherence: We assume the the true underlying dictionary is \u00b5/pd-incoherent\n\n|hA\u21e4i , A\u21e4ji| \uf8ff\n\n\u00b5\npd\n\n8 i, j 2{ 1, . . . , r} such that, i 6= j.\n\nThis is a standard assumption in the sparse regression literature when support recovery is of\ninterest. It was introduced in [15; 34] in signal processing and independently in [38; 25] in\nstatistics. We can also establish guarantees under the strictly weaker `1-sensitivity condition\n(cf. [16]) used in analyzing sparse estimators under in-variable uncertainty in [9; 31]. The\n{`1,` 2,` 1}-MU selector that we use for our sparse recovery step also works with this more\ngeneral assumption, however for ease of exposition we assume A\u21e4 to be \u00b5/pd-incoherent.\n\n(A2) Normalized Columns: We assume that all the columns of A\u21e4 are normalized to 1,\n\nkA\u21e4ik2 = 1 8 i 2{ 1, . . . , r}.\n\ni=1 are invariant when we scale the columns of A\u21e4 or under\nNote that the samples {yi}n\npermutations of its columns. Thus we restrict ourselves to dictionaries with normalized\ncolumns and label the entire equivalence class of dictionaries with permuted columns and\nvarying signs as A\u21e4.\n\n(A3) Bounded max-norm: We assume that A\u21e4 is bounded in matrix in\ufb01nity norm\n\nkA\u21e4k1 \uf8ff\n\nCb\ns\n\n.\n\nThis is in contrast with previous work that imposes conditions on the operator norm of A\u21e4\n[4; 1; 5]. Our assumptions help provide guarantees under alternate assumptions and it also\nallows the operator norm to grow with dimension, whereas earlier work requires A\u21e4 to be such\n\nthat kA\u21e4kop \uf8ff C\u21e3pr/d\u2318. In general the in\ufb01nity norm and operator norm balls are hard to\n\ncompare. However, one situation where a comparison is possible is if we assume the entries\nof the dictionary to be drawn iid from a Gaussian distribution N (0, 2). Then by standard\nconcentration theorems, for the operator norm condition to be satis\ufb01ed we would need the\nvariance (2) of the distribution to scale as O(1/d) while, for the in\ufb01nity norm condition to be\nsatis\ufb01ed we need the variance to be \u02dcO(1/s2). This means that modulo constants the variance\ncan be much larger for the in\ufb01nity norm condition to be satis\ufb01ed than for the operator norm\ncondition.\n\nAssumption on the initial estimate and initialization\n(B1) We require an initial estimate for the dictionary A(0) such that,\n\nkA(0) A\u21e4k1 \uf8ff\n\nCR\ns\n\n.\n\n5\n\n\fwith 2Cb = CR; where CR = 1/2000M 2. Assuming 2Cb = CR allows a fast random\ninitialization, where we draw each entry of the initial estimate from the uniform distribution\n(on the interval (Cb/2s, Cb/2s)). This allows us to circumvent the computationally heavy\nSVD/clustering step required in previous work to get an initial dictionary estimate [4; 1; 5].\nNote that we start with a random initialization and not with A(0) = 0, as this causes our sparse\nestimator to fail (columns of A need to be non-zero).\n\nAssumptions on x\u21e4\nNext we assume a generative model on the s-sparse covariates x\u21e4. Here are the assumptions we make\nabout the (unknown) distribution of x\u21e4.\n\n(C1) Conditional Independence: We assume that distribution of non-zero entries of x\u21e4 is condi-\n\ntionally independent and identically distributed. That is, x\u21e4i ?? x\u21e4j|x\u21e4i , x\u21e4j 6= 0.\n\n(C2) Sparsity Level:We assume that the level of sparsity s is bounded\n2 \uf8ff s \uf8ff min(2pd, Cbpd, Cpd/\u00b5),\n\nwhere C is an appropriate global constant such that A\u21e4 satis\ufb01es assumption (D3), see Re-\nmark 15. For incoherent dictionaries, this upper bound is tight up to constant factors for sparse\nrecovery to be feasible [12; 18].\n\n(C3) Boundedness: Conditioned on the event that i is in the subset of non-zero entries, we have\n\nm \uf8ff| x\u21e4i|\uf8ff M,\n\nwith m 32R(0)M (R(0)(s + 1) + s/pd) and M > 1. This is needed for the thresholded\nsparse estimator to correctly predict the sign of the true covariate (sgn(x) = sgn(x\u21e4)). We can\nalso relax the boundedness assumption: it suf\ufb01ces for the x\u21e4i to have sub-Gaussian distributions.\n(C4) Probability of support: The probability of i being in the support of x\u21e4 is uniform over all\n\ni 2{ 1, 2, . . . , r}. This translates to\nPr(x\u21e4i 6= 0) =\nPr(x\u21e4i , x\u21e4j 6= 0) =\n\ns\nr\ns(s 1)\nr(r 1)\n\n8 i 2{ 1, . . . , r},\n\n8 i 6= j 2{ 1, . . . , r}.\n\n(C5) Mean and variance of variables in the support: We assume that the non-zero random\n\nvariables in the support of x\u21e4 are centered and are normalized\n\nE(x\u21e4i|x\u21e4i 6= 0) = 0,\n\nE(x\u21e42\ni\n\n|x\u21e4i 6= 0) = 1.\n\nWe note that these assumptions (A1), (A2) and (C1) - (C5) are similar to those made in [4; 1].\nAgarwal et al. [1] require the matrices to satisfy the restricted isometry property, which is strictly\nweaker than \u00b5/pd-incoherence, however they can tolerate a much lower level of sparsity (d1/6).\n\n3.2 Main Result\n\nTheorem 3. Suppose that\nthe s-sparse sam-\nples x\u21e4 satisfy the assumptions stated in Section 3.1 and we are given an estimate A(0)\nsuch that kA(0) A\u21e4k1 \uf8ff R(0) \uf8ff CR/s.\nsamples\ns(R(t1))2 log(dr/)) then Algorithm 1 with parameters\nin every iteration with n(t) =\u2326(\nt=1,\u2318 ) chosen as speci\ufb01ed in Section 3.1 after T iterations\n({\u2327 (t)}T\nreturns a dictionary A(T ) such that,\n\ntrue dictionary A\u21e4 and the distribution of\nIf we are given {n(t)}T\n\nt=1,{(t)}T\n\nt=1,{(t)}T\n\nt=1,{\u232b(t)}T\n\nt=1 i.i.d.\n\nr\n\nkA(T ) A\u21e4k1 \uf8ff\u2713 7\n8\u25c6T\n\nR(0),\n\nwith probability 1 .\n\n6\n\n\f\u02c6gij =\n\n=\n\n1\nn\n\n1\nn\n\nk xm\n\ni xm\n\nnXm=1\" rXk=1aikxm\nj #\nj ym\nnXm=1\" rXk=1\u21e3aikxm\nj #\nk a\u21e4ikxm\u21e4k \u2318xm\n= E\" rXk=1\u21e3aikxk a\u21e4ikx\u21e4k\u2318xj#\nnXm=1\" rXk=1\u21e3aikxm\n+\" 1\n| {z }\n\n= gij + \u02c6gij gij\n\nk a\u21e4ikxm\u21e4k \u2318xm\n\n,\u270fn\n\nn\n\n,\n\nj # E\" rXk=1\u21e3aikxk a\u21e4ikx\u21e4k\u2318xj##\n\n4 Proof of Convergence\n\nIn this section we will prove the main convergence results stated as Theorem 3. To prove this result\nwe will analyze the gradient update to the dictionary at each step. We will decompose this gradient\nupdate (which is a random variable) into a \ufb01rst term which is its expected value and a second term\nwhich is its deviation from expectation. We will prove a deterministic convergence result by working\nwith the expected value of the gradient and then appeal to standard concentration arguments to control\nthe deviation of the gradient from its expected value with high probability.\nBy Lemma 8, Algorithm 1 is guaranteed to estimate the sign pattern correctly at every round of the\nalgorithm, sgn(x) = sgn(x\u21e4) (see proof in Appendix A.1).\nTo un-clutter notation let, A\u21e4ij = a\u21e4ij, A(t)\nij = aij, A(t+1)\nij. The kth coordinate of the mth\ncovariate is written as xm\u21e4k . Similarly let xm\nk be the kth coordinate of the estimate of the mth\ncovariate at step t. Finally let R(t) = R, n(t) = n and \u02c6gij be the (i, j)th element of the gradient with\nn (n(t)) samples at step t. Unwrapping the expression for \u02c6gij we get,\n\n= a0\n\nij\n\nwhere gij denotes (i, j)th element of the expected value (in\ufb01nite samples) of the gradient. The second\nterm \u270fn is the deviation of the gradient from its expected value. By Theorem 10 we can control the\ndeviation of the sample gradient from its mean via an application of McDiarmid\u2019s inequality. With\nthis notation in place we are now ready to prove Theorem 3.\nProof [Proof of Theorem 3] First we analyze the structure of the expected value of the gradient.\nStep 1: Unwrapping the expected value of the gradient we \ufb01nd it decomposes into three terms\n\ns\n\naikxkxj a\u21e4ikx\u21e4kxj35\nr E\u21e5(xj x\u21e4j )xj|x\u21e4j 6= 0\u21e4\n}\n\nj a\u21e4ijx\u21e4j xj + E24Xk6=j\ngij = Eaijx2\nj|x\u21e4j 6= 0\u21e4\nr E\u21e5x2\n= (aij a\u21e4ij)\n|\n{z\n}\n|\nThe \ufb01rst term gc\nij points in the correct direction and will be useful in converging to the right answer.\nThe other terms could be in a bad direction and we will control their magnitude with Lemma 5 such\n3r R. The proof of Lemma 5 is the main technical challenge in the convergence\nthat |\u23051| + |\u23052|\uf8ff s\nanalysis to control the error in the gradient. Its proof is deferred to the appendix.\nStep 2: Given this bound, we analyze the gradient update,\n\naikxkxj a\u21e4ikx\u21e4kxj35\n{z\n}\n\n+ E24Xk6=j\n|\n\n{z\n\n+ a\u21e4ij\n\n,\u23051\n\n,\u23052\n\n,gc\n\ns\n\nij\n\n.\n\na0\nij = aij \u2318\u02c6gij\n\n= aij \u2318(gij + \u270fn)\n\n= aij \u2318\u21e5gc\n\nij + (\u2305 1 +\u2305 2) + \u270fn\u21e4 .\n\n7\n\n\fSo if we look at the distance to the optimum a\u21e4ij we have the relation,\n\n(i)\n\n(ii)\n\ns\n\ns\n\ns\n\n|a0\nij a\u21e4ij|\n\ns\n\nr E\u21e5x2\n\nTaking absolute values, we get\n\na0\nij a\u21e4ij = aij a\u21e4ij \u2318(aij a\u21e4ij)\n\nr E\u21e5x2\nr E\u21e5x2\nr\u21e2E\u21e5x2\n\n\uf8ff\u21e31 \u2318\n\uf8ff \u21e31 \u2318\n\uf8ff\u27131 \u2318\n\nj|x\u21e4j 6= 0\u21e4 \u2318 {(\u23051 +\u2305 2) + \u270fn} .\nj|x\u21e4j 6= 0\u21e4\u2318|aij a\u21e4ij| + \u2318 {|\u23051| + |\u23052| + |\u270fn|}\nj|x\u21e4j 6= 0\u21e4\u2318|aij a\u21e4ij| + \u2318\u21e3 s\nR\u2318 + \u2318|\u270fn|\n3\u25c6 R + \u2318|\u270fn|,\nj|x\u21e4j 6= 0\u21e4 \nj|x\u21e4j 6= 0\u21e4. We would expect that as R\nj |x\u21e4j 6= 0\u21e4 = 1. By invoking Lemma 6 we can bound\nj|x\u21e4j 6= 0\u21e4 \nCoupled with Lemma 6, Inequality (i) follows from \u2318 \uf8ff r\n4s . We also\nhave by Theorem 10 that \u2318|\u270fn|\uf8ff R/8 with probability 1 . So if we unroll the bound for t steps\nwe have,\n\nLemma 5. Next we give an upper and lower bound on E\u21e5x2\ngets smaller this variance term approaches E\u21e5x\u21e42\nr\u21e2E\u21e5x2\n\nprovided the \ufb01rst term is at non-negative. Here, (i) follows by triangle inequality and (ii) is by\n\nj|x\u21e4j 6= 0\u21e4 \uf8ff 4\n\uf8ff\u27131 \u2318\n\n3. We want the \ufb01rst term to contract at a rate 3/4; it suf\ufb01ces to\n\n3 \uf8ff E\u21e5x2\n\nthis term to be 2\nhave\n\n(i)\n\n0\n\ns\n\n3r\n\n1\n\n.\n\n1\n\n3\n4\n\n\uf8ff\n\n3\u25c6 (ii)\ns while (ii) follows from \u2318 3r\nR(t1) \uf8ff\u2713 7\n8\u25c6t\n\nR(t1) =\n\n7\n8\n\nR(0).\n\n|a(t)\nij a\u21e4ij|\uf8ff\n\n3\n4\n\nR(t1) + \u2318|\u270fn|\uf8ff\n\n3\n4\n\nR(t1) +\n\n1\n8\n\nWe also have \u2318|\u270fn|\uf8ff R/8 \uf8ff R(0)/8 with probability at least 1 for all t 2{ 1, . . . , T}; thus we\nare guaranteed to remain in our initial ball of radius R(0) with high probability, completing the proof.\n\n5 Conclusion\n\nAn interesting question would be to further explore and analyze the range of algorithms for which\nalternating minimization works and identifying the conditions under which they provably converge.\nThere also seem to be many open questions for improper dictionary learning and developing provably\nfaster algorithms there. Going beyond sparsity pd still remains challenging, and as noted in previous\nwork alternating minimization also appears to break down experimentally and new algorithms are\nrequired in this regime. Also all theoretical work on analyzing alternating minimization for dictionary\nlearning seems to rely on identifying the signs of the samples (x\u21e4) correctly at every step. It would\nbe an interesting theoretical question to analyze if this is a necessary condition or if an alternate\nproof strategy and consequently a bigger radius of convergence are possible. Lastly, it is not known\nwhat the optimal sample complexity for this problem is and lower bounds there could be useful in\ndesigning more sample ef\ufb01cient algorithms.\n\nAcknowledgments\n\nWe gratefully acknowledge the support of the NSF through grant IIS-1619362, and of the Australian\nResearch Council through an Australian Laureate Fellowship (FL110100281) and through the ARC\nCentre of Excellence for Mathematical and Statistical Frontiers. Thanks also to the Simons Institute\nfor the Theory of Computing Spring 2017 Program on Foundations of Machine Learning. The authors\nwould like to thank Sahand Negahban for pointing out an error in the \u00b5-incoherence assumption in an\nearlier version.\n\n8\n\n\fReferences\n[1] Agarwal, A., A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon (2014). Learning sparsely\n\nused overcomplete dictionaries. In COLT, pp. 123\u2013137.\n\n[2] Agarwal, A., A. Anandkumar, and P. Netrapalli (2013). A clustering approach to learn sparsely-\n\nused overcomplete dictionaries. arXiv preprint arXiv:1309.1952.\n\n[3] Aharon, M., M. Elad, and A. Bruckstein (2006). K-SVD: An algorithm for designing over-\ncomplete dictionaries for sparse representation. IEEE Transactions on signal processing 54(11),\n4311\u20134322.\n\n[4] Arora, S., R. Ge, T. Ma, and A. Moitra (2015). Simple, ef\ufb01cient, and neural algorithms for sparse\n\ncoding. In COLT, pp. 113\u2013149.\n\n[5] Arora, S., R. Ge, and A. Moitra (2013). New algorithms for learning incoherent and overcomplete\n\ndictionaries.\n\n[6] Balakrishnan, S., M. J. Wainwright, B. Yu, et al. (2017). Statistical guarantees for the EM\n\nalgorithm: From population to sample-based analysis. The Annals of Statistics 45(1), 77\u2013120.\n\n[7] Barak, B., J. A. Kelner, and D. Steurer (2015). Dictionary learning and tensor decomposition via\nthe sum-of-squares method. In Proceedings of the Forty-Seventh Annual ACM on Symposium on\nTheory of Computing, pp. 143\u2013151. ACM.\n\n[8] Belloni, A., M. Rosenbaum, and A. B. Tsybakov (2014). An {`1,` 2,` 1}-Regularization\n\nApproach to High-Dimensional Errors-in-variables Models. arXiv preprint arXiv:1412.7216.\n\n[9] Belloni, A., M. Rosenbaum, and A. B. Tsybakov (2016). Linear and conic programming\nestimators in high dimensional errors-in-variables models. Journal of the Royal Statistical Society:\nSeries B (Statistical Methodology).\n\n[10] Boucheron, S., G. Lugosi, and P. Massart (2013). Concentration inequalities: A nonasymptotic\n\ntheory of independence. Oxford university press.\n\n[11] Chen, Y. and C. Caramanis (2013). Noisy and Missing Data Regression: Distribution-Oblivious\n\nSupport Recovery.\n\n[12] Donoho, D. L. and X. Huo (2001). Uncertainty principles and ideal atomic decomposition.\n\nIEEE Transactions on Information Theory 47(7), 2845\u20132862.\n\n[13] Elad, M. and M. Aharon (2006). Image denoising via sparse and redundant representations over\n\nlearned dictionaries. IEEE Transactions on Image processing 15(12), 3736\u20133745.\n\n[14] Engan, K., S. O. Aase, and J. H. Husoy (1999). Method of optimal directions for frame\ndesign. In Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International\nConference on, Volume 5, pp. 2443\u20132446. IEEE.\n\n[15] Fuchs, J.-J. (2004). Recovery of exact sparse representations in the presence of noise. In\nAcoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP\u201904). IEEE International\nConference on, Volume 2, pp. ii\u2013533. IEEE.\n\n[16] Gautier, E. and A. Tsybakov (2011). High-dimensional instrumental variables regression and\n\ncon\ufb01dence sets. arXiv preprint arXiv:1105.2454.\n\n[17] Gribonval, R., R. Jenatton, and F. Bach (2015). Sparse and spurious: dictionary learning with\n\nnoise and outliers. IEEE Transactions on Information Theory 61(11), 6298\u20136319.\n\n[18] Gribonval, R. and M. Nielsen (2003). Sparse representations in unions of bases.\n\ntransactions on Information theory 49(12), 3320\u20133325.\n\nIEEE\n\n[19] Hardt, M. (2014). Understanding alternating minimization for matrix completion. In Foun-\ndations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pp. 651\u2013660.\nIEEE.\n\n9\n\n\f[20] Hazan, E. and T. Ma (2016). A Non-generative Framework and Convex Relaxations for\nUnsupervised Learning. In Advances in Neural Information Processing Systems, pp. 3306\u20133314.\n[21] Jain, P., P. Netrapalli, and S. Sanghavi (2013). Low-rank matrix completion using alternating\nminimization. In Proceedings of the forty-\ufb01fth annual ACM symposium on Theory of computing,\npp. 665\u2013674. ACM.\n\n[22] Lewicki, M. S. and T. J. Sejnowski (2000). Learning overcomplete representations. Neural\n\ncomputation 12(2), 337\u2013365.\n\n[23] Loh, P.-L. and M. J. Wainwright (2011). High-dimensional regression with noisy and missing\ndata: Provable guarantees with non-convexity. In Advances in Neural Information Processing\nSystems, pp. 2726\u20132734.\n\n[24] McDiarmid, C. (1989). On the method of bounded differences. Surveys in combinatorics 141(1),\n\n148\u2013188.\n\n[25] Meinshausen, N. and P. B\u00fchlmann (2006). High-dimensional graphs and variable selection with\n\nthe Lasso. The annals of statistics, 1436\u20131462.\n\n[26] Netrapalli, P., P. Jain, and S. Sanghavi (2013). Phase retrieval using alternating minimization.\n\nIn Advances in Neural Information Processing Systems, pp. 2796\u20132804.\n\n[27] Netrapalli, P., U. Niranjan, S. Sanghavi, A. Anandkumar, and P. Jain (2014). Non-convex robust\n\nPCA. In Advances in Neural Information Processing Systems, pp. 1107\u20131115.\n\n[28] Olshausen, B. A. and D. J. Field (1997). Sparse coding with an overcomplete basis set: A\n\nstrategy employed by V1? Vision research 37(23), 3311\u20133325.\n\n[29] Rigollet, P. and A. Tsybakov (2011). Exponential screening and optimal rates of sparse\n\nestimation. The Annals of Statistics, 731\u2013771.\n\n[30] Rosenbaum, M., A. B. Tsybakov, et al. (2010). Sparse recovery under matrix uncertainty. The\n\nAnnals of Statistics 38(5), 2620\u20132651.\n\n[31] Rosenbaum, M., A. B. Tsybakov, et al. (2013). Improved matrix uncertainty selector. In From\nProbability to Statistics and Back: High-Dimensional Models and Processes\u2013A Festschrift in\nHonor of Jon A. Wellner, pp. 276\u2013290. Institute of Mathematical Statistics.\n\n[32] Spielman, D. A., H. Wang, and J. Wright (2012). Exact Recovery of Sparsely-Used Dictionaries.\n\nIn COLT, pp. 37\u20131.\n\n[33] Sun, J., Q. Qu, and J. Wright (2017). Complete dictionary recovery over the sphere I: Overview\n\nand the geometric picture. IEEE Transactions on Information Theory 63(2), 853\u2013884.\n\n[34] Tropp, J. A. (2006). Just relax: Convex programming methods for identifying sparse signals in\n\nnoise. IEEE transactions on information theory 52(3), 1030\u20131051.\n\n[35] Tsybakov, A. B. (2009). Introduction to nonparametric estimation. Revised and extended from\n\nthe 2004 French original. Translated by Vladimir Zaiats.\n\n[36] Wu, S. and B. Yu (2015). Local identi\ufb01ability of `1-minimization dictionary learning: a\n\nsuf\ufb01cient and almost necessary condition. arXiv preprint arXiv:1505.04363.\n\n[37] Yu, B. (1997). Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pp. 423\u2013435.\n\nSpringer.\n\n[38] Zhao, P. and B. Yu (2006). On model selection consistency of Lasso. Journal of Machine\n\nlearning research 7(Nov), 2541\u20132563.\n\n10\n\n\f", "award": [], "sourceid": 1226, "authors": [{"given_name": "Niladri", "family_name": "Chatterji", "institution": "UC Berkeley"}, {"given_name": "Peter", "family_name": "Bartlett", "institution": "UC Berkeley"}]}