Algorithms and lower bounds in the streaming and sparse recovery models

[摘要] In the data stream computation model, input data is given to us sequentially (the data stream), and our goal is to compute or approximate some function or statistic on that data using a sublinear (in both the length of the stream and the size of the universe of items that can appear in the stream) amount of space; in particular, we can store neither the entire stream nor a counter for each possible item we might see. In the sparse recovery model (also known as compressed sensing), input data is a large but sparse vector x [epsilon] Rn, and our goal is to design an m x n matrix [Phi]D, where m << n, such that for any sufficiently sparse x we can efficiently recover a good approximation of x from [Phi]x. Although at first glance these two models may seem quite different, they are in fact intimately related. In the streaming model, most statistics of interest are order-invariant, meaning they care only about the frequency of each item in the stream and not their position. For these problems, the data in the stream can be viewed as an n-dimensional vector x, where xi is the number of occurrences of item i. Using this representation, one of the high-level tools that have proven most popular has been the linear sketch, where for some m x n matrix {Phi]we maintain {Phi]x (the sketch) for the partial vector x as we progress along the stream. The linearity of the mapping D allows us to efficiently do incremental updates on our sketch, and as in its use in sparse recovery, the linear sketch turns out to be surprisingly powerful. In this thesis, we try to answer some questions of interest in each model, illustrating both the power and the limitations of the linear sketch. In Chapter 2, we provide an efficient sketch for estimating the (planar) Earth-Mover Distance (EMD) between two multisets of points. The EMD between point sets A, B R2 of the same size is defined as the minimum cost of a perfect matching between them, with each edge contributing a cost equal to its (Euclidean) length. As immediate consequences, we give an improved algorithm for estimating EMD between point sets given over a stream, and an improved algorithm for the approximate nearest neighbor problem under EMD. In Chapter 3, we prove tight lower bounds for sparse recovery in the number of rows in the matrix [Phi] (i.e., the number of measurements) in order to achieve any of the three most studied recovery guarantees. Specifically, consider a matrix [Phi] and an algorithm R such that for any signal x, R can recover an approximation & from [Phi] satisfying ... where (1) p= q= 1 and C= O(1), (2) p= q= 2 and C = O(1), or (3) p =2, q = 1 and C = O(k-1/ 2 ). We show that any such [Phi] I must have at least [Omega](k log(n/k)) rows. This is known to be optimal in cases (1) and (2), and near optimal for (3). In Chapter 4, we propose a variant of sparse recovery that incorporates some additional knowledge about the signal that allows the above lower bound to be broken. In particular, we consider the scenario where, after measurements are taken, we are given a set S of size s < n (s is known beforehand) that is supposed to contain most of the ;;large;; coefficients of x. The goal is then to recover i satisfying ... We refer to this formulation as the sparse recovery with partial support knowledge problem (SRPSK). We focus on the guarantees where p = q = 1 or 2 and C = 1 + e, for which we provide lower bounds as well as a method of converting algorithms for ;;standard;; sparse recovery into ones for SRPSK. We also make use of one of the reductions to give an optimal algorithm for SRPSK for the guarantee where p = q = 2.

[发布日期] [发布机构] Massachusetts Institute of Technology

[效力级别] [学科分类]

[关键词] [时效性]

浏览次数：3

统一登录查看全文激活码登录查看全文