1. Abstract
- study stochastic optimization problems when the data is sparse. 研究數據稀疏的隨機優化問題
- we derive matching upper and lower bounds on the minimax rate for optimization
and learning with sparse data. 給出了稀疏優化問題minimax rate的上界和下界 - show how leveraging sparsity leads to (still minimax optimal) parallel and asynchronous algorithms. 提出了一些并行異步的算法
- async AdaGrad, async Dual Averaging
2. Intro
- In this paper, we take a two-pronged approach. 分兩方面分析問題
- First, we investigate the fundamental limits of optimization and learning algorithms in sparse data regimes. In doing so, we derive lower bounds on the optimization error of any algorithm for problems of the form (1) with sparsity condition. 首先,分析了稀疏優化和學習問題的根本限制。得出了任何稀疏條件下優化算法的優化誤差的下界
- As the second facet of our two-pronged approach, we study how sparsity may be leveraged in parallel computing frameworks to give substantially faster algorithms that still achieve optimal sample complexity. 另一方面,研究怎么在并行計算中利用稀疏性質來給出更快的算法
- We develop two new algorithms, asynchronous dual averaging (ASYNCDA) and asynchronous ADAGRAD (ASYNCADAGRAD), which allow asynchronous parallel solution of the problem (1) for general convex f and X. 提出了兩個新算法,異步的dual averaging和異步的AdaGrad
3. Minimax rates for sparse optimization
- 給出了任何這類問題算法的minimax convergence rate的bound
4. Parallel and asynchronous optimization with sparsity
- we first revisit Niu et al.’s HOGWILD! [12]. HOGWILD! is an asynchronous (parallelized) stochastic
gradient algorithm for optimization over product-space domains, meaning that X in problem (1)
decomposes as X = X1 × · · · × Xd, where Xj ? R. 首先回顧HOGWILD,異步并行隨機梯度算法,數據的domain是product space - The key of HOGWILD! is that in step 2, the parameter x is allowed to be inconsistent—it may have received partial gradient updates from many processors—and for appropriate problems, this inconsistency is negligible. Hogwild的參數是inconsistent,是不一致的.
4.1. Asynchronous dual averaging
- Hogwild的缺點,好像需要域是product space點積空間(笛卡爾積)
- ASYNCDA maintains and upates a centralized dual vector z instead of a parameter x, and a pool of
processors perform asynchronous updates to z, where each processor independently iterates:
- 提出了一種異步的dual averaging算法,AsyncDA,保存和更新一個中心的dual向量z,線程池異步地更新z
- The only communication point between any of the processors is the addition operation in step 3.
Since addition is commutative and associative, forcing all asynchrony to this point of the algorithm
is a natural strategy for avoiding synchronization problems. 線程之間的唯一通信是addition操作。因為addition是可交換和可結合的,讓這里成為為異步的點是很自然的策略。
4.2. Asynchronous AdaGrad
- AdaGrad擴展到異步方式
5. Experiments
- URL dataset. The dataset in this case consists of an anonymized collection of URLs labeled as malicious (e.g., spam, phishing, etc.) or benign over a span of 120 days.
- We also experiment on a proprietary datasets consisting of search ad impressions. Each example
corresponds to showing a search-engine user a particular text ad in response to a query string. From this, we construct a very sparse feature vector based on the text of the ad displayed and the query string (no user-specific data is used). The target label is 1 if the user clicked the ad and -1 otherwise. 轉悠的廣告數據集. - 算法是LR