1. Abstract
基于variancereduction(VR)的SGD算法,比SGD更好,不論是theoretically and empirically,但是異步版本沒(méi)有被研究。本文為很多VR算法提供了一個(gè)unifying framework,然后在這個(gè)框架中提出了一種異步算法,并證明了很快的收斂速度。對(duì)于通用的稀疏的機(jī)器學(xué)習(xí)問(wèn)題,能夠達(dá)到線性的加速。
2. Introduction
在強(qiáng)凸假設(shè)下,VR隨機(jī)算法比SGD的期望收斂速度更快,VR分析了problem structure,做了一些space-time的trade-off,能夠減少因?yàn)殡S機(jī)梯度帶來(lái)的varince。需要將同步的VR算法擴(kuò)展到異步的并行和分布式環(huán)境。
3. Related work:
3.1. Primal VR方法
SAG(Minimizing Finite Sums with the StochasticAverage Gradient)
SAGA(SAGA: A fast incremental gradient methodwith support for non-strongly convex composite objectives)
SVRG(Accelerating stochastic gradient descent using predictive variance reduction)
S2GD(Semi-Stochastic Gradient Descent Methods)
3.2. Dual VR方法
SDCA(Stochastic dual coordinate ascent methods for regularized loss)
Finito(Finito: A faster, permutable incremental gradientmethod for big data problems)
3.3. 分析Dual方法和VR隨機(jī)方法之間關(guān)系
- New Optimization Methods for Machine Learning
3.4. VR的算法結(jié)構(gòu)
VR的算法結(jié)構(gòu)可以trace back to經(jīng)典的非隨機(jī)incremental梯度算法(Incremental gradient, subgradient, and proximal methods for convex optimization:A survey),但是現(xiàn)在被公認(rèn)的是,隨機(jī)性幫助取得更快的收斂
3.5. Proximal方法
- A proximal stochastic gradient method with progressive variance reduction
3.6. 加速VR方法
Stochastic Proximal Gradient Descent with Acceleration Techniques
Accelerated mini-batch stochastic dual coordinate ascent)
3.7. 分析finite-sum問(wèn)題lower-bound
- A lower bound for the optimization of finite sums
3.8. 異步SGD算法
并行variants
- Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
分布式variants
Distributed delayed stochastic optimization
Distributed asynchronous incremental subgradient methods
3.9. Coordinate descent方法的并行和分布式variants
Asynchronous stochastic coordinate descent: Parallelism and convergence properties
An asynchronous parallel stochastic coordinate descent algorithm
An asynchronous parallel stochastic coordinate descent algorithm
Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function
3.10. mini-batch
- Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting. 將S2GD擴(kuò)展到mini-batch上,因此允許并行運(yùn)行,但是需要更多的同步,只能允許小的batch
4. VR隨機(jī)算法的通用框架
4.1. 假設(shè)條件
- L-Lipschitz條件
- lamda-strongly凸函數(shù)(也可擴(kuò)展到smooth convex函數(shù))
4.2. 已有VR算法
x是參數(shù),alpha是額外的參數(shù),A是alpha的集合
- SVRG每m輪iteration更新一次完整的A
- SAGA每輪iteration更新A中的一個(gè)alpha
- SAG也是每輪iteration更新一個(gè)alpha
4.3. 空間時(shí)間開(kāi)銷分析
- m較大時(shí),SVRG的計(jì)算開(kāi)銷小,但是收斂速度慢
- SAG和SAGA更頻繁地更新A,因此收斂速度更快,但是存儲(chǔ)開(kāi)銷大
4.4. 通用算法
HSAG:結(jié)合不同VR算法的調(diào)度策略
5. 異步VR算法
類似HogWild!,單機(jī)多核環(huán)境,稀疏機(jī)器學(xué)習(xí)問(wèn)題
6. 實(shí)驗(yàn)
- l2-LR算法
- 無(wú)鎖的SVRG對(duì)訓(xùn)練步長(zhǎng)(學(xué)習(xí)速度)和線程數(shù)的變化魯棒性更強(qiáng)
7. 總結(jié)
- a common platform
- an synchronous algorithm