1. Abstract
- At no or nearly no loss of accuracy, quantize the gradients aggressively—to but one bit per value—if the quantization error is carried forward across minibatches (error feedback). 對gradient進行量化,不會帶來精確度的下降,只要本minibatch的量化誤差加到下一個minibatch的gradient上
- This size reduction makes it feasible to parallelize SGD through data-parallelism with fast processors like recent GPUs. 數據壓縮能夠更有效地在數據并行下并行SGD
- Combining this finding with AdaGrad, automatic minibatch-size selection, double buffering, and model parallelism. Unexpectedly, quantization benefits AdaGrad, giving a small accuracy gain. 結合了AdaGrad、自動選擇minibatch大小、雙緩沖和模型并行。
2. Intro
- nodes. Each node computes a sub-gradient on its sub-minibatch. These sub-gradients, of the same dimension as the full model, must be summed over all nodes and redistributed. 數據并行,每個節點計算自己數據的subgradient,跟完整的參數是同樣的維度,需要累加所有node的subgradient,然后重新分發
- Applied directly to typical training configurations, this process is infeasible due to the high bandwidth that it takes to exchange sub-minibatch gradients across nodes. 應用distributed replicas來訓練,是不可行的,因為需要很大的帶寬來交換所有nodes的gradient
- Avenues for improving efficiency for data parallelism are to increase the minibatch size and to reduce how much data gets exchanged. 提高數據并行效率的解決辦法是,增大minibatch大小;減小數據交換大小
- propose to reduce bandwidth by aggressively quantizing the sub-gradients—to but one bit per value. We show that this does not or almost not reduce word accuracies—but only if the quantization error is carried forward across minibatches, i.e. the error in quantizing the gradient in one minibatch is added (fed back) to the gradient of the next minibatch. 本文提出,對subgradient進行量化,每個值壓縮到一個bit。只要本minibatch的量化誤差加到下一個minibatch的gradient上,就幾乎不會帶來精確度的下降。
- 數據并行不會改變convergence behavior,也就是clock vs objective。也與Hogwild/ASGD不同,本文著重的是確定性的收斂behavior。
- In this category, an alternative to data parallelism is model parallelism, where models are distributed over nodes. One can also parallelize over layers [19]: Each GPU processes one or more consecutive layers, where data flows up and down through the layers between GPUs. 在上面說的這一類中,數據并行之外的選擇是模型并行,將模型分布在不同節點上。也可以將網絡層并行,每隔GPU處理一個或多個連續的網絡層,數據在不同GPU的層之間流動
- That work showed, however, that delayed updates can work, and motivated the double-buffering technique we apply in this paper. 上面提到的工作顯示了,有delay的update也能夠work,這激發了本文使用的double buffering方法
3. Data-parallel Deterministically Distributed SGD
- CD-DNN-HMM model
3.1. Data-parallel Distributed SGD
- 當計算和通信完全overlap,也就是計算時間和通信時間相等時,是最優的節點數量
- 時間開銷分為四部分:處理每條數據的時間(每隔layer是三個矩陣相乘);計算完梯度之后的post-processing(momentum+AdaGrad),是component-wise操作(比如加減);交換float類型的subgradient的時間;將梯度更新到模型參數上的開銷,是component-wise操作,是fixed開銷
3.2. Double Buffering with Half Batches
- To achieve concurrent computation, we break each minibatch in half and exchange sub-gradients of one half-minibatch while computing the sub-gradients of the next half-minibatch,為了并發地計算,計算到minibatch的一半時,就交換一半的subgradients,同時計算另一半的minibatch
- using a model that is outdated by N/2 samples (delayed update [19, 8]). 使用的模型有固定的delay,收斂并不會收到根本的影響
3.3. Potential Faster-Than-Fixed-Cost Communication
- 當通信開銷低于fixed的時間開銷時,網絡不會被占滿,因而速度被fixed cost所限制,這是1bit SGD時會發生的情況
- In this case, double buffering with half-minibatches no longer makes sense, as it masks communication cost at the expense of an additional fixed cost, which is now higher. 在這種情況下,double buffering就沒有必要了,因為它帶來了額外的開銷,而此時網絡開銷是很小的,網絡開銷的減少帶來的效果不大
3.4. Relation to Hogwild/ASGD
- Hogwild differs in that it uses an unsynchronized gradient exchange (via a parameter server). It is another form of delayed update where the delay varies non-deterministically across model parameters. Hogwild的不同是,它是另一種delayed update形式,模型參數的不同維度的delay是不確定的.
4. 1-bit SGD with Error Feedback
- 量化誤差是不可避免的,可能會導致發散
- 參考Sigma-Delta modulation,當量化一個參數的梯度時,保存量化誤差,在下一個量化之前,加到下一個minibatch的梯度中
- We find that as long as error feedback is used, we can quantize all the way to 1 bit at no or nearly no loss of accuracy. 只要使用error feedback,能夠一直量化到1bit,也幾乎不會降低準確率
- For our 1-bit implementation, we find that using a constant quantization threshold of 0 is a good (and cheap) choice, whereas the reconstruction values used by the unquantizer Q??1(?) are tied within each weight-matrix column (j, l). The two values per column are recomputed as to minimize the
square quantization error and transmitted in each data exchange. 1bit的量化實現,我們發現閾值設為0就很好,但是重構值需要和參數矩陣的每個列綁定。一個列用兩個值就可以使得重構的時候最小化量化平方誤差
4.1. Aggregating the Gradients
- Each compute node is responsible for aggregating a 1=K-th subset of model parameters, which it will receive in quantized form from all peer nodes每個計算節點負責匯總1/K個模型參數的子集,從其他節點獲得量化后的值
- 這些量化值被累加,post-processed(AdaGrad, momentum),然后重新分發給計算節點,然后再被量化。這樣每個minibatch的gradient被兩次量化,
- The first quantization is applied to sub-gradients which are summed up, reducing the quantization error through averaging. The second quantization happens after AdaGrad, where gradient values are in more homogeneous numeric range. 第一次量化是對subgradient,通過averaging來減小量化誤差;第二次量化發生在AdaGrad之后,這樣gradient的不同維度的值有不均勻的數值范圍
5. System Description
- 用前45分鐘的數據選擇合適的minibatch
- 根據cross validation set上的準確率decay learning rate
- 使用AdaGrad,根據不同維度隨時間的變化來normalize gradients
6. Experimental Results
6.1. Cost Measurements
- 不依賴于batch size的開銷,gradient post-processing(Adagrad,momentum)和fixed cost(模型更新)的時間開銷是基本固定的
6.2. Effect of 1-bit Quantization
- 1-bit quantization works well across all setups, at minor but consistent impact on training-set frame accuracy.
- double buffering has minor impact on accuracy
7. Conclusion
- 1-bit quantization allows to significantly reduce data-exchange bandwidth for data-parallel SGD at no or nearly no loss of accuracy, making data-parallel distribution of SGD feasible even with modern fast hardware (GPUs). 1-bit量化能夠減少數據并行SGD的數據交換,幾乎不影響準確率
- For this to work, quantization-error feedback is essential. 量化誤差的誤差反饋很重要
- 量化能夠和AdaGrad互相作用互相影響