Efficient DNN Training

2020-10-02

Efficient DNN Training Summary

Model compression has been extensively studied for light-weight inference, popular means includes network pruning, weight factorization, network quantization, and neural architecture search among many others. On the other hand, the literature on efficient training appears to be much sparser, DNN training still requires us to fully train the over-parameterized neural network. Here we focus on reducing total training times and training energy cost, aiming at the deployment on resource-constrainted platforms, e.g., FPGAs, ASICs, Mobile and IoT devices. Below are three most recently works putting huge efforts toward the efficient training goal.

General DNN Training Scheme.

E^2 Train (NeurIPS 2019): To conduct more energy-efficient training of CNNs, so as to enable on-device training. This paper strive to reduce the energy cost during training, by dropping unnecessary computations from three complementary levels: stochastic mini-batch dropping on the data level; selective layer update on the model level; and sign prediction for low-cost, low-precision back-propagation, on the algorithm level.

Fig.1 - An illustration of E^2 Train framework. On the data level, we adopt stochastic mini-batch dropping to aggressively reduce the training cost by, letting it see less mini-batches; On the model level, we dynamically skipping a subset of layers during both feed-forward and back-propagation; On the algorithm level, we propose a novel predictive sign gradient descent (PSG) algorithm, which predicts the sign of gradients using low-cost bit-level predictors, thereby completely bypassing the costly full-gradient computation.

Benefit from aforementioned three techiniques, we are able to achieve over 80% total training energy cost savings measured by FPGA. For example, when training ResNet-74 on CIFAR-10, we achieve aggressive energy savings of >90% and >60%, while incurring a top-1 accuracy loss of only about 2% and 1.2%, respectively. When training ResNet-110 on CIFAR-100, an over 84% training energy saving is achieved without degrading inference accuracy.
```
Code avaialble at: https://github.com/RICE-EIC/E2Train
```

EB Train (ICLR 2020 Spotlight): This paper bridges the gap between Lottery Ticket Hypothesis and efficient training goal. Since the identification of winning tickets still requires the costly train-prune-retrain process, limiting their practical benefits, this paper discover for the first time that the winning tickets can be identified at the very early training stage (Early-Bird (EB) tickets) via low-cost training schemes, and propose a mask distance metric that can be used to identify EB tickets with low computational overhead, without needing to know the true winning tickets that emerge after the full training. Therfore, we can leverage the existence of EB tickets and the proposed mask distance to develop efficient training methods, which are achieved by first identifying EB tickets via low-cost schemes, and then continuing to train merely the EB tickets towards the target accuracy.

Fig.2 - A high-level overview of the difference between EB Train and existing progressive pruning and training. In particular, the progressive pruning and training scheme adopts a three-step routine of 1) training a large and dense model, 2) pruning it, and 3) then retraining the pruned model to restore performance, and these three steps can be iterated. The first step often dominates (e.g., occupy 75% training FLOPs) in terms of training energy and time costs. While EB Train replaces the aforementioned steps 1 and 2 with a lower-cost step of detecting the EB tickets.

Experiments based on various deep networks and datasets validate: 1) the existence of EB tickets, and the effectiveness of mask distance in efficiently identifying them; and 2) that the proposed efficient training via EB tickets can achieve up to 4.7× energy savings while maintaining comparable or even better accuracy, demonstrating a promising and easily adopted methodfor tackling the often cost-prohibitive deep network training.
```
Code avaialble at: https://github.com/RICE-EIC/Early-Bird-Tickets
```

FracTrain (NeurIPS 2020): This paper proposes FracTrain which integrates: 1) progressive fractional quantization which gradually increases the bitwidth of activations, weights, and gradients that will not reach the precision of SOTA static quantized DNN training until the final training stage, and 2) dynamic fractional quantization which assigns bitwidths to both the activations and gradients of each layer in an input-adaptive manner, for only “fractionally” updating layer parameters.

Fig.3 - (a) A high-level view of the proposed PFQ vs. SOTA low-precision training, where PFQ adopts a four-stage precision schedule to gradually increase precision of weights, activations, gradients, and errors up to that of the static baseline which here employs 8-bit for both the forward and backward paths, denoted as FW-8/BW-8, and (b) the corresponding training loss trajectory.

Extensive simulations and ablation studies (six models, four datasets, and three training settings including standard, adaptation, and fine-tuning) validate the effectiveness of FracTrain in reducing computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12% ~ +1.87%) accuracy. For example, when training ResNet-74 on CIFAR-10, FracTrain achieves 77.6% and 53.5% computational cost and training latency savings, respectively, compared with SOTA baseline, while achieving a comparable (-0.07%) accuracy.
```
Code will be available soon.
```

ShiftAddNet (NeurIPS 2020): This paper presented ShiftAddNet, whose main inspiration is drawn from a common practice in energy-efficient hardware implementation, that is, multiplication can be instead performed with additions and logical bit-shifts. We leverage this idea to explicitly parameterize deep networks in this way, yielding a new type of deep network that involves only bit-shift and additive weight layers.

Fig.4 - Overview structure of ShiftAddNet.

This hardware-inspired ShiftAddNet immediately lead to both energy-efficient inference and training, without compromising the expressive capacity compared to standard DNNs. The two complementary operations types (bit-shift and add) additionally enable finer-grained control of the model’s learning capacity, leading to more flexible trade-off between accuracy and (training) efficiency. Compared to existing DNNs or other multiplication-less models, ShiftAddNet aggressively reduces over 80% the hardware-quantified energy cost of DNN training and inference, while offering comparable or better accuracies.
```
Code will be available soon.
```