I Introduction
Recently deep neural networks (DNNs) have made a great success in many realworld applications, such as image classification [6], speech recognition [1]
, and natural language processing
[13]. With the increasing size of DNNs, the models show the stateoftheart performance. However, the high memory space requirements and computational complexity have become a serious problem for efficient implementations, especially on mobile devices.To alleviate the extremely high demand of computational resource, many compression methods are proposed, which aim to generate compact DNN models. At present, the reducedprecision representation of numbers, also known as quantization, is one of the most attractive topics[9]. However, these methods mainly focus on the inference phase of DNN. Researches on training with limitedprecision numbers still remain to be explored.
Because of the existence of more information flows, including gradients backpropagation and parameters updating, the training of DNNs needs higher representation ability for data. In other words, a suitable number format for DNN training should have enough dynamic range for big numbers, and have high precision for numbers in the center of data distribution.
Posit, a type3 universal number, is introduced by Gustafson et al.[4]. An bit posit number is defined as , where (exponent bits) is used to control dynamic range. Comparing to standard floating point(FP) number, posit has a better tradeoff between dynamic range and precision, just meeting the needs of lowbits number for DNN training. Some researchers have claimed the prospect of posit in DNNs, but practical implementations and verifications are absent[4][15]. In this paper, we first propose an effective strategy for DNN training using posit number system. After the posit being proved useful in DNN training, a processing element supporting posit arithmetics is required to make full use of its efficiency in DNN accelerators. Our contributions are summarized as follows:

With an operation which transforms a real number to posit format, we illustrate how to apply the posit in DNN training process.

We analyze the advantages and disadvantages of the application of posit in DNN training, then we propose corresponding solutions to overcome these problems. Firstly, to deal with the high sensitivity of models in the early training stage and ensure the convergence of models, a warmup training with FP32 is carried out. Secondly, to take the advantage of posit, we design a layerwise scaling factors based on the center of data distribution in logdomain, making the data distribution of models match the change of the precision of posit number. Thirdly, to meet different data ranges of different layers, we come up with a quanlitative criteria to select a proper to achieve a better tradeoff between dynamic range and precision of posit number.

In order to verify the effectiveness of our methods, ResNet18 models are trained on ImageNet dataset and Cifar10 dataset, where 8bit or 16bit posit numbers are applied in forward and backward computation, respectively. The experiments show no accuracy loss with the baseline model.

We propose a hardware architecture for posit multiplyandaccumulate (MAC) unit, which is coded by Verilog HDL and synthesized by Design Compiler under TSMC 28nm technology. Comparing to standard floating point MAC unit, the posit MAC can reduce the power by 83%, and reduce the area by 76%. It demonstrates that our design will benefit future lowpower DNN training accelerators.
Ii Background
Iia ReducedPrecision for DNN Training
Training DNNs with reducedprecision is an appealing issue. Gupta et al. trained DNNs with fixedpoint numbers, and introduced stochastic rounding procedure to prevent accuracy degradation[3]. In paper[11], the binary logarithmic data representation for both inference and training is explored, so that multiplication operations can be replaced by simpler shift operations. However, the above works usually can not provide expected model accuracy on complex tasks because there are too many information losses caused by the aggressive approximation.
To deal with this problem, some recent works use reducedprecision floating point including FP8 or FP16 in training. Micikevicius et al.[10] used FP16 for forward and backward computation, and kept FP32 for weight update and accumulation. They also proposed a lossscaling method to keep gradients propagation effectively. Furthermore, with a chunkbased accumulation technique applied, Wang et al.[12] reduced the precision of the computation to FP8, and the precision of the weight update and accumulation to FP16.
IiB Posit number system
Binary Code  Regime  Exponent  Mantissa  Real Value 
00000  x  x  x  0 
00001  3  0  0  1/64 
00010  2  0  0  1/16 
00011  2  1  0  1/8 
00100  1  0  0  1/4 
00101  1  0  1/2  3/8 
00110  1  1  0  1/2 
00111  1  1  1/2  3/4 
01000  0  0  0  1 
01001  0  0  1/2  3/2 
01010  0  1  0  2 
01011  0  1  1/2  3 
01100  1  0  0  4 
01101  1  1  0  8 
01110  2  0  0  16 
01111  3  0  0  64 
An posit number, whose detail structure is shown as Fig. 1, includes four parts: a sign bit, regime bits, exponent bits, and mantissa part. The boundary between the last three parts are not fixed, as the regime part is encoded by runlength method. As for the numerical meaning of regime bits, consecutive ended by a means , consecutive ended by a means . As an example, a posit construction is described in Table I. The value of a posit number (binary code) is given by Eq. (1).
(1) 
where determines the dynamic range.
The maximum and the minimum positive values that can represent are and , respectively.
Some groups have worked on the design of hardware architecture generators for posit arithmetics. Jaiswal et al.[7] proposed a parameterized posit arithmetic architectures generator, supporting basic operations such as FPPosit conversion, addition/subtraction, and multiplication. Recently, an efficient posit MAC unit generator that can be combined with a reasonable pipeline strategy was put forward by Zhang et al.[15]
, Besides, the applications of lowbit posit in deep learning also attracted some attentions. Deep Positron
[2], a DNN architecture that employs exactmultiplyandaccumulates (EMACs) for 8bit posit, shows better accuracies than 8bit fixedpoint and FP for some small datasets. J.Johnson[8] proposed logfloat format inspired by posit, and use it for DNN inference, whose accuracy loss is less than for ImageNet dataset within ResNet50 model.Iii Posit Training Strategy and Experiments Results
Name  Description 
posit word size  
posit exponent field size  
sign of the number  
the effective exponent value of  
the regime value of  
the exponent value of before rounding  
the mantissa value of before rounding  
the regime width of  
the exponent width of  
the mantissa width of  
the exponent value of after rounding  
the mantissa value of after rounding 
Iiia Posit Transformation
In this work, all data and computations are represented in posit format in the training process. Therefore, we have to transform a real number, which is represented in FP32 format in current computers, to posit format. Here we define an operator to achieve this task. The detail process is shown in Algorithm 1, and the involved notations are listed in Table II.
Given the total word size and exponent field size , we can determine the dynamic range of a posit number. To convert a nonzero number to corresponding posit number , firstly we have to limit its magnitude based on the dynamic range and then extract sign, regime, exponent, and mantissa parts.
Next, because of the restriction of word size, the width of each part is adjusted. Therefore, the rounding operations are applied to the value of each part to fit the adjusted width. Here we choose the roundingtozero method, e.g. the operator in Algorithm 1, Line 16, 17. Comparing to the roundingtonearest and stochastic rounding methods, the roundingtozero will be more friendly for hardware implementation. Finally, the posit result is attained by combining these parts based on Eq. (1).
With the transformation algorithm accomplished, we insert it in DNN training computation flow as depicted in Fig. 3, which includes forward process, backward process, and weight update process.
IiiB Training a DNN Model with Posit
Although posit has many benefits while being used in DNN training, it can not show expected performance if we replace FP32 with reducedprecision posit directly. There are several key reasons as follows:

In the early training stage, the model is more sensitive to the precision of data, and the distributions of some layers are unstable, so that the reducedprecision representations will cause a bad initialization and make the model hard to converge.

In fact, the precision of posit number system is basically symmetrical about 1, but the data distributions in DNN models are concentrated on limited range. To some extent, it results mismatching between data distributions and number representation formats, thereby leading larger approximation errors.

For different layers, the data have different ranges, which means some data distributions are more concentrated and the others are relatively decentralized. Therefore, it is suboptimal to use same data precision (e.g of posit) to represent them.
In this section, we propose corresponding methods for dealing with the above problems.
Warmup Training: By observing the distributions of data in training process, we find that most of them are approximately normal. As shown in Fig. 2
, the distributions of the weights in Convolution (CONV) layers are basically stable in the training process. However, because of the initialization method, the distributions of the weights in Batch Normalization (BN) layers have a steep change in the first several epochs, which may be an important reason of high model sensitivity in early training process. Therefore, in this phase, a higher numerical precision is required. On account of this situation, a warmup training using FP32 for several epochs (15 epochs) is carried out. It will be helpful to determine the data distribution effectively and make sure the convergence of networks.
Distributionbased Shifting: When transforming a real number to its reducedprecision format, the most common idea is approximating it to the nearest reducedprecision value and clipping it based on the dynamic range of reducedprecision format. As a result, the numerical errors are inevitable. To overcome the second issue, a scaling factor is introduced to shift the data distribution to a more appropriate range, whose upper bound is usually the maximum value that the reducedprecision number can represent [14]. As for posit number system, its dynamic range is large enough to meet demand. However, to make full use of the code space of posit, inspired by the shiftbased mapping method [14], we also propose a layerwise scaling factor . The calculation of the scaling factor is shown as Eq. (2).
(2)  
is a tensor to be converted,
means the approximate distribution center of the input tensor in log domain, which stands for that the majority of values are close to this magnitude, is a predefined positive integer constant, which is set as 2 in our experiments. As mentioned in previous works[5], the large values have more importance than small values, so we add to for shifting values towards small magnitude a little more. Basd on the warmup trained model, the scaling factor of each layer can be calculated. Finally, by applying the scaling factor before and after transformation operation as Eq. (3), the more important values are shifted to the order of magnitude that has higher precision.(3) 
Adjust Dynamic Range: During the DNN training process, different layers have different distribution ranges which are measured approximately by the difference between the maximum and minimum value in log domain. For example, in the first few layers, the ranges of gradients are relatively larger than the ranges of other values. In this case, the posit number should have a larger dynamic range, which means a bigger value. In this work, for simplicity, we just set the to be 1 for all weights and activations, and be 2 for all gradients and errors.
IiiC Experiment Results
To validate our posit training strategy, we perform experiments with ResNet18[6]
on ImageNet and Cifar10 datasets utilizing Pytorch framework on NVIDIA P100 GPUs. The validate top1 accuracy and related configuration are summarized in Table
III. which demonstrate that training with reducedprecision posit number can achieve FP32 baseline accuracy without tuning hyperparameters. The training details are as follows:
Cifar10:
The model uses stochastic gradient descent with moment 0.9 as optimizer. The initial learning rate is set to 0.1 and divided by 10 at epoch 60, epoch 150, and 250. The network is trained for 300 epochs with a minibatch size of 512. The warmup training runs for 1 epoch.
ImageNet:The model uses stochastic gradient descent with moment 0.9 as optimizer. The initial learning rate is set to 0.1 and divided by 10 every 30 epochs. The model is trained for 90 epochs with a minibatch size of 512. The warmup training runs for 5 epochs.
Dataset  Cifar10  ImageNet 
model  CifarResNet18  ResNet18 
batch size  512  512 
epochs  300  120 
optimizer  SGD with Moment  SGD with Moment 
FP32 baseline  93.40  71.02 
posit  92.87  71.09 

posit (8,1) for CONV layers forward pass and weight update, posit (8,2) for CONV layers backward pass. posit (16,1) for BN layers forward pass and weight update, posit (16,2) for BN layers backward pass.

posit (16,1) for forward pass and weight update, posit (16,2) for backward pass.
Iv EnergyEfficient Posit MAC Architecture
By using 8 bits or 16 bits posit number for training, the model size can be reduced to 25% or 50%, then the energy consumption can be saved significantly, because the memory space requirements and the communication bandwidth are reduced. As for computational process, the energy consumption mainly comes from a mass of MAC operations. Since the posit arithmetic operations are different from traditional floating point arithmetic operations, a dedicated MAC unit is urgently required to take full advantage of the reducedprecision posit.
As shown in Fig. 4, the posit MAC unit proposed in [15] mainly compose of three units: a decoder converting posit to FP, an FP MAC unit, and an encoder converting FP to posit. In this way, the summation of the encoder delay and decoder delay consumes about 40% time of the total posit MAC delay.
Based on this result, improved architectures for the encoder and decoder with lower latency are proposed, which are shown in Fig. 6 and Fig. 5.
Iva The Optimized Decoder and Encoder Architectures
The decoder aims to extract different parts of posit, then exports effective exponent value and mantissa value. Firstly, the absolute regime value of the input posit number is calculated by a LOD (if real regime value is negative) or a LZD (if real regime value is positive). Secondly, The input is left shifted by the width of regime bits, which is equal to or , where is the absolute regime value. The output of composes of posit exponent value and mantissa value. Finally the regime value and posit exponent value are packaged into effective exponent value. The critical path of the original decoder is determined by the add one operation. As shown in Fig. 5, we remove the adder, and split the left shift path by duplicating the . To preserve the function of the adder, a leftshiftone ( ) operation is inserted after the .
The encoder converts the FP to posit format. Firstly, a 2nbit variable is constructed with mantissa and the least significant bits(LSB) exponent bits, and the remained bits are filled by regime sequence. Then is right shifted by the width of regime bits, which is equal to or , where is the absolute regime value. Therefore, an optimization method, which is similar to that used in the optimized decoder, is applied for the encoder architecture.
IvB Hardware Implementation Results
The architectures are coded by Verilog HDL and synthesized by Design Compiler under TSMC 28nm technology. To prove efficiency of the proposed encoder and decoder, the same parameterized architectures with [15] are evaluated.
posit(8,0)  posit(16,1)  posit(32,3)  
[15]  delay(ns)  encoder  0.2  0.29  0.35 
decoder  0.2  0.28  0.34  
Ours  delay(ns)  encoder  0.13  0.18  0.23 
decoder  0.14  0.21  0.29  
power(mW)  encoder  0.21  0.44  0.59  
decoder  0.27  0.45  0.66  
area()  encoder  137  295  540  
decoder  201  504  960 
The comparison results in Table IV show our encoder speeds up by 25%35% and our decoder speeds up by 15%30%, thereby reducing the impact of these two units on total delay.
After combining the proposed encoder and decoder with the FP MAC unit, an energyefficient posit MAC architecture is proposed. To meet the requirements of the DNN training with posit, different posit MAC units which support all kinds of posit format involved in Table III are implemented. The implementation results are summarized in Table V. For fair comparison between the posit MAC and FP32 MAC on energy consumption, all these units are synthesized with a timing constraint of 750MHz. Comparing to FP32 MAC, the posit MAC can reduce the power by 22%83%, and reduce the area by 6%76%.
Power(mW)  Area ()  
FP32  2.52  4322 
posit(8,1)  0.45  1208 
posit(8,2)  0.35  1032 
posit(16,1)  1.77  4079 
posit(16,2)  1.60  3897 
V Conclusion and Future Work
In this paper, with several useful methods proposed, the posit number system is applied to DNN training successfully. The experiments results show that reducedprecision posit can achieve similar accuracy with FP32 on different datasets. If the posit is applied in DNN accelerators, the overhead caused by data communications can be saved by 24. In order to take full advantage of posit, an energyefficient posit MAC unit is designed. Comparing to FP32 MAC, the posit MAC can reduce the power by 22%83%, and reduce the area by 6%76%.
In the further work, we will implement a hardware accelerator for DNN training with posit. On the other hand, the architectures for posit arithmetic with the encoder and decoder may be not the optimal method. We will carefully design a new architecture for the posit MAC to further improve its performance.
References

[1]
(2016)
Deep speech 2: endtoend speech recognition in english and mandarin.
In
International conference on machine learning
, pp. 173–182. Cited by: §I.  [2] (2018) Deep positron: a deep neural network using the posit number system. arXiv preprint arXiv:1812.01762. Cited by: §IIB.
 [3] (2015) Deep learning with limited numerical precision. In International Conference on Machine Learning, pp. 1737–1746. Cited by: §IIA.
 [4] (2017) Beating floating point at its own game: posit arithmetic. Supercomputing Frontiers and Innovations 4 (2), pp. 71–86. Cited by: §I.
 [5] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §IIIB.

[6]
(2016)
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §I, §IIIC.  [7] (2018) Universal number posit arithmetic generator on fpga. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1159–1162. Cited by: §IIB.
 [8] (2018) Rethinking floating point for deep learning. arXiv preprint arXiv:1811.01721. Cited by: §IIB.
 [9] (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §I.
 [10] (2017) Mixed precision training. arXiv preprint arXiv:1710.03740. Cited by: §IIA.
 [11] (2016) Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025. Cited by: §IIA.
 [12] (2018) Training deep neural networks with 8bit floating point numbers. In Advances in neural information processing systems, pp. 7675–7684. Cited by: §IIA.
 [13] (2016) Attentionbased lstm for aspectlevel sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 606–615. Cited by: §I.
 [14] (2018) Training and inference with integers in deep neural networks. arXiv preprint arXiv:1802.04680. Cited by: §IIIB.
 [15] (2019) Efficient posit multiplyaccumulate unit generator for deep learning applications. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. Cited by: §I, §IIB, §IVB, TABLE IV, §IV.
Comments
There are no comments yet.