I Introduction
As the training of deep neural networks (DNNs) is a very computationally expensive procedure for stateoftheart DNNs (containing hundreds of layers and ten millions of parameters), it is usually performed on clusters of graphic processing units (GPUs) or even on supercomputers. If a more energyefficient implementation is required, a specialized hardware DNN accelerator is developed, for example, see Google‘s TPU [7] or various academia neural chips [2, 9]. Such accelerators typically implement only the inference procedure for fully trained DNNs. In addition to simplifying the DNN architecture (by the socalled pruning), a significant power consumption reduction of the DNN hardware accelerator can be obtained by introducing approximate memory and approximate arithmetic circuits [10]. As applications of DNNs are highly errorresilient [13, 12], the error introduced by approximate (i.e. simplified) implementations of circuits and programs often remains acceptable and, in most cases, invisible to the user. It is, however, important to quantify the error introduced by employing approximate circuits and find the best tradeoff between the error and power requirements prior a real hardware design is started. This is usually done with the help of software platforms developed for DNN design and training (such as TensorFlow [1]). These platforms are, however, optimized for standard floating point arithmetic operations available in processors and GPUs and use standard libraries of mathematical functions. If these operations are replaced with userspecific (fixed point) operations (e.g., approximate multipliers), the DNN execution is slowed down by several orders of magnitude on processors as well as GPUs. The reason is that there is no hardware support for approximate arithmetic operations on common processors and GPUs and these operations have to be expensively emulated.
In order to analyze the DNN error after introducing approximate arithmetic operations into the computation datapath of a hardware DNN accelerator, we propose a new GPUbased emulation platform for DNN accelerators containing approximate circuits. The reason for choosing a GPU is that (i) complex DNNs are usually trained using software tools such as TensorFlow that are highly optimized for GPUs and (ii) GPUs provide higher performance than common processors for DNN applications. Note that determining a suitable approximate implementation of arithmetic operation for a given DNN and a given application requires evaluating many candidate approximate operations and, in most cases, performing additional parameter finetuning (i.e. retraining). Existing DNN development platforms do not support the aforementioned approach. For example, Ristretto (operating over Caffe) is capable of evaluating various number representations that can be employed in DNNs and find an optimum bit width for these operations
[3]. However, it does not support approximate arithmetic operations, i.e. only common operations with reduced bit widths can be executed.In the proposed emulation platform, all relevant approximate circuits are implemented as lookup tables and accessed through a texture memory mechanism of CUDA capable GPUs. We exploit the fact that the texture memory is optimized for irregular readonly access and in some GPU architectures is even implemented as a dedicated cache. This technique allowed us to reduce the inference time of the emulated DNN accelerator approximately 200 times with respect to an optimized CPU version on complex DNNs such as ResNet. The proposed approach has been embedded into the TensorFlow library.
Ii CNNs with Approximate Multipliers Emulated on GPU
Various quantization schemes have been applied within DNNs [8, 5, 6]. The affine transformation represents the most preferred technique which is employed, for example, also in TensorFlow (TF). It allows an efficient implementation of arithmetic operations using only integer arithmetic on the quantized values. This quantization scheme maps a real number to an integer to ensure the following equality:
(1) 
where and are two constants corresponding with the socalled scale and zeropoint [6]. While is a positive real number, should be a number of the same type as . The constants are chosen in such a way that the real value
is exactly representable by a quantized value. This requirement is important because many computations result in 0 and it is highly undesirable to propagate a nonzero quantization error to next layers. In addition, zero padding is often applied.
To improve the efficiency, the convolution on two functions is typically implemented using the matrix multiplication. Given two input matrices denoted as and , both having real elements, the output value corresponding with an element of the matrix product at position is calculated as follows:
(2) 
When we apply the affine transform (Eq. 1), the output is:
(3) 
where represents the quantized value of . This expression can be rewritten as
(4) 
The first sum in Eq. 4 represents the summation on the quantized values which can be calculated using the integer operations. In hardware, this can efficiently be implemented using an integer MAC circuit. The remaining two sums can also be done on the quantized values but it is beneficial for our purposes to express them in terms of the real numbers. The obtained equation describes the convolution on two independently quantized inputs followed by the dequantization.
As analyzed in [7], the 8bit operations are sufficient for DNN accelerators based on integer arithmetic. It means that we need a MAC unit consisting of an 8bit multiplier and 32bit accumulator to calculate a single element of the matrix product [6]. To emulate the behavior of the approximate multipliers employed in the MAC, an 8bit approximate multiplier is used for determining the product in our DNN approximate hardware accelerator.
Several types of 2D convolutional layers that implement a variant of Eq. 2 are available in TF. To support the approximate multiplications in the training as well as inference process seamlessly without the necessity to rewrite the training algorithms already implemented in TF, we propose to introduce an alternative approximate 2D convolutional layer to each type of the 2D convolution implementing a variant of Eq. 4. The approximate layer reads two floatingpoint inputs and produces a single floatingpoint output which has the same range as if we use the original convolutional layer.
Compared to the common convolutional layers, we need to provide some additional information: four scalars specifying the quantization coefficients, a model of the approximate multiplier, expected range of the quantized values ([128, 127] for signed, [0, 255] for unsigned multipliers) and requested round mode for the rounding applied during the quantization. In fact, the coefficients can be calculated independently for each input vector using the knowledge of the range of the inputs (i.e. minimum and maximum value). The approximate multiplication is specified by means of its truth table. This approach offers the highest throughput and does not cause any limitations as the truth table for an 8bit multiplier occupies only 128 kB.
The design flow is as follows. Firstly, a DNN model is created or loaded in TF. Then, all convolutional layers are identified and replaced by corresponding approximate variants. During this process, the minimum and maximum operators are inserted into the computational path and connected to the approximate layers. At the end, we obtain a transformed graph which is suitable for the inference as well as training because the minimum and maximum values of the input tensors are determined once per a batch. A part of the original and transformed graph is shown in Fig.
1.Iii GPU implementation
The 2D convolution operation in TF typically expects two 4D input tensors and produces another 4D tensor provided that the stride and dilation parameters are specified. The first input tensor represents a batch of 3D input images given in NHWC format (Batch
HeightWidthChannels), where the number of channels corresponds with the fastest changing index. The second tensor is a set of 3D filters (or kernels of the convolution) stored in the HeightWidthChannelsCount format, where Count specifies the number of filters applied to the same input. The output of the convolution shares the same layout as the input data; however, the height and width are determined according to the shape of the kernel and the depth of each output image depends on the number of applied filters. The approximate version of the 2D convolution is extended by four scalar inputs that provide the minimum and maximum values computed independently for each input vector.In [11], the authors applied a direct approach to implement the TFcompatible approximate 2D convolution. Unfortunately, only a CPU platform was supported. The method directly stems from the definition of the convolution operation and the format of inputs. This leads to a system of nested loops (over each input image in the batch, each output pixel, each output channel etc.) which is difficult to efficiently parallelize on GPUs. To aviod this issue, we selected the General Matrixmatrix multiplication (GEMM) approach for our CUDAbased GPU implementation. Similarly to the CPUbased implementation, we adopted the idea to implement the inner multiplication between each input and filter values using a lookup table.
The GEMMbased approach splits the convolution into two separate operations. (i) The patch matrix in which each row corresponds to a single position of the kernel is constructed (the imagetocolumns phase). (ii) The patch matrix is multiplied with the filters matrix in which each column corresponds to a single filter (GEMM phase). The GPU implementation of the approximate 2D convolution mostly follows this structure, but extends it with a few auxiliary computations to precompute the constant terms of Eq. 4.
Algorithm 1 describes the highlevel structure of our implementation. At the beginning, the quantization parameters , are computed using the input range information (the minimum and maximum values provided separately for each input). Then, the filteronly sum corresponding with the third sum in Eq. 4 is computed. The input batch is then split into chunks of a constant size to decouple memory usage from convolution parameters. Next, each chunk is converted to a matrix of 8bit integer values , in which each row (patch) corresponds to single position of the convolution kernel. At the same time, the dequantization sum for each patch is also computed and stored as a vector . Finally, the matrix is multiplied by the matrix of filters (which are quantized at the same time) and the results are dequantized using precomputed correction terms. The dequantized result of matrix multiplication is appended to the output of the operation.
(i) ImagetoColumns phase
The imagetocolumns phase (Im2Cols function in Algorithm 1) can easily be parallelized in CUDA by running a single thread for each output value of . However, in order to compute the sums in vector in a single pass over the data, one may limit the number of parallel threads to a single thread per patch or the thread block size has to be tied to the patch length (the reduction can thus be performed in a shared memory). As these approaches can limit the level of parallelism or flexibility of the solution, we opted for a slightly different way to compute these sums.
The thread block size in our solution is fixed and independent of the patch length. This means that any given thread block can process one or even several patches at the time. Multiple reductions over the values processed by consecutive threads have to be performed and each result has to be added to an appropriate element of . In CUDA, this can be done (with a reasonable efficiency) by loading input values into the shared memory and performing a prefix scan which allows extracting the partial sums at the end of each patch. These results are then added atomically (using atomicAdd) to the appropriate element of as the rest of the patch may be processed by other thread blocks.
DNN parameters  Accurate Conv2D  Approximate AxConv2D  Approx. overhead  Speedup GPU vs CPU  
DNN  # MACs  CPU  GPU  CPU  GPU  CPU  GPU  Accurate  Approximate  
ResNet8  7  0.2 + 4.4 s  1.8 + 0.2 s  0.2 + 341 s  1.7 + 1.5 s  337 s  1.2 s  2.3  106.8  
ResNet14  13  0.2 + 7.4 s  1.9 + 0.3 s  0.2 + 724 s  1.8 + 3.1 s  718 s  2.7 s  3.5  148.8  
ResNet20  19  0.2 + 10.4 s  1.8 + 0.5 s  0.2 + 1105 s  1.8 + 4.7 s  1096 s  4.3 s  4.7  170.2  
ResNet26  25  0.2 + 13.4 s  1.9 + 0.6 s  0.2 + 1489 s  1.8 + 6.2 s  1477 s  5.6 s  5.5  185.0  
ResNet32  31  0.3 + 16.3 s  1.9 + 0.7 s  0.3 + 1876 s  1.9 + 7.9 s  1861 s  7.3 s  6.5  191.0  
ResNet38  37  0.3 + 19.3 s  1.9 + 0.8 s  0.3 + 2259 s  1.9 + 9.4 s  2241 s  8.6 s  7.3  200.1  
ResNet44  43  0.3 + 22.3 s  1.9 + 0.9 s  0.3 + 2640 s  2.0 + 10.9 s  2620 s  10.0 s  8.0  205.6  
ResNet50  49  0.3 + 25.2 s  1.9 + 1.1 s  0.3 + 3025 s  2.0 + 12.6 s  3003 s  11.7 s  8.6  207.2  
ResNet56  55  0.3 + 28.1 s  1.9 + 1.2 s  0.3 + 3409 s  2.0 + 13.9 s  3384 s  12.8 s  9.2  214.4  
ResNet62  61  0.3 + 31.1 s  1.9 + 1.3 s  0.3 + 3796 s  2.3 + 15.5 s  3767 s  14.7 s  10.0  213.2 
(ii) Matrix multiplication phase
The matrix multiplication phase (ApproxGEMM function in Algorithm 1) is implemented as a typical tiled GEMM, in which the threads of the block have to load a 2D tile from each matrix into the shared memory and each thread computes a single output value. The tiles in the shared memory are quantized and stored as uint to avoid possible shared memory access conflicts. The multiplication of quantized 8bit values is implemented by a lookup table containing 16bit values stored in GPU memory and cached in L1 or L1 texture cache. To manage this in CUDA, cudaTextureObject_t is used to store the table and tex1Dfetch<ushort> to perform the lookup based on the index created by stitching the multiplied 8bit values into a single 16bit value. The results of multiplication (lookup) operations are accumulated in a 32bit floating point accumulator. The last step is to perform dequantization and a correction according to Eq. 4 using , terms and precomputed constants stored in vectors and .
Iv Results
The proposed emulation platform was implemented in C++ using NVIDIA CUDA Toolkit 10.1 and integrated into TensorFlow library. Its performance was evaluated on a residual ResNet [4] DNN because it enabled us to easily configure the number of building blocks and thus the number of 2D convolutional layers and MAC operations (see Tab. I). We focused on the performance evaluation of the approximate layers. Note that the accuracy is the same as if we use the quantization followed by dequantization available in TensorFlow. The experiments were conducted on Intel Xeon E52620 CPU and NVIDIA GTX 1080 CPU. We used CIFAR10 dataset containing input images having pixels each. Ten pretrained ResNet models were used whose parameters are provided in Tab. I. Only the inference process is considered to avoid any bias. The content of the LUT table implementing an approximate multiplier does not have any impact on the execution time. The evaluation of the data set is divided in 10 batches consisting of 1000 images each.
Our implementation is compared with the native and highly optimized implementation for CPUs and GPUs already available in TF (columns ‘accurate Conv2D‘) as well as with the CPUbased approach (first column in ‘approximate Conv2D‘). For each DNN, Tab. I reports the time in form of , where is the initialization of the computation (including the memory allocation and data transfer which is critical especially in case of GPUs) and required to process the whole data set by the DNN. While is nearly constant (the same data set is used in all cases), increases linearly with increasing the number of MACs. The last two columns contain the achieved speedup on GPU which also grows linearly. The proposed accelerator achieves significantly better performance compared to the CPUbased one (see the last column). The overhead introduced by the necessity to perform quantization, LUT lookup and dequantization is much smaller. For ResNet62, the computation time was reduced from 3796 s to 15.5 s. The overhead due to the emulation of the approximate computations is still quite high (14.7 s) but the total time is now acceptable for the practical usage. Fig. 2 shows a more detailed analysis of the total computation time for some configurations. For ResNet62, 26% of the total time is caused by the LUT lookups, 20% is due to the quantization, dequantization and computation of min/max, 10% is spent in the initialization phase and the rest is the remaining computation (Im2Cols, GEMM, etc.).
V Conclusions
We proposed an efficient emulation method for DNN accelerators containing approximate multipliers. This method allowed us to reduce the inference time of the emulated DNN accelerator approximately 200 times with respect to an optimized CPU version on complex DNNs such as ResNet. This opens new ways to automated design of approximate DNN accelerators in which many candidate designs have to be quickly evaluated.
Acknowledgements
This work was supported by Czech Science Foundation project 1910137S.
References

[1]
(2015)
TensorFlow: largescale machine learning on heterogeneous systems
. Note: Software available from tensorflow.org External Links: Link Cited by: §I.  [2] (2014) Dadiannao: a machinelearning supercomputer. In IEEE/ACM Int. Symp. Microarchitecture, pp. 609–622. Cited by: §I.

[3]
(2018)
Ristretto: a framework for empirical study of resourceefficient inference in convolutional neural networks
. IEEE Trans. Neural Netw. Learn. Syst 29 (11), pp. 5784–5789. Cited by: §I.  [4] (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. Cited by: §IV.
 [5] (2016) Binarized neural networks. In Adv. in Neural Inf. Proc. Systems 29, pp. 4107–4115. Cited by: §II.
 [6] (2017) Quantization and training of neural networks for efficient integerarithmeticonly inference. CoRR abs/1712.05877. External Links: 1712.05877 Cited by: §II, §II.
 [7] (2017) Indatacenter performance analysis of a tensor processing unit. In Proc. of the 44th Annual Int. Symposium on Computer Architecture, pp. 1–12. Cited by: §I, §II.
 [8] (2016) Ternary weight networks. CoRR abs/1605.04711. Cited by: §II.
 [9] (2017) Flexflow: a flexible dataflow accelerator architecture for convolutional neural networks. In HPCA’17, Cited by: §I.
 [10] (2016) A survey of techniques for approximate computing. ACM Comput. Surv. 48 (4), pp. 62:1–62:33. External Links: Document Cited by: §I.
 [11] (2019) ALWANN: automatic layerwise approximation of deep neural network accelerators without retraining. In ICCAD’19, External Links: ISBN 9781728123509 Cited by: §III.
 [12] (201807) Energyefficient neural computing with approximate multipliers. J. Emerg. Technol. Comput. Syst. 14 (2), pp. 16:1–16:23. External Links: ISSN 15504832, Document Cited by: §I.
 [13] (2014) AxNN: energyefficient neuromorphic systems using approximate computing. In ISLPED’14, Vol. , pp. 27–32. External Links: ISSN Cited by: §I.
Comments
There are no comments yet.