A Method For Implementing Convolutional Neural Networks Using Winograd

A Method For Implementing Convolutional Neural Networks Using Winograd Minimal Filtering And Gemm

Abstract: The present disclosure relates to a method for acceleration of convolutional neural networks on FPGA based on unified Winograd minimal filtering and GEMM architecture. Convolutional Neural Networks (CNN) have emerged as the most efficient technique for solving a host of machine learning tasks, especially in image and video processing domains. However deploying CNNs on computing systems with smaller form factors have found to be extremely challenging due to the complex nature of CNNs. Hardware acceleration of CNNs using FPGAs have emerged as a promising approach due to the high performance, energy efficiency and reconfigurability of FPGAs. Winograd filtering based convolution is the most efficient algorithm for calculating convolution for smaller filter sizes. Various CNN implementations with dedicated processing elements for Winograd algorithm have been proposed. Current embodiment is a unified architecture, where both Winograd based convolution and general matrix multiplication (GEMM) can be accelerated using the same set of processing elements. This enables efficient utilization of FPGA hardware resources for accelerating all the layers in the CNNs. The current embodiment has been used to accelerate AlexNet CNN on FPGA. We have also analyzed the performance with varying Winograd tile sizes and found out the most appropriate tile sizes for maximizing the performance while reducing on-chip memory resources. [To be published with Figure 1]

Patent Information

Application #

Filing Date

08 October 2018

Publication Number

42/2018

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

suresh@unimarkslegal.com

Parent Application

Applicants

Kala S

Division of Electronics Engineering, School of Engineering, Cochin University of Science And Technology, Kochi-682022

Babita Roslind Jose

Division of Electronics Engineering, School of Engineering, Cochin University of Science And Technology, Kochi-682022

Jimson Mathew

Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna-801103

Nalesh S

Department of Electronics, Cochin University of Science And Technology, Kochi-682022

Inventors

1. Kala S

Division of Electronics Engineering, School of Engineering, Cochin University of Science And Technology, Kochi-682022

2. Babita Roslind Jose

Division of Electronics Engineering, School of Engineering, Cochin University of Science And Technology, Kochi-682022

3. Jimson Mathew

Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna-801103

4. Nalesh S

Department of Electronics, Cochin University of Science And Technology, Kochi-682022

Specification

CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
The present application does not claim priority from any patent application.
TECHNICAL FIELD
The present disclosure in general relates to the field of convolutional neural networks. More particularly, the present invention relates to unified Winograd-GEMM architecture for accelerating convolutional neural networks on hardware.
BACKGROUND
Convolutional Neural Network (CNN) has emerged as the most efficient technique for solving a host of machine learning tasks, especially in image and video processing domains. However, deploying CNNs on computer systems with smaller form factors have found to be extremely challenging due to complex nature of CNNs. Hardware acceleration of CNNs using Field Programmable Gate Arrays (FPGAs) have emerged as a promising approach due to the high performance, energy efficiency and reconfigurability of FPGAs. Although CNNs provide considerable improvement in efficiency, considerable amount of processing is still required during training and the process can last multiple days. Graphic Processing Units (GPUs) are widely used to reduce compute time, but they are unsuitable for embedded applications due to high power consumption [1].
SUMMARY
Before the present implementation of convolutional neural network architecture using Winograd and general matrix multiply (GEMM) algorithm is described, it is to be understood that this use case is not limited to the particular system, and methodologies described, as there can be multiple possible embodiments which are not expressly illustrated in the present disclosure. It is also to be understood that the terminology used in the description is for the purpose of describing the particular versions or embodiments only, and is not intended to limit the scope of the present application. This summary is provided to introduce concepts related to the convolutional neural network architecture using Winograd-GEMM based technique. This summary is not intended to identify essential features of the claimed subject matter nor is it intended for use in determining or limiting the scope of the claimed subject matter.

[005] CNN consists of convolutional layers (CONV), pooling and fully connected
(FC) layers. Convolutional layers account for 90% of the total computation in CNN [2]. Direct convolution method, Fast Fourier Transform (FFT) method and Winograd minimal filtering method are the different methods for performing convolution [3], [4]. Direct convolution is performed using GEMM algorithm and is the least efficient of the three. But this has the advantage of using same hardware resources for CONV and FC layers. FFT based convolutions are useful when kernel (filter) size is large, or when both input and kernel are of same size. Winograd method is useful for small kernel sizes. Efficiency and precision of CONV layers depend on Winograd tile size.
[006] AlexNet was the first CNN model to win the ImageNet challenge in 2012 [5].
AlexNet has five convolution layers and two fully connected layers. First convolutional layer uses 11x11 kernel and is denoted as CONV1. Second layer uses 5x5 kernel and is denoted as CONV2. Third convolutional layer uses 3x3 kernel and is denoted as CONV3. Fourth and fifth convolutional layers also use 3x3 kernel and are denoted as CONV4 and CONV5 respectively. Here CONV2, CONV3, CONV4 and CONV5 layers can be efficiently accelerated using Winograd algorithm, while CONV1 need direct convolution. If we can perform both Winograd and FC layer using the same compute resources, we can achieve significant savings in hardware. Present embodiment accelerates CNNs on FPGAs which incorporates Winograd minimal filtering algorithm on a GEMM accelerator.
BRIEF DESCRIPTION OF DRAWINGS
[007] The detailed description is described with reference to the accompanying
figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to refer like features and components.
[008] Figure 1 illustrates the blocked Winograd minimal filtering algorithm, in
accordance with an embodiment of the present subject matter.
[009] Figure 2 illustrates the proposed unified Winograd-GEMM architecture for
accelerating CNN on FPGA, in accordance with an embodiment of the present subject matter.
[0010] Figure 3 illustrates the architecture of Data Transform Unit, in accordance
with an embodiment of the present subject matter.

[0011] Figure 4 illustrates the architecture of Output Transform Unit, in accordance
with an embodiment of the present subject matter.
[0012] Figure 5 illustrates the computation time for convolutional layers using
various tile sizes, in accordance with an embodiment of the present subject matter.
[0013] Figure 6 illustrates the complete use case of the system and method in
accordance with an embodiment of the present subject matter.
DETAILED DESCRIPTION
[0014] Some embodiments of the present disclosure, illustrating all its features, will
now be discussed in detail. The words “receiving”, “determining”, “controlling”, and other forms thereof, are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms "a", "an" and "the" include plural references unless the context clearly dictates otherwise. Although any systems and methods or equivalent to those described herein can be used in the practice or testing of embodiments of the present disclosure, the exemplary, architecture for CNN is now described. The disclosed embodiments of the CNN architecture are merely exemplary of the disclosure, which may be embodied in various forms.
[0015] Various modifications to the embodiment will be readily apparent to those
skilled in the art and the generic principles herein may be applied to other embodiments. However, one of ordinary skill in the art will readily recognize that the present disclosure for a CNN architecture is not intended to be limited to the embodiments illustrated, but is to be accorded the widest scope consistent with the principles and features described herein.
[0016] The present subject matter relates to CNN architecture on FPGA. The
proposed method is based on accelerating CNNs on FPGA, based on Winograd minimal filtering algorithm on a GEMM accelerator.
[0017] Various FPGA implementations of CNN from various groups are presented in
[2], [6]-[11]. Various implementations of Winograd minimal filtering based CNNs have been proposed in the last three years [1]-[2], [6]-[9]. All of them use specially designed dedicated Processing Elements (PEs) for performing Winograd based convolution. A

major drawback of this approach is the need for separate PEs for CONV layers and FC layers which lead to severe under-utilization of resources. In case of AlexNet, only CONV2, CONV3, CONV4 and CONV5 can be efficiently accelerated using Winograd algorithm. CONV1 needs direct convolution. If we can perform direct convolution, Winograd convolution and FC computation using the same compute resources, we can achieve significant savings in the hardware. We now describe each of the components of the current method.
[0018] A 2D Winograd filter is represented by F(mxm, rxr) where mxm denotes output tile size and rxr denotes kernel size. The kernel is applied on an input tile of size (m+r-1) x (m+r-1). For a kernel tile g and input tile d, output Y can be written as shown in Equation I,
Y = AT [ (GgGT) O (BTdB) ] A Equation I
Y = AT [ (GgGT) O (BTdB) ] A Equation I
where, G, B and A are transform matrices for kernel, input and output respectively. These matrices can be precomputed once the value of m and r are known. For F(2x2,3x3) tile, conventional convolution involves 36 multiplications while Winograd takes only 16 multiplications. Hence multiplication complexity is reduced by a factor of 2.25. As Winograd tile size increases (m increases for a particular r), number of additions and constant multiplications also increase. The size of transform matrix also increases with tile size and coefficients in transform matrix need approximation which reduces the precision of convolution. But multiplication complexity reduces with tile size. Therefore there is a trade-off between precision and performance based on Winograd tile size.
[0019] Consider C channels of K x K kernels and a total of F such kernels. Also consider C channels of Hx W input feature maps. Each kernel performs convolution on the input feature map with a stride of S, and generates output feature map Y. Element (u,v) mfh kernel, cth channel is represented as G/,c,u,v and input element (x,y) in cth channel of ith image is denoted as DijC,x,y . Each output element is given by Equation II,

This equation calculates the convolution of image and one kernel for a single channel.
[0020] For multiple channels, convolution outputs from all channels have to be added
together to get the final outputs. This is expressed as in Equation IV,

By reducing the image tile coordinates to a single dimension and separating each element-wise multiplication in a tile, the inner add and multiply in Equation V can be transformed to GEMM. Here GEMM is between the transformed input tiles and kernel tiles. Each element of input tile and kernel, after transformation is scattered into different matrix. Transformed image matrix consists of corresponding elements from consecutive image tiles for a single channel along the column and one column for each channel. Transformed kernel matrix is arranged as channels along the rows and kernels along the columns.
[0021] Implementing GEMM on FPGA requires a trade-off between performance,
external memory bandwidth and Block RAM (BRAM) utilization. Typically, blocked variants of GEMM algorithm are implemented which optimally use the BRAM resources and DDR bandwidth while maximizing performance. Similar blocking techniques have to be applied to GEMM based Winograd algorithm.
[0022] Basic GEMM based Winograd filtering algorithm has been presented in [3].
For efficient FPGA implementation, we have modified this algorithm to create a blocked
Winograd Filtering algorithm which is listed in Algorithm 1. Basic algorithm in [3]
transforms the Winograd operation into q x q matrix multiplications where q = m + r - 1
is the input tile size. Each multiplication is between matrices of size T x C and C x F
where T, C and F are number of input tiles, channels and kernels respectively.
[0023] In Algorithm 1, first matrix is divided into sub-matrices of size ST x C and
second matrix into sub-matrices of size CxSF. Input and kernel transforms are performed at sub-matrix level, GEMM is performed on the transformed sub-matrices, and resultant product sub-matrix is transformed to get ST output tiles (each of size m x m) for SF

kernels. This process is repeated for all tiles and filters. Hereafter we denote F(m x m, r x
r) as F(m, r).

Algorithm 1: Blocked Winograd Filtering Algorithm for F(m x m, r x r)
T = \H/m\ \ W/m\ is the number of image tiles.
q=m + r -1 is the input tile size.
Each input tile is q x q and tile stride is m.
dC,t 6 /? *■ trans ) * total \* trans ' * comp J liquation A.
If data transfer and compute are completely overlapping, then lower bound of execution time can be achieved. [0032] Implementation of present embodiment has been done on Xilinx XC7VX690T FPGA. Convolution layers in AlexNet CNN are accelerated using the implemented architecture. Table I shows the parameters for CONV layers in AlexNet. CONV1 layer uses a kernel of size 11x11 and is computed using direct convolution. Rest of the layers is computed using the Winograd algorithm. For CONV2, we have used the technique proposed in [12] to convert the 5x5 filters to four 3x3 filters. Hence all four layers have filter size of 3x3.

[0033] Implementation in [13] has four arrays of 64 PEs each. Multiple arrays can be
dynamically tied together in cooperative mode to form larger arrays. NP is the number of independent arrays. Reducing NP (having larger arrays) reduce the required DDR bandwidth since all PEs in one independent array shares the same memory interface. However, for maximum PE utilization, block size has to be equal to number of PEs in each independent array.
[0034] In the blocked Winograd algorithm, the two matrices to be multiplied are of
size T x C and C x F. From Table I, for the last three convolution layers, maximum tile size is only 49 which mean that in a 64 PE array, some of the PEs will always remain idle. Hence, we have changed the PE array configuration to eight arrays of 32 PEs each and use only the independent mode in case of blocked Winograd algorithm. Note that, the total number of PEs remains same as in [13] and the entire configuration used in [13] is possible by interconnecting multiple arrays. More number of independent PE arrays leads to higher DDR bandwidth requirement. However, input to PE arrays are provided by the Data Transform Unit and efficient input feature reuse strategies have been employed to mitigate this issue. Also, each Winograd operation is converted to Qt GEMM operations, all of which can be performed in parallel. Hence for all the CONV layer computations using Winograd algorithm, we have fixed NP as 8. Sub-block size is another parameter that affects the performance. Sub-block sizes ST and SF have to be closely matched for maximum performance. Unbalanced blocking can result in more time spend on loading the larger block to the PEs. However, since we are operating all arrays in independent mode, we will fix SF to 32, number of PEs in each array and ST to a value close to 32 which reduces the number of sub blocks, T=ST.
[0035] In [13], NP and sub block size are the two parameters that are leveraged to
extract the optimum performance for each layer. We have already fixed these two parameters for Winograd based convolutions. However in case of Winograd algorithm, size of Winograd tile is an additional parameter which affect the performance. Increasing the tile size reduces computations and thus increases the performance. However larger tile sizes demand more BRAM for the three storages. Also larger tile size implies lower precision. If lower precision levels are acceptable, we can use the tile size to trade-off between BRAM requirement and performance. We have used four tile sizes namely F(2,3), F(3,3), F(4,3) and F(6,3).

[0036] Predicted additional BRAM requirement are in Table II. Fig. 5 shows
expected compute time for one convolution with varying tile sizes for CONV2 to CONV5 layers. We have used lower bounds for expected compute time.

[0037] Fig. 5 shows that compute time is minimum using F(3,3) and increases for
larger tile sizes. These layers have smaller image feature sizes so that large tile sizes result in increased zero padding. With this tile size, maximum BRAM requirement is 197 for CONV3. For CONV2, maximum performance is for tile F(6,3) and BRAM requirement is within 197. Thus we select F(6,3) for CONV2 and F(3,3) for all other layers with number of BRAMs fixed as 197.
[0038] With this configuration, we have performed Winograd based convolution for
the four AlexNet layers on current embodiment. Table III shows performance comparison for these layers with [13].

[0039] In [13], performance is reported in terms of GFLOPs, which denote giga
floating point operations. However, since Winograd algorithm leads to reduction in computation, performance comparison with direct convolution techniques using GFLOPs is not fair. In such cases, number of convolutions per second is a better metric for comparison.
[0040] Table III shows that the current embodiment gives considerable improvement
in performance in terms of convolutions/second for all four layers. CONV2 layer shows maximum improvement of 4.02x against expected performance of 5.06x for F(6,3) Winograd tiles. For other layers performance ranges from 1.4x to 2.72x.
[0041] Table IV gives FPGA resource utilization for the current embodiment in
comparison with [13]. Table IV shows that very less additional FPGA resources are used for introducing Winograd algorithm. Only 13.8% of the total Look Up Table (LUT) resources and 13.4% of total BRAMs are required. We are able to achieve significant performance improvement with very few additional hardware resources which shows the effectiveness of our approach. Note that current embodiment can compute direct convolutions and GEMM operations with the same performance as reported in [13]. In terms of GFLOPs, performance is higher for all layers except CONV2. Reduced performance for CONV2 indicate that the blocking strategy employed for this layer is not efficient enough to fully exploit the reduction in computations from using F(6,3) tile.

[0042] Table V compares the performance of current embodiment with other state-of-
art FPGA based CNN accelerators. In Table V, GFLOPs represent giga floating point

operations and GOPs represent giga operations. Among floating point implementations, only [9] shows higher performance. The difference in performance is only 5%. As mentioned earlier GFLOPs is not the correct metric to compare architectures using different convolution algorithms. However, these results show that current embodiment gives the same level of performance as the state-of art floating point based implementations.

[0043] Use case of complete system is shown in Fig. 6. Input features are fed to CNN,
where feature extraction is performed in convolutional layers. Sub sampling is performed by pooling layers and finally feature classification is done by fully connected layers. For example, CNN will be trained on thousands of images of animals, from which prediction can be made whether the given image is of which animal. The CNN model can be trained on any type of class that we want.

We claim:
1. Accelerating convolutional neural networks on FPGA method comprising:
a novel unified Winograd minimal filtering and GEMM architecture for CNNs;
an architecture where both Winograd minimal filtering and GEMM can be performed using the same array of processing elements;
a study of trade-off between on-chip memory and performance for AlexNet model for this architecture.
2. The unified architecture of claim 1,
wherein the architecture gives most efficient implementation for convolutional layers irrespective of the kernel sizes and also for fully connected layers;
3. The blocked Winograd filtering algorithm of claim 1, wherein the input and kernel transformations are performed at sub-matrix level and GEMM is performed on the transformed sub-matrices.
4. The transform blocks of claim 1, wherein the data transformation (Data Transform Unit) and output transformation (Output Transform Unit) is performed.

Documents

Application Documents

#	Name	Date
1	201841038043-REQUEST FOR EXAMINATION (FORM-18) [08-10-2018(online)].pdf	2018-10-08
2	201841038043-REQUEST FOR EARLY PUBLICATION(FORM-9) [08-10-2018(online)].pdf	2018-10-08
3	201841038043-FORM-9 [08-10-2018(online)].pdf	2018-10-08
4	201841038043-FORM 18 [08-10-2018(online)].pdf	2018-10-08
5	201841038043-FORM 1 [08-10-2018(online)].pdf	2018-10-08
6	201841038043-DRAWINGS [08-10-2018(online)].pdf	2018-10-08
7	201841038043-COMPLETE SPECIFICATION [08-10-2018(online)].pdf	2018-10-08
8	abstract 201841038043.jpg	2018-10-11
9	201841038043-POA [22-06-2021(online)].pdf	2021-06-22
10	201841038043-OTHERS [22-06-2021(online)].pdf	2021-06-22
11	201841038043-FORM-26 [22-06-2021(online)].pdf	2021-06-22
12	201841038043-FORM 3 [22-06-2021(online)].pdf	2021-06-22
13	201841038043-FORM 13 [22-06-2021(online)].pdf	2021-06-22
14	201841038043-FER_SER_REPLY [22-06-2021(online)].pdf	2021-06-22
15	201841038043-COMPLETE SPECIFICATION [22-06-2021(online)].pdf	2021-06-22
16	201841038043-FER.pdf	2021-10-17
17	201841038043-US(14)-HearingNotice-(HearingDate-13-02-2024).pdf	2024-01-11
18	201841038043-Correspondence to notify the Controller [10-02-2024(online)].pdf	2024-02-10
19	201841038043-Annexure [10-02-2024(online)].pdf	2024-02-10
20	201841038043-Written submissions and relevant documents [19-02-2024(online)].pdf	2024-02-19
21	201841038043-Annexure [19-02-2024(online)].pdf	2024-02-19

Search Strategy

1	2020-12-1716-56-13E_22-12-2020.pdf