Specification
DESC:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR ITERATIVE KNOWLEDGE DISTILLATION FOR NEURAL NETWORK COMPRESSION
Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.
CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
The present application claims priority from Indian provisional patent application no. 202021055409, filed on December 19, 2020. The entire contents of the aforementioned application are incorporated herein by reference.
TECHNICAL FIELD
The disclosure herein generally relates to the field of neural network compression, and, more particularly, to a method and system for iterative knowledge distillation for neural network compression.
BACKGROUND
Deep Neural Networks have achieved remarkable results in various fields such as medical diagnosis, finance, drug discovery, speech recognition and space exploration etc. However, they often have hundreds of millions of parameters leading to large model size. This is a burden while training and eventually deploying the model on real-time systems where predictions are required to be almost instantaneous and accurate in a resource constrained environment. As larger neural networks are used to model data, it becomes imperative to consider its storage, computational and power requirements.
Model compression refers to techniques for simplifying a large neural network to one that requires less resources—usually storage and computational—with no significant loss in performance. A simple form of model compression results, for example, by progressively dropping edges (synaptic connections) with low weights, as long as there is no significant change in estimates of model performance. However, weight-based dropout of edges (and nodes which do not have any incoming edges as a result) is not necessarily the best way to compress a model because there is no way to determine nodes to be removed in a neural network (NN) in the absolute sense as internals of the NN are not explainable.
A well-known model compression method is knowledge distillation (KD). In KD, a complex model transfers knowledge to a simpler model. The complex model is often referred to as the teacher and the simpler model as the student. In conventional method of KD, the teacher and the student are trained on a training dataset to generate a soft output for a given input. Further, error between outputs of the teacher and student is computed and based on the error, parameters of the student is updated using techniques such as backpropagation to minimize the error. In conventional knowledge distillation (KD) method, the teacher model needs to be provided explicitly. Further, applying KD only once may not result in optimal compression.
SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems.
For example, in one embodiment, a method for iterative knowledge distillation for neural network compression is provided. The method comprises receiving training data, validation data, and a threshold error as input; training a neural network using the training data; refining the neural network by block reduction to get a refined neural network. Further, the method comprises iteratively performing knowledge distillation on the refined neural network by: identifying a teacher for the refined neural network based on a training type opted for training, wherein in the first iteration the trained neural network is the teacher, and the refined neural network is a student; training the refined neural network using a knowledge distillation process, wherein the knowledge distillation process transfers knowledge from the teacher to the student; determining a validation error of the student using the validation data; comparing the validation error with the threshold error; and performing based on the comparison one of: (a) terminating the knowledge distillation if the validation error is greater than the threshold error and (b) refining the student by block reduction.
In another aspect, a system for iterative knowledge distillation for neural network compression is provided. The system comprises a memory storing instructions; one or more Input/Output (I/O) interfaces; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to receive training data, validation data, and a threshold error as input; train a neural network using the training data; refine the neural network by block reduction to get a refined neural network. Further, the one or more hardware processors are configured to iteratively perform knowledge distillation on the refined neural network by: identifying a teacher for the refined neural network based on a training type opted for training, wherein in the first iteration the trained neural network is the teacher, and the refined neural network is a student; training the refined neural network using a knowledge distillation process, wherein the knowledge distillation process transfers knowledge from the teacher to the student; determining a validation error of the student using the validation data; comparing the validation error with the threshold error; and performing based on the comparison one of: (a) terminating the knowledge distillation if the validation error is greater than the threshold error and (b) refining the student by block reduction.
In yet another aspect, there are provided one or more non-transitory machine readable information storage mediums comprising one or more instructions, which when executed by one or more hardware processors causes a method for iterative knowledge distillation for neural network compression. The method comprises receiving training data, validation data, and a threshold error as input; training a neural network using the training data; refining the neural network by block reduction to get a refined neural network. Further, the method comprises iteratively performing knowledge distillation on the refined neural network by: identifying a teacher for the refined neural network based on a training type opted for training, wherein in the first iteration the trained neural network is the teacher, and the refined neural network is a student; training the refined neural network using a knowledge distillation process, wherein the knowledge distillation process transfers knowledge from the teacher to the student; determining a validation error of the student using the validation data; comparing the validation error with the threshold error; and performing based on the comparison one of: (a) terminating the knowledge distillation if the validation error is greater than the threshold error and (b) refining the student by block reduction.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 illustrates an exemplary block diagram of a system implementing iterative knowledge distillation for neural network compression, according to some embodiments of the present disclosure.
FIG. 2 is a flow diagram illustrating method for iterative knowledge distillation for neural network compression, according to some embodiments of the present disclosure.
It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems and devices embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
The embodiments herein provide a method and system for iterative knowledge distillation for neural network compression. A neural network is trained using a training data. Further, the trained neural network is refined by block reduction technique to get a refined neural network. Further, knowledge distillation is iteratively performed on the refined neural network to achieve model compression. At each iteration, a teacher is selected based on training type and the refined neural network is trained as student using knowledge distillation process. Further, validation error of the student is determined using a validation dataset and the validation error is compared with a threshold error. If the validation error is greater than the threshold error, then, the iterative knowledge distillation is terminated. Else, the student is refined using block reduction and next iteration of the knowledge distillation is performed to further compress the refined neural network (student). Thus, the method and system disclosed herein provides a systematic method of iteratively applying KD by selecting suitable teacher based on the training type to achieve increased model compression without significant loss in predictive accuracy of the student.
Referring now to the drawings, and more particularly to FIG. 1 through FIG. 2, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1 illustrates an exemplary block diagram of a system implementing iterative knowledge distillation for neural network compression, according to some embodiments of the present disclosure. In an embodiment, the system 100 includes one or more processors 104, communication interface device(s) or Input/Output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the one or more processors 104. The memory 102 comprises a database 108. The one or more processors 104 that are hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) is configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
The database 108 may store information but are not limited to, information associated with: (i) a teacher model, and (ii) a student model generated at each iteration. Further, the database 108 stores information pertaining to inputs fed to the system 100 and/or outputs generated by the system (e.g., at each stage), specific to the methodology described herein. Functions of the components of system 100 are explained in conjunction with flow diagram depicted in FIG. 2 for iterative knowledge distillation for neural network compression.
FIG. 2 is a flow diagram illustrating method (200) for iterative knowledge distillation for neural network compression, according to some embodiments of the present disclosure. In an embodiment, the system 100 comprises one or more data storage devices or the memory 102 operatively coupled to the processor(s) 104 and is configured to store instructions for execution of steps of the method 200 by the processor(s) or one or more hardware processors 104. The steps of the method of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIG. 1 and the steps of flow diagram as depicted in FIG. 2. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps to be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
Referring to the steps of the method 200 in FIG. 2, , at step 202 of the method 200, the one or more hardware processors 104 are configured to receive training data, validation data and a threshold error as input. Further, at step 204 of the method 200, the one or more hardware processors 104 are configured to train a neural network using the training data. Further, at step 206 of the method 200, the one or more hardware processors 104 are configured to refine the neural network by a block reduction to get a refined neural network. The refined neural network has a simpler structure than the neural network, which implies that the refined neural network has reduced number of parameters than the neural network thereby resulting in a smaller size. In an embodiment, block reduction is achieved by performing one or more of- (i) pruning half of convolutional filters of each layer in the neural network, (ii) pruning based on statistic of weights, (iii) pruning structural entities of the neural network comprising channels, nodes, filters, and layers.
Returning to method 200, at step 208, the one or more hardware processors 104 are configured to iteratively perform knowledge distillation on the refined neural network by:
Identifying (208A) a teacher for the refined neural network. In the first iteration the trained neural network is the teacher, and the refined neural network is a student. In subsequent iterations, the teacher is chosen based on a training type opted for training. In an embodiment, the training type is one of- (i) ancestral training type and (ii) parental training type. If ancestral training type is opted in an iteration, then, the refined neural network trained in any of previous iterations is selected as teacher. If parental training type is opted in an iteration, then, refined neural network trained in immediate previous iteration is selected. Once the teacher is selected, the refined neural network is trained using a knowledge distillation process (208B). As understood by a person skilled in the art, the knowledge distillation process transfers knowledge from the teacher to the student.
After training the refined neural network (student), validation error (alternately referred as model error) of the student is determined using the validation data (208C). In an embodiment, the validation error is computed as percentage of misclassification by the student on the validation data. Further, at step 208D, the validation error is compared with a threshold error. If the validation error is lesser than the threshold error, then, the student is optimally compressed and therefore the knowledge distillation is terminated. Otherwise, the student is refined by block reduction (208E) and next iteration of the knowledge distillation is performed.
EXPERIMENTAL RESULTS:
The method 200 was implemented on three neural networks: (i) DenseNet-121, (ii) VGGNet-19 and (iii) ResNet-152 using two image datasets: CIFAR-10 and MNIST. The neural networks are trained using the datasets with Adam optimizer. The training hyper-parameters for the method 200 and the models are as follows: batch size is set to 64, learning rate is 10-4, momentum factor is ß = (0.9, 0.999). All the experiments were conducted in Python environment in a machine with 64GB main memory, 16-core Intel processor and 8GB NVIDIA P4000 graphics processor.
A student model M is a pair (s,p), comprising structure s and a set of parameters p. Size of the student model M is |p|. Loss for the student model M given a training data D is defined according to equation 1.
L(M|D)= 1/m ?_(j=1)^m¦{2T^2 aD_KL (P^((j)),Q^((j)))-(1-a)?_(i=1)^c¦??y_i?^((j)) log(1-? ?y ^_i?^((j)))} .......(1)
In equation 1, DKL denotes KL-divergence, m denotes batch size, c denotes number of classes in the training data D, T is temperature to soften probability distributions, a is relative importance of teacher with respect to hard targets from the training data D, P^((j)) and Q^((j)) are the T-softened class probability distributions from the student and teacher model respectively for the jth instance of the training data D, Y={y1,...,yc} is the one-hot encoded ground truth, and Y ^={(y_1 ) ^,...,(y_c ) ^} is prediction from the student model. In an embodiment, T=20 and a=0.7. In alternate embodiments, the values of hyperparameters T and a are computed using standard procedures such as grid search.
Results of training DenseNet-121, VGGNet-19 and ResNet-152 using the method 200 are given in tables 1A, 1B and 1C respectively, wherein S refers to student and T refers to chosen teacher. The baseline model for comparisons is the initial neural network M0. An entry of S = Mj and T = Mi in the tables 1A, 1B and 1C denote that the method 200 resulted in neural network Mj by selecting Mi as the teacher among the neural networks M0, . . . ,Mi-1 which were obtained in previous iterations. E denotes the error of the student model on test data. C is the model compression (that is the ratio of the size of M0 to the size of the student model).
Table 1A
Model S DenseNet-121
CIFAR-10 MNSIT
T E C T E C
Base M0 - 0.16 1x - 0.02 1x
Method 200 M1 M0 0.18 2x M0 0.02 2x
M2 M1 0.17 2x M1 0.01 2x
M3 M2 0.19 5x M2 0.02 5x
M4 M3 0.19 9x M3 0.02 9x
Table 1B
Model S VGGNet-19
CIFAR-10 MNSIT
T E C T E C
Base M0 - 0.12 1x - 0.01 1x
Method 200 M1 M0 0.17 4x M0 0.01 4x
M2 M1 0.20 16x M1 0.01 16x
M3 M2 0.28 64x M2 0.01 64x
M4 M3 0.44 254x M3 0.01 254x
Table 1C
Model S ResNet-152
CIFAR-10 MNSIT
T E C T E C
Base M0 - 0.24 1x - 0.02 1x
Method 200 M1 M0 0.20 2x M0 0.02 2x
M2 M1 0.20 3x M1 0.02 3x
M3 M2 0.20 4x M2 0.02 4x
M4 M3 0.19 7x M3 0.01 7x
Another well-known method for model compression is pruning. Experiments were conducted to further reduce the size of the neural network by performing pruning after compressing the neural network using the method 200. The results were compared with experimental results derived from compressing the models only by using pruning without implementing method 200. It shows that pruning after implementing the method 200 gives 2 to 180 fold increase in compression over pruning without implementing the method 200 depending on the architecture and the training dataset. This gain is equivalent to (|Prune(Method 200(M))|)/(|Prune(M)|) wherein M is the neural network considered.
In an embodiment, a prune function Prune is considered such that the final compressed model Mi’ = Prune(Mi). The experimental results of applying pruning after training the three deep models DenseNet-121, VGGNet-19 and ResNet-152 using the method 200 are given in tables 2A, 2B and 2C respectively.
Table 2A
Model S DenseNet-121
CIFAR-10 MNSIT
T E C T E C
Base M0’ - 0.19 1x - 0.02 1x
Method 200 M1’ M0 0.17 1x M0 0.04 2x
M2’ M1 0.17 1x M1 0.05 2x
M3’ M2 0.19 2x M2 0.04 5x
M4’ M3 0.20 2x M3 0.05 9x
Table 2B
Model S VGGNet-19
CIFAR-10 MNSIT
T E C T E C
Base M0’ - 0.16 1x - 0.02 1x
Method 200 M1’ M0 0.18 2x M0 0.09 2x
M2’ M1 0.29 6x M1 0.02 7x
M3’ M2 0.30 30x M2 0.02 32x
M4’ M3 0.43 60x M3 0.09 180x
Table 2C
Model S ResNet-152
CIFAR-10 MNSIT
T E C T E C
Base M0’ - 0.23 1x - 0.03 1x
Method 200 M1’ M0 0.23 2x M0 0.03 2x
M2’ M1 0.25 3x M1 0.05 3x
M3’ M2 0.26 6x M2 0.05 6x
M4’ M3 0.22 10x M3 0.03 11x
Algorithm for training a neural network using the method 200 by opting parental training method at every iteration and further applying pruning to the final model is given below:
Given: Model M0, an iteration bound k, validation data Dv, a performance measure A, a tolerance ?, and a prune function Prune;
i:=0;
M*:=M0;
done:=false;
while i
Documents
Application Documents
| # |
Name |
Date |
| 1 |
202021055409-STATEMENT OF UNDERTAKING (FORM 3) [19-12-2020(online)].pdf |
2020-12-19 |
| 2 |
202021055409-PROVISIONAL SPECIFICATION [19-12-2020(online)].pdf |
2020-12-19 |
| 3 |
202021055409-FORM 1 [19-12-2020(online)].pdf |
2020-12-19 |
| 4 |
202021055409-DRAWINGS [19-12-2020(online)].pdf |
2020-12-19 |
| 5 |
202021055409-DECLARATION OF INVENTORSHIP (FORM 5) [19-12-2020(online)].pdf |
2020-12-19 |
| 6 |
202021055409-Proof of Right [20-05-2021(online)].pdf |
2021-05-20 |
| 7 |
202021055409-FORM 3 [07-09-2021(online)].pdf |
2021-09-07 |
| 8 |
202021055409-FORM 18 [07-09-2021(online)].pdf |
2021-09-07 |
| 9 |
202021055409-ENDORSEMENT BY INVENTORS [07-09-2021(online)].pdf |
2021-09-07 |
| 10 |
202021055409-DRAWING [07-09-2021(online)].pdf |
2021-09-07 |
| 11 |
202021055409-COMPLETE SPECIFICATION [07-09-2021(online)].pdf |
2021-09-07 |
| 12 |
202021055409-FORM-26 [12-10-2021(online)].pdf |
2021-10-12 |
| 13 |
Abstract1.jpg |
2022-02-22 |
| 14 |
202021055409-FER.pdf |
2022-08-02 |
| 15 |
202021055409-OTHERS [26-09-2022(online)].pdf |
2022-09-26 |
| 16 |
202021055409-FER_SER_REPLY [26-09-2022(online)].pdf |
2022-09-26 |
| 17 |
202021055409-COMPLETE SPECIFICATION [26-09-2022(online)].pdf |
2022-09-26 |
| 18 |
202021055409-CLAIMS [26-09-2022(online)].pdf |
2022-09-26 |
| 19 |
202021055409-US(14)-HearingNotice-(HearingDate-12-12-2024).pdf |
2024-11-11 |
| 20 |
202021055409-FORM-26 [06-12-2024(online)].pdf |
2024-12-06 |
| 21 |
202021055409-FORM-26 [06-12-2024(online)]-1.pdf |
2024-12-06 |
| 22 |
202021055409-Correspondence to notify the Controller [06-12-2024(online)].pdf |
2024-12-06 |
| 23 |
202021055409-Written submissions and relevant documents [26-12-2024(online)].pdf |
2024-12-26 |
Search Strategy
| 1 |
SearchHistory-202021055409E_01-08-2022.pdf |