Method And System For Dynamic Compression Of Deep Neural Network (Dnn)

< Back

Method And System For Dynamic Compression Of Deep Neural Network (Dnn)

Abstract: This disclosure relates generally to deep neural network (DNN) models. Typically, devices mostly working at edge of network face number of limitations while offering solution deep learning methods. The disclosed method and system facilitates in compressing the DNN to thus bring the DNNs to the edge of IoT applications. The system applies black-box function approximation to directly model input-output relationship of the DNN model. The output is obtained at a penultimate layer of the DNN model, and an approximation function approximating the input-output relation of the DNN model is determined. The approximation function is applied to the output layer of the DNN model to obtain a compressed size DNN model. The disclosed system is able to achieve high compression rates with minimal loss in accuracy. The system dynamically determines value of compression factor that gives an optimal accuracy by trading-off the accuracy expected and expected compressed size of DNN model

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

08 December 2020

Publication Number

23/2022

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

ip@legasis.in

Parent Application

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point, Mumbai - 400021, Maharashtra, India

Inventors

1. SAHU, Ishan

Tata Consultancy Services Limited, Block -1B, Eco Space, Plot No. IIF/12 (Old No. AA-II/BLK 3. I.T) Street 59 M. WIDE (R.O.W.) Road, New Town, Rajarhat, P.S. Rajarhat, Dist - N. 24 Parganas, Kolkata - 700160, West Bengal, India

2. PAL, Arpan

3. UKIL, Arijit

4. BANDYOPADHYAY, Soma

5. MAJUMDAR, Angshul

Indraprastha Institute of Information Technology, A 606, New Academic Building, Okhla Industrial Estate, Phase III, near Govind Puri Metro Station - 110020, New Delhi, India

Specification

DESC:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of invention:
METHOD AND SYSTEM FOR DYNAMIC COMPRESSION OF DEEP NEURAL NETWORK (DNN)

Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India

The following specification particularly describes the invention and the manner in which it is to be performed.
CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
The present application claims priority from Indian provisional application no. 202021053450, filed on December 08, 2020. The entire contents of the aforementioned application are incorporated herein by reference.

TECHNICAL FIELD
The disclosure herein generally relates to deep neural network compression, and, more particularly, to method and system for dynamic compression of deep neural network (DNN) model.

BACKGROUND
In this last decade, deep learning has revolutionized all aspects of data analysis. It has been applied for computer vision, speech analysis, natural language processing, medical imaging, bioinformatics, and so on. Neural networks are known for their function approximation capacity; this has been known since the earlier decades. Recent studies have shown that deep neural networks can approximate more complex functions than their shallow counterparts.
A huge number of Internet of Things (IoT) devices are available, and the number is increasing rapidly. A substantial fraction of those devices is available to perform different intelligent tasks. Such devices, mostly working at the edge of the network, often face number of limitations while offering the solution using state-of-the-art learning algorithms like deep learning methods. Fundamentally, deep learning ensures higher degree of learnability from the training dataset, but the constructed models are often too much expensive to run on sensor nodes, hand-held devices, smart appliances, smartphones and other edge devices for availing reliable and expected performance when deployed to solve real-world problems. The computing and energy costs of running massive deep neural network (DNN) models are the biggest hindrance. Edge devices have limited computing capabilities. The power needed to run such deep neural networks may simply not be available from the battery of an edge device. Additionally, the deep neural networks leave a carbon footprint.
Various industries from different verticals require inferencing solutions to be deployed at the edge and such deployment infrastructure is restricted by the typical limitations imposed by the edge devices. Hence, DNN compression is of utmost importance to bring the capability of the DNNs into the edge devices for performing different intelligent tasks like time series classification.

SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method for dynamic compression of DNN model is provided. The method includes receiving, via one or more hardware processors, a user input defining a plurality of constraints associated with the accuracy of the inference expected from a DNN model and an expected compressed size of the DNN model for the inference. Further, the method includes receiving, via the one or more hardware processors, a time-series input data at an input layer of the DNN model, the DNN model comprising at least the input layer, an output layer and a penultimate layer. Furthermore, the method includes obtaining a prefinal output data from the penultimate layer of the DNN model, via the one or more hardware processors. Also, the method includes determining, via the one or more hardware processors, an approximation function indicative of an approximate mapping of the input time-series data with the prefinal output data using a multi-variate multi-output regression technique, such that the approximation function satisfies the plurality of constraints, wherein the approximation function is determined using one of linear regression technique, kernelization technique, and a piecewise linear approximation technique. Moreover, the method includes applying the approximation function to the prefinal layer of the DNN model to compress the DNN model and obtain a compressed DNN model, wherein an amount of compression of the DNN model is determined based on a compression factor, wherein determining the approximation function satisfying the plurality of constraints comprises dynamically determining a value of the compression factor that gives an optimal accuracy by trading-off the accuracy expected and the expected compressed size of the DNN for inference.
In another aspect, a system for dynamic compression of DNN model is provided. The system includes a memory storing instructions, one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to receive a user input defining a plurality of constraints associated with the accuracy of the inference expected from a DNN model and an expected compressed size of the DNN model for the inference. The one or more hardware processors are further configured by the instructions to receive a time-series input data at an input layer of the DNN model, the DNN model comprising at least the input layer, an output layer and a penultimate layer. Furthermore, the one or more hardware processors are configured by the instructions to obtain a prefinal output data from the penultimate layer of the DNN model. Also, the one or more hardware processors are configured by the instructions to determine an approximation function indicative of an approximate mapping of the input time-series data with the prefinal output data using a multi-variate multi-output regression technique, such that the approximation function satisfies the plurality of constraints, wherein the approximation function is determined using one of linear regression technique, kernelization technique, and a piecewise linear approximation technique. Moreover, the one or more hardware processors are configured by the instructions to apply the approximation function to the prefinal layer of the DNN model to compress the DNN model and obtain a compressed DNN model, wherein an amount of compression of the DNN model is determined based on a compression factor, wherein determining the approximation function satisfying the plurality of constraints comprises dynamically determining a value of the compression factor that gives an optimal accuracy by trading-off the accuracy expected and the expected compressed size of the DNN for inference.
In yet another aspect, a non-transitory computer readable medium for a method for dynamic compression of DNN model is provided. The method includes receiving, via one or more hardware processors, a user input defining a plurality of constraints associated with the accuracy of the inference expected from a DNN model and an expected compressed size of the DNN model for the inference. Further, the method includes receiving, via the one or more hardware processors, a time-series input data at an input layer of the DNN model, the DNN model comprising at least the input layer, an output layer and a penultimate layer. Furthermore, the method includes obtaining a prefinal output data from the penultimate layer of the DNN model, via the one or more hardware processors. Also, the method includes determining, via the one or more hardware processors, an approximation function indicative of an approximate mapping of the input time-series data with the prefinal output data using a multi-variate multi-output regression technique, such that the approximation function satisfies the plurality of constraints, wherein the approximation function is determined using one of linear regression technique, kernelization technique, and a piecewise linear approximation technique. Moreover, the method includes applying the approximation function to the prefinal layer of the DNN model to compress the DNN model and obtain a compressed DNN model, wherein an amount of compression of the DNN model is determined based on a compression factor, wherein determining the approximation function satisfying the plurality of constraints comprises dynamically determining a value of the compression factor that gives an optimal accuracy by trading-off the accuracy expected and the expected compressed size of the DNN for inference.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 illustrates a generalized workflow for DNN model deployment on edge devices.
FIG. 2 illustrates a network implementation of a system for dynamic compression of a DNN model for inference according to some embodiments of the present disclosure.
FIG. 3 is a flow diagram of a method for dynamic compression of a DNN model for inference according to an embodiment of the present disclosure.
FIG. 4 illustrates an example representation of the DNN model and an function approximation representation of the DNN model according to some embodiments of the present disclosure.
FIG. 5 illustrate an example representation of piecewise linear approximation of a function for DNN model compression according to an embodiment of the present disclosure.
FIG. 6 is a block diagram of an exemplary computer system for implementing embodiments consistent with the present disclosure.
FIG. 7 illustrates plots representing accuracy of compressed models obtained by disclosed method and system according to an embodiment of the present disclosure.
FIGS. 8A-8D illustrate plots representing an observed difference in accuracy (compressed model accuracy - target accuracy) and compression factors in log10 scale in case of each of the compression methods.

DETAILED DESCRIPTION OF EMBODIMENTS
For solutions depending on frequent communication with cloud deployed models and services, the availability of quality network infrastructure and its upkeep assume great importance. In applications which demand real-time response with high reliability and low latency, any disruption in network connectivity may lead to disastrous consequences. The security and privacy issues in today’s big data analytics environment (Microsoft AzureTM, Amazon Web ServicesTM, Google CloudTM etc.) are considerable. For significant number of applications, specifically for healthcare applications, data privacy policies and guidelines must be complied with, which primarily restrict data movement from edge devices to the cloud infrastructure. It is understood that industries from different verticals require inferencing solutions to be deployed at the edge and such deployment infrastructure is restricted by the typical limitations imposed by the edge devices. For example, GE Healthcare’s pneumothorax detection system uses in situ inferencing of pneumothorax condition from x-ray images. In fact, the inferencing engine is part of the X-ray device. Likewise, AI-enabled ECG analysis for Cardiovascular diseases (CVDs) diagnosis service requires inferencing at the edge devices. Similarly, automated structural assessment of underground pipelines and sewage system (where human assessment is difficult, risky and time consuming) demands inferencing solution by the sensor nodes of the water or sewage pipelines. All these examples follow similar workflow for deployment of DNN models on constrained edge devices as shown in FIG. 1.
Referring to FIG. 1, a generalized workflow for DNN model deployment on edge devices is illustrated. As is illustrate trained DNN models such as trained DNN model 102 (obtained by training a DNN model, for example a DNN model 104 using a training data, for example training data 106) are required to be compressed and optimized so that they can meet the constraints of edge devices.
Major hindrances behind running massive deep neural networks at edge devices includes computing requirement, energy budget, and privacy and security. Such aforementioned issues may be addressed simultaneously if the DNNs could somehow be made smaller and more efficient. DNN compression is of utmost importance to bring the capability of the DNNs into the edge devices for performing different intelligent tasks like time series classification.
Typically, DNN compression can be performed by using one of the four techniques, including, pruning?quantization?coding, low-rank (matrix / tensor) factorization, parameterizing convolutional filters, and knowledge distillation.
Amongst the known techniques, the three-step approach (pruning?quantization?coding) is motivated by the compression paradigm arising in signal processing and communication. In the first step, low-valued filter weights are pruned; this is based on the assumption that small weights are less relevant. The filter weights may be pruned based on saliency, using advanced l1-regularization based sparsity constraints or sensitivity driven pruning. The second step is quantization. By reducing the number of bits (say from 32 to 4/8), a considerable reduction in memory footprint can be achieved. Studies proposed hierarchical schemes or learned schemes for more efficient quantization. After pruning and quantization, the network needs to be retrained with original training data. The newly trained network is further compressed via Huffman coding by storing the binary representation of the weights in a more efficient manner.
The technique of low rank factorization includes storing high singular values of pixels and the corresponding singular vectors were stored. The idea in DNN compression is similar. For the convolutional layers, the 2D filters are stacked one after the other to form a 3D tensor. Then, tensor factorization tools are used to decompose it into components that require less memory than the original. The fully connected layer is treated as a matrix and standard factorization tools such as singular value decomposition is used for compressing it.
The learnt filters (in a deep convolutional neural network) are not always independent of each other. This is the reason they can be compacted via low-rank factorization. Another approach to deal with redundancy is to consider some of the filters as basis filters and represent the rest of the filters as a linear or affine transformation of this basis. In deep neural networks, specifically in convolutional neural networks the convolutional filters act as a basis for analyzing the input signal / image. These are akin to the atoms in dictionary learning and basis in transform learning. The same problem (as faced by DNNs today) arose in dictionary and transform learning – how to efficiently store the atoms and basis. The parameterized versions of dictionaries and transforms were proposed to address this issue. Instead of trying to learn a dense dictionary / transform from the data, they parameterized these from known signal processing transforms like wavelet or DCT. This way, instead of storing the entire basis, one could only store a sparse set of wavelet / DCT coefficients.
The same approach was proposed in a study for deep CNN compression. Instead of storing all the convolutional filters, a set of basis filters were used which would be translated / rotated to produce all the filters. A few similar approaches based on different kinds of transformation (not translation/rotation) are known. Said approach is only applicable to deep CNNs and cannot generalize to other networks like deep belief network or stacked autoencoder.
Knowledge distillation is yet another approach used for compressing DNNs. Knowledge distillation approximates a massive DNN by using shallower architectures. The idea is to train a smaller network that can approximate the input-output relationship of the deeper network. The output in this case is not the binarized target vector (in classification) but the real values just before the soft-max operation. Consequently, this compression paradigm is only applicable for DNNs with soft-max layer. For regression problems, where the outputs are real, this limitation does not arise. In knowledge distillation, the trained DNN is called the teacher and the shallower model (approximating the deeper one) is the student. Given this basic model, the only difference among the published studies arise out of the different types of cost functions used to penalize the approximation between the student’s and teacher’s output.
The typical techniques available for DNN compression has concentrated on large models. Compressing large models are relatively easy, however, such approaches for compressing such large models do not scale down well. For relatively smaller DNN models, they can achieve only modest compression rate (3 times).
Various embodiments disclosed herein provides method and system for DNN compression that result in achieving a good compression rate (upto 300 time) for small DNNs in addition to big DNNs. In an embodiment, the disclosed system applies black-box function approximation approach to model the (DNN) learnt function. The disclosed system approximates the input-output relationship of the DNN using a regression-based function approximation approach. In an embodiment, the disclosed system dynamically provides model compression to adjust with an inference performance (accuracy) requirement of the DNN model by trading-off a compression factor associated with a percentage of compression achieved by model inference performance. These are other embodiments of the disclosed method and system are explained further in detail in the following description.
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims.
Referring now to the drawings, and more particularly to FIG. 2 through 6, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 2 illustrates a network implementation 200 of a system 202 for dynamic compression of a DNN model for inference according to some embodiments of the present disclosure. Herein, it will be understood that DNN model is a trained DNN model, however for the brevity of description, the trained DNN model may be referred to as DNN model throughout the description. Herein, the DNN is modeled as a black box and an input-output relationship of the DNN is approximated, which leads to a multi-variate multi-output regression problem. An important contribution of the disclosed system is that the system considers the output from a penultimate layer of the DNN (and not the final output layer). The system then determines an approximation function mapping the input time-series data with the prefinal output data using the multi-variate multi-output regression technique, such that the approximation function satisfies the certain constraints. Herein, the time-series input data may be in form of one of real vectors, matrices and tensors.
Although the present disclosure is explained considering that the system 202 is implemented on a server, it may be understood that the system 202 may also be implemented in a variety of computing systems 204, such as a laptop computer, a desktop computer, a notebook, a workstation, a cloud-based computing environment and the like. It will be understood that the system 202 may be accessed through one or more devices 206-1, 206-2... 206-N, collectively referred to as devices 206 hereinafter, or applications residing on the devices 206. Examples of the devices 206 may include, but are not limited to, a portable computer, a personal digital assistant, a handheld device, a smartphone, a tablet computer, a workstation and the like. The devices 206 are communicatively coupled to the system 202 through a network 208.
In an embodiment, the network 208 may be a wireless or a wired network, or a combination thereof. In an example, the network 208 can be implemented as a computer network, as one of the different types of networks, such as virtual private network (VPN), intranet, local area network (LAN), wide area network (WAN), the internet, and such. The network 106 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), and Wireless Application Protocol (WAP), to communicate with each other. Further, the network 208 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices. The network devices within the network 208 may interact with the system 202 through communication links.
As discussed above, the system 202 may be implemented in a computing device 204, such as a hand-held device, a laptop or other portable computer, a tablet computer, a mobile phone, a PDA, a smartphone, and a desktop computer. The system 202 may also be implemented in a workstation, a mainframe computer, a server, and a network server. In an embodiment, the system 202 may be coupled to a data repository, for example, a repository 212. The repository 212 may store data processed, received, and generated by the system 202. In an alternate embodiment, the system 202 may include the data repository 212.
The network environment 200 supports various connectivity options such as BLUETOOTH®, USB, ZigBee and other cellular services. The network environment enables connection of devices 206 such as Smartphone with the server 204, and accordingly with the database 212 using any communication link including Internet, WAN, MAN, and so on. In an exemplary embodiment, the system 202 is implemented to operate as a stand-alone device. In another embodiment, the system 202 may be implemented to work as a loosely coupled device to a smart computing environment. The components and functionalities of the system 202 are described further in detail with reference to FIGS. 3-8B.
Referring collectively to FIGS. 3-5, components and functionalities of the system 202 for dynamic compression of a DNN model for inference is described in accordance with an example embodiment. For example, FIG. 3 illustrate a flow diagram for a method for dynamic compression of a DNN model for inference, in accordance with an example embodiment of the present disclosure. FIG. 4 illustrates an example block diagram of a computer system for dynamic compression of a DNN model for inference, in accordance with an example embodiment of the present disclosure. FIG. 5 illustrate an example representation of piecewise linear approximation of a function for DNN model compression according to an embodiment of the present disclosure
In certain applications, for example a banking application, there may be a requirement for more accurate inference results while the compression size may be large. On the contrary there may be certain applications where the size of utmost importance while the inference accuracy may be negotiated. In an embodiment, such requirements may be determined by receiving user input defining a plurality of constraints associated with an accuracy of the inference expected from the DNN model and an expected compressed size of the DNN model for the inference at 302.
At 304, the method 300 includes receiving a time-series input data at an input layer of the DNN model. The DNN model includes at least the input layer, an output layer and a penultimate layer. An exemplary DNN model 400 is shown in FIG. 4. As illustrated in FIG. 4, the DNN model 400 is shown to include an input layer 402, a plurality of hidden layers collectively marked as 404, a penultimate layer 406 and an output layer 408. The input layer receives training data. In an embodiment, the training data includes time-series input data. The output from the DNN model is a vector obtained just before the soft-max operation from the DNN model. At 306, the method 300 includes obtaining a prefinal output data from the penultimate layer of the DNN model.
In an embodiment, the disclosed system approximates the input-output relationship of the DNN model. The disclosed system 202 utilizes a regression based function approximation approach for approximating the input-output relationship of the DNN model. In an embodiment, the regression-based function approximation facilitates in determining an approximation function by solving multi-variate multi-output regression problem.
At 308, the method 300 includes determining an approximation function indicative of an approximate mapping of the input time-series data with the prefinal output data using a multi-variate multi-output regression technique, such that the approximation function satisfies the plurality of constraints. The approximation function is determined using one of linear regression technique, kernelization technique, and a piecewise linear approximation technique. The solution of the approximation function using each of the aforementioned technique is described further with reference to the description below.
In an embodiment, the approximation function is determined using the linear regression technique by assuming that there are N training samples, the expression for the ith sample reads as in (1).
yi = Am×nxi + ?, i = 1 · · · N (1)
In this formulation, the size of the linear function (A) approximating the DNN model is independent of the number of samples of the training data and simply depends on the length of the input (n) received at the input layer and dimensionality of the output (m) obtained from the penultimate layer of the DNN model. In the simplest scenario, the modeling error ? is assumed to be normally distributed. Thus, estimation of A can be expressed as given in (2):
¦(argmin@A)?_(i=1)^N¦?y_i-?Ax?_i ?_2^2 (2)

Alternately, (2) can be expressed in a more compact fashion as in (3).
¦(argmin@A) ?Y-AX?_F^2 (3)
where Y = [y1]…|yN] and X = [x1|…|xN].
The formulation (3) will be ill-conditioned. This is because the columns in X are similar to each other (many samples belong to one class and are expected to be similar). To ameliorate this issue, one needs to regularize argmin function. In an embodiment, Tikhonov regularization may be applied to equation (3), which is summarized in (4).
(4)
The function of equation (4) is an objective function which is to be optimized to estimate the linear function approximating the DNN model. Herein, it will be noted that in alternate embodiments, regularization methods other than Tikhonov regularization can be applied to equation (3).
There is a closed form solution to (4) given by (5).
(5)
Herein, it will be noted that even for the simple linear regression model, there can be several variants. For example, instead of employing the Euclidean distance for the data fidelity term, a Taxi-cab distance or l8 norm can be used. Moreover, other divergence measures like Kullback-Leibler divergence may be employed. Also, there may be many choices in the selection of regularization term (for example LASSO / l1 norm). Hence, it will be noted that the solution to the linear regression problem may be formulated in a manner distinct from the one disclosed here.
In certain scenarios involving massive DNN models, the input-output relationship may be non-linear. There are two approaches to generalize into non-linear functions. One is via kernelization, and the other via changing the basis from linear to polynomial. The kernelization approach may be taken owing to its simplicity. In an embodiment, determination of the approximation function using the kernelization technique includes a first level of kernelization followed by a second level approximation, as described below:
In the first level of kernelization,
yi = A?(xi), ? i (6)
Here, ?(.) indicates a non-linear transformation. Using the matrix inversion lemma, the final form can be derived as given in (7):
yT = k (KT + ?I )-1Ytrain (7)
Here, k = ??(xtest),?(Xtrain)?, that is, a vector with kernel values between the test sample xtest to be predicted and each training sample. K is the kernel matrix with entries Kij = ??(xtrain, i),?(xtrain, j)?, where xtrain, i and xtrain, j are ith and jth training samples, respectively.
Compared to the linear solution, (7) has higher memory complexity. The matrix K is dependent on the number of training samples, when the number of training samples are higher the size of K is large. This is likely to lead to a situation, where instead of compressing, the size of the approximation model (compared to the original DNN) may be increased. Hence, the first-level kernelization is followed by a second level approximation to reduce the size of the kernel matrix. In an embodiment, the size of the Kernel matrix is reduced by using Nyström approximation in the second level approximation. Herein it will be understood that in other embodiments, the size of the kernel matrix may be reduced by using other techniques which include, but are not limited to, ensemble Nystrom, randomized Nystrom, Fast-Nys, and so on. Other methods may include finding feature mapping using Random Kitchen Sinks (RKS) algorithm, Quasi-Monte Carlo approach, etc., and block kernel approximation methods. Herein, the basic idea is to exploit the fact the kernel matrix is highly rank deficient. Therefore, instead of using the entire matrix it is possible to decompose it into its low rank components (via singular value decomposition or Cholesky decomposition). The low rank component is more memory efficient.
The Nyström approximation gives a trade-off between performance (for example, accuracy) and compressibility that can be attained with the approximation model. With higher number of components, better accuracy may be achieved however that may come at the cost of low compressibility and vice versa.
In yet another embodiment, the approximation function is determined using the piecewise linear approximation technique. The concept of piecewise linear approximation is shown in FIG. 5.
Any curve can be approximated by a set of straight lines; the more the number of lines, the better will be the approximation. When the function / surface is known, an optimal number of pieces required for approximating the surface/function may be determined. However, DNN is a complete black box mathematically and no functional relationship is known.
End-to-end, a DNN takes in a real vector / matrix / tensor as an input and outputs a binary class label as output. In system identification / function approximation, the input and output are assumed to be both real. Herein, since approximating a mapping from the space of real vectors to that of binary vectors by system identifications methods may be challenging, the output from the pre-final layer, i.e. the output before the softmax operation is taken (instead of the final output of the DNN). At this stage, the vector has the same dimension as the final binary vector but consists of real values.
For input (x), the operations till the softmax layer may be represented by g(.), upon which with softmax operation outputs the final binary label vector (b). Given the one-to-one mapping, no information is lost by considering the real vector instead of the final binary output.
y = g(x); b = softmax(y); b = softmax(g(x)) (8)
It is not known how the surface of g(.) may look. Herein, it may be assumed that the function approximation / system identification can be performed by a set of hyperplanes. The system 202 determines basis for these hyperplanes as described further in the description below.
In an embodiment, the disclosed system 202 determines an optimal number of pieces (n) for the approximation function by clustering the input training data into a plurality of clusters such that training samples in each cluster forms a basis for a hyperplane corresponding to the each cluster. To identify the basis for the hyperplanes, the disclosed 202 system clusters the training data and assumes that the samples in each cluster form the basis for the corresponding hyperplane. i.e. if Xc consists of all the training samples belonging to cluster c, the corresponding outputs (can be computed by y = g(x))) as
Yc=AcXc (9)
An approximation function is learned for each of the plurality of clusters using the linear regression technique. Learning the approximation function for a cluster from amongst the plurality of clusters includes computing a weight matrix for the cluster based on a least square technique.
The weight matrix Ac can be computed from the data by solving a least squares problem as descried in equation (10) below.
min-(A_c )???Y_c-A_c X_c ?_F ? (10)
Herein, it is assumed that the input-output relationship of each cluster is linear (regression). This constitutes the training phase. During testing, for a sample xtest, the system 202 determines which cluster it belongs to by comparing its distances from the cluster centers. Once the cluster is identified, the system 202 applies corresponding learnt weight Ac on xtest to obtain ytest.
ytest = Acxtest (11)
This ytest represents the value just before the output layer (for example, the softmax layer). The final output, i.e. the binary label is obtained by
btest = softmax(ytest) (12)
The disclosed system 202 determines the approximation function satisfying the plurality of constraints by dynamically determining a value of compression factor that gives an optimal accuracy by trading-off the accuracy expected and the expected compressed size of the DNN for inference. For example, in the present embodiment, the system 200 handles trade-off between accuracy and compressibility by selecting the number of clusters. If the number of clusters are increased, the surface of the DNN model may be better fit, and hence provide a better approximation, which in turn may lead to better results. But an increase in the number of clusters may result in a decrease in compressibility since more Ac’s may have to be stored.
FIG. 6 is a block diagram of an exemplary computer system 601 for implementing embodiments consistent with the present disclosure. The computer system 601 may be implemented in alone or in combination of components of the system 102 (FIG. 1). Variations of computer system 601 may be used for implementing the devices included in this disclosure. Computer system 601 may comprise a central processing unit (“CPU” or “hardware processor”) 602. The hardware processor 602 may comprise at least one data processor for executing program components for executing user- or system-generated requests. The processor may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc. The processor may include a microprocessor, such as AMD AthlonTM, DuronTM or OpteronTM, ARM’s application, embedded or secure processors, IBM PowerPCTM, Intel’s Core, ItaniumTM, XeonTM, CeleronTM or other line of processors, etc. The processor 602 may be implemented using mainframe, distributed processor, multi-core, parallel, grid, or other architectures. Some embodiments may utilize embedded technologies like application specific integrated circuits (ASICs), digital signal processors (DSPs), Field Programmable Gate Arrays (FPGAs), etc.
Processor 602 may be disposed in communication with one or more input/output (I/O) devices via I/O interface 603. The I/O interface 603 may employ communication protocols/methods such as, without limitation, audio, analog, digital, monoaural, RCA, stereo, IEEE-1394, serial bus, universal serial bus (USB), infrared, PS/2, BNC, coaxial, component, composite, digital visual interface (DVI), high-definition multimedia interface (HDMI), RF antennas, S-Video, VGA, IEEE 802.11 a/b/g/n/x, Bluetooth, cellular (e.g., code-division multiple access (CDMA), high-speed packet access (HSPA+), global system for mobile communications (GSM), long-term evolution (LTE), WiMax, or the like), etc.
Using the I/O interface 603, the computer system 601 may communicate with one or more I/O devices. For example, the input device 604 may be an antenna, keyboard, mouse, joystick, (infrared) remote control, camera, card reader, fax machine, dongle, biometric reader, microphone, touch screen, touchpad, trackball, sensor (e.g., accelerometer, light sensor, GPS, gyroscope, proximity sensor, or the like), stylus, scanner, storage device, transceiver, video device/source, visors, etc.
Output device 605 may be a printer, fax machine, video display (e.g., cathode ray tube (CRT), liquid crystal display (LCD), light-emitting diode (LED), plasma, or the like), audio speaker, etc. In some embodiments, a transceiver 606 may be disposed in connection with the processor 602. The transceiver may facilitate various types of wireless transmission or reception. For example, the transceiver may include an antenna operatively connected to a transceiver chip (e.g., Texas Instruments WiLink WL1283, Broadcom BCM4750IUB8, Infineon Technologies X-Gold 618-PMB9800, or the like), providing IEEE 802.11a/b/g/n, Bluetooth, FM, global positioning system (GPS), 2G/3G HSDPA/HSUPA communications, etc.
In some embodiments, the processor 602 may be disposed in communication with a communication network 608 via a network interface 607. The network interface 607 may communicate with the communication network 608. The network interface may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), transmission control protocol/internet protocol (TCP/IP), token ring, IEEE 802.11a/b/g/n/x, etc. The communication network 608 may include, without limitation, a direct interconnection, local area network (LAN), wide area network (WAN), wireless network (e.g., using Wireless Application Protocol), the Internet, etc. Using the network interface 607 and the communication network 608, the computer system 601 may communicate with devices 609 and 610. These devices may include, without limitation, personal computer(s), server(s), fax machines, printers, scanners, various mobile devices such as cellular telephones, smartphones (e.g., Apple iPhone, Blackberry, Android-based phones, etc.), tablet computers, eBook readers (Amazon Kindle, Nook, etc.), laptop computers, notebooks, gaming consoles (Microsoft Xbox, Nintendo DS, Sony PlayStation, etc.), or the like. In some embodiments, the computer system 601 may itself embody one or more of these devices.
In some embodiments, the processor 602 may be disposed in communication with one or more memory devices (e.g., RAM 613, ROM 614, etc.) via a storage interface 612. The storage interface may connect to memory devices including, without limitation, memory drives, removable disc drives, etc., employing connection protocols such as serial advanced technology attachment (SATA), integrated drive electronics (IDE), IEEE-1394, universal serial bus (USB), fiber channel, small computer systems interface (SCSI), etc. The memory drives may further include a drum, magnetic disc drive, magneto-optical drive, optical drive, redundant array of independent discs (RAID), solid-state memory devices, solid-state drives, etc. Variations of memory devices may be used for implementing, for example, any databases utilized in this disclosure.
The memory devices may store a collection of program or database components, including, without limitation, an operating system 616, user interface application 617, user/application data 618 (e.g., any data variables or data records discussed in this disclosure), etc. The operating system 616 may facilitate resource management and operation of the computer system 601. Examples of operating systems include, without limitation, Apple Macintosh OS X, Unix, Unix-like system distributions (e.g., Berkeley Software Distribution (BSD), FreeBSD, NetBSD, OpenBSD, etc.), Linux distributions (e.g., Red Hat, Ubuntu, Kubuntu, etc.), IBM OS/2, Microsoft Windows (XP, Vista/7/8, etc.), Apple iOS, Google Android, Blackberry OS, or the like. User interface 617 may facilitate display, execution, interaction, manipulation, or operation of program components through textual or graphical facilities. For example, user interfaces may provide computer interaction interface elements on a display system operatively connected to the computer system 601, such as cursors, icons, check boxes, menus, scrollers, windows, widgets, etc. Graphical user interfaces (GUIs) may be employed, including, without limitation, Apple Macintosh operating systems’ Aqua, IBM OS/2, Microsoft Windows (e.g., Aero, Metro, etc.), Unix X-Windows, web interface libraries (e.g., ActiveX, Java, Javascript, AJAX, HTML, Adobe Flash, etc.), or the like.
In some embodiments, computer system 601 may store user/application data 618, such as the data, variables, records, etc. as described in this disclosure. Such databases may be implemented as fault-tolerant, relational, scalable, secure databases such as Oracle or Sybase. Alternatively, such databases may be implemented using standardized data structures, such as an array, hash, linked list, structured text file (e.g., XML), table, or as hand-oriented databases (e.g., using HandStore, Poet, Zope, etc.). Such databases may be consolidated or distributed, sometimes among various computer systems discussed above. It is to be understood that the structure and operation of any computer or database component may be combined, consolidated, or distributed in any working combination.
Additionally, in some embodiments, the server, messaging and instructions transmitted or received may emanate from hardware, including operating system, and program code (i.e., application code) residing in a cloud implementation. Further, it should be noted that one or more of the systems and methods provided herein may be suitable for cloud-based implementation. For example, in some embodiments, some or all of the data used in the disclosed methods may be sourced from or stored on any cloud computing platform.
An example scenario depicting the results of label generation performed by the disclosed system 102/600 is described below.
Example scenario
For the purpose of experimentation, two of the best performing techniques, namely ResNet and InceptionTime were considered for applications involving compressing DNN for 1D data analysis. The goal of the experiments was to obtain compressed models that can approximate the performance of these base models. These networks were selected because they have been reported to be proven over multiple 1D time series datasets. The trained models were trained using the available implementations of for the experiments. InceptionTime is an ensemble of 5 networks with each given an equal weight in final prediction. For experimentation, one network only was considered for the experiments as the classification accuracies were similar to that of 5-network models. An ensemble of 5 networks may have larger model size, and the compression method would achieve even better compression performance over it. The compression results were compared against Model Optimizer tool available as part of OpenVINO toolkit which is an industry standard DNN deployment toolkit for edge devices.
The experiment was conducted by considering standard UCR Time Series Classification Archive dataset. It is a collection of 128 datasets from different time series domains. For the study, 36 datasets from various application domains were considered. Such domains included, for example, Industry 4.0, Healthcare, Utilities, and Mobile computing. This helped in studying the applicability of the disclosed approach in different use cases. Even though these 36 datasets are a subset of the complete repository, they include time series of various types and lengths, and hence seems appropriate for a proof-of-concept study.
For the disclosed techniques, there are a few parameters which need to be tuned. For example, in case of linear models, it is necessary to specify regularization parameter ?. The kernels considered in the experiments were radial basis function (RBF), laplacian and polynomial defined as in (13), (14), and (15) respectively.
k(x,y) = exp(-? ?x - y?2) (13)
k(x,y) = exp(-? ?x - y?1) (14)
k(x,y) = (?xT y + c)d (15)
This parameter tuning has been done on the training part of the datasets using 5-fold cross validation. For linear ridge regression ? was tuned; for both RBF and laplacian, tune ? and ? were tuned; and for polynomial ?, ? and degree d were tuned. The parameter search was performed over the following:
• ?: 0 and values ranging from 0.001 to 1000.
• ? : [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]
• degree, d: polynomials from degree 2 to 7 were considered.
The set of parameters which resulted in maximum cross validation accuracy over the training data was used to train the final model for the given method using the complete training data. The final model was then tested over the test data and the test accuracies are reported in Tables 1A-1D and Table 2A-2D. To measure reduction in space requirements, the compression factor (CF) was computed as given in (16). This gives us the factor by which the model file size is reduced.
CF = (Original DNN model file size)/(DNN model size after compression) (16)
Table 1 and Table 2 lists the performance of different functional approximation methods across datasets in terms of the accuracy and compression factor achieved by each of the approximation models for ResNet and InceptionTime networks, respectively. In the third column, accuracies of the base DNN models is given in bold-italic. This is the classification performance that is to be achieved with much lesser memory footprint. Accuracies and compression factors of optimized models, using OpenVINO Model Optimizer, are presented in the last two columns. OpenVINO model optimization is done with the default parameters of the toolkit. It is observed that such optimized models exhibit the same accuracy as the original base model with a compression factor of about 3 for ResNet and about 2.9 for InceptionTime.

Table 1-A

Table 1-B

Table 1-C

Table 1-D

Table 2-A

Table 2-B

Table 2-C

Table 2-D

For each dataset, the regression method which resulted in test accuracy nearest to that of the actual base model with compression factor, CF > 2 is given in bold. For some datasets, regression based models have shown classification performance even better than what was achieved by the original base model. In such cases, all better performing methods are given in bold. The size of K matrix in kernel based methods, as mentioned earlier, is dependent on training samples. There were where due to substantial size of K matrix the approximation model has very poor compression factor as presented in the tables.
Further experiment with such datasets were conducted by applying Nyström kernel map approximation to reduce the memory footprint. Table 3 and Table 4 present Nyström approximation results for ResNet and InceptionTime, respectively. By specifying the number of components (n) based on which the low-rank approximation to the kernel matrix is computed, the size of the resultant compressed model can be controlled.
Table 3: ResNet Compression: Accuracy and CF of regression models obtained using Nyström approximation for different kernel maps

Table 4: InceptionTime Compression: Accuracy CF of regression models obtained using Nyström approximation for different kernel maps

From Table 3 and Table 4, it is observed that as n is increased from 100 to 500, there is an improvement in classification accuracy due to the refinement in kernel map approximation. But simultaneously, there is a drop in compression factor as the approximation model increases in size with increase in number of components.
For each dataset, the regression model was considered which could attain the best performance in terms of classification accuracy with a compression factor of at least 2. FIG. 7 plots the observed difference in accuracy (compressed model accuracy - target accuracy) and compression factor of such models for all 36 datasets. OpenVINO optimized models are also included in the plot for comparison. As is seen from FIG. 3, compressed models are shown that are best in terms of accuracy and CF > 2. Each point represents a compressed model. Along x-axis, the compression factor in log10 scale. Along y-axis, the plot shows the difference between accuracy of compressed model and the corresponding target base model.
As these models provide nearly consistent performance and compression factor, they produce overlapping points in the plot. It is evident that for 1D time series models, which are comparatively smaller in size, OpenVINO could achieve only a modest compression factor of about 3. In comparison, the disclosed method could compress models with impressive compression factors of up to 300.
The datasets and their corresponding approximation models can be grouped in different bands of loss in accuracy compared to the target base models. Table 5 presents the number of datasets and the range of compression factors observed within each of these bands. It can be observed that for majority of cases the loss in accuracy is not more than 10%. Moreover, there are a few cases (such as ChlorineConcentration, ItalyPowerDemand, ECG5000, etc.) where the compressed model outperforms the base model (ResNet /InceptionTime) and reports higher test accuracy.
Table 5: Results summary of the proposed techniques for all 36 datasets. Nearness of compressed model accuracies to respective target base model accuracies is split into bands. The number of datasets and the range of compression factors were observed within each band.

As is seen above, the disclosed method and system provides for DNN compression. The DNN is modeled as a black box and approximate the input-output relationship. This leads to a multi-variate multi-output regression problem. The system was used to compress two of the best performing models on 1D time series analysis – ResNet and InceptionTime. The experiments have been carried out on the UCR repository. The experimental results shows comparison with the industry standard OpenVINO toolkit. Theoretical results on asymptotic compression rates have been known in the literature for the past three decades, i.e. larger the model / data to compress better are the compression rates. The disclosed system can even compress small DNN models. The state-of-practice OpenVINO can only compress these DNN architectures by almost a uniform rate of three-fold. The disclosed system, on the other hand, can compress them to up to 300 times.
As described previously, the approximation function can be determined using piecewise linear approximation function. The results of compression of the DNN using the piecewise linear approximation function have been shown by taking ResNet and InceptionTime.
The disclosed method is compared against spatial singular value decomposition (spatial SVD) and spatial singular value decomposition (spatial SVD). Also the method is compared with Model Optimizer of OpenVINO toolkit which is an industry standard DNN deployment toolkit. For the benchmarking techniques, the default parameters and settings were used. The experiment was conducted on UCR Time Series Classification Archive. 30 datasets encompassing time series of various types, lengths, and domains were considered to study the applicability of the disclosed approaches in different use cases.
In the piecewise linear approximation, the optimal number of pieces (n) required for approximating a function has to be determined. Then, for each of the linear models, its corresponding regularization parameter ? was tuned. Both n (varying between 1 and 5) and ? are obtained by 5-fold cross validation of the training set. The compression factor (CF) is described with reference to equation 16.
The comparative results are shown in Tables 6A-6C and 7A-7C. This is the classification performance that was achieved with much smaller memory footprint. For each dataset, the compression method excluding OpenVINO which resulted in test accuracy nearest to that of the target is given in bold. For some datasets, the piecewise linear approach has shown classification performance even better than what was achieved by the original base model; this is indeed an interesting observation. It was observed that the other two methods, in general, have poor compression performance; this may be because these compression techniques were originally developed for images.

Table 6-A

Table 6-B

Table 6-C

Table 7-A

Table 7-B

Table 7-C

In FIGS. 8A-8D, the observed difference in accuracy (compressed model accuracy - target accuracy) and compression factors in log10 scale are plotted in case of each of the compression methods. It was observed overall that it performs better considering both the classification performance and the compression factor. On an average, the disclosed method achieved 20 times more compression than spatial SVD, and 16 times more than channel pruning. The mean loss in accuracy for the disclosed method was around 4%, much better than 17% for spatial SVD, and 47% for channel pruning. Though OpenVINO provided consistent accuracy, the disclosed method produced 27 times more compression on average.
The datasets and their corresponding approximation models can be grouped in different bands of loss in accuracy compared to the target base models. Table 8 presents the number of datasets and the range of compression factors observed within each of these bands for the disclosed approach. It can be observed that for majority of the cases, the loss in accuracy is not more than 10%. Moreover, there are a few cases (such as ChlorineConcentration, ItalyPowerDemand, ECG5000, etc.) where the discloed compressed model outperforms the base model (ResNet / InceptionTime) and reports higher test accuracy.

Table 8: Results Summary for all 30 datasets

The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
Various embodiments disclosed herein provides method and system for DNN compression using black-box system identification methodology. Typically, devices mostly working at the edge of the network, often face number of limitations while offering the solution using state-of-the-art learning algorithms like deep learning methods. The disclosed method and system facilitate in compressing the DNN thereby facilitating in bringing the DNNs to the edge for IoT applications. The disclosed method and system apply the black-box function approximation to directly model the input-output relationship of the DNN model. Such approximation facilitates in compressing the size of the DNN model. The disclosed system is able to achieve high compression rates with minimal loss in accuracy.
In an embodiment, the disclosed system learns compressed models corresponding to the input-output behavior of the DNNs. An important contribution of the disclosed embodiments is the ability of the system and method to dynamically determine a value of the compression factor that gives an optimal accuracy by trading-off the accuracy expected and the expected compressed size of the DNN for inference.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.

,CLAIMS:
A processor implemented method for dynamic compression of a deep neural network (DNN) model for inference, the method comprising:
receiving (302), via one or more hardware processors, a user input defining a plurality of constraints associated with the accuracy of the inference expected from a DNN model and an expected compressed size of the DNN model for the inference;
receiving (304), via the one or more hardware processors, a time-series input data at an input layer of the DNN model, the DNN model comprising at least the input layer, an output layer and a penultimate layer;
obtaining (306) a prefinal output data from the penultimate layer of the DNN model, via the one or more hardware processors;
determining (308), via the one or more hardware processors, an approximation function indicative of an approximate mapping of the input time-series data with the prefinal output data using a multi-variate multi-output regression technique, such that the approximation function satisfies the plurality of constraints, wherein the approximation function is determined using one of linear regression technique, kernelization technique, and a piecewise linear approximation technique; and
applying (310) the approximation function to the prefinal layer of the DNN model to compress the DNN model and obtain a compressed DNN model, wherein an amount of compression of the DNN model is determined based on a compression factor, and
wherein determining the approximation function satisfying the plurality of constraints comprises dynamically determining a value of the compression factor that gives an optimal accuracy by trading-off the accuracy expected and the expected compressed size of the DNN model for inference.

The processor implemented method of claim 1, wherein determining the approximation function using the linear regression technique comprises optimizing an objective function to estimate a linear function approximating the DNN model, wherein size of the linear function depends on length of the time-series input data and dimensionality of the prefinal output at the penultimate layer of the DNN model.

The processor implemented method of claim 1, wherein determining the approximation function using the kernelization technique comprises: approximating the DNN model using a kernel-based method to obtain a kernel matrix; and
approximating the kernel matrix using a second level approximation.

The processor implemented method of claim 1, wherein determining the approximation function using the piecewise linear approximation technique comprises:
determining an optimal number of pieces for the approximation function by clustering the input training data into a plurality of clusters such that training samples in each cluster of the plurality of clusters forms a basis for a hyperplane corresponding to the each cluster, wherein input-output relationship of the each cluster is linear; and
learning an approximation function for each of the plurality of clusters using the linear regression technique, wherein learning the approximation function for a cluster from amongst the plurality of clusters comprises computing a weight matrix for the cluster based on a least square technique.

The processor implemented method of claim 4, wherein determining a value of the compression factor that gives the optimal accuracy comprises tuning a set of clusters in the plurality of clusters to optimize the accuracy expected and the expected compressed size the DNN model based on the plurality of constraints.

The processor-implemented method of claim 1, further comprising computing the compression factor associated with the compression of the DNN model, wherein the compression factor is computed as:
Compression Factor (CF) = (Size of the DNN model )/(Size of the Compressed DNN model)

The method of claim 1, wherein the time-series input data is in form of one of real vectors, matrices and tensors.

A system (600) for dynamic compression of a deep neural network (DNN) model for inference, comprising:
a memory (615) storing instructions;
one or more communication interfaces (607); and
one or more hardware processors (602) coupled to the memory (615) via the one or more communication interfaces (607), wherein the one or more hardware processors are configured by the instructions to:
receive a user input defining a plurality of constraints associated with the accuracy of the inference expected from a DNN model and an expected compressed size of the DNN model for the inference;
receive a time-series input data at an input layer of the DNN model, the DNN model comprising at least the input layer, an output layer and a penultimate layer;
obtain a prefinal output data from the penultimate layer of the DNN model;
determine an approximation function indicative of an approximate mapping of the input time-series data with the prefinal output data using a multi-variate multi-output regression technique, such that the approximation function satisfies the plurality of constraints, wherein the approximation function is determined using one of linear regression technique, kernelization technique, and a piecewise linear approximation technique; and
apply the approximation function to the prefinal layer of the DNN model to compress the DNN model and obtain a compressed DNN model, wherein an amount of compression of the DNN model is determined based on a compression factor,
wherein determining the approximation function satisfying the plurality of constraints comprises dynamically determining a value of the compression factor that gives an optimal accuracy by trading-off the accuracy expected and the expected compressed size of the DNN model for inference.

The system of claim 8, wherein to determine the approximation function using the linear regression technique, the one or more hardware processors are further configured by the instructions to optimize an objective function to estimate a linear function approximating the DNN model, wherein size of the linear function depends on length of the time-series input data and dimensionality of the prefinal output at the penultimate layer of the DNN model.

The system of claim 8, wherein to determine the approximation function using the kernelization technique, the one or more hardware processors are further configured by the instructions to:
approximate the DNN model using a kernel based method to obtain a kernel matrix; and
approximate the kernel matrix using a second level approximation.

The system of claim 8, wherein to determine the approximation function using the piecewise linear approximation technique, the one or more hardware processors are further configured by the instructions to:
determine an optimal number of pieces for the approximation function by clustering the input training data into a plurality of clusters such that training samples in each cluster forms a basis for a hyperplane corresponding to the each cluster, wherein input-output relationship of each cluster is linear; and
learn an approximation function for each of the plurality of clusters using the linear regression technique, wherein learning the approximation function for a cluster from amongst the plurality of clusters comprises computing a weight matrix for the cluster based on a least square technique.

The system of claim 11, wherein to determine a value of the compression factor that gives the optimal accuracy, the one or more hardware processors are further configured by the instructions to tune a set of clusters in the plurality of clusters to accuracy expected and the expected compressed size of the DNN model based on the plurality of constraints.

The system of claim 8, wherein the one or more hardware processors are further configured by the instructions to compute the compression factor associated with the compression of the DNN model, wherein the compression factor is computed as:
Compression Factor (CF) = (Size of the DNN model )/(Size of the Compressed DNN model)

The system of claim 8, wherein the time-series input data in form of one of real vectors, matrices and tensors.

Documents

Application Documents

#	Name	Date
1	202021053450-STATEMENT OF UNDERTAKING (FORM 3) [08-12-2020(online)].pdf	2020-12-08
2	202021053450-PROVISIONAL SPECIFICATION [08-12-2020(online)].pdf	2020-12-08
3	202021053450-FORM 1 [08-12-2020(online)].pdf	2020-12-08
4	202021053450-DRAWINGS [08-12-2020(online)].pdf	2020-12-08
5	202021053450-DECLARATION OF INVENTORSHIP (FORM 5) [08-12-2020(online)].pdf	2020-12-08
6	202021053450-FORM 3 [30-03-2021(online)].pdf	2021-03-30
7	202021053450-FORM 18 [30-03-2021(online)].pdf	2021-03-30
8	202021053450-ENDORSEMENT BY INVENTORS [30-03-2021(online)].pdf	2021-03-30
9	202021053450-DRAWING [30-03-2021(online)].pdf	2021-03-30
10	202021053450-COMPLETE SPECIFICATION [30-03-2021(online)].pdf	2021-03-30
11	202021053450-Proof of Right [19-05-2021(online)].pdf	2021-05-19
12	202021053450-FORM-26 [14-10-2021(online)].pdf	2021-10-14
13	Abstract1.jpg	2021-10-19
14	202021053450-FER.pdf	2022-06-24
15	202021053450-OTHERS [27-07-2022(online)].pdf	2022-07-27
16	202021053450-FER_SER_REPLY [27-07-2022(online)].pdf	2022-07-27
17	202021053450-COMPLETE SPECIFICATION [27-07-2022(online)].pdf	2022-07-27
18	202021053450-CLAIMS [27-07-2022(online)].pdf	2022-07-27
19	202021053450-ABSTRACT [27-07-2022(online)].pdf	2022-07-27
20	202021053450-US(14)-HearingNotice-(HearingDate-27-08-2024).pdf	2024-08-02
21	202021053450-Correspondence to notify the Controller [22-08-2024(online)].pdf	2024-08-22
22	202021053450-FORM-26 [26-08-2024(online)].pdf	2024-08-26
23	202021053450-Written submissions and relevant documents [10-09-2024(online)].pdf	2024-09-10
24	202021053450-Power of Authority [10-09-2024(online)].pdf	2024-09-10
25	202021053450-PETITION u-r 6(6) [10-09-2024(online)].pdf	2024-09-10
26	202021053450-Covering Letter [10-09-2024(online)].pdf	2024-09-10

Search Strategy

1	SearchHistoryE_23-06-2022.pdf