Learning Student Dnn Via Output Distribution

< Back

Learning Student Dnn Via Output Distribution

Abstract: Systems and methods are provided for generating a DNN classifier by "learning" a "student" DNN model from a larger more accurate "teacher" DNN model. The student DNN may be trained from unlabeled training data by passing the unlabeled training data through the teacher DNN which may be trained from labeled data. In one embodiment an iterative processis applied to train the student DNN by minimizing the divergence of the output distributions from the teacher and student DNN models. For each iteration until convergence the difference in the outputs of these two DNNsis used to update the student DNN model and outputs are determined again using the unlabeled training data. The resulting trained student DNN model may be suitable for providing accurate signal processing applications on devices having limited computational or storage resources such as mobile or wearable devices. In an embodiment the teacher DNN model comprises an ensemble of DNN models.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

09 March 2017

Publication Number

21/2017

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

MICROSOFT CORPORATION

One Microsoft Way Redmond, Washington 98052-6399

Inventors

1. ZHAO, Rui

c/o Microsoft Asia Pacific R&D Headquarters 14F, Building 2, No. 5, Dan Ling Street Haidian District, Beijing 100080

2. HUANG, Jui-Ting

c/o Microsoft Corporation, One Microsoft Way Redmond, Washington 98052

3. LI, Jinyu

c/o Microsoft Corporation, One Microsoft Way Redmond, Washington 98052

4. GONG, Yifan

c/o Microsoft Corporation, One Microsoft Way Redmond, Washington 98052

Specification

LEARNING STUDENT DNN VIA OUTPUT DISTRIBUTION BACKGROUND [0001] Deep neural network (DNN) promises significant accuracy improvements for complex signal processing applications including speech recognition and image processing. The power of DNN comes from its deep and wide network structure having a very large number of parameters. For example, context-dependent deep neural network hidden Markov model (CD-DNN-HMM) has been shown to outperform the conventional Gaussian mixture model (CD-GMM-HMM) on many automatic speech recognition (ASR) tasks.However, the outstanding performance of CD-DNN-HMMs comes with much higher run-time costs because DNNs use much more parameters than the traditional systems. Thus, while CD-DNN-HMMs have been deployed with high accuracy on servers or other computer systems having ample computational and storage resources, it becomes challenging to deploy DNN on devices that have limited computational and storage resources, such as smartphones, wearable devices, or entertainment systems. [0002] Yet, given the prevalence of such devices and the potential benefits DNN presents to applications such as ASR and image processing, the industry has a strong interest to have DNN on these devices. A common approach to this problem is to reduce the dimensions of the DNN, for example, by reducing the number of nodes in hidden layers and the number of senone targets in the output layer. But although this approach reduces the DNN model size, accuracy loss (e.g., word error rate) increases significantly and performance quality suffers. SUMMARY [0003] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. [0004] Embodiments of the invention are directed to systems and methods for providing a more accurate DNN model of reduced size for deployment on devices by "learning" the deployed DNN from a DNN with larger capacity (number of hidden nodes). To learn a DNN with a smaller number of hidden nodes, a larger-sized (more accurate) "teacher" DNN is used to train the smaller "student" DNN.In particular, as will be further described, an embodimentof the invention utilizes the property of DNN output distribution by minimizing the divergence between the output distributions of a small-sized student DNN and a larger-sized teacher DNN, using unlabeled data, such as un-transcribed data.The student DNN may be trained from unlabeled (or un-transcribed) data by passing unlabeled training data through the teacher DNN to generate the training target. Without the need for labeled (or transcribed) training data, much more data becomes available for training, thereby further improving the accuracy of the student DNN to provide a better approximation of complex functions from the larger-sized teacher DNN. The student DNN may be iteratively optimized until its output converges with the output of the teacher DNN. In this way, the student DNN approaches the behavior of the teacher, so that whatever the output of the teacher, the student will approximate, even where the teacher may be wrong. An embodiment of the invention is thus particularly suitable for providing accurate signal processing applications (e.g., ASR or image processing), on smartphones, entertainment systems, or similar consumer electronics devices. [0005] Some embodiments of the invention include providing a more accurate DNN model (e.g., small or standard size) by learning the DNN model from an even larger "giant" teacher DNN. For example, a standard-sized DNN model for deployment on a server can be generated using the teacher-student learning procedures described herein, wherein student DNN is the standard-sized DNN model and the teacher DNN is a giant-sized DNN, which might be implemented as a trained ensemble of multiple DNNs with different error patterns.In an embodiment, the ensemble is trained by combining the ensemble member outputs with automatically learned combination coefficients using, for example, cross-entropy criterion, sequential criterion, least square error criterion, least square error criterion with non-negative constraint, or similar criteria. BRIEF DESCRIPTION OF THE DRAWINGS [0006] The present invention is illustrated by way of example and not limitation in the accompanying figures in which like reference numerals indicate similar elements and in which: [0007] FIG. 1 is a block diagram of an example system architecture in which an embodiment of the invention may be employed; [0008] FIG. 2 depicts aspects of an illustrative representation of a DNN model, in accordance with an embodiment of the invention; [0009] FIG. 3 depicts aspects of an illustrative representation of learning a smaller- footprint student DNN from a larger-footprint teacher DNN using unlabeled data, in accordance with an embodiment of the invention; [0010] FIG. 4 depicts aspects of an illustrative representation of an ensemble teacher DNN model, in accordance with an embodiment of the invention; [0011] FIG. 5depicts aflow diagram of a method for generating a DNN classifier of a reduced size by learning from a larger DNN model, in accordance with embodiments of the invention; [0012] FIG. 6 depictsa flow diagram of a method for generating a trained DNN model from an ensemble teacher DNN model, in accordance with embodiments of the invention; and [0013] FIG. 7 is a block diagram of an exemplary computing environment suitable for use in implementing embodiments of the present invention. DETAILED DESCRIPTION [0014] The subject matter of the present invention is described with specificity herein to meet statutory requirements. However, the description itself is not intended to limit the scope of this patent. Rather, the inventors have contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies. Moreover, although the terms "step" and/or "block" may be used herein to connote different elements of methods employed, the terms should not be interpreted as implying any particular order among or between various steps herein disclosed unless and except when the order of individual steps is explicitly described. [0015] Various aspects of the technology described herein are generally directed to, among other things.systems, methods, and computer-readable media for providing a first DNN model of reduced size for deployment on devices by "learning" the first DNN from a second DNN with larger capacity (number of hidden nodes). To learn a DNN with a smaller number of hidden nodes, a larger-sized (more accurate) "teacher" DNN is usedfor training the smaller "student" DNN.In particular, an embodiment of the invention utilizes the property of DNN output distribution by minimizing the divergence between the output distributions of a small-sized student DNN and a standard (or larger-sized) teacher DNN, using unlabeled data, such as un-transcribed data. The student DNN can be trained from unlabeled (or un-transcribed) data because its training target is obtained by passing the unlabeled training data through the teacher DNN. Without the need for labeled (or transcribed) training data, much more data becomes available for training, thereby further improving the accuracy of the student DNN to provide a better approximation of complex functions from the large-sized teacher DNN. [0016] As will be further described, in one embodiment, the student DNN is iteratively optimized until its output converges with the output of the teacher DNN. In this way, the student DNN approaches the behavior of the teacher, so that whatever the output of the teacher, the student will approximate, even where the teacher may be wrong. Some embodiments of the invention are thus particularly suitable for providing accurate signal processing applications (e.g., ASR or image processing), on smartphones, entertainment systems, or similar consumer electronics devices. Further, some of these embodiments of the invention may be combined with other technologies tofurther improvethe run-time performance of CD-DNN-HMMs, such as low rank matrices used at the output layers or all layers to further reduce the number of parameters and CPU cost, 8-bit quantization for SSE (Streaming SIMD Extensions) evaluation, and/or frame skipping or prediction technologies. [0017] In some embodiments of the invention, a deployable DNN model (e.g., small or standard-sized model) is determinedby learning the deployable DNN model from an even larger "giant" teacher DNN. For example, a standard-sized DNN model for deployment on a server (or a smaller-sized DNN for deployment on a mobile device) can be generated using the teacher-student learning procedures described herein, wherein the student DNN is the standard-sized DNN model (or smaller-sized DNN model) and the teacher DNN is a giant-sized DNN. The giant-sized DNN maybe implemented as a trained ensemble of multiple DNNs with different error patterns, in an embodiment. The ensemble may be trained by combining the ensemble member outputs with automatically learned combination coefficients using, for example, cross-entropy criterion, sequential criterion, least square error criterion, least square error criterion with non-negative constraint, or similar criteria. [0018] As described above, an advantage of some embodiments described herein is that the student DNN model may be trained using unlabeled (or un-transcribed) data because its training target (PL(S|X), as will be further described) is obtained by passing the unlabeled training data through the teacher DNN model. Because labeling (or transcribing) data for training costs time and money, a much smaller amount of labeled (or transcribed) data is available as compared to unlabeled data. (Labeled (or transcribed) data may be used to train the teacher DNN.)Without the need for transcribed (or labeled) training data, much more data becomes available for training the student DNN to approximate the behavior of the teacher DNN. With more training data available to cover a particular feature space, the accuracy of a deployed (student) DNN model is even further improved. This advantage is especially useful for industry scenarios with large amounts of unlabeled data available due to the deployment feedback loop (wherein deployed models provide their usage data to application developers, whouse the data to further tailor future versions of the application). For example, many search engines use such a deployment feedback loop. [0019] Turning now to FIG. 1, a block diagram is provided showing aspects of one exampleof a system architecture suitable for implementing an embodiment of the inventionand designated generally as system 100. It should be understood that this and other arrangements described herein are set forth only as examples. Thus, system 100 representsonly one example of suitable computing system architectures. Other arrangements and elements (e.g., user devices, data stores, etc.) can be used in addition to or instead of those shown, and some elements may be omitted altogether for the sake of clarity. Further, many of the elements described herein are functional entities that may be implemented as discrete or distributed components or in conjunction with other components, and in any suitable combination and location. Various functions described herein as being performed by one or more entities may be carried out by hardware, firmware, and/or software. For instance, various functions or services may be carried out by a processor executing instructions stored in memory. [0020] Among other components not shown, system 100 includesnetwork 110 communicatively coupled to one or more data source(s) 108, storage 106, client devices 102 and 104, and DNN model generator 120. The components shown in FIG. 1 may be implemented on or using one or more computing devices, such as computing device 700 described in connection to FIG. 7. Network 110 may include, without limitation, one or more local area networks (LANs) and/or wide area networks (WANs). Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets, and the Internet.lt should be understood that any number of data sources, storage components or data stores, client devices,and DNN model generatorsmay be employed within the system 100 within the scope of the present invention. Each may comprise a single device or multiple devices cooperating in a distributed environment. For instance, the DNN model generator 120 may be provided via multiple computing devices or components arranged in a distributed environment that collectively provide the functionality described herein. Additionally, other components not shown may also be included within the network environment. [0021] Example system 100 includes one or more data source(s) 108. Data source(s)108comprise data resources for training the DNN models described herein. The data provided by data source (s) 108 may include labeled and unlabeled data, such as transcribed and un-transcribed data. For example, in an embodiment, the data includes one or more phone sets (sounds) and may also include corresponding transcription information or senone labels that may be used for initializing the teacher DNN model. In an embodiment, the unlabeled data in data source (s) 108 is provided by one or more deploymentfeedback loops, as described above. For example, usage data from spoken search queries performed on search engines may be provided as un-transcribed data. Other examples of data sources may include, by way of example and not limitation, various spoken-language audio or image sources, including streaming sounds or video;web queries; mobile device camera or audio information;web cam feeds;smart-glasses and smart-watch feeds;customer care systems;security camera feeds;web documents; catalogs; user feeds; SMS logs; instant messaging logs; spoken-word transcripts; gaming system user interactions such as voice commands or captured images (e.g., depth camera images);tweets; chat or video-call records; or social-networking media. Specific data source(s)108 used may be determined based on the application including whether the data is domain-specific data (e.g., data only related to entertainment systems, for example) or general (non-domainspecific) in nature. [0022] Example system 100 includes client devices 102 and 104, which may comprise any type of computing device where it is desirable to have a DNN system on the device and, in particular, wherein the device has limited computational and/or storage resources as compared to a more powerful server or computing system. For example, in one embodiment, client devices 102 and 104 may be one type of computing device described in relation to FIG. 7 herein. By way of example and not limitation, a user device may be embodied as a personal data assistant (PDA), a mobile device, smartphone, smart-watch, smart-glasses (or other wearable smart device), a laptop, a tablet, remote control, entertainment system, vehicle computer system, embedded system controller, appliance, home computer system, security system, consumer electronics device, or other similar electronics device. In one embodiment, the client deviceis capable of receiving input data such as audio and image information usable by a DNN system described herein that is operating in the device. For example, the client device may have a microphone or line-in for receiving audio information, a camera for receiving video or image information, or a communication component (e.g.,Wi-Fi functionality) for receiving such information from another source, such as the Internet or a data source 108. [0023] Using an embodiment of the student DNN model described herein, the client device 102 or 104 and student DNN model process the inputted data to determine computer-usable information. For example, using one embodiment of a student DNN operating on a client device, a query spoken by a user may be processed to determine the user's intent (i.e., what the user is asking for). Similarly, camera-derived information may be processed to determine shapes, features, objects, or other elements in the image or video. [0024] Example client devices 102 and 104 are included in system 100 to provide an example environment on whichstudent (or smaller-sized) DNN models created by embodiments of the invention may be deployed. Although, it is contemplated that aspects of the DNN models described herein may operate on one or more client devices 102 and 104, it is also contemplated that some embodiments of the invention do not include client devices. For example, a standard-sized or larger-sized student DNN may be embodied on a server or in the cloud. Further, although FIG. 1 shows two example client devices 102 and 104, more or fewer devices may be used. [0025] Storage 106 generally stores information including data, computer instructions (e.g.,software program instructions, routines, or services), and/or models used in embodiments of the invention described herein. In an embodiment, storage 106 stores data from one or more data source(s) 108, one or more DNN models (or DNN classifiers), information for generating and training DNN models, and the computer-usable information outputted by one or more DNN models. As shown in FIG. 1, storage 106 includes DNN models 107 and 109. DNN model 107 represents a teacher DNN model, and DNN model 109 represents a student DNN model having a smaller size than teacher DNN model 107. Additional details and examples of DNN models are described in connection to FIGS. 2-4. Although depicted as a single data store component for the sake of clarity, storage 106 may be embodied as one or more information stores, including memory on client device 102 or 104, DNN model generator 120, orin the cloud. [0026] DNN model generatorl20 comprises an accessing component 122, an initialization component 124, a training component 126, and an evaluating component 128. The DNN model generator 120, in general, is responsible for generating DNN models, such as the CD-DNN-HMM classifiers described herein, including creating new DNN models (or adapting existing DNN models) by initializing and training "student" DNN models from trained teacher DNN models, based on data from data source(s) 108. The DNN models generated by DNN model generator 120 may be deployed on a client device such as device 104 or 102, a server, or other computer system. In one embodiment, DNN model generator 120 creates a reduced-sized CD-DNN-HMM classifier for deployment on a client device.which may have limited computational or storage resources, by training an initialized "student" DNN model to approximate a trained teacher DNNmodel having a larger model size (e.g., number of parameters) than the student. In another embodiment, DNN model generator 120 creates a DNN classifier for deployment on a client device, server, or other computer system by training an initialized "student" DNN model to approximate a trained giant-sized teacher DNN model having a larger model size (e.g., number of parameters) than the student, wherein the giant-sized teacher DNN model comprises an ensemble of other DNN models. [0027] DNN model generator 120 and its components 122, 124, 126, and 128 may be embodied as a set of compiled computer instructions or functions, program modules, computer software services, or an arrangement of processes carried out on one or more computer systems, such as computing device 700 described in connection to FIG. 7, for example. DNN model generator 120, components 122, 124, 126, and 128, functions performed by these components, or services carried out by these components may be implemented at appropriate abstraction layer(s) such as the operating system layer, application layer, hardware layer, etc., of the computing system(s). Alternatively, or in addition, the functionality of these components, DNN model generator 120, and/or the embodiments of the invention described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc. [0028] Continuing with FIG. 1, accessing component 122 is generally responsible for accessing and providing training data from one or more data sources 108 and DNN models, such as DNN models 107 and 109, to DNN model generator 120. In some embodiments, accessing component 122 may access information about a particular client device 102 or 104, such as information regarding the computational and/or storage resources available on the client device. In some embodiments, this information may be used to determine the optimal size of a DNN model generated by DNN model generator 120for deployment on the particular client device. [0029] Initialization component 124 is generally responsible for initializing an untrained "student" DNN model, and in some embodiments initializing a teacher DNN model for training the student. In some embodiments, initialization component 124 initializes a student DNN model of a particular size (or a model no larger than a particular size) based on the limitations of the client device on which the trained student DNN model will be deployed, and may initialize the student DNN based on a teacher DNN model (a larger DNN model). For example, in an embodiment, initialization component 124 receives from accessing component 122 a fully trained teacher DNN of size Nr,which is already trained according to techniques known by one skilled in the art, and information about the limitations of a client device on which the trained student DNN is to be deployed. The teacher DNN may be initialized and/or trained for a domain-specific application (such as facial recognition or spokenqueries for an entertainment system) or for a general purpose. Based on the received information, initialization component 124 creates an initial, untrained student DNN model of a suitable model size (based on the limitations of the client device). In one embodiment, the student DNN model may be created by copying and dividing the teacher DNN model into a smaller model (smaller number of nodes.) Like the teacher DNN model, the untrained student DNN model includes a number of hidden layers that may be equal to the number of layers of the teacher.or the student DNN may contain a different number of hidden layers than the teacher DNN model. In one embodiment, the student DNN model size, including the number or nodes or parameters for each layer, is less than Ny, the size of the teacher. An example DNN model suitable for use as a student DNN is described in connection to FIG. 2. In that example, a CD-DNN-HMM model inherits its model structure, including a phone set, HMM topology, and tying of context-dependent states, directly from a conventional CD-GMM-HMM system, which may be pre-existing. [0030] In one embodiment, initialization component 124 creates and initializes the untrained student DNN model by assigning random numbers to the weights of the nodes in the model (i.e., the weights of matrix W). In another embodiment, initialization component 124 receives from accessing component 122 data for pre-training the student DNN model, such as un-transcribed data that is used to establish initial node weights for the student DNN model. [0031] In some embodiments, initialization component 124 also initializes or creates the teacher DNN model. In particular, using labeled or transcribed data from data source (s) 108 provided by accessing component 122, initialization component 124 may create a teacher DNN model (which may be pre-trained) and provide the initialized but untrained teacher DNN model to training component 126 for training. Similarly, initialization component 124 may create an ensemble teacher DNN model by determining a plurality of sub-DNN models (e.g., creating and handing off to training component 126 for training or identifying already existing DNN model(s)) to be included as members of the ensemble). In these embodiments, initialization component 124 may also determine the relationships between the output layer of the ensemble and the output layers of the member sub-DNN models (e.g., by taking a raw average of the member model outputs), or may provide the initialized but untrained ensemble teacher DNN to training component 126 for training. [0032] Training component 126 is generally responsible for training the student DNN based on the teacher. In particular, training component 126 receives from initialization component 124 and/or accessing component 122 an untrained (or pre-trained) DNN model, which will be the student, and trained DNN model, which will serve as the teacher. (It is also contemplated that the student DNN model may be trained, but may be further trained according to the embodiments described herein.) Training component 126 also receives unlabeled data from accessing component 122for training the student DNN. [0033] Training component 126 facilitates the learning of the student DNN through an iterative process with evaluation component 128 that provides the same unlabeled data to the teacher and student DNN models, evaluates the output distributions of the DNN models to determine the error of the student DNN's output distributionfrom the teacher's, performs back propagation on the student DNN model based on the error to update the student DNN model, and repeats this cycle until the output distributions converge (or are otherwise sufficiently close). In some embodiments, training component 126trains the student DNN according to methods 500 and 600 described in connection to FIGS. 5 and 6, respectively. [0034] In some embodiments, training component 126 also trains the teacher DNN model. For example, in one embodiment, a teacher DNN is trained using labeled (or transcribed) data according to techniques known to one skilled in the art. In some embodiments, using an ensemble teacher DNN, training component 126 trains the ensemble teacher DNN. By way of example and not limitation, training component 126 may train the ensemble by combining the ensemble member outputs with automatically learned combination coefficients using, for example, cross-entropy criterion, sequential criterion, least square error criterion, least square error criterion with non-negative constraint, or similar criteria. [0035] Evaluating component 128 is generally responsible for evaluating the student DNN model to determine if it is sufficiently trained to approximate the teacher. In particular, in an embodiment, evaluating component 128 evaluates the output distributions of the student and teacher DNNs, determines the difference (which may be determined as an error signal) between the outputs,and also determines whether the student is continuing to improve or whether the student is no longer improving (i.e., the student output distribution shows no further trend towards convergence with the teacher output). In one embodiment, evaluating component 128 computes the Kullback-Leibler (KL) divergence between the output distributions and, in conjunction with training component 126, seeks to minimize the divergence through the iterative process described in connection to training component 126. Some embodiments of evaluator 128 may use regression, mean square error (MSE), or other similar approaches to minimizing divergence between the outputs of the teacher and student DNNs. [0036] In addition to determining the error signal, some embodiments of evaluating component 128 determine whether to complete another iteration (for example, another iteration comprising: updating the student DNN based on the error, passing unlabeled data through the student and teacher DNNs, and evaluating their output distributions). In particular, some embodiments of evaluating component 128 apply a threshold to determine convergence of the teacher DNN and student DNN output distributions. Where the threshold is not satisfied, iteration may continue, thereby further training the student to approximate the teacher. Where the threshold is satisfied, then convergence is determined (indicating the student output distribution is sufficiently close enough to the teacher DNN's output distribution), and the student DNN may be considered trained and further may be deployed on a client device or computer system. Alternatively, in some embodiments, evaluating component 128 determines whether to continue iterating based on whether the student is continuing to show improvement (i.e., whether over multiple successive iterations, the output distribution of the student is moving towards convergence with the output distribution of the teacher, indicating that the student DNN is continuing to improve with subsequent iterations). In such embodiments, so long as the student is improving, the iterative training continues. But in one embodiment, where the student learning stalls (i.e., the student DNN output distributions are not getting any closer to the teacher DNN's output distributions for several iterations), then "class is over," and the student DNN model may be considered trained. In one embodiment, convergence may be determined where the student DNN output distributions are not getting any closer to the teacher DNN's output distributions over several iterations.In some embodiments, evaluating component 128 evaluates the student DNN according to the methods 500 and 600 described in connection to FIGS. 5 and 6, respectively. [0037] Turning now to FIG. 2, aspects of an illustrative representation of an example DNN classifier are provided and referred to generally as DNN classifier 200. This example DNN classifier 200 includes a DNN model 201. (FIG. 2 also shows data 202, which is shown for purposes of understanding, but which is not considered a part of DNN classifier 200.) In one embodiment, DNN model 201 comprises a CD-DNN-HMM model and may be embodied as a specific structure of mapped probabilistic relationships of an input onto a set of appropriate outputs, such as illustratively depicted in FIG. 2. The probabilistic relationships (shown as connected lines between the nodes 205 of each layer) may be determined through training.Thus, in some embodiments of the invention, the DNN model 201 is defined according to its training. (An untrained DNN model, therefore.maybe considered to have a different internal structure than the same DNN model that has been trained.) A deep neural network (DNN) can be considered as a conventional multi-layer perceptron (MLP) with many hidden layers (thus, deep). In some embodiments of the invention, three aspects contributing to the excellent performance of CD-DNN-HMM include: modeling senones directly even though there might be thousands of senones; using DNNs instead of shallow MLPs; and using a long context window of frames as the input. [0038] With reference to FIG. 2, the input and output of DNN model 201 are denoted as A" and 0(210 and 250 of FIG. 2), respectively. Denote the input vector at layer/(220 of FIG. 2)as V (with v°—-x), the weight matrix as W, and bias vector as a .Then, for a DNN with Lhidden layers (240 of FIG. 2), the output of the /-th hidden layer is: v1+1 = a(z(v1)), 0<1