Deep Neural Network Optimization System For Machine Learning Model

Deep Neural Network Optimization System For Machine Learning Model Scaling

Abstract: The present disclosure is related to techniques for optimizing artificial intelligence (AI) and/or machine learning (ML) models to reduce resource consumption while maintaining or improving AI/ML model performance. A sparse distillation framework (SDF) is provided for producing a class of parameter and compute efficient AI/ML models suitable for resource constrained applications. The SDF simultaneously distills knowledge from a compute heavy teacher model while also pruning a student model in a single pass of training, thereby reducing training and tuning times considerably. A self-attention mechanism may also replace CNNs or convolutional layers of a CNN to have better translational equivariance. Other embodiments may be described and/or claimed.

Patent Information

Application #

Filing Date

15 September 2022

Publication Number

16/2023

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

INTEL CORPORATION

2200 Mission College Boulevard, Santa Clara, California 95054, USA

Inventors

1. Sairam Sundaresan

16409 W Bernardo Dr, Suite 100, San Diego, CA, USA, 92127

2. Souvik Kundu

3165 S. Sepulveda Blvd, Apt # 202, Los Angeles, CA, USA, 90034

Specification

Description:RELATED APPLICATION
[0001] The present application claims priority to U.S. Non-Provisional Patent Application No. 17/504,282 filed on 18 October 2021 and titled “DEEP NEURAL NETWORK OPTIMIZATION SYSTEM FOR MACHINE LEARNING MODEL SCALING” the entire disclosure of which is hereby incorporated by reference.

TECHNICAL FIELD
[0002] Embodiments described herein generally relate to artificial intelligence (AI), machine learning (ML), and Neural Architecture Search (NAS) technologies, and in particular, to techniques for Deep Neural Network (DNN) model engineering and optimization.
BACKGROUND
[0003] Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data. Performing machine learning involves creating a statistical model (or simply a “model”), which is configured to process data to make predictions and/or inferences. ML algorithms build models using sample data (referred to as “training data”) and/or based on past experience in order to make predictions or decisions without being explicitly programmed to do so.
[0004] ML model design is a lengthy process that involves a highly iterative cycle of training and validation to tune the structure, parameters, and/or hyperparameters of a given ML model. The training and validation can be especially time consuming and resource intensive for larger ML architectures such as deep neural networks (DNNs) and the like. Conventional ML design techniques may also require relatively large amounts of computational resources beyond the reach of many users.
[0005] The efficiency of an ML model, in terms of resource consumption, speed, accuracy, and other performance metrics, are based in part on the number and type of parameters used for the ML model. The parameters used for the ML model include “model parameters” (also referred to simply as “parameters”) and “hyperparameters.” Model parameters are parameters derived via training, whereas hyperparameters are parameters whose values are used to control aspects of the learning process and usually have to be set before running an ML model. Changes to model parameters and/or hyperparameters can greatly impact the performance of a given ML model. In particular, reducing the number of parameters may decrease the performance of a model, but may allow the model to run faster and use less memory than it would with a larger number of parameters.
[0006] For example, existing computer vision models rely heavily on convolution-based architectures (e.g., convolutional neural networks (CNNs)), which scale poorly with receptive field sizes, apply the same set of weights to all parts of the input, and have a significant increase in parameters and floating point operations (FLOPs) as the model size grows. This can lead to increased training, optimization and inference times, particularly in the context of applications such as Neural Architecture Search (NAS), federated learning, and the like.
[0007] Current approaches to improve ML model efficiency include using knowledge distillation or pruning in isolation to reduce the computation and/or storage budget required for the model for inference deployment. These approaches are discussed in Gou et al., "Knowledge distillation: A survey", Int'l J. of Comp. Vision, vol. 129, no. 6, pp. 1789-819 (2021) (“[Gou]”) and Cheng et al., “A Survey of Model Compression and Acceleration for Deep Neural Networks”, IEEE Signal Processing Mag., Special Issue on Deep Learning for Image Understanding, arXiv:1710.09282v9 (14 Jun 2020) (“[Cheng]”). However, these current approaches involve highly iterative training processes, which increases the training time and resource usage overhead. These drawbacks are also exacerbated by the need for significant parameter tuning, and therefore, these approaches are not easily scalable. Even after such lengthy fine-tuning processes, the current approaches do not guarantee a reasonable compromise between ML model accuracy, model size, speed, and power.
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. Like numerals having different letter suffixes may represent different instances of similar components. Some embodiments are illustrated by way of example, and not limitation, in the figures of the accompanying drawings in which:
[0009] Figure 1 shows an overview of a joint optimization framework according to various embodiments. Figures 2, 3, 4, and 5 depict an overview of a sparse distillation system according to various embodiments. Figure 6 depicts a sparse distillation process according to various embodiments.
[0010] Figure 7 depicts an example NAS architecture including the sparse distillation system of Figures 2, 4, 3, and 5 according to various embodiments. Figure 8 depicts an example NAS procedure according to various embodiments.
[0011] Figure 9 depicts an example artificial neural network (ANN). Figure 10a illustrates an example accelerator architecture. Figure 10b illustrates an example components of a computing system. Figure 11 depicts an example procedure that may be used to practice the various embodiments discussed herein.

DETAILED DESCRIPTION
[0012] The present disclosure is related to techniques for optimizing artificial intelligence (AI) and/or machine learning (ML) models to reduce resource consumption while improving AI/ML model performance. In particular, the present disclosure provides a framework for producing a class of parameter and compute efficient ML models suitable for resource constrained applications. This framework is referred to herein as “sparse distillation”.
[0013] The sparse distillation framework discussed herein sparsely distills a relatively large reference ML model (referred to herein as a “supernetwork” or “supernet”) into a smaller ML model (referred to herein as a “subnetwork” or “subnet”). As an example, a supernet may be a relatively large and/or dense ML model that an end-user has developed, but is expensive to operate in terms of computation, storage, and/or power consumption. This supernet may include parameters and/or weights that do not significantly contribute to the prediction and/or inference determination, and these parameters and/or weights contribute to the supernet’s overall computational complexity and density. Therefore, the supernet contains a smaller subnet that, when trained in isolation, can match the accuracy (or other performance metrics) of the original ML model (supernet). Additionally, it may be possible for the subnet to outperform the supernet in certain scenarios.
, Claims:1. An apparatus for sparse distillation of a machine learning (ML) model, the apparatus comprising:
a knowledge distillation (KD) mechanism to distill knowledge of a supernet into a subnet during a single ML training epoch; and
a pruning mechanism to prune one or more parameters from the subnet during the single ML training epoch to produce a sparse distilled subnet.

Documents

Application Documents

#	Name	Date
1	202244052662-FORM 1 [15-09-2022(online)].pdf	2022-09-15
2	202244052662-DRAWINGS [15-09-2022(online)].pdf	2022-09-15
3	202244052662-DECLARATION OF INVENTORSHIP (FORM 5) [15-09-2022(online)].pdf	2022-09-15
4	202244052662-COMPLETE SPECIFICATION [15-09-2022(online)].pdf	2022-09-15
5	202244052662-FORM-26 [28-12-2022(online)].pdf	2022-12-28
6	202244052662-FORM 3 [14-03-2023(online)].pdf	2023-03-14
7	202244052662-FORM 3 [08-09-2023(online)].pdf	2023-09-08
8	202244052662-Proof of Right [12-09-2023(online)].pdf	2023-09-12
9	202244052662-FORM 3 [08-03-2024(online)].pdf	2024-03-08
10	202244052662-FORM 18 [13-10-2025(online)].pdf	2025-10-13