Using Sparsity Metadata To Reduce Systolic Array Power Consumption

Abstract: A processing apparatus can include a general-purpose parallel processing engine comprising a matrix accelerator including a multi-stage systolic array, where each stage includes multiple processing elements associated with multiple processing channels. The multiple processing elements are configured to receive output sparsity metadata that is independent of input sparsity of input matrix elements and perform processing operations on the input matrix elements based on the output sparsity metadata.

Patent Information

Application #

Filing Date

16 March 2022

Publication Number

52/2022

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

INTEL CORPORATION

2200 Mission College Boulevard, Santa Clara, California 95054, USA

Inventors

1. JORGE PARRA

6200 Edgehill Dr. El Dorado Hills, CA 95762 USA

2. SUPRATIM PAL

890 Bullion Lane Folsom, CA 95630 USA

3. JIASHENG CHEN

4328 Suffolk Way El Dorado Hills, CA 95762 USA

4. CHANDRA GURRAM

1352 Walden Dr Folsom, CA 95630 USA

Specification

Claims:1. A processing apparatus having reduced systolic array power consumption, the processing apparatus including:
a general-purpose parallel processing engine comprising a matrix accelerator including one or more systolic arrays, at least one of the one or more systolic arrays comprising multiple pipeline stages, each pipeline stage of the multiple pipeline stages including multiple processing elements, the multiple processing elements associated with multiple processing channels, wherein the multiple processing elements are configured to:
receive output sparsity metadata at a first pipeline stage, the output sparsity metadata associated with the multiple processing channels, wherein the output sparsity metadata is independent of input sparsity of input matrix elements;
perform processing operations on the input matrix elements based on the output sparsity metadata, wherein to perform the processing operations includes to:
bypass multiplication at a first processing element associated with a first processing channel and power gate a portion of the first processing element; and
multiply input elements at a second processing element associated with a second processing channel.
, Description:RELATED APPLICATION
[0001] The present application claims priority to U.S. Non-Provisional Patent Application No. 17/358,542 filed June 25, 2021 and titled “USING SPARSITY METADATA TO REDUCE SYSTOLIC ARRAY POWER CONSUMPTION” the entire disclosure of which is hereby incorporated by reference.

FIELD
[0002] This disclosure relates generally to data processing and more particularly to data processing via a matrix accelerator of a parallel or graphics processing unit.

BACKGROUND OF THE DISCLOSURE
[0003] Parallel graphics data processing includes systems and methods developed to perform specific operations on graphics data such as, for example, linear interpolation, tessellation, rasterization, texture mapping, depth testing, etc. Traditionally, graphics processors used fixed function computational units to process graphics data. More recently, portions of graphics processors have been made programmable, enabling such processors to support a wider variety of operations for processing vertex and fragment data. Programmable graphics processors have also been adapted to perform general purpose numerical computing applications, such as high-performance computing (HPC), deep learning (e.g., study of artificial neural networks and related machine learning algorithms), and digital signal processing (DSP). These general-purpose numerical computing applications make extensive use of matrix multiplication computations. Accordingly, programmable portions of parallel and graphics data processing units have been adapted to include processing resources and/or functional units that are configured to perform high-throughput matrix operations, including matrix multiply and add operations or dot product operations.

BRIEF DESCRIPTION OF THE DRAWINGS
[0004] The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings in which like references indicate similar elements, and in which:
[0005] FIG. 1 is a block diagram illustrating a computer system configured to implement one or more aspects of the embodiments described herein;
[0006] FIG. 2A-2D illustrate parallel processor components;
[0007] FIG. 3A-3C are block diagrams of graphics multiprocessors and multiprocessor-based GPUs;
[0008] FIG. 4A-4F illustrate an exemplary architecture in which a plurality of GPUs is communicatively coupled to a plurality of multi-core processors;
[0009] FIG. 5 illustrates a graphics processing pipeline;
[0010] FIG. 6 illustrates a machine learning software stack;
[0011] FIG. 7 illustrates a general-purpose graphics processing unit;
[0012] FIG. 8 illustrates a multi-GPU computing system;
[0013] FIG. 9A-9B illustrate layers of exemplary deep neural networks;
[0014] FIG. 10 illustrates an exemplary recurrent neural network;
[0015] FIG. 11 illustrates training and deployment of a deep neural network;
[0016] FIG. 12A is a block diagram illustrating distributed learning;
[0017] FIG. 12B is a block diagram illustrating a programmable network interface and data processing unit;
[0018] FIG. 13 illustrates an exemplary inferencing system on a chip (SOC) suitable for performing inferencing using a trained model;
[0019] FIG. 14 is a block diagram of a processing system;
[0020] FIG. 15A-15C illustrate computing systems and graphics processors;
[0021] FIG. 16A-16C illustrate block diagrams of additional graphics processor and compute accelerator architectures;
[0022] FIG. 17 is a block diagram of a graphics processing engine of a graphics processor;
[0023] FIG. 18A-18B illustrate thread execution logic including an array of processing elements employed in a graphics processor core;
[0024] FIG. 19 illustrates an additional execution unit;
[0025] FIG. 20 is a block diagram illustrating a graphics processor instruction formats;
[0026] FIG. 21 is a block diagram of an additional graphics processor architecture;
[0027] FIG. 22A-22B illustrate a graphics processor command format and command sequence;
[0028] FIG. 23 illustrates exemplary graphics software architecture for a data processing system;
[0029] FIG. 24A is a block diagram illustrating an IP core development system;
[0030] FIG. 24B illustrates a cross-section side view of an integrated circuit package assembly;
[0031] FIG. 24C illustrates a package assembly that includes multiple units of hardware logic chiplets connected to a substrate (e.g., base die);
[0032] FIG. 24D illustrates a package assembly including interchangeable chiplets;
[0001] FIG. 25 is a block diagram illustrating an exemplary system on a chip integrated circuit;
[0033] FIG. 26A-26B are block diagrams illustrating exemplary graphics processors for use within an SoC;
[0034] FIG. 27 is a block diagram of a data processing system, according to an embodiment;
[0035] FIG. 28A-28B illustrate a matrix operation performed by an instruction pipeline, according to an embodiment;
[0036] FIG. 29 illustrates a systolic array including multiplier and adder circuits organized in a pipelined fashion;
[0037] FIG. 30A-30B illustrates the use of a systolic array that can be configured to execute operations at an arbitrary systolic depth;
[0038] FIG. 31 illustrates a two-path matrix multiply accelerator in which each path has a depth of four stages;
[0039] FIG. 32 illustrates a four-path matrix multiply accelerator in which each path has a depth of two stages;
[0040] FIG. 33 illustrates a scalable sparse matrix multiply accelerator using systolic arrays with feedback inputs;
[0041] FIG. 34 shows a scalable sparse matrix multiply accelerator using systolic arrays with feedback inputs and outputs on each stage;
[0042] FIG. 35A-35B illustrates the use of output sparsity metadata to disable processing channels of a systolic array;
[0043] FIG. 36 illustrates metadata for matrix multiplication on operations that include half precision matrix elements;
[0044] FIG. 37 illustrates metadata as depicted in matrix form and as stored within a metadata register;
[0045] FIG. 38 illustrates a processing element having structured output sparsity support;
[0046] FIG. 39A-39B illustrates snapshots of processing elements at cycle zero and cycle one of instruction execution when output sparsity is enabled;
[0047] FIG. 40 is a flow chart of a method performed by a systolic array to reduce power consumption using output sparsity metadata;
[0048] FIG. 41 illustrates a method of performing processing operations for a machine learning model using output sparsity;
[0049] FIG. 42 is a flow chart of a method of generating output sparsity metadata based on a sparsity percentage; and
[0050] FIG. 43 is a block diagram of a computing device including a graphics processor, according to an embodiment.

DETAILED DESCRIPTION
[0051] A graphics processing unit (GPU) is communicatively coupled to host/processor cores to accelerate, for example, graphics operations, machine-learning operations, pattern analysis operations, and/or various general-purpose GPU (GPGPU) functions. The GPU may be communicatively coupled to the host processor/cores over a bus or another interconnect (e.g., a high-speed interconnect such as PCIe or NVLink). Alternatively, the GPU may be integrated on the same package or chip as the cores and communicatively coupled to the cores over an internal processor bus/interconnect (i.e., internal to the package or chip). Regardless of the manner in which the GPU is connected, the processor cores may allocate work to the GPU in the form of sequences of commands/instructions contained in a work descriptor. The GPU then uses dedicated circuitry/logic for efficiently processing these commands/instructions.

Documents

Application Documents

#	Name	Date
1	202244014140-FORM 1 [16-03-2022(online)].pdf	2022-03-16
2	202244014140-DRAWINGS [16-03-2022(online)].pdf	2022-03-16
3	202244014140-DECLARATION OF INVENTORSHIP (FORM 5) [16-03-2022(online)].pdf	2022-03-16
4	202244014140-COMPLETE SPECIFICATION [16-03-2022(online)].pdf	2022-03-16
5	202244014140-FORM-26 [11-07-2022(online)].pdf	2022-07-11
6	202244014140-FORM 3 [14-03-2023(online)].pdf	2023-03-14
7	202244014140-FORM 3 [14-09-2023(online)].pdf	2023-09-14
8	202244014140-FORM 3 [14-03-2024(online)].pdf	2024-03-14
9	202244014140-FORM 18 [18-06-2025(online)].pdf	2025-06-18