Abstract: A matrix processing engine is provided for efficient matrix computation performed by a dense matrix compute circuit (performing SIMD operations) and a scalar computing core (performing SISD operations). These two processing components operate together to produce output data tiles by feeding results of the dense SIMD operations to the scalar computing core using thread packing and an in-line buffer for accumulating and packing the dense result data. This permits the scalar computing to spawn threads to operate on the dense results as available and without requiring partial or intermediate data read/writes between the dense and scalar computations.
Description:Cross-Reference to Related Applications
[0001] The present application claims priority to Indian Provisional Patent Application No. 202141049577, filed on 29 October 2021 and titled “METHODS FOR TIGHT COUPLING OF VECTOR AND PROGRAMMABLE SCALAR ENGINES FOR EFFICIENT MATRIX ALGEBRA AND DNN PROCESSING” the entire disclosure of which is hereby incorporated by reference.
[0002] The present application claims priority to U.S. Non-Provisional Patent Application No. 17/560,100 filed on 22 December 2021 and titled “MATRIX PROCESSING ENGINE WITH COUPLED DENSE AND SCALAR COMPUTE” the entire disclosure of which is hereby incorporated by reference.
Technical Field
[0003] This disclosure relates generally to matrix processing, and particularly to efficient matrix algebra using dense matrix computation in combination with configurable scalar computation.
Background
[0004] Several algorithms used in Computer Vision (CV) applications and typical Artificial Intelligence (AI) workloads apply various matrix processing algorithms that combine matrix multiplication with various scalar operations. Matrix-multiplication stages and scalar operation stages are often interleaved, with the output of one stage fed as input to another. Cholesky decomposition or triangular matrix solve are examples of such matrix processing algorithms, where square-root or division operations are used as scalar operations to compute final values of diagonal and non-diagonal elements respectively. These equations have a combination of matrix multiplication and per-element scalar operations for calculating results. Similarly, in neural network processing, certain neural network layer operations, such as a convolutional filter may be mapped to a matrix-multiply (multiply-and-accumulate) function, many other operations in neural networks such as pooling, normalization, or activation functions typically need to be performed as operations on a scalar computing core. The output of these operations may then be used as input to matrix-multiplication operations for a next layer’s compute.
[0005] Vector operations such as matrix-multiplication are often offloaded to a dedicated engine for performance and energy efficiency reasons. A unified architecture for mapping various matrix operations along with different flavors of scalar operations (e.g., activation functions), that includes fine-grained data coupling between vector and scalar operations typically poses significant mapping challenges due to frequent data movements, operand latency and synchronization issues.
[0006] Custom accelerator designs may be used with fixed operation and dedicated internal data paths. However, in real-world applications for many use-cases, multiple types of matrix and/or DNN functions are required, thus complicating accelerator solutions and making fixed devices inefficient for more general purposes. In addition, more general solutions are often inefficient in terms of chip-area cost, resource utilization and energy.
[0007] An architecture is needed that maximizes compute resource utilization and energy efficiency, while allowing flexible mapping of diverse matrix operations pervasive in modern AI/CV applications allows achieving high performance per watt at low cost.
[0008] None of the prior solutions provide a matrix processing engine that comprehensively addresses the requirements spanning performance-per-watt, performance-per-unit-area, flexibility to map diverse matrix processing equations and achieving architectural effectiveness for scaled-up configurations. Most often, existing solutions perform MAC operations separately and the results from MAC operation are moved off of the matrix processor, with the remaining logic of the equations performed by another device (e.g., a host processor or similar compute elements), which compromises efficiency and the programming model.
Brief Description of the Drawings
[0009] Embodiments will be readily understood by the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. Embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings.
[0010] FIG. 1 shows an example matrix processing engine, according to one embodiment.
[0011] FIGS. 2-3 shows an example tiling of operands and preparation of thread packets for processing by the scalar computing core to generate output data, according to one embodiment.
[0012] FIG. 4 shows an example configuration of the dense matrix compute circuit, according to one embodiment.
[0013] FIGS. 5-6 shows one embodiment of a scalar processing core a supported instruction format, according to one embodiment.
[0014] FIG. 7 shows an example execution workflow for generating an output tile according to one embodiment.
[0015] FIG. 8 shows an implementation of a Tiling Algorithm of Matrix-Matrix Multiplication (SGEMM), according to one embodiment.
[0016] FIG. 9 shows an example tiling of a convolution algorithm according to one embodiment.
[0017] FIG. 10 shows an example tiling algorithm of Cholesky decomposition, according to one embodiment.
[0018] FIGS. 11-12 show comparative performance of one embodiment of the matrix processor relative to other computing circuits for performing matrix equations.
[0019] FIG. 13 shows example physical synthesis of the MxCore embodiment.
[0020] FIG. 14 is a block diagram of an example computing device that may include one or more components in accordance with any of the embodiments disclosed herein.
Detailed Description
Overview
[0021] The systems, methods and devices of this disclosure each have several innovative aspects, no single one of which is solely responsible for all desirable attributes disclosed herein. Details of one or more implementations of the subject matter described in this specification are set forth in the description below and the accompanying drawings.
[0022] This disclosure includes an architecture for a matrix processing engine that effectively combines efficient matrix computation of a dense matrix compute circuit with a scalar computing core. The dense matrix compute circuit may be a single-instruction-multiple-data (SISD) computing device, which performs a single instruction on multiple data sets. Similarly, the scalar computing core may be a single-instruction-single-data (SISD) computing device which performs individual operations on individual data sequentially according to its instructions. The SISD device may also be capable of parallel execution of multiple different instructions on different data (e.g., it may be multithreaded and permit out-of-order execution based on dependencies). The dense matrix compute circuit may thus also be referred to herein as a SIMD core or SIMD circuit, and similarly the scalar computing core may be referred to as a SISD core or SISD circuit.
, Claims:1. A computing device comprising:
a dense matrix compute circuit configured to receive a first dense operand and a second dense operand and perform an operation on the first dense operand and the second dense operand to generate a dense compute result;
an operand packing circuit configured to receive the dense compute result and generate a set of thread packets based on the dense compute result; and
a scalar computing core configured to receive the set of thread packets and execute a corresponding set of processing threads, the computing core executing a processing thread by loading the associated thread packet to a set of registers and executing a set of configurable instructions with respect to the set of registers to generate one or more outputs.
| # | Name | Date |
|---|---|---|
| 1 | 202244053002-FORM 1 [16-09-2022(online)].pdf | 2022-09-16 |
| 2 | 202244053002-DRAWINGS [16-09-2022(online)].pdf | 2022-09-16 |
| 3 | 202244053002-DECLARATION OF INVENTORSHIP (FORM 5) [16-09-2022(online)].pdf | 2022-09-16 |
| 4 | 202244053002-COMPLETE SPECIFICATION [16-09-2022(online)].pdf | 2022-09-16 |
| 5 | 202244053002-FORM-26 [28-12-2022(online)].pdf | 2022-12-28 |
| 6 | 202244053002-FORM 3 [14-03-2023(online)].pdf | 2023-03-14 |
| 7 | 202244053002-Proof of Right [12-09-2023(online)].pdf | 2023-09-12 |