Abstract: Systems, methods, and apparatuses relating to a matrix operations accelerator are described. In one embodiment, a processor includes a matrix operations accelerator circuit that includes a two-dimensional grid of fused multiply accumulate circuits that is switchable to a scheduling mode for execution of a decoded single instruction where the matrix operations accelerator circuit loads a first buffer of the two-dimensional grid of fused multiply accumulate circuits from a first plurality of registers that represents a first input two-dimensional matrix, checks if a second buffer of the two-dimensional grid of fused multiply accumulate circuits stores an immediately prior input two-dimension matrix that is the same as a second input two-dimensional matrix from a second plurality of registers that represents the first input two-dimensional matrix, and when the second buffer of the two-dimensional grid of fused multiply accumulate circuits stores the immediately prior input two-dimension matrix, from execution of a previous instruction, that is the same as the second input two-dimensional matrix: prevents reclamation of the second buffer between execution of the previous instruction and the decoded single instruction, performs an operation on the first input two-dimensional matrix from the first buffer and the immediately prior input two-dimension matrix from the second buffer to produce a resultant, and stores the resultant in resultant storage, and when the second buffer of the two-dimensional grid of fused multiply accumulate circuits does not store the immediately prior input two-dimension matrix, from execution of the previous instruction, that is the same as the second input two-dimensional matrix: loads the second input two-dimensional matrix into the second buffer of the two-dimensional grid of fused multiply accumulate circuits, performs the operation on the first input two-dimensional matrix from the first buffer and the second input two-dimension matrix from the second buffer to produce a resultant, and stores the resultant in the resultant storage.
Description:RELATED APPLICATION
[0001] This patent application is related to India Patent Application No. 202044039434, filed on 11 September 2020, entitled “APPARATUSES, METHODS, AND SYSTEMS FOR INSTRUCTIONS OF A MATRIX OPERATIONS ACCELERATOR”.
[0002] The present application claims priority to U.S. Non-Provisional Patent Application No. 16/729,361 filed December 28, 2019 and titled “APPARATUSES, METHODS, AND SYSTEMS FOR INSTRUCTIONS OF A MATRIX OPERATIONS ACCELERATOR” the entire disclosure of which is hereby incorporated by reference.
TECHNICAL FIELD
[0003] The disclosure relates generally to computer processor architecture, and, more specifically, to apparatuses, systems, and methods for executing instructions to perform a matrix operation using a matrix operations accelerator circuit.
BACKGROUND
[0004] A processor, or set of processors, executes instructions from an instruction set, e.g., the instruction set architecture (ISA). The instruction set is the part of the computer architecture related to programming, and generally includes the native data types, instructions, register architecture, addressing modes, memory architecture, interrupt and exception handling, and external input and output (I/O). It should be noted that the term instruction herein may refer to a macro-instruction, e.g., an instruction that is provided to the processor for execution, or to a micro-instruction, e.g., an instruction that results from a processor’s decoder decoding macro-instructions.
BRIEF DESCRIPTION OF THE DRAWINGS
[0005] The present disclosure is illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
[0006] Figure 1A illustrates an embodiment of configured tiles according to embodiments of the disclosure.
[0007] Figure 1B illustrates an embodiment of configured tiles according to embodiments of the disclosure.
[0008] Figure 2 illustrates several examples of matrix storage according to embodiments of the disclosure.
[0009] Figure 3 illustrates an embodiment of a system utilizing a matrix (tile) operations accelerator according to embodiments of the disclosure.
[00010] Figures 4 and 5 show different embodiments of how memory is shared using a matrix operations accelerator.
[00011] Figure 6 illustrates an embodiment of matrix multiply accumulate operation using tiles (“TMMA”).
[00012] Figure 7 illustrates an embodiment of a subset of the execution of an iteration of a chained fused multiply accumulate instruction.
[00013] Figure 8 illustrates an embodiment of a subset of the execution of an iteration of a chained fused multiply accumulate instruction.
[00014] Figure 9 illustrates an embodiment of a subset of the execution of an iteration of a chained fused multiply accumulate instruction.
[00015] Figure 10 illustrates an embodiment of a subset of the execution of an iteration of chained fused multiply accumulate instruction.
[00016] Figure 11 illustrates power-of-two sized SIMD implementations wherein the accumulators use input sizes that are larger than the inputs to the multipliers according to an embodiment.
[00017] Figure 12 illustrates an embodiment of a system utilizing matrix operations circuitry.
[00018] Figure 13 illustrates an embodiment of a processor core pipeline supporting matrix operations using tiles.
[00019] Figure 14 illustrates an embodiment of a processor core pipeline supporting matrix operations using tiles.
[00020] Figure 15 illustrates an example of a matrix expressed in row major format and column major format.
[00021] Figure 16 illustrates an example of usage of matrices (tiles).
[00022] Figure 17 illustrates an embodiment a method of usage of matrices (tiles).
[00023] Figure 18 illustrates support for configuration of the usage of tiles according to an embodiment.
[00024] Figure 19 illustrates an embodiment of a description of the matrices (tiles) to be supported.
[00025] Figures 20(A)-(D) illustrate examples of register(s).
[00026] Figure 21 illustrates an embodiment of a system utilizing a matrix (tile) operations accelerator according to embodiments of the disclosure.
[00027] Figure 22 illustrates a matrix operations accelerator circuit comprising a two-dimensional grid of processing element circuits according to embodiments of the disclosure.
[00028] Figure 23 illustrates dispatch circuitry of a matrix operations accelerator circuit according to embodiments of the disclosure.
[00029] Figure 24 illustrates scheduling circuitry of dispatch circuitry of a matrix operations accelerator circuit according to embodiments of the disclosure.
[00030] Figure 25 illustrates scheduling circuitry, of dispatch circuitry of a matrix operations accelerator circuit, that is switchable from a baseline scheduling mode to a scheduling mode that reuses an input matrix according to embodiments of the disclosure.
[00031] Figure 26 illustrates dispatch circuitry of a matrix operations accelerator circuit for multiple passes according to embodiments of the disclosure.
[00032] Figure 27 illustrates scheduling circuitry of dispatch circuitry of a matrix operations accelerator circuit for multiple passes according to embodiments of the disclosure.
[00033] Figure 28 illustrates pseudocode for matrix operations circuitry according to embodiments of the disclosure.
[00034] Figure 29 illustrates a method of processing a matrix operation instruction according to embodiments of the disclosure.
[00035] Figure 30A is a block diagram illustrating a generic vector friendly instruction format and class A instruction templates thereof according to embodiments of the disclosure.
[00036] Figure 30B is a block diagram illustrating the generic vector friendly instruction format and class B instruction templates thereof according to embodiments of the disclosure.
[00037] Figure 31A is a block diagram illustrating fields for the generic vector friendly instruction formats in Figures 30A and 30B according to embodiments of the disclosure.
[00038] Figure 31B is a block diagram illustrating the fields of the specific vector friendly instruction format in Figure 31A that make up a full opcode field according to one embodiment of the disclosure.
[00039] Figure 31C is a block diagram illustrating the fields of the specific vector friendly instruction format in Figure 31A that make up a register index field according to one embodiment of the disclosure.
[00040] Figure 31D is a block diagram illustrating the fields of the specific vector friendly instruction format in Figure 31A that make up the augmentation operation field 3050 according to one embodiment of the disclosure.
[00041] Figure 32 is a block diagram of a register architecture according to one embodiment of the disclosure
[00042] Figure 33A is a block diagram illustrating both an exemplary in-order pipeline and an exemplary register renaming, out-of-order issue/execution pipeline according to embodiments of the disclosure.
[00043] Figure 33B is a block diagram illustrating both an exemplary embodiment of an in-order architecture core and an exemplary register renaming, out-of-order issue/execution architecture core to be included in a processor according to embodiments of the disclosure.
[00044] Figure 34A is a block diagram of a single processor core, along with its connection to the on-die interconnect network and with its local subset of the Level 2 (L2) cache, according to embodiments of the disclosure.
[00045] Figure 34B is an expanded view of part of the processor core in Figure 34A according to embodiments of the disclosure.
[00046] Figure 35 is a block diagram of a processor that may have more than one core, may have an integrated memory controller, and may have integrated graphics according to embodiments of the disclosure.
[00047] Figure 36 is a block diagram of a system in accordance with one embodiment of the present disclosure.
[00048] Figure 37 is a block diagram of a more specific exemplary system in accordance with an embodiment of the present disclosure.
[00049] Figure 38, shown is a block diagram of a second more specific exemplary system in accordance with an embodiment of the present disclosure.
[00050] Figure 39, shown is a block diagram of a system on a chip (SoC) in accordance with an embodiment of the present disclosure.
[00051] Figure 40 is a block diagram contrasting the use of a software instruction converter to convert binary instructions in a source instruction set to binary instructions in a target instruction set according to embodiments of the disclosure.
DETAILED DESCRIPTION
[00052] In the following description, numerous specific details are set forth. However, it is understood that embodiments may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.
[00053] References in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
, Claims:1. An apparatus, comprising:
a first one or more vector registers to store a first plurality of source matrix data elements;
a second one or more vector registers to store a second plurality of source matrix data elements, each source matrix data element of the first and second plurality of source matrix data elements having a first data element width;
a third one or more vector registers to store a plurality of accumulation matrix data elements, each accumulation data element of the plurality of accumulation data elements having a second data element width which is at least twice the first data element width;
matrix processing circuitry operable in a plurality of processing lanes, the matrix processing circuitry to execute a single matrix instruction to perform a corresponding plurality of multiplications in the plurality of processing lanes; and
operand routing circuitry to broadcast a first source matrix data element of the first plurality of source matrix data elements to multiple processing lanes of the plurality of processing lanes in accordance with the single matrix instruction, wherein in each processing lane of the multiple processing lanes, the matrix processing circuitry is to perform a corresponding multiplication of the first source matrix data element and a different data element of the second plurality of source matrix data elements to produce a corresponding product, the corresponding product to be added to a corresponding accumulation matrix data element and one or more other products of multiplications of respective data elements of the first and second plurality of source matrix data elements to generate a corresponding result matrix data element having the second data element width.
| # | Name | Date |
|---|---|---|
| 1 | 202345085209-POWER OF AUTHORITY [13-12-2023(online)].pdf | 2023-12-13 |
| 2 | 202345085209-FORM 1 [13-12-2023(online)].pdf | 2023-12-13 |
| 3 | 202345085209-DRAWINGS [13-12-2023(online)].pdf | 2023-12-13 |
| 4 | 202345085209-DECLARATION OF INVENTORSHIP (FORM 5) [13-12-2023(online)].pdf | 2023-12-13 |
| 5 | 202345085209-COMPLETE SPECIFICATION [13-12-2023(online)].pdf | 2023-12-13 |
| 6 | 202345085209-FORM 18 [23-02-2024(online)].pdf | 2024-02-23 |
| 7 | 202345085209-FORM 3 [11-06-2024(online)].pdf | 2024-06-11 |
| 8 | 202345085209-FER.pdf | 2025-07-28 |
| 9 | 202345085209-FORM 3 [01-09-2025(online)].pdf | 2025-09-01 |
| 1 | 202345085209_SearchStrategyNew_E_Search085209E_15-07-2025.pdf |