Abstract: An apparatus to facilitate supporting 8-bit floating point format operands in a computing architecture is disclosed. The apparatus includes a processor comprising: a decoder to decode an instruction fetched for execution into a decoded instruction, wherein the decoded instruction is a matrix instruction that operates on 8-bit floating point operands to cause the processor to perform a parallel dot product operation; a controller to schedule the decoded instruction and provide input data for the 8-bit floating point operands in accordance with an 8-bit floating data format indicated by the decoded instruction; and systolic dot product circuitry to execute the decoded instruction using systolic layers, each systolic layer comprises one or more sets of interconnected multipliers, shifters, and adder, each set of multipliers, shifters, and adders to generate a dot product of the 8-bit floating point operands.
Description:FIELD
[0002] This document relates generally to data processing and more particularly to supporting 8-bit floating point format operands in a computing architecture.
BACKGROUND
[0003] Current parallel graphics data processing includes systems and methods developed to perform specific operations on graphics data such as, for example, linear interpolation, tessellation, rasterization, texture mapping, depth testing, etc. Traditionally, graphics processors used fixed function computational units to process graphics data; however, more recently, portions of graphics processors have been made programmable, enabling such processors to support a wider variety of operations for processing vertex and fragment data.
[0004] To further increase performance, graphics processors typically implement processing techniques such as pipelining that attempt to process, in parallel, as much graphics data as possible throughout the different parts of the graphics pipeline. Parallel graphics processors with single instruction, multiple data (SIMD) or single instruction, multiple thread (SIMT) architectures are designed to maximize the amount of parallel processing in the graphics pipeline. In a SIMD architecture, computers with multiple processing elements attempt to perform the same operation on multiple data points simultaneously. In a SIMT architecture, groups of parallel threads attempt to execute program instructions synchronously together as often as possible to increase processing efficiency.
[0005] Graphics processors are often be utilized for applications in the fields of artificial intelligence (AI) and machine learning (ML). Advances in these fields have enabled ML models to take advantage of low-precision arithmetic for training neural network. Conventional training platforms support floating point 16 (FP16) and brain floating point 16 (bfloat16 or BF16) data formats in high-performance systolic array implementations. Recent advances have been made to support training of deep neural networks using lower precision data formats, such as an 8-bit data formats. However, conventional systems provide no hardware support for performing operations using 8-bit floating point format operands.
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] So that the manner in which the above recited features of the present embodiments can be understood in detail, a more particular description of the embodiments, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments and are therefore not to be considered limiting of its scope.
[0007] FIG. 1 is a block diagram of a processing system.
[0008] FIG. 2A-2D illustrate computing systems and graphics processors.
[0009] FIG. 3A-3C illustrate block diagrams of additional graphics processor and compute accelerator architectures.
[0010] FIG. 4 is a block diagram of a graphics processing engine of a graphics processor.
[0011] FIG. 5A-5B illustrate thread execution logic including an array of processing elements employed in a graphics processor core.
[0012] FIG. 6 illustrates an additional execution unit.
[0013] FIG. 7 is a block diagram illustrating a graphics processor instruction formats.
[0014] FIG. 8 is a block diagram of an additional graphics processor architecture.
[0015] FIG. 9A-9B illustrate a graphics processor command format and command sequence.
[0016] FIG. 10 illustrates example graphics software architecture for a data processing system.
[0017] FIG. 11A is a block diagram illustrating an IP core development system.
[0018] FIG. 11B illustrates a cross-section side view of an integrated circuit package assembly.
[0019] FIG. 11C illustrates a package assembly that includes multiple units of hardware logic chiplets connected to a substrate (e.g., base die).
[0020] FIG. 11D illustrates a package assembly including interchangeable chiplets.
[0021] FIG. 12 is a block diagram illustrating an example system on a chip integrated circuit.
[0022] FIG. 13A-13B are block diagrams illustrating example graphics processors for use within an SoC.
[0023] FIG. 14 is a block diagram of a data processing system, according to an embodiment.
[0024] FIG. 15 is a block diagram illustrating a brain-float 8 (BFLOAT8 or BF8) binary format, in accordance with embodiments.
[0025] FIG. 16 is a block diagram illustrating a systolic DP 8-bit FP format operation performed by an instruction pipeline, according to embodiments.
[0026] FIGS. 17A-17B are block diagrams illustrating a systolic array circuit to perform systolic dot product accumulate on 8-bit floating point format input operands, in accordance with embodiments.
[0027] FIG. 18A illustrates an instruction executable by a systolic array circuit, according to embodiments described herein.
[0028] FIG. 18B illustrates a program code compilation process, according to an embodiment.
[0029] FIG. 19 is a flow diagram illustrating an embodiment of a method for executing an instruction for systolic dot product accumulate on 8-bit floating point format input operands.
[0030] FIG. 20 is a flow diagram illustrating an embodiment of a method for systolic dot product accumulate on 8-bit floating point format input operands.
[0031] FIG. 21 is a block diagram illustrating an 8-bit FP format conversion operation performed by an instruction pipeline, according to embodiments.
[0032] FIG. 22A illustrates an instruction executable by a processing unit, according to embodiments described herein.
[0033] FIG. 22B illustrates a program code compilation process, according to an embodiment.
[0034] FIG. 23 is a flow diagram illustrating an embodiment of a method for executing an instruction for converting floating point data to 8-bit floating point format data.
[0035] FIG. 24 is a flow diagram illustrating an embodiment of a method for converting floating point data to 8-bit floating point format data.
[0036] FIG. 25 is a block diagram illustrating an 8-bit FP format conversion with stochastic rounding operation performed by an instruction pipeline, according to embodiments.
[0037] FIG. 26 is a block diagram illustrating fixed-point addition of sign-magnitude representation of the mantissa and the random number, in accordance with embodiments.
[0038] FIG. 27A illustrates an instruction executable by a processing unit, according to embodiments described herein.
[0039] FIG. 27B illustrates a program code compilation process, according to an embodiment.
[0040] FIG. 28 is a flow diagram illustrating an embodiment of a method for executing an instruction for performing efficient stochastic rounding on floating point values.
[0041] FIG. 29 is a flow diagram illustrating an embodiment of a method for performing efficient stochastic rounding on floating point values.
[0042] FIG. 30 is a block diagram illustrating two 8-bit floating point formats that use a different binary encoding and exponent bias, in accordance with embodiments.
[0043] FIG. 31 is a block diagram illustrating a hybrid 8-bit FP format systolic operation performed by an instruction pipeline, according to embodiments.
[0044] FIG. 32 is a block diagram illustrating a hybrid FMA unit of a systolic array circuit to perform hybrid floating point systolic operations, in accordance with embodiments.
[0045] FIG. 33A illustrates an instruction executable by a systolic array circuit, according to embodiments described herein.
[0046] FIG. 33B illustrates a program code compilation process, according to an embodiment.
[0047] FIG. 34 is a flow diagram illustrating an embodiment of a method for executing an instruction for hybrid floating point systolic operations.
[0048] FIG. 35 is a flow diagram illustrating an embodiment of a method for hybrid floating point systolic operations.
[0049] FIG. 36 is a block diagram illustrating a mixed mode 8-bit FP format operation performed by an instruction pipeline, according to embodiments.
[0050] FIG. 37 shows an example schematic representation of a hardware circuit to perform mixed mode MAC operation using at least one 8-bit FP format operand, in accordance with embodiments.
[0051] FIG. 38A illustrates a set of instructions executable by a processing unit, according to embodiments described herein.
[0052] FIG. 38B illustrates a program code compilation process, according to an embodiment.
[0053] FIG. 39 is a flow diagram illustrating an embodiment of a method for executing an instruction to perform mixed mode operations with 8-bit floating point format operands.
[0054] FIG. 40 is a flow diagram illustrating an embodiment of a method for performing mixed mode operations with 8-bit floating point format operands.
DETAILED DESCRIPTION
[0055] A graphics processing unit (GPU) is communicatively coupled to host/processor cores to accelerate, for example, graphics operations, machine-learning operations, pattern analysis operations, and/or various general-purpose GPU (GPGPU) functions. The GPU may be communicatively coupled to the host processor/cores over a bus or another interconnect (e.g., a high-speed interconnect such as PCIe or NVLink). Alternatively, the GPU may be integrated on the same package or chip as the cores and communicatively coupled to the cores over an internal processor bus/interconnect (i.e., internal to the package or chip). Regardless of the manner in which the GPU is connected, the processor cores may allocate work to the GPU in the form of sequences of commands/instructions contained in a work descriptor. The GPU then uses dedicated circuitry/logic for efficiently processing these commands/instructions.
, C , Claims:1. An apparatus, comprising:
an integrated circuit chip, comprising:
a plurality of registers to store a plurality of data elements, including 8-bit floating point data elements and 32-bit floating point data elements;
decode circuitry to decode a single matrix instruction having fields to indicate an opcode and locations of a first source matrix including a first plurality of 8-bit floating point data elements encoded in a first 8-bit floating point format, a second source matrix including a second plurality of 8-bit floating point data elements encoded in a second 8-bit floating point format, and a third source matrix including a plurality of 32-bit floating point data elements,
wherein the first 8-bit floating point format comprises a sign bit, a 5-bit exponent value and a 2-bit mantissa value, and the second 8-bit floating point format comprises a sign bit, a 4-bit exponent value and a 3-bit mantissa value; and
execution circuitry including a matrix accelerator to accelerate matrix operations, wherein, responsive to the single matrix instruction, the execution circuitry is to generate a plurality of products based on the first plurality of 8-bit floating point data elements of the first source matrix and the second plurality of 8-bit floating point data elements of the second source matrix, and accumulate each product of the plurality of products with a corresponding 32-bit floating point data element of the third source matrix to generate a corresponding 32-bit floating point result data element of a result matrix.
| # | Name | Date |
|---|---|---|
| 1 | 202445073670-POWER OF AUTHORITY [30-09-2024(online)].pdf | 2024-09-30 |
| 2 | 202445073670-FORM 1 [30-09-2024(online)].pdf | 2024-09-30 |
| 3 | 202445073670-DRAWINGS [30-09-2024(online)].pdf | 2024-09-30 |
| 4 | 202445073670-DECLARATION OF INVENTORSHIP (FORM 5) [30-09-2024(online)].pdf | 2024-09-30 |
| 5 | 202445073670-COMPLETE SPECIFICATION [30-09-2024(online)].pdf | 2024-09-30 |
| 6 | 202445073670-FORM 3 [21-03-2025(online)].pdf | 2025-03-21 |
| 7 | 202445073670-FORM 18 [24-03-2025(online)].pdf | 2025-03-24 |