Bfloat16 Comparison Instructions

Abstract: Techniques for comparing BF16 data elements are described. An exemplary BF16 comparison instruction includes fields for an opcode, an identification of a location of a first packed data source operand, and an identification of a location of a second packed data source operand, wherein the opcode is to indicate that execution circuitry is to perform, for a particular data element position of the packed data source operands, a comparison of a data element at that position, and update a flags register based on the comparison.

Patent Information

Application #

Filing Date

26 July 2022

Publication Number

09/2023

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

INTEL CORPORATION

2200 Mission College Boulevard, Santa Clara, California 95054, USA

Inventors

1. ALEXANDER HEINECKE

55 River Oaks Place #701, San Jose, California, 95134, USA

2. MENACHEM ADELMAN

Hatichon 31A Apt.5, Haifa 3229624, Israel

3. ROBERT VALENTINE

Ya'ara 40, Kiryat Tivon 36054, Israel

4. ZEEV SPERBER

32nd Igal Alon St., Zikhron Yaakov 3092832, Israel

5. AMIT GRADSTEIN

16th Hadas St., Binyamina 3052316, Israel

6. MARK CHARNEY

610 Waltham Street, Lexington, Massachusetts 02421, USA

7. EVANGELOS GEORGANAS

1927 Bridgepoint Pkwy, Unit H346, San Mateo, California 94404, USA

8. DHIRAJ KALAMKAR

Intel tech India Pvt Ltd, Devarabeesanahalli Village, Bangalore 560103, India

9. CHRISTOPHER HUGHES

3543 Druffel Place, Santa Clara, California 95051, USA

10. CRISTINA ANDERSON

890 NW Brookhill Street, Hillsboro, Oregon, 97124, USA

Specification

Description:RELATED APPLICATION
[0001] The present application claims priority to U.S. Non-Provisional Patent Application No. 17/463,410 filed on 31 August 2021 and titled “BFLOAT16 COMPARISON INSTRUCTIONS” the entire disclosure of which is hereby incorporated by reference.

BACKGROUND
[0002] In recent years fused-multiply-add (FMA) units with lower-precision multiplications and higher-precision accumulation have proven useful in machine learning/artificial intelligence applications, most notably in training deep neural networks due to their extreme computational intensity. Compared to classical IEEE-754 32-bit (FP32) and 64-bit (FP64) arithmetic, this reduced precision arithmetic can naturally be sped up disproportional to their shortened width.

BRIEF DESCRIPTION OF DRAWINGS
[0003] Various embodiments in accordance with the present disclosure will be described with reference to the drawings, in which:
[0004] FIG. 1 illustrates different floating point representation formats.
[0005] FIG. 2 illustrates an exemplary execution of an instruction to determine a maximum value between BF16 data elements of corresponding data element positions of two sources.
[0006] FIG. 3 illustrates an embodiment of method performed by a processor to process an instruction to determine a maximum value between BF16 data elements of corresponding data element positions of two sources.
[0007] FIG. 4 illustrates more detailed embodiments of an execution of an instruction to determine a maximum value between BF16 data elements of corresponding data element positions of two sources.
[0008] FIG. 5 illustrates exemplary embodiments of pseudo code representing the execution and format of an instruction to determine a maximum value between BF16 data elements of corresponding data element positions of two sources instruction.
[0009] FIG. 6 illustrates an exemplary execution of an instruction to determine a minimum value between BF16 data elements of corresponding data element positions of two sources.
[0010] FIG. 7 illustrates an embodiment of method performed by a processor to process an instruction to determine a minimum value between BF16 data elements of corresponding data element positions of two sources.
[0011] FIG. 8 illustrates more detailed embodiments of an execution of an instruction to determine a minimum value between BF16 data elements of corresponding data element positions of two sources.
[0012] FIG. 9 illustrates exemplary embodiments of pseudo code representing the execution and format of an instruction to determine a minimum value between BF16 data elements of corresponding data element positions of two sources instruction.
[0013] FIG. 10 illustrates an exemplary execution of an instruction to compare values between BF16 data elements of corresponding data element positions of two sources according to a comparison operator.
[0014] FIG. 11 provides examples of comparison operators according to some embodiments.
[0015] FIG. 12 illustrates an embodiment of method performed by a processor to process an instruction to compare values between BF16 data elements of corresponding data element positions of two sources according to a comparison operator.
[0016] FIG. 13 illustrates exemplary embodiments of pseudo code representing the execution and format of an instruction to compare values between BF16 data elements of corresponding data element positions of two sources according to a comparison operator.
[0017] FIG. 14 illustrates an exemplary execution of an instruction to compare BF16 values in a particular data element position of a first source operand and second source operand and sets a zero flag, parity flag, and a carry flag according to a result of the comparison.
[0018] FIG. 15 illustrates an embodiment of method performed by a processor to process an instruction to compare BF16 values in a particular data element position of a first source operand and second source operand and sets a zero flag, parity flag, and a carry flag according to a result of the comparison.
[0019] FIG. 16 illustrates exemplary embodiments of pseudo code representing the execution and format of an instruction to compare BF16 values in a particular data element position of a first source operand and second source operand and sets a zero flag, parity flag, and a carry flag according to a result of the comparison.
[0020] FIG. 17 illustrates embodiments of hardware to process an instruction such as any of the BF16 compare instructions detailed above. As illustrated, storage 1703 stores at least one BF16 compare instruction 1701 to be executed.
[0021] FIG. 18 illustrates embodiments of an exemplary system.
[0022] FIG. 19 illustrates a block diagram of embodiments of a processor that may have more than one core, may have an integrated memory controller, and may have integrated graphics.
[0023] FIG. 20(A) is a block diagram illustrating both an exemplary in-order pipeline and an exemplary register renaming, out-of-order issue/execution pipeline according to embodiments of the invention.
[0024] FIG. 20(B) is a block diagram illustrating both an exemplary embodiment of an in-order architecture core and an exemplary register renaming, out-of-order issue/execution architecture core to be included in a processor according to embodiments of the invention.
[0025] FIG. 21 illustrates embodiments of execution unit(s) circuitry, such as execution unit(s) circuitry of FIG. 20(B).
[0026] FIG. 22 is a block diagram of a register architecture according to some embodiments.
[0027] FIG. 23 illustrates embodiments of an instruction format.
[0028] FIG. 24 illustrates embodiments of an addressing field.
[0029] FIG. 25 illustrates embodiments of a first prefix.
[0030] FIGS. 26(A)-(D) illustrate embodiments of how the R, X, and B fields of the first prefix 2301(A) are used.
[0031] FIGS. 27(A)-(B) illustrate embodiments of a second prefix.
[0032] FIG. 28 illustrates embodiments of a third prefix.
[0033] FIG. 29 illustrates a block diagram contrasting the use of a software instruction converter to convert binary instructions in a source instruction set to binary instructions in a target instruction set according to embodiments of the invention.

DETAILED DESCRIPTION
[0034] The present disclosure relates to methods, apparatus, systems, and non-transitory computer-readable storage media for supporting comparison operations BF16 data elements instructions.
[0035] BF16 is gaining traction due to its ability to work well in machine learning algorithms, in particular deep learning training. FIG. 1 illustrates different floating point representation formats. In this illustration, the formats are in little endian format, however, in some embodiments, a big endian format is used. The FP32 format 101 has a sign bit (S), an 8-bit exponent, and a 23-bit fraction (a 24-bit mantissa that uses an implicit bit). The FP16 format 103 has a sign bit (S), a 5-bit exponent, and a 10-bit fraction. The BF16 format 105 has a sign bit (S), an 8-bit exponent, and a 7-bit fraction.
, Claims:1. An apparatus comprising:
decode circuitry to decode an instance of a single instruction, the single instruction to include fields for an opcode, an identification of a location of a first packed data source operand, and an identification of a location of a second packed data source operand, wherein the opcode is to indicate that execution circuitry is to perform, for a particular data element position of the packed data source operands, a comparison of a BF16 data element at that position, and update a flags register based on the comparison; and
the execution circuitry to execute the decoded instruction according to the opcode.

Documents

Application Documents

#	Name	Date
1	202244042724-FORM 1 [26-07-2022(online)].pdf	2022-07-26
2	202244042724-DRAWINGS [26-07-2022(online)].pdf	2022-07-26
3	202244042724-DECLARATION OF INVENTORSHIP (FORM 5) [26-07-2022(online)].pdf	2022-07-26
4	202244042724-COMPLETE SPECIFICATION [26-07-2022(online)].pdf	2022-07-26
5	202244042724-FORM 3 [21-12-2022(online)].pdf	2022-12-21
6	202244042724-FORM-26 [28-12-2022(online)].pdf	2022-12-28
7	202244042724-FORM 3 [31-01-2023(online)].pdf	2023-01-31
8	202244042724-FORM 18 [25-08-2025(online)].pdf	2025-08-25