Bfloat16 Comparison Instructions

Abstract: Techniques for comparing BF16 data elements are described. An exemplary BF16 comparison instruction includes fields for an opcode, an identification of a location of a first packed data source operand, and an identification of a location of a second packed data source operand, wherein the opcode is to indicate that execution circuitry is to perform, for a particular data element position of the packed data source operands, a comparison of a data element at that position, and update a flags register based on the comparison.

Patent Information

Application #

Filing Date

15 May 2024

Publication Number

22/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

INTEL CORPORATION

2200 Mission College Boulevard, Santa Clara, California 95054, USA

Inventors

1. ALEXANDER HEINECKE

#701 55 River Oaks Place San Jose California USA 95134

2. MENACHEM ADELMAN

Hatichon 31A Apt.5 Haifa Israel 3229624

3. ROBERT VALENTINE

Ya'ara 40 Kiryat Tivon Israel 36054

4. ZEEV SPERBER

32nd Igal Alon St., Zikhron Yaakov Israel 3092832

5. AMIT GRADSTEIN

16th Hadas St., Binyamina Israel 3052316

6. MARK CHARNEY

610 Waltham Street Lexington Massachusetts USA 02421

7. EVANGELOS GEORGANAS

1927 Bridgepoint Pkwy, Unit H346, San Mateo California USA 94404

8. DHIRAJ KALAMKAR

Intel tech India Pvt Ltd Devarabeesanahalli Village Bangalore Karnataka India 560103

9. CHRISTOPHER HUGHES

3543 Druffel Place Santa Clara California USA 95051

10. CRISTINA ANDERSON

890 NW Brookhill Street Hillsboro Oregon USA 97124

Specification

Description:RELATED APPLICATION
[0001] This application is a divisional of India Patent Application No. 202244042724, filed on 26 July 2022, entitled “BFLOAT16 COMPARISON INSTRUCTIONS”.
[0002] The present application claims priority to U.S. Non-Provisional Patent Application No. 17/463,410 filed on 31 August 2021 and titled “BFLOAT16 COMPARISON INSTRUCTIONS” the entire disclosure of which is hereby incorporated by reference.

BACKGROUND
[0003] In recent years fused-multiply-add (FMA) units with lower-precision multiplications and higher-precision accumulation have proven useful in machine learning/artificial intelligence applications, most notably in training deep neural networks due to their extreme computational intensity. Compared to classical IEEE-754 32-bit (FP32) and 64-bit (FP64) arithmetic, this reduced precision arithmetic can naturally be sped up disproportional to their shortened width.

BRIEF DESCRIPTION OF DRAWINGS
[0004] Various embodiments in accordance with the present disclosure will be described with reference to the drawings, in which:
[0005] FIG. 1 illustrates different floating point representation formats.
[0006] FIG. 2 illustrates an exemplary execution of an instruction to determine a maximum value between BF16 data elements of corresponding data element positions of two sources.
[0007] FIG. 3 illustrates an embodiment of method performed by a processor to process an instruction to determine a maximum value between BF16 data elements of corresponding data element positions of two sources.
[0008] FIG. 4 illustrates more detailed embodiments of an execution of an instruction to determine a maximum value between BF16 data elements of corresponding data element positions of two sources.
[0009] FIG. 5 illustrates exemplary embodiments of pseudo code representing the execution and format of an instruction to determine a maximum value between BF16 data elements of corresponding data element positions of two sources instruction.
[0010] FIG. 6 illustrates an exemplary execution of an instruction to determine a minimum value between BF16 data elements of corresponding data element positions of two sources.
[0011] FIG. 7 illustrates an embodiment of method performed by a processor to process an instruction to determine a minimum value between BF16 data elements of corresponding data element positions of two sources.
[0012] FIG. 8 illustrates more detailed embodiments of an execution of an instruction to determine a minimum value between BF16 data elements of corresponding data element positions of two sources.
[0013] FIG. 9 illustrates exemplary embodiments of pseudo code representing the execution and format of an instruction to determine a minimum value between BF16 data elements of corresponding data element positions of two sources instruction.
[0014] FIG. 10 illustrates an exemplary execution of an instruction to compare values between BF16 data elements of corresponding data element positions of two sources according to a comparison operator.
[0015] FIG. 11 provides examples of comparison operators according to some embodiments.
[0016] FIG. 12 illustrates an embodiment of method performed by a processor to process an instruction to compare values between BF16 data elements of corresponding data element positions of two sources according to a comparison operator.
[0017] FIG. 13 illustrates exemplary embodiments of pseudo code representing the execution and format of an instruction to compare values between BF16 data elements of corresponding data element positions of two sources according to a comparison operator.
[0018] FIG. 14 illustrates an exemplary execution of an instruction to compare BF16 values in a particular data element position of a first source operand and second source operand and sets a zero flag, parity flag, and a carry flag according to a result of the comparison.
[0019] FIG. 15 illustrates an embodiment of method performed by a processor to process an instruction to compare BF16 values in a particular data element position of a first source operand and second source operand and sets a zero flag, parity flag, and a carry flag according to a result of the comparison.
[0020] FIG. 16 illustrates exemplary embodiments of pseudo code representing the execution and format of an instruction to compare BF16 values in a particular data element position of a first source operand and second source operand and sets a zero flag, parity flag, and a carry flag according to a result of the comparison.
[0021] FIG. 17 illustrates embodiments of hardware to process an instruction such as any of the BF16 compare instructions detailed above. As illustrated, storage 1703 stores at least one BF16 compare instruction 1701 to be executed.
[0022] FIG. 18 illustrates embodiments of an exemplary system.
[0023] FIG. 19 illustrates a block diagram of embodiments of a processor that may have more than one core, may have an integrated memory controller, and may have integrated graphics.
[0024] FIG. 20(A) is a block diagram illustrating both an exemplary in-order pipeline and an exemplary register renaming, out-of-order issue/execution pipeline according to embodiments of the invention.
[0025] FIG. 20(B) is a block diagram illustrating both an exemplary embodiment of an in-order architecture core and an exemplary register renaming, out-of-order issue/execution architecture core to be included in a processor according to embodiments of the invention.
[0026] FIG. 21 illustrates embodiments of execution unit(s) circuitry, such as execution unit(s) circuitry of FIG. 20(B).
[0027] FIG. 22 is a block diagram of a register architecture according to some embodiments.
[0028] FIG. 23 illustrates embodiments of an instruction format.
[0029] FIG. 24 illustrates embodiments of an addressing field.
[0030] FIG. 25 illustrates embodiments of a first prefix.
[0031] FIGS. 26(A)-(D) illustrate embodiments of how the R, X, and B fields of the first prefix 2301(A) are used.
[0032] FIGS. 27(A)-(B) illustrate embodiments of a second prefix.
[0033] FIG. 28 illustrates embodiments of a third prefix.
[0034] FIG. 29 illustrates a block diagram contrasting the use of a software instruction converter to convert binary instructions in a source instruction set to binary instructions in a target instruction set according to embodiments of the invention.

DETAILED DESCRIPTION
[0035] The present disclosure relates to methods, apparatus, systems, and non-transitory computer-readable storage media for supporting comparison operations BF16 data elements instructions.
[0036] BF16 is gaining traction due to its ability to work well in machine learning algorithms, in particular deep learning training. FIG. 1 illustrates different floating point representation formats. In this illustration, the formats are in little endian format, however, in some embodiments, a big endian format is used. The FP32 format 101 has a sign bit (S), an 8-bit exponent, and a 23-bit fraction (a 24-bit mantissa that uses an implicit bit). The FP16 format 103 has a sign bit (S), a 5-bit exponent, and a 10-bit fraction. The BF16 format 105 has a sign bit (S), an 8-bit exponent, and a 7-bit fraction.
, Claims:1. A processor comprising:
decode circuitry to decode an instruction, the instruction having fields for identification of a first packed data register, identification of a second packed data register, and identification of a predicate register, the first packed data register to store a first packed data source operand having at least eight BF16 data elements, the second packed data register to store a second packed data source operand having at least eight BF16 data elements, the predicate register to store at least eight predicate values, each of the at least eight BF16 data elements of the first packed data source operand corresponding to one of the at least eight BF16 data elements of the second packed data source operand at a corresponding data element position, each of the predicate values corresponding to a pair of corresponding BF16 data elements of the first and second packet data source operands at a corresponding data element position; and
circuitry coupled with the decode circuitry to perform operations corresponding to the instruction, including to:
provide, for each data element position, a data element result, wherein:
for each predicate value that is a first value, the data element result is to include a corresponding data element that is a result of either a maximum comparison or a minimum comparison of the pair of corresponding BF16 data elements, wherein, when the BF16 data elements of the pair of corresponding BF16 data elements are both zero, of either sign, the data element result is to include the corresponding BF16 data element of the second packed data source operand; and
for each predicate value that is a second value, the data element result is to include a corresponding data element that is either zero or remains unchanged; and
store the data element results in a packed data destination operand.

Documents

Application Documents

#	Name	Date
1	202445038080-POWER OF AUTHORITY [15-05-2024(online)].pdf	2024-05-15
2	202445038080-FORM 1 [15-05-2024(online)].pdf	2024-05-15
3	202445038080-DRAWINGS [15-05-2024(online)].pdf	2024-05-15
4	202445038080-DECLARATION OF INVENTORSHIP (FORM 5) [15-05-2024(online)].pdf	2024-05-15
5	202445038080-COMPLETE SPECIFICATION [15-05-2024(online)].pdf	2024-05-15
6	202445038080-FORM 18 [24-10-2024(online)].pdf	2024-10-24