Modular Gpu Architecture For Clients And Servers

Abstract: One embodiment provides a graphics processor including an active base die including a fabric interconnect and a chiplet including a switched fabric, wherein the chiplet couples with the active base die via an array of interconnect structures, the array of interconnect structures couple the fabric interconnect with the switched fabric, and the chiplet includes a first modular interconnect configured to couple a block of graphics processing resources to the switched fabric and a second modular interconnect configured to couple a memory subsystem with the switched fabric and the block of graphics processing resources, the memory interconnect including a set of memory controllers and a set of physical interfaces.

Patent Information

Application #

Filing Date

07 September 2022

Publication Number

15/2023

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

INTEL CORPORATION

2200 Mission College Boulevard, Santa Clara, California 95054, USA

Inventors

1. Lakshminarayana Pappu

1480 Leonard Ct., Folsom, CA, USA, 95630

2. Altug Koker

8241 Trevi Way, El Dorado Hills, CA, USA, 95762

3. Aditya Navale

1129 Widgeon Ct, Folsom, CA, USA, 95630

4. Prasoonkumar Surti

1408 Kilrenny Ct, Folsom, CA, USA, 95630

5. Ankur Shah

1900 Prairie City Road, Folsom, CA, USA, 95630

6. Joydeep Ray

727 Misty Ridge Cir, Folsom, CA, USA, 95630

7. Naveen Matam

4086 Kalamata Way, Rancho Cordova, CA, USA, 95742

Specification

Description:RELATED APPLICATION
[0001] The present application claims priority to U.S. Non-Provisional Patent Application No. 17/496,467 filed October 07, 2021 and titled “MODULAR GPU ARCHITECTURE FOR CLIENTS AND SERVERS” the entire disclosure of which is hereby incorporated by reference.

FIELD
[0002] This disclosure relates generally to data processing and more particularly to data processing via a general-purpose graphics processing unit.

BACKGROUND OF THE DISCLOSURE
[0003] Product segmentation within product classes can be performed for graphics product devices by designing a client or server architecture that includes a superset of all components and then binning partially failed or lower performing products into differing product segments. Accordingly, within a given architectural generation, products of different tiers or classification may be architecturally similar, with different amounts of memory or processing resources. However, graphics products that are designed for use in server devices will generally have a different core architecture than products designed for use in client devices. Additionally, different software structures may be used for the client and server segments, even for components that are expected to perform similar functions.

BRIEF DESCRIPTION OF THE DRAWINGS
[0004] The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings in which like references indicate similar elements, and in which:
[0005] FIG. 1 is a block diagram illustrating a computer system configured to implement one or more aspects of the embodiments described herein;
[0006] FIG. 2A-2D illustrate parallel processor components;
[0007] FIG. 3A-3C are block diagrams of graphics multiprocessors and multiprocessor-based GPUs;
[0008] FIG. 4A-4F illustrate an exemplary architecture in which a plurality of GPUs is communicatively coupled to a plurality of multi-core processors;
[0009] FIG. 5 illustrates a graphics processing pipeline;
[0010] FIG. 6 illustrates a machine learning software stack;
[0011] FIG. 7 illustrates a general-purpose graphics processing unit;
[0012] FIG. 8 illustrates a multi-GPU computing system;
[0013] FIG. 9A-9B illustrate layers of exemplary deep neural networks;
[0014] FIG. 10 illustrates an exemplary recurrent neural network;
[0015] FIG. 11 illustrates training and deployment of a deep neural network;
[0016] FIG. 12A is a block diagram illustrating distributed learning;
[0017] FIG. 12B is a block diagram illustrating a programmable network interface and data processing unit;
[0018] FIG. 13 illustrates an exemplary inferencing system on a chip (SOC) suitable for performing inferencing using a trained model;
[0019] FIG. 14 is a block diagram of a processing system;
[0020] FIG. 15A-15C illustrate computing systems and graphics processors;
[0021] FIG. 16A-16C illustrate block diagrams of additional graphics processor and compute accelerator architectures;
[0022] FIG. 17 is a block diagram of a graphics processing engine of a graphics processor;
[0023] FIG. 18A-18B illustrate thread execution logic including an array of processing elements employed in a graphics processor core;
[0024] FIG. 19 illustrates an additional execution unit;
[0025] FIG. 20 is a block diagram illustrating a graphics processor instruction formats;
[0026] FIG. 21 is a block diagram of an additional graphics processor architecture;
[0027] FIG. 22A-22B illustrate a graphics processor command format and command sequence;
[0028] FIG. 23 illustrates exemplary graphics software architecture for a data processing system;
[0029] FIG. 24A is a block diagram illustrating an IP core development system;
[0030] FIG. 24B illustrates a cross-section side view of an integrated circuit package assembly;
[0031] FIG. 24C illustrates a package assembly that includes multiple units of hardware logic chiplets connected to a substrate (e.g., base die);
[0032] FIG. 24D illustrates a package assembly including interchangeable chiplets;
[0033] FIG. 25 is a block diagram illustrating an exemplary system on a chip integrated circuit;
[0034] FIG. 26A-26B are block diagrams illustrating exemplary graphics processors for use within an SoC;
[0035] FIG. 27 illustrates a modular GPU architecture for clients and servers, according to embodiments described herein;
[0036] FIG. 28 illustrates a modular graphics core cluster, according to an embodiment;
[0037] FIG. 29 is a block diagram of a tile of a modular graphics processor, according to an embodiment;
[0038] FIG. 30 illustrates a single tile client graphics SoC, according to an embodiment;
[0039] FIG. 31 illustrates a multi-tile client graphics SoC, according to an embodiment;
[0040] FIG. 32 illustrates a scalable multi-board server graphics processor, according to an embodiment;
[0041] FIG. 33 illustrates a scalable multi-board server graphics processor having a multi-tile graphics SoC, according to an embodiment;
[0042] FIG. 34 illustrates an interchangeable chiplet system for a modular graphics SoC, according to an embodiment;
[0043] FIG. 35 illustrates a method for configuring a component set for a modular graphics SoC, according to an embodiment; and
[0044] FIG. 36 is a block diagram of a computing device including a graphics processor, according to an embodiment.

DETAILED DESCRIPTION
[0045] A graphics processing unit (GPU) is communicatively coupled to host/processor cores to accelerate, for example, graphics operations, machine-learning operations, pattern analysis operations, and/or various general-purpose GPU (GPGPU) functions. The GPU may be communicatively coupled to the host processor/cores over a bus or another interconnect (e.g., a high-speed interconnect such as PCIe or NVLink). Alternatively, the GPU may be integrated on the same package or chip as the cores and communicatively coupled to the cores over an internal processor bus/interconnect (i.e., internal to the package or chip). Regardless of the manner in which the GPU is connected, the processor cores may allocate work to the GPU in the form of sequences of commands/instructions contained in a work descriptor. The GPU then uses dedicated circuitry/logic for efficiently processing these commands/instructions.
[0046] Current parallel graphics data processing includes systems and methods developed to perform specific operations on graphics data such as, for example, linear interpolation, tessellation, rasterization, texture mapping, depth testing, etc. Traditionally, graphics processors used fixed function computational units to process graphics data. However, more recently, portions of graphics processors have been made programmable, enabling such processors to support a wider variety of operations for processing vertex and fragment data.
[0047] To further increase performance, graphics processors typically implement processing techniques such as pipelining that attempt to process, in parallel, as much graphics data as possible throughout the different parts of the graphics pipeline. Parallel graphics processors with single instruction, multiple thread (SIMT) architectures are designed to maximize the amount of parallel processing in the graphics pipeline. In a SIMT architecture, groups of parallel threads attempt to execute program instructions synchronously together as often as possible to increase processing efficiency. A general overview of software and hardware for SIMT architectures can be found in Shane Cook, CUDA Programming Chapter 3, pages 37-51 (2013).
[0048] Described herein is a discrete graphics processing architecture that is scalable across clients (entry, mainstream, enthusiast) and servers (high-performance compute, machine learning, media) via the use of modular and scalable system on a chip integrated circuit (SoC). The various modules of the graphics SoC can be selected based on the target product classes and intended market segments. Different types of modules can be selected, for example, based on the memory technology selected for a product and the intended use case for the product. The selected modules can then be packaged and assembled into a variety of end products across various product classes and market segments.
[0049] In the following description, numerous specific details are set forth to provide a more thorough understanding. However, it will be apparent to one of skill in the art that the embodiments described herein may be practiced without one or more of these specific details. In other instances, well-known features have not been described to avoid obscuring the details of the present embodiments.
, Claims:1. A general-purpose graphics processor comprising:
an active base die including a fabric interconnect; and
a first chiplet including a switched fabric, wherein the first chiplet is configured to couple with the active base die via an array of interconnect structures, the array of interconnect structures additionally to couple the fabric interconnect with the switched fabric, the first chiplet including:
a first modular interconnect configured to couple a block of graphics processing resources to the switched fabric; and
a second modular interconnect configured to couple a memory subsystem with the switched fabric and the block of graphics processing resources, the memory interconnect including a set of memory controllers and a set of physical interfaces.

Documents

Application Documents

#	Name	Date
1	202244051173-US 17496467-DASCODE-2410 [07-09-2022].pdf	2022-09-07
2	202244051173-FORM 1 [07-09-2022(online)].pdf	2022-09-07
3	202244051173-DRAWINGS [07-09-2022(online)].pdf	2022-09-07
4	202244051173-DECLARATION OF INVENTORSHIP (FORM 5) [07-09-2022(online)].pdf	2022-09-07
5	202244051173-COMPLETE SPECIFICATION [07-09-2022(online)].pdf	2022-09-07
6	202244051173-FORM-26 [28-12-2022(online)].pdf	2022-12-28
7	202244051173-FORM 3 [07-03-2023(online)].pdf	2023-03-07
8	202244051173-FORM 3 [07-09-2023(online)].pdf	2023-09-07
9	202244051173-Proof of Right [12-09-2023(online)].pdf	2023-09-12
10	202244051173-FORM 3 [08-03-2024(online)].pdf	2024-03-08
11	202244051173-FORM 18 [30-09-2025(online)].pdf	2025-09-30