In Network Collective Operations

Abstract: Examples described herein relate to a switch comprising circuitry configured to for packet communications associated with a collective operation to train machine learning (ML) models: utilize a reliable transport protocol for communications from at least one worker node of the collective operation to a switch, wherein the utilize a reliable transport protocol for communications from at least one worker node of the collective operation to the switch comprises store packet receipt state for per-packet communications from the at least one worker node of the collective operation to the switch and utilize a non-reliable transport protocol by the switch to a device that is to perform aggregation of results, wherein the reliable transport protocol comprises a different protocol than that of the non-reliable transport protocol.

Patent Information

Application #

Filing Date

22 November 2023

Publication Number

04/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

INTEL CORPORATION

2200 Mission College Boulevard, Santa Clara, California 95054, USA

Inventors

1. Vivek KASHYAP

1592 NW Jolie Place Portland, Oregon 97229 USA

2. Amedeo SAPIO

1089 Big Sur Drive San Jose, California 95120 USA

Specification

Description:RELATED APPLICATION
[001] The present application claims priority to U.S. Non-Provisional Patent Application No. 18/222,946 filed on 17 July 2023 and titled “IN-NETWORK COLLECTIVE OPERATIONS” the entire disclosure of which is hereby incorporated by reference.

BACKGROUND
[002] Machine Learning (ML) or High performance computing (HPC) clusters utilize multitudes of servers and graphics processing unit (GPUs) or Tensor Processing Units (TPUs). Collective operations can be performed on data transmitted through a network at different switches. These systems are used to train ML models using iterative algorithms such as stochastic gradient descent. Input data is partitioned across workers and multiple iterations are performed over the training data. At each iteration, workers compute an update to the ML model parameters based on a subset of local data and an intermediate current model. The workers then communicate their results to be aggregated into a model update and the aggregate update is summed for model parameters at the nodes for the next iteration. These iterations are performed multiple times (epochs) over an entire dataset.
[003] A parameter server (PS) can be utilized for collective operations whereby worker nodes compute updates and send updates to the PS. The PS pushes the aggregated data or the data is pulled from PS servers. FIG. 1 shows an end-to-end solution for machine learning (ML) training using a PS architecture. PS architecture includes workers 100 and parameter servers (PS) 120 that are communicatively coupled using switches 110. An end-to-end solution for PS architecture includes reduce-scatter and Allgather operators. FIG. 1 shows that Worker1 has three queue pairs (QPs), and each QP connects to a PS. Worker2 and Worker3 also utilize three QPs, and each QP connects to a PS.
[004] In the reduce-scatter operator, a worker sends a partition of the data to a corresponding parameter server. For example, partition a1 from Worker1, a2 from Worker2 and a3 from Worker3 are sent to PS1, whereas partition b1 from worker1, b2 from worker2, and b3 from worker3 are sent to PS2. A similar pattern applies to the PS3. As a result, the data are scattered across multiple parameter servers to leverage the parallel computation of graphics processing units (GPUs) located at a parameter server. After receiving the data, the PS first performs aggregation over the data from the workers.
[005] In the Allgather operator, the data that are processed by a GPU are multicast to the workers. A parameter server sends the same copy of the data to the workers. In this process, the bandwidth from one PS is distributed to all the workers, and the network could be the bottleneck.

BRIEF DESCRIPTION OF THE DRAWINGS
[006] FIG. 1 shows an end-to-end solution for machine learning (ML) training using a parameter server (PS) architecture.
[007] FIG. 2 depicts an example topology.
[008] FIG. 3 depicts an example system.
[009] FIG. 4 depicts an example of buffer management.
[0010] FIG. 5 depicts an example process.
[0011] FIG. 6 depicts an example network interface device or packet processing device.
[0012] FIGs. 7A-7C depict example switches.
[0013] FIG. 8 depicts an example system.

DETAILED DESCRIPTION
[0014] In collective operations that utilize in-network aggregation (INA) to offload collective operations for ML workloads to network interface devices, such as switches, the network interface devices can perform operations of a parameter server (PS) and be connected in a tree, mesh, or other arrangements. An endpoint switch can communicate with worker nodes using a reliable transport protocol and the endpoint switch can maintain connection state information for packet communications to or from the worker nodes. The endpoint switch can aggregate data from the worker nodes by performing one or more of SUM, SUBTRACT, MIN, MAX, MULTIPLY, etc.. The endpoint switch can perform floating point operations as part of data processing. The endpoint switch can be connected with at least one other switch using a non-reliable protocol and may not store connection state information for packet communications to or from the at least one other switch. The endpoint switch can utilize best efforts in communicating with the at least one other switch and may not re-transmit packets that were not received by the at least one other switch. In some examples, the endpoint switch can utilize a first protocol to communicate with the worker nodes and a second protocol to communicate with the at least one other switch and the first and second protocols can be different. The at least one other switch can aggregate data from at least the switch such as one or more of SUM, SUBTRACT, MIN, MAX, MULTIPLY, etc
, Claims:1. An apparatus comprising:
an interface and
circuitry coupled to the interface, the circuitry configured to:
for packet communications associated with a collective operation to train machine learning (ML) models:
utilize a reliable transport protocol for communications from at least one worker node of the collective operation to a switch, wherein the utilize a reliable transport protocol for communications from at least one worker node of the collective operation to the switch comprises store packet receipt state for per-packet communications from the at least one worker node of the collective operation to the switch and
utilize a non-reliable transport protocol by the switch to a device that is to perform aggregation of results, wherein the reliable transport protocol comprises a different protocol than that of the non-reliable transport protocol.

Documents

Application Documents

#	Name	Date
1	202344079422-POWER OF AUTHORITY [22-11-2023(online)].pdf	2023-11-22
2	202344079422-FORM 1 [22-11-2023(online)].pdf	2023-11-22
3	202344079422-DRAWINGS [22-11-2023(online)].pdf	2023-11-22
4	202344079422-DECLARATION OF INVENTORSHIP (FORM 5) [22-11-2023(online)].pdf	2023-11-22
5	202344079422-COMPLETE SPECIFICATION [22-11-2023(online)].pdf	2023-11-22
6	202344079422-FORM 3 [22-05-2024(online)].pdf	2024-05-22
7	202344079422-Correspondence-Letter [19-07-2024(online)].pdf	2024-07-19
8	202344079422-Proof of Right [30-10-2024(online)].pdf	2024-10-30