Abstract: In one embodiment, a system includes a device and a host. The device includes a device stream buffer. The host includes a processor to execute at least a first application and a second application, a host stream buffer, and a host scheduler. The first application is associated with a first transmit streaming channel to stream first data from the first application to the device stream buffer. The first transmit streaming channel has a first allocated amount of buffer space in the device stream buffer. The host scheduler schedules enqueue of the first data from the first application to the first transmit streaming channel based at least in part on availability of space in the first allocated amount of buffer space in the device stream buffer. Other embodiments are described and claimed.
Claims:1. An apparatus comprising:
a processor to execute at least a first application and a second application, the first application associated with a first transmit streaming channel to stream first data from the first application to a device stream buffer of a device, wherein the first transmit streaming channel has a first allocated amount of buffer space in the device stream buffer; and
a scheduler coupled to the processor, the scheduler to schedule the first application to enqueue the first data to the first transmit streaming channel based at least in part on availability of space in the first allocated amount of buffer space in the device stream buffer.
2. The apparatus of claim 1, wherein the scheduler is further to:
receive flow control data associated with the availability of space in the first allocated amount of buffer space in the device stream buffer from the device; and
schedule the first application to enqueue the first data to the first transmit streaming channel based at least in part on the flow control data.
3. The apparatus of claim 1, wherein the second application is associated with a second transmit streaming channel to stream second data from the second application to the device stream buffer, the second transmit streaming channel having a second allocated amount of buffer space in the device stream buffer; and
the apparatus further comprises a host stream buffer to:
receive the first and second data via the first and second transmit streaming channels, respectively; and
stream the first and second data from the host stream buffer to the device stream buffer via a buffer transmit streaming channel.
4. The apparatus of claim 3, wherein the scheduler is further to prioritize streaming of the first data and the second data from the host stream buffer to the device stream buffer via the buffer transmit streaming channel based at least in part on a first priority associated with the first application and a second priority associated with the second application.
5. The apparatus of claim 3,wherein the scheduler is further to:
receive a first data drain rate from the device, the first data drain rate associated with draining of the first data from the device stream buffer;
receive a second data drain rate from the device, the second data drain rate associated with draining of the second data from the device stream buffer; and
reprioritize streaming of the first data and the second data from the host stream buffer to the device stream buffer based at least in part on the first and second data drain rates.
6. The apparatus of claim 1, wherein the scheduler is further to:
receive a first data drain rate from the device, the first data drain rate associated with draining of the first data from the device stream buffer; and
schedule the enqueue of the first data from the first application to the first transmit streaming channel based at least in part on the first data drain rate.
7. The apparatus of claim 1, wherein the second application is associated with a second transmit streaming channel to stream second data from the second application to the device stream buffer and the second transmit streaming channel having a second allocated amount of buffer space in the device stream buffer; and
the scheduler is further to:
determine whether the first data provided by the first application falls below a channel data threshold; and
dynamically adjust the first allocated amount of buffer space in the device stream buffer to the first transmit streaming channel and the second allocated amount of buffer space in the device stream buffer to the second transmit streaming channel based at least in part on the determination.
8. The apparatus of claim 1, wherein the first application is associated with a first priority and the second application is associated with a second priority, the second application associated with a second transmit streaming channel to stream second data from the second application to the device stream buffer and having a second allocated amount of buffer space in the device stream buffer; and
the scheduler is further to:
allocate the first allocated amount of the buffer space in the device stream buffer to the first transmit streaming channel based at least in part on the first priority; and
allocate the second allocated amount of the buffer space in the device stream buffer to the second transmit streaming channel based at least in part on the second priority.
9. The apparatus of claim 1, further comprising a host stream buffer to receive third data associated with the first application from the device via a first receive streaming channel,the first receive streaming channel being associated with the first application and having a first allocated amount of buffer space in the host stream buffer.
10. The apparatus of claim 9, wherein the host buffer is to transmit flow control data associated with availability of space in the first allocated amount of buffer space in the host stream buffer to the device.
11. The apparatus of claim 9, wherein the scheduler is further to schedule dequeue of the third data by the first application from the first receive streaming channel.
12. A machine-readable medium comprising instructions stored thereon, which if performed by a machine, cause the machine to:
receive first data associated with a first application at a device stream buffer of a device, the first application associated with a first receive streaming channel to stream the first data from the device stream buffer to the first application at a host device via a host stream buffer,the first receive streaming channel having a first allocated amount of buffer space in the host stream buffer; and
schedule streaming of the first data from the device stream buffer to the first application via the first receive streaming channel based at least in part on availability of space in the first allocated amount of buffer space in the host stream buffer.
13. A system comprising:
a device comprising:
a device stream buffer; and
a host coupled to the device, the host comprising:
a processor to execute at least a first application and a second application, the first application associated with a first transmit streaming channel to stream first data from the first application to the device stream buffer, the first transmit streaming channel having a first allocated amount of buffer space in the device stream buffer;
a host stream buffer; and
a host scheduler coupled to the processor, the host scheduler to schedule enqueue of the first data from the first application to the first transmit streaming channel based at least in part on availability of space in the first allocated amount of buffer space in the device stream buffer.
14. The system of claim 13, wherein the host scheduler is further to:
receive flow control data associated with availability of space in the first allocated amount of buffer space in the device stream buffer; and
schedule the enqueue of the first data from the first application to the first transmit streaming channel based at least in part on the flow control data.
15. The system of claim 13, further comprising a device scheduler, wherein the device stream buffer is to receive third data associated with the first application, the first application associated with a first receive streaming channel to stream the third data from the device stream buffer to the first application, the first receive streaming channel having a first allocated amount of buffer space in the host stream buffer; and
the device scheduler is to schedule streaming of the third data from the device stream buffer to the host stream buffer based at least in part on availability of space in the first allocated amount of buffer space in the host stream buffer.
16. The system of claim 15, wherein the device scheduler is further to
receive flow control data associated with the availability of space in the first allocated amount of buffer space in the host stream buffer; and
schedule streaming of the third data from the device stream buffer to the host stream buffer based at least in part on the flow control data.
, Description:[0001] This application claims priority to United States Provisional Patent Application No. 17/313,353, filed on May 06, 2021, entitled SYSTEM, APPARATUS, AND METHOD FOR STREAMING INPUT/OUTPUT DATA, the disclosure of which is hereby incorporated by reference.
Technical Field
[0002] Embodiments relate to data communications in a computer system.
Background
[0003] Data communicationsbetween a Central Processing Unit (CPU) and other devices such as disks, network interface cards (NICs), field programmable gate arrays (FPGAs), accelerators, and the like currently operate using multiple back-and-forth coordination actions. Each of these actions last at least a few microsecondsand may contribute toward the overall latency of the data transfer between the CPU and the other devices.
Brief Description of the Drawings
[0004] FIG. 1 is a block diagram representation of a system including a streaming subsystem in accordance with various embodiments.
[0005] FIG. 2 is a diagram illustrating a streaming interface with memory semanticsin accordance with various embodiments.
[0006] FIG. 3 is a block diagram representation of a system including a streaming subsystem in accordance with various embodiments.
[0007] FIG. 4 is a block diagram representation of a system including a streaming subsystem in accordance with various embodiments.
[0008] FIG. 5 is a block diagram representation of a peer-to-peer streaming flow via a streaming subsystem in accordance with various embodiments.
[0009] FIG. 6 is a block diagram representation of a device to host streaming flow via a streaming subsystem in accordance with various embodiments.
[0010] FIGs. 7A-7C are diagrams describing zero copy memory, in accordance with various embodiments.
[0011] FIG. 8 is a diagram that shows transfers within a streaming channel flow via a streaming subsystem in accordance with various embodiments.
[0012] FIG. 9 is a diagram that shows application software to I/O interaction for transmit streaming I/O semantics, in accordance with various embodiments.
[0013] FIG. 10 is a diagram that shows application software to I/O interaction for receive streaming I/O semantics, in accordance with various embodiments.
[0014] FIG. 11A and FIG. 11B are examples of commands, input parameters, and output parameters to support a streaming channel architecture, in accordance with various embodiments.
[0015] FIG. 12 shows an example process to implement streaming channel architecture, in accordance with various embodiments.
[0016] FIG. 13 illustrates an example computing device suitable for use to implement components and techniques described herein, in particular with respect to FIGs.1-12, in accordance with various embodiments.
[0017] FIG. 14 depicts a computer-readable storage medium that may be used in conjunction with the computing device, in accordance with various embodiments.
[0018] FIG. 15 schematically illustrates a computing device which may include various embodiments as described herein.
[0019] FIG. 16 is a block diagram representation of a system including an embodiment of the streaming subsystem.
[0020] FIG. 17 is a flowchart representation of an embodiment of a method of implementing flow managementis shown.
[0021] FIG. 18 is a flowchart representation of an embodiment of a method of streaming data from a host device to an I/O device.
[0022] FIG. 19 is a block diagram representation of a system including an embodiment of the streaming subsystem.
[0023] FIG. 20 is a flowchart representation of an embodiment of a method of streaming data from an I/O device to a host device.
[0024] FIG. 21 is a block diagram representation of an interface circuit in accordance with an embodiment.
[0025] FIG. 22 is a block diagram representation of a system in accordance with an embodiment.
Detailed Description
[0026] Embodiments described herein may be directed to a streaming I/O architecture. Examples of this streaming I/O architecture may include data passing back and forth between a CPU and a device using a stream buffer that includes a streaming protocol, flow control, stream buffer format, scheduler, buffer manager, and exposure of an interface of the stream buffer to software including application software. These embodiments may reduce data-transfer latencies due to reducing the number of multiple interactions between interfaces found in legacy implementations, including PCIelegacy implementations. These embodiments may be a portion of a multi-terabit streaming data delivery technology. Embodiments may also be referred to as a lowlatency streaming channel architecture for data flows between processors and devices, or a streaming data delivery system (SDDS).
[0027] Embodiments described herein may serve as foundations for scale-out computing, inparticular as the foundation for how software interacts with I/O devices such as network devices. These I/O devices may include network interface cards, storage devices, accelerators, and other similar devices. The legacy architecture for software-I/O interface was identified decades ago, and recently has created a data movement and processing problem for CPU devices regarding the transportation of data from assorted sources to assorted storage mediums. This data movement and processing problem will likely increase due to Ethernet technology reaching 400 Gb per second (Gbps) and likely going to 800 Gbps-1.6 terabits per second (Tb/s) in the near future that will be used to address emerging data center workloads such as artificial intelligence (AI), machine learning (ML), streaming services, and the like.
[0028] Legacy network latencies are starting to impact scale-out computing. At the microsecond timescale, software scheduling approaches may be too coarse-grained and impractical, while hardware-based approaches may require too many resources towards masking latencies of data movements and various associated coordination operations. Recently there is a microservices trend, particularly in cloud computing, data center computing, and edge computing to refactor large applications into many smaller parts, called microservices, that can be independently developed, improved, load-balanced, scaled, etc., and to perform the work of the large applications by communicating messages among the microservices or performing interĀ microservices procedure calls. The different parts of a large application so refactored into microservices may then be placed, replicated, etc. independently of each other, on multiplemachines connected by networks, or interact with one another through data they update on distributed storage devices. This increases inter-machine and inter-device data movement and processing, which exacerbates the data movement and processing delays by hundreds of times, and further burdens computational devices like CPUs, GPUs, etc. with additional code for performing the inter-device and inter-machine data movements and waiting for such data movements to be completed. In other emerging areas such as Internet-of-Things (IOT) computing which exhibit high degree of device-to-device communications, a similar data movement and processing bottleneck limits the speed and scale of solving problems by using custom off-the-shelf components, and necessitates the use of costly, hard-to-scale, specially designed components and software.
[0029] Large-scale micro-services trends, exascale computing and data storage, edge computing, 5G, network function virtualization, disaggregation within the data center, heterogeneous programming, with the advent of accelerators, smart storage, data center fabrics and the like, are all creating pressure on the "last inch" of data transfers to and from CPUs in legacy implementations. The "last inch" refers to the coordination within a machine between any device that needs to send or receive data to or from a software task that respectively consumes or produces the data piece by piece, or in various units of packets, messages, pages, blocks, etc.
[0030] These legacy implementations are across devices fromhigh-throughput disaggregated storage elements, to 12+ terabytes switches, and high-speed NICs. Simultaneously, AI-enabled real-time computations continue to drive per-capita data creation and consumption. In addition, data operations around billions of Internet of things (IoT) devices are experiencing a similar problem with legacy implementations, that may affect machine to machine (M2M) communications, storage, data filtering, time-series analytics, and so on. Data is being produced on the scale of petabytes per second and has to be moved, processed, filtered, cleaned, and organized for computation through large scale parallel and distributed computations. As these, and other, computations become ubiquitous, high speed and efficient data transfer has become increasingly important.
[0031] These legacy I/O architecture implementations often have difficulty achieving very low latencies and very high throughputs because of their need to cross back and forth between loadstore semantics of processor-based computations and I/O semantics of data-transfer operationsbetween CPUs and such devices as disks, NICs, or even specialized computing devices like FPGAs, or accelerators operating from their own private high-speed memories.
[0032] In the following description, various aspects of the illustrative implementations will be described using terms commonly employed by those skilled in the art to convey the substance of their work to others skilled in the art. However, it will be apparent to those skilled in the art that embodiments of the present disclosure may be practiced with only some of the described aspects. For purposes of explanation, specific numbers, materials, and configurations are set forth in order to provide a thorough understanding of the illustrative implementations. It will be apparent to one skilled in the art that embodiments of the present disclosure may be practiced without the specific details. In other instances, well-known features are omitted or simplified in order not to obscure the illustrative implementations.
[0033] In the following detailed description, reference is made to the accompanying drawings that form a part hereof, wherein like numerals designate like parts throughout, and in which is shown by way of illustration embodiments in which the subject matter of the present disclosure may be practiced. It is to be understood that other embodiments may be utilized and structural or logical changes may be made without departing from the scope of the present disclosure. Therefore, the following detailed description is not to be taken in a limiting sense, and the scope of embodiments is defined by the appended claims and their equivalents.
[0034] For the purposes of the present disclosure, the phrase "A and/or B" means (A), (B), or (A and B). For the purposes of the present disclosure, the phrase "A, B, and/or C" means (A), (B), (C), (A and B), (A and C), (Band C), or (A, B, and C).
[0035] The description may use perspective-based descriptions such as top/bottom, in/out, over/under, and the like. Such descriptions are merely used to facilitate the discussion and are not intended to restrict the application of embodiments described herein to any particular orientation.
[0036] The description may use the phrases "in an embodiment," or "in embodiments," which may each refer to one or more of the same or different embodiments. Furthermore, the terms "comprising," "including," "having," and the like, as used with respect to embodiments of the present disclosure, are synonymous.
[0037] The term "coupled with," along with its derivatives, may be used herein. "Coupled" may mean one or more of the following. "Coupled" may mean that two or moreelements are in direct physical or electrical contact. However, "coupled" may also mean that two or more elements indirectly contact each other, but yet still cooperate or interact with each other, and may mean that one or more other elements are coupled or connected between the elements that are said to be coupled with each other. The term "directly coupled" may mean that two or more elements are in direct contact.
[0038] As used herein, the term "module" may refer to, be part of, or include an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (shared, dedicated, or group), and/or memory (shared, dedicated, or group) that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that provide the described functionality.
[0039] Referring to FIG. 1 a block diagram representation of a system 300 including a streaming subsystem 340 in accordance with various embodiments is shown. Thestreaming subsystem 340 isconfigured to handle exchange of data between processors and devices, in accordance with various embodiments. A CPU 320 is coupled to an I/O device 330 via a streaming subsystem 340. In embodiments, the CPU 320 may be multiple CPUs. In embodiments, the CPU 320 may have multiple cores running multiple software threads 322, 324, 326. In embodiments, the I/O device 330 may be a NIC, a solid-state drive (SSD), or the like. In embodiments, the I/O device 330 may be a XPU device that may be a CPU, a graphics processor unit (GPU), a field-programable gate array(FPGA), or some other device.
[0040] Embodiments, unlike legacy implementations that may use packet processing and converting from I/O to memory I/O, operate directly from/to data streams using thestreaming subsystem 340. The streaming subsystem 340is an interface that presents a streaming channel to the CPU 320. The streaming subsystem 340 includes hardware assists and acceleration mechanisms to produce an abstraction made seamless to an application 322, 324, 326 running on, for example, the CPU 320. The I/O device 330 is configured to use the streaming subsystem 340 as a bidirectional communication peer to the CPU 320.
[0041] In embodiments, streaming datavia the streaming subsystem 340 may be a first-class operation limited by capacity exhaustion in the abstracted channels (described further below) of the streaming subsystem 340 that links producers of data, such as the CPU 320 and the I/O device 330, to consumers of data such as the CPU 320 and the I/O device330.
[0042] The architecture described by conceptual diagram 300 may target delivering multi- terabits with low latency at a high multithreading and a multiprogramming scaleunder an abstracted synchronous programming model that does not involve queueing and does not cause loss in the routine cases. Examples of exceptional or infrequent cases include, but are not limited to, a congested buffer that impacts communications betweenthe CPU 320 and the I/O device 330, a congested processor or device that cannot keep up with the device or processor, and a condition inwhich software is handling a rare event such as a page fault and is unable to perform timely actions to keep its I/O actions streamlined. In embodiments, the application 322, 324, 326doesnot know or care that it is talking to an I/O device 330, and instead uses memory-like semantics to access data when it is needed.
[0043] Referring to FIG. 2, a diagram 400 illustrating a streaming interface with memory semantics, in accordance with various embodiments is shown. The diagram 400 shows coordination between the CPU 407 and the I/O device 408. The CPU 407 may be similar to the CPU 320and the I/O device 408 may be similar to the I/O device 330. In embodiments, the CPU 407 is configured to provide a stream payload 416 to the I/O device 408 and the I/O device 408 is configured to process the payload 418.
[0044] Legacy systems often include hundreds of different interactions for a complex payload. The streaming architecture of the diagram 400 may reducethe hundreds of interactions typically performed by legacy systems to a one-action interaction example of the diagram 400. This one-action interaction may be considered a logical one-action interaction. In embodiments, the one action interaction may be broken into smaller pieces for efficiency, but unless there is a scarcity of CPU cycles, latency dilation of the interaction typically does not occur.
[0045] Referring to FIG. 3,a block diagram representation of a system including a streaming subsystem 340 in accordance with various embodiments is shown. The CPU 520may be similar to the CPU320, 407 and the I/O device 530may be similar to the I/O device 330, 418. In embodiments, the streaming channel architecture, also referred to as the streaming subsystem 340, includes hardware components and a software host interface. In embodiments, the architecture as shown in FIG. 3includes three functional areas. The first functional area is a streaming protocol functional area 540 that may be implemented as an interface using the uncore (not shown) which is a part of CPU 520), and the I/O device 530. Thestreaming protocol functional area 540includes a stream buffer 542 that may be part of the I/O device 530and a stream buffer 544 that may be part of the CPU 520. In embodiments, the uncore includes functions of a microprocessor that are not in the core, butare closely connected to the core to achieve high performance. The uncore may also be referred to as a "system agent." The streaming protocol functional area 540is configured to employ the stream buffers542, 544 to implement flow control between the two stream buffers 542, 544 and support a stream buffer format.
[0046] The second functional area is an exposed stream buffer to software functional area 550, which may be implemented in the uncore. In an embodiment, theexposed stream buffer to software functional area 550 includes the streaming channels 554, the scheduler, the QoS, and assists 558. In embodiments the exposed stream buffer to software functional area 550will expose the stream buffers 542, 544 as streaming channels 554 that are available to the applications 562. The exposed stream buffer to software functional area 550may also perform as a scheduler for compute and I/O (as discussed further with respect to FIG. 4 below)and provide buffer management functionality and other assists 558.
[0047] The third functional area is an enabled synchronous memory semantics functional area 560 that may be implemented at the instruction set architecture (ISA) level. The enabled synchronous memory semantics functional area 560 may include data streaming. In an embodiment, the data streaming includes enqueuing and dequeuing data, and implementing optional zero-copy ownership. In embodiments, they may achieve zero copy for the enqueue and dequeue operations through special instructions added to the CPUs, as described in FIGs. 7A-7C.
[0048] Embodiments may include various hardware assists 558 for managing buffers and for performing various canonical operations. These canonical operations may include unpacking data from headers or payloads and placing it into a desired, e.g. deserialized format, or translating, e.g. from an unserialized format into a serialized format; for selecting data elements and composing new data elements from them (e.g., performing a gather or a scatter operation); and/or for presenting data to applications as logically contiguous streams while recording packet and message boundaries in auxiliary data structures. Such canonical operations obviate data transformations, may remove or hide management of various I/O descriptors, and may automate operations like compression, encryption, decompression, and decryption in hardware.
[0049] In embodiments, the streaming I/O architecture, also referred to as the streaming subsystem 340, contains streaming channels 554 which are composed from three implementations to legacy implementations: stream buffers 542, 544, Streaming Instruction Set Architecture(ISA) instructions, and a stream buffer I/O protocol.
[0050] Stream Buffer
[0051] The stream buffermay be implemented using capacities available in processor caches, memory-side caches (e.g., L4), and memory associated with the CPU 520. In embodiments, the memory may be backed up with DRAM. In embodiments, the stream buffers may be partitioned into virtual cues, similar to an I/O device.
[0052] Stream buffers are implemented at the CPU 520 via the stream buffer 544, and at the I/O device 530 via the stream buffer 542. By this implementation, each stream buffer 542,544 may funnel data in or funnel out data that it sendsto/receivesfrom the other. Data may flow smoothly in either direction, from the CPU 520 to the I/O device 530 or vice versa, with flow control implemented by various policy mechanisms described below with respect to streaming I/O buffer protocol.
[0053] Streaming Instruction Set Architecture (ISA) instructions
[0054] The ISA instructions enable enqueue and dequeue of data by the application 562 to and from a streaming channel 554 channel. In embodiments, they may achieve zero copy for the enqueue and dequeue operations through special instructions added to the CPUs, as described in FIGs. 7A-7C. In embodiments, these instructions advance from legacy application integration architecture (AIA) queuing instructions because they are stream oriented and include the logic to cause automatic actions such as credit management, buffer space and descriptor management, hooking in various data transformation, and control steering operations. These instructions may be implemented in hardware or redirected to a combination of hardware and emulating software in different generations of implementation.
[0055] Streaming I/O Buffer Protocol
[0056] Existing PCIe electricals can be used running extensions of PCIe, compute express link (CXL), or combined with a streaming channel protocol. In embodiments, a streaming message protocol allows source data to be pushed to the destination including PCIe bus, device, and function (BDF), Process Address Space ID (PASID), stream payload,and meta data. The streaming I/O buffer operates in two directions, from the I/O device 530 to the CPU 520 and from the CPU 520 to the I/O device 530.
[0057] Streaming Channels
[0058] In embodiments, the stream buffer 544 is organized into multiple streaming channels 554. Inother words, a multi-channel abstraction is overlaid on the stream buffer 544 so the streaming subsystem 340 presents a plurality of streaming channels 554 between a producing end, such as the CPU 520, and a consuming end, such as the I/O device 530, bi-directionally. The streaming channels 554 may have different rates of flow and different amounts of buffering, so that different classes of service can be constructed for shared but flexibly partitioned use of the stream buffer 554among the different usages.
| # | Name | Date |
|---|---|---|
| 1 | 202144045920-FORM 1 [08-10-2021(online)].pdf | 2021-10-08 |
| 2 | 202144045920-DRAWINGS [08-10-2021(online)].pdf | 2021-10-08 |
| 3 | 202144045920-DECLARATION OF INVENTORSHIP (FORM 5) [08-10-2021(online)].pdf | 2021-10-08 |
| 4 | 202144045920-COMPLETE SPECIFICATION [08-10-2021(online)].pdf | 2021-10-08 |
| 5 | 202144045920-FORM-26 [03-01-2022(online)].pdf | 2022-01-03 |
| 6 | 202144045920-POA [28-03-2022(online)].pdf | 2022-03-28 |
| 7 | 202144045920-FORM 13 [28-03-2022(online)].pdf | 2022-03-28 |
| 8 | 202144045920-AMMENDED DOCUMENTS [28-03-2022(online)].pdf | 2022-03-28 |
| 9 | 202144045920-FORM 3 [04-04-2022(online)].pdf | 2022-04-04 |
| 10 | 202144045920-FORM 3 [03-10-2022(online)].pdf | 2022-10-03 |
| 11 | 202144045920-FORM 18 [24-10-2024(online)].pdf | 2024-10-24 |