Abstract: There is disclosed a router for routing data on a computing chip comprising a plurality of processing elements, the router comprising: a packet processing pipeline; a dropped packet buffer; and one or more circuits configured to: determine that a data packet in the packet processing pipeline is to be dropped; move the data packet that is to be dropped from the packet processing pipeline to the dropped packet buffer; and re-insert the dropped data packet from the dropped packet buffer into the packet processing pipeline for re-processing.
Technical Field
This specification relates to a router for routing data on a computing chip comprising a plurality of processing elements and corresponding methods for processing data packets by the router.
Background
Many of today’s more powerful computers, including high-performance computers used for demanding scientific applications such as weather forecasting, drug discovery, neural simulations and self-driving vehicles amongst others, use a large number of processors connected together through some kind of inter-processor network. It is vital for these machines that their communications networks do not deadlock as a result of circular dependencies between packets of information travelling through the networks. A deadlock is a situation where packet A cannot proceed because its path is blocked by packet B, and packet B cannot proceed because its path is blocked by packet A. More than two packets may be involved in such a circular dependency, but the principle remains the same.
To avoid any risk of circular dependencies resulting in deadlocks, it is common practice to design such networks such that circular dependencies can never arise. This can be achieved through the use of multiple network layers, or the use of virtual channels, or other known techniques. Such techniques are effective in eliminating any possibility of a circular dependency, but they incur significant additional cost in terms of the hardware required to support the additional communication features. As such, it is desirable to provide improved means of preventing deadlock in inter-processor networks.
Summary
According to a first aspect, there is provided a router for routing data on a computing chip comprising a plurality of processing elements, the router comprising: a packet processing pipeline; a dropped packet buffer; and one or more circuits configured to: determine that a data packet in the packet processing pipeline is to be dropped; move the data packet that is to be dropped from the packet processing pipeline to the dropped packet buffer; and re-insert the dropped data packet from the dropped packet buffer into the packet processing pipeline for re-processing.
As discussed above, prior methods attempted to design inter-processor networks such that circular dependencies that can result in deadlock can never arise. However, such techniques incur significant additional cost in terms of the hardware required to support the additional communication features required. The present invention allows a much simpler network to be deadlock free by allowing packets that may be blocked to be temporarily removed (“dropped”) from the network fabric and held in a buffer, thereby removing any circular dependencies. The packets can subsequently be re-inserted into the network and delivered as normal. The use of hardware mechanisms for packet dropping and re-insertion allows the network to maintain good throughput even when congested, thereby ensuring rapid and reliable packet delivery from a low-cost network fabric.
In some other prior art methods, the process for dropping data packets is performed in software. For example, in one prior art method, a processing element on the chip is designated as a monitor processor. When it is determined that a packet is to be dropped, an interrupt signal is sent to the monitor processor. The monitor processor may then copy the dropped packet and re-transmit the copied packet. This process however is slow given the time required for the monitor processor to service the interrupt. The latency incurred to copy and re-transmit the packet over the network further increases the time required to handle dropped packets. The present invention provides a hardware-based packet dropping and reinsertion mechanism as part of the router. Thus, the time required to process dropped packets is significantly reduced and the throughput of the router under congestion can be increased. As packet dropping and re-insertion happens fully within the router instead of requiring transmission of the dropped packet to a monitor processor and re-transmission to the router, the process is also more power efficient. In addition, as a monitor processor is not required, the processing elements of the chip can be fully dedicated to performing their respective tasks.
It will be appreciated that the router may route data packets between processing elements of the same chip and that chips may be connected such that data packets may be routed between processing elements of different chips. It will also be appreciated that the router comprises a hardware-based packet dropping and re insertion mechanism comprising the dropped packet buffer and the one or more circuits.
The one or more circuits may comprise a packet re-insertion module that includes the dropped packet buffer and is configured to re-insert the dropped data packet from the dropped packet buffer into the packet processing pipeline for re-processing.
The one or more circuits may be configured to determine that a data packet in the packet processing pipeline is to be dropped based upon a signal indicating that a destination of the data packet is unable to receive the data packet. It will be appreciated that a destination may not necessarily be a final destination of the data packet, rather a destination may include an intermediate node of the on-chip data network or may be an output port of the router. In order to avoid blocking the packet processing pipeline, the data packet may be dropped.
The packet processing pipeline may further comprise an out-of-order packet buffer and the one or more circuits may be configured to move the data packet that is to be dropped from the out-of-order packet buffer to the dropped packet buffer. The one or more circuits may be further configured to select a data packet for processing by the packet processing pipeline; determine one or more destinations of the data packet; determine that a destination of the one or more destinations of the data packet is unable to receive the data packet; and move the data packet to the out-of-order packet buffer.
The out-of-order packet buffer may enable stalled data packets, that is, data packets that cannot yet be transmitted, to be queued such that the packet processing pipeline can continue to process further packets without needing to wait for a stalled data packet to be dropped before continuing. Where a data packet cannot be transmitted, it may be stored in the out-of-order packet buffer until such time as the data packet may be transmitted out to its destinations or should be removed from the out-of-order buffer to be dropped.
The out-of-order packet buffer may be a pipeline queue. When new data packets become stalled, existing stalled data packets in the out-of-order packet buffer may be moved further along in the queue to accommodate the new stalled at packets. The data packet that is to be dropped may be oldest data packet in the out-of-order packet buffer.
The one or more circuits may be configured to determine that a data packet in the packet processing pipeline is to be dropped based upon the expiration of a timer. For example, where an out-of-order packet buffer is used, the timer may be started when a data packet reaches the end slot of the buffer. This may indicate that the buffer is full and that data packets will be to be dropped to ensure that packet processing pipeline is not unduly delayed. Where an out-of-order packet buffer is not used, a timer may be started when it is determined that the data packet cannot be transmitted. The timer provides a minimum waiting time such that packets are not unnecessarily dropped.
The one or more circuits may be configured to re-insert the dropped data packet from the dropped packet buffer into the packet processing pipeline for re-processing based upon a timer. For example, the timer may start when a packet is moved into the dropped packet buffer and the dropped packet buffer is no longer empty. The timer may be reset when a dropped packet is removed from the dropped packet buffer and re-inserted into the packet processing pipeline. The timer may be restarted if the dropped packet buffer is not empty. The timer thereby enforces a minimum time interval between dropped packet re-insertions. The minimum time interval may be configured dynamically based upon a current load of the dropped packet buffer and/or the current status of the router and/or the current status of the on-chip network.
The one or more circuits may be configured to re-insert the dropped data packet from the dropped packet buffer into the packet processing pipeline for re-processing based upon a load of the packet processing pipeline. For example, a re-insertion request may be sent to the packet processing pipeline and a re-insertion is performed when the request is granted. Granting of the request may be based upon the load of the packet processing pipeline and/or when there is a free slot at the start of the pipeline and/or when there are no new data packets arriving for processing and/or on a periodic or interleaved basis with newly arriving data packets and/or other conditions deemed
appropriate by a person skilled in the art. The one or more circuits may comprise a re insertion requester as part of a packet re-insertion module for handling the re-insertion request and the re-insertion requester may be connected to a re-insertion timer that signals when to generate a re-insertion request.
The router may be configured to provide a software-based packet offload operation to remove a data packet stored in the dropped packet buffer for re-processing of the data packet outside of the router. For example, if the dropped packet buffer is full or approaching full, a software-based mechanism may be used to remove packets from the dropped packet buffer for storage elsewhere on chip such as local memory associated with a processing element. The removed packet may then be re-transmitted on the on-chip network. The router may comprise a register in a general register file that when read from, initiates the software-based packet offload operation. The general register file may be configured as an intermediate storage location for storing the dropped packet prior to copying the packet out of the router to elsewhere on-chip.
The packet processing pipeline may be configured to process multi-cast data packets. The router may comprise a plurality of packet processing pipelines. Each packet processing pipeline may be configured to process a particular type of data packet, for example, a multi-cast packet or system type packet. The dropped packet buffer may be configured to serve all of the packet processing pipelines. The one or more circuits may comprise an arbiter to select between dropped packet requests of different packet processing pipelines. Alternatively, there may be a plurality of dropped packet buffers each serving one or more of the plurality of packet processing pipelines.
The dropped packet buffer may be configured to store a plurality of dropped data packets. The dropped packet buffer may comprise one or more Static RAM (SRAM) modules. The dropped packet buffer may further comprise a packet disassembly buffer for copying a dropped packet into the dropped packet buffer and a packet re-assembly buffer for removing a dropped packet from the dropped packet buffer.
The computing chip may be a neuromorphic chip. Neuromorphic chips are specialised chips for performing biological neural simulations. In a biological neural network, processing is achieved via the firing of electro-chemical impulses between neurons of the biological neural network. Information is contained in the firing rate and timing of these impulses. This may be simulated using one or more neuromorphic chips whereby the processing elements of the neuromorphic chips are configured to simulate the activities of individual neurons. Data packets representing the impulses are sent between neurons (processing elements) at the appropriate time and rate. Thus, given the potential large-scale nature of the simulation and the desire for the simulation to occur in biological real-time, effective management of data traffic on and between chips is critical.
In addition, data packets representing neural communication may be sent as multi-cast packets given the one to many nature of neural connectivity. In order to simplify delivery of multi-cast packets, an all-or-nothing policy may be used such that a packet must be deliverable to all destinations or it is not transmitted at all. Such a policy may result in additional waiting time before a data packet can be transmitted. In addition, multi-casting packets may require creating multiple copies of the same packet which may also create a delay within the router and may also impact on router throughput.
Furthermore, the components on the neuromorphic chip may operate at different speeds and there may need to be restrictions in place on the transmission rate from the router to various components in order to avoid overwhelming those components. For example, a processing element may not be able to process packets as quickly as the router may be able to transmit the packets. As such, there may be restrictions on how quickly data packets can be transmitted out of the router to the same processing element. In another example, neuromorphic chips may be connected together to increase the processing power available. Packets may be transmitted to processing elements of different chips through an inter-chip link. The inter-chip link may run slower than the router and as such, there may be restrictions on the transmission of packets out of the router via the inter-chip link further affecting router throughput.
Network congestion and the inability to maintain sufficient router throughput can also result in significant performance decrease on the on-chip network. For example, congestion at a particular node may prevent packets from being transmitted out of the router and may cause the packet processing pipeline to halt. This may in turn cause further congestion as packets are queued up at nodes in the network to the point where no further packets can be transmitted given that blocked nodes may be dependent on other blocked nodes for routing and a deadlock situation may arise. Congestion at a
node may arise if the node is unable to process packets as quickly as they are being received or if the node has crashed. In another example, network congestion may delay the timely transmittal of system/configuration type packets and may cause the system to slow or even crash.
The present invention is particularly suitable for use with neuromorphic chips given the provision of hardware mechanisms for packet dropping and re-insertion, which as noted above, is capable of maintaining good throughput even in the presence of network congestion. It will be appreciated that the above discussion may also be applicable more generally to computing chips such as those used in high performance computing applications (HPC).
According to another aspect, there is provided a method of routing data on a computing chip comprising a plurality of processing elements, the router comprising a packet processing pipeline, a dropped packet buffer, and one or more circuits, the method comprising: determining, by the one or circuits, that a data packet in the packet processing pipeline is to be dropped; moving, by the one or more circuits, the data packet that is to be dropped from the packet processing pipeline to the dropped packet buffer; and re-inserting, by the one or more circuits, the dropped data packet from the dropped packet buffer into the packet processing pipeline for re-processing.
The method may correspond to the operations of the one or more circuits of the first aspect. For example, determining that a data packet in the packet processing pipeline is to be dropped may be based upon a signal indicating that a destination of the data packet is unable to receive the data packet.
The method may further comprise moving the data packet that is to be dropped from an out-of-order packet buffer to the dropped packet buffer.
The method may further comprise: selecting a data packet for processing by the packet processing pipeline; determining one or more destinations of the data packet; determining that a destination of the one or more destinations of the data packet is unable to receive the data packet; and moving the data packet to the out-of-order packet buffer.
The data packet that is to be dropped may be the oldest data packet in the out-of-order packet buffer.
Determining that a data packet in the packet processing pipeline is to be dropped may be based upon the expiration of a timer.
Re-inserting the dropped data packet from the dropped packet buffer into the packet processing pipeline for re-processing may be based upon a load of the packet processing pipeline.
The method may further comprise removing a data packet stored in the dropped packet buffer in response to a software-based packet offload operation.
The packet processing pipeline may be configured to process multi-cast data packets.
The dropped packet buffer may be configured to store a plurality of dropped data packets.
The dropped packet buffer may comprise one or more SRAM modules.
The computing chip may be a neuromorphic chip.
It will be appreciated that aspects can be combined and it will be readily appreciated that features described in the context of one aspect can be combined with other aspects.
Brief Description of the Figures
Embodiments will now be described, by way of example, with reference to the accompanying drawings, in which:
Figure 1 is a schematic illustration of a neuromorphic chip.
Figure 2 is a schematic illustration of a network topography on a neuromorphic chip. Figure 3 is a schematic illustration of a packet processing pipeline of a router on a neuromorphic chip.
Figure 3A is a schematic illustration of a packet re-insertion module.
Figure 3B is a schematic illustration of a dropped packet buffer.
Figure 4 is a flowchart showing processing carried out by the router.
Figure 5 is a flowchart showing processing carried out by the router in more detail. Figure 6 is a flowchart showing exemplary processing for determining whether to drop a data packet.
Figure 7 is a flowchart showing alternative exemplary processing for determining whether to drop a data packet.
Figure 8 is a flowchart showing processing for handling dropped data packets.
Detailed Description
Referring now to Figure 1, a schematic illustration of a neuromorphic chip 100 is shown. The neuromorphic chip 100 comprises a plurality of processing elements 101. Whilst Figure 1 shows a chip having eight processing elements, it will be appreciated that there may be a greater (or fewer) number of processing elements on the chip 100. For example, one exemplary neuromorphic chip has 152 processing elements.
Each processing element may comprise a processor core, local memory and modules for transmitting and receiving data. Data may be communicated between processing elements 101 of the chip 100 and/or between processing elements of different chips. The chip 100 may comprise a plurality of inter-chip links 102 in this regard. External devices may also be connected to the chip 100 via the external I/O interface 103.
The chip 100 additionally comprises a primary router 105. The primary router 105 is configured to direct specified types of on-chip data traffic, such as multi-cast data packets, and all inter-chip data traffic. Data is communicated via communication links on the chip 100 that connect the different components of the chip 100 and forms a Network-on-Chip (NoC). Where data is to be communicated to a component on another chip, the primary router 105 may be in communication with the primary router of the other chip via the inter-chip links 102.
The primary router 105 comprises a packet processing pipeline. The packet processing pipeline is configured to receive data packets transmitted to the primary router 105 for processing, determine a destination of the data packet, determine an appropriate
routing path for the data packet, and to transmit the data packet to the determined destination via the determined routing path.
When a data packet is ready to be transmitted out of the pipeline to its destination, it is possible that the data packet cannot be transmitted at that time due factors such as network congestion. For example, the neuromorphic chip 100 may be used to run a neural simulation. Each processing element of the chip 100 may be used to simulate a large number of neurons, for example, one processing element may simulate 1000 neurons. Thus, there may be congestion at the router 105 if communications are directed to a large number of neurons on the same processing element or if large amounts of communications are destined for a small number of neurons on the processing element. In addition, one neuron is typically connected to many other neurons. This is reflected in the use of multi-casting for communications between neurons, that is, one data packet transmitted from a processing element to the router 105 may result in the generation of many data packets being transmitted out of the router 105, thereby amplifying network traffic. Furthermore, in order to simplify the delivery mechanism of multi-cast packets, the router may operate an all-or-nothing policy, whereby a multi-cast packet is only transmitted out of the router 105 if all of the destinations are available to receive the multi-cast packet. This policy avoids the need for mechanisms to record delivery of multi-cast packets and to formulate potentially complex re-transmission policies, thereby reducing storage and power requirements of the chip. However, an all-or-nothing policy increases the likelihood that a packet cannot be transmitted and therefore stalling at the end of the packet processing pipeline. This may cause the packet processing pipeline to become blocked until the stalled packet may be transmitted.
The router 105 provides a hardware-based mechanism for handling stalled packets and to maintain throughput of the router 105. In this regard, the router 105 further comprises a dropped packet buffer and one or more circuits within the router 105 configured to determine that a data packet in the packet processing pipeline is to be dropped. The one or more circuits are further configured to move the data packet that is to be dropped from the packet processing pipeline to the dropped packet buffer. The one or more circuits are additionally configured to re-insert the dropped data packet from the dropped packet buffer into the packet processing pipeline for re-processing. Each of these operations are described in more detail below with reference to Figure 3 which shows an exemplary packet processing pipeline that includes a dropped packet buffer 306.
In some prior art methods, the process for handling stalled packets is performed in software. For example, in one prior art method, a processing element on the chip 100 is designated as a monitor processor. When a packet is to be dropped, an interrupt signal is sent to the monitor processor. The monitor processor may then copy the dropped packet and re-transmit the copied packet. This process however is slow given the time required for the monitor processor to service the interrupt. The latency incurred to copy and re-transmit the packet over the network further increases the time required to handle dropped packets. By providing a hardware-based packet dropping and reinsertion mechanism as part of the router 105, the time required to process dropped packets is significantly reduced and the throughput of the router 105 under congestion can be increased. As packet dropping and re-insertion happens fully within the router 105 instead of requiring transmission of the dropped packet to a monitor processor and re-transmission to the router, the process is also more power efficient. In addition, as a monitor processor is not required, the processing elements of the chip can be fully dedicated to performing their respective tasks. An analysis of the throughput of the router 105 is provided in more detail below.
Referring now to Figure 2, a schematic illustration of an exemplary network topology for the NoC is shown. In Figure 2, the NoC is arranged as a 2D mesh network. This type of topology provides a more scalable solution as the number of processing elements on the chip increases compared to other network topologies such as a star topology.
We CLAIMS:
1. A router for routing data on a computing chip comprising a plurality of processing elements, the router comprising:
a packet processing pipeline; and
a hardware-based packet dropping and re-insertion mechanism comprising:
a dropped packet buffer; and
one or more circuits configured to:
determine that a data packet in the packet processing pipeline is to be dropped;
move the data packet that is to be dropped from the packet processing pipeline to the dropped packet buffer; and
re-insert the dropped data packet from the dropped packet buffer into the packet processing pipeline for re-processing.
2. The router of claim 1, wherein the one or more circuits are configured to determine that a data packet in the packet processing pipeline is to be dropped based upon a signal indicating that a destination of the data packet is unable to receive the data packet.
3. The router of claim 1 or 2, wherein the packet processing pipeline further comprises an out-of-order packet buffer and wherein the one or more circuits are configured to move the data packet that is to be dropped from the out-of-order packet buffer to the dropped packet buffer.
4. The router of claim 3, wherein the one or more circuits are further configured to:
select a data packet for processing by the packet processing pipeline; determine one or more destinations of the data packet;
determine that a destination of the one or more destinations of the data packet is unable to receive the data packet; and
move the data packet to the out-of-order packet buffer.
5. The router of claim 3 or 4, wherein the data packet that is to be dropped is the oldest data packet in the out-of-order packet buffer.
6. The router of any preceding claim, wherein the one or more circuits are configured to determine that a data packet in the packet processing pipeline is to be dropped based upon the expiration of a timer.
7. The router of any preceding claim, wherein the one or more circuits are configured to re-insert the dropped data packet from the dropped packet buffer into the packet processing pipeline for re-processing based upon a timer.
8. The router of any preceding claim, wherein the one or more circuits are configured to re-insert the dropped data packet from the dropped packet buffer into the packet processing pipeline for re-processing based upon a load of the packet processing pipeline.
9. The router of any preceding claim, wherein the router is configured to provide a software-based packet offload operation to remove a data packet stored in the dropped packet buffer for re-processing of the data packet outside of the router.
10. The router of any preceding claim, wherein the packet processing pipeline is configured to process multi-cast data packets.
11. The router of any preceding claim, wherein the dropped packet buffer is configured to store a plurality of dropped data packets.
12. The router of any preceding claim, wherein the dropped packet buffer comprises one or more SRAM modules.
13. The router of any preceding claim, wherein the computing chip is a neuromorphic chip.
14. A method of routing data on a computing chip comprising a plurality of processing elements, the router comprising a packet processing pipeline, a dropped packet buffer, and one or more circuits, the method comprising:
determining, by the one or circuits, that a data packet in the packet processing pipeline is to be dropped;
moving, by the one or more circuits, the data packet that is to be dropped from the packet processing pipeline to the dropped packet buffer; and
re-inserting, by the one or more circuits, the dropped data packet from the dropped packet buffer into the packet processing pipeline for re-processing.
15. The method of claim 14, wherein determining that a data packet in the packet processing pipeline is to be dropped is based upon a signal indicating that a destination of the data packet is unable to receive the data packet.
16. The method of claim 14 or 15, further comprising moving the data packet that is to be dropped from an out-of-order packet buffer to the dropped packet buffer.
17. The method of claim 16, further comprising:
selecting a data packet for processing by the packet processing pipeline; determining one or more destinations of the data packet;
determining that a destination of the one or more destinations of the data packet is unable to receive the data packet; and
moving the data packet to the out-of-order packet buffer.
18. The method of claim 16 or 17, wherein the data packet that is to be dropped is the oldest data packet in the out-of-order packet buffer.
19. The method of any one of claims 14 to 18, wherein determining that a data packet in the packet processing pipeline is to be dropped is based upon the expiration of a timer.
20. The method of any one of claims 14 to 19, wherein re-inserting the dropped data packet from the dropped packet buffer into the packet processing pipeline for re processing is based upon a load of the packet processing pipeline.
21. The method of any one of claims 14 to 20, further comprising removing a data packet stored in the dropped packet buffer in response to a software-based packet offload operation.
22. The method of any one of claims 14 to 21, wherein the packet processing pipeline is configured to process multi-cast data packets.
23. The method of any one of claims 14 to 22, wherein the dropped packet buffer is configured to store a plurality of dropped data packets.
24. The method of any one of claims 14 to 23, wherein the dropped packet buffer comprises one or more SRAM modules.
25. The method of any one of claims 14 to 24, wherein the computing chip is a neuromorphic chip.
| # | Name | Date |
|---|---|---|
| 1 | 202217063974.pdf | 2022-11-09 |
| 2 | 202217063974-TRANSLATIOIN OF PRIOIRTY DOCUMENTS ETC. [09-11-2022(online)].pdf | 2022-11-09 |
| 3 | 202217063974-STATEMENT OF UNDERTAKING (FORM 3) [09-11-2022(online)].pdf | 2022-11-09 |
| 4 | 202217063974-NOTIFICATION OF INT. APPLN. NO. & FILING DATE (PCT-RO-105-PCT Pamphlet) [09-11-2022(online)].pdf | 2022-11-09 |
| 5 | 202217063974-FORM 1 [09-11-2022(online)].pdf | 2022-11-09 |
| 6 | 202217063974-DRAWINGS [09-11-2022(online)].pdf | 2022-11-09 |
| 7 | 202217063974-DECLARATION OF INVENTORSHIP (FORM 5) [09-11-2022(online)].pdf | 2022-11-09 |
| 8 | 202217063974-COMPLETE SPECIFICATION [09-11-2022(online)].pdf | 2022-11-09 |
| 9 | 202217063974-Proof of Right [12-12-2022(online)].pdf | 2022-12-12 |
| 10 | 202217063974-FORM-26 [12-12-2022(online)].pdf | 2022-12-12 |
| 11 | 202217063974-FORM 3 [04-05-2023(online)].pdf | 2023-05-04 |
| 12 | 202217063974-FORM 3 [17-11-2023(online)].pdf | 2023-11-17 |
| 13 | 202217063974-FORM 18 [03-04-2024(online)].pdf | 2024-04-03 |
| 14 | 202217063974-FER.pdf | 2025-06-11 |
| 15 | 202217063974-FORM 3 [01-09-2025(online)].pdf | 2025-09-01 |
| 1 | 202217063974_SearchStrategyNew_E_202217063974E_30-01-2025.pdf |