Abstract: Provided is polymorphic computing fabric (100) for static dataflow execution of computing kernels represented as dataflow graphs (DFGs), wherein the DFGs are realized directly in hardware. The computing fabric (100) includes a plurality of components arranged in a matrix configuration. The plurality of components includes processing elements (PEs 102-1 to 102-n) and switching elements (SEs 102-1 to 102-n). Computational Operations represented as DFGs are realized on the polymorphic computing fabric (100) by mapping the nodes onto the PEs (102-1 to 102-n) and routing the edges via the circuit switched network formed by SEs (102-1 to 102-n) to form a virtual circuit. The polymorphic computing fabric (100) is configured to run in a configure-and-execute scheme allowing for pre-configuration of the plurality of components prior to execution.
DESC:FIELD OF THE INVENTION
The invention generally relates to hardware accelerators, essential for executing computational tasks within computing systems. Specifically, the invention relates to a polymorphic computing fabric for static dataflow execution of a variety of computation operations represented as dataflow graphs (DFGs), wherein the DFGs are directly realized in hardware.
BACKGROUND OF THE INVENTION
The ever-growing demands for accelerated computational performance have led to the emergence of specialized hardware accelerators in modern computing systems.
Hardware accelerators play a pivotal role in augmenting computational capabilities within computing systems. These specialized units are designed to optimize the execution of specific tasks, offering substantial performance enhancements over conventional processors. Their emergence has been driven by the exponential growth in data-centric applications, such as, but not limited to, artificial intelligence, machine learning, signal processing, scientific simulations, and more.
Central Processing Units (CPUs) are celebrated for their versatility, boasting general-purpose processor pipelines capable of handling a wide range of applications. However, non-compute activities of a general-purpose processor’s pipeline, such as instruction fetch and decode, branching and speculation, and scoreboarding and interlocking can incur significant power, while their inefficiencies can degrade performance for specialized computational tasks.
Application-Specific Integrated Circuits (ASICs) exhibit commendable efficiency within their tailored applications due to their custom designs. However, their lack of adaptability significantly restricts their efficacy and performance across different computational tasks, resulting in reduced performance beyond their intended scope.
Field-Programmable Gate Arrays (FPGAs) offer reconfigurability, enabling the synthesis of specialized accelerators as needed. However, the finer granularity of look-up tables (LUTs) contributes to prolonged configuration times and reduced operational frequencies, impacting overall computational performance.
Graphics Processing Units (GPUs) excel in accelerating parallel applications but suffer from excessive energy consumption due to complex control hardware, complex memory access patterns, and wide register files. The intricate warp scheduling hardware and extensive wide register banks contribute to substantial energy consumption, limiting overall efficiency.
As explained earlier, the existing accelerator hardware domains are faced with shortcomings that impede greater adaptability, power-efficiency and overall computational performance. Therefore, there is a need for a hardware accelerator architecture that overcomes these shortcomings by providing broader application coverage, enhanced power-efficiency, and improved performance characteristics.
SUMMARY OF THE INVENTION
An architecture of a hardware accelerator (a polymorphic computing fabric) is disclosed for enabling static dataflow execution of computation operations represented as dataflow graphs (DFGs), wherein the DFGs are directly realized in hardware as shown in and/or described in connection with, at least one of the figures, as set forth more completely in the claims. The polymorphic computing fabric includes a plurality of components arranged in a matrix configuration. The plurality of components include processing elements (PEs) and switching elements (SEs). The polymorphic computing fabric is fed data via a fabric interface unit (FIF) connected to the periphery of the matrix configuration.
The computing fabric is polymorphic in its execution and programming abstraction allowing the programmers to explore the best fit between the computation operations and the fabric. Accordingly, the plurality of components (PEs and SEs) can be reconfigured as a vector-SIMD datapath that is capable of pipelined multi-cycle operations (vector processing) and concurrent independent computation on multiple data (SIMD parallelism). Similarly, the plurality of components (PEs and SEs) can be reconfigured as MIMO-dataflow, a datapath that is capable of Multiple Input Multiple Output (MIMO) sequence of operations chained as per their data dependencies. Additionally, the plurality of components (PEs and SEs) can be reconfigured as a subword-SIMD datapath capable of mixed precision computations on packed data.
Each PE includes multiplexers, configuration registers (CRs), and FIFOs designed to accumulate valid input-operands before executing operations and output-results for subsequent forwarding to their respective destinations. It incorporates functional units like Arithmetic and Logic Units (ALUs), and Floating-Point Units (FPUs) to perform the actual computation operation. The CRs are programmed to bind input-operands and output-results to specific input-sources and output-sources in the different directions (West, North, East, or South). Additionally, they enable the selection of various operations within the functional unit.
The functional units in the PEs are configured to execute at least one of a plurality of operations. The plurality of operations includes but is not limited to, a 32-bit pipelined multi-cycle arithmetic and logical operations, single-precision floating point operations, sub-word SIMD arithmetic and logical operations for 8-bit and 16-bit datawidth, reduction trees with arithmetic and logical operations following sub-word SIMD operations, compare and exchange operations, and predicated accumulation. The PEs can perform operations based on input-operands received from the SEs and/or FIF and produce output results routed through the SEs and/or FIF.
Each SE comprises a central control logic, input and output lanes, a configuration register (CR) and multiplexer at each output lane to control data routing.
The polymorphic computing fabric is configured to operate in a configuration mode and an execution mode. The CRs of the PEs and SEs are programmed in configuration mode and the computation happens in execution mode. The configuration and execution happen dynamically.
The computation function is first converted to a dataflow graph (DFG). A DFG is a graph where nodes represent the operations and edges represent dependencies between the operations. The DFG is assumed to have operations that can be directly mapped on the PEs of the fabric. Mapping the DFG on the fabric involves the placement of the nodes on the PEs and the configuration of the SEs to set up appropriate communication routes among PEs.
The computations to be performed by PEs and the path for the intermediate data is defined by the contents of the CRs. The execution mode of the polymorphic computing fabric facilitates the PEs to perform the configured operations on operands and SEs to form a virtual circuit between the PEs. The flow control in the polymorphic computing fabric is established with a ready-valid handshake mechanism.
By natively realizing the DFG, the polymorphic computing fabric employs local FIFOs and switching elements to communicate intermediate results, instead of communicating through large and wide register banks as done in CPUs and GPUs respectively.
These and other features and advantages of the present invention may be appreciated from a review of the following detailed description of the present invention, along with the accompanying figures in which like reference numerals refer to like parts throughout.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a diagram that illustrates an architecture of the polymorphic computing fabric in accordance with an exemplary embodiment of the invention.
FIG. 2 is a diagram that illustrates a Switching Element (SE) interface in accordance with an exemplary embodiment of the invention.
FIG. 3 is a diagram that illustrates a visual of an internal-Processing Element (PE) and internal-Switching Element (SE) in accordance with an exemplary embodiment of the invention.
FIG. 4 is a diagram that illustrates a visual of a corner-Processing Element (PE) and peripheral-Switching Element (SE) in accordance with an exemplary embodiment of the invention.
FIG. 5 is a diagram that illustrates a realization of a DFG on the polymorphic computing fabric on a subset of PEs and SEs in accordance with an exemplary embodiment of the invention.
FIG. 6 is a diagram that illustrates the architecture of the polymorphic computing fabric configured as vector-SIMD datapath in accordance with an exemplary embodiment of the invention.
FIG. 7 is a diagram that illustrates the architecture of the polymorphic computing fabric configured as MIMO-dataflow datapath in accordance with an exemplary embodiment of the invention.
FIG. 8 is a diagram that illustrates the architecture of the computing fabric configured as subword-SIMD datapath in accordance with an exemplary embodiment of the invention.
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the present invention.
DETAILED DESCRIPTION OF THE INVENTION
The following described implementations may be found in the disclosed architecture of a hardware accelerator (a polymorphic computing fabric) for facilitating static dataflow execution of computation operations represented as dataflow graphs (DFGs), wherein the DFGs are realized directly in hardware.
FIG. 1 is a diagram that illustrates an architecture of the polymorphic computing fabric in accordance with an exemplary embodiment of the invention. Referring to FIG. 1, there is a polymorphic computing fabric 100, which includes a plurality of components such as processing elements (PE) 102-1 to 102-n, switching elements (SE) 104-1 to 104-n, lanes 108-1 to 108-n, buffered lanes 110-1 to 110-n and qPE lanes 112-1 to 112-n. As shown in the FIG. 1, the polymorphic computing fabric 100 is connected via a fabric interface unit (FIF) 106.
The plurality of components is arranged in a matrix configuration. In an embodiment, the PEs 102-1 to 102-n and the SEs 104-1 to 104-n are arranged in a two-dimensional array in an alternating fashion.
In accordance with exemplary embodiment, FIG. 1 shows a 6-by-7 architecture of the polymorphic computing fabric 100. This implementation of the polymorphic computing fabric 100 includes 21 PEs and 21 SEs. In accordance with an embodiment, for the polymorphic computing fabric 100 of with dimensions of M-by-N and MN elements, there are ?MN/2??PEs?,??MN/2??SEs.
As illustrated in FIG. 1, the PEs 102-1 to 102-n and the SEs 104-1 to 104-n are categorized as internal, quasi-peripheral, peripheral or corner elements based on their location in the two-dimensional array with respect to FIF 106. The SEs differ in the number of input and output lanes, while the PEs differ in terms of the sources from which their input-operands are multiplexed. This is further discussed in conjunction with FIG. 3 that provides visuals of an internal-PE, internal-SE, and the lanes between them, and FIG. 4 that provides visuals of a corner-PE, peripheral-SE, and the connections between them and FIF.
In an embodiment, each PE of the PEs 102-1 to 102-n is configured to consume a maximum of three input-operands, execute one of several operations for which the polymorphic compute fabric 106 is configured and produce a maximum of two output results.
In accordance with various embodiments of the invention, PEs 102-1 to 102-n are further configured to execute at least one of a plurality of operations, such as, but not limited to 32-bit pipelined multi-cycle arithmetic and logical operations, single-precision floating point operations, subword-SIMD arithmetic and logical operations for 8-bit and 16-bit data width, reduction trees with arithmetic and logical operations following subword-SIMD operations, compare and exchange operations, and predicated accumulation.
The PEs 102-1 to 102-n are configured to perform operations on input-operands received from the FIF106/SEs 104-1 to 104-n and produce output-results routed through the SEs to other PEs/FIF106.
In an embodiment, constants as input-operands can be supplied to the PEs 102-1 to 102-n at configuration time, instead of at execution time.
In another embodiment, the polymorphic compute fabric 100 allows configuration to support a third input-operand and predication support. The third input-operand is used for multiple purposes. For delay balancing edges of the DGF which is to maximize throughput, the third input-operand can be used for passthrough via a PE (that is not contributing to an operation) but adjusted to match pipeline depth of another node at the same level of the graph. This use-case is explained in conjunction with FIG. 8.
In an embodiment, the third input operand data can be augmented with predicate bits. Predication support at each PE allows for local data-dependent control flow. For example, a PE configured to perform accumulation can be reset and set with predicate controls passed with data. The output result can be configured to be reused as recurrent input using the third input-operand.
Each PE of the PEs 102-1 to 102-n includes a configuration register that stores the required configuration metadata for: operation, input-operand sources, and output-result destinations. Depending on the position of a PE in the polymorphic computing fabric 100 with regards to the FIF 106, the input-operands for the PE are multiplexed from different sources.
In accordance with an embodiment, for each PE input-operands are buffered before the operation and similarly, output-results are buffered before forwarding them to any of the neighboring SEs/FIF 106. Further, input-operands are dequeued only if, all of them for a given operation are present/available and there is space in all the configured output-results buffers.
In an embodiment, the FIF 106 functions as a staging area where data is arranged in a specific pattern before entering the polymorphic computing fabric 100. Similarly, the FIF 106 handles output data from the polymorphic computing fabric 100 before writing it back to the source.
In another embodiment, the FIF 106 is an interface that enables the polymorphic computing fabric 100 to be accessible from a Compute Element (CE) or processor core, wherein, the polymorphic computing fabric 100 serves as a co-processor and respects the execution model In accordance with the methodologies outlined in our previous Indian Patent 400171.
FIG. 2 is a diagram that illustrates a Switching Element (SE) interface in accordance with an exemplary embodiment of the invention. Referring to FIG. 2, a SE 202 includes a central control logic 204, input lanes 206, output lanes 208, a configuration register (CR) 210 and multiplexers 212 at all output lanes.
The central control logic 204 is responsible for implementing the handshake mechanism of flow-control at all output lanes of the SE. The CR 210 at each output lane is programmed with the enable and select signal for the multiplexer 212 to route the correct incoming data on the output lane 208.
In an embodiment, the polymorphic computing fabric 100 is employed as configure-and-execute scheme similar to FPGAs. Accordingly, the computing fabric 100 components are configured before execution begins and do not change during the execution.
In a circuit-switched network, dedicated paths are set up prior to communication. In polymorphic computing fabric 100, the configuration of SEs 104-1 to 104-n establishes the circuit-switched network, which routes data to internal PEs for computation.
In accordance with an embodiment of the invention, the corner SEs, peripheral SEs at the computing fabric 100 periphery (boundaries) and the internal SEs only differ in terms of the number of input and output lanes. For instance, an internal SE has eight input and output lanes connected to PEs and SEs in all the directions around it.
As referred to in FIG. 2, the connections through an SE to one or more neighboring SEs are buffered with FIFOs. On the other hand, the connection to the one or more neighboring PEs is entirely combinational as their inputs and outputs are already buffered with FIFOs.
The central control logic 204 ensures that data is only forwarded from input lanes 206 when all destination lanes are ready to receive. In an embodiment, the output lanes 208 can be disabled, preventing them from forwarding data from the input lanes 206.
The CR 210 determines routing between the input lanes 206 and the output lanes 208, and the SE 202 prevents data from being sent back in the same direction from where it arrived.
FIG. 3 is a diagram that illustrates a visual of an internal Processing Element (PE) and internal Switching Element (SE) in accordance with an exemplary embodiment of the invention. Referring to FIG. 3, an internal SE 302 has four neighboring SEs, and four neighboring PEs. The connections to other SEs are via buffered lanes 306-1 to 306-n, while connections to PEs are via lanes 308-1 to 308-n. An internal PE 304 has four neighboring SEs and connections to them are via the lanes 308-1 to 308-n.
FIG. 4 is a diagram that illustrates a visual of a corner-Processing Element and peripheral-Switching Element (SE) in accordance with an exemplary embodiment of the invention. Referring to FIG. 4, the peripheral SE 402 has two neighboring SEs connected via buffered lanes 406-1 to 406-n. Furthermore, the peripheral SE 402 is left with three input and two output FIF IO 408-1 to 408-n connections to be connected to FIF 106. The connections to corner-PE 404 and peripheral-PE (not shown in FIG.4) are via lanes 410-1 to 410-n.
In accordance with an embodiment, the FIF 106 inputs and outputs can be directly accessed, without additional buffers in the path only from the peripheral and quasi peripheral qPEs. The qPE are configured with two additional input and output lanes, qPE lanes, which are connected to the additional output and input lanes of its neighboring peripheral SE, as shown by OS8 and IS8, and OS9 and IS9 in the FIG. 4, these lanes are further directly connected to the FIF IO, as shown by IS1 and OS1, IS2 and OS2, and IS3.
In an embodiment, the corner SEs are not connected with the FIF 106 as the corner SEs are restricted to only handling connections between its three neighboring elements.
The polymorphic computing fabric 100 as disclosed in the invention is configured to be operated in either a configuration mode or an execution mode.
In accordance with an embodiment of the invention, to operate the computing fabric 100 in the configuration mode, the data in respective CRs of PEs 102-1 to 102-n and SEs 104-1 to 104-n is changed.
On the other hand, in the execution mode, the PEs 102-1 to 102-n performs the configured operations on the operands, while the SEs 104-1 to 104-n form the virtual circuit between the PEs 102-1 to 102-n. The computations to be performed by the PEs 102-1 to 102-n and the path for the intermediate data is defined by the contents of CRs.
The flow control in the polymorphic computing fabric 100 is established with a ready-valid handshake mechanism. In an embodiment, the handshake mechanism is implemented using FIFOs that ensures that the data is not overwritten until it is consumed.
The handshake mechanism provides flexibility to the polymorphic computing fabric 100 to realize DFGs with unbalanced internal edges at the cost of reduced throughput. This further supports the dataflow execution model of the computing fabric 100.
In accordance with an embodiment of the invention, the lanes of the polymorphic computing fabric 100 are configured to carry configuration metadata during configuration phase and computation data during execution phase. There are no separate configuration lanes, therefore, tag bits are used to differentiate the configuration metadata and execution data. Each element also holds its own ID (coordinate-based address), fixed at design-time. The configuration metadata also includes ID bits (coordinate-based address) for identifying individual SEs and PEs. When a SE/PE receives a configuration word, the SE/PE may write the configuration word to its own CR or forward it to its neighbor on the South port. In the configuration phase, the configuration metadata augmented with ID bits (coordinate-based addresses) is supplied from the FIF to the topmost PEs and SEs. On every cycle, each element either registers the received data in its configuration register if it’s addressed to it or forwards it to the element below. Eventually, after a few cycles all elements are configured.
Further, the lanes of the polymorphic computing fabric 100 are configured with width which is equal to the wordlength with a few additional bits that are used for tag bits and other configuration meta-data and/or used as predication bits by PEs 102-1 to 102-n.
FIG. 5 is a diagram that illustrates a realization of a DFG on a polymorphic computing fabric on a subset of PEs n and SEs in accordance with an exemplary embodiment of the invention.
In accordance with an exemplary embodiment of the invention, the computation function is first converted to a dataflow graph (DFG). A DFG is a graph where nodes represent the operations and edges represent dependencies between the operations. The DFG is assumed to have operations which can be directly mapped on all or subset of the PEs 102-1 to 102-n of the polymorphic computing fabric 100.
Mapping the DFG on the polymorphic computing fabric 100 involves placement of the nodes on all or subset of the PEs 102-1 to 102-n and configuration of all or the subset of the SEs 104-1 to 104-n to setup appropriate communication routes among PEs. The computations to be performed by PEs 102-1 to 102-n and the path for the intermediate data is defined by the contents of the CRs. The execution mode of the polymorphic computing fabric facilitates the PEs to perform the configured operations on operands and SEs to form a virtual circuit between the PEs. The flow control in the polymorphic computing fabric is established with a ready-valid handshake mechanism.
In the configuration phase, the configuration metadata augmented with ID bits (or coordinate-based addresses) is supplied from the FIF 106 to the topmost PEs and SEs. On every cycle, each element either registers the received data in its CR if it’s addressed to it or forwards it to the element below. Eventually, after a few cycles all elements are configured.
The PEs's CRs holds the following information: which input-source to receive each input-operand from, what operation to perform and which output-source to forward each output-result.
Referring to FIG. 5, a sample DFG 502 presented on the right is mapped on to the polymorphic computing fabric 100. A PE 506-2 and a PE 506-3 are programmed to receive input-operands from the FIF 106 on their North FIF IO 510- (1,3) and West FIF IO 510- (2,4), perform addition operations, and send their output-results on the South output lanes 512- (1,2). PE 506-6 is programmed to receive input-operands from its neighbouring SE 508- (5, 6) on West input lane 512-3 and East input lane 512-4, perform multiplication operation and send its output-result on the North output lane 512-5, which finally exits as out signal via a SE 508-2 to FIF via FIF IO 510-5.
The SE's configuration register holds the following information: which input-source to forward at each output lane. The East output lane on SE 508-5 and the West output lane on SE 508-6 are connected to input-source from their North input lanes. Therefore, the intermediate output-results from PE 502-2 and PE 502-3 are routed to the West and East input lanes of PE 502-6.
All unused PEs and unused paths in SEs, remain disabled. When the PE is unused, it is in a disabled state, i.e., it doesn’t perform any operation, and all its buffers remain disabled. When an output lane is disabled in an SE, there's no valid data exiting the lane.
In the execution phase, using a ready-valid handshake mechanism the data is driven through the PEs and SEs. The enabled PEs queue valid input-operands from their respective configured sources, perform the configured operation only when all required input-operands are present in their buffers and there is space on the output-result queue. The enabled output lanes in SEs are forwarded valid data received from their respective configured input-sources. Data is dequeued from the internal buffers only when the receiving element is ready or has space on its internal buffer.
FIG. 6 is a diagram that illustrates the architecture of a polymorphic computing fabric 600 configured as vector-SIMD datapath in accordance with an exemplary embodiment of the invention. Referring to Fig. 6, a commonly found DFG 602 with pipelined operations in neural networks is suitable for mapping on the polymorphic computing fabric 600 configured in the vector-SIMD configuration. As illustrated, seven instances of the DFG 602 on the right are mapped on the 6 by 7 polymorphic computing fabric 600. The inputs (a, b) and output (c) of the DFG 602 are indexed with their instance IDs.
FIG. 7 is a diagram that illustrates the architecture of a polymorphic computing fabric 700 configured as MIMO-dataflow datapath in accordance with an exemplary embodiment of the invention. Referring to Fig. 7, a radix-2 FFT butterfly DFG 702 with levels L1, L2, L3 on the right is mapped on the polymorphic computing fabric 700 which is configured as a MIMO-dataflow datapath capable of MIMO sequence of operations chained as per their data dependencies. The polymorphic computing fabric 700 as illustrated in FIG. 7 is a 6 by 7 fabric. x0, y0, x1, y1 are the inputs, u0, w0, u1, w1 are the twiddle factors and a0, b0, and a1, b1 are the outputs of the radix-2 FFT butterfly DFG 702.
The multiplication nodes at L1 are mapped on to PE 704 - (2, 3, 8, 11, 15, 18, 19 and 20); the addition nodes at L2 are mapped on to PE 704- (14 and 16); the subtraction nodes at L2 are mapped on to PE 704- (6 and 12); the addition nodes at L3 are mapped on to PE 704- (9 and 17); the subtraction nodes at L3 are mapped on to PE 704- (13 and 5).
The inputs and twiddle factors are supplied along the FIF IO to PEs at L1. The outputs from PEs at L1 are routed via SE 706- (16 and 19) to PE 704-16; SE 706- (10 and 7) to PE 704-14; SE 706- (8 and 15) to PE 704-12; SE 706- (2 and 6) to PE 704-6. The outputs from PEs at L2 are routed via SE 706- (12 and 13) to PE 704-13; SE 706- (16 and 17) to PE 704-17; SE 706- (5 and 8) to PE 704-5; SE 706- (9 and 12) to PE 704-9. The final outputs are routed to FIF IO: a0 from PE 704-5 via SE 706- (8 followed by 4); b0 from PE 704-17 via SE 706- (17 followed by 20); a1 from PE 704-9 via SE 706- (5 followed by 1); b1 from PE 704-13 via SE 706- (16 followed by 19).
FIG. 8 is a diagram that illustrates the architecture of a polymorphic computing fabric 800 configured as a subword-SIMD in accordance with an exemplary embodiment of the invention. The polymorphic computing fabric 800 as disclosed herein is capable of performing SIMD execution on subwords of different datatypes, such as, but not limited to, int8, and int16.
Referring to FIG. 8, three instances of 3x3 depthwise convolution of MobileNet DFG 802 are mapped on the 6 by 7 polymorphic computing fabric 800. L1, L2, L3 and L4 are the levels of the DFG 802. At L1, inputs are 8-bit packed-SIMD, and outputs are 16-bits. At L2 and L3, the inputs and outputs to each node are kept at 16-bits. At L4, the output is quantized to 8-bits.
In accordance with the embodiment of the invention, the subword datatype operations are further augmented with arithmetic/logical reduction stages. In this example, the nodes of L1 of the three instances of DFG 802 mapped to PE 804- (19, 20 and 15), PE 804- (1, 2 and 8), and PE 804- (18, 21 and 4), respectively, are configured to perform four, 4 by 4 int8 multiplication followed by a 4 to 1 reduction. The output datatype of the reduction stage is quantized to 16-bits. The addition node of L2 of the three instances of DFG 802 are mapped to PE 804-16, PE 804-5, and PE 804-14 respectively. The addition node of L3 of the three instances of DFG 802 are mapped to PE 804-13, PE 804-6, and PE 804-10 respectively. Finally, the ReLU node of L4 of the three instances of DFG 802 is mapped to PE 804-17, PE 804-3, and PE 804-11 respectively.
The 8-bit packed SIMD inputs of the three instances of the DFG 802 are supplied along FIF IO to PEs at L1. The outputs from nodes at L1 of the three instances of the DFG 802 are routed via SE 806- (15 and 19) to PE 804-16, SE 806- (4 and 5) to PE 804-5, and SE 806- (14 and 17) to PE 804-14 respectively. The output from addition node at L2 of the three instances of the DFG 802 are routed via SE 806-16 to PE 804-13, SE 806-5 to PE 804-6, and SE 806-13 to PE 804-10 respectively. The output from addition node at L3 of the three instances of the DFG 802 are routed via SE 806-13 to PE 804-17, SE 806-6 to PE 804-3, and SE 806-10 to PE 804-11 respectively.
For delay-balancing the DFG edges, they can be routed via (i) additional SEs, or (ii) as passthrough operands through PEs. In this example, PE 804- (12, 9 and 7) of the three instances of the DFG 802 respectively, are configured as Passthroughs and used for delay balancing an edge from node L1 to L3. The edge to Passthrough node in L2 from L1 in the three instances of the DFG 802 are routed via SE 806-15 to PE 804-12, SE 806-8 to PE 804-9, and SE 806-7 to PE 804-7 respectively. The edge from Passthrough node from L2 to L3 in the three instances of the DFG 802 are routed via SE 806-12 to PE 804-13, SE 806-9 to PE 804-6, and SE 806-6 to PE 804-10 respectively.
The present invention is advantageous in that it provides a polymorphic computing fabric that is capable of static dataflow execution of computation operations represented as dataflow graphs (DFGs).
The polymorphic computing fabric provided by the invention is designed to be capable of coarse-grained reconfiguration that enables the computing fabric in avoiding inefficiencies and power incurred from non-compute activities of a general-purpose processor’s pipeline such as instruction fetch and decode, branching and speculation, and scoreboarding and interlocking.
The computing fabric disclosed by the invention enables the data path to be reconfigured to match the requirements of the computation kernels. Additionally, the computing fabric’s generic ALU functional unit within PEs can be constrained to repurpose as a Domain Specific Accelerator (DSA). This provides an enormous energy savings opportunity to the computing fabric of the invention.
The computing fabric as disclosed herein is designed as coarse-grained reconfigurable, polymorphic accelerator, that overcomes the reconfigurable overheads of existing accelerators and operates at a high frequency thereby providing performance close to specialized accelerators.
The computing fabric as disclosed herein is energy efficient as compared to the existing hardware accelerators. The computing fabric’s approach to maximizing hardware utilization is via exploiting the polymorphic capabilities unlike complex wrap-scheduling hardware in GPUs. Further, the computing fabric optimizes memory usage by using local FIFOs and switching elements to communicate intermediate results instead of large/wide register banks as done in GPUs.
Those skilled in the art will realize that the above recognized advantages and other advantages described herein are merely exemplary and are not meant to be a complete rendering of all of the advantages of the various embodiments of the present invention.
The present invention may be realized in hardware, or a combination of hardware and software. The present invention may be realized in a centralized fashion, in at least one computer system, or in a distributed fashion, where different elements may be spread across several interconnected computer systems. A computer system or other apparatus/devices adapted to carry out the methods described herein may be suited. A combination of hardware and software may be a general-purpose computer system with a computer program that, when loaded and executed on the computer system, may control the computer system such that it carries out the methods described herein. The present invention may be realized in hardware that comprises a portion of an integrated circuit that also performs other functions. The present invention may also be realized as a firmware that form part of the media rendering device.
While the present invention is described with reference to certain embodiments, it will be understood by those skilled in the art that various changes may be made, and equivalents may be substituted without departure from the scope of the present invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the present invention without departure from its scope. Therefore, it is intended that the present invention is not limited to the particular embodiment disclosed, but that the present invention will include all embodiments that fall within the scope of the appended claims.
,CLAIMS:1. A polymorphic computing fabric (100) as a hardware accelerator, comprising:
a plurality of components arranged in a matrix configuration, the plurality of functional elements including processing elements (PEs) (102-1 to 102-n) and switching elements (SEs) (104-1 to 104-n) interconnected by bi-directional lanes,
wherein, the polymorphic computing fabric (100) facilitates static dataflow execution of a variety of computation operations represented as Data Flow Graphs (DFGs) directly realized on all or a subset of PEs (102-1 to 102-n) and SEs (104-1 to 104-n),
wherein, the polymorphic computing fabric (100) is dynamically reconfigurable through a configure-and-execute process.
2. The computing fabric of claim 1, wherein the PEs (102-1 to 102-n) and the SEs(104-1 to 104-n) are arranged in a two-dimensional array in an alternating fashion.
3. The polymorphic computing fabric (100) of claim 1 can be reconfigured as at least one of:
a vector-SIMD datapath capable of pipelined multi-cycle operations (vector processing) and concurrent independent computation on multiple data (SIMD parallelism);
a MIMO datapath capable of MIMO3 sequence of operations chained as per their data dependencies; and
a subword-SIMD datapath capable of mixed precision computations on packed data.
4. The polymorphic computing fabric (100) of claim 1, wherein the PEs (102-1 to 102-n) are further configured to execute at least one of a plurality of operations, the plurality of operations comprising 32-bit pipelined multi-cycle arithmetic and logical operations, single-precision floating point, subword-SIMD arithmetic and logical operations for 8-bit and 16-bit data width, reduction trees with arithmetic and logical operations following sub-word SIMD operations, compare and exchange operations and predicated accumulation.
5. The polymorphic computing fabric (100) of claim 1, wherein the PEs (102-1 to 102-n) support a third input operand for multiple purposes including delay balancing and passing predicate controls with data.
6. The polymorphic computing fabric (100) of claim 1, wherein each SE comprises:
a central control logic (204);
input (206) and output lanes (208);
a configuration register (CR) (210); and
multiplexers (212) at the output lanes to control data routing.
7. The polymorphic computing fabric (100) of claim 6 is further configured to operate in one of a configuration mode and an execution mode, wherein, the CRs (210) are programmed in the configuration mode and the computation happens in the execution mode, wherein the programming of CRs (210) happens dynamically.
8. The polymorphic computing fabric (100) of claim 7, wherein computation operations represented as DFGs are directly realized on all or a subset of PEs (102-1 to 102-n) and SEs (104-1 to 104-n) by mapping a plurality of nodes of the DFGs onto the PEs (102-1 to 102-n) and routing their edges (inputs and outputs) via the SEs (104-1 to 104-n) to form a virtual circuit.
9. The polymorphic computing fabric (100) of claim 8, wherein the execution mode of the polymorphic computing fabric (100) facilitates the PEs (102-1 to 102-n) to perform the configured operations on operands and SEs (104-1 to 104-n) to form a virtual circuit between the PEs, wherein, the computations to be performed by the PEs (102-1 to 102-n) and the path for the intermediate data is defined by the contents of the CRs (210), wherein the flow control in the polymorphic computing fabric (100) is established with a ready-valid handshake mechanism.
10. The polymorphic computing fabric (100) of claim 1, wherein the bi-directional lanes are used for carrying one of a configuration data and computational data at a given instance.
| # | Name | Date |
|---|---|---|
| 1 | 202241060131-PROVISIONAL SPECIFICATION [20-10-2022(online)].pdf | 2022-10-20 |
| 2 | 202241060131-POWER OF AUTHORITY [20-10-2022(online)].pdf | 2022-10-20 |
| 3 | 202241060131-FORM FOR SMALL ENTITY(FORM-28) [20-10-2022(online)].pdf | 2022-10-20 |
| 4 | 202241060131-FORM FOR SMALL ENTITY [20-10-2022(online)].pdf | 2022-10-20 |
| 5 | 202241060131-FORM 1 [20-10-2022(online)].pdf | 2022-10-20 |
| 6 | 202241060131-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [20-10-2022(online)].pdf | 2022-10-20 |
| 7 | 202241060131-EVIDENCE FOR REGISTRATION UNDER SSI [20-10-2022(online)].pdf | 2022-10-20 |
| 8 | 202241060131-DRAWINGS [20-10-2022(online)].pdf | 2022-10-20 |
| 9 | 202241060131-DECLARATION OF INVENTORSHIP (FORM 5) [20-10-2022(online)].pdf | 2022-10-20 |
| 10 | 202241060131-PostDating-(16-10-2023)-(E-6-356-2023-CHE).pdf | 2023-10-16 |
| 11 | 202241060131-APPLICATIONFORPOSTDATING [16-10-2023(online)].pdf | 2023-10-16 |
| 12 | 202241060131-Further evidence [26-10-2023(online)].pdf | 2023-10-26 |
| 13 | 202241060131-Annexure [26-10-2023(online)].pdf | 2023-10-26 |
| 14 | 202241060131-DRAWING [19-12-2023(online)].pdf | 2023-12-19 |
| 15 | 202241060131-CORRESPONDENCE-OTHERS [19-12-2023(online)].pdf | 2023-12-19 |
| 16 | 202241060131-COMPLETE SPECIFICATION [19-12-2023(online)].pdf | 2023-12-19 |
| 17 | 202241060131-FORM28 [04-01-2024(online)].pdf | 2024-01-04 |
| 18 | 202241060131-Covering Letter [04-01-2024(online)].pdf | 2024-01-04 |
| 19 | 202241060131-FORM 18 [22-01-2024(online)].pdf | 2024-01-22 |
| 20 | 202241060131-FORM 3 [19-03-2024(online)].pdf | 2024-03-19 |
| 21 | 202241060131-FER.pdf | 2025-07-31 |
| 1 | 202241060131_SearchStrategyNew_E_202241060131E_02-03-2025.pdf |