Optimized Allocation Of Tasks In Heterogeneous Computing Systems

Abstract: A method and system for automatically optimizing task allocation in heterogeneous computing systems is disclosed. The system (100) comprises a plurality of target processing elements (102) and a host processor (104). The host processor (104) is configured to receive one or more requests from one or more applications for task allocation. During compilation, a virtualizer (106) extracts parameters of kernels of the one or more applications and receives the architectures of the plurality of target processing elements (102). The virtualizer (106) comprises a device conformability module (108) and a mapping module (110). The device conformability module (108) provides a prediction on execution time of the kernels for each of the architectures based on the parameters. The mapping module (110) compares the predictions and indicates a ranking of the plurality of target processing elements (102) based on least execution time for each of the kernels and determines a combination of the plurality of target processing elements (102) based on the mapping prediction to optimize the task allocation.

Patent Information

Application #

Filing Date

22 August 2017

Publication Number

44/2018

Publication Type

INA

Invention Field

PHYSICS

Status

Email

indiafiling@deeptech-ip.com

Parent Application

Patent Number

Legal Status

Grant Date

2023-11-30

Renewal Date

Applicants

AMRITA VISHWA VIDYAPEETHAM

AMRITA VISHWA VIDYAPEETHAM Kasavanahalli, Carmelaram P.O. Bangalore - , Karnataka, India

Inventors

1. Madhura PURNAPRAJNA

A601, Sreeda Pride Kasavanahalli Bengaluru, Karnataka 560 035

2. Vanishree KATTISHETTI

#469, 6th Cross, K.P.C. Layout Kasavanahalli Bengaluru, Karnataka 560 035

Specification

OPTIMIZED ALLOCATION OF TASKS IN HETEROGENEOUS COMPUTING
SYSTEMS
RELATED APPLICATION
[0001] None.
BACKGROUND
[0002] Present disclosure generally relates to heterogeneous computing systems.
Heterogeneous computing systems (HCS) include several processing devices or accelerators, which have diverse micro-architectures and different programming platforms. These devices work collaboratively on different tasks of an application, especially highly compute-intensive and data parallel tasks, in order to achieve higher system performance overall.
[0003] However, this diversity is accompanied by a serious challenge of selecting
suitable micro-architectures for different tasks of any given application. This challenge gets aggravated with the diverse programming platforms that each device is associated with. For instance, GPU applications are written in languages like CUDA and OpenCL. Applications written for the FPGA are in Hardware Description Languages like Verilog, VHDL, and System Verilog.
[0004] One approach to overcome these challenges, as known in the art, includes the use
of a unified programming framework, such as OpenCL. However, even with this framework, the programmer bears an additional overhead of specifying the workload assignment. For instance, the job of determining the appropriate device for different portions of the application code is left to the programmer who may be unaware of the architecture level details of the devices. Therefore, there exists no automated tool that provides an optimized work/task assignment by dynamically distributing the task among the processing devices.

SUMMARY
[0005] The above-mentioned problems are solved and overcome by automatically
optimizing task allocation in heterogeneous computing systems (HCS).
[0006] The present subject matter relates to systems and methods to automatically
optimize allocation of tasks in heterogeneous computing systems. The system includes a plurality of target processing elements, a virtualizer, and a host processor. Each of the plurality of target processing elements may have a distinct architecture and may include processors with a known instruction set architecture or with a compiler available. The virtualizer is configured to receive one or more requests from one or more applications for the allocation of tasks. The virtualizer is further configured to extract parameters of kernels of the one or more applications. Further, the virtualizer receives the architectures of the plurality of target processing elements.
[0007] The virtualizer may include a device conformability module to provide
predictions on execution time of the kernels of the one or more applications for each of the architectures based on the parameters. The virtualizer further comprises a mapping module to compare the predictions to derive a mapping prediction. The mapping prediction indicates a ranking of the plurality of target processing elements based on least execution time for each of the kernels and determines a combination of the plurality of target processing elements based on the mapping prediction. The tasks are allocated to the combination of the plurality of target processing elements and, therefore, an optimized and automated task allocation is provided.
[0008] In one embodiment, the system further comprises a compiler to transform a
source code of a portion of the one or more applications to machine code based on the mapping prediction.

[0009] In one embodiment, the mapping prediction indicates a list of target processing
elements based on an increasing order of the execution time for each kernel.
[0010] In one embodiment, the plurality of target processing elements execute the
portion of the one or more applications.
[0011] In one embodiment, the plurality of target processing elements comprises at least
one of Central Processing Unit (CPU), Graphics Processing Unit (GPU), Field-Programmable Gate Array (FPGA).
[0012] In one embodiment, the host processor is configured to execute the virtualizer.
[0013] In one embodiment, the plurality of target processing elements have distinct
architectures.
[0014] In another embodiment, the present subject matter relates to a method to
automatically optimize task allocation in Heterogeneous Computing Systems (HCS). The method includes receiving, by a virtualizer, one or more requests from one or more applications for task allocation to a plurality of target processing elements. The virtualizer extracts parameters of kernels of the one or more applications and receives architectures of the plurality of target processing elements. Predictions on execution time of the kernels for each of the plurality of target processing elements are determined based on the parameters and the architectures. The predictions are compared to derive a mapping prediction, which indicates a ranking of the plurality of target processing elements based on least execution time for each of the kernels. Further, combinations of the plurality of target processing elements are determined based on the mapping prediction. The virtualizer allocates the tasks to the combination of the plurality of target processing elements to provide optimized task allocation.

[0015] In one embodiment, the task comprises executing at least a portion of the one or
more applications.
[0016] In one embodiment, the method includes determining a workload distribution
ratio among the plurality of target processing elements.
[0017] In one embodiment, the mapping prediction ranks the plurality of target
processing elements based on the workload distribution ratio.
[0018] In one embodiment, the mapping prediction indicates ranking of the plurality of
target processing elements based on least execution time for each of the kernels and the workload distribution ratio among the plurality of target processing elements.
[0019] In one embodiment, the method includes compiling a source code of the portion
of the one or more applications to machine code based on the mapping prediction.
[0020] In one embodiment, the mapping prediction is modified based on runtime
performance of the portion of the one or more applications during runtime.

BRIEF DESCRIPTION OF THE DRAWINGS
[0021] The invention has other advantages and features which will be more readily
apparent from the following detailed description of the invention and the appended claims, when taken in conjunction with the accompanying drawings, in which:
[0022] FIG. 1 illustrates a method to automatically optimize task allocation in
heterogeneous computing systems, according to one embodiment of the present subject matter.
[0023] FIG. 2 illustrates a model of the virtualizer for task allocation, according to one
embodiment of the present subject matter.
[0024] FIG. 3 illustrates a schematic diagram of the system for optimizing allocation of
tasks, according to one embodiment of the present subject matter.
[0025] FIG. 4 illustrates measured and estimated data distribution ratio of CPU:GPU =
50:50 for BlackScholes benchmark using embodiments of the invention, showing a measured performance improvement of 45.17% w.r.t. the fastest device.

DETAILED DESCRIPTION
[0026] While the invention has been disclosed with reference to certain embodiments, it
will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the scope of the invention. In addition, many modifications may be made to adapt to a particular situation or material to the teachings of the invention without departing from its scope.
[0027] Throughout the specification and claims, the following terms take the meanings
explicitly associated herein unless the context clearly dictates otherwise. The meaning of “a”, “an”, and “the” include plural references. The meaning of “in” includes “in” and “on.” Referring to the drawings, like numbers indicate like parts throughout the views. Additionally, a reference to the singular includes a reference to the plural unless otherwise stated or inconsistent with the disclosure herein.
[0028] The present subject matter relates to a system and method to automatically
optimize task allocation in heterogeneous computing systems. The system includes a plurality of target processing elements, a virtualizer, and a host processor. The virtualizer is configured to extract parameters of kernels of one or more applications. It also receives information of the architectures of the plurality of target processing elements. The virtualizer comprises a device conformability module and a mapping module. The device conformability module provides predictions on performances of the kernels for each of the architectures based on the parameters. Based on these parameters, the mapping module compares the predictions and selects one or more of the plurality of target processing elements for the task allocation.

[0029] A schematic representation of a system 100 for optimizing task allocation in
Heterogeneous Computing Systems (HCS) is illustrated in FIG. 1. The system 100 includes a plurality of target processing elements 102-1, 102-2 ..., 102-N, host processor 104, and a virtualizer 106. The plurality of target processing elements 102-1, 102-2 ..., 102-N may include several distinct processors in combination with a central processor for providing specialized processing capabilities. For instance, the plurality of target processing elements 102-1, 102-2,..., 102-N may include a Central Processing Unit (CPU), a Graphical Processing Unit (GPU), a Field-Programmable Gate Array (FPGA), etc.
[0030] Further, the host processor 104 is configured to receive one or more requests
from one or more applications for task allocation. In one embodiment, the task may include executing a portion of the one or more applications. For example, the host processor 104 may receive requests for executing a program from a gaming application, which may be installed on the system, to render images related to a game. In one embodiment, the host processor 104 may be configured to execute the virtualizer 106.
[0031] The virtualizer 106 may be configured to extract parameters of kernels of the one
or more applications and receive the architectures of the plurality of target processing elements 102-1, 102-2,..., 102-N. The virtualizer 106 includes a device conformability module 108 and a mapping module 110. The device conformability module 108 and the mapping module 110, according to various embodiments, may be implemented as one or more software modules, hardware modules, firmware modules, or some combination of these. Generally, the functionality of these modules may be combined or distributed as desired in the various embodiments. The program modules may include routines, programs, objects, components, data structures, and the like, that perform particular tasks or implement particular abstract data types.

[0032] The device conformability module 108 predicts performances of the kernels for
each of the architectures based on the parameters of the kernels of the one or more applications. For example, the performance may be characterized by execution time, i.e., time taken for executing a program by processor. The parameters may include at least one of compute to communication ratio, synchronization, parallelism, etc. In one embodiment, the plurality of target processing elements 102-1, 102-2..., 102-N may have distinct architectures and may be configured on different programming platforms, such as Verilog, VHSIC Hardware Description Language (VHDL), and System Verilog.
[0033] The mapping module 110 compares the predictions made by the device
conformability module 108 and derives a mapping prediction. The mapping prediction indicates a ranking of the plurality of target processing elements 102-1, 102-2,..., 102-N based on least execution time for each of the kernels. In one embodiment, the mapping prediction may indicate a ranking of the plurality of target processing elements based on the workload distribution ratio among the plurality of target processing elements. Based on the mapping prediction, a combination of the plurality of target processing elements 102-1, 102-2..., 102-N are determined. The virtualizer 106 allocates the tasks to the combination of the plurality of target processing elements 102-1, 102-2...., 102-N to provide optimized task allocation.
[0034] In one example, the mapping prediction may specify one of the plurality of target
processing elements 102-1, 102-2..., 102-N to be selected for the task allocation. For instance, the mapping module 110 may indicate an appropriate target processor based on least execution time, architecture compatibility, etc. For example, the mapping module 110 may select the GPU for executing the program of the gaming application to render images related to the game, based on architectural compatibility.

[0035] Further, the system 100 includes a target-specific compiler 112 for compiling a
source code of the portion of the one or more applications to machine code based on the mapping prediction. In one embodiment, the compiler 112 may construct one source file for each of the target processing elements and carry out device dependent optimization for the target processing elements according to the mapping prediction. These per-device compiled versions are executed on the corresponding target processing elements. In one embodiment, the performance of the executions may be measured during runtime for feedback. Based on this feedback, the mapping may be altered dynamically for improved performance of the system 100.
[0036] An example of the model of the virtualizer 106 is illustrated in FIG. 2, according
to an embodiment. The virtualizer 106 may be a combination of software, firmware, or hardware. In one embodiment, the virtualizer 106 may communicate between the application and a target-specific compiler to facilitate task allocation dynamically.
[0037] The virtualizer 106 receives the architecture information 202-1, 202-2…, 202-N
of the plurality of target processing elements 102-1, 102-2…, 102-N. Further, the virtualizer receives a set of parameters 204 of kernels of the one or more applications. It extracts kernel or task parameters of the application, such as compute-to-communication ratio, parallelism, sychronization. Based on the architectures 202-1, 202-2…, 202-N and the parameters, the virtualizer may predict the execution time for each target processor. Additionally, the virtualizer may also determine the workload distribution ratio based on the relative predicted execution time on each of the target processing elements. The execution time may further be used for deriving mapping prediction as described earlier and in subsequent sections.
[0038] A method 300 for automatically optimizing task allocation in the HCS 100 is
illustrated in FIG. 3, according to an embodiment of the present subject matter. The method

300 may be performed by the virtualizer 106. The method 300 may include receiving one or more applications requests for the task allocation on one of a plurality of target processing elements 102-1, 102-2..., 102-N, at block 302. In one embodiment, the task may include executing at least a portion of the one or more applications.
[0039] The parameters of kernels of the one or more applications and architectures of
the plurality of target processing elements 102-1, 102-2..., 102-N are extracted, at block 304 and block 306. Based on the parameters of the kernels of the one or more applications and the architectures, predictions on execution time of the kernels for each of the architectures are determined, at block 308. In one embodiment, the predictions are determined by calculating the execution time of the kernels for each of the architectures.
[0040] These predictions are compared to derive the mapping prediction, at block 310.
The mapping prediction may be used for indicating one of the plurality of target processing elements 102-1, 102-2..., 102-N for the task allocation. In one embodiment, the mapping prediction may indicate a list of target processing elements 102-1, 102-2..., 102-N ranked based on their execution time. For example, the target processing element associated with least execution time may be ranked highest, i.e., as highest priority.
[0041] In one embodiment, the mapping prediction ranks the plurality of target
processing elements based on which the workload distribution ratio is determined among the plurality of target processing elements. Based on the mapping prediction, a combination of the plurality of target processing elements 102-1, 102-2...., 102-N are determined, at block 312. The virtualizer 106 allocates the tasks to the combination of the plurality of target processing elements 102-1, 102-2...., 102-N to provide optimized task allocation, at block 314.

[0042] Further, a source code of the portion of the one or more applications is compiled
to machine code. The one or more target processor 102-1, 102-2..., 102-N executes the portion of the application using the machine code. In one embodiment, the mapping prediction may be modified based on the performance of the portion of the one or more applications during runtime.
[0043] In one implementation, the virtualizer 106 may function in two phases, the initial
phase with CPU and GPU, and the next phase with CPU, GPU, and FPGA. For instance, in the initial phase, the virtualizer may determine relative performance of CPU and GPU instead of the absolute performance. With the relative performance estimation, the virtualizer 106 may determine an automatic optimal task/data distribution ratio between CPU and GPU for data parallel applications.
[0044] The virtualizer 106 may determine the relative performance through static code
analysis. The analysis may be performed based on target elements architecture specification, such as number of cores/threads, instructions executed per cycle, frequency of operation available at compile time. The slowest path in the CPU and GPU code may be identified based on the analysis. Further, the execution time may be estimated in each case to find the suitability of an application to each device, based on the device information and the slowest path of each code. Once the choice of the element is made (i.e. the relative estimated execution time on either element is estimated), the virtualizer 106 determines an optimal data distribution ratio automatically based on the relative performance.

EXAMPLE
Example 1: Computational Efficiency on Benchmarks
[0045] The virtualizer 106 as disclosed in the embodiments was used to generate
performance data on benchmarks for evaluation. A performance improvement of BlackScholes benchmark as a function the statically estimated workload/data distribution ratio between the CPU and GPU is illustrated in FIG. 4. As the results indicate, the optimal data distribution ratio between CPU and GPU of 50:50 shows measured performance improvement of 45.17% against the best performing device or target processing element.
[0046] The virtualizer 106 was further evaluated against benchmarks from the Nvidia
SDK and Polybench suite and the results are shown in Table 1. It is seen from Table 1 that there is an average measured performance improvement of around 38.44% across all benchmarks. Therefore, the virtualizer 106 eliminates the time-consuming manual data distribution effort programmer among the CPU and GPU to yield performance improvement of the application.

[0047] In other implementations, the execution time may be estimated by computing the
best-case and worst-case execution times on both the target processing elements. Since memory latencies make a greater impact on the performance of an application, the best-case and worst-case estimations may be based on memory operations in the identified slowest path. The best-case scenario is when all the data is available in cache/shared memory. Worst-case scenario is when no data is available in cache and hence must be accessed from global/device memory. The relative estimated execution time gives the optimal data distribution ratio among the two devices for a given application.
[0048] The advantages of the above the present subject matter and its embodiments
include providing an automatic and appropriate mapping of each kernel to the right target processing element. Therefore, the efficiency and performance of the system is increased. Moreover, the reduced time required for execution, decreases the overall power consumption.
[0049] Although the detailed description contains many specifics, these should not be
construed as limiting the scope of the invention but merely as illustrating different examples and aspects of the invention. While the invention has been disclosed with reference to certain embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the scope of the invention. In addition, many modifications may be made to adapt to a particular situation or material to the teachings of the invention without departing from its scope.

WE CLAIM:
1. A method for automatically optimizing task allocation in Heterogeneous Computing
Systems (HCS), the method comprising:
receiving, by a virtualizer (106), one or more requests from one or more applications for task allocation to a plurality of target processing elements (102-1, 102-2...., 102-N);
extracting, by the virtualizer (106), parameters of kernels of the one or more applications;
receiving, by the virtualizer (106), information related to architectures of the plurality of target processing elements (102-1, 102-2...., 102-N);
determining, by the virtualizer (106), predictions on execution time of the kernels in each of the plurality of target processing elements (102-1, 102-2...., 102-N) based on the parameters and the architectures;
comparing, by the virtualizer (106), the predictions to derive a mapping prediction, wherein the mapping prediction indicates a ranking of the plurality of target processing elements (102-1, 102-2...., 102-N) based on least execution time for each of the kernels;
determining, by the virtualizer (106), a combination of the plurality of target processing elements (102-1, 102-2...., 102-N) based on the mapping prediction; and
allocating, by the virtualizer (106), the tasks to the combination of the plurality of target processing elements (102-1, 102-2...., 102-N) to provide optimized task allocation.
2. The method as claimed in claim 1, wherein the task comprises executing at least a
portion of the one or more applications.

3. The method as claimed in claim 1 further comprising, determining a workload distribution ratio among the plurality of target processing elements (102-1, 102-2...., 102-N).
4. The method as claimed in claim 1, wherein the mapping prediction indicates ranking of the plurality of target processing elements (102-1, 102-2...., 102-N) and determines the workload distribution ratio.
5. The method as claimed in claim 1 further comprising compiling a source code of the portion of the one or more applications to machine code based on the mapping prediction.
6. The method as claimed in claim 1 further comprising modifying the mapping prediction based on the executions of the portion of the one or more applications during runtime.
7. A system for automatically optimizing task allocation in Heterogeneous Computing Systems (HCS), the system comprising:
a plurality of target processing elements (102-1, 102-2...., 102-N); a host processor (104) configured to:
receive one or more requests from one or more applications for the task allocation; and
execute a virtualizer (106) for extracting parameters of kernels of the one or more applications and to receive information on the architectures of the plurality of

target processing elements (102-1, 102-2...., 102-N), wherein the virtualizer (106) comprises:
a device conformability module (108) to determine predictions on execution time of the kernels of the one or more applications for each of the architectures based on the parameters; and
a mapping module (110) configured to:
compare the predictions to derive a mapping prediction, wherein the mapping prediction indicates a ranking of the plurality of target processing elements (102-1, 102-2...., 102-N) based on least execution time for each of the kernels;
determine a combination of the plurality of target processing elements (102-1, 102-2...., 102-N) based on the mapping prediction; and
allocate the tasks to the combination of the plurality of target processing elements (102-1, 102-2...., 102-N) to provide optimized task allocation.
8. The system as claimed in claim 7, wherein the mapping prediction indicates ranking of the plurality of target processing elements (102-1, 102-2...., 102-N) based on which a workload distribution ratio can be determined among the plurality of target processing elements (102-1, 102-2...., 102-N).

9. The system as claimed in claim 7, wherein the plurality of target processing elements (102-1, 102-2...., 102-N) execute the portion of the one or more applications.
10. The system as claimed in claim 7 further comprising a compiler (112), wherein the compiler (112) transforms a source code of a portion of the one or more applications to machine code based on the mapping prediction for one or more of the target processing elements (102-1, 102-2...., 102-N).
11. The system as claimed in claim 7, wherein the information on architectures include at least number of cores or threads, instructions executed per cycle, and frequency of operation.
12. The system as claimed in claim 7, wherein the plurality of target processing elements (102-1, 102-2...., 102-N) comprises at least one of Central Processing Unit (CPU), Graphics Processing Unit (GPU), Field-Programmable Gate Array (FPGA).
13. The system as claimed in claim 7, wherein the plurality of target processing elements (102-1, 102-2...., 102-N) have distinct architectures.

Documents

Application Documents

#	Name	Date
1	201741029762-STATEMENT OF UNDERTAKING (FORM 3) [22-08-2017(online)].pdf	2017-08-22
2	201741029762-DRAWINGS [22-08-2017(online)].pdf	2017-08-22
3	201741029762-COMPLETE SPECIFICATION [22-08-2017(online)].pdf	2017-08-22
4	201741029762-Proof of Right (MANDATORY) [22-02-2018(online)].pdf	2018-02-22
5	201741029762-FORM 3 [09-10-2018(online)].pdf	2018-10-09
6	201741029762-FORM-9 [26-10-2018(online)].pdf	2018-10-26
7	201741029762-FORM 18 [14-10-2019(online)].pdf	2019-10-14
8	201741029762-REQUEST FOR CERTIFIED COPY [17-10-2019(online)].pdf	2019-10-17
9	201741029762-Response to office action (Mandatory) [16-12-2019(online)].pdf	2019-12-16
10	201741029762-FER.pdf	2021-10-26
11	201741029762-RELEVANT DOCUMENTS [19-04-2022(online)].pdf	2022-04-19
12	201741029762-FORM 13 [19-04-2022(online)].pdf	2022-04-19
13	201741029762-FER_SER_REPLY [19-04-2022(online)].pdf	2022-04-19
14	201741029762-EVIDENCE FOR REGISTRATION UNDER SSI [19-04-2022(online)].pdf	2022-04-19
15	201741029762-EDUCATIONAL INSTITUTION(S) [19-04-2022(online)].pdf	2022-04-19
16	201741029762-CORRESPONDENCE [19-04-2022(online)].pdf	2022-04-19
17	201741029762-CLAIMS [19-04-2022(online)].pdf	2022-04-19
18	201741029762-FORM-26 [30-05-2022(online)].pdf	2022-05-30
19	201741029762-US(14)-HearingNotice-(HearingDate-03-11-2023).pdf	2023-10-10
20	201741029762-Correspondence to notify the Controller [31-10-2023(online)].pdf	2023-10-31
21	201741029762-Written submissions and relevant documents [14-11-2023(online)].pdf	2023-11-14
22	201741029762-PatentCertificate30-11-2023.pdf	2023-11-30
23	201741029762-IntimationOfGrant30-11-2023.pdf	2023-11-30
24	201741029762-PROOF OF ALTERATION [08-04-2025(online)].pdf	2025-04-08
25	201741029762-OTHERS [12-05-2025(online)].pdf	2025-05-12
26	201741029762-EDUCATIONAL INSTITUTION(S) [12-05-2025(online)].pdf	2025-05-12

Search Strategy

1	SearchStretegy-201741029762E_25-10-2021.pdf