Abstract: ABSTRACT METHOD AND SYSTEM FOR HETEROGENEOUS LOAD PARTITIONING This disclosure relates generally to a mechanism of optimal workload partitioning in a heterogeneous system. However, due to the differences in architectures and in turn the capabilities of different processors that are being used by the computing systems, splitting the workload and scheduling tasks by considering abilities of both CPU and GPU, while ensuring maximum resource utilization, is a challenge. The system and method in the embodiments disclosed herein provide a mechanism in which, the system determines complexity of a program, particularly in terms of number of loops in the program, and then based on the determined complexity, performs the workload partitioning. The method further includes partitioning the program to different portions/segments, based on the determined optimal partitioning of the program. [To be published with FIG. 2]
Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR HETEROGENEOUS LOAD PARTITIONING
Applicant
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description:
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
[001] The disclosure herein generally relates to computer systems, and, more particularly, to workload partitioning for data processing in the computer systems.
BACKGROUND
[002] Computer systems or data processing systems in general may include processors of different types, for example, Central Processing Unit (CPU), Graphics Processing Unit (GPU) and so on. The CPUs and GPUs are silicon-based processors, each of them has their own unique functionalities and properties. CPU has cores that can execute a wide range of tasks. But with low latency, these cores are not parallelizable as GPU. The GPU has thousands of cores but not as powerful as the cores of CPU. But these GPU cores can perform more specific task parallelly with high throughput. There maybe certain tasks in data processing that may be performed by the CPU or by the GPU. Some of the tasks maybe performed by either of the CPU or the GPUs. Efficient processing speeds maybe achieved by partitioning workload associated with a data processing being handled by the system between the available processors, i.e. between the CPUs and the GPUs.
[003] However, due to the differences in architectures and in turn the capabilities, splitting the workload and scheduling tasks by considering abilities of both CPU and GPU, while ensuring maximum resource utilization, remains a challenge.
SUMMARY
[004] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a processor implemented method is provided. The method includes receiving, via one or more hardware processors, an input data comprising a program to be executed. Further, the program is determined as one of a kernel-based program and a non-kernel-based program, via the one or more hardware processors. Further, the program is pre-processed, via the one or more hardware processors, to extract a set of features comprising a) Number of functions, b) Number of basic blocks, c) Number of binary operators, d) Number of bit binary operators, e) Number of vector operators, f) Number of store operators, g) Number of load operations, and h) Number of private memory access. For pre-processing the program, a first hardware component executed by the one or more hardware processors is used if the program is the kernel-based program, and a second hardware component executed by the one or more hardware processors is used if the program is the non-kernel-based program. Further, a loop having highest number of nested loops, from among a plurality of loops in the program, is identified, via the one or more hardware processors, as a complicated loop. Further, a complexity factor and a number of loops associated with the program are determined, via the one or more hardware processors, from the generated set of static features, wherein the complexity factor represents number of nested loops in the complicated loop. Further, the generated set of static features, the determined complexity factor and the number of loops associated with the program, are processed using a trained MLP model, via the one or more hardware processors, to obtain an optimal workload partition of the program.
[005] In another embodiment of the method, the first hardware component is a low level virtual machine (llvm) feature extractor.
[006] In another embodiment of the method, the second hardware component is a Static feature extractor.
[007] In another embodiment of the method, the MLP model is trained to determine the optimal workload partition, comprising: collecting as training data, via the one or more hardware processors, a plurality of training programs and information on a plurality of CPUs and GPUs; determining, via the one or more hardware processors, an optimal workload partition corresponding to each training program among the plurality of training programs using an OpenCL execution model, comprising: initializing a set of platforms for an OpenCL implementation; and creating (i) an OpenCL context, and (ii) a command queue, for each of the plurality of CPUs and GPUs for a kernel execution, wherein the kernel execution comprises: setting up a kernel based on a set of host variables and a set of device variables; performing an automatic partitioning using the OpenCL execution model by iterating CPU load from 0 to 100 and GPU load from 100 to CPU load, in a plurality of iterations, wherein in each of the plurality iterations, a) a plurality of buffers are created for each of the plurality of CPUs and GPUs, b) an execution time is calculated for each of a plurality of partitions of each of the plurality of CPUs and GPUs, and c) a partition having least value of execution time from among the plurality of partitions is selected; extracting the set of static features from the plurality of training programs, for the performed automatic partitioning; and computing, for each of the plurality of training programs, (i) the complexity factor and (ii) the number of loops, based on the extracted set of static features; and training the MLP model using a) the set of static features, b) the optimal workload partition, c) the complexity factor, and d) the number of loops for each training program, as a training data, wherein during the training of the MLP model, a number of hidden layers in the MLP model is activated based on the complexity factor.
[008] In another embodiment of the method, during the training of the MLP model, a number of hidden layers in the MLP model is activated based on the complexity factor.
[009] In another embodiment of the method, the program is partitioned to a plurality of serial and parallel portions based on the determined optimal workload partition of the program.
[010] In another embodiment of the method, determining the complexity factor and the number of loops comprises identifying a plurality of parallel blocks, wherein each of the plurality of parallel block comprises of a part of the program with loops; identifying a parallel block with a maximum number of nested loops from among the plurality of parallel blocks, as a complex block in the program; and determining the complexity factor and the number of loops based on the complex block.
[011] In yet another embodiment, a system is provided. The system includes one or more hardware processors, a communication interface, and a memory storing a plurality of instructions. The plurality of instructions cause the one or more hardware processors to receive an input data comprising a program to be executed. Further, the program is determined as one of a kernel-based program and a non-kernel-based program, via the one or more hardware processors. Further, the program is pre-processed, via the one or more hardware processors, to extract a set of features comprising a) Number of functions, b) Number of basic blocks, c) Number of binary operators, d) Number of bit binary operators, e) Number of vector operators, f) Number of store operators, g) Number of load operations, and h) Number of private memory access. For pre-processing the program, a first hardware component executed by the one or more hardware processors is used if the program is the kernel-based program, and a second hardware component executed by the one or more hardware processors is used if the program is the non-kernel-based program. Further, a complexity factor and a number of loops associated with the program are determined from the generated set of static features, wherein the complexity factor represents number of nested loops in the program. Further, the generated set of static features, the determined complexity factor, and the number of loops associated with the program are processed, via the one or more hardware processors, using a trained MLP model, to obtain an optimal workload partition of the program.
[012] In yet another embodiment of the system, the first hardware component is a llvm feature extractor.
[013] In yet another embodiment of the system, the second hardware component is a static feature extractor.
[014] In yet another embodiment of the system, the one or more hardware processors are configured to train the MLP model to determine the optimal workload partition, by: collecting as training data, via the one or more hardware processors, a plurality of training programs and information on a plurality of CPUs and GPUs; determining, via the one or more hardware processors, an optimal workload partition corresponding to each training program among the plurality of training programs using an OpenCL execution model, comprising: initializing a set of platforms for an OpenCL implementation; and creating (i) an OpenCL context, and (ii) a command queue, for each of the plurality of CPUs and GPUs for a kernel execution, wherein the kernel execution comprises: setting up a kernel based on a set of host variables and a set of device variables; performing an automatic partitioning using the OpenCL execution model by iterating CPU load from 0 to 100 and GPU load from 100 to CPU load, in a plurality of iterations, wherein in each of the plurality iterations, a) a plurality of buffers are created for each of the plurality of CPUs and GPUs, b) an execution time is calculated for each of a plurality of partitions of each of the plurality of CPUs and GPUs, and c) a partition having least value of execution time from among the plurality of partitions is selected; extracting the set of static features from the plurality of training programs, for the performed automatic partitioning; and computing, for each of the plurality of training programs, (i) the complexity factor and (ii) the number of loops, based on the extracted set of static features; and training the MLP model using a) the set of static features, b) the optimal workload partition, c) the complexity factor, and d) the number of loops for each training program, as a training data, wherein during the training of the MLP model, a number of hidden layers in the MLP model is activated based on the complexity factor.
[015] In yet another embodiment of the system, during the training of the MLP model, a number of hidden layers in the MLP model is activated based on the complexity factor.
[016] In yet another embodiment of the system, the one or more hardware processors are configured to partition the program to a plurality of serial and parallel portions based on the determined optimal workload partition of the program.
[017] In yet another embodiment of the system, the one or more hardware processors are configured to determine the complexity factor and the number of loops by: identifying a plurality of parallel blocks, wherein each of the plurality of parallel block comprises of a part of the program with loops; identifying a parallel block with a maximum number of nested loops from among the plurality of parallel blocks, as a complex block in the program; and determining the complexity factor and the number of loops based on the complex block.
[018] In yet another aspect, a non-transitory computer readable medium is provided. The non-transitory computer readable medium includes a plurality of instructions, which when executed, cause the one or more hardware processors to receive an input data comprising a program to be executed. Further, the program is determined as one of a kernel-based program and a non-kernel-based program, via the one or more hardware processors. Further, the program is pre-processed, via the one or more hardware processors, to extract a set of features comprising a) Number of functions, b) Number of basic blocks, c) Number of binary operators, d) Number of bit binary operators, e) Number of vector operators, f) Number of store operators, g) Number of load operations, and h) Number of private memory access. For pre-processing the program, a first hardware component executed by the one or more hardware processors is used if the program is the kernel-based program, and a second hardware component executed by the one or more hardware processors is used if the program is the non-kernel-based program. Further, a complexity factor and a number of loops associated with the program are determined from the generated set of static features, wherein the complexity factor represents number of nested loops in the program. Further, the generated set of static features, the determined complexity factor, and the number of loops associated with the program are processed, via the one or more hardware processors, using a trained MLP model, to obtain an optimal workload partition of the program.
[019] In another embodiment of the non-transitory computer readable medium, the first hardware component is a llvm feature extractor.
[020] In another embodiment of the non-transitory computer readable medium, the second hardware component is a static feature extractor.
[021] In another embodiment of the non-transitory computer readable medium, the MLP model is trained to determine the optimal workload partition, comprising: collecting as training data, via the one or more hardware processors, a plurality of training programs and information on a plurality of CPUs and GPUs; determining, via the one or more hardware processors, an optimal workload partition corresponding to each training program among the plurality of training programs using an OpenCL execution model, comprising: initializing a set of platforms for an OpenCL implementation; and creating (i) an OpenCL context, and (ii) a command queue, for each of the plurality of CPUs and GPUs for a kernel execution, wherein the kernel execution comprises: setting up a kernel based on a set of host variables and a set of device variables; performing an automatic partitioning using the OpenCL execution model by iterating CPU load from 0 to 100 and GPU load from 100 to CPU load, in a plurality of iterations, wherein in each of the plurality iterations, a) a plurality of buffers are created for each of the plurality of CPUs and GPUs, b) an execution time is calculated for each of a plurality of partitions of each of the plurality of CPUs and GPUs, and c) a partition having least value of execution time from among the plurality of partitions is selected; extracting the set of static features from the plurality of training programs, for the performed automatic partitioning; and computing, for each of the plurality of training programs, (i) the complexity factor and (ii) the number of loops, based on the extracted set of static features; and training the MLP model using a) the set of static features, b) the optimal workload partition, c) the complexity factor, and d) the number of loops for each training program, as a training data, wherein during the training of the MLP model, a number of hidden layers in the MLP model is activated based on the complexity factor.
[022] In yet another embodiment of the non-transitory computer readable medium, during the training of the MLP model, a number of hidden layers in the MLP model is activated based on the complexity factor.
[023] In yet another embodiment of the non-transitory computer readable medium, the program is partitioned to a plurality of serial and parallel portions based on the determined optimal workload partition of the program.
[024] In yet another embodiment of the non-transitory computer readable medium, determining the complexity factor and the number of loops comprises: identifying a plurality of parallel blocks, wherein each of the plurality of parallel block comprises of a part of the program with loops; identifying a parallel block with a maximum number of nested loops from among the plurality of parallel blocks, as a complex block in the program; and determining the complexity factor and the number of loops based on the complex block.
[025] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[026] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[027] FIG. 1 illustrates an exemplary block diagram of a system for workload partitioning, according to some embodiments of the present disclosure.
[028] FIG. 2 is a flow diagram depicting steps involved in the process of the workload partitioning by the system of FIG. 1, according to some embodiments of the present disclosure.
[029] FIG. 3 is a flow diagram depicting steps involved in the process of training a MLP model to determine the optimal workload partition, by the system of FIG. 1, in accordance with some embodiments of the present disclosure.
[030] FIG. 4 is a flow diagram depicting steps involved in the process of kernel execution being performed by the system of FIG. 1, during the optimal workload partition, by the system of FIG. 1, in accordance with some embodiments of the present disclosure.
[031] FIG. 5 is a flow diagram depicting steps involved in the process of determining a complexity factor and number of loops by the system of FIG. 1, during the optimal workload partition, in accordance with some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
[032] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
[033] Computer systems or data processing systems in general may include processors of different types, for example, Central Processing Unit (CPU), Graphics Processing Unit (GPU) and so on. The CPUs and GPUs are silicon-based processors, each of them has their own unique functionalities and properties. CPU has cores that can execute a wide range of tasks. But with low latency, these cores are not parallelizable as GPU. The GPU has thousands of cores but not as powerful as the cores of CPU. But these GPU cores can perform more specific task parallelly with high throughput. There maybe certain tasks in data processing that may be performed by the CPU or by the GPU. Some of the tasks maybe performed by either of the CPU or the GPUs. Efficient processing speeds maybe achieved by partitioning workload associated with a data processing being handled by the system between the available processors, i.e. between the CPUs and the GPUs. However, due to the differences in architectures and in turn the capabilities, splitting the workload and scheduling tasks by considering abilities of both CPU and GPU, while ensuring maximum resource utilization, remains a challenge.
[034] In order to overcome the aforementioned challenges, method and system disclosed herein provide an approach in which a complexity factor and a number of loops associated with a program being processed is determined, and then based on the complexity factor and the number of loops, an optimal workload partitioning is obtained for the program, using a trained MLP model.
[035] Referring now to the drawings, and more particularly to FIG. 1 through FIG. 5, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[036] FIG. 1 illustrates an exemplary block diagram of a system for workload partitioning, according to some embodiments of the present disclosure.
[037] The system 100 includes or is otherwise in communication with hardware processors 102, at least one memory such as a memory 104, an I/O interface 112. The hardware processors 102, memory 104, and the Input /Output (I/O) interface 112 may be coupled by a system bus such as a system bus 108 or a similar mechanism. In an embodiment, the hardware processors 102 can be one or more hardware processors.
[038] The I/O interface 112 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface 112 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a keyboard, a mouse, an external memory, a printer and the like. Further, the I/O interface 112 may enable the system 100 to communicate with other devices, such as web servers, and external databases.
[039] The I/O interface 112 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, local area network (LAN), cable, etc., and wireless networks, such as Wireless LAN (WLAN), cellular, or satellite. For the purpose, the I/O interface 112 may include one or more ports for connecting several computing systems with one another or to another server computer. The I/O interface 112 may include one or more ports for connecting several devices to one another or to another server.
[040] The one or more hardware processors 102 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, node machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 102 is configured to fetch and execute computer-readable instructions stored in the memory 104.
[041] The memory 104 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic random-access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, the memory 104 includes a plurality of modules 106.
[042] The plurality of modules 106 include programs or coded instructions that supplement applications or functions performed by the system 100 for executing different steps involved in the process of the workload partitioning. The plurality of modules 106, amongst other things, can include routines, programs, objects, components, and data structures, which performs particular tasks or implement particular abstract data types. The plurality of modules 106 may also be used as, signal processor(s), node machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the plurality of modules 106 can be used by hardware, by computer-readable instructions executed by the one or more hardware processors 102, or by a combination thereof. The plurality of modules 106 can include various sub-modules (not shown). The plurality of modules 106 may include computer-readable instructions that supplement applications or functions performed by the system 100 for the workload partitioning.
[043] The data repository (or repository) 110 may include a plurality of abstracted pieces of code for refinement and data that is processed, received, or generated as a result of the execution of the plurality of modules in the module(s) 106.
[044] Although the data repository 110 is shown as internal to the system 100, it will be noted that, in alternate embodiments, the data repository 110 can also be implemented external to the system 100, where the data repository 110 may be stored within a database (repository 110) communicatively coupled to the system 100. The data contained within such an external database may be periodically updated. For example, new data may be added into the database (not shown in FIG. 1) and/or existing data may be modified and/or non-useful data may be deleted from the database. In one example, the data may be stored in an external system, such as a Lightweight Directory Access Protocol (LDAP) directory and a Relational Database Management System (RDBMS). Functions of the components of the system 100 are now explained with reference to the steps in flow diagrams in FIG. 2 through FIG. 5.
[045] FIG. 2 is a flow diagram depicting steps involved in the process of the workload partitioning by the system of FIG. 1, according to some embodiments of the present disclosure.
[046] In an embodiment, the system 100 comprises one or more data storage devices or the memory 104 operatively coupled to the processor(s) 102 and is configured to store instructions for execution of steps of a method 200 by the processor(s) or one or more hardware processors 102. The steps of the method 200 of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIG. 1 and the steps of flow diagram as depicted in FIG. 2. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods, and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps to be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
[047] At step 202 of method 200 in FIG. 2, the system 100 receives, via one or more hardware processors 102, an input data comprising a program to be executed. The program maybe fed as input to the system 100 via one or more suitable interface provided by the user interface 112. In an embodiment, the program is fed as input manually by an authorized user. In another embodiment, the program is automatically fetched by the system 100 from one or more associated external systems that are configured to communicate with the system 100.
[048] Further, at step 204 of the method 200, the system 100 determines the program as one of a kernel-based program and a non-kernel-based program, via the one or more hardware processors 102. At this stage, the system 100 scans through the program, to determine presence of a keyword ‘saxpy_kernel’. If this keyword is present in the program, the system 100 determines that the program is the kernel based program, and if this keyword is not present, then the system 100 determines that the program is the non-kernel based program.
[049] Further, at step 206 of the method 200, the system 100 pre-processes the program, via the one or more hardware processors 102, to extract a set of features including a) Number of functions, b) Number of basic blocks, c) Number of binary operators, d) Number of bit binary operators, e) Number of vector operators, f) Number of store operators, g) Number of load operations, and h) Number of private memory access, in the program. In an embodiment, the system 100 may use any suitable extractor module/component, which may be known in the art, for the purpose of extracting the features. For example, the system 100 uses the low level virtual machine (llvm) api / llvm feature extractor. In an embodiment, the system 100 uses different hardware components for the pre-processing of the program. For example, a first hardware component executed by the one or more hardware processors 102 is used if the program is the kernel-based program, and a second hardware component executed by the one or more hardware processors 102 is used if the program is the non-kernel-based program. For example, the first hardware component is a llvm feature extractor, and the second hardware component is a static feature extractor. It is to be noted that the llvm feature extractor and static feature extractor are cited only as examples, and any other suitable components maybe used.
[050] Further, at step 208 of the method 200, the system 100 identifies, via the one or more hardware processors, a loop having highest number of nested loops, from among a plurality of loops in the program, as a complicated loop. At this stage, the system 100 scans through the program and identifies number of nested loops inside each of the plurality of loops having the nested loops, and then compares the number of nested loops to determine the loop having the highest number of nested loops.
[051] Further, at step 210 of the method 200, a complexity factor and a number of loops associated with the program are determined from the generated set of static features, by the system 100, using the one or more hardware processors 102, wherein the complexity factor represents number of nested loops in the complicated loop. Various steps involved in the process of determining the complexity factor and the number of loops are depicted in method 500 in FIG. 5, and are explained hereafter. The complexity factor represents number of nested loops in the complicated loop. The system 100 traverses through each line of the program and calculates the number of loops. During the traversal, the system 100 may identify a plurality of parallel blocks, wherein each of the plurality of parallel blocks comprises of a part of the program with loops, as in step 502 of the method 500. The system 100 adds the number loops from all the lines to compute total number of loops in the program. Further, for each loop, the system 100 calculates the number of nested loops and stores the result in a vector. The number of loops in a loop with maximum number of nested loops (i.e., identifying a parallel block with a maximum number of nested loops from among the plurality of parallel blocks, as in step 504 of the method 500), i.e., of the complicated loop, is value of the complexity factor of that particular program. Based on the identified complicated loop (alternately referred to as complex block), the system 100 determines the complexity factor and the number of loops at step 506 of the method 500.
[052] Further, at step 210 of the method 200, the system 100 determines an optimal workload partition of the program by processing the generated set of static features, the determined complexity factor, and the number of loops associated with the program, using a trained MLP model, via the one or more hardware processors 102. Further, the program is partitioned to a plurality of serial and parallel portions based on the determined optimal workload partition of the program, and then each of the portions are processed by respective processors, as per the determined workload partitioning.
[053] The MLP model is trained to determine the optimal workload partition. Various steps involved in the training of the MLP model are depicted in method 300 in FIG. 3, and are explained hereafter. At step 302 of the method 300, the system 100 collects as training data, via the one or more hardware processors 102, a plurality of training programs and information on a plurality of CPUs and GPUs. Further, at step 304 of the method 300, the system 100 determines, via the one or more hardware processors 102, an optimal workload partition corresponding to each training program among the plurality of training programs using an OpenCL execution model. The steps in determining the optimal workload partition are steps 304a and 304b as depicted in FIG. 3, and are explained hereafter. At step 304a, the system 100 initializes a set of platforms for an OpenCL implementation. Further, at step 304b, the system 100 creates (i) an OpenCL context, and (ii) a command queue, for each of the plurality of CPUs and GPUs for a kernel execution. Various steps in the kernel execution are depicted in method 400 in FIG. 4, and are explained hereafter.
[054] At step 402 of the method 400, the system 100 sets up a kernel based on a set of host variables and a set of device variables. The device variables are given as input to a kernel of the system 100, whereas the host variables are used to store user inputs. Further, at step 404 of the method 400, the system 100 performs an automatic partitioning using the OpenCL execution model by iterating CPU load from a pre-defined range (for example, 0 to 100) and GPU load from another pre-defined range (for example, 100 to CPU load), in a plurality of iterations. The total number of iterations may vary depending on a determined and configured step size. In each of the plurality iterations, a) a plurality of buffers may be created for each of the plurality of CPUs and GPUs, b) an execution time is calculated for each of a plurality of partitions of each of the plurality of CPUs and GPUs, and c) a partition having least value of execution time from among the plurality of partitions is determined. Further, at step 406 of the method 400, the system 100 extracts the set of static features from the plurality of training programs, for the performed automatic partitioning. The system 100 may use any suitable static feature extractors, for example, a llvm extractor, so as to extract the static features. Further, at step 408 of the method 400, the system 100 computes, for each of the plurality of training programs, (i) the complexity factor, and (ii) the number of loops, based on the extracted set of static features, and then the MLP model is trained by using a) the set of static features, b) the optimal workload partition, c) the complexity factor, and d) the number of loops for each training program, as a training data. In an embodiment, during the training of the MLP model, a number of hidden layers in the MLP model is activated based on the complexity factor. During this training, based on the determined parameters, particularly based on the calculated execution time and information on the partition having least value of execution time from among the plurality of partitions, the MLP model learns the optimal partitioning, and associated parameters.
[055] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[056] The embodiments of present disclosure herein address unresolved problem of workload partitioning in heterogeneous systems. The embodiment, thus provides a mechanism for determining a complexity of a program being processed. Moreover, the embodiments herein further provide a mechanism for performing the optimal workload partitioning based on the determined complexity of the program.
[057] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
[058] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[059] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[060] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[061] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
, Claims:We Claim:
1. A processor implemented method (200), comprising:
receiving (202), via one or more hardware processors, an input data comprising a program to be executed;
determining (204), via the one or more hardware processors, the program as one of a kernel-based program and a non-kernel-based program;
pre-processing the program (206), via the one or more hardware processors, to extract a set of features comprising a) number of functions, b) number of basic blocks, c) number of binary operators, d) Number of bit binary operators, e) number of vector operators, f) number of store operators, g) number of load operations, and h) number of private memory access, wherein a first hardware component executed by the one or more hardware processors is used for the pre-processing of the program if the program is the kernel-based program, and a second hardware component executed by the one or more hardware processors is used for the pre-processing of the program if the program is the non-kernel-based program;
identifying (208), via the one or more hardware processors, a loop having highest number of nested loops, from among a plurality of loops in the program, as a complicated loop;
determining (210), via the one or more hardware processors, a complexity factor and a number of loops associated with the program, from the generated set of static features, wherein the complexity factor represents number of nested loops in the complicated loop; and
processing (212), via the one or more hardware processors, the generated set of static features, the determined complexity factor and the number of loops associated with the program, using a trained Multilayer Perceptron (MLP) model, to obtain an optimal workload partition of the program.
2. The processor implemented method as claimed in claim 1, wherein the first hardware component is a llvm feature extractor.
3. The processor implemented method as claimed in claim 1, wherein the second hardware component is a static feature extractor.
4. The processor implemented method as claimed in claim 1, wherein the MLP model is trained to determine the optimal workload partition, comprising:
collecting as training data (302), via the one or more hardware processors, a plurality of training programs and information on a plurality of Central Processing Units (CPUs) and Graphics Processing Units (GPUs);
determining (304), via the one or more hardware processors, an optimal workload partition corresponding to each training program among the plurality of training programs using an OpenCL execution model, comprising:
initializing (304a) a set of platforms for an OpenCL implementation; and
creating (304b) (i) an OpenCL context, and (ii) a command queue, for each of the plurality of CPUs and GPUs for a kernel execution, wherein the kernel execution comprises:
setting (402) up a kernel based on a set of host variables and a set of device variables;
performing (404) an automatic partitioning using the OpenCL execution model by iterating CPU load from 0 to 100 and GPU load from 100 to CPU load, in a plurality of iterations, wherein in each of the plurality iterations, a) a plurality of buffers are created for each of the plurality of CPUs and GPUs, b) an execution time is calculated for each of a plurality of partitions of each of the plurality of CPUs and GPUs, and c) a partition having least value of execution time from among the plurality of partitions is selected;
extracting (406) the set of static features from the plurality of training programs, for the performed automatic partitioning; and
computing (408), for each of the plurality of training programs, (i) the complexity factor and (ii) the number of loops, based on the extracted set of static features; and
training (306) the MLP model using a) the set of static features, b) the optimal workload partition, c) the complexity factor, and d) the number of loops for each training program, as a training data, wherein during the training of the MLP model, a number of hidden layers in the MLP model is activated based on the complexity factor.
5. The processor implemented method as claimed in claim 4, wherein during the training of the MLP model, a number of hidden layers in the MLP model is activated based on the complexity factor.
6. The processor implemented method as claimed in claim 1, comprising partitioning the program to a plurality of serial and parallel portions based on the determined optimal workload partition of the program.
7. The processor implemented method as claimed in claim 1, wherein determining the complexity factor and the number of loops comprises:
identifying (502) a plurality of parallel blocks, wherein each of the plurality of parallel block comprises of a part of the program with loops;
identifying (504) a parallel block with a maximum number of nested loops from among the plurality of parallel blocks, as a complex block in the program; and
determining (506) the complexity factor and the number of loops based on the complex block.
8. A system (100), comprising:
one or more hardware processors (102);
a communication interface (112); and
a memory (104) storing a plurality of instructions, wherein the plurality of instructions cause the one or more hardware processors to:
receive an input data comprising a program to be executed;
determine the program as one of a kernel-based program and a non-kernel-based program;
pre-process the program to extract a set of features comprising a) number of functions, b) number of basic blocks, c) number of binary operators, d) number of bit binary operators, e) number of vector operators, f) number of store operators, g) number of load operations, and h) number of private memory access, wherein a first hardware component executed by the one or more hardware processors is used for the pre-processing of the program if the program is the kernel-based program, and a second hardware component executed by the one or more hardware processors is used for the pre-processing of the program if the program is the non-kernel-based program;
identify a loop having highest number of nested loops, from among a plurality of loops in the program, as a complicated loop;
determine a complexity factor and a number of loops associated with the program, from the generated set of static features, wherein the complexity factor represents number of nested loops in the complicated loop; and
process the generated set of static features, the determined complexity factor and the number of loops associated with the program, using a trained Multilayer Perceptron (MLP) model, to obtain an optimal workload partition of the program.
9. The system as claimed in claim 8, wherein the first hardware component is a llvm feature extractor.
10. The system as claimed in claim 8, wherein the second hardware component is a static feature extractor.
11. The system as claimed in claim 8, wherein the one or more hardware processors are configured to train the MLP model to determine the optimal workload partition, by:
collecting as training data, via the one or more hardware processors, a plurality of training programs and information on a plurality of Central Processing Units (CPUs) and Graphics Processing Units (GPUs);
determining, via the one or more hardware processors, an optimal workload partition corresponding to each training program among the plurality of training programs using an OpenCL execution model, comprising:
initializing a set of platforms for an OpenCL implementation; and
creating (i) an OpenCL context, and (ii) a command queue, for each of the plurality of CPUs and GPUs for a kernel execution, wherein the kernel execution comprises:
setting up a kernel based on a set of host variables and a set of device variables;
performing an automatic partitioning using the OpenCL execution model by iterating CPU load from 0 to 100 and GPU load from 100 to CPU load, in a plurality of iterations, wherein in each of the plurality iterations, a) a plurality of buffers are created for each of the plurality of CPUs and GPUs, b) an execution time is calculated for each of a plurality of partitions of each of the plurality of CPUs and GPUs, and c) a partition having least value of execution time from among the plurality of partitions is selected;
extracting the set of static features from the plurality of training programs, for the performed automatic partitioning; and
computing, for each of the plurality of training programs, (i) the complexity factor and (ii) the number of loops, based on the extracted set of static features; and
training the MLP model using a) the set of static features, b) the optimal workload partition, c) the complexity factor, and d) the number of loops for each training program, as a training data, wherein during the training of the MLP model, a number of hidden layers in the MLP model is activated based on the complexity factor.
12. The system as claimed in claim 11, wherein the one or more hardware processors are configured to activate a number of hidden layers in the MLP model during the training of the MLP model, based on the complexity factor.
13. The system as claimed in claim 8, wherein the one or more hardware processors are configured to partition the program to a plurality of serial and parallel portions based on the determined optimal workload partition of the program.
14. The system as claimed in claim 8, wherein the one or more hardware processors are configured to determine the complexity factor and the number of loops, by:
identifying a plurality of parallel blocks, wherein each of the plurality of parallel block comprises of a part of the program with loops;
identifying a parallel block with a maximum number of nested loops from among the plurality of parallel blocks, as a complex block in the program; and
determining the complexity factor and the number of loops based on the complex block.
Dated this 19th Day of October 2023
Tata Consultancy Services Limited
By their Agent & Attorney
(Adheesh Nargolkar)
of Khaitan & Co
Reg No IN-PA-1086
| # | Name | Date |
|---|---|---|
| 1 | 202321071428-STATEMENT OF UNDERTAKING (FORM 3) [19-10-2023(online)].pdf | 2023-10-19 |
| 2 | 202321071428-REQUEST FOR EXAMINATION (FORM-18) [19-10-2023(online)].pdf | 2023-10-19 |
| 3 | 202321071428-FORM 18 [19-10-2023(online)].pdf | 2023-10-19 |
| 4 | 202321071428-FORM 1 [19-10-2023(online)].pdf | 2023-10-19 |
| 5 | 202321071428-FIGURE OF ABSTRACT [19-10-2023(online)].pdf | 2023-10-19 |
| 6 | 202321071428-DRAWINGS [19-10-2023(online)].pdf | 2023-10-19 |
| 7 | 202321071428-DECLARATION OF INVENTORSHIP (FORM 5) [19-10-2023(online)].pdf | 2023-10-19 |
| 8 | 202321071428-COMPLETE SPECIFICATION [19-10-2023(online)].pdf | 2023-10-19 |
| 9 | 202321071428-Proof of Right [20-10-2023(online)].pdf | 2023-10-20 |
| 10 | 202321071428-FORM-26 [19-01-2024(online)].pdf | 2024-01-19 |
| 11 | Abstract.1.jpg | 2024-01-25 |
| 12 | 202321071428-FORM-26 [12-11-2025(online)].pdf | 2025-11-12 |