Systems And Methods For Service Level Agreement (Sla) Aware Workload

< Back

Systems And Methods For Service Level Agreement (Sla) Aware Workload Scheduling Using Hybrid Cloud Services

Abstract: This disclosure relates to systems and methods for service level agreement (SLA) aware workload scheduling using hybrid cloud services. Cloud services have an auto-scaling feature for load balancing to meet the performance requirements of an application. Existing autoscaling techniques are based on upscaling and downscaling cloud resources to distribute the dynamically varying workloads. However, bursty workloads pose many challenges for auto-scaling and sometimes result in Service Level Agreement (SLA) violations. Furthermore, over-provisioning or under-provisioning cloud resources to address dynamically evolving workloads results in performance degradation and cost escalation. The disclose method and system provides workload characterization-based approach for scheduling the bursty workload on a highly scalable serverless architecture in conjunction with a machine learning (ML) platform. [To be published with FIG. 2]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

18 June 2021

Publication Number

51/2022

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

kcopatents@khaitanco.com

Parent Application

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point Mumbai Maharashtra India 400021

Inventors

1. CHAHAL, Dheeraj

Tata Consultancy Services Limited Quadra II, Survey No. 239, Sadesataranali, Opposite Magarpatta City, Hadapsar, Pune Maharashtra India 411028

2. MISHRA, Mayank

Tata Consultancy Services Limited Olympus - A, Opp Rodas Enclave, Hiranandani Estate, Ghodbunder Road, Patlipada, Thane West Maharashtra India 400607

3. SINGHAL, Rekha

Tata Consultancy Services Limited Olympus - A, Opp Rodas Enclave, Hiranandani Estate, Ghodbunder Road, Patlipada, Thane West Maharashtra India 400607

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention:
SYSTEMS AND METHODS FOR SERVICE LEVEL AGREEMENT (SLA) AWARE WORKLOAD SCHEDULING USING HYBRID CLOUD
SERVICES
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD [001] The disclosure herein generally relates to load balancing on cloud, and, more particularly, to system and method for service level agreement (SLA) aware workload scheduling on hybrid cloud services.
BACKGROUND
[002] Many organizations are migrating their Artificial Intelligence (AI) application workloads to cloud due to the availability of cost-effective large infrastructure and other benefits. Additionally, popular cloud vendors provide machine learning (ML) platforms for migrating these workloads to cloud such as AWS SageMaker 1TM, Microsoft Azure ML 2TM, Google Cloud Platform (GCP) 3 TM, and so on.
[003] These ML platforms provide application programming interfaces (APIs) for data preprocessing, model training, deployment, and inference. The inference service is implemented on these platforms under a strict service level agreement (SLA). The auto-scaling feature with platforms provides elasticity to dynamically scale up and scale down resources and deliver desired latency and throughput as defined in SLA. The auto-scaling feature solves the problem of under-provisioning and performance degradation by spawning new resources. Also, it makes deployment cost-effective by downscaling the resources when they are underutilized.
[004] Typically, enterprise applications involve a large number of concurrent and dependent requests/jobs, which can be executed in parallel or sequence. Simultaneously launching jobs from different applications during a short time-period can immediately cause a ‘burst’, leading to complexities in resource management. Such jobs which can cause burst may hereinafter be referred to as ‘bursty workloads’. A bursty workload results in stress on instance resources resulting in SLA violations.
[005] Conventionally, to mitigate the effect of bursty workloads on SLAs, new virtual machines (VMs) are spawned using auto-scaling features. However, it

takes several minutes to instantiate a new ML instance resulting in an SLA violation during this period.
SUMMARY
[006] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method for service level agreement (SLA) aware workload scheduling on cloud is provided. The method includes characterizing, via one or more hardware processors, an inference workload apriori for workload scheduling on a cloud in an on-premise controlled environment, wherein characterizing the inference workload comprises varying configuration of a plurality of hardware resources capable of serving the inference workload, and measuring response time for each hardware resource of the plurality of hardware resources for a plurality of sets of concurrent inference requests. Further, the method includes dynamically servicing a plurality of concurrent inference requests received in real-time based on the characterization of the inference workload, via the one or more hardware processors. Dynamically servicing the plurality of concurrent inference requests includes determining an SLA associated with servicing of one or more concurrent inference requests from amongst the plurality of inference requests, the SLA comprising a response time predefined for servicing the one or more concurrent inference requests, determining whether it is possible to service the one or more concurrent inference requests within the response time specified in the SLA by a virtual machine (VM) instance based on an availability of hardware resources and a threshold number of requests serviceable by the VM instance; and optimally distributing the workload associated with the one or more concurrent requests between the VM instance and a serverless instance on determination of the number of the one or more concurrent requests greater than a threshold number of requests to maintain the response time within the SLA.
[007] In another aspect, a system for service level agreement (SLA) aware workload scheduling on cloud is provided. The system includes a memory storing

instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to characterize an inference workload a priori for workload scheduling on a cloud in an on-premise controlled environment, wherein to characterize the inference workload, the one or more hardware processors are configured by the instructions to vary configuration of a plurality of hardware resources capable of serving the inference workload, and measure response time for each hardware resource of the plurality of hardware resources for a plurality of sets of concurrent inference requests. The one or more hardware processors are configured by the instructions to dynamically service a plurality of concurrent inference requests received in real¬time based on the characterization of the inference workload. To dynamically service the plurality of concurrent inference requests, the one or more hardware processors are configured by the instructions to determine an SLA associated with servicing of one or more concurrent inference requests from amongst the plurality of inference requests, the SLA comprising a response time predefined for servicing the one or more concurrent inference requests; determine whether it is possible to service the one or more concurrent inference requests within the response time specified in the SLA by a VM instance based on an availability of hardware resources and a threshold number of requests serviceable by the VM instance; and optimally distribute the workload associated with the one or more concurrent requests between the VM instance and a serverless instance on determination of the number of the one or more concurrent requests greater than a threshold number of requests to maintain the response time within the SLA.
[008] In yet another aspect, a non-transitory computer readable medium for a method for service level agreement (SLA) aware workload scheduling on cloud is provided. The method includes characterizing, via one or more hardware processors, an inference workload apriori for workload scheduling on a cloud in an on-premise controlled environment, wherein characterizing the inference workload comprises varying configuration of a plurality of hardware resources capable of serving the inference workload, and measuring response time for each hardware

resource of the plurality of hardware resources for a plurality of sets of concurrent inference requests. Further, the method includes dynamically servicing a plurality of concurrent inference requests received in real-time based on the characterization of the inference workload, via the one or more hardware processors. Dynamically servicing the plurality of concurrent inference requests includes determining an SLA associated with servicing of one or more concurrent inference requests from amongst the plurality of inference requests, the SLA comprising a response time predefined for servicing the one or more concurrent inference requests, determining whether it is possible to service the one or more concurrent inference requests within the response time specified in the SLA by a VM instance based on an availability of hardware resources and a threshold number of requests serviceable by the VM instance; and optimally distributing the workload associated with the one or more concurrent requests between the VM instance and a serverless instance on determination of the number of the one or more concurrent requests greater than a threshold number of requests to maintain the response time within the SLA.
[009] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS [010] The accompanying drawings, which are incorporated in and
constitute a part of this disclosure, illustrate exemplary embodiments and, together
with the description, serve to explain the disclosed principles:
[011] FIG. 1 illustrates a network implementation for service level
agreement (SLA) aware workload scheduling using hybrid cloud services according
to some embodiments of the present disclosure.
[012] FIG. 2 is a flow diagram illustrating a method for SLA aware
workload scheduling using hybrid cloud services in accordance with some
embodiments of the present disclosure.

[013] FIG. 3 illustrates an example representation of an on-premise characterization of the model inference workload, in accordance with an example embodiment.
[014] FIG. 4 illustrates an example architecture for load balancing by a load balancer for SLA aware workload scheduling using hybrid cloud services, in accordance with an example embodiment.
[015] FIG. 5 is a block diagram of an exemplary computer system for implementing embodiments consistent with the present disclosure.
[016] FIG. 6 is an example representation of a bursty workload used for evaluation of disclosed methodology in an example scenario.
[017] FIG. 7 illustrates a graphical representation of response time of the inference requests using 2 core SageMakerTM ML instance without load balancing on LambdaTM in an example scenario.
[018] FIG. 8 illustrates a graphical representation of Response time of the inference requests using 2 core SageMakerTM ML instance with LambdaTM instances in an example scenario.
[019] FIG. 9 illustrates a graphical representation of response time of the inference requests served by LambdaTM instances in an example scenario.
[020] FIG. 10 illustrates a graphical representation of response time of the inference requests using 8 core SageMakerTM ML instance without load balancing on LambdaTM in an example scenario.
[021] FIG. 11 illustrates a graphical representation of response time of the inference requests using 8 core SageMakerTM ML instance with LambdaTM instances in an example scenario.
[022] FIG. 12 illustrates a graphical representation of response time of the inference requests served by LambdaTM instances in an example scenario.
DETAILED DESCRIPTION OF EMBODIMENTS [023] Cloud services have been explored extensively for SLA-aware and cost-effective Machine Learning (ML) inference serving. A scalable and cost-effective SLA-aware ML inference serving system in AWSTM has been developed.

Another highly scalable and cost-effective framework was developed for serverless deep learning inference. Deep leaning model inference has also been studied in art. Another known system includes frameworks called FEATTM and SpockTM respectively exploiting cloud functions for auto-scaling and scheduling. These techniques are based on feedback control scaling or predictive analytics.
[024] The aforementioned techniques are either based on feedback control scaling or predictive analytics. Some of the existing techniques also require access to the performance metrics, such as resource utilization, in real time when workload is in execution. These approaches deliver better performance for unknown workloads. However, such methods may lead to performance degradation and cost escalations specially for well-known workloads.
[025] Various embodiments disclosed herein provides method and system for service level agreement (SLA) aware workload scheduling on cloud based on a priori offline workload characterization. The disclosed method is based on a priori offline workload characterization which does not rely on access to performance metrics in real time for well known workloads. The disclosed method is suitable for well-known workloads and does not rely on access to performance metrics in real time. The disclosed system utilizes a hybrid cloud services for workload scheduling, by employing a serverless platform for scheduling the bursty workload in conjunction with machine learning (ML) platforms. A serverless architecture is used for deploying deep learning workloads. Although, serverless platforms also suffer from cold start problem, however, latency due to this is in few seconds. Furthermore, a serverless platform can be a viable solution for serving bursty workloads due to high scalability and a cost-effective pay-per-use cost model. The disclosed system uses of ML platform with serverless architecture available as cloud services, to balance the bursty workload.
[026] The disclosed system and method employ an on-premise workload characterization for SLA-aware scheduling in a controlled environment by varying the configuration of the hardware resources. The workload characterization facilitates in distributing the workload optimally between the ML platform and serverless instances. The on-premise profiling serves multiple purposes such as

identifying the most appropriate configuration of the ML platform instance such as the number of cores and memory required, identifying the maximum number of requests that can be served by model servers without violating the SLAs, finding an optimal configuration for the serverless instance, and so on. These and other advantages and applications of the disclosed system is described further in the description below.
[027] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims.
[028] Referring now to the drawings, and more particularly to FIG. 1 through 12, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[029] FIG. 1 illustrates an example network implementation 100 of a system 102 for service level agreement (SLA) aware workload scheduling using hybrid cloud services in accordance with example embodiment of present disclosure. The disclosed system uses a serverless platform for scheduling the bursty workload in conjunction with ML platforms. The system 102 uses an ML platform which uses dedicated virtual machines (VMs) as endpoints for serving deep models in conjunction with a serverless platform for serving the bursty traffic. In an embodiment, the system 102 facilitates in characterization of the workload to distribute the load optimally between the ML platform and serverless instances.
[030] Although the present disclosure is explained considering that the system 102 is implemented on a server, it may be understood that the system 102

may also be implemented in a variety of computing systems 104, such as a laptop computer, a desktop computer, a notebook, a workstation, a cloud-based computing environment and the like. It will be understood that the system 102 may be accessed through one or more devices 106-1, 106-2... 106-N, collectively referred to as devices 106 hereinafter, or applications residing on the devices 106. Examples of the devices 106 may include, but are not limited to, a portable computer, a personal digital assistant, a handheld device, a smartphone, a tablet computer, a workstation and the like. The devices 106 are communicatively coupled to the system 102 through a network 108.
[031] In an embodiment, the network 108 may be a wireless or a wired network, or a combination thereof. In an example, the network 108 can be implemented as a computer network, as one of the different types of networks, such as virtual private network (VPN), intranet, local area network (LAN), wide area network (WAN), the internet, and such. The network 106 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), and Wireless Application Protocol (WAP), to communicate with each other. Further, the network 108 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices. The network devices within the network 108 may interact with the system 102 through communication links.
[032] As discussed above, the system 102 may be implemented in a computing device 104, such as a hand-held device, a laptop or other portable computer, a tablet computer, a mobile phone, a PDA, a smartphone, and a desktop computer. The system 102 may also be implemented in a workstation, a mainframe computer, a server, and a network server. In an embodiment, the system 102 may be coupled to a data repository, for example, a repository 112. The repository 112 may store data processed, received, and generated by the system 02. In an alternate embodiment, the system 102 may include the data repository 112.
[033] The network implementation 100 supports various connectivity options such as BLUETOOTH®, USB, ZigBee and other cellular services. The

network environment enables connection of devices 106 such as Smartphone with the server 104, and accordingly with the database 112 using any communication link including Internet, WAN, MAN, and so on. In an exemplary embodiment, the system 102 is implemented to operate as a stand-alone device. In another embodiment, the system 102 may be implemented to work as a loosely coupled device to a smart computing environment.
[034] FIG. 2 illustrates a flow diagram of a method 200 for service level agreement (SLA) aware workload scheduling using hybrid cloud services in accordance with example embodiment of present disclosure. Herein, the term ‘cloud services’ refers to the services and resources delivered by cloud vendors, including Infrastructure as a service, Platform as a Service (PaaS), Function as a service (FaaS), Software as a Service (SaaS). The disclosed method enables workload scheduling using cloud services such as ML platform and serverless platform.
[035] Operations of the flowchart, and combinations of operation in the flowchart, may be implemented by various means, such as hardware, firmware, processor, circuitry and/or other device associated with execution of software including one or more computer program instructions. For example, one or more of the procedures described in various embodiments may be embodied by computer program instructions. In an example embodiment, the computer program instructions, which embody the procedures, described in various embodiments may be stored by at least one memory device of a system and executed by at least one processor in the system. Any such computer program instructions may be loaded onto a computer or other programmable system (for example, hardware) to produce a machine, such that the resulting computer or other programmable system embody means for implementing the operations specified in the flowchart. It will be noted herein that the operations of the method 200 are described with help of system 102. However, the operations of the method 200 can be described and/or practiced by using any other system.
[036] The method 200 for implementing SLA-aware job scheduling initiates by characterizing the inference workload on-premise in a controlled

environment. Herein, the controlled environment may refer to, for example a laboratory environment where the working conditions associated with the workload as well as hardware resources can be controlled. At 202, the method 200 includes characterizing the inference workload apriori for workload scheduling on cloud in the on-premise controlled environment. Herein, the term ‘cloud’ refers to the cluster of computing system resources available over the internet. In an embodiment, characterizing the inference workload includes varying configuration of a plurality of hardware resources capable of serving the inference workload. In an embodiment, the hardware resources may include, a number of cores, memory bandwidth, and network bandwidth required for serving the inference workload. For each hardware resource of the plurality of hardware resources, response time is measured for a plurality of sets of concurrent inference requests. An example of characterizing the inference workload described further with reference to FIG. 3.
[037] FIG. 3 illustrates an example representation of an on-premise characterization of the model inference workload, in accordance with an example embodiment. As illustrated in FIG. 3, characterizing the inference workload apriori for the workload scheduling on cloud in the on-premise controlled environment includes varying configuration of hardware resources and determining the response time for serving a plurality of sets of concurrent requests. For example, for a set of concurrent requests and various hardware resources (such as 2 cores, 4 cores, 6 cores, 8 cores and 16 cores respectively), the response time for catering to the inference workload may be determined. Such variation of concurrent requests and hardware resources with response time may be recorded and stored in a repository, and may be further utilized for identifying the most appropriate configuration of the ML platform VM instance such as the number of cores and memory required. Moreover, the number of concurrent requests for a particular configuration may facilitate in identifying the maximum number of requests that can be served by model servers without violating the SLAs (indicated by the response time).
[038] Referring back to FIG. 2, at 204, the method 200 includes dynamically servicing a plurality of concurrent inference requests received in real¬time based on the characterization of the inference workload. Herein, wherein

dynamically servicing the plurality of concurrent inference requests includes determining an SLA associated with servicing of one or more concurrent inference requests from amongst the plurality of inference requests. As previously described, the SLA includes a response time for servicing the one or more concurrent inference requests. For example, for an enterprise application the response time may be 200 milliseconds. It may be determined at 208 whether it is possible to service one or more concurrent inference requests from amongst the plurality of concurrent inference requests within the response time specified in the SLA by a VM instance based on an availability of hardware resources and a threshold number of requests serviceable by the VM instance of the ML platform without violating the SLA. If it is determined at 208 that the number of the one or more concurrent requests greater than a threshold number of requests to maintain the response time within the SLA, then the method 200 includes optimally distributing the workload associated with the one or more concurrent requests between the VM instance and a serverless instance at 210. For example, if the number of concurrent requests exceeds the threshold number of requests that might result in SLA violations, then the additional requests may be redirected to the serverless platform.
[039] In an embodiment, one or more serverless instances may be created prior to optimally distributing the workload associated with the one or more concurrent requests. Further, subsequent one or more inference requests may be redirected to the one or more serverless instances to prevent SLA violation. In an embodiment, the disclosed system may include a load balancer (illustrated in FIG. 4) to optimally distribute the concurrent requests.
[040] FIG. 4 illustrates an example architecture 400 for load balancing by a load balancer 402 for service level agreement (SLA) aware workload scheduling using hybrid cloud services, in accordance with an example embodiment. The load balancer is equipped with an information associated with on-premise characterization 404 of the inference workload. On receipt of a new inference request 406, based on the on-premise characterization of the inference workload, the load balancer determines whether a number of inference requests to be serviced by the VM instance is less than a threshold number of requests. If the load balancer

determines that the number of inference requests to be serviced by the VM instance is less than the threshold number of requests, then the load balancer 402 directs the inference request 406 to the VM instance 408. If, however, the load balancer determines that that the number of inference requests to be serviced by the VM instance is greater than or equal to the threshold number of requests, then the load balancer redirects the inference request to a serverless instance 410, thereby optimally distributing the inference request using the hybrid cloud services (including ML platform VM instance and serverless instance).
[041] FIG. 5 is a block diagram of an exemplary computer system 501 for implementing embodiments consistent with the present disclosure. The computer system 501 may be implemented in alone or in combination of components of the system 102 (FIG. 1). Variations of computer system 501 may be used for implementing the devices included in this disclosure. Computer system 501 may comprise a central processing unit (“CPU” or “hardware processor”) 502. The hardware processor 502 may comprise at least one data processor for executing program components for executing user- or system-generated requests. The processor may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc. The processor may include a microprocessor, such as AMD AthlonTM, DuronTM or OpteronTM, ARM’s application, embedded or secure processors, IBM PowerPCTM, Intel’s Core, ItaniumTM, XeonTM, CeleronTM or other line of processors, etc. The processor 502 may be implemented using mainframe, distributed processor, multi-core, parallel, grid, or other architectures. Some embodiments may utilize embedded technologies like application specific integrated circuits (ASICs), digital signal processors (DSPs), Field Programmable Gate Arrays (FPGAs), etc. The processor 502 may be a multi-core multi-threaded processor.
[042] Processor 502 may be disposed in communication with one or more input/output (I/O) devices via I/O interface 503. The I/O interface 503 may employ communication protocols/methods such as, without limitation, audio, analog, digital, monoaural, RCA, stereo, IEEE-1394, serial bus, universal serial bus (USB),

infrared, PS/2, BNC, coaxial, component, composite, digital visual interface (DVI), high-definition multimedia interface (HDMI), RF antennas, S-Video, VGA, IEEE 802.11 a/b/g/n/x, Bluetooth, cellular (e.g., code-division multiple access (CDMA), high-speed packet access (HSPA+), global system for mobile communications (GSM), long-term evolution (LTE), WiMax, or the like), etc.
[043] Using the I/O interface 503, the computer system 501 may communicate with one or more I/O devices. For example, the input device 504 may be an antenna, keyboard, mouse, joystick, (infrared) remote control, camera, card reader, fax machine, dongle, biometric reader, microphone, touch screen, touchpad, trackball, sensor (e.g., accelerometer, light sensor, GPS, gyroscope, proximity sensor, or the like), stylus, scanner, storage device, transceiver, video device/source, visors, etc.
[044] Output device 505 may be a printer, fax machine, video display (e.g., cathode ray tube (CRT), liquid crystal display (LCD), light-emitting diode (LED), plasma, or the like), audio speaker, etc. In some embodiments, a transceiver 506 may be disposed in connection with the processor 502. The transceiver may facilitate various types of wireless transmission or reception. For example, the transceiver may include an antenna operatively connected to a transceiver chip (e.g., Texas Instruments WiLink WL1283, Broadcom BCM4750IUB8, Infineon Technologies X-Gold 618-PMB9800, or the like), providing IEEE 802.11a/b/g/n, Bluetooth, FM, global positioning system (GPS), 2G/3G HSDPA/HSUPA communications, etc.
[045] In some embodiments, the processor 502 may be disposed in communication with a communication network 508 via a network interface 507. The network interface 507 may communicate with the communication network 508. The network interface may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), transmission control protocol/internet protocol (TCP/IP), token ring, IEEE 802.11a/b/g/n/x, etc. The communication network 508 may include, without limitation, a direct interconnection, local area network (LAN), wide area network (WAN), wireless network (e.g., using Wireless Application Protocol), the Internet,

etc. Using the network interface 507 and the communication network 508, the computer system 501 may communicate with devices 509 and 510. These devices may include, without limitation, personal computer(s), server(s), fax machines, printers, scanners, various mobile devices such as cellular telephones, smartphones (e.g., Apple iPhone, Blackberry, Android-based phones, etc.), tablet computers, eBook readers (Amazon Kindle, Nook, etc.), laptop computers, notebooks, gaming consoles (Microsoft Xbox, Nintendo DS, Sony PlayStation, etc.), or the like. In some embodiments, the computer system 501 may itself embody one or more of these devices.
[046] In some embodiments, the processor 502 may be disposed in communication with one or more memory devices (e.g., RAM 1113, ROM 1114, etc.) via a storage interface 512. The storage interface may connect to memory devices including, without limitation, memory drives, removable disc drives, etc., employing connection protocols such as serial advanced technology attachment (SATA), integrated drive electronics (IDE), IEEE-1394, universal serial bus (USB), fiber channel, small computer systems interface (SCSI), etc. The memory drives may further include a drum, magnetic disc drive, magneto-optical drive, optical drive, redundant array of independent discs (RAID), solid-state memory devices, solid-state drives, etc. Variations of memory devices may be used for implementing, for example, any databases utilized in this disclosure.
[047] The memory devices may store a collection of programs or database components, including, without limitation, an operating system 516, user interface application 517, user/application data 518 (e.g., any data variables or data records discussed in this disclosure), etc. The operating system 516 may facilitate resource management and operation of the computer system 501. Examples of operating systems include, without limitation, Apple Macintosh OS X, Unix, Unix-like system distributions (e.g., Berkeley Software Distribution (BSD), FreeBSD, NetBSD, OpenBSD, etc.), Linux distributions (e.g., Red Hat, Ubuntu, Kubuntu, etc.), IBM OS/2, Microsoft Windows (XP, Vista/7/8, etc.), Apple iOS, Google Android, Blackberry OS, or the like. User interface 517 may facilitate display, execution, interaction, manipulation, or operation of program components through

textual or graphical facilities. For example, user interfaces may provide computer interaction interface elements on a display system operatively connected to the computer system 501, such as cursors, icons, check boxes, menus, scrollers, windows, widgets, etc. Graphical user interfaces (GUIs) may be employed, including, without limitation, Apple Macintosh operating systems’ Aqua, IBM OS/2, Microsoft Windows (e.g., Aero, Metro, etc.), Unix X-Windows, web interface libraries (e.g., ActiveX, Java, Javascript, AJAX, HTML, Adobe Flash, etc.), or the like.
[048] In some embodiments, computer system 501 may store user/application data 518, such as the data, variables, records, etc. as described in this disclosure. Such databases may be implemented as fault-tolerant, relational, scalable, secure databases such as Oracle or Sybase. Alternatively, such databases may be implemented using standardized data structures, such as an array, hash, linked list, structured text file (e.g., XML), table, or as hand-oriented databases (e.g., using HandStore, Poet, Zope, etc.). Such databases may be consolidated or distributed, sometimes among various computer systems discussed above. It is to be understood that the structure and operation of any computer or database component may be combined, consolidated, or distributed in any working combination.
[049] Additionally, in some embodiments, (the server, messaging and instructions transmitted or received may emanate from hardware, including operating system, and program code (i.e., application code) residing in a cloud implementation. Further, it should be noted that one or more of the systems and methods provided herein may be suitable for cloud-based implementation. For example, in some embodiments, some or all of the data used in the disclosed methods may be sourced from or stored on any cloud computing platform.
Example scenario:
[050] To evaluate the disclosed system and method approach, an in-house recommender system called NISER was utilized. NISER uses a Graph Neural Network (GNN) based model and makes recommendations for the next product based on the user’s actions, like product or item clicks in the past. The training data

set used by NISER consists of past sessions of the item or product clicks by the user. During inference, the sequence of products clicked in the current session is used to recommend top n products to the user.
[051] ML platform SageMakerTM APIs were used for deploying the NISER model for inference. The training data was saved on AWS storage S3 and downloaded by train API of SageMaker for training. On completion of training, the model was saved back on S3. During deployment, the model was downloaded from the AWS S3 and deployed using deploy API on the chosen ML instance. NISER model was hosted using the PyTorch model server that runs inside SageMaker endpoint.
[052] The predict API was used for model inference. AWS Lambda was used as a serverless platform for deploying our deep learning NISER model. The pre-trained model was saved on AWS storage service S3. When the first request was received by the Lambda function, the model was loaded from the S3 into its memory. Concurrent requests were served by spawning an equal number of Lambda instances. The configuration of each instance was based on the characterization of the workload. The CPU version of PyTorch was used which requires only 200 MB space and hence satisfies the storage constraint of the Lambda.
[053] Experiments were conducted to evaluate the efficacy of the disclosed system and method. The on-premise model was deployed on an IntelTM Xeon 1359v2@2.45 GHz machine with 28 physical cores configured with 256 GB of memory. For the experiments with SageMakerTM and LambdaTM, the SageMaker Jupyter notebook instance was used as a load generator. The SageMaker, Lambda, and load generator instances were allocated from only ap-south-1 geography.
[054] FIG. 3 shows the on-premise characterization of the inference workload. This Figure shows the response time of the inference requests as we increase the concurrency and the capacity of the on-premise instance. One important conclusion that can be drawn from this data is that for one request i.e. concurrency level one, there is a significant improvement in the response time as we increase the number of cores from 2 to 4. However, an increasing number of

cores further does not provide any improvement in the response time. Hence, each serverless instance that serves one inference request at a time shall be configured with 4 cores. Although the maximum number of cores available per instance on AWS Lambda for 10 GB memory is 6, however, based on the on-premise observations, the number of cores were restricted to 4 on the Lambda instance. This results in an approximately 35% reduction in the cost without compromising the performance. This is because the cost of using Lambda is directly proportional to the amount of memory and hence cores reserved in each instance.
[055] Also, using the characterization data in FIG. 3, it can be seen the maximum number of concurrent requests that can be served by an ML instance without violating the response time SLA. For example, if the model is deployed on a ml.c5.2xlarge machine (8 cores) and SLA defined response time is a maximum of 150 ms, then not more than 5 concurrent requests can be served by the instance. Any additional inference requests must be diverted to Lambda.
[056] In this experiment, a sample bursty workload was used where the number of concurrent requests varies between 5 and 20 (FIG. 6), and the maximum allowed response time for an inference request is 200ms. As a first step, using characterization data in FIG. 3, the maximum number of requests that can be served by any ML instance without violating response time constraint can be seen. Then the disclosed recommender system model was deployed on ml.c5.large (2 core) ML instance of SageMaker that can serve a maximum of 5 concurrent requests with a response time less than 200 milliseconds. As shown in FIG. 7, when the highly variable workload is run on this ML instance, response time varies a lot violating the SLA. However, using the disclosed load balancer when the requests were redirected resulting in concurrency more than 5 to Lambda, SLA violations were reduced significantly as shown in FIG. 8. Also, the requests that were sent to Lambda were served under SLA response time limits of 200 milliseconds as shown in FIG. 9.
[057] In another experiment with the same sample inference workload, the response time requirements were changed to 150ms. Again, based on the workload characterization, it was observed that ml.c5.2xlarge (8 cores) SageMaker ML

instance could serve 5 concurrent requests with a response time less than 150 milliseconds. FIG. 10 shows the response time increases significantly for any short burst of more than 5 concurrent requests. However, when Lambda instances were used for serving all requests above concurrency 5, a response time of less than 150 milliseconds was seen for the requests served by ML instance (FIG. 11) and Lambda (FIG. 12).
[058] This study shows that on-premise workload characterization can be a useful technique for cost-efficient and SLA-aware inference workload scheduling on ML platform like SageMaker and serverless platform like Lambda.
[059] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[060] Various embodiments disclosed herein provides method and system for SLA) aware workload scheduling using hybrid cloud services for load balancing to meet the performance requirements of an application. Existing autoscaling techniques are based on upscaling and downscaling cloud resources to distribute the dynamically varying workloads. However, bursty workloads pose many challenges for auto-scaling and sometimes result in Service Level Agreement (SLA) violations. Furthermore, over-provisioning or under-provisioning cloud resources to address dynamically evolving workloads results in performance degradation and cost escalation. The disclosed method and system present a workload characterization-based approach for scheduling the bursty workload on a highly scalable serverless architecture in conjunction with a machine learning (ML) platform. The disclosed system includes a load balancer that facilitates in load balancing the inference workload to avoid SLA violations by using a on-premise characterization of inference workload apriori in a controlled environment. The characterization of the workload to distribute the load optimally between the ML

platform and serverless instances. The ML platform which uses dedicated virtual machines (VMs) as endpoints for serving deep models in conjunction with a serverless platform for serving the bursty traffic. Further, the workload characterization technique facilitates in cost-effective scheduling of inference workload on these cloud services.
[061] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[062] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[063] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological

development will change the manner in which particular functions are performed.
These examples are presented herein for purposes of illustration, and not limitation.
Further, the boundaries of the functional building blocks have been arbitrarily
defined herein for the convenience of the description. Alternative boundaries can
be defined so long as the specified functions and relationships thereof are
appropriately performed. Alternatives (including equivalents, extensions,
variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[064] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[065] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.

We Claim:
1. A processor implemented method (200) for service level agreement (SLA)
aware workload scheduling on cloud, comprising:
characterizing (202), via one or more hardware processors, an inference workload apriori for workload scheduling on the cloud in an on-premise controlled environment, wherein characterizing the inference workload comprises varying configuration of a plurality of hardware resources capable of serving the inference workload, and measuring response time for each hardware resource of the plurality of hardware resources for a plurality of sets of concurrent inference requests; and
dynamically servicing (204) a plurality of concurrent inference requests received in real-time based on the characterization of the inference workload, via the one or more hardware processors, wherein dynamically servicing the plurality of concurrent inference requests comprises:
determining (206) an SLA associated with servicing of one or more concurrent inference requests from amongst the plurality of inference requests, wherein the SLA comprising a response time predefined for servicing the one or more concurrent inference requests;
determining (208) whether it is possible to service the one or more concurrent inference requests within the response time specified in the SLA by a virtual machine (VM) instance based on an availability of hardware resources and a threshold number of requests serviceable by the VM instance; and
optimally distributing (210) the workload associated with the one or more concurrent requests between the VM instance and a serverless instance on determination of a number of the one or more concurrent requests more than a threshold number of requests to maintain the response time within the SLA.
2. The processor implemented method of claim 1, further comprises creating
one or more serverless instances prior to optimally distributing the workload
associated with the one or more concurrent requests.

3. The processor implemented method of claim 2, further comprises
redirecting subsequent one or more inference requests to the one or more serverless
instances to prevent service level agreement (SLA) violation.
4. The processor implemented method of claim 1, wherein the
configuration of the plurality of hardware resources comprises number of cores and
memory required for serving the inference workload.
5. A system (501) for service level agreement (SLA) aware workload
scheduling on cloud, comprising:
a memory (515) storing instructions;
one or more communication interfaces (503); and
one or more hardware processors (502) coupled to the memory (515) via the one or more communication interfaces (503), wherein the one or more hardware processors (502) are configured by the instructions to:
characterize an inference workload a priori for workload scheduling on the cloud in an on-premise controlled environment, wherein to characterize the inference workload, the one or more hardware processors are configured by the instructions to vary configuration of a plurality of hardware resources capable of serving the inference workload, and measure response time for each hardware resource of the plurality of hardware resources for a plurality of sets of concurrent inference requests; and
dynamically service a plurality of concurrent inference requests received in real-time based on the characterization of the inference workload, wherein to dynamically service the plurality of concurrent inference requests, the one or more hardware processors are configured by the instructions to:
determine an SLA associated with servicing of one or more concurrent inference requests from amongst the plurality of inference requests, the SLA comprising a response time predefined for servicing the one or more concurrent inference requests;

determine whether it is possible to service the one or more concurrent inference requests within the response time specified in the SLA by a virtual machine (VM) instance based on an availability of hardware resources and a threshold number of requests serviceable by the VM instance; and
optimally distribute the workload associated with the one or more concurrent requests between the VM instance and a serverless instance on determination of the number of the one or more concurrent requests greater than a threshold number of requests to maintain the response time within the SLA.
6. The system of claim 5, wherein the one or more hardware processors are configured by the instructions to create one or more serverless instances prior to optimally distributing the workload associated with the one or more concurrent requests.
7. The system of claim 6, wherein the one or more hardware processors are further configured by the instructions to redirect subsequent one or more inference requests to the one or more serverless instances to prevent service level agreement (SLA) violation.
8. The system of claim 5, wherein the configuration of the plurality of hardware resources comprises number of cores and memory required for serving the inference workload.

Documents

Application Documents

#	Name	Date
1	202121027383-STATEMENT OF UNDERTAKING (FORM 3) [18-06-2021(online)].pdf	2021-06-18
2	202121027383-REQUEST FOR EXAMINATION (FORM-18) [18-06-2021(online)].pdf	2021-06-18
3	202121027383-PROOF OF RIGHT [18-06-2021(online)].pdf	2021-06-18
4	202121027383-FORM 18 [18-06-2021(online)].pdf	2021-06-18
5	202121027383-FORM 1 [18-06-2021(online)].pdf	2021-06-18
6	202121027383-FIGURE OF ABSTRACT [18-06-2021(online)].jpg	2021-06-18
7	202121027383-DRAWINGS [18-06-2021(online)].pdf	2021-06-18
8	202121027383-DECLARATION OF INVENTORSHIP (FORM 5) [18-06-2021(online)].pdf	2021-06-18
9	202121027383-COMPLETE SPECIFICATION [18-06-2021(online)].pdf	2021-06-18
10	202121027383-FORM-26 [22-10-2021(online)].pdf	2021-10-22
11	Abstract1..jpg	2021-12-02
12	202121027383-FER.pdf	2023-02-16
13	202121027383-OTHERS [30-06-2023(online)].pdf	2023-06-30
14	202121027383-FER_SER_REPLY [30-06-2023(online)].pdf	2023-06-30
15	202121027383-CLAIMS [30-06-2023(online)].pdf	2023-06-30

Search Strategy

1	SearchHistory(27)AE_23-01-2024.pdf
2	202121027383ME_15-02-2023.pdf