System And Method To Detect Behaviour Of System Applications And

< Back

System And Method To Detect Behaviour Of System Applications And Recovery For Distributed Systems

Abstract: The present disclosure relates to a system (100) for determining a health status of plurality of computing nodes, the system comprising a plurality of computing nodes comprising a monitoring agent (110) configured to monitor at a time interval a plurality of threads of application executed in the plurality of computing nodes and monitor resource attributes of memory and processor. The monitoring agent can compute health status of the application by detection of unresponsive state of any or a combination of the plurality of threads and resource attributes of memory and processor monitored by the monitoring agent, wherein, upon detection of the unresponsive state, the monitoring agent configured to recover the unresponsive state of the plurality of threads to sustain the uninterrupted operations of the system.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

11 August 2021

Publication Number

07/2023

Publication Type

INA

Invention Field

COMMUNICATION

Status

info@khuranaandkhurana.com

Parent Application

Applicants

Bharat Electronics Limited

Corporate Office, Outer Ring Road, Nagavara, Bangalore - 560045, Karnataka, India.

Inventors

1. RAKESH

Central Research Laboratory, Bharat Electronics Limited, Sahibabad Industrial Area Site IV, Ghaziabad - 201010, Uttar Pradesh, India.

2. ADITYA BAUNTHIYAL

Central Research Laboratory, Bharat Electronics Limited, Sahibabad Industrial Area Site IV, Ghaziabad - 201010, Uttar Pradesh, India.

3. MANJEET GUPTA

Central Research Laboratory, Bharat Electronics Limited, Sahibabad Industrial Area Site IV, Ghaziabad - 201010, Uttar Pradesh, India.

4. AMIT KR. JAIN

Central Research Laboratory, Bharat Electronics Limited, Sahibabad Industrial Area Site IV, Ghaziabad - 201010, Uttar Pradesh, India.

Specification

Claims:1. A system (100) for determining a health status of plurality of computing nodes, the system comprising:
a display (102) coupled to the plurality of computing nodes over a network switch (104), each of the plurality of computing nodes (106) comprising a monitoring agent (110) configured to:
monitor at a time interval a plurality of threads of application executed in the plurality of computing nodes;
monitor resource attributes of memory and processor for each of the plurality of computing nodes by determining the service level agreement (SLA) limit of the resource attributes; and
compute health status of the application by detection of unresponsive state of any or a combination of the plurality of threads and resource attributes of memory and processor, wherein, upon detection of the unresponsive state, the monitoring agent (110) configured to recover the unresponsive state of the plurality of threads to sustain the uninterrupted operations of the system.
2. The system as claimed in claim 1, wherein the unresponsive state of the plurality of threads is caused due to any or a combination of abrupt failure, resource deadlock, and ambiguous operational behaviour.
3. The system as claimed in claim 1, wherein the monitoring agent (110) declares the health status of the plurality of threads as faulty when the resources attributes of the memory and processor breach a pre-set value of the SLA limit.
4. The system as claimed in claim 1, wherein a monitor thread (118) in the plurality of computing nodes is created to monitor the execution of the plurality of threads by using a monitor thread map (122), wherein each of the plurality of threads comprise a thread identifier (ID) that is updated in the monitor thread map based on the execution time interval.
5. The system as claimed in claim 1, wherein the monitor thread (118) accesses the monitor thread map (122) to detect the presence of thread ID in the monitor thread map, wherein, upon detection of the thread ID, the monitor thread declares the plurality of threads as healthy in the time interval.
6. The system as claimed in claim 1, wherein upon detecting of blocked thread ID, the monitor thread reports the blocked thread ID to a recovery thread.
7. The system as claimed in claim 1, wherein the recovery thread in the plurality of computing nodes is created to recover a failed thread of the plurality of threads reported by the monitor thread, wherein the recovery thread recovers by cancelling the failed thread of the plurality of threads and restore the previous state to start a new thread within a predefined time interval.
8. The system as claimed in claim 1, wherein the monitoring agent (110) is operatively coupled to data distribution services (DDS) (116) to store internal state on DDS memory device for reporting health status of the plurality of threads.
9. The system as claimed in claim 1, wherein the monitoring agent (110) receive health status of any or a combination of periodic and non-periodic threads of the plurality of threads through DDS.
10. A method (400) for determining a health status of plurality of computing nodes, the method comprising:
monitoring (402), by a monitoring agent, at a time interval a plurality of threads of application executed in a plurality of computing nodes, each of the plurality of computing nodes comprising the monitoring agent, the plurality of computing nodes coupled to a display over a network switch;
monitoring (404), by the monitoring agent, resource attributes of memory and processor for each of the plurality of computing nodes by determining the service level agreement (SLA) limit of the resource attributes; and
computing (406), by the monitoring agent, health status of the application by detection of unresponsive state of any or a combination of the plurality of threads and resource attributes of memory and processor, wherein, upon detection of the unresponsive state, the monitoring agent configured to recover (408) the unresponsive state of the plurality of threads to sustain the uninterrupted operations of the system.
, Description:TECHNICAL FIELD
[0001] The present disclosure relates, in general, to a distributed system, and more specifically, relates to a system and method to detect behaviour of system applications and recovery for distributed systems.

BACKGROUND
[0002] A distributed system is a combination of hardware and software, which runs cohesively to perform system operations. The machines that are part of the distributed system may be computers, physical servers, virtual machines, or any other node that is connected within the network. Moreover, these machines are equipped with system and application software. The software includes monitoring agents, services, multi-threaded applications (front-end and back-end), drivers, embedded applications. The responsive behaviour of these hardware and software subsystems plays an important role in achieving the common objective of the system. The system applications, in particular, are the key component whose unresponsive behaviour majorly affects the overall system performance and sometimes leads to failure in achieving the system's goal.
[0003] An existing system can include soft hang that involves the combination of human testing of execution of programs and static programmatic analysis of source code to detect and analyse blocking function calls. Runtime recovery involves manual intervention (pressing of some hotkey to invoke module) which is not an ideal solution for back-end server applications. Another existing system describes an automated hang detection for java threads and the deadlock is analysed by taking a snapshot of threads in JVM which can be analysed offline. However, this solution is suitable for only post analysis, no runtime recovery mechanisms are suggested. Yet another system includes fast data race detection for multi-core systems to trace all shared memory accesses, every data access operation is instrumented which is an overhead. This method creates one detection thread for every shared memory block accessed by application threads and one detection analysis thread, which may slow down the system, however, no runtime recovery mechanisms are proposed.
[0004] Traditionally, the legacy subsystems do have watchdogs and monitoring agents, however, they lack in monitoring and detection of unresponsive behaviour of system applications. The reasons for the failure of the application are numerous like an abrupt failure, resource deadlock, ambiguous operational behaviour and the likes. Higher memory and CPU resource usage also make the application unresponsive and lead to system degradation.
[0005] Therefore, there is a need in the art to provide an automated method that detects unresponsive behaviour of system applications and recovers the failed one at runtime to sustain the system uninterrupted operations.

OBJECTS OF THE PRESENT DISCLOSURE
[0006] An object of the present disclosure relates, in general, to a distributed system, and more specifically, relates to a system and method to detect behaviour of system applications and recovery for distributed systems.
[0007] Another object of the present disclosure provides a system that detects unresponsive behaviour of system applications.
[0008] Another object of the present disclosure provides a system that recovers the failed task at runtime to sustain the system uninterrupted operations.
[0009] Another object of the present disclosure automates system application unresponsive detection by monitoring the health of worker task by the monitoring agent.
[0010] Another object of the present disclosure provides run-time recovery by invoking recovery method to ensure system high availability.
[0011] Yet another object of the present disclosure detects application unresponsive behaviour caused due to continuous higher memory and CPU usage.

SUMMARY
[0012] The present disclosure relates, in general, to a distributed system, and more specifically, relates to a system and method to detect behaviour of system applications and recovery for distributed systems. The distributed system uses a combination of various computing nodes on which multiple applications run to perform a common objective. These applications can become unresponsive because of abrupt failure, resource deadlock, ambiguous operational behaviour and the likes. Higher memory or CPU usage may also lead to the complete application/computing system failure or partially task blocking/hanging. The present disclosure calculates application health based on working thread and overall application CPU and memory usage monitored by the monitoring agent to recover the system by application restart or thread recovery.
[0013] In an aspect, the present disclosure provides a system for determining a health status of plurality of computing nodes, the system including a display coupled to the plurality of computing nodes over a network switch, each of the plurality of computing nodes comprising a monitoring agent configured to monitor at a time interval a plurality of threads of application executed in the plurality of computing nodes, monitor resource attributes of memory and processor for each of the plurality of computing nodes by determining the service level agreement (SLA) limit of the resource attributes; and compute health status of the application by detection of unresponsive state of any or a combination of the plurality of threads and resource attributes of memory and processor, wherein, upon detection of the unresponsive state, the monitoring agent configured to recover the unresponsive state of the plurality of threads to sustain the uninterrupted operations of the system.
[0014] In an embodiment, the unresponsive state of the plurality of threads is caused due to any or a combination of abrupt failure, resource deadlock, and ambiguous operational behaviour.
[0015] In another embodiment, the monitoring agent declares the health status of the plurality of threads as faulty when the resources attributes of the memory and processor breach a pre-set value of the SLA limit.
[0016] In another embodiment, a monitor thread in the plurality of computing nodes is created to monitor the execution of the plurality of threads by using a monitor thread map, wherein each of the plurality of threads comprise a thread identifier (ID) that is updated in the monitor thread map based on the execution time interval.
[0017] In another embodiment, the monitor thread access the monitor thread map to detect the presence of thread ID in the monitor thread map, wherein, upon detection of the thread ID, the monitor thread declares the plurality of threads as healthy in the time interval.
[0018] In another embodiment, upon detecting of blocked thread ID, the monitor thread reports the blocked thread ID to a recovery thread.
[0019] In another embodiment, the recovery thread in the plurality of computing nodes is created to recover a failed thread of the plurality of threads reported by the monitor thread, wherein the recovery thread recovers by cancelling the failed thread of the plurality of threads and restore the previous state to start a new thread within a predefined time interval.
[0020] In another embodiment, the monitoring agent is operatively coupled to data distribution services (DDS) to store internal state on DDS memory device for reporting health status of the plurality of threads.
[0021] In another embodiment, the monitoring agent receive health status of any or a combination of periodic and non-periodic threads of the plurality of threads through DDS.
[0022] In an aspect, the present disclosure provides a method for determining a health status of plurality of computing nodes, the method comprising: monitoring, by a monitoring agent, at a time interval a plurality of threads of application executed in a plurality of computing nodes, each of the plurality of computing nodes comprising the monitoring agent, the plurality of computing nodes coupled to a display over a network switch; monitoring, by the monitoring agent, resource attributes of memory and processor for each of the plurality of computing nodes by determining the service level agreement (SLA) limit of the resource attributes; and computing, by the monitoring agent, health status of the application by detection of unresponsive state of any or a combination of the plurality of threads and resource attributes of memory and processor, wherein, upon detection of the unresponsive state, the monitoring agent configured to recover the unresponsive state of the plurality of threads to sustain the uninterrupted operations of the system
[0023] Various objects, features, aspects, and advantages of the inventive subject matter will become more apparent from the following detailed description of preferred embodiments, along with the accompanying drawing figures in which like numerals represent like components.

BRIEF DESCRIPTION OF THE DRAWINGS
[0024] The following drawings form part of the present specification and are included to further illustrate aspects of the present disclosure. The disclosure may be better understood by reference to the drawings in combination with the detailed description of the specific embodiments presented herein.
[0025] FIG. 1A illustrates an exemplary representation of distributed system with different sub-system and computing nodes, in accordance with an embodiment of the present disclosure.
[0026] FIG. 1B illustrates an exemplary functional component of a computing node, in accordance with an embodiment of the present disclosure.
[0027] FIG. 1C illustrates an exemplary block diagram of multi-thread application with internal and external interface, in accordance with an embodiment of the present disclosure.
[0028] FIG. 2A is an exemplary flow chart illustrating a method of monitoring performed by monitoring agent, in accordance with an embodiment of the present disclosure.
[0029] FIG. 2B illustrates a schematic view of periodic thread, monitor thread and recovery thread execution time, in accordance with an embodiment of the present disclosure.
[0030] FIG. 2C is an exemplary flow chart depicting a method of periodic thread monitoring, recovery and action on thread health status, in accordance with an embodiment of the present disclosure.
[0031] FIG. 3A illustrates an exemplary block diagram representing event-based thread, monitor thread and recovery thread execution time, in accordance with an embodiment of the present disclosure.
[0032] FIG. 3B is an exemplary flow chart depicting a method of non-periodic thread monitoring, recovery and action on thread health status, in accordance with an embodiment of the present disclosure.
[0033] FIG. 4 illustrates an exemplary flow chart of a method for determining a health status of plurality of computing nodes, in accordance with an embodiment of the present disclosure.

DETAILED DESCRIPTION
[0034] The following is a detailed description of embodiments of the disclosure depicted in the accompanying drawings. The embodiments are in such detail as to clearly communicate the disclosure. If the specification states a component or feature “may”, “can”, “could”, or “might” be included or have a characteristic, that particular component or feature is not required to be included or have the characteristic.
[0035] As used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.
[0036] The present disclosure relates, in general, to a distributed system, and more specifically, relates to a system and method to detect behaviour of system applications and recovery for distributed systems. The distributed system uses a combination of various computing nodes on which multiple applications run to perform a common objective. These applications can become unresponsive because of abrupt failure, resource deadlock, ambiguous operational behaviour and the likes. Higher memory or CPU usage may also lead to the complete application/computing system failure or partially task blocking/hanging. The present disclosure calculates application health based on working thread and overall application CPU and memory usage monitored by the monitoring agent to recover the system by application restart or thread recovery. The present disclosure can be described in enabling detail in the following examples, which may represent more than one embodiment of the present disclosure.
[0037] FIG. 1A illustrates an exemplary representation of distributed system with different sub-system and computing nodes, in accordance with an embodiment of the present disclosure.
[0038] Referring to FIG. 1A, distributed system 100 (also referred to as a system 100, herein) is monitored by using agents distributed in the one or more computing nodes (106-1 to 106-n (which are collectively referred to as computing nodes 106, hereinafter)). System 100 can include a combination of tactical display 102 connected over a network switch 104 to receive data from one or more computing nodes 106. The operations can be executed on each of the one or more computing nodes 106 to complete requested processes and operations. The present disclosure can detect the unresponsive behaviour of the system applications and can recover the failed state at runtime to sustain the system uninterrupted operations.
[0039] The present disclosure periodically scans each individual process running in the multi-process application, which is under observation via its unique feature of process monitoring. The process may be a thread, task, subroutine or any other critical segment of a process or routine. The process or thread are used interchangeably throughout the present disclosure. The process monitoring aspect of the method monitors all the process dependencies and check their running status to declare that process as healthy and does the same for all such processes in the application. The cumulative response of the monitoring task finally decides the responsive behaviour of the application in the subsystem. If any of the process or task is found to be unhealthy, the overall response of the application remains in suspicion for system performance.
[0040] In another embodiment, the present disclosure describes a recovery method that recovers the failed task or application on runtime by cancelling the failed or stopped task and starts the replica of the same within the predefined time span. In case the application is most critical on for the overall system behaviour and it cannot afford delay to be incurred in task recovery then the method declares the task and its parent application as dead and declares it as faulty.
[0041] The declaration of faulty application depends upon the heartbeat mechanism of process monitoring. The recovery process checks for the application heartbeat and if it is not received within stipulated time then it shall close the faulty one and start fresh. The process monitoring component of the proposed method also looks for memory and CPU consumed by the application under observation in reference to the defined service level agreement (SLA) pre-set for all application. For the application which breaches the set SLA, the proposed method generates an alert and log the event along with the shutdown of the application and start a fresh instance of the same.
[0042] FIG. 1B illustrates an exemplary functional component of a computing node, in accordance with an embodiment of the present disclosure. As illustrated in FIG. 1B, each of the one or more computing nodes 106 can include different components such as services 108, monitoring agent 110, multi-threaded application 112 and system drivers 114. Those skilled in the art can recognize that the computing nodes 106 may comprise various combinations of hardware and software components, although in some embodiments the various nodes of distributed systems may be structured in entirely-software form, be partially or wholly virtualized, or be positioned on hardware.
[0043] In an embodiment, the monitoring agent 110 can be deployed to each of the one or more computing nodes 106 in a cluster. The monitoring agent 110 collects statistics such as overall application state, memory usage and CPU usage for health calculations and status. In another embodiment, the monitoring agent 110 can monitor at a time interval the one or more threads of the application executed in the one or more computing nodes 106. The monitoring agent 110 can monitor resource attributes of memory and processor/CPU for each of the one or more computing nodes 108 by determining the service level agreement (SLA) limit of the resource attributes.
[0044] The monitoring agent 110 can compute the health status of the application by detection of the unresponsive state of any or a combination of the plurality of threads and resource attributes of memory and processor, where upon detection of the unresponsive state, the monitoring agent 110 configured to recover the unresponsive state of the one or more threads to sustain the uninterrupted operations of the system. The monitoring agent 110 can declare the health status of one or more threads as faulty when the resources attributes of the memory and processor breach a pre-set value of the SLA limit. The unresponsive state of one or more threads is caused due to any or a combination of abrupt failure, resource deadlock, and ambiguous operational behaviour.
[0045] FIG. 1C illustrates an exemplary block diagram of multi-thread application with internal and external interface, in accordance with an embodiment of the present disclosure. As depicted in FIG. 1C, the monitoring agent 110 of each of the one or more computing nodes 106 configured to monitor the multi-thread application 112 by receiving periodic application health status through a data distribution service (DDS) communication layer 116, where the DDS communication layer 116 can be operatively coupled to the monitoring agent 110.
[0046] In an embodiment, the monitoring agent 110 is the node level manager, responsible for applications spawning and monitoring of applications health status and statistics like CPU, memory, hard disk, temperature and the likes. The monitoring agent 110 can shut down and re-spawn the application new instance when the determined application health status become faulty or application CPU, memory usage breach the SLA limit. The DDS 116 is a communication middleware to store internal state on the DDS 116 cloud (also interchangeably referred to as memory device) and for reporting the application health status.
[0047] In another embodiment, the present disclosure provides a monitoring method that suggest to create a monitor thread 118 to monitor periodic and non-periodic worker threads also interchangeably referred to as one or more threads of the application and detects to identify hung thread with any or a combination of unresponsive behaviour and resource deadlock. The monitor thread 118 can be created by the multi-thread application 112 to monitor its internal worker thread T1, T2, T3 and Tn by using monitor thread map 122. It also creates a recovery thread 120 to recover its unresponsive threads and its state can be reported by the monitor thread 118.
[0048] The monitor thread 118 in the one or more computing nodes 106 is created to monitor the execution of the one or more threads by using the monitor thread map 122, where each of the one or more threads comprise a thread identifier (ID) that is updated in the monitor thread map 122 based on the execution time interval. The monitor thread 118 can access the monitor thread map 122 to detect the presence of thread ID in the monitor thread map 122, where upon detection of the thread ID, the monitor thread 118 can declare the one or more threads as healthy in the time interval. Similarly, upon detecting of blocked thread ID, the monitor thread 118 reports the blocked thread ID to a recovery thread 120.
[0049] In another embodiment, the recovery method suggests to create the recovery thread 120 which is responsible to recover the hung thread reported by the monitor thread 118, where the recovery thread 120 can cancel the hung thread, and restores it previous state and start a fresh thread, and notify the current thread status to the monitor thread 118.
[0050] For example, if all the worker threads are operational then publish/declare a heartbeat message from the monitor thread 118 with application as healthy, and if any thread is unrecoverable then the monitor thread 118 can publish/declare the application health status as faulty. The application considered healthy if it is functioning in a normal operating state. If a faulty or “unhealthy” component is identified, then the system monitor can take some action to correct the fault from the application and thereby return the system to correct operation.
[0051] The embodiments of the present disclosure described above provide several advantages. The one or more of the embodiments provide a system 100 that detects unresponsive behaviour of system applications. System 100 can recover the failed task at runtime to sustain the system uninterrupted operations. The present disclosure automates system application unresponsiveness detection by monitoring the health of worker task by the monitoring agent. The present disclosure provides run-time recovery by invoking recovery method to ensure system high availability and detects application unresponsive behaviour caused due to continuous higher memory and CPU usage.
[0052] FIG. 2A is an exemplary flow chart illustrating a method of monitoring performed by monitoring agent, in accordance with an embodiment of the present disclosure.
[0053] The method 200 includes the monitoring of the monitoring agent 110 to detect the application hang state. At block 202, the monitoring agent 110 can spawn the applications and start monitoring any or a combination of application health status, CPU and memory resources. At block 204, once the application is spawned, the monitoring agent 110 can check if application SLA is within the set limit or not. If the SLA is within the set limit, branch initiates from block 204 to block 206, where at block 206 the monitoring agent 110 can start monitoring the health status of the application. At block 208, if the health status of the application received is good then the branch 208 takes it to block to 210, where at block 210 health status of the application is updated and continue to monitor the health of the application. At block 204, if the health status of the application received is faulty, branch 204 leads to block 212, where at block 212, the monitoring agent 110 can shut down the application and start a fresh instance. In an embodiment, the multithread application 112 developer generally uses two kinds of thread such as periodic and non-periodic thread for identification of hung thread. The periodic and non-periodic thread as described below:
[0054] In periodic thread case, the monitor method is used to monitor the execution of periodic working threads of the application. FIG. 2B illustrates a schematic view of periodic thread, monitor thread and recovery thread execution time, in accordance with an embodiment of the present disclosure. As illustrated in FIG. 2B, six working threads such as Thread1 214, Thread2 216, Thread3 218, Thread4 220, Thread5 222 and Thread6 224 with execution time are mentioned, and monitor thread 226 is created to monitor the execution of working thread and one more event-based thread is created for recovery of faulty thread termed as recovery thread 228.
[0055] FIG. 2C is an exemplary flow chart depicting a method of periodic thread monitoring, recovery and action on thread health status, in accordance with an embodiment of the present disclosure.
[0056] At block 230, the multithreaded application can create working threads, monitor thread 226 and recovery thread 228. At block 232, after successful creation, the working thread can update thread IDs and execution time in the monitor thread map 122 as per their execution time interval. At block 234, the monitor thread can access monitor thread map 122 and get the number of threads running in the system. At block 236, if thread ID is present in the monitor thread map 122, then branch 236 is moved to block 238, where at block 238 the monitor thread can publish the application as healthy with pre-defined interval e.g., 10 sec.
[0057] At block 236, if the thread ID is not present then branch 236 is moved to block 240, where at block 240 the monitor thread shall report the blocked/hanged thread ID to the recovery thread. At block 242, the recovery thread can cancel the hung thread and start a fresh instance. At block 244, after recovery of the hung thread, the newly started working thread start updating its thread ID in monitor thread map 122 as per execution time. At block 246, in the second iteration, the monitor thread shall check the monitor thread map 122 if any of the previously reported thread is not recovered then as per block 248 declare health status of the application as faulty.
[0058] FIG. 3A illustrates an exemplary block diagram representing event-based thread, monitor thread and recovery thread execution time, in accordance with an embodiment of the present disclosure.
[0059] In non-periodic thread case, the monitor thread is used to monitor the execution of non-periodic working threads of an application. As shown in FIG. 3A, the three events based working threads named Thread1 302, Thread2 304 and Thread3 306 are mentioned which can wake up on an event. The monitor thread 308 is created to monitor the execution of these threads and one more event-based thread is created for recovery of faulty thread termed as recovery thread 310.
[0060] FIG. 3B is an exemplary flow chart depicting a method of non-periodic thread monitoring, recovery and action on thread health status, in accordance with an embodiment of the present disclosure.
[0061] As shown in FIG. 3B, at block 312, multi-thread application can start and create event based working thread, monitor thread 308 and recovery thread 310. In block 314, the event-based thread, on receipt of the event, update monitor thread map 122 with flag as true and after processing of event again reset the flag to false. At block 316, monitor thread start tracing of the monitor thread map 122 flag raised by the working threads up to a predefined time. At block 318, if the flag is still true, then yes branch is lead to block 320 and no branch leads to block 322. At block 320, the monitor thread sends blocked thread ID to recovery thread, and at block 324 the recovery thread cancels the hanged thread and start a fresh thread. After successful start of new thread it can again lead to block 314. At block 322, monitor thread declares the application as healthy and publish the application health status.
[0062] FIG. 4 illustrates an exemplary flow chart of a method for determining a health status of plurality of computing nodes, in accordance with an embodiment of the present disclosure.
[0063] Referring to FIG. 4, the method 400 includes at block 402, the monitoring agent can monitor at a time interval a plurality of threads of application executed in a plurality of computing nodes, each of the plurality of computing nodes comprising the monitoring agent, the plurality of computing nodes coupled to a display over a network switch.
[0064] At block 404, the monitoring agent can monitor resource attributes of memory and processor for each of the plurality of computing nodes by determining the service level agreement (SLA) limit of the resource attributes.
[0065] At block 406, the monitoring agent can compute health status of the application by detection of unresponsive state of any or a combination of the plurality of threads and resource attributes of memory and processor, wherein, upon detection of the unresponsive state, the monitoring agent configured to recover 408 the unresponsive state of the plurality of threads to sustain the uninterrupted operations of the system.
[0066] It will be apparent to those skilled in the art that the system 100 of the disclosure may be provided using some or all of the mentioned features and components without departing from the scope of the present disclosure. While various embodiments of the present disclosure have been illustrated and described herein, it will be clear that the disclosure is not limited to these embodiments only. Numerous modifications, changes, variations, substitutions, and equivalents will be apparent to those skilled in the art, without departing from the scope of the disclosure, as described in the claims.

ADVANTAGES OF THE PRESENT DISCLOSURE
[0067] The present disclosure provides a system that detects unresponsiveness behaviour of system applications.
[0068] The present disclosure provides a system that recovers the failed task at runtime to sustain the system uninterrupted operations.
[0069] The present disclosure automates system application unresponsiveness detection by monitoring the health of worker task by monitoring agent
[0070] The present disclosure provides run-time recovery by invoking recovery method to ensure system high availability.
[0071] The present disclosure provides detects application unresponsive behaviour caused due to continuous higher memory and CPU usage.

Documents

Application Documents

#	Name	Date
1	202141036431-STATEMENT OF UNDERTAKING (FORM 3) [11-08-2021(online)].pdf	2021-08-11
2	202141036431-POWER OF AUTHORITY [11-08-2021(online)].pdf	2021-08-11
3	202141036431-FORM 1 [11-08-2021(online)].pdf	2021-08-11
4	202141036431-DRAWINGS [11-08-2021(online)].pdf	2021-08-11
5	202141036431-DECLARATION OF INVENTORSHIP (FORM 5) [11-08-2021(online)].pdf	2021-08-11
6	202141036431-COMPLETE SPECIFICATION [11-08-2021(online)].pdf	2021-08-11
7	202141036431-Proof of Right [27-08-2021(online)].pdf	2021-08-27
8	202141036431-POA [08-10-2024(online)].pdf	2024-10-08
9	202141036431-FORM 13 [08-10-2024(online)].pdf	2024-10-08
10	202141036431-AMENDED DOCUMENTS [08-10-2024(online)].pdf	2024-10-08
11	202141036431-Response to office action [01-11-2024(online)].pdf	2024-11-01
12	202141036431-FORM 18 [08-08-2025(online)].pdf	2025-08-08