Method And System For Detecting Interaction Between Human Arm And

< Back

Method And System For Detecting Interaction Between Human Arm And Object Using Computer Vision

Abstract: ABSTRACT METHOD AND SYSTEM FOR DETECTING INTERACTION BETWEEN HUMAN ARM AND OBJECT USING COMPUTER VISION Shrinkage in retail store and productivity monitoring in manufacturing is a big problem. Present disclosure provides a method and a system for detecting interaction between human arm and object using computer vision in target scenarios. The system first receives images of area of interest from overhead cameras. The system then identifies a set of zones in area of interest based on images using an object detection model. Once zones are identified, system detects motion of each user entity present in each identified zone and object present in each hand of each user entity using Deep Neural Network (DNN) inference model. Thereafter, system checks whether motion follows valid or an invalid sequence. Upon determining invalid sequence, system generates alert for user which notifies user about irregularity observed in real-time video. [To be published with FIG. 6]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

22 September 2023

Publication Number

13/2025

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th floor, Nariman point, Mumbai 400021, Maharashtra, India

Inventors

1. DAS, Apurba

Tata Consultancy Services Limited, Anchor Building, ITPL, Whitefield Bangalore 560066, Karnataka, India

2. GAURIAR, Anshuman Santosh

Tata Consultancy Services Limited, Anchor Building, ITPL, Whitefield Bangalore 560066, Karnataka, India

3. DUTTA, Sayantan

Tata Consultancy Services Limited, Flat 3B, Angira Apartment, 1803 Garia Place, Kolkata 700084, West Bengal, India

4. SURESH, Rahul

Tata Consultancy Services Limited, Peepul Park, Technopark Campus, Kariyavattom, Trivandrum 562114, Kerala, India

5. SINGH, Vedant

Tata Consultancy Services Limited, Anchor Building, ITPL, Whitefield Bangalore 560066, Karnataka, India

Specification

Description:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of invention:

METHOD AND SYSTEM FOR DETECTING INTERACTION BETWEEN HUMAN ARM AND OBJECT USING COMPUTER VISION

Applicant

Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India

Preamble to the description:

The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
[001] The disclosure herein generally relates to movement tracking, and, more particularly, to a method and a system for detecting interaction between human arm and object using computer vision.

BACKGROUND
[002] In manufacturing, maintaining productivity and quality at the assembly line is a common problem. In retail, checkout shrinkage is a big problem. Companies lose billions of dollars every year due to shrinkage and productivity/quality issues. The most common reasons for shrinkage includes, but are not limited to, mis-scan, partial scan, abandoned scan, label switching, customer theft, employee theft, human errors etc. Among the mentioned reasons, most of the things that lead to shrinkage occur at self-checkout zones.
[003] Retailers tried to address the shrinkage problem by placing radio frequency identification (RFID) tags on some objects. However, in places like warehouses and supermarkets where the volume of product being handled is large, RFID’s cannot be maintained. Some retailers tried using cash recyclers, currency sorters, smart safes and the like for monitoring the cash being handled. However, inventory loss is something that they were not able to stop even after using smart solutions. Hence, solution that can perform real-time tracking of the objects that are being handled by humans and the humans who are involved in quality check may help in preventing shrinkage in retail as well as in maintaining quality and productivity in manufacturing.

SUMMARY
[004] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one aspect, there is provided a processor implemented method for detecting interaction between human arm and object using computer vision. The method comprises receiving, by a system via one or more hardware processors, a real-time video from one or more overhead cameras installed at an area of interest, the real-time video comprising a plurality of video frames corresponding to a plurality of time frames, wherein the plurality of video frames comprises one or more user entities; extracting, by the system via the one or more hardware processors, the plurality of video frames from the real-time video using a video frame extraction technique; detecting, in each extracted video frame of the plurality of video frames, by the system via the one or more hardware processors, one or more hands of each user entity of the one or more user entities, and one or more objects present in the one or more hands of each user entity, using a trained Region-based Convolutional Neural Network (R-CNN) model, wherein the trained R-CNN model creates a bounding box around each detected hand of the one or more hands of each user entity and around each object of the one or more objects present in the one or more hands of each user entity; performing, for each video frame of the plurality of video frames: comparing a current video frame with a previous video frame to obtain a similarity score, wherein the current video frame refers to a video frame currently being handled by the system, and wherein the previous video frame refers to the video frame present before the current video frame; comparing the obtained similarity score with a predefined threshold value, wherein the predefined threshold value is accessed from a database; upon determining that the obtained similarity score is higher than a predefined threshold value, confirming motion of at least one user entity of the one or more user entities present in the real-time video; computing an inter-frame intersection over union (IoU) score based on the one or more bounding boxes created for each hand of the one or more hands of each user entity present in the current video frame and the previous video frame, wherein the inter-frame IoU score ensures tagging of each bounding box with a unique identification; for each tagged bounding box, computing an intra-frame IoU score based on a predefined set of zones; detecting a state of the current video frame from a predefined set of states based on the intra-frame IoU score, wherein the predefined set of states is defined in correspondence with the predefined set of zones; and adding the detected state of the current video frame in a list of states maintained for the real-time video, wherein the list of states comprises a state information of each video frame of the plurality of video frames present in the real-time video, and wherein the state information of each video frame is associated with the detected state of a respective video frame; and determining whether a sequence associated with the list of states corresponds to a valid sequence or an invalid sequence, wherein the sequence corresponds to the valid sequence if the state of each video frame in the list of states follows a valid order amongst a plurality of predefined valid orders, and wherein the sequence corresponds to the invalid sequence if the state of each video frame in the list of states follows an invalid order amongst one or more predefined invalid orders.
[005] In an embodiment, the method comprises upon determining the invalid sequence, generating an alert for a user, wherein the alert notifies the user about the irregularity observed in the real-time video.
[006] In an embodiment, the inter-frame IoU score calculation comprises: determining an area of intersection and an area of union of the one or more boundary boxes of the detected one or more hands of each user entity present in the current video frame and the previous video frame; and computing the inter-frame IoU score based on the area of intersection and the area of union.
[007] In an embodiment, the method comprising: receiving, by the system via the one or more hardware processors, one or more images of the area of interest from the one or more overhead cameras; and identifying, by the system via the one or more hardware processors, a set of zones present in the area of interest based on the one or more images using a pre-trained object detection model, wherein the identified set of zones are stored in the database and are defined as the predefined set of zones.
[008] In an embodiment, the method comprising: the step of detecting a barcode of each object of the one or more objects present in each hand of the one or more hands of each user entity, wherein the barcode is detected using an under scanner camera installed in the area of interest; performing, by the system via the one or more hardware processors, decoding of the barcode of each object to obtain object information of a respective object; checking, by the system via the one or more hardware processors, whether the object information of each barcode matches with the respective object using an image verification algorithm; and upon determining that the object information of at least one barcode does not matches with the respective object, identifying a label switch.
[009] In an embodiment, the method comprising: performing labelling of each detected hand of the one or more hands of each user entity, wherein the labelling ensures a reference label is provided for each detected hand; tracking movement of each hand of the one or more hands of each user entity based on the reference label of a respective hand using a tracking algorithm; identifying a type of activity performed by each hand of the one or more hands of each user entity based on the tracked movement of the respective hand, wherein the type of activity comprises one of: a local activity, and a global activity, wherein the local activity refers to movement of a hand in a predefined local region present in the area of interest, and wherein the global activity refers to the movement of at least one user entity in a predefined global region present in the area of interest; and sending a notification to the user based on the type of activity, wherein the notification notifies the user about the irregularity observed in hand movement if the type of activity identified is the global activity.
[010] In another aspect, there is provided a system for detecting interaction between human arm and object using computer vision. The system comprises a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: receive a real-time video from one or more overhead cameras installed at an area of interest, the real-time video comprising a plurality of video frames corresponding to a plurality of time frames, wherein the plurality of video frames comprises one or more user entities; extract the plurality of video frames from the real-time video using a video frame extraction technique; detect, in each extracted video frame of the plurality of video frames, one or more hands of each user entity of the one or more user entities, and one or more objects present in the one or more hands of each user entity, using a trained Region-based Convolutional Neural Network (R-CNN) model, wherein the trained R-CNN model creates a bounding box around each detected hand of the one or more hands of each user entity and around each object of the one or more objects present in the one or more hands of each user entity; perform, for each video frame of the plurality of video frames: compare a current video frame with a previous video frame to obtain a similarity score, wherein the current video frame refers to a video frame currently being handled by the system, and wherein the previous video frame refers to the video frame present before the current video frame; compare the obtained similarity score with a predefined threshold value, wherein the predefined threshold value is accessed from a database; upon determining that the obtained similarity score is higher than a predefined threshold value, confirm motion of at least one user entity of the one or more user entities present in the real-time video; compute an inter-frame intersection over union (IoU) score based on the one or more bounding boxes created for each hand of the one or more hands of each user entity present in the current video frame and the previous video frame, wherein the inter-frame IoU score ensures tagging of each bounding box with a unique identification; for each tagged bounding box, compute an intra-frame IoU score based on a predefined set of zones; detect a state of the current video frame from a predefined set of states based on the intra-frame IoU score, wherein the predefined set of states is defined in correspondence with the predefined set of zones; add the detected state of the current video frame in a list of states maintained for the real-time video, wherein the list of states comprises a state information of each video frame of the plurality of video frames present in the real-time video, and wherein the state information of each video frame is associated with the detected state of a respective video frame; and determine whether a sequence associated with the list of states corresponds to a valid sequence or an invalid sequence, wherein the sequence corresponds to the valid sequence if the state of each video frame in the list of states follows a valid order amongst a plurality of predefined valid orders, and wherein the sequence corresponds to the invalid sequence if the state of each video frame in the list of states follows an invalid order amongst one or more predefined invalid orders.
[011] In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause a method for detecting interaction between human arm and object using computer vision. The method comprises receiving, by a system, a real-time video from one or more overhead cameras installed at an area of interest, the real-time video comprising a plurality of video frames corresponding to a plurality of time frames, wherein the plurality of video frames comprises one or more user entities; extracting, by the system, the plurality of video frames from the real-time video using a video frame extraction technique; detecting, in each extracted video frame of the plurality of video frames, by the system, one or more hands of each user entity of the one or more user entities, and one or more objects present in the one or more hands of each user entity, using a trained Region-based Convolutional Neural Network (R-CNN) model, wherein the trained R-CNN model creates a bounding box around each detected hand of the one or more hands of each user entity and around each object of the one or more objects present in the one or more hands of each user entity; performing, for each video frame of the plurality of video frames: comparing a current video frame with a previous video frame to obtain a similarity score, wherein the current video frame refers to a video frame currently being handled by the system, and wherein the previous video frame refers to the video frame present before the current video frame; comparing the obtained similarity score with a predefined threshold value, wherein the predefined threshold value is accessed from a database; upon determining that the obtained similarity score is higher than a predefined threshold value, confirming motion of at least one user entity of the one or more user entities present in the real-time video; computing an inter-frame intersection over union (IoU) score based on the one or more bounding boxes created for each hand of the one or more hands of each user entity present in the current video frame and the previous video frame, wherein the inter-frame IoU score ensures tagging of each bounding box with a unique identification; for each tagged bounding box, computing an intra-frame IoU score based on a predefined set of zones; detecting a state of the current video frame from a predefined set of states based on the intra-frame IoU score, wherein the predefined set of states is defined in correspondence with the predefined set of zones; and adding the detected state of the current video frame in a list of states maintained for the real-time video, wherein the list of states comprises a state information of each video frame of the plurality of video frames present in the real-time video, and wherein the state information of each video frame is associated with the detected state of a respective video frame; and determining whether a sequence associated with the list of states corresponds to a valid sequence or an invalid sequence, wherein the sequence corresponds to the valid sequence if the state of each video frame in the list of states follows a valid order amongst a plurality of predefined valid orders, and wherein the sequence corresponds to the invalid sequence if the state of each video frame in the list of states follows an invalid order amongst one or more predefined invalid orders.
[012] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS
[013] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[014] FIG. 1 is an example representation of an environment, related to at least some example embodiments of the present disclosure.
[015] FIG. 2 illustrates an exemplary block diagram of a system for detecting interaction between human arm and object using computer vision, in accordance with an embodiment of the present disclosure.
[016] FIG. 3 illustrates an example representation of a camera configuration, in accordance with an embodiment of the present disclosure.
[017] FIG. 4 illustrates an example representation of a set of zones identified in a self-checkout area of a retail store, in accordance with an embodiment of the present disclosure.
[018] FIGS. 5A, 5B and 5C, collectively, illustrate an exemplary flow diagram of a method for detecting interaction between human arm and object using computer vision, in accordance with an embodiment of the present disclosure.
[019] FIG. 6 illustrates a schematic block diagram representation of a shrinkage detection process performed by the system of FIGS. 1 and 2 for detecting shrinkage activities happening in the retail store, in accordance with an embodiment of the present disclosure.
[020] FIG. 7A, 7B and 7C illustrate example representations of boundary boxes created around one or more hands and one or more objects detected in a video frame, in accordance with an embodiment of the present disclosure.
[021] FIG. 8 illustrates an example representation of an intersection over union score calculation, in accordance with an embodiment of the present disclosure.
[022] FIG. 9A illustrates an example representation of a predefined valid order maintained for the retail store, in accordance with an embodiment of the present disclosure.
[023] FIG. 9B illustrates an example representation of a predefined invalid order maintained for the retail store, in accordance with an embodiment of the present disclosure.
[024] FIG. 10 illustrates an example representation of an item switch scenario, in accordance with an embodiment of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS
[025] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
[026] As discussed earlier, shrinkage in retail stores and quality/ productivity issues at assembly lines are a big problem. Some manufacturers and retailers tried to address these problems by installing Radio Frequency Identification (RFIDs) in stores and assembly lines. However, in grocery shops, supermarkets and warehouses, maintaining RFID for each product is not possible due to various reasons. Additionally, the shrinkage theft is huge in all these places. Hence, it becomes important to come up with techniques that can help in reducing shrinkage theft and maintaining quality in assemble lines.
[027] Embodiments of the present disclosure overcome the above-mentioned disadvantages by providing a method and a system for detecting interaction between human arm and object using computer vision which may further help in reducing shrinkage and maintaining quality at production lines. The system of the present disclosure first receives one or more images of an area of interest from one or more overhead cameras installed at the area of interest. The system then identifies a set of zones present in the area of interest based on the one or more images using a pre-trained object detection model. Once the zones are identified, the system detects motion of each user entity present in each identified zone and object present in each hand of each user entity. Thereafter, the system checks whether the motion follows a valid sequence or an invalid sequence. Upon determining the invalid sequence, the system generates an alert for a user which notifies the user about the irregularity observed in the real-time video.
[028] Further, the system checks whether object information of each barcode present on an object matches with a respective object using an image verification algorithm. The system identifies a label switch in case the object information does not match with the object.
[029] Additionally, the system also tracks movement of each detected hand of each user entity to identify any irregularity happening at the area of interest. In one case, the irregularity can be partial scanning of objects, mis-scanning, or non-payment.
[030] In the present disclosure, the system detect irregularities in user movement without performing object detection, thereby reducing the computing time required to perform object detection which further reduces the time spent on training the system. The system only focuses on hands/entities occupied with object instead of focusing on the complete frame, thereby ensuring improved performance of the system. Further, single time configuration of cameras are required before using the system, thereby ensuring easy usability as minimal training is required for deploying the system to new location. Additionally, the system works with overhead cameras from which face identification is not possible, thereby ensuring compliance with various privacy regulations such as General Data Protection Regulation (GDPR) .
[031] Referring now to the drawings, and more particularly to FIGS. 1 through 10, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[032] FIG. 1 illustrates an exemplary representation of an environment 100 related to at least some example embodiments of the present disclosure. Although the environment 100 is presented in one arrangement, other embodiments may include the parts of the environment 100 (or other parts) arranged otherwise depending on, for example, detecting state of each video frame, identifying zones, etc. The environment 100 generally includes a camera 102, and a system 106, each coupled to, and in communication with (and/or with access to) a network 104. It should be noted that one camera is shown for the sake of explanation; there can be more number of cameras.
[033] The network 104 may include, without limitation, a light fidelity (Li-Fi) network, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a satellite network, the Internet, a fiber optic network, a coaxial cable network, an infrared (IR) network, a radio frequency (RF) network, a virtual network, and/or another suitable public and/or private network capable of supporting communication among two or more of the parts illustrated in FIG. 1, or any combination thereof.
[034] Various entities in the environment 100 may connect to the network 104 in accordance with various wired and wireless communication protocols, such as Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), 2nd Generation (2G), 3rd Generation (3G), 4th Generation (4G), 5th Generation (5G) communication protocols, Long Term Evolution (LTE) communication protocols, or any combination thereof.
[035] In an embodiment, without limiting the scope of the invention, the camera 102 is a top ceiling camera that is facing downwards towards an area of interest present in an assembly line or in a retail store. It should be noted that the area of interest can be any area present in any location which can be a source of shrinkage theft hence needs monitoring. An example representation of a camera configuration is shown with respect to FIG. 3. The area of interest may include one or more predefined set of zones that are being monitored using the top ceiling camera.
[036] The system 106 includes one or more hardware processors and a memory. The system 106 is configured to perform one or more of the operations described herein. The system 106 is first configured to receive one or more images of the area of interest from the one or more overhead cameras, such as the camera 102 via the network 104. The system 106 then identifies a set of zones present in the area of interest based on the one or more images using a pre-trained object detection model. In an embodiment, without limiting the scope of the invention, the pre-trained object detection model is trained using you only look once (YOLO) for determining the set of zones. In at least one example embodiment, once the system identifies zone boundaries and zone signatures of the set of zones present in the area of interest using the pre-trained object detection model, an administrator/user of the system 106 may verify the zone boundaries and the zone signatures. And based on the verification, the set of zones are confirmed. An example representation of a set of zones identified for a retail store is shown with respect to FIG. 4. The identified set of zones are then stored in a database and are defined as the predefined set of zones.
[037] In an embodiment, the user can be one or more of: a store manager, a production manager and/or a stakeholder and the like.
[038] Once the predefined set of zones are available, the system 106 receives a real-time video from the one or more overhead cameras, such as the camera 102 installed at the area of interest. The real-time video incudes a plurality of video frames corresponding to a plurality of time frames and one or more user entities that are present in the area of interest. The system 106 then detects one or more hands of each user entity of the one or more user entities, and one or more objects present in the one or more hands of each user entity in each video frame using a trained Region-based Convolutional Neural Network (R-CNN). Thereafter, the system 106 compares the video frames to detect motion of the one or more user entities present in the real-time video using a Deep Neural Network (DNN) inference model. Once the motion of the one or more user entities is confirmed, the system 106 detects a state of each video frame from a predefined set of states. It should be noted that the predefined set of states are defined in correspondence with the predefined set of zones.
[039] Further, the system 106 checks sequence of the states of each video frame. The sequence may correspond to a valid sequence or an invalid sequence. In case of the invalid sequence, the system 106 generates an alert for a user. The alert notifies the user about the irregularity observed in the real-time video i.e., in the area of interest. In an embodiment, the irregularity can be an action that may cause shrinkage. In case of the retail store, the irregularity can be non-scanning happening at a self-checkout zone.
[040] In an embodiment, the system 106 is configured to identify one or more other irregularities, such as label switching, partial scanning, non-payment, partial payment, item abandoning, missed scan, item switch, etc., and also performs tracking of user entities present in the area of interest based on the real-time video.
[041] The number and arrangement of systems, devices, and/or networks shown in FIG. 1 are provided as an example. There may be additional systems, devices, and/or networks; fewer systems, devices, and/or networks; different systems, devices, and/or networks; and/or differently arranged systems, devices, and/or networks than those shown in FIG. 1. Furthermore, two or more systems or devices shown in FIG. 1 may be implemented within a single system or device, or a single system or device shown in FIG. 1 may be implemented as multiple, distributed systems or devices. Additionally, or alternatively, a set of systems (e.g., one or more systems) or a set of devices (e.g., one or more devices) of the environment 100 may perform one or more functions described as being performed by another set of systems or another set of devices of the environment 100 (e.g., refer scenarios described above).
[042] FIG. 2 illustrates an exemplary block diagram of a system 200 for detecting interaction between human arm and object using computer vision, in accordance with an embodiment of the present disclosure. In some embodiments, the system 200 is embodied as a cloud-based and/or SaaS-based (software as a service) architecture. In some embodiments, the system 200 may be implemented in a server system. In some embodiments, the system 200 may be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, and the like.
[043] In an embodiment, the system 200 includes one or more processors 204, communication interface device(s) or input/output (I/O) interface(s) 206, and one or more data storage devices or memory 202 operatively coupled to the one or more processors 204. The one or more processors 204 may be one or more software processing modules and/or hardware processors. In an embodiment, the hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) is configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 200 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
[044] The I/O interface device(s) 206 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
[045] The memory 202 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment a database 208 can be stored in the memory 202, wherein the database 208 may comprise, but are not limited to, predefined set of states, a trained R-CNN model, a pre-trained object detection model and the like. The memory 202 further comprises (or may further comprise) information pertaining to input(s)/output(s) of each step performed by the systems and methods of the present disclosure. In other words, input(s) fed at each step and output(s) generated at each step are comprised in the memory 202 and can be utilized in further processing and analysis.
[046] It is noted that the system 200 as illustrated and hereinafter described is merely illustrative of an apparatus that could benefit from embodiments of the present disclosure and, therefore, should not be taken to limit the scope of the present disclosure. It is noted that the system 200 may include fewer or more components than those depicted in FIG. 2.
[047] FIG. 3 illustrates an example representation of a camera configuration, in accordance with an embodiment of the present disclosure.
[048] As seen in FIG. 3, the camera is a top ceiling camera that is mounted on a pole or a ceiling in a self-checkout area (an example of the area of interest in the retail store as most of shrinkage theft happens in the self-checkout area) present in the retail store. The camera is facing downwards over the predefined set of zones that are present in the area of interest i.e., the self-checkout area. The position of the camera is fixed such that it covers the whole area of interest i.e., all the predefined set of zones are being captured through the camera.
[049] FIG. 4 illustrates an example representation of a set of zones identified in the self-checkout area of a retail store, in accordance with an embodiment of the present disclosure.
[050] As seen in FIG. 4, the area of interest includes ‘5’ zones viz a basket zone, a scan zone, a weight scale zone, a pay zone and a customer standing area zone. In an embodiment, the basket zone is considered as a landing area where each customer of the retail store is expected to place all the items that are to be scanned for payment. The scan zone is considered as a region where underneth scanner is placed for scaning barcode of each item. The weight scale zone is considered as a region where items are expected to be placed post scanning. The pay zone is considered as a region where customer is expected to pay for the purchased items. In the pay zone, the customer can select any payment mode, such as credit and debit cards, quick response (QR) codes, and the like for making payment for the items purchased. The customer standing area zone is considered as a region where customer might be standing in the self-checkout area of the retail store. It should be noted that the set of zones shown with reference to FIG. 4 are just used for sake of explanation, without limiting the scope of the invention. The set of zones are some examples of the zones that can be identified in the area of interest, such as self-checkout area of the retail store, other set of zones can be identified depending on the usage of the system.
[051] FIGS. 5A, 5B and 5C, with reference to FIGS. 1 through 4, collectively, represent an exemplary flow diagram of a method 500 for detecting interaction between human arm and object using the computer vision, in accordance with an embodiment of the present disclosure. The method 500 may use the system 106 of FIGS. 1 and 2 for execution. In an embodiment, the system 106 comprises one or more data storage devices or the memory 202 operatively coupled to the one or more hardware processors 204 and is configured to store instructions for execution of steps of the method 500 by the one or more hardware processors 204. The sequence of steps of the flow diagram may not be necessarily executed in the same order as they are presented. Further, one or more steps may be grouped together and performed in form of a single step, or one step may have several sub-steps that may be performed in parallel or in sequential manner. The steps of the method of the present disclosure will now be explained with reference to the components of the system 106 as depicted in FIG. 2 and FIG. 1.
[052] At step 502 of the method of the present disclosure, the one or more hardware processors 204 of the system 106 receive a real-time video from one or more overhead cameras installed at an area of interest, the real-time video comprising a plurality of video frames corresponding to a plurality of time frames, wherein the plurality of video frames comprises one or more user entities that may be present in the area of interest.
[053] At step 504 of the method of the present disclosure, the one or more hardware processors 204 of the system 106 extract the plurality of video frames from the real-time video using a video frame extraction technique. It should be noted that the video frame extraction technique used here can be any frame extraction technique available in the art. In an embodiment, Open Source Computer Vision Library (OpenCV) is used for extracting the plurality of video frames from the real-time video and for processing them frame by frame in real time.
[054] At step 506 of the method of the present disclosure, the one or more hardware processors 204 of the system 106 detect, in each extracted video frame of the plurality of video frames, one or more hands of each user entity of the one or more user entities, and one or more objects present in the one or more hands of each user entity using a trained Region-based Convolutional Neural Network (R-CNN) model. The above step 506 is better understood by way of following description.
[055] The R-CNN model is first trained with a dataset containing hand and object interaction scenarios such that it can detect two classes viz 1) object in a human hand and 2) one or more hands of a human. The trained R-CNN model, once used by the system 106 for each video frame, detects the one or more user entities, and the one or more objects present in the one or more hands of each user entity present in the video frame by creating a boundary box around the one or more hands of each user entity present in a respective video frame and the one or more objects present in the one or more hands of each user entity present in the respective video frame. Some example representations showing boundary boxes created around the one or more hands and the one or more objects detected in the video frame are shown with reference to FIGS. 7A-7C.
[056] At step 508 of the method of the present disclosure, the one or more hardware processors 204 of the system 200 detect state information of each video frame of the plurality of video frames by performing a plurality of steps 508a through 508g for each video frame.
[057] More specifically, at step 508a of the present disclosure, the one or more hardware processors 204 of the system 106 compares a current video frame with a previous video frame to obtain a similarity score. The current video frame refers to a video frame currently being handled by the system, and the previous video frame refers to the video frame present before the current video frame i.e., the video frame that was handled before the current video frame.
[058] In an embodiment, the similarity score is computed based on the comparison of the current video frame and the previous video frame. In particular, the current video frame is compared with the previous video frame to find out the difference in the location of a user entity in both video frames. Then, based on the difference in the location, the system generates the similarity score. In an embodiment, the similarity score is equivalent to {current video frame – previous video frame}.
[059] At step 508b of the method of the present disclosure, the one or more hardware processors 204 of the system 106 compares the obtained similarity score with a predefined threshold value. The predefined threshold value is accessed from a database, such as the database 208. In an embodiment, the predefined threshold value is defined by an administrator of the system 106 based on domain knowledge.
[060] At step 508c of the method of the present disclosure, the one or more hardware processors 204 of the system 106 confirm motion of at least one user entity of the one or more user entities present in the real-time video, upon determining that the obtained similarity score is higher than a predefined threshold value. In particular, if the similarity score is found to be greater than the predefined threshold value, the system 106 assumes that one or more user entities present in the current video frame are moving hence identifies the motion.
[061] At step 508d of the present disclosure, the one or more hardware processors 204 of the system 106 compute an inter-frame intersection over union (IoU) score based on the one or more bounding boxes created for each hand of the one or more hands of each user entity present in the current video frame and the previous video frame.
[062] In particular, once the motion of the one or more user entities present in the current video frame is confirmed, the system 106 determines an area of intersection and an area of union of the boundary boxes of the one or more hands of each user entity. Thereafter, the system 106 computes the inter-frame IoU score for the current video frame based on the area of intersection and the area of union. An example representation of the IoU score calculation is shown with reference to FIG. 8. The inter-frame IoU score ensures tagging of each bounding box with a unique identification (ID) which further helps in tracking movement of each hand of each user entity.
[063] At step 508e of the present disclosure, the one or more hardware processors 204 of the system 106 compute an intra-frame IoU score based on a predefined set of zones for each tagged bounding box. In particular, once the tagging of the hands of the user entity present in the current video frame is available, the system computes intra-frame IoU score based on the tagging i.e., the unique ID and the predefined set of zones to track movement of at least one user entity in different zones of the predefined set of zones.
[064] At step 508f of the present disclosure, the one or more hardware processors 204 of the system 106 detect a state of the current video frame from a predefined set of states based on the intra-frame IoU score, wherein the predefined set of states is defined in correspondence with the predefined set of zones. The above step 508e is better understood by way of following description.
[065] As discussed earlier, the set of zones are defined based on the usage of the system 106. In case it is being used for the retail store, the exemplary set of zones that can be defined includes, but are not limited to, the customer standing area zone, the basket zone, the scan zone, the weight scale zone and the pay zone. So, the set of states can be defined in correspondence to the predefined set of zones. For example, in case the system 106 is being used for the retail store, the State 1 (Z1) may correspond to object in hand detected in the customer standing area zone i.e., the object is detected in the customer standing area zone. Similarly, State 2 (Z2) may correspond to object in hand detected in the basket zone. State 3 (Z3) may correspond to object in hand detected in the scan zone. State 4 (Z4) may correspond to object in hand detected in the weight scale zone. State 5 (Z5) may correspond to object in hand detected in the pay zone.
[066] Further, for each intra-frame IoU score range, a state is defined. So, if intra-frame IoU score falls in a particular range, the state corresponding to that range is selected for the current video frame. For example, the intra-frame IoU score is found to be 50, and state ‘Z3’ is defined as the state for intra-frame IoU score in range of 45-55. So, Z3 will be considered as the state of the current video frame.
[067] At step 508g of the present disclosure, the one or more hardware processors 204 of the system 106 add the detected state of the current video frame in a list of states maintained for the real-time video. The list of states comprises state information of each video frame of the plurality of video frames present in the real-time video. The state information of each video frame corresponds to the detected state of a respective video frame. In particular, the list of states is maintained which comprises the state of each video frame present in the real-time video.
[068] The steps 508a to 508f are performed for each video frame until the state of each video frame of the plurality of video frames present in the real-time video is detected. It should be noted that a state of a first video frame may not be determined as there will be no previous video frame for the first video frame. In an embodiment, the state determination process starts from a second video frame of the real-time video.
[069] In an embodiment, at step 510 of the method of the present disclosure, once the state information of each video frame of the plurality of video frames is available in the list of states, the one or more hardware processors 204 of the system 106 determine whether a sequence associated with the list of states corresponds to a valid sequence or an invalid sequence. The sequence corresponds to the valid sequence in case the state of each video frame in the list of states follows a valid order amongst a plurality of predefined valid orders. The sequence corresponds to the invalid sequence in case the state of each video frame in the list of states follows an invalid order amongst one or more predefined invalid orders. The above step 510 is better understood by way of following description.
[070] Once the state information of each video frame is available, the system 106 may consider a current video frame state as ‘T’. Similarly, a state of the previous video frame state is considered as ‘T-1’ and the state of the previous to previous video frame state is considered as ‘T-2’. And then using this notation, the system 106 determines the sequence based on the list of states. If the sequence present in the list of states corresponds to the valid sequence i.e., it follows the valid order amongst the plurality of predefined valid orders, then the system 106 assumes that there is no irregularity present, and the actions performed by the one or more user entities present in the real-time video are normal. An example representation of the predefined valid order maintained for the retail store is shown with reference to FIG. 9A.
[071] In an embodiment, the one or more hardware processors 204 of the system 106 generate an alert for a user based on the determined sequence. The alert notifies the user about the irregularity observed in the real-time video. In particular, if the sequence present in the list of states corresponds to the invalid sequence i.e., it follows the invalid order amongst the plurality of predefined invalid orders, then the system 106 assumes that there is some irregularity like non-scanning of items, and the actions performed by the one or more user entities present in the real-time video are not normal i.e., they are trying to steal or misplace items. Hence, the alert is generated for the user/administrator of the system 106 so that required actions can be taken by the user/administrator which may further help in minimizing the shrinkage theft. An examples representation of the predefined invalid order maintained for the retail store is shown with reference to FIG. 9B.
[072] It should be noted that the system 106 detects anomalous transactions by determining state transition which are not present in pre-defined state transition table or configuration i.e., not present in the predefined valid order. However, in some scenarios, some unseen or unpredicted or unanticipated state transition may occur and that may also be a valid transaction/scenario. In that case, the user (e.g., store manager/plant supervisor and the like) may override the system. Thereafter, the unseen or unpredicted or unanticipated state transition is automatically adapted by the system 106 as the valid transition without any user intervention. In at least one example embodiment, all the steps being performed by/on the system 106 are being monitored using an artificial intelligence to improve the accuracy of the complete solution.
[073] In an embodiment, the system 106 may also identify label switch i.e., a label of the object/product/item with higher price value is switched with a label of an object/ product/item with lower price value. For identifying the label switch, the system 106 detects a barcode of each object of the one or more objects present in the one or more hands of each user entity. It should be noted that the barcode is detected using an under scanner camera installed in the area of interest. Thereafter, the system 106 performs decoding of the barcode of each object to obtain object information of a respective object. Once the object information is available, the system 106 checks whether the object information of each barcode matches with the respective object using an image verification algorithm. It should be noted that the image verification algorithm, without limiting the scope of the invention, can be any image verification algorithm known in the art. In an embodiment, the image verification algorithm is pretrained using barcode dataset. In case the object information of each barcode matches with the respective object, then the object scanning is considered as normal. Further, in case the object information of each barcode does not matches with the respective object, the system identifies the label switch. In one embodiment, the system 106 may also generate the alert for the label switch.
[074] In an exemplary scenario, assume that the object is a ‘ramen packet’ of a specific band. So, the barcode contains the information that it’s the ramen packet. Once the system 106 knows that the barcode is of ramen packet, it compares the ramen packet image with the image of the object that the user entity is holding. In case, the image matches, the object scanned is considered as same as the object billed. Otherwise, the label switch is detected.
[075] In an embodiment, the system 106 also performs labelling of each detected hand of the one or more hands of each user entity present in the real-time video after detecting objects and hands of each user entity present in the real-time video. The labelling is done by the system 106 to ensure that a reference label is provided for each detected hand of each user entity. Thereafter, the system 106 tracks movement of each hand of the one or more hands of each user entity based on the reference label of a respective hand using a tracking algorithm.
[076] As the movement of each hand of each user entity is getting tracked, the system may identify a type of activity performed by each hand of the one or more hands of each user entity based on the tracked movement of the respective hand. The type of activity comprises one of a local activity, and a global activity. The local activity refers to movement of a hand in a predefined local region present in the area of interest, and the global activity refers to the movement of at least one user entity in a predefined global region present in the area of interest. In an embodiment, the local regions can be predefined zones, such as scanning zone, payment zone and the like. The local activity can be scanning of items, dropping of items, payment and the like.
[077] Similarly, the global regions can be combination of two or more predefined zones. In at least one example embodiment, the global activity can be a complete transaction happening at the system 106, such as 1) user entity entry in the area of interest, scanning of all the items, making payment, and taking all the items and exiting the SCO machine. The anomalies in the global activity can be item switching, partial scanning, mis-scanning, non-payment, label switching and the like i.e., actions that may lead to shrinkage. An example representation showing item switching is shown with reference to FIG. 10.
[078] It should be noted that the system generates an alert for anomalies in both the global activity and the local activity.
[079] In an embodiment, as the system 106 is tracking the movement of hands of each user, the system may also compute a speed of hand movement of each hand of a user entity using a speed calculation technique.
[080] In an embodiment, the system 106 may also detect non-scanning of the items/objects in case the system 106 is being used for the retail store. In particular, the system 106 detects the non-scanning of items if a hand of the customer is moving from basket zone to weighing zone without passing through the scanning zone. Based on the detection of the non-scanning of the items, the system 106 may generate the alert.
[081] In an embodiment, the system 106 may also detect non-payment of the items/objects in case the system 106 is being used for the retail store. In particular, the system 106 detects the non-payment of the items/objects if the hand of the customer is not tracked in the pay zone.
[082] In an embodiment, the system 106 may also perform abandon item detection in case the system 106 is being used for the retail store. In particular, the system 106 detects leftover items when the user entity exit the self-checkout area. For doing so, the system 106 uses a timer in case the items are left in the basket or in a weighing zone. If the user entity comes back before a predefined time limit, the transactions is resumed. If the user entity does not come back, then an alert is sent to the user for operational action.
[083] In an embodiment, the system 106 may also perform empty basket/trolley detection in case the system 106 is being used for the retail store. The system 106 uses a pre trained model that detects empty trolley and basket at the time of payment. If the objects are detected in the trolley and basket at the time of payment, the system 106 consider this as partial scan and thus generates an alert for the user .
[084] FIG. 6, with reference to FIGS, 1-5, illustrates a schematic block diagram representation of a shrinkage detection process performed by the system 106 of FIGS. 1 and 2 for detecting shrinkage activities happening in the retail store, in accordance with an embodiment of the present disclosure.
[085] As seen in FIG.6, a camera captures a real-time video from which a plurality of video frames are extracted. Thereafter, the DNN inference model uses those frames to detect motion of the user entities present in the video frames. Once the motion is detected, the system determines a state of each frame. Further, based on the states, one or more flags, such as a non-scan detected flag, a partial scan flag, item switch detection flag, ticket/product switch detection flag, and abandoned item detection flag are updated. The system 200 then generates an alert based on the flags.
[086] In case motion is not detected in the video frames, the system resets all flags and closes/completes the transaction.
[087] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[088] The system detect irregularities in user movement without performing object detection, thereby reducing the computing time required to perform object detection which further reduces the time spent on training the system. The system only focuses on hands/entities occupied with object instead of focusing on the complete frame, thereby ensuring improved performance of the system. Further, single time configuration of cameras are required before using the system, thereby ensuring easy usability as minimal training is required for deploying the system to new location. Additionally, the system works with overhead cameras from which face identification is not possible, thereby ensuring General Data Protection Regulation (GDPR) compliance.
[089] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
[090] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[091] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising”, “having”, “containing”, and “including”, and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a”. “an”, and “the” include plural references unless the context clearly dictates otherwise.
[092] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[093] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
, Claims:We Claim:
1. A processor implemented method (500), comprising:
receiving (502), by a system via one or more hardware processors, a real-time video from one or more overhead cameras installed at an area of interest, the real-time video comprising a plurality of video frames corresponding to a plurality of time frames, wherein the plurality of video frames comprises one or more user entities;
extracting (504), by the system via the one or more hardware processors, the plurality of video frames from the real-time video using a video frame extraction technique;
detecting (506), in each extracted video frame of the plurality of video frames, by the system via the one or more hardware processors, one or more hands of each user entity of the one or more user entities, and one or more objects present in the one or more hands of each user entity, using a trained Region-based Convolutional Neural Network (R-CNN) model, wherein the trained R-CNN model creates a bounding box around each detected hand of the one or more hands of each user entity and around each object of the one or more objects present in the one or more hands of each user entity;
performing (508), for each video frame of the plurality of video frames:
comparing (508a) a current video frame with a previous video frame to obtain a similarity score, wherein the current video frame refers to a video frame currently being handled by the system, and wherein the previous video frame refers to the video frame present before the current video frame;
comparing (508b) the obtained similarity score with a predefined threshold value, wherein the predefined threshold value is accessed from a database;
upon determining that the obtained similarity score is higher than a predefined threshold value, confirming (508c) motion of at least one user entity of the one or more user entities present in the real-time video;
computing (508d) an inter-frame intersection over union (IoU) score based on the one or more bounding boxes created for each hand of the one or more hands of each user entity present in the current video frame and the previous video frame, wherein the inter-frame IoU score ensures tagging of each bounding box with a unique identification (ID);
for each tagged bounding box, computing (508e) an intra-frame IoU score based on a predefined set of zones;
detecting (508f) a state of the current video frame from a predefined set of states based on the intra-frame IoU score, wherein the predefined set of states is defined in correspondence with the predefined set of zones; and
adding (508g) the detected state of the current video frame in a list of states maintained for the real-time video, wherein the list of states comprises a state information of each video frame of the plurality of video frames present in the real-time video, and wherein the state information of each video frame is associated with the detected state of a respective video frame; and
determining (510) whether a sequence associated with the list of states corresponds to a valid sequence or an invalid sequence, wherein the sequence corresponds to the valid sequence if the state of each video frame in the list of states follows a valid order amongst a plurality of predefined valid orders, and wherein the sequence corresponds to the invalid sequence if the state of each video frame in the list of states follows an invalid order amongst one or more predefined invalid orders.

2. The processor implemented method (500) as claimed in claim 1, comprising:
upon determining the invalid sequence, generating an alert for a user, wherein the alert notifies the user about the irregularity observed in the real-time video.

3. The processor implemented method (500) as claimed in claim 1, wherein the inter-frame IoU score calculation comprises:
determining an area of intersection and an area of union of the one or more boundary boxes of the detected one or more hands of each user entity present in the current video frame and the previous video frame; and
computing the inter-frame IoU score based on the area of intersection and the area of union.

4. The processor implemented method (500) as claimed in claim 1, comprising:
receiving one or more images of the area of interest from the one or more overhead cameras; and
identifying a set of zones present in the area of interest based on the one or more images using a pre-trained object detection model, wherein the identified set of zones are stored in the database and are defined as the predefined set of zones.

5. The processor implemented method (500) as claimed in claim 1, comprising:
detecting a barcode of each object of the one or more objects present in each hand of the one or more hands of each user entity, wherein the barcode is detected using an under scanner camera installed in the area of interest;
performing decoding of the barcode of each object to obtain object information of a respective object;
checking whether the object information of each barcode matches with the respective object using an image verification algorithm; and
upon determining that the object information of at least one barcode does not matches with the respective object, identifying a label switch.

6. The processor implemented method (500) as claimed in claim 1, comprising:
performing labelling of each detected hand of the one or more hands of each user entity, wherein the labelling ensures a reference label is provided for each detected hand;
tracking movement of each hand of the one or more hands of each user entity based on the reference label of a respective hand using a tracking algorithm;
identifying a type of activity performed by each hand of the one or more hands of each user entity based on the tracked movement of the respective hand, wherein the type of activity comprises one of: a local activity, and a global activity, wherein the local activity refers to movement of a hand in a predefined local region present in the area of interest, and wherein the global activity refers to the movement of at least one user entity in a predefined global region present in the area of interest; and
sending a notification to the user based on the type of activity, wherein the notification notifies the user about the irregularity observed in hand movement if the type of activity identified is the global activity .

7. A system (106), comprising:
a memory (202) storing instructions;
one or more communication interfaces (206); and
one or more hardware processors (204) coupled to the memory (202) via the one or more communication interfaces (206), wherein the one or more hardware processors (204) are configured by the instructions to:
receive a real-time video from one or more overhead cameras installed at an area of interest, the real-time video comprising a plurality of video frames corresponding to a plurality of time frames, wherein the plurality of video frames comprises one or more user entities;
extract the plurality of video frames from the real-time video using a video frame extraction technique;
detect, in each extracted video frame of the plurality of video frames, one or more hands of each user entity of the one or more user entities, and one or more objects present in the one or more hands of each user entity, using a trained Region-based Convolutional Neural Network (R-CNN) model, wherein the trained R-CNN model creates a bounding box around each detected hand of the one or more hands of each user entity and around each object of the one or more objects present in the one or more hands of each user entity;
perform, for each video frame of the plurality of video frames:
compare a current video frame with a previous video frame to obtain a similarity score, wherein the current video frame refers to a video frame currently being handled by the system, and wherein the previous video frame refers to the video frame present before the current video frame;
compare the obtained similarity score with a predefined threshold value, wherein the predefined threshold value is accessed from a database;
upon determining that the obtained similarity score is higher than a predefined threshold value, confirming motion of at least one user entity of the one or more user entities present in the real-time video;
compute an inter-frame intersection over union (IoU) score based on the one or more bounding boxes created for each hand of the one or more hands of each user entity present in the current video frame and the previous video frame, wherein the inter-frame IoU score ensures tagging of each bounding box with a unique identification;
for each tagged bounding box, computing (508e) an intra-frame IoU score based on a predefined set of zones;
detect a state of the current video frame from a predefined set of states based on the intra-frame IoU score, wherein the predefined set of states is defined in correspondence with the predefined set of zones;
add the detected state of the current video frame in a list of states maintained for the real-time video, wherein the list of states comprises a state information of each video frame of the plurality of video frames present in the real-time video, and wherein the state information of each video frame is associated with the detected state of a respective video frame; and
determine whether a sequence associated with the list of states corresponds to a valid sequence or an invalid sequence, wherein the sequence corresponds to the valid sequence if the state of each video frame in the list of states follows a valid order amongst a plurality of predefined valid orders, and wherein the sequence corresponds to the invalid sequence if the state of each video frame in the list of states follows an invalid order amongst one or more predefined invalid orders.

8. The system (106) as claimed in claim 7, wherein the one or more hardware processors (204) are configured by the instructions to:
upon determining the invalid sequence, generate an alert for a user, wherein the alert notifies the user about the irregularity observed in the real-time video.

9. The system (106) as claimed in claim 7, wherein the inter-frame IoU score calculation comprises:
determine an area of intersection and an area of union of the one or more boundary boxes of the detected one or more hands of each user entity present in the current video frame and the previous video frame; and
compute the inter-frame IoU score based on the area of intersection and the area of union.

10. The system (106) as claimed in claim 7, wherein the one or more hardware processors are configured by the instructions to:
receive one or more images of the area of interest from the one or more overhead cameras; and
identify a set of zones present in the area of interest based on the one or more images using a pre-trained object detection model, wherein the identified set of zones are stored in the database and are defined as the predefined set of zones.

11. The system (106) as claimed in claim 7, wherein the one or more hardware processors are configured by the instructions to:
detect a barcode of each object of the one or more objects present in each hand of the one or more hands of each user entity, wherein the barcode is detected using an under scanner camera installed in the area of interest;
perform decoding of the barcode of each object to obtain object information of a respective object;
check whether the object information of each barcode matches with the respective object using an image verification algorithm; and
upon determining that the object information of at least one barcode does not matches with the respective object, identify a label switch.

12. The system (106) as claimed in claim 7, wherein the one or more hardware processors are configured by the instructions to:
perform labelling of each detected hand of the one or more hands of each user entity, wherein the labelling ensures a reference label is provided for each detected hand;
track movement of each hand of the one or more hands of each user entity based on the reference label of a respective hand using a tracking algorithm;
identify a type of activity performed by each hand of the one or more hands of each user entity based on the tracked movement of the respective hand, wherein the type of activity comprises one of: a local activity, and a global activity, wherein the local activity refers to movement of a hand in a predefined local region present in the area of interest, and wherein the global activity refers to the movement of at least one user entity in a predefined global region present in the area of interest; and
send a notification to the user based on the type of activity, wherein the notification notifies the user about the irregularity observed in hand movement if the type of activity identified is the global activity.

Dated this 22nd Day of September 2023

Tata Consultancy Services Limited
By their Agent & Attorney

(Adheesh Nargolkar)
of Khaitan & Co
Reg No IN-PA-1086

Documents

Application Documents

#	Name	Date
1	202321063885-STATEMENT OF UNDERTAKING (FORM 3) [22-09-2023(online)].pdf	2023-09-22
2	202321063885-REQUEST FOR EXAMINATION (FORM-18) [22-09-2023(online)].pdf	2023-09-22
3	202321063885-PROOF OF RIGHT [22-09-2023(online)].pdf	2023-09-22
4	202321063885-FORM 18 [22-09-2023(online)].pdf	2023-09-22
5	202321063885-FORM 1 [22-09-2023(online)].pdf	2023-09-22
6	202321063885-FIGURE OF ABSTRACT [22-09-2023(online)].pdf	2023-09-22
7	202321063885-DRAWINGS [22-09-2023(online)].pdf	2023-09-22
8	202321063885-DECLARATION OF INVENTORSHIP (FORM 5) [22-09-2023(online)].pdf	2023-09-22
9	202321063885-COMPLETE SPECIFICATION [22-09-2023(online)].pdf	2023-09-22
10	202321063885-FORM-26 [22-12-2023(online)].pdf	2023-12-22
11	Abstract.jpg	2024-01-12
12	202321063885-FORM-26 [11-11-2025(online)].pdf	2025-11-11