Method And System For Activity Determination From Video Frames

< Back

Method And System For Activity Determination From Video Frames

Abstract: This disclosure relates generally to method and system for determining activity from video frames. Determining the activity performed by the subject from video frames along with background is challenging due to cluttered backgrounds, occlusions, and viewpoint variations thereby affecting accuracy of activity detection. The proposed method creates bounding box and resizes the bounding box for the video frames match with the training image frame. Further, the method computes total average distance value and the predefined percentage of total average distance value for the video frames. The computed percentage of confidence interval is compared with the predefined threshold value for determining the activity performed by the subject in the video frames.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

27 August 2019

Publication Number

10/2021

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

ip@legasis.in

Parent Application

Patent Number

Legal Status

Grant Date

2024-09-30

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point, Mumbai - 400021, Maharashtra, India

Inventors

1. BALAJI, Ramesh

Tata Consultancy Services Limited, 2nd Floor - Block A - Phase II, IIT Madras Research Park, Kanagam Road, Tharamani, Chennai - 600113, Tamil Nadu, India

2. JOSEPH, Deepak

Tata Consultancy Services Limited, TCS Centre SEZ Unit, Infopark PO, Kochi - 682042, Kerala, India

Specification

Claims:
1. A processor implemented method for activity determination from video frames wherein the method comprises:
receiving, by a processor, a video stream comprising a plurality of input image frames, which captures an activity performed by a subject;
fetching, by the processor, a plurality of training image frames for determining the activity performed by the subject by analyzing each input image frame among the plurality of input image frames,
wherein, the plurality of training image frames are generated from a set of training data, wherein, each training image frame from the set of training data are labelled with an activity, among a plurality of activities, performed by a reference subject present in each training data;
creating, by the processor, using a single shot detector (SSD) technique, a bounding box around the subject present in each input image frame and the reference subject present in each training image frame from the plurality of training images;
resizing, by the processor, the bounding box, of the subject present in each input image frame and the reference subject present in each training image frame in accordance with a preset resizing criteria for the input image frame and for the training image frame;
detecting, by the processor, keypoint coordinates for the resized bounding box around the subject of each input image frame and for each reference subject present in each training image frame;
extracting, by the processor, a plurality of features from each input image frame and each training image to obtain feature descriptors using the keypoint coordinates, wherein the feature descriptor assigns a numerical description as vectors for each subject of each input image frame and for each reference subject present in each training image frame;
matching, by the processor, the feature descriptor of each input image frame with the feature descriptor of each training image to create a list of matching distances for each input image frame, wherein the least distance in the list of matching distances provides a closest reference subject present in the training image frame corresponding to the subject in an input image frame;
computing by the processor:
a total average coordinates value for the list of matching distances created for each input image frame;
a predefined percentage of total average distance value for the list of matching distances created for each input image frame;
computing, by the processor, a percentage of confidence interval for each input image frame based on the total average coordinates value and the predefined percentage of total average distance value; and
comparing, by the processor, the computed confidence interval of each input image frame with a predefined threshold value, wherein the activity performed by the subject in the input image frame is mapped to the activity performed by the reference subject of a corresponding training image frame, if the confidence interval is equal or greater than the predefined threshold.
2. The method as claimed in claim 1, wherein the total average distance value is computed using the sum of all the distances present in the list of matching distances and the total number of count of all the elements in the list of matching distances.
3. The method as claimed in claim 1, wherein the predefined percentage of total average distance value is computed using the matching count from the list of matching distances and the predefined percentage count of all the elements in the list of matching distances.
4. The method as claimed in claim 1, wherein the percentage of confidence interval is computed using the total average distance value and the predefined percentage of total average distance value.
5. A system (100), comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
receive, by a processor (104), a video stream comprising a plurality of input image frames, which captures an activity performed by a subject;
fetch, by the processor (104), a plurality of training image frames for determining the activity performed by the subject by analyzing each input image frame among the plurality of input image frames,
wherein, the plurality of training image frames are generated from a set of training data, wherein, each training image frame from the set of training data are labelled with an activity, among a plurality of activities, performed by a reference subject present in each training data;
create, by the processor (104), using a single shot detector (SSD) technique, a bounding box around the subject present in each input image frame and the reference subject present in each training image frame from the plurality of training images;
resize, by the processor (104), the bounding box, of the subject present in each input image frame and the reference subject present in each training image frame in accordance with a preset resizing criteria for the input image frame and for the training image frame;
detect, by the processor (104), keypoint coordinates for the resized bounding box around the subject of each input image frame and for each reference subject present in each training image frame;
extract, by the processor (104), a plurality of features from each input image frame and each training image to obtain feature descriptors using the keypoint coordinates, wherein the feature descriptor assigns a numerical description as vectors for each subject of each input image frame and for each reference subject present in each training image frame;
match, by the processor (104), the feature descriptor of each input image frame with the feature descriptor of each training image to create a list of matching distances for each input image frame, wherein the least distance in the list of matching distances provides a closest reference subject present in the training image frame corresponding to the subject in an input image frame;
compute, by the processor (104):
a total average coordinates value for the list of matching distances created for each input image frame;
a predefined percentage of total average distance value for the list of matching distances created for each input image frame;
compute, by the processor (104), a percentage of confidence interval for each input image frame based on the total average coordinates value and the predefined percentage of total average distance value; and
compare, by the processor (104), the computed confidence interval of each input image frame with a predefined threshold value, wherein the activity performed by the subject in the input image frame is mapped to the activity performed by the reference subject of a corresponding training image frame, if the confidence interval is equal or greater than the predefined threshold.

6. The system as claimed in claim 5, wherein the total average distance value is computed using the sum of all the distances present in the list of matching distances and the total number of count of all the elements in the list of matching distances.
7. The system as claimed in claim 5, wherein the predefined percentage of total average distance value is computed using the matching count from the list of matching distances and the predefined percentage count of all the elements in the list of matching distances.
8. The system as claimed in claim 5, wherein the percentage of confidence interval is computed using the total average distance value and the predefined percentage of total average distance value.
, Description:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of invention:
METHOD AND SYSTEM FOR ACTIVITY DETERMINATION FROM VIDEO FRAMES

Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India

The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD
[001] The disclosure herein generally relates to activity determination, and, more particularly, to method and system for activity determination from video frames.

BACKGROUND
[002] Human activity determination is an area of research interest because of its wide applicability in various applications such as automatic video surveillance, sign language interpretation and human computer interfaces. The main goal of video surveillance is not only to monitor, but also to automate the entire surveillance task for determining certain actions performed by the subject. However, accurate determination of activities performed by the subject is a highly challenging task due to cluttered backgrounds, occlusions, and viewpoint variations, etc. State of the art techniques for human determination is not reliable and scalable yet. In such scenarios, a system for determining activity is an emergent solution providing a reliable system in order to broadly facilitate.
[003] Most of the conventional methods, for detecting human action from video data utilize training image gestures and the video data comparison based on a matching criteria. However, these training and test images used are along with background, which makes detecting the subject in the image challenging due to challenges mentioned earlier such as cluttered backgrounds, occlusions, and viewpoint variations and the like, thereby affecting accuracy of activity detection. Recent methods apply a bounding box criteria on the subject present in a training image and test image to eliminate the background, however, have limitations in resizing the bounding box of image, affecting the accuracy of activity detection.

SUMMARY
[004] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a system for determining activity from video frames is provided. The system includes
a processor, an Input/output (I/O) interface and a memory coupled to the processor is capable of executing programmed instructions stored in the processor in the memory to receiving a video stream comprising a plurality of input image frames, which captures an activity performed by a subject. Further, a plurality of training image frames are fetched for determining the activity performed by the subject by analyzing each input image frame among the plurality of input image frames, wherein, the plurality of training image frames are generated from a set of training data, wherein, each training image frame from the set of training data are labelled with an activity, among a plurality of activities, performed by a reference subject present in each training data. Further, a bounding box around the subject present in each input image frame is created using a single shot detector (SSD) technique and the reference subject present in each training image frame from the plurality of training images. The bounding box is resized, for the subject present in each input image frame and the reference subject present in each training image frame in accordance with a preset resizing criteria for the input image frame and for the training image frame. Furthermore, the keypoint coordinates are detected for the resized bounding box around the subject of each input image frame and for each reference subject present in each training image frame. A plurality of features from each input image frame and each training image are extracted to obtain feature descriptors using the keypoint coordinates, wherein the feature descriptor assigns a numerical description as vectors for each subject of each input image frame and for each reference subject present in each training image frame. The feature descriptor of each input image frame with the feature descriptor of each training image are matched to create a list of matching distances for each input image frame, wherein the least distance in the list of matching distances provides a closest reference subject present in the training image frame corresponding to the subject in an input image frame. Furthermore, the method computes a total average coordinates value for the list of matching distances created for each input image frame and a predefined percentage of total average distance value for the list of matching distances created for each input image frame. A percentage of confidence interval is computed for each input image frame based on the total average coordinates value and the predefined percentage of total average distance value. The confidence interval of each input image frame with a predefined threshold value is compared, wherein the activity performed by the subject in the input image frame is mapped to the activity performed by the reference subject of a corresponding training image frame, if the confidence interval is equal or greater than the predefined threshold.
[005] In another aspect, a method for determining activity from video frames is provided. The method includes a processor, an Input/output (I/O) interface and a memory coupled to the processor is capable of executing programmed instructions stored in the processor in the memory to receiving a video stream comprising a plurality of input image frames, which captures an activity performed by a subject. Further, a plurality of training image frames are fetched for determining the activity performed by the subject by analyzing each input image frame among the plurality of input image frames, wherein, the plurality of training image frames are generated from a set of training data, wherein, each training image frame from the set of training data are labelled with an activity, among a plurality of activities, performed by a reference subject present in each training data. Further, a bounding box around the subject present in each input image frame is created using a single shot detector (SSD) technique and the reference subject present in each training image frame from the plurality of training images. The bounding box is resized, for the subject present in each input image frame and the reference subject present in each training image frame in accordance with a preset resizing criteria for the input image frame and for the training image frame. Furthermore, the keypoint coordinates are detected for the resized bounding box around the subject of each input image frame and for each reference subject present in each training image frame. A plurality of features from each input image frame and each training image are extracted to obtain feature descriptors using the keypoint coordinates, wherein the feature descriptor assigns a numerical description as vectors for each subject of each input image frame and for each reference subject present in each training image frame. The feature descriptor of each input image frame with the feature descriptor of each training image are matched to create a list of matching distances for each input image frame, wherein the least distance in the list of matching distances provides a closest reference subject present in the training image frame corresponding to the subject in an input image frame. Furthermore, the method computes a total average coordinates value for the list of matching distances created for each input image frame and a predefined percentage of total average distance value for the list of matching distances created for each input image frame. A percentage of confidence interval is computed for each input image frame based on the total average coordinates value and the predefined percentage of total average distance value. The confidence interval of each input image frame with a predefined threshold value is compared, wherein the activity performed by the subject in the input image frame is mapped to the activity performed by the reference subject of a corresponding training image frame, if the confidence interval is equal or greater than the predefined threshold.
[006] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS
[007] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[008] FIG. 1 illustrates an exemplary block of a system for determining activity from video frames, in accordance with some embodiments of the present disclosure.
[009] FIG. 2a and 2b illustrates a flow diagram for determining activity from video frames using system of FIG. 1, in accordance with some embodiments of the present disclosure.
[010] FIG. 3a and 3b illustrates high level architecture for determining activity from video frames, in accordance with some embodiments of the present disclosure.
[011] FIG. 4 illustrates experimental results of example activities determined from the video frame using the system of FIG. 1, in accordance with some embodiments of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS
[012] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims.
[013] The embodiments herein provide a method and system for activity determination from video frames. The method and the system enables providing a mechanism for determining the activity performed by a subject from video frames. The system herein may be interchangeably referred as activity determination system. The method disclosed provides abounding box based approach, enabling flexibility in the resizing of the bounding box around the subjects in an input or test image frame and a training image frame. The bounding box approach disclosed provides increased scalability and higher accuracy while computing percentage of confidence interval while mapping the activity in a video stream with the training images.
[014] Referring now to the drawings, and more particularly to FIG. 1 through 4, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[015] FIG. 1 illustrates an exemplary block of a system for determining activity from video frames, in accordance with some embodiments of the present disclosure. In an embodiment, the activity determination system 100 includes processor (s) 104, communication interface device(s), alternatively referred as or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the processor (s) 104. The processor (s) may be alternatively referred as one or more hardware processors. In an embodiment, the hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) 104 is configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the activity determination system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
[016] The I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for receiving the video stream.
[017] The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, the memory 102, may include a repository 108. The memory 102 may further comprise information pertaining to input(s)/output(s) of each step performed by the system 100 and methods of the present disclosure
[018] The repository 108 may store the video frames received from external source and the training images. The repository 108 may be external to the activity determination system 100 or internal to the activity determination system 100 (as shown in FIG.1). The repository 108, coupled to the activity determination system 100, may store the video frame to be processed in the system for determining the activity performed by the subject. The video streams are received offline via external source such as video surveillance system.
[019] FIG. 2a and 2b illustrates a flow diagram for determining activity from video frames using system of FIG. 1, in accordance with some embodiments of the present disclosure. The steps of the method 200 of the flow diagram will now be explained with reference to the components or blocks of the system 100 in conjunction with the example architecture of the system as depicted in FIG.3a and3b. Here, FIG. 3a and 3b illustrates high level architecture for determining activity from video frames, in accordance with some embodiments of the present disclosure. In an embodiment, the system 100 comprises one or more data storage devices or the memory 102 operatively coupled to the one or more processors 104 and is configured to store instructions for execution of steps of the method 200 by the one or more processors 104. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
[020] At step 202 of the method 200, the processor 104 is configured to receive, a video stream comprising a plurality of input image frames, which captures an activity performed by a subject. Initially, the system receives video streams that are captured using motion capturing device and these video streams are stored in the repository for processing. Further, each input image frame of the video streams are analyzed to determine the activity performed by the subject. The system comprises of two components that includes a training model and a processing model and each of the model will be further discussed.
[021] At step 204 of the method 200, the processor 104 is configured to fetch, a plurality of training image frames for determining the activity performed by the subject by analyzing each input image frame among the plurality of input image frames. The plurality of training image frames are generated from a set of training data. Here, each training image frame from the set of training data are labelled with an activity, among a plurality of activities, performed by a reference subject present in each training data. Referring now to Fig. 3a and 3b which depicts generation of training data from video streams using the training model. The training model of the system utilizes captured example training video through external source, where the spanning duration of video is 2 to 3 minutes. In the training video, reference subject performs one activity among the plurality of activities which is further utilized for labeling the training images. Further, the captured training video is processed uploaded to extract the plurality of training image frames for labelling the set of activities among the plurality of activities which includes walking, sleeping, running, standing, talking in phone, taking medication which is required during daily living routine of human. For example, if there is a training image frame where the reference subject is performing the activity as “Sitting”, “Standing”, “Walking”, “talking on telephone”, the training image frame is labelled with “Sitting”, “Standing”, “Walking”, “Talking on Phone” which is further utilized for processing any input video stream for activity detection. Further, for each activity represented as “sitting” may have various classes of postures based on the subject. For example, the subject performing the activity sitting may be bend posture, straight posture, slanting posture. It is to be understood that these varying classes of postures is applicable for the activity performed by the subject. These labelled training image frames are stored in the repository as training data. Further, each training image frame of the training data is created with bounding box around the reference subject for obtaining coordinates. The obtained bounding box coordinates are the four sides of the corner for each input image frame which is of start Y, end Y, start X and start X.
[022] A step 206 of the method 200, the processor 104 is configured to create, a bounding box around the subject present in each input image frame and the reference subject present in each training image frame from the plurality of training images. A single shot detector (SSD) technique is used for detecting the subject present in each input image frame in real-time. The SSD technique takes one single shot to detect the multiple subjects present within the image. Here, in each input image the SSD detects the subject thereby creating a bounding box on 4 corners around the subject present in each input image frame. Referring to the above example, the bounding box created around each input image frame and the reference subject present in each training image frame reduces noise thereby leaving the background. Here the bounding box will be created on each input image frame if there exists a “subject” and if there is no “subject”, present in the referred input image frame then the input image frame will be discarded from further processing. Similarly, the bounding box will be created on each training image frame from the plurality of training image frames. The bounding box on each image frame improves the accuracy for detecting the subject present in each image frame irrespective of input image frame or the training image frame. Mobinet cafe library application components which is well known in the art is utilized for creating bounding box through python program as mentioned below,

createboundedbox(input image frame [i]):
{
boundedbox = input image frame [startY:endY,startX:endX]
}

[023] At step 208 of the method 200, the processor 104 is configured to resize, the bounding box, of the subject present in each input image frame and the reference subject present in each training image frame in accordance with a preset resizing criteria for the input image frame and for the training image frame. The bounding box created on the subject present in each input image frame and the reference subject present in the training image frame is utilized for resizing of each image frames. The bounding box will ensure each training image frame is resized to twice the size of the input image frame. The resizing of each input image frame and each training image frame varies based on the frame width of each input image frame. Further, resizing of each input image frame and each training image frame increases the accuracy of determining the activity. Here, the preset criteria is derived from trial and error method by performing various iterations from subject matter expert. For example, if the input image frame has width of 320, then the training image frame will be resized by 4 times of the original size and the input image frame image will be resized by twice of the original image. The pseudo logic size depicted as mentioned below,
For each TrainingImage in TrainingSetImages
TrainingImage = boundedbox (TrainingImage]
TrainingImage = imgresize(TrainingImage, “trainvideo”)
Array (TrainingSet [I,j] = TrainingImage[i], TrainingActivity[j]

The above pseudo logic depicts, the method fetches each training image frame for each input image frame such that each training image frame is loaded to process along with each input image frame. The training data is stored in array size structure for loading into the memory to process. Further, the bounding box is created for each training image frame for extracting the activity from each training image frame. Further, each bounding box is resized for each training image frame and stores the resized bounding box and the activity performed by the subject into the array structure. The proposed approach uses OpenCV library components, the video stream is first converted into bounded box input image frame and the bounded box frame is then converted into the input image frame. The pseudo logic is represented as mentioned below,
VideoData = LoadVideoFile(“Example.mp4”)
For each VideoFrame in VideoData
VideoFrame = boundedbox (Videoframe)
VideoFrame = resize(Videoframe, “realvideo”)
VideoImageFile = Cnvt(Videoframe, “file”)
PerformAnalysis(VideoImageFile, TrainingSet)

[024] At step 210 of the method 200, the processor 104 is configured to detect, keypoint coordinates for the resized bounding box around the subject of each input image frame and for each reference subject present in each training image frame. The ORB library technique is utilized to detect the keypoint coordinates which is a fusion of FAST key point detector and BRIEF descriptor for detecting keypoint coordinates. The keypoint coordinates are obtained for the subject present in each input image frame by marking the corners for detecting the clear vision of the subject. The keypoint coordinates are obtained for the subject present in each training image frame by marking the corners for detecting the clear vision of the subject. Further, the extracted keypoint coordinates are further utilized to track and map subject present in each input image frame and each training image frame. The key advantage of ORB library technique is for finding variations in each input image frame and the training image frame for specific rotations. To handle rotational invariance for the subject present in each input image frame and the reference subject present in each training image frame the intensity weighted centroid of the patch is computed with located corner at center. The direction of the vector from this corner point to centroid gives the orientation. To improve the rotation invariance, moments are computed with x and y which should be in a circular region of radius r, where r is the size of the patch.
[025] At step 212 of the method 200, the processor 104 is configured to extract, a plurality of features from each input image frame and each training image to obtain feature descriptor using the keypoint coordinates. The feature descriptor assigns a numerical description as vectors for each subject of each input image frame and for each reference subject present in each training image frame. Once ORB is used to find the keypoint coordinates and the feature descriptor for a training data and the same approach is again used to find the keypoint coordinates and the feature description against video stream. The keypoint coordinates and the feature descriptors of each input image frame and each training image frame are matched against each other to find the distance between key point aspect using a technique called brute force matching technique. The brute force matcher takes the descriptor of one feature of each training image frame and is matched with all other features in each input image frame using distance calculation, wherein the distance is computed using hamming distance and the closest one is returned.
PerformAnalysis(VideoImageFile, TrainingSet)
DefinedThreshold = 10
For each TrainingImage in TrainingSet
TrainingImgtoAnalyse = TrainingImage[i]r
GetActivity = Trainingset[j]
VideoImagetoAnalyse = VideoImageFile
OrbVideo = load(ORB)
KeyPointTraining, DescriptorTraining = OrbVideo.detectandcompute(TrainingImgtoAnalyse)
KeyPointVideoImage, DescriptorVideoImage = OrbVideo.detectandcompute(VideoImagetoAnalyse)

Getmatches = BruteForceMatch(DescriptorTraining, DescriptorVideoImage)

[026] At step 214 of the method 200, the processor 104 is configured to match, the feature descriptor of each input image frame with the feature descriptor of each training image to create a list of matching distances for each input image frame, wherein the least distance in the list of matching distances provides a closest reference subject present in the training image frame corresponding to the subject in an input image frame. After running the Brute Force matcher, based on the results obtained from the feature descriptor of each input image frame with the feature descriptor of each training image will create a list of matching distances for each input image frame, wherein the least distance in the list of matching distances provides a closest reference subject present in the training image frame corresponding to the subject in an input image frame.
[027] At step 216 of the method 200, the processor 104 is configured to compute, the total average coordinates value for the list of matching distances created for each input image frame. Here, a predefined percentage of total average distance value for the list of matching distances created for each input image frame. The pre-defined threshold can be defined as minimum distances where the accuracy of activity is high. To obtain a relevant threshold, the Sum of all the Distances are obtained which is normally a float value. The distances from the Brute Force Matcher come as a List object of Python. Firstly, the number of the elements in the list of matching distances are counted. Secondly all the items in the list of matching distances are counted. Further, the total average distance value is computed using the sum of all the distances present in the list of matching distances and the total number of count of all the elements in the list of matching distances.
TotalAverageDistance = Sum (List of distances)/Count (List of distances)
[028] At step 218 of the method 200, the processor 104 is configured to compute, a predefined percentage of confidence interval for each input image frame based on the total average coordinates value and the predefined percentage of total average distance value. As a first step, the percentage of confidence interval for the list of matching distances are calculated. For example, if the distances list has 100 total elements, 25% of the matches count will be number 25 considering 25% as one of the predefined percentage of total average distance value.
25percentDistanceListCount = Count(List of Distances) * 0.25

Secondly, the predefined percentage of total average distance value as 25% of the sum of all the list of matching distances through the following pseudo code.

for g in matches:
if mycount == 25percentDistanceListCount:
break
25percentTotalDistance += g.distance
mycount = mycount + 1

Further, the predefined percentage of total average distance value 25% of sum and count of list of distances are utilized to compute using the matching count from the list of matching distances and the predefined percentage count of all the elements in the list of matching distances.
25percentAverageDistance = 25percentTotalDistance /25percentDistanceListCount

[029] At step 220 of the method 200, the processor 104 is configured to compare, the computed confidence interval of each input image frame with a predefined threshold value, wherein the activity performed by the subject in the input image frame is mapped to the activity performed by the reference subject of a corresponding training image frame, if the confidence interval is equal or greater than the predefined threshold. Further, the percentage of confidence interval is computed using the total average distance value and the predefined percentage of total average distance value.
POC = (TotalAverageDistance – 25percentAverageDistance)/TotalAverageDistance
If the percentage of confidence interval exceeds the predefined threshold which is 30% derived from trial and error method from the domain experts, then the activity determined from the training image frame and the input image frame are closely identical.
If POC >= 0.30:
getthresholdvalue = 25percentAverageDistance
This percentage of confidence interval ensures the matches between the activity performed by the reference subject present in the training image frame and the activity performed by the subject present in the input image frame. The percentage of confidence interval is obtained after going through several iterations. For an accuracy of 95% activity detection, the percentage of confidence interval is 0.30. The input image frame matches the thresholds or below thresholds based on the percentage of confidence interval are ideal image matches for which the subject present in the input image is tracked. Once the match or subject present in the image is determined for recording the activity. This process is iteratively repeated for all video frames images, so that training data is scanned against each input image frame of the video frame.
[030] FIG. 4 illustrates experimental results of example activities determined from the video frame using the system of FIG. 1, in accordance with some embodiments of the present disclosure. This graphical representation exemplifies the activity performed by the subject that includes sitting, standing and walking detected from each input image frame from the plurality of input image frames. This experimental data is captured as a.csv file along with three attributes that includes a activity, an activity instance and frame duration from each video frames.
[031] The embodiments of present disclosure herein addresses the problem of determining activity from video frames. The embodiment, thus provides a method for determining the activity performed by the subject in each input image frame by matching each training image frame. This matching criteria thereby increases the accuracy of determining the activity performed by the subject in each input image frame of the video stream using the list of matching distances. Also, the SSD technique utilized for creating the bounding box around the subject present in each input image frame and each training image frame helps in removing the background noise and occlusions. Further, each training image frame is processed by resizing with each input image frame appropriately based on the preset criteria. This resizing mechanism increases the accuracy of detecting the activity performed by the subject by which the preset criteria is derived by conducting several iterations of experimental results. The calculated distance measurement using brute force provides distances between the input image frame and the training image frame based on the percentage of confidence interval which increases the accuracy approximately around 85%. Further, the method also conforms on matching criteria for the list of matching distances by computing the percentage of confidence interval for each input image frame using the total average coordinates value and the predefined percentage of total average distance value. These computational resulted value helps in determining the activity performed by the subject in most efficient way. Further, the computed confidence interval of each input image frame is compared with the predefined threshold value derived by conducting strategic experimental analysis. Therefore, each sequential step in the proposed method helps in determining the activity performed by the subject in each image frame in most scalable and efficient way.
[032] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[033] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[034] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[035] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[036] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure.A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[037] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.

Documents

Application Documents

#	Name	Date
1	201921034491-STATEMENT OF UNDERTAKING (FORM 3) [27-08-2019(online)].pdf	2019-08-27
2	201921034491-REQUEST FOR EXAMINATION (FORM-18) [27-08-2019(online)].pdf	2019-08-27
3	201921034491-FORM 18 [27-08-2019(online)].pdf	2019-08-27
4	201921034491-FORM 1 [27-08-2019(online)].pdf	2019-08-27
5	201921034491-FIGURE OF ABSTRACT [27-08-2019(online)].jpg	2019-08-27
6	201921034491-DRAWINGS [27-08-2019(online)].pdf	2019-08-27
7	201921034491-COMPLETE SPECIFICATION [27-08-2019(online)].pdf	2019-08-27
8	201921034491-Proof of Right (MANDATORY) [10-09-2019(online)].pdf	2019-09-10
9	Abstract1.jpg	2019-09-18
10	201921034491-ORIGINAL UR 6(1A) FORM 1-200919.pdf	2019-09-24
11	201921034491-FORM-26 [11-10-2019(online)].pdf	2019-10-11
12	201921034491-ORIGINAL UR 6(1A) ASSIGNMENT-110919.pdf	2019-11-19
13	201921034491-OTHERS [03-08-2021(online)].pdf	2021-08-03
14	201921034491-FER_SER_REPLY [03-08-2021(online)].pdf	2021-08-03
15	201921034491-COMPLETE SPECIFICATION [03-08-2021(online)].pdf	2021-08-03
16	201921034491-CLAIMS [03-08-2021(online)].pdf	2021-08-03
17	201921034491-FER.pdf	2021-10-19
18	201921034491-PatentCertificate30-09-2024.pdf	2024-09-30
19	201921034491-IntimationOfGrant30-09-2024.pdf	2024-09-30

Search Strategy

1	2021-04-1516-09-08E_15-04-2021.pdf