Abstract: Selecting right features for training dataset is one key factor in improving accuracy of human activity detection from live video stream. Embodiments of the present disclosure provide a method and system for generating image feature model for human activity recognition. The method generates the image feature model comprising a set of customized features, which enables a fusion of multiple features derived from a training video having a plurality of human activities of interest. The derived set of customized features for each activity comprises pre-defined number of pose points, corner points, histogram, and edges. Further each feature is compressed to generate a four feature point compressed representation for each activity. Thus, the image feature model comprising the compressed feature representation of all activities of interest trains a ML model for human activity recognition. Once trained, the ML model can accurately detect human activity in a video for real time applications. [To be published with 4]
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR GENERATING IMAGE FEATURE
MODEL FOR HUMAN ACTIVITY RECOGNITION
Applicant
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.
TECHNICAL FIELD [001] The embodiments herein generally relate to image analysis and, more particularly, to a method and system for generating image feature model for human activity recognition.
BACKGROUND
[002] Computer-vision based methods using video cameras to detect the liveliness of human in detecting specific labelled activities like eating, walking, sleeping etc. are gaining importance to enable remote unobtrusive monitoring, specifically of elderly people.
[003] Research in human activity recognition, attempts various approaches aiming to improve accuracy in the human activity recognition systems. Analyzing live videos to identify the human activity is one of the popular traditional approaches. However, direct video monitoring is invasive and affects privacy of a subject being monitored. Neural network based approaches, like in all other fields, are gaining importance in automatic human activity recognition by extracting information from the video frames rather than direct video monitoring. Conventional neural network based methods for human activity detection rely on standard feature extraction or standard descriptors. However, selecting right features for training dataset is one key factor in improving accuracy of human activity detection from live video stream. Moreover, they rely on standard feature extractors, that work on fixed size input images, and pose limitation.
SUMMARY
[004] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems.
[005] For example, in one embodiment, a method for generating image feature model for human activity recognition is provided.
[006] The method includes receiving a plurality of images identified as training images for the image feature model generation, wherein each of the
plurality of images are training images comprising a unique human activity among a plurality of human activities of interest.
[007] Further, the method includes localizing each of the plurality of human activities in each of the plurality of images by marking a bounding box around each of the plurality of human activities.
[008] Further, the method includes resizing image within the bounding box of each of the plurality of images to generate a plurality of resized images based on a preset scaling factor, wherein each of the plurality of resized images have non-uniform size.
[009] Further, the method includes generating, a set of customized features for each of the plurality of resized images, comprising: extracting a pose point array corresponding to the human activity in a resized image among the plurality of resized images, wherein the pose point array comprises a set of pose point coordinate pairs corresponding to a plurality of predefined parts of a human present in the human activity; obtaining a histogram providing pixel distribution of the resized image; extracting a corner point array corresponding to the human activity in the resized image, wherein the corner point array comprises a set of corner coordinate pairs corresponding to a plurality of corner points identified in the resized image; obtaining a set of edges in the resized image.
[0010] Furthermore, the method includes compressing each feature from the set of customized features extracted for each of the plurality of resized images to generate a plurality of summarized feature point representations, wherein each summarized feature point representation among the plurality of summarized feature point representations correspond to the unique human activity.
[0011] Further, the method includes generating an image feature model based on the plurality of summarized feature point representations representing the plurality of human activities of interest and training a Machine Learning (ML) model for, recognizing the plurality of human activities, using the image feature model.
[0012] Furthermore, the method includes recognizing a human activity in image frames of a live video using the trained ML model by extracting a
summarized feature point representation for the human activity present in each of the image frames.
[0013] In another aspect, a system for generating image feature model for human activity recognition is provided. The system comprises a memory storing instructions; one or more Input/Output (I/O) interfaces; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to receive a plurality of images identified as training images for the image feature model generation, wherein each of the plurality of images are training images comprising a unique human activity among a plurality of human activities of interest.
[0014] Further, localize each of the plurality of human activities in each of the plurality of images by marking a bounding box around each of the plurality of human activities.
[0015] Further, resize image within the bounding box of each of the plurality of images to generate a plurality of resized images based on a preset scaling factor, wherein each of the plurality of resized images have non-uniform size.
[0016] Further, generate, a set of customized features for each of the plurality of resized images, comprising: extracting a pose point array corresponding to the human activity in a resized image among the plurality of resized images, wherein the pose point array comprises a set of pose point coordinate pairs corresponding to a plurality of predefined parts of a human present in the human activity; obtaining a histogram providing pixel distribution of the resized image; extracting a corner point array corresponding to the human activity in the resized image, wherein the corner point array comprises a set of corner coordinate pairs corresponding to a plurality of corner points identified in the resized image; obtaining a set of edges in the resized image.
[0017] Furthermore, compress each feature from the set of customized features extracted for each of the plurality of resized images to generate a plurality of summarized feature point representations, wherein each summarized feature
point representation among the plurality of summarized feature point
representations correspond to the unique human activity.
[0018] Further, the generate an image feature model based on the plurality of summarized feature point representations representing the plurality of human activities of interest and training a Machine Learning (ML) model for, recognizing the plurality of human activities, using the image feature model.
[0019] Furthermore, recognize a human activity in image frames of a live video using the trained ML model by extracting a summarized feature point representation for the human activity present in each of the image frames.
[0020] In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or more instructions, which when executed by one or more hardware processors causes a method for generating image feature model for human activity recognition.
[0021] The method includes receiving a plurality of images identified as training images for the image feature model generation, wherein each of the plurality of images are training images comprising a unique human activity among a plurality of human activities of interest.
[0022] Further, the method includes localizing each of the plurality of human activities in each of the plurality of images by marking a bounding box around each of the plurality of human activities.
[0023] Further, the method includes resizing image within the bounding box of each of the plurality of images to generate a plurality of resized images based on a preset scaling factor, wherein each of the plurality of resized images have non-uniform size.
[0024] Further, the method includes generating, a set of customized features for each of the plurality of resized images, comprising: extracting a pose point array corresponding to the human activity in a resized image among the plurality of resized images, wherein the pose point array comprises a set of pose point coordinate pairs corresponding to a plurality of predefined parts of a human present in the human activity; obtaining a histogram providing pixel distribution of the resized image; extracting a corner point array corresponding to the human activity
in the resized image, wherein the corner point array comprises a set of corner coordinate pairs corresponding to a plurality of corner points identified in the resized image; obtaining a set of edges in the resized image.
[0025] Furthermore, the method includes compressing each feature from the set of customized features extracted for each of the plurality of resized images to generate a plurality of summarized feature point representations, wherein each summarized feature point representation among the plurality of summarized feature point representations correspond to the unique human activity.
[0026] Further, the method includes generating an image feature model based on the plurality of summarized feature point representations representing the plurality of human activities of interest and training a Machine Learning (ML) model for, recognizing the plurality of human activities, using the image feature model.
[0027] Furthermore, the method includes recognizing a human activity in image frames of a live video using the trained ML model by extracting a summarized feature point representation for the human activity present in each of the image frames.
[0028] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS [0029] The accompanying drawings, which are incorporated in and
constitute a part of this disclosure, illustrate exemplary embodiments and, together
with the description, serve to explain the disclosed principles:
[0030] FIG. 1 is a functional block diagram of a system for generating image
feature model for human activity recognition, in accordance with some
embodiments of the present disclosure.
[0031] FIG. 2A and FIG. 2B ( collectively referred as FIG. 2) is a flow
diagram illustrating a method for generating image feature model for human activity
recognition, using the system of FIG. 1, in accordance with some embodiments of the present disclosure.
[0032] FIGS. 3A through 3G (collectively referred as FIG. 3) are example images depicting output of system 100 at various stages of image feature model for human activity recognition, in accordance with some embodiments of the present disclosure.
[0033] FIG. 4 is a functional block diagram depicting process flow for human activity recognition in a live video stream, in accordance with some embodiments of the present disclosure.
[0034] FIG. 5 depicts recognizing of human activity in the live video using, in accordance with some embodiments of the present disclosure.
[0035] It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems and devices embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
DETAILED DESCRIPTION OF EMBODIMENTS [0036] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
[0037] Embodiments of the present disclosure provide a method and system for generating image feature model for human activity recognition. The method generates the image feature model comprising a set of customized features, which
enables a fusion of multiple features derived from a training video having a plurality of human activities of interest. The derived set of customized features for each activity comprises pre-defined number of pose points, corner points, histogram, and edges. Further each feature is compressed to generate a four feature point compressed representation for each activity. Thus, the image feature model comprising the compressed feature representation of all activities of interest trains a Machine Learning (ML) model for human activity recognition. Once trained, the ML model can accurately detect human activity in a live video for real time applications based on extracted customized feature set for human activity present in the frames of the video.
[0038] Referring now to the drawings, and more particularly to FIGS. 1 through 5, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[0039] FIG. 1 is a functional block diagram of a system 100 for generating image feature model for human activity recognition, in accordance with some embodiments of the present disclosure.
[0040] In an embodiment, the system 100 includes a processor(s) 104, communication interface device(s), alternatively referred as input/output (I/O) interface(s) 106, and one or more data storage devices or a memory 102 operatively coupled to the processor(s) 104. The system 100 with one or more hardware processors is configured to execute functions of one or more functional blocks of the system 100.
[0041] Referring to the components of system 100, in an embodiment, the processor(s) 104, can be one or more hardware processors 104. In an embodiment, the one or more hardware processors 104 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 104 are configured to fetch and execute computer-
readable instructions stored in the memory 102. In an embodiment, the system 100 can be implemented in a variety of computing systems including laptop computers, notebooks, hand-held devices such as mobile phones, workstations, mainframe computers, servers, and the like.
[0042] The I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface to display the generated target images and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular and the like. In an embodiment, the I/O interface (s) 106 can include one or more ports for connecting to a number of external devices or to another server or devices.
[0043] The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. Further, the memory 102 includes a database 108 that stores the training video frames, extracted customized features and the like. Further, the memory 102 includes modules such as image feature model (not shown) and a Machine Learning (ML) model (not shown). The database 108, may also store a various ML model such as Random forest and the like. The database also stores Further, the memory 102 may comprise information pertaining to input(s)/output(s) of each step performed by the processor(s) 104 of the system100 and methods of the present disclosure. In an embodiment, the database 108 may be external (not shown) to the system 100 and coupled to the system via the I/O interface 106. Functions of the components of the system 100 are explained in conjunction with flow diagram of FIG. 2, examples in FIG. 3, live video analysis process of FIG. 4 and example output in FIG. 5.
[0044] FIG. 2A and FIG. 2B (collectively referred as FIG. 2) is a flow diagram illustrating a method 200 for generating image feature model for human
activity recognition, using the system of FIG. 1, in accordance with some embodiments of the present disclosure.
[0045] In an embodiment, the system 100 comprises one or more data storage devices or the memory 102 operatively coupled to the processor(s) 104 and is configured to store instructions for execution of steps of the method 200 by the processor(s) or one or more hardware processors 104. The steps of the method 200 of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIG. 1, the steps of flow diagram as depicted in FIG. 2, examples in FIG. 3, live video analysis process of FIG. 4 and example output in FIG. 5.
[0046] Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods, and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps to be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
[0047] In one example set up of the system 100, the system 100 may be installed in old age home for remote, non-obtrusive recognition of activities performed by the elderly people living inside the old age home. Initially the ML model of the system 100 is trained using a training video to learn and detect the human activities of interest and then the trained ML model is capable of analyzing live video captured by camera’s deployed at appropriate locations at the old age home.
[0048] Referring to the steps of the method 200, at step 202 of the method 200, the one or more hardware processors 104 receive a plurality of images identified as training images for the image feature model generation captured via one or more cameras. Each of the plurality of images is a training image comprising a unique human activity among a plurality of human activities of interest.
[0049] A learning set of images, also referred to as training images or training video or training dataset hereinafter) for specific human activities of
interest, interchangeably referred to as Activities of Interest (AOIs) that may include walking, sitting, eating, reading etc. is created. One of the way to extract these learning set of images is to create a small duration video ideally spanning 2 to 3 minutes, which effectively records these the Activities of Interest. Once the training video is made, the training Images are extracted. To enable this, for example, a Web/Standalone application is developed that captures the scenes from the training video. The application hold the set of labels/tags for Activities of Interest. Here there are 3 sub steps to be followed for creating the training images include:
1. Run the Training video and capture the Video frame which specifies the Activity of Interest.
2. Tag/Label the video frame (image) to specific Activity of Interest. For example, if there is a frame which pictures the person “Walking”, it is tagged or labelled to “Walking” label. Similarly, if there is a frame which pictures person taking medication, then the image is tagged as “Taking Medication”
3. Save the tags/labels as Image files in .PNG/JPG format. An example training image in FIG. 3A refers to a walking activity of a person. Here is the example of person “Walking” as referred in Figure 3.
[0050] The key point to note is that the training images from the training set video can be different from an actual video to be tracked. Once our training images are finalized in terms of Activity of Interest (walking, sitting etc.), then the real time video or the live video is analyzed for specific activities as per Activity of Interest specified in the training dataset.
[0051] Once such training images are received, at step 204 the one or more hardware processors 104 localize each of the plurality of human activities in each of the plurality of images by marking a bounding box around each of the plurality of human activities. To effectively identify the human activity of interest that is captured in the training images, it is very critical to understand the activity of the person by rightly eliminating noise (non-human images) in the images and capturing the images that have object of interest (human/person), which is further processed to identify the activity performed by the human. The bounding box approach enables
identifying only those images that have presence of human and then creating the bounding box around the human/person in the image. If there are no “Person(s)” in the image or in a video frame, then that image will be discarded from usage.
[0052] The bounding box creation step 204 applies to both training and live or real video images. By doing this, noise of “non-human” images are eliminated. Additionally, bounding box step improves the accuracy of detecting human activities irrespective of difference in training and live or real frames (images) in video stream. In an embodiment, YOLO Caffe library components implemented through Python program are used for bounding box generation. The pseudo code 1 below refers to bounding box creation:
Pseudo code1:
createboundedbox(image): {
boundedbox = image[startY:endY,startX:endX] }
[0053] Thus, bounding box enables the further processing steps of the method to explicitly focus on the human in the image since it is important to extract the human activity only as a bounded box. For example, as seen in FIG. 3A, the human or the subject in the image is walking with considerable space in the image taken by the background scene, herein room, which is not of interest. The bounding box enables to specifically carve out the subject in the image as depicted in FIG. 3B.
[0054] At step 206, the one or more hardware processors 104 resize image within the bounding box of each of the plurality of images to generate a plurality of resized images based on a preset scaling factor. Each of the plurality of resized images have non-uniform size as the size is based on image space occupied by the human while performing each of the plurality of AOIs. Thus, every activity occupies varying number of pixels in the image and thus has different bounding box size. The resizing of the bounding box improves accuracy in human activity detection. For example, each of the training image resize is twice the size of the real video images. The resizing of the training images and live/real video images varies based on the frame width of the real video. FIG. 3C depicts the resized image.
[0055] For example, if the video frame image comes with frame width of 320, the training set resize is 4.0 times (4X) (predefined scaling factor) of the original size and real video frame image is 2.0 (2X) (predefined scaling factor) of the original image. These values are threshold values and vary based on real world scenarios, which means they are based on the clarity of the bounding box size of the training image. The 4X in training set and 2X for live video has been arrived post experimentations based on the training video and live video output after applying bounding box. Thus, the scaling factor is set as “parameter” value, which is read by the program automatically based on bounding box output which might vary. If bounding box size varies, the system 100 can change the scaling factor appropriately through the parameter value, wherein this variation brings in flexibility. Further, resizing enables in obtaining proper feature points from both training images and live video frames.
[0056] At step 208 the one or more hardware processors 104 generate, a set of customized features for each of the plurality of resized images, wherein generating comprises:
a) Extracting a pose point array corresponding to the human activity in a resized image among the plurality of resized images. The pose point array comprises a set of pose point coordinate pairs corresponding to a plurality of predefined parts of a human present in the human activity. A known human POSE estimation approach is used capture 14 points of the human body. The pose estimation detects key point locations of the human body. Thus, major parts/joints of the body such as head, shoulder, neck, knee joints, elbow joints, etc. are detected and localized. The pose estimation, for example utilizes a pretrained Multi Person Dataset (MPII™) models trained on CaffeDeep Learning™ Framework in OpenCV. The extracted pose points are depicted in FIG. 3D and the pseudo code 2 for pose estimation is given below:
Pseudo code 2:
getPoseEstimation (image): {
PoseBodyPoints = image[startY:endY,startX:endX] } b) Obtaining a histogram providing pixel distribution of the resized image.
understand the pixel distribution through Histograms, which is a plot of
intensity distribution of an image. For example, a histogram can be a plot
with pixel intensity values (ranging from 0 to 255, not always) in X-axis and
corresponding number of pixels in the image on Y-axis. Histogram provides
another way of understanding the features of an image and the customized
feature set captures the histogram based features, thus features such as
contrast, brightness, intensity distribution of that image are captured. FIG.
3E depicts the histogram of the walking person/human depicted in the FIG.
3C. For example, when FIG. 3C is a colored image, Red Green Blue (RGB)
channels are obtained in the histogram, which is calculated based on
CalcHist function of OpenCV which takes 4 inputs.
1.) Image
2.) No. of the channels. If it is a color image, it is 3 channels denoting
RGB
3.) No. of Bins – 256. This is a configurable value. Herein, the bin size
of 32 is used. The pixels from 0-255 representation will be put across 32
bins.
4.) Range of values (typically between 0 -255 denoting all pixels are
used)
Further, the pixel values that range from the values. Since the pixel range
from 0 to 255 are normalized, in order to limit the scale between 0 to 1. For
example, the OpenCV function of normalization can be used. The
normalized histogram is then flattened the Histogram as Single set of items
i.e. Vectors through Flatten feature of OpenCV. Provided below is pseudo
code 3 to for histogram flattening Pseudo code 3:
getHistogramFeatures (image):
{
HistogramFeatures = CalcHist[Image]
HistogramFeaturesVector = NormalizeandFlatten(HistogramFeatures) }
c) Extracting a corner point array corresponding to the human activity in the
resized image. The corner point array comprises a set of corner coordinate
pairs corresponding to a plurality of corner points identified in the resized
image.
The corner points as depicted in FIG. 3F are key points to understand the position of human activity. In an example, OpenCV Shi-Tomasi Corner Detector™ is used to detect the corner points in the resized image and typically 25 best corners points are selected. The pseudo code 4 below refers to corner point extraction.
Pseudo code 4:
getCorners (image): {
CornerPoints = image[startY:endY,startX:endX] }
d) Obtaining a set of edges in the resized image. The edges provide additional
key point to understand especially on the position of human activity as
depicted in FIG. 3G. In an example implementation, OpenCV Canny Edge
detector™ is used to detect edge points of the image. The pseudo code 5
below provides edge detection.
Pseudo code 5:
getEdges (image):
{
EdgePoints = image[startY:endY,startX:endX]
}
[0057] As well known in art, for any Machine learning based system, feature generation is a critical step. Appropriate generation and selection of features is one of the major factor in success or accuracy of any ML model, which is trained based on the features selected. The method 200 disclosed herein enables customized feature generation, unlike existing ML based systems that rely standard RESNet™
or Inception™ or VGG™ models. These existing feature extraction techniques or transfer learning models are specific to an input image size. However, the custom feature set makes the method 200 independent of input image size (both for training images and images from live video). The flexibility introduced due to customized feature approach to have input images of varying sizes enables the method 200 to process bounding boxes of non-uniform sizes. As explained above, it is obvious, that the bounding box size will be based on activity captured in the input image. If bounded boxes are force fitted on the human activity identified, the accuracy of activity recognition suffers. The method 200 disclosed herein overcomes this disadvantage by providing capability to process non-uniform bounding boxes by deriving customized features for further processing.
[0058] At step 210 the one or more hardware processors 104 compress each feature from the set of customized features extracted for each of the plurality of resized images to generate a plurality of summarized feature point representations. Each summarized feature point representation among the plurality of summarized feature point representations correspond to the unique human activity.
[0059] Compressing of each feature from the set of customized feature comprises:
a) Summing values of the set of pose point coordinate pairs associated with the resized image into a summarized pose point coordinate pair to identify a first feature point in the summarized feature point representation.
b) Summing values of pixel distributions in the histogram (normalized histogram) of the resized image into a summarized histogram value to identify a second feature point in the summarized feature point representation.
c) Summing values of the set of corner coordinate pairs associated with the resized image into a summarized corner coordinate pair to identify a third feature point in the summarized feature point representation.
d) Summing pixel values of a plurality of edge points in the set of edges obtained for the resized image into an edge pixel value to identify a fourth feature point in the summarized feature point representation.
[0060] Once the compresses customized feature set (the summarized feature
point representation) is generated, at step 212 of the method 200, the one or more
hardware processors 104 generate the image feature model, as depicted in Table 1,
which is generated based on the plurality of summarized feature point
representations representing the plurality of human activities of interest and is stored in database 108. As understood by person ordinarily skilled in the art, this process is repeated for all the training images to capture compressed feature set for all the plurality of activities of interest. Sample compressed features for training images are provided in table1 below and each feature will be mapped to Activity label. For example, activity label for example herein is: 0- Sitting, 1-Standing and 2 – Walking.
TABLE 1:
SumPosPoints SumHistPoints SumImgCorners SumImgEdges ActivityLabel
1898 1.732050776 8210 455940 0
2743 1.732050776 8072 414885 0
2130 1.732050776 7404 341955 1
2813 1.732050776 10259 429420 1
1618 1.732050776 3429 227205 2
2706 1.732050776 11122 411060 2
[0061] The pseudo code 6 below provides table generation for the image feature model:
Pseudo code 6:
For each TrainingImage in TrainingSetImagesTrainingImage = boundedbox (TrainingImage] TrainingImage = imgresize (TrainingImage, “trainvideo”)
PosePoints = getPOSEPoints (TrainingImage)
Histograms = getHistogramFeatures (TrainingImage)
Corners = getCorners (TrainingImage)
Edges = getEdges(TrainingImage) ImgFeatures.append(PosePoints, Histograms, Corners, Edges) ImgFeatures.savecsv(filename)
[0062] Once the compressed feature set is created for each activity providing
the image feature set, at step 214 of the method 200, the one or more hardware
processors 104 train the Machine Learning (ML) model using the image feature
model for recognizing the plurality of human activities. In an example
implementation the Random Forest technique known in the art is used to build the ML model, however, is not a limitation. Any Supervised Machine Learning Algorithm supporting multi classification can be used. A classifier ( ML model) is trained to classify the activity in the column ‘activity label’ of table 1 and the rest of the data in remaining columns of the table 1 ( representing the image feature model) are used as feature inputs. The pseudo code 7 for training is given below:
Pseudo code 7:
RandForestAlg = RandomForestClassifier(n_estimators=100)
#Train the model using the training sets RandForestAlg.fit(ImgFeatureDF,ImgLabels) # save the model to disk
filename = 'ImgFeatureModel.sav' saveModel(RandForestAlg, open(filename, 'wb'))
[0063] The trained ML model interchangeably referred as Image Feature
Training Model is then used for real-world videos to predict the activities of
the human in the video frame. This activity completes the Training Process.
[0064] Once the ML model is trained, then at step 216 of the method 200, the one or more hardware processors 104 recognize a human activity in image frames of a live video using the trained ML model by extracting a summarized feature point representation for the human activity present in each of the image frames. The summarized feature point representation for the live video follows steps similar to the training process.
[0065] FIG. 4 is a functional block diagram depicting process flow for human activity recognition in a live video stream, in accordance with some embodiments of the present disclosure.
[0066] As depicted in FIG. 4, once the trained ML model, also referred to as the image feature model, is created, then during live/ real time video analysis
includes loading video data. Loading Video data in this context, involves loading the real/actual video file through OpenCV library™ on whom you want to detect the activity against your training set. Once the live video file is loaded into a video data structure, the video data structure processes each frame of the video. During each frame processing, steps as mentioned in the training process to generate summarized feature point representation of each of the live video frames are performed and the summarized feature point representation is stored in the database 108. The processing of live video frames for human activity detection is provided in pseudocode 8 below:
Pseudo code 8:
VideoData = LoadVideoFile (“Example.mp4”) For each VideoFrame in VideoData
VideoFrame = boundedbox (VideoFrame] VideoFrame = imgresize (VideoFrame, “Realvideo”) PosePoints = getPOSEPoints (VideoFrame) Histograms = getHistogramFeatures (VideoFrame) Corners = getCorners (VideoFrame) Edges = getEdges(VideoFrame) ImgLiveFeatures.append(PosePoints, Histograms, Corners, Edges) [0067] Prediction of Live Video Frames: This last step of prediction of live video frames function forms the process of determining various AOIs like Sitting/Walking etc. in the live video. This prediction is enabled Human Activity Feature Model which is created from Training set feature data. The pseudo code 9 below provides prediction logic.
Pseudocode 9:
# Load the model disk
filename = 'ImgFeatureModel.sav'
LoadModel(RandForestAlg, open(filename, 'wb')) #Predict the model using the Live video frame ActivityLabel =RandForestAlg.predict(ImgLiveFeatures)
[0068] The ActivityLabel provides the Individual activity predicted for each video frame. For example, if the Activity Label predicted is “2”, it means the predicted activity is “Walking”. If the Activity label predicted is “0”, then the predicted activity is “Sitting” so on so forth.
[0069] Storing Predicted Activity in Live Output folder: Once the video analysis program predicts the activity on the real-world basis, the predicted activity gets stored in the “LiveVideoOutput” table. As depicted in FIG. 5, predicted is human activity ‘walking’, in the live video, indicated with a bounding box marking.
[0070] Thus, the method disclosed herein that trains the ML model based on an image feature model generated using a summarized feature point presentation for each AOI provides enhanced accuracy in human activity recognition or prediction and the accuracy range goes up to 90 -95%. Thus, the method disclosed herein
[0071] This invention enables to extract very specific features that are involved in a human activity image. The customized features selected herein are very critical in improving the accuracy of the human activity recognition or detection, and includes detection of poses like hand movement, leg movement, head movement. The customized features are derived from localized portion of the image which contains the human there by removing any background distortions and the customized features mentioned above can be extracted of images of any shape i.e. width and height which is not possible in existing transfer learning models as they require a specific input size. Moreover, the localized image has such size variations, but the custom feature generation eliminates the size variation factor and improving the accuracy
[0072] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[0073] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[0074] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[0075] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are
appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[0076] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[0077] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
We Claim:
1. A processor implemented method (200) for human activity recognition, the method comprising:
receiving (202), via one or more hardware processors, a plurality of images identified as training images for an image feature model generation, wherein each of the plurality of images is training image comprising a unique human activity among a plurality of human activities of interest;
localizing (204), via the one or more hardware processors, each of the plurality of human activities in each of the plurality of images by marking a bounding box around each of the plurality of human activities;
resizing (206), via the one or more hardware processors, image within the bounding box of each of the plurality of images to generate a plurality of resized images based on a preset scaling factor, wherein each of the plurality of resized images have non-uniform size;
generating, (208), via the one or more hardware processors, a set of customized features for each of the plurality of resized images, comprising:
a) extracting a pose point array corresponding to the human
activity in a resized image among the plurality of resized
images, wherein the pose point array comprises a set of pose
point coordinate pairs corresponding to a plurality of
predefined parts of a human present in the human activity ;
b) obtaining a histogram providing pixel distribution of the
resized image;
c) extracting a corner point array corresponding to the human activity in the resized image, wherein the corner point array comprises a set of corner coordinate pairs corresponding to a plurality of corner points identified in the resized image; and
d) obtaining a set of edges in the resized image;
compressing, (210), via the one or more hardware processors, each feature from the set of customized features extracted for each of the plurality of resized images to generate a plurality of summarized feature point representations, wherein each summarized feature point representation among the plurality of summarized feature point representations correspond to the unique human activity;
generating, (212), via the one or more hardware processors, an image feature model based on the plurality of summarized feature point representations representing the plurality of human activities of interest; and
training, (214), via the one or more hardware processors, a Machine Learning (ML) model for, recognizing the plurality of human activities, using the image feature model.
2. The method as claimed in claim 1, wherein the method further comprises recognizing (206), via the one or more hardware processors, a human activity in image frames of a live video using the trained ML model by extracting a summarized feature point representation for the human activity present in each of the image frames.
3. The method as claimed in claim 1, wherein compressing of each feature from the set of customized feature comprises:
summing values of the set of pose point coordinate pairs associated with the resized image into a summarized pose point coordinate pair to identify a first feature point in the summarized feature point representation;
summing values of pixel distributions in the histogram of the resized image into a summarized histogram value to identify a second feature point in the summarized feature point representation;
summing values of the set of corner coordinate pairs associated with the resized image into a summarized corner coordinate pair to identify a third feature point in the summarized feature point representation; and
summing pixel values of a plurality of edge points in the set of edges obtained for the resized image into an edge pixel value to identify a fourth feature point in the summarized feature point representation.
4. A system (100) for human activity recognition, the system (100) comprising:
a memory (102) storing instructions;
one or more Input/Output (I/O) interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the
one or more I/O interfaces (106), wherein the one or more hardware
processors (104) are configured by the instructions to:
receive a plurality of images identified as training images for an image feature model generation, wherein each of the plurality of images is a training image comprising a unique human activity among a plurality of human activities of interest;
localize each of the plurality of human activities in each of the plurality of images by marking a bounding box around each of the plurality of human activities;
resize image within the bounding box of each of the plurality of images to generate a plurality of resized images based on a preset scaling factor, wherein each of the plurality of resized images have non-uniform size;
generate, a set of customized features for each of the plurality of resized images, comprising:
a) extracting a pose point array corresponding to the human activity in a resized image among the plurality of resized images, wherein the pose point array comprises a set of pose point coordinate pairs corresponding to a plurality of predefined parts of a human present in the human activity ;
b) obtaining a histogram providing pixel distribution of the
resized image;
c) extracting a corner point array corresponding to the human activity in the resized image, wherein the corner point array comprises a set of corner coordinate pairs corresponding to a plurality of corner points identified in the resized image; and
d) obtaining a set of edges in the resized image;
compress each feature from the set of customized features extracted for each of the plurality of resized images to generate a plurality of summarized feature point representations, wherein each summarized feature point representation among the plurality of summarized feature point representations correspond to the unique human activity;
generate an image feature model based on the plurality of summarized feature point representations representing the plurality of human activities of interest; and
train a Machine Learning (ML) model for, recognizing the plurality of human activities, using the image feature model.
5. The system (100) as claimed in claim 1, wherein the one or more hardware processors (104) are configured to recognize a human activity in image frames of a live video using the trained ML model by extracting a summarized feature point representation for the human activity present in each of the image frames.
6. The system (100) as claimed in claim 1, wherein the one or more hardware processors (104) are configured to compress of each feature from the set of customized feature by:
summing values of the set of pose point coordinate pairs associated with the resized image into a summarized pose point coordinate pair to identify a first feature point in the summarized feature point representation;
summing values of pixel distributions in the histogram of the resized image into a summarized histogram value to identify a second feature point in the summarized feature point representation;
summing values of the set of corner coordinate pairs associated with the resized image into a summarized corner coordinate pair to identify a third feature point in the summarized feature point representation; and summing pixel values of a plurality of edge points in the set of edges obtained for the resized image into an edge pixel value to identify a fourth feature point in the summarized feature point representation.
| # | Name | Date |
|---|---|---|
| 1 | 202121013382-STATEMENT OF UNDERTAKING (FORM 3) [26-03-2021(online)].pdf | 2021-03-26 |
| 2 | 202121013382-REQUEST FOR EXAMINATION (FORM-18) [26-03-2021(online)].pdf | 2021-03-26 |
| 3 | 202121013382-PROOF OF RIGHT [26-03-2021(online)].pdf | 2021-03-26 |
| 4 | 202121013382-FORM 18 [26-03-2021(online)].pdf | 2021-03-26 |
| 5 | 202121013382-FORM 1 [26-03-2021(online)].pdf | 2021-03-26 |
| 6 | 202121013382-FIGURE OF ABSTRACT [26-03-2021(online)].jpg | 2021-03-26 |
| 7 | 202121013382-DRAWINGS [26-03-2021(online)].pdf | 2021-03-26 |
| 8 | 202121013382-DECLARATION OF INVENTORSHIP (FORM 5) [26-03-2021(online)].pdf | 2021-03-26 |
| 9 | 202121013382-COMPLETE SPECIFICATION [26-03-2021(online)].pdf | 2021-03-26 |
| 10 | 202121013382-FORM-26 [22-10-2021(online)].pdf | 2021-10-22 |
| 11 | Abstract1.jpg | 2022-02-24 |
| 12 | 202121013382-FER.pdf | 2022-10-17 |
| 13 | 202121013382-OTHERS [10-01-2023(online)].pdf | 2023-01-10 |
| 14 | 202121013382-FER_SER_REPLY [10-01-2023(online)].pdf | 2023-01-10 |
| 15 | 202121013382-DRAWING [10-01-2023(online)].pdf | 2023-01-10 |
| 16 | 202121013382-COMPLETE SPECIFICATION [10-01-2023(online)].pdf | 2023-01-10 |
| 17 | 202121013382-CLAIMS [10-01-2023(online)].pdf | 2023-01-10 |
| 18 | 202121013382-US(14)-HearingNotice-(HearingDate-19-09-2024).pdf | 2024-08-28 |
| 19 | 202121013382-FORM-26 [02-09-2024(online)].pdf | 2024-09-02 |
| 20 | 202121013382-Correspondence to notify the Controller [13-09-2024(online)].pdf | 2024-09-13 |
| 21 | 202121013382-Written submissions and relevant documents [30-09-2024(online)].pdf | 2024-09-30 |
| 22 | 202121013382-PatentCertificate27-06-2025.pdf | 2025-06-27 |
| 23 | 202121013382-IntimationOfGrant27-06-2025.pdf | 2025-06-27 |
| 1 | 202121013382E_16-10-2022.pdf |