A System And Method For Detecting A Human In An Image

< Back

A System And Method For Detecting A Human In An Image

Abstract: Disclosed is a system for detecting a human in an image. An image capturing module captures the image using a motion sensing device, wherein the image comprises a plurality of pixels having gray scale information and a depth information. The image capturing module further segments the image into a plurality of segments based upon the depth information. An analysis module performs a connected component analysis on a segment in order to segregate the one or more objects into noisy objects and candidate objects. The analysis module further eliminates the noisy objects from the segment using a vertical pixel projection technique. A feature extraction module extracts a plurality of features from the candidate objects. An object determination module evaluates the plurality of features using a Hidden Markov Model (HMM) model in order to determine the candidate objects as one of the human or non-human.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

07 February 2014

Publication Number

39/2015

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

ip@legasis.in

Parent Application

Patent Number

Legal Status

Grant Date

2021-11-26

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point, Mumbai 400021, Maharashtra, India

Inventors

1. ROY, Sangheeta

Tata Consultancy Services Limited, Building 1B,Ecospace Plot - IIF/12 ,New Town, Rajarhat, Kolkata - 700156,West Bengal, India

2. CHATTOPADHYAY, Tanushyam

Tata Consultancy Services Limited, Building 1B,Ecospace Plot - IIF/12 ,New Town, Rajarhat, Kolkata - 700156,West Bengal, India

3. MUKHERJEE, Dipti Prasad

Indian Statistical Institute, 203, Barrackpore Trunk Rd, Kolkata, 700108 , Bengal, India

Specification

CLIAMS:WE CLAIM:

1. A method for detecting a human in an image, the method comprising:
capturing the image using a motion sensing device, wherein the image comprises a plurality of pixels having gray scale information and a depth information, and wherein the gray scale information comprises intensity of each pixel corresponding to a plurality of objects in the image, and wherein the depth information comprises a distance of each object of the plurality of objects from the motion sensing device;
segmenting, by a processor, the image into a plurality of segments based upon the depth information of the plurality of objects, wherein each segment of the plurality of segments comprises a subset of the plurality of pixels, and each segment corresponds to one or more objects in the image;
performing connected component analysis on a segment, of the plurality of segments, in order to segregate the one or more objects, present in the segment, into one or more noisy objects and one or more candidate objects;
eliminating the one or more noisy objects from the segment using a vertical pixel projection technique;
extracting a plurality of features from the one or more candidate objects present in the segment, wherein the plurality of features are extracted by,
applying a windowing technique on the segment in order to divide the segment into one or more blocks, wherein each block of the one or more blocks comprises one or more sub-blocks,
calculating a local gradient histogram (LGH) corresponding to each sub-block of the one or more sub-blocks,
concatenating the LGH of each sub-block, and
generating a vector comprising the plurality of features based on the concatenation; and
evaluating the plurality of features using a Hidden Markov Model (HMM) model in order to determine the one or more candidate objects as one of the human or non-human.

2. The method of claim 1, wherein the segmenting further comprises:
maintaining the gray scale information of the subset of the plurality of pixels; and
transforming the gray scale information of remaining pixels other than the subset into a black color.

3. The method of claim 1, wherein the distance of the one or more objects in each segment is within a pre-defined threshold limit from the motion sensing device.

4. The method of claim 1, wherein the one or more noisy objects is selected from a group comprising ceiling, wall, floor, and combinations thereof.

5. The method of claim 1, wherein the vertical pixel projection eliminates the one or more noisy objects by using a K-Means algorithm.

6. The method of claim 1, wherein the one or more candidate objects are determined as one of the human or the non-human based on the evaluation of state transition sequence of the plurality of features with a sequence of features pre-stored in a database.

7. The method of claim 6, wherein the one or more candidate objects are determined as one of the human or the non-human by using a Viterbi algorithm.

8. The method of claim 1, wherein the image is segmented by using a depth connected operator, wherein the depth connected operator enables to segment the image into the plurality of segments.

9. The method of claim 1, wherein the plurality of features, corresponding to one or more activities associated to the human, are evaluated by using the HMM model in order to determine the one or more activities, wherein the one or more activities comprises standing, sitting and walking.

10. A system for detecting a human in an image, the system comprising:
a processor; and
a memory coupled to the processor, wherein the processor is capable of executing a plurality of modules stored in the memory, and wherein the plurality of module comprising:
an image capturing module configured to
capture the image using a motion sensing device, wherein the image comprises a plurality of pixels having gray scale information and a depth information, and wherein the gray scale information comprises intensity of each pixel corresponding to a plurality of objects in the image, and wherein the depth information comprises a distance of each object of the plurality of objects from the motion sensing device;
segment the image into a plurality of segments based upon the depth information of the plurality of objects, wherein each segment of the plurality of segments comprises a subset of the plurality of pixels, and each segment corresponds to one or more objects in the image;
an analysis module configured to
perform connected component analysis on a segment, of the plurality of segments, in order to segregate the one or more objects, present in the segment, into one or more noisy objects and one or more candidate objects;
eliminate the one or more noisy objects from the segment using a vertical pixel projection technique;
a feature extraction module configured to extract a plurality of features from the one or more candidate objects present in the segment, wherein the plurality of features are extracted by,
applying a windowing technique on the segment in order to divide the segment into one or more blocks, wherein each block of the one or more blocks comprises one or more sub-blocks,
calculating a local gradient histogram (LGH) corresponding to each sub-block of the one or more sub-blocks,
concatenating the LGH of each sub-block, and
generating a vector comprising the plurality of features based on the concatenation; and
an object determination module configured to evaluate the plurality of features using a Hidden Markov Model (HMM) model in order to determine the one or more candidate objects as one of the human or non-human.

11. A non transitory computer program product having embodied thereon a computer program for detecting a human in an image, the computer program product storing instructions, the instructions comprising instructions for:
capturing the image using a motion sensing device, wherein the image comprises a plurality of pixels having gray scale information and a depth information, and wherein the gray scale information comprises intensity of each pixel corresponding to a plurality of objects in the image, and wherein the depth information comprises a distance of each object of the plurality of objects from the motion sensing device;
segmenting the image into a plurality of segments based upon the depth information of the plurality of objects, wherein each segment of the plurality of segments comprises a subset of the plurality of pixels, and each segment corresponds to one or more objects in the image;
performing connected component analysis on a segment, of the plurality of segments, in order to segregate the one or more objects, present in the segment, into one or more noisy objects and one or more candidate objects;
eliminating the one or more noisy objects from the segment using a vertical pixel projection technique;
extracting a plurality of features from the one or more candidate objects present in the segment, wherein the plurality of features are extracted by,
applying a windowing technique on the segment in order to divide the segment into one or more blocks, wherein each block of the one or more blocks comprises one or more sub-blocks,
calculating a local gradient histogram (LGH) corresponding to each sub-block of the one or more sub-blocks,
concatenating the LGH of each sub-block, and
generating a vector comprising the plurality of features based on the concatenation; and
evaluating the plurality of features using a Hidden Markov Model (HMM) model in order to determine the one or more candidate objects as one of the human or non-human. ,TagSPECI:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION

(See Section 10 and Rule 13)

Title of invention:

A SYSTEM AND METHOD FOR DETECTING A HUMAN IN AN IMAGE

APPLICANT:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India

The following specification describes the invention and the manner in which it is to be performed.

PRIORITY INFORMATION
[001] This patent application does not take priority from any application.

TECHNICAL FIELD
[002] The present subject matter described herein, in general, relates to a system and a method for image processing and more particularly to the system and the method for detecting a human present in an image through the image processing.

BACKGROUND
[003] Detection of human and activities of the human for indoor and outdoor surveillance has become a major domain of research. It has been observed that, the detection of the human and the activities is very effective in applications like video indexing and retrieval, intelligent human machine interaction, video surveillance, health care, driver assistance, automatic activity detection, and predicting person behavior. Some of such applications may be utilized in offices, or retail stores, or shopping malls in order to monitor/detect people present in the offices, or the retail stores, or the shopping malls. It has been further observed that, the detection of the human and their corresponding activities through still images or video frames may also be possible in the indoor surveillance.
[004] In order to detect the human in the still images or the video frames, traditional background modeling based methods have been implemented. However, such methods are not capable of detecting RGB-D/ grayscale data, pertaining to each pixel in the still images or the video frames. This is because, the camera capturing the still images is not static or there is constant variation of lighting/environmental conditions around the camera. Further, since the video frames may contain a human leaning over a wall, or the person occluding on another person, it may be challenge to distinguish the person from the wall or the other person, thereby leading to incorrect/inaccurate detection of the person.
[005] In addition, there have been other techniques implemented for detecting the human in the still images or video frames. Examples of such techniques include ace detection algorithm integrated with cascade-of-rejectors concept along with histogram of oriented gradients (HoG), window scanning technique, human detection based on body-part, hierarchical classification architecture using SVM, graphical model based approach for estimating poses of upper-body parts. However, these techniques focus on detecting one or more body parts (e.g. head, leg, arm etc) of the human. Additionally, these techniques require the human in the image to be localized in a predefined orientation/view, and hence are not capable of detecting the human for view-invariant.

SUMMARY
[006] Before the present systems and methods, are described, it is to be understood that this application is not limited to the particular systems, and methodologies described, as there can be multiple possible embodiments which are not expressly illustrated in the present disclosures. It is also to be understood that the terminology used in the description is for the purpose of describing the particular versions or embodiments only, and is not intended to limit the scope of the present application. This summary is provided to introduce concepts related to systems and methods for detecting a human in an image and the concepts are further described below in the detailed description. This summary is not intended to identify essential features of the claimed subject matter nor is it intended for use in detecting or limiting the scope of the claimed subject matter.
[007] In one implementation, a system for detecting a human in an image is disclosed. In one aspect, the system may comprise a processor and a memory coupled to the processor for executing a plurality of modules present in the memory. The plurality of modules may further comprise an image capturing module, an analysis module, a feature extraction module, and an object determination module. The image capturing module may be configured to capture the image using a motion sensing device. The image may comprise a plurality of pixels having gray scale information and a depth information. The gray scale information indicates intensity of each pixel corresponding to a plurality of objects in the image. The depth information indicates distance of each object of the plurality of objects from the motion sensing device. The image capturing module may further be configured to segment the image into a plurality of segments based upon the depth information of the plurality of objects. It may be understood that, each segment of the plurality of segments may comprise a subset of the plurality of pixels, and each segment corresponds to one or more objects in the image. The analysis module may be configured to perform connected component analysis on a segment, of the plurality of segments, in order to segregate the one or more objects present in the segment. The one or more objects may be segregated into one or more noisy objects and one or more candidate objects. The analysis module may further be configured to eliminate the one or more noisy objects from the segment using a vertical pixel projection technique. The feature extraction module may be configured to extract a plurality of features from the one or more candidate objects present in the segment. It may be understood that, the plurality of features may be extracted by applying a windowing technique on the segment in order to divide the segment into one or more blocks. It may be understood that, each block of the one or more blocks comprises one or more sub-blocks. After dividing, a local gradient histogram (LGH) may be calculated corresponding to each sub-block of the one or more sub-blocks. After the calculation of the LGH, the LGH of each sub-block may then be concatenated to generate a vector comprising the plurality of features. The object determination module may be configured to evaluate the plurality of features using a Hidden Markov Model (HMM) model in order to determine the one or more candidate objects as one of the human or non-human.
[008] In another implementation, a method for detecting a human in an image is disclosed. The image may be captured by using a motion sensing device. The image may comprise a plurality of pixels having gray scale information and a depth information. In one aspect, the gray scale information may indicate intensity of each pixel corresponding to a plurality of objects in the image and the depth information may indicate distance of each object of the plurality of objects from the motion sensing device. After capturing, the image may be segmented into a plurality of segments based upon the depth information of the plurality of objects. It may be understood that, each segment of the plurality of segments may comprise a subset of the plurality of pixels, and each segment corresponds to one or more objects in the image. Subsequent to the segmentation, a connected component analysis may be performed on a segment, of the plurality of segments, in order to segregate the one or more objects present in the segment. The one or more objects may be segregated into one or more noisy objects and one or more candidate objects. Based on the connected component analysis, the one or more noisy objects may be eliminated from the segment by using a vertical pixel projection technique. After eliminating the one or more noisy objects, a plurality of features may be extracted from the one or more candidate objects present in the segment. It may be understood that, the plurality of features may be extracted by applying a windowing technique on the segment in order to divide the segment into one or more blocks. It may be understood that, each block of the one or more blocks comprises one or more sub-blocks. After dividing, a local gradient histogram (LGH) may be calculated corresponding to each sub-block. Based on the calculation of the LGH, the LGH of each sub-block may then be concatenated to generate a vector comprising the plurality of features. In one aspect, the plurality of features may then be evaluated by using a Hidden Markov Model (HMM) model in order to determine the one or more candidate objects as one of the human or non-human.
[009] In yet another implementation, a non transitory computer program product having embodied thereon a computer program for detecting a human in an image is disclosed. The computer program product may comprise instructions for capturing the image using a motion sensing device. The image may comprise a plurality of pixels having gray scale information and a depth information. The gray scale information comprises intensity of each pixel corresponding to a plurality of objects in the image, and the depth information may comprise a distance of each object of the plurality of objects from the motion sensing device. The computer program product may comprise instructions for segmenting the image into a plurality of segments based upon the depth information of the plurality of objects. It may be understood that, each segment of the plurality of segments may comprise a subset of the plurality of pixels, and each segment corresponds to one or more objects in the image. The computer program product may comprise instructions for performing connected component analysis on a segment, of the plurality of segments, in order to segregate the one or more objects, present in the segment, in one or more noisy objects and one or more candidate objects. The computer program product may comprise instructions for eliminating the one or more noisy objects from the segment using a vertical pixel projection technique. The computer program product may comprise instructions for extracting a plurality of features from the one or more candidate objects present in the segment. The plurality of features may be extracted by applying a windowing technique on the segment in order to divide the segment into one or more blocks. It may be understood that, each block of the one or more blocks comprises one or more sub-blocks, calculating a local gradient histogram (LGH) corresponding to each sub-block of the one or more sub-blocks, concatenating the LGH of each sub-block, and generating a vector comprising the plurality of features based on the concatenation. The computer program product may comprise instructions for evaluating the plurality of features using a Hidden Markov Model (HMM) model in order to determine the one or more candidate objects as one of the human or non-human.

BRIEF DESCRIPTION OF THE DRAWINGS
[0010] The foregoing detailed description of embodiments is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the disclosure, there is shown in the present document example constructions of the disclosure; however, the disclosure is not limited to the specific methods and apparatus disclosed in the document and the drawings.
[0011] The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to refer like features and components.
[0012] Figure 1 illustrates a network implementation of a system for detecting a human in an image, in accordance with an embodiment of the present subject matter.
[0013] Figure 2 illustrates the system, in accordance with an embodiment of the present subject matter.
[0014] Figure 3 and 4 illustrates working of the system, in accordance with an embodiment of the present subject matter.
[0015] Figure 5 illustrates a method for detecting the human in the image, in accordance with an embodiment of the present subject matter.

[0016] Figure 6 illustrates a method for extracting a plurality of features, in accordance with an embodiment of the present subject matter.

DETAILED DESCRIPTION
[0017] Some embodiments of this disclosure, illustrating all its features, will now be discussed in detail. The words "comprising," "having," "containing," and "including," and other forms thereof, are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms "a," "an," and "the" include plural references unless the context clearly dictates otherwise. Although any systems and methods similar or equivalent to those described herein can be used in the practice or testing of embodiments of the present disclosure, the exemplary, systems and methods are now described. The disclosed embodiments are merely exemplary of the disclosure, which may be embodied in various forms.
[0018] Various modifications to the embodiment will be readily apparent to those skilled in the art and the generic principles herein may be applied to other embodiments. However, one of ordinary skill in the art will readily recognize that the present disclosure is not intended to be limited to the embodiments illustrated, but is to be accorded the widest scope consistent with the principles and features described herein.
[0019] The present subject matter provides system and method for detecting a human from a plurality of objects present in an image. In one aspect, the image may be captured by a motion sensing device. In one example, the motion sensing device may be a Kinect™ device. It may be understood that, the image captured, by the Kinect™ device, may comprise a plurality of pixels having gray scale information and depth information. The gray scale information indicates intensity of each pixel corresponding to the plurality of objects, and the depth information indicates distance of each object from the Kinect™ device.
[0020] After capturing the image, the image may be segmented into a plurality of segments based upon the depth information associated to the plurality of objects in the image. It may be understood that, each segment comprises a subset of the plurality of pixels. In one aspect, each segment corresponds to one or more objects of the plurality of objects in the image. In one embodiment, the image may be segmented by maintaining the gray scale information of the subset, and transforming the gray scale information of the remaining pixels (not belonging to the subset) of the image into a black color.
[0021] After segmenting the image into the plurality of segments, a connected component analysis may be performed on a segment of the plurality of segments. The connected component analysis may facilitate to segregate the one or more objects present in the segment. The one or more objects may be segregated into one or more noisy objects and one or more candidate objects. Examples of the one or more noisy objects may include, but not limited to, ceiling, wall, and floor. Examples of the one or more candidate objects may include, but not limited to, a human, a chair, a refrigerator. After the segregation of the one or more objects, the one or more noisy objects may be eliminated from the segment using a vertical pixel projection technique. In one aspect, the vertical pixel projection may eliminate the one or more noisy objects by using a K-Means algorithm.
[0022] Subsequent to the elimination of the one or more noisy objects, a plurality of features may be extracted from the one or more candidate objects present in the segment. In one embodiment, the plurality of features may be extracted by implementing a sliding windowing technique on the segment. In one aspect, the sliding windowing technique may provide a rectangular sliding window while traversing across the segment in order to obtain one or more window frames. The one or more frames are indicative of one or more blocks of the segment. Thus, the implementation of the sliding windowing technique may enable division of the segment into one or more blocks. It may be understood that, each block of the one or more blocks comprises one or more sub-blocks. After dividing the segment, a local gradient histogram (LGH) may be calculated for a pre-defined set of orientations corresponding to each sub-block. Based on the calculation, the LGH of each sub-block may be concatenated in order to generate a vector comprising the plurality of features.
[0023] After extracting the plurality of features from the segment, the plurality of features may be evaluated by using a Hidden Markov Model (HMM) model in order to determine the one or more candidate objects as one of the human or non-human. In one aspect, the one or more candidate objects may be determined as one of the human or the non-human based on the evaluation of state transition sequence of the plurality of features with a sequence of features pre-stored in a database. It may be understood that, the one or more candidate objects may be determined as one of the human or the non-human by using a Viterbi algorithm.
[0024] While aspects of described system and method for detecting the human in the image and may be implemented in any number of different computing systems, environments, and/or configurations, the embodiments are described in the context of the following exemplary system.
[0025] Referring now to Figure 1, a network implementation 100 of a system 102 for detecting a human in an image is illustrated, in accordance with an embodiment of the present subject matter. In one embodiment, the system 102 may be configured to capture the image. In one aspect, the image may comprise a plurality of pixels having gray scale information and a depth information. The gray scale information indicates intensity of each pixel corresponding to a plurality of objects in the image, and the depth information indicates distance of each object of the plurality of objects from the motion sensing device. After capturing the image, the system 102 may further be configured to segment the image into a plurality of segments. Based on the segmentation, the system 102 may further be configured to perform a connected component analysis on a segment, of the plurality of segments, in order to segregate the one or more objects into one or more noisy objects and one or more candidate objects. Subsequent to the performance of the connected component analysis on the segment, the system 102 may further be configured to eliminate the one or more noisy objects from the segment. After eliminating the one or more noisy objects, the system 102 may further be configured to extract a plurality of features from the one or more candidate objects present in the segment. Subsequent to the extraction of the plurality of features, the system 102 may further be configured to evaluate the plurality of features in order to determine the one or more candidate objects as one of the human or non-human.
[0026] Although the present subject matter is explained considering that the system 102 is implemented on a server, it may be understood that the system 102 may also be implemented in a variety of computing systems, such as a laptop computer, a desktop computer, a notebook, a workstation, a mainframe computer, a server, a network server, a cloud-based computing environment and the like. In one implementation, the system 102 may be implemented in a cloud-based environment. It will be understood that the system 102 may be accessed by multiple users through one or more user devices 104-1, 104-2…104-N, collectively referred to as user devices 104 hereinafter, or applications residing on the user devices 104. Examples of the user devices 104 may include, but are not limited to, a portable computer, a personal digital assistant, a handheld device, and a workstation. The user devices 104 are communicatively coupled to the system 102 through a network 106.
[0027] In one implementation, the network 106 may be a wireless network, a wired network or a combination thereof. The network 106 can be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area network (WAN), the internet, and the like. The network 106 may either be a dedicated network or a shared network. The shared network represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), and the like, to communicate with one another. Further the network 106 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like.
[0028] Referring now to Figure 2, the system 102 is illustrated in accordance with an embodiment of the present subject matter. In one embodiment, the system 102 may include at least one processor 202, an input/output (I/O) interface 204, and a memory 206. The at least one processor 202 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the at least one processor 202 is configured to fetch and execute computer-readable instructions stored in the memory 206.
[0029] The I/O interface 204 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface 204 may allow the system 102 to interact with the user directly or through the client devices 104. Further, the I/O interface 204 may enable the system 102 to communicate with other computing devices, such as web servers and external data servers (not shown). The I/O interface 204 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. The I/O interface 204 may include one or more ports for connecting a number of devices to one another or to another server.
[0030] The memory 206 may include any computer-readable medium or computer program product known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. The memory 206 may include modules 208 and data 210.
[0031] The modules 208 include routines, programs, objects, components, data structures, etc., which perform particular tasks, functions or implement particular abstract data types. In one implementation, the modules 208 may include an image capturing module 212, an analysis module 214, a feature extraction module 216, an object determination module 218, and other modules 220. The other modules 220 may include programs or coded instructions that supplement applications and functions of the system 102. The modules 208 described herein may be implemented as software modules that may be executed in the cloud-based computing environment of the system 102.
[0032] The data 210, amongst other things, serves as a repository for storing data processed, received, and generated by one or more of the modules 208. The data 210 may also include a database 222 and other data 224. The other data 224 may include data generated as a result of the execution of one or more modules in the other modules 220.
[0033] In one implementation, at first, a user may use the client device 104 to access the system 102 via the I/O interface 204. The user may register themselves using the I/O interface 204 in order to use the system 102. In one aspect, the user may accesses the I/O interface 204 of the system 102. In order to detect a human in an image, the system 102 may employ the image capturing module 212, the analysis module 214, the feature extraction module 216, and the object determination module 218. The detailed working of the plurality of modules is described below.
[0034] Further referring to Figure 2, at first, the image capturing module 212 captures the image by using a motion sensing device. Example of the motion sensing device may include a Kinect™ device. It may be understood that, the Kinect™ device is capable of capturing the image along with metadata associated to the image. The metadata may include gray scale information and depth information pertaining to a plurality of objects in the image. In one aspect, the gray scale information may indicate intensity of each pixel corresponding to the plurality of objects in the image, whereas the depth information may indicate distance of each object of the plurality of objects from the Kinect™ device. In one example, the image, captured by the image capturing module 212 may comprise objects such as a human, a refrigerator, a chair. It may be understood that, the objects (i.e. the human, the refrigerator, or the chair) may be located at distinct locations in an indoor environment. Since the objects are located at distinct locations, the image capturing module 212 determines the depth information along with gray scale information of each object in the image.
[0035] Subsequent to the capturing of the image, the image capturing module 212 may further segment the image into a plurality of segments. The image may be segmented based upon the depth information of the plurality of objects present in the image. In one embodiment, the image may be segmented by using a depth connected operator (?). The ‘?’ may be represented as the depth connected operator when the symmetrical difference P? ? (D) is exclusively composed of connected components of ‘D’ or compliment of the connected component ‘DC’. It may be understood that, the depth information represented by the depth connected operator ‘?’ obtained from the Kinect™ may be applied over each pixel in order to segment the image into the plurality of segments. In one aspect, each segment of the plurality of segments may comprise a subset of the plurality of pixels. In one aspect, each pixel is having the distance within a pre-defined threshold value. Since each segment comprises the subset having the distance within the pre-defined threshold value, therefore it may be understood that, a segment of the plurality of segments may comprise one or more objects, of the plurality of objects, corresponding to the subset.
[0036] In one embodiment, the image may be segmented by maintaining the gray scale information of the subset, and transforming the gray scale information of the remaining pixels (not belonging to the subset) of the image into a black color. It may be understood that, the gray scale information of the subset is maintained since each pixel in the subset may have the distance within the pre-defined threshold value. On the other hand, the gray scale information of the another subset are transformed since each pixel in the another subset may not have the distance within the pre-defined threshold value. Thus, subsequent to the transformation, the image capturing module 212 may segment the image into the plurality of segments.
[0037] In order to understand the segmentation of the image into the plurality of segments, consider an example (1) in which a video frame/image is segmented into 3 segments. As illustrated in the figure 3(a), a person X is walking towards another two persons (person Y and person Z) that are standing away from the person X. In addition to the presence of the person X, the person Y and the person Z, an object (i.e. a door) is also present in the video frame/image. Based on the functionality of the image capturing module 212, as aforementioned, the image capturing module 212 determines the depth information along with gray scale information of each object (i.e. person X, person Y, person Z and the door) present in the video frame/image. Based on the depth information, the video frame/image is segmented into 3 segments as illustrated in figure 3(b), 3(c), and 3(d). Since, it is determined from the depth information that, the distance of the person X is distinct from the person Y, the person Z and the door, therefore, the person X is segmented into a first segment, as shown in figure 3(b). On the other hand, the distance of the person Y and the person Z are within the threshold value but distinct from the person X and the door, therefore both the person Y and the person Z are classified into a second segment, as shown in figure 3(c). Similarly, the object (i.e. the door) is located at distinct location from the person X, the person Y, and the person Z, therefore the door is segmented into a third segment, as shown in figure 3(d). As shown in the figures 3(b), 3(c) and 3(d), the gray scale information of the one or more objects in the segment are maintained whereas the background color in the segments are marked as black in order to highlight the one or more objects in the segment.
[0038] Subsequent to the segmentation of the image, the analysis module 214 may perform a connected component analysis on the segment of the plurality of segments. In one aspect, the connected component analysis may facilitate to segregate the one or more objects, present in the segment, into one or more noisy objects and one or more candidate objects. Examples of the one or more noisy objects may include, but not limited to, ceiling, wall, and floor. Examples of the one or more candidate objects may include, but not limited to, a human, a chair, and a refrigerator. Based on the segregation of the one or more objects, the one or more noisy objects may be eliminated from the segment using a vertical pixel projection technique. In one embodiment, the vertical pixel projection technique may include the following steps:
[0039] If the width of an object of the one or more objects in the segment is greater than 75% of the width of the image, then execute the following steps for eliminating the one or more noisy objects:
• Count a number of pixels (cnti) in a column ‘i’ for which the background color in the segment to be transformed as ‘black’ or assigning a value to a flag as ‘FALSE’, wherein the flag indicates the background color to be transformed as ‘Black’;
• Execute a K-Means algorithm clustering with value of K=2 on cnti ?i ? 0;H where H indicates height of the image;
• The output of the K-Means algorithm represents columns with higher number of Fore Ground pixels (C1) and lower number of Fore Ground pixels (C2); and
• Assigning the value to the flag for the number of pixels (cnti) in the column that is resided in C2 as ‘TRUE’.
[0040] In order to understand the working of the analysis module 214, consider an example (2), in which a video frame/image, as illustrated in figure 4(a), comprising a person X and a door. Since the video frame/image is captured in the indoor environment, the video frame/image may also consist of the one or more noisy objects such as floor and ceiling. In order to eliminate the floor and the ceiling, the aforesaid vertical pixel projection may be applied on the video frame/image. In this manner, based on the vertical pixel projection, the floor and the ceiling are eliminated and the one or more candidate objects (i.e. the person X and the door) are retained in the video frame/image, as illustrated in figure 4(b).
[0041] Subsequent to the elimination of the one or more noisy objects, the feature extraction module 216 may extract a plurality of features from the one or more candidate objects present in the segment. It may be understood that, the plurality of features may be extracted by calculating a local gradient histogram (LGH). In order to calculate the LGH, a sliding windowing technique may be implemented on the segment. In one aspect, the sliding windowing technique provides a rectangular sliding window while traversing across the segment in order to obtain one or more window frames. The one or more window frames are indicative of one or more blocks of the segment. In an exemplary embodiment, the feature extraction module 216 may further divide each block of the one or more blocks into a matrix of (4 X 4) sub-blocks. After dividing the one or more blocks, the LGH may be calculated of each sub-block for a predefined set of orientations. Subsequent to the calculation of the LGH for each sub-block, the feature extraction module 216 may further concatenate the LGH calculated for each sub-block of the one or more blocks associated to the segment. Based on the concatenation, the feature extraction module 216 generates a vector comprising the plurality of features of the segment.
[0042] In one example (3), consider a segment, of the image, comprising the one or candidate objects. Based on the functionality of the feature extraction module 216, as aforementioned, the sliding windowing technique may be implemented on the segment that facilitates in dividing the segment into 8 blocks. Further, each block of the 8 blocks are divided into a matrix of (4 X 4) sub-blocks. In other words, it may be understood that, each block comprises 16 sub-blocks. After dividing each block, the LGH may be calculated of each sub-block for a predefined set of orientations (for example ‘8’ orientations) and therefore, it is to be understood that, 128 features may be calculated for each block of the segment. After calculating the LGH, the feature extraction module 216 concatenates the LGH for each sub-block in order to generate the vector comprising the plurality of features corresponding to the one or candidate objects present in the segment.
[0043] After extracting the features, the object determination module 218 evaluates the plurality of features using a Hidden Markov Model (HMM) model of a plurality of HMM models. In one embodiment, the plurality of features may be evaluated to determine the one or more candidate objects as one of the human or non-human. It may be understood that, the one or more candidate objects may be determined as one of the human or the non-human based on the evaluation of state transition sequence of the plurality of features with a sequence of features pre-stored in a database 222. It may be understood that, the sequence of features may be extracted from a large sample of images that may be used in training the HMM model. It may be understood that, the training facilitates in determining the one or more candidate objects as one of the human or the non-human. Once the HMM model have been trained, a Viterbi algorithm may be used to determine the one or more candidate objects as the human or the non-human. It may be understood that, the Viterbi algorithm have been implemented in Hidden Markov model speech recognition toolkit (HTK) library.
[0044] In another embodiment, the object determination module 218 may further evaluates the plurality of features by using one or more HMM models of the plurality of HMM models in order to determine one or more activities corresponding to an object, of the one or more candidate objects, determined as the human. Examples of the one or more activities may include, but not limited to, standing, sitting and walking. It may be understood that, in order to determine each activity, distinct HMM models may be trained based on a plurality of features corresponding to each activity. It may be understood that, the plurality of features may be trained based on the large sample of images corresponding to each activity. Thus, the training of the distinct HMM models may facilitate in determining the one or more activities as one of the standing, sitting and walking. In one aspect, the object determination module 218 may determine the one or more activities of the human in the image for view-invariant.
[0045] In one embodiment, the functionality of the object determination module 218 for determining the one or more candidate objects as the human or the non-human is explained as below. It may be understood that, the HMM follows a first-order Markov assumption where each state St at time‘t’ depends only on the state St-1 at time‘t-1’. It may be understood that, the image observation feature vectors constitute the sequence of states. Each class is modeled by a left-to-right HMM where each state has transition to itself and next state. In one aspect, the HMM may contain a fixed number of hidden states. The HMM is characterized by 3 matrices i.e. state transition probability matrix A, symbol output probability matrix B, and initial state probability matrix ?. The state transition probability matrix A, the symbol output probability matrix B, and initial state probability matrix ? may be determined during a learning process. It may be understood that, the image may be represented as a sequence of feature vectors X = X1, X2…. XT also known as sequence of frames. For a model ?, if ‘O’ is an observation sequence O = (O1, O2…..OT). It may be understood that, the observation sequence O generates a state sequence Q = (Q1, Q2…..QT), of length T. The observation sequence O may be calculated by:

[0046] where ?q1 indicates initial probability of state 1, transition probability from state ‘i’ to state ‘j’ and is output probability of state ‘i’. In one aspect, the observation may be computed by using a Gaussian Mixture Model (GMM). The GMM includes

[0047] where, Mj indicates number of Gaussians assigned to j and N(x, µ, s) denotes a Gaussian with mean ‘µ’ and covariance matrix ‘s’ and ‘cjk’ is the weight coefficient of the Gaussian component ‘k’ of state ‘j’. Further, the Viterbi algorithm is used to decode and search the subsequence of the observation that matches best to the HMM.
[0048] Exemplary embodiments discussed above may provide certain advantages. Though not required to practice aspects of the disclosure, these advantages may include those provided by the following features.
[0049] Some embodiments enable a system and a method for detecting upper body part as well as other body parts of a human present in the image.
[0050] Some embodiments enable the system and the method for view-invariance human detection by using a Hidden Markov model (HMM) approach.
[0051] Some embodiments enable the system and the method to detect the human during improper segmentation of a segment of the image and further when the human is partially occluded by other object and human.
[0052] Some embodiments enable the system and the method to segment the image into a plurality of segments in order to recognize the presence of human in each segment based on depth information pertaining to an object in each segment.
[0053] Referring now to Figure 5, a method 500 detecting a human in an image is shown, in accordance with an embodiment of the present subject matter. The method 500 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method 500 may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, computer executable instructions may be located in both local and remote computer storage media, including memory storage devices.
[0054] The order in which the method 500 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method 500 or alternate methods. Additionally, individual blocks may be deleted from the method 500 without departing from the spirit and scope of the subject matter described herein. Furthermore, the method can be implemented in any suitable hardware, software, firmware, or combination thereof. However, for ease of explanation, in the embodiments described below, the method 500 may be considered to be implemented as described in the system 102.
[0055] At block 502, the image may be captured by using a motion sensing device. In one aspect, the image comprises a plurality of pixels having gray scale information and depth information. In one implementation, the image may be captured by the image capturing module 212.
[0056] At block 504, the image may be segmented into a plurality of segments based upon the depth information. In one aspect, each segment of the plurality of segments may comprise a subset of the plurality of pixels, and each segment corresponds to one or more objects in the image. In one implementation, the image may be segmented into a plurality of segments by the image capturing module 212.
[0057] At block 506, a connected component analysis may be performed on each segment in order to segregate the one or more objects, present in the segment, into one or more noisy objects and one or more candidate objects. In one implementation, the connected component analysis may be performed by the analysis module 214.
[0058] At block 508, the one or more noisy objects may be eliminated from the segment by using a vertical pixel projection technique. In one implementation, the one or more noisy objects may be eliminated by the analysis module 214.
[0059] At block 510, a plurality of features may be extracted from the one or more candidate objects present in the segment. In one implementation, the plurality of features may be extracted by the feature extraction module 216. Further, the block 510 may be explained in greater detail in Figure 6.
[0060] At block 512, the plurality of features may be evaluated to determine the one or more candidate objects as one of the human or non-human. In one aspect, the plurality of features may be evaluated by using a Hidden Markov Model (HMM) model. In one implementation, the plurality of features may be evaluated by the object determination module 218.
[0061] Referring now to Figure 6, a method 510 for extracting the plurality of features from the one or more candidate objects is shown, in accordance with an embodiment of the present subject matter.
[0062] At block 602, the segment may be divided into one or more blocks, wherein each block of the one or more blocks comprises one or more sub-blocks. In one embodiment, the segment may be divided by applying a windowing technique on the segment. In one implementation, the segment may be divided by the feature extraction module 216.
[0063] At block 604, a local gradient histogram (LGH) corresponding to each sub-block of the one or more sub-blocks may be calculated. In one implementation, the LGH may be calculated by the feature extraction module 216.
[0064] At block 606, the LGH of each sub-block may be concatenated. In one implementation, the LGH of each sub-block may be concatenated by the feature extraction module 216.
[0065] At block 608, a vector comprising the plurality of features may be generated. In one aspect, the vector may be generated based on the concatenation. In one implementation, the vector may be generated by the feature extraction module 216.
[0066] Although implementations for methods and systems for detecting a human in an image have been described in language specific to structural features and/or methods, it is to be understood that the appended claims are not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as examples of implementations for detecting the human in the image.

Documents

Application Documents

#	Name	Date
1	456-MUM-2014-Request For Certified Copy-Online(23-02-2015).pdf	2015-02-23
2	Form 3 [09-12-2016(online)].pdf	2016-12-09
3	Form-18(Online).pdf	2018-08-11
4	Form 5.pdf	2018-08-11
5	Form 3.pdf	2018-08-11
6	Form 2.pdf	2018-08-11
7	Figure of Abstract.jpg	2018-08-11
8	Drawings.pdf	2018-08-11
9	certified copy 456-mum-2014.pdf	2018-08-11
10	456-MUM-2014-FORM 26(19-3-2014).pdf	2018-08-11
11	456-MUM-2014-FORM 1(6-3-2014).pdf	2018-08-11
12	456-MUM-2014-CORRESPONDENCE(6-3-2014).pdf	2018-08-11
13	456-MUM-2014-CORRESPONDENCE(19-3-2014).pdf	2018-08-11
14	456-MUM-2014-FER.pdf	2019-08-21
15	456-MUM-2014-OTHERS [21-02-2020(online)].pdf	2020-02-21
16	456-MUM-2014-FER_SER_REPLY [21-02-2020(online)].pdf	2020-02-21
17	456-MUM-2014-COMPLETE SPECIFICATION [21-02-2020(online)].pdf	2020-02-21
18	456-MUM-2014-CLAIMS [21-02-2020(online)].pdf	2020-02-21
19	456-MUM-2014-PatentCertificate26-11-2021.pdf	2021-11-26
20	456-MUM-2014-IntimationOfGrant26-11-2021.pdf	2021-11-26
21	456-MUM-2014-RELEVANT DOCUMENTS [30-09-2023(online)].pdf	2023-09-30

Search Strategy

1	2019-07-3100-08-45_31-07-2019.pdf