A System And Method For Face Based Video Indexing In A Video

Abstract: The present invention relates a systems and method to discover faces for forming a video index. According to one exemplary embodiment, a multiple face detector -tracker combination bound by reasoning scheme and operational in both forward and backward direction is used to extract face tracks from individual shots of a shot segmented video. According to an embodiment, a face track -cluster-correspondence - matrix (TCCM) is formed further to identify the equivalent face tracks.

Patent Information

Application #

Filing Date

10 August 2011

Publication Number

08/2013

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

ip@legasis.in

Parent Application

Applicants

TATA CONSULTANCY SERVICES LIMITED

NIRMAL BUILDING,9TH FLOOR,NARIMAN POINT, MUMBAI 400021,MAHARASHTRA,INDIA

Inventors

1. GUHA PRITHWIJIT

TCS INNOVATION LABS, TCS TOWERS,249 D & E UDYOG VIHAR,PHASE-IV, GURGAON 122016,HARYANA,INDIA

2. PANDE NIPUN

TCS INNOVATION LABS, TCS TOWERS,249 D & E UDYOG VIHAR,PHASE-IV, GURGAON 122016,HARYANA,INDIA

3. JAIN MAYANK

TCS INNOVATION LABS, TCS TOWERS,249 D & E UDYOG VIHAR,PHASE-IV, GURGAON 122016,HARYANA,INDIA

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES. 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention: A SYSTEM AND METHOD FOR FACE BASED VIDEO INDEXING IN A
VIDEO
Applicant:
Tata Consultancy Services Limited A company Incorporated in India under The Companies Act. 1956
Having address:
Nirrrsai Building. 9th Floor.
Nariman Point, Mumbai 400021 ,
Maharashtra. India
The following specification particularly describes (he invention and the manner in which it is to be performed.

RELATED INVENTION
The present invention is related to invention disclosed in the Indian patent application No. 1959/MUM/2011, titled "A System and method for tracking the multiple faces with appearance modes and reasoning process", which was filed on July 7th, 2011 is hereby incorporated by reference.
FIELD OF THE INVENTION
The present application relates to a fully automatic system and method for generating a face based index of a video without any human intervention.
BACKGROUND OF THE INVENTION
Generally, security surveillance cameras are installed in housing society, office space. and street for close monitoring of the activities, behavior information, usually of people and often in a furtive manner. The surveillance camera is adapted to identify
the frontal face pose of the object or person to be tracked. The face track received is difficult to identify the tracked objects due to presence of outliers. The surveillance cameras sometimes result in false detection also. In cases of general videos, people may occur in all possible facial poses.
Existing approaches have mostly considered the corpus of videos of security surveillance cameras or news videos, where occlusions occur only very rarely and people mostly appear in frontal faces. In such cases, multiple face tracking with occlusion reasoning is not a strict necessity and also face tracks obtained by non-overlapping face region tracking are not filtered to remove outliers formed by false detections. Also, the correspondence of face tracks for indexing purposes become

easier as the existing approaches have only considered frontal faces. These approaches are thus not suitable for handling general videos (e.g. movies, TV shows, home videos etc.) which contain a lot of occlusions and wide variability of facial poses. Thus need for system for multiple face tracking is needed to localize person face in the entire video.
The approach for video indexing system will provide user to quickly click and set the video to the start of each task. The prior art teaches about the difficulty in video indexing as the existing approaches consider only frontal faces. Thus these approaches are not applicable for handling general videos e.g. movies. TV shows. home videos which contain a lot of occlusions and wide variability of facial poses.
Consider example of recording for traveling, wedding, and birthday parties and automatically organizing of videos according to the human participant or management of large video database. In both cases face track for video indexing is difficult as existing approaches consider frontal faces and occlusion present in the video makes difficult to differentiate between faces poses hence video indexing is difficult.
Thus, in the light of the above mentioned background of the art, it is evident that,
there is a need for a system and method for enabling face based video Indexing and updating accordingly.
OBJECTS OF THE INVENTION
The principal object is to provide a system and method to generate a face based index of a video without any human intervention.

Another significant object is to provide a system and method for indexing video segments through faces, enabling face based space-time queries (who/where/when) for person search in area under surveillance.
Still another object is to provide a system and method for management of large video databases, where individual videos are indexed first with respect to faces and different videos are cross-linked next through the faces.
Still another object is to provide a system and method for an automatic organization of personal videos based on human participants and indexing video segments through faces.
Yet another object is to provide a system for performance, both in terms of accuracy and computations, which will improve if we have more generalized face detector which is independent of facial poses.
SUMMARY OF THE INVENTION
Before the present systems and methods, enablement are described, it is to be understood that this application is not limited to the particular systems, and methodologies described, as there can be multiple possible embodiments which are not expressly illustrated in the present disclosures. It is also to be understood that the terminology used in the description is for the purpose of describing the particular versions or embodiments only, and is not intended to limit the scope of the present application.
The present application provides a system for face based video Indexing in a video, the system comprising of: (a) at least one detector-tracker reasoning scheme is

operational in both forward and backward directions to extract at least one face track for varying facial poses using at least one face detector and face tracker; (b) a means for filtering outliers from individual track containing face region and non-face region via hue-saturation color distribution; (c) a Gaussian Mixture Model (GMM) means for generating a set of clusters to detect modes of facial appearance and (d) a means for generating face based video index by analyzing the track and cluster correspondences for indexing the video in terms of faces.
In one aspect a method for face based video Indexing in a video, the method comprising machine implemented steps of: (a) tracking in forward and backward directions to extract face tracks for varying facial poses via a detector-tracker reasoning scheme operational in both forward and backward directions; (b) filtering outliers from individual track containing face regions and non-face region via hue-saturation color distribution;(c) clustering the extracted face track to generate at least one cluster to detect the modes of facial appearance via a Gaussian Mixture Model (GMM) variant; (d) computing a face Track-Cluster-Correspondence-Matrix (TCCM) to identify equivalent tracks; and (e) automatically generating face based video index by analyzing the track and cluster correspondences.
BRIEF DESCRIPTION OF THE DRAWINGS
Exemplary embodiments of the invention are discussed hereinafter in reference to the following drawings, in which photographs of pertinent human have been used as the only practicable medium for illustrating details of the claimed invention, and in which:
Figure 1 illustrates flow chart for face based video index generated via analyzing the track and cluster correspondences according to various embodiments of the invention.

Figure 2(a) and 2(b) illustrate functional block diagrams for face based video index generated via analyzing the track and cluster correspondences according to various embodiments of the invention.
Figure 3 illustrates face representation scheme according to one exemplary embodiment of the invention.
Figure 4 illustrates combining of both backward and forward tracking with varying poses according to one exemplary embodiment of the invention.
Figure 5(a) and 5(b) illustrate removal of outliers from crude face log according to one exemplary embodiment of the invention.
Figure 6 illustrates clustering performance analysis according to one exemplary embodiment of the invention.
Figure 7 illustrates Track-Cluster-Correspondence-Matrix (TCCM) according to one exemplary embodiment of the invention.
Figure 8 illustrates rows for indicating the face of the human participants according to one exemplary embodiment of the invention.
Figure 9(a)-9(t) illustrate results of multiple faces tracking under occlusions according to one exemplary embodiment of the invention.

Figure 10(a) and 10(b) illustrate multiple faces tracking performance analysis according to one exemplary embodiment of the invention.
DETAILED DESCRIPTION OF THE INVENTION
Some embodiments, illustrating its features, will now be discussed in detail. The words "comprising," "having," "containing," and "including," and other forms thereof, are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms "a," "an," and "the" include plural refer ences unless the context clearly dictates otherwise. Although any methods, and systems similar or equivalent to those described herein can be used in the practice or testing of embodiments, the preferred methods, and systems are now described. The disclosed embodiments are merely exemplary.
The present invention provides a system and method to generate a face based index of a video without any human intervention.
Figure 1 illustrates flow chart for face based video index generated via analyzing the track and cluster correspondences according to various embodiments of the invention. The process starts at the step 102, video frames are tracked in forward and backward direction to extract face tracks for varying facial poses via a detector-tracker reasoning scheme operational in both forward and backward directions. At the step 104, outliers are filtered from individual track containing face regions and non-face regions via a hue-saturation color distribution. At the step 106, the extracted face tracks are clustered to generate a set of clusters to detect the modes of facial appearance using a statistical method. At the step 108, a face Track-Cluster-

Correspondence-Matrix (TCCM) is computed to identify equivalent tracks. The process ends at the step 110, face based video index is generated automatically by analyzing the track and cluster correspondences.
Figure 2(a) and 2(b) illustrate functional block diagrams for face based video index generated via analyzing the track and cluster correspondences according to various embodiments of the invention. Initially, input video is first segmented into at least two shots. The video consisting of T time-ordered frames. The
process of shot detection 201 helps in partitioning this video in intervals consisting of frames which are almost consistent in color distribution. The same video is decomposed as shot spanning the
time interval denoting the hue-saturation color distribution is
computed from /(t) and BC(H1,H2) signifying the Bhattacharya coefficient is computed from color distributions H1 and H2 The shot boundaries of St are decided based on the criterion of image color distribution consistency satisfiability as
match
threshold. The color match threshold as ήsd is chosen empirically. In a preferred embodiment, ήsd = 0.6. The component frames of each shot intervals are subjected to frontal/profile face detection and shots without any detection success are rejected as they are irrelevant to the purpose of face extraction which is dependent on such detection results.
According to one exemplary embodiment, the face track is extracted from individual shots by using multiple face tracking 202 via a detector - tracker reasoning scheme. In an exemplary embodiment, a Haar feature based face detector (available with OpenCV software) is used to segment the regions of left/right profile or frontal faces in the image sequence. However, these detectors are extremely sensitive to the facial pose. Thus, although they are very accurate in detecting faces in left/right profile or

frontal faces, they fail when the facial pose changes. It is also not practical to use a lot of detectors, each tuned to different face orientations as that would lead to both high memory and processor usage. Thus, a detection reduced to a local neighborhood search guided by face features is advantageous to satisfy real-time constraints. Such a necessity is achieved by the procedure of tracking. The tracker with a face detection success, continue tracking where detection fails (due to facial pose variations) and update the target face features at times when the detectors succeed during the frame presence of the face.
According to an embodiment, the face tracks obtained from all the shots are subjected to two stages of filtering using hue-saturation color distribution. The first stage removes individual tracks containing mostly non face regions 203 a and second stage removes non face outlier tracks 203b.
According to an embodiment, the resulting face log is clustered using a Gaussian Mixture Model (GMM) variant to discover the modes of facial appearances of different people in varying facial poses 204. The face based video index is generated by analyzing the track and cluster correspondences 205.
In the first step 102 of the proposed method, video frames are tracked in forward and backward direction to extract face tracks for varying facial poses via a detector-tracker reasoning scheme operational in both forward and backward directions, wherein the face representation and localization is illustrated in the figure 3. A set of facial features are initially computed from the face region detected by one of the profile/frontal face detectors and are updated whenever the face is detected next, isolated from other detected/tracked face regions. Face bounding rectangle, motion history, color distribution and a mixture model learned on normalized face appearances as features is used for representing the face. The location of the face F in

the image is identified by the face bounding rectangle BR (F) with sides parallel to image axes. A second order motion model (constant jerk) is used, continuously updated from the 3 consecutive centroid positions of BR (F). Using this model, the ccntroidal position Ct(F) at the ttk instant is predicted as CL(F) = 2.5Ct-1 (F) — 2C1-2(F) 4- 0.5Ct.-3(F). The color distribution H (F) of the face F is computed as a normalized color histogram, position weighted by the Epanechnikov kernel supported over the maximal elliptical region BE (F) (centered at C (F)) inscribed in BR (F). Next, mean-shift iterations converge to localize the target face region in the current image, which are initialized from the motion model predicted position. In the next step 104, the mean-shift tracking algorithm maximizes the Bhatiacharya co-efficient between the target color distribution H (F) and the color distribution computed from the localized region at each step of the iterations. The maximum Bhattacharya coefficient obtained after the mean-shift tracker convergence is used as the tracking confidence tc (F) of the face F. This color based representation is combined with an appearance model to encode the structural information of the face. The RGB image region within BR(F) is first resized and then converted to a q x q monochrome image which is further normalized by its brightest pixel intensity to form the normalized face image nF of the face F. The normalization is performed to make the face image independent of illumination variations as shown in the figure 3.
In the next step 106, the extracted face tracks are clustered to generate a set of clusters to detect the modes of facial appearance using a statistical method. According to an embodiment, during the course of tracking, a person appears with various facial poses. The present invention proposes to cluster the normalized faces obtained from the different facial poses to learn the modes of his/her appearances thereby forming a Normalized Face Cluster Set (NFCS (F), henceforth), fhe normalized face image nF is re-arranged in a row-major format to generate the d = q x q dimensional feature vector X(nF). To achieve computational gain, the

individual dimensions of the feature vector is assumed that are un-correlated and hence, a diagonal co-variance matrix is sufficient to approximate the spread of the component Gaussians. A distribution over these feature vectors is approximated by learning a variant of the Gaussian mixture models where we construct a set of normalized face clusters.
The NFCS with K clusters is given by the set NFCS = where µr; σr are the respective mean and standard deviation vectors of the rth cluster and the weighing parameter πr is the fraction of the total number of normalized face vectors belonging to the rth cluster. The NFCS initializes with and an
initial standard deviation vector
Let there be Kt_1 clusters in the NFCS until the processing of the vector The belongingness function Br(u) is defined for the uth dimension of the rth cluster as Br(u) given by,

where A is the cluster membership threshold and is generally chosen between 1.0 — 5.0 (Chebyshev's inequality ). The vector X(nFt) is considered to belong to the rth cluster if.

whereήmv e (0,1) is the cluster membership violation tolerance threshold such that ήmv x d denotes the upper limit of tolerance on the number of membership violations

in the normalized face vector. If X(nFt) belongs to the rth cluster, then its parameters are updated as,

clusters r' # r. the mean and standard deviation vectors remain unchanged while the cluster weight πr is penalized as πr ←(1 — αt) πr However, if X(nFt) is not found to belong to any existing cluster, a new cluster is formed (Kt = Kt_t + 1) with its
mean vector as X(nFl), standard deviation vector as σinit and weight the weights
of the existing clusters are penalized as mentioned hefore.
It is worth noting that the parameter updates in equation 5 match the traditional Gaussian Mixture Model (GMM) learning. In GMMs, all the dimensions of the mean vector are updated with the incoming data vector. However, here the mean and standard deviation vector dimensions selectively is updated with membership checking to resist the fading out of the mean images. Hence, the NFCS is called as a variant of the mixture of Gaussians. Figure 3 shows a few mean images of the normalized face clusters learned from the tracked face sequences of the subject.
According to an embodiment, tracking multiple faces is not merely the implementation of multiple trackers but a reasoning scheme that binds the individual face trackers to act according to problem case based decisions. For example, consider the case of tracking a face which gets occluded by another object. A straight through tracking approach will try to establish correspondences even when the target face

disappears in the image due to complete occlusion by some scene object leading to tracking failure. A reasoning scheme, on the other hand, will identify (he problem situation of the disappearance due to the occlusion of the face and will accordingly wait for the face to reappear by freezing the concerned tracker. The proposed approach to multiple face tracking proposes a reasoning scheme to identify the cases of face grouping/isolation along with the scene entry/exit of new/existing faces.
The process of reasoning is performed over three sets. viz. the sets of active, passive and detected faces. The active set Fa(t) consists of the faces that are well tracked until the tth instant. On the other hand, the passive set Fp(t) contains the faces for which either the system has lost track or are not visible in the scene. The set of detected, faces Fd(t) contains the faces detected in the tth frame. The system initializes itself with empty aclive/passive/detected face sets and the objects are added or removed accordingly as they enter or leave the field of view. During the process of reasoning, the objects are often switched between the active and passive sets as the track is lost or restored. The process of reasoning is started at the tth frame based on the active/passive face sets available from the (c — l)£ft instant. The faces in the active set are first localized with motion prediction initialized mean-shift trackers . The extent of overlap between the tracked face regions from the active set and the detected face regions is computed to identify the isolation/grouping state of the faces.
The reasoning scheme based on the tracked-detected region overlaps is described next.
Consider the case where m faces are detected (Fd = {dFj;j — 1 ...m}) while n faces were actively tracked till the last frame (Fa - {aFl,-; i = 1 ...n}). To analyze the correspondence between the tracked face region and the detected face area, the fractional overlap between the faces F1 and F2 is defined as

which signify the fraction of the bounding rectangle of/'"-, overlapped with that of F2. The actively tracked face aFt and the detected lace dFj is considered to have significant overlap, if either of their fractional overlap with the other crosses a certain threshold r\ad (equation 7).

Sdf(i) denotes the set of detected faces which has significant overlap with the face a!') in the active set and Saf(j) represents the set of faces in the active set which has significant overlap with the detected face dFj (equation 9).

Based on the cardinalities of these sets associated with either of aFjdFj and the tracking confidence tc(aFt), the following situations is identified during the process of tracking.
• Isolation and Feature Update -- The face aF-t is considered to be isolated if
it does not overlap with any other face in the active set --
Under this condition of isolation of the tracked face, features of the face is updated if there exists a pair which
significantly overlap only with each other and none else
.In such a case, the face boundaries on account of the

face detection is success and thus the color distribution and the motion features of aF1 are updated from the associated (detected) face dl'k.
• Face Grouping - The face is considered to be in a group (e.g. multiple persons with overlapping face regions) if the bounding rectangles of the tracked faces overlap. In such a case, even if a single detected face dFk is associated to aFu the correspondence on account of multiple overlaps is not sure. Thus, in this case the motion model of aFi is updated based on its currently tracked position.
• Detection and/or Tracking Failure — This is the case where due to facial pose variations, the presence of the face is not detected in the image. However, if the face aFt is tracked well (tc(aFj) > rjtc). only the motion model of a/r,-is updated. In
absence of a confident face detection, the exact face boundaries is not sure and hence the color distribution is not updated. However, in case of both detection and tracking failure, aFt is not associated with any detected face and the tracking confidence also drops below the threshold (rjtc). In this case, aFt is considered to disappear from the scene (equation 10) and transfer it from Ta to Tp.

• New Face Identification — A new face in the s-ent does not overlap with any of the the bounding rectangles of the existing (tracked) faces. Thus, dFj is considered a new face if Saf (j) is a null set.

However, the system might lose track of an existing face whose re-appearance is also delected as the occurrence of a new one. Hence, the newly detected face region is normalized first and checked against the NFCS of the faces in Tp using the belongingness criterion outlined in Equation 2. If a match is found, the track of the

corresponding face is restored by moving it from Tp to Ta and its color and motion features are re-initialized from the newly detected face region. However, if no matches are found, a new face is added to Ta whose color and motion features are learned from the newly detected face region.
During the course of multiple face tracking, the faces in the active set are identified in one of the above situations and the feature update or active to passive set transfer decisions are taken accordingly. By reasoning with these conditions, new trackers are initialized as new faces enter the scene and destroyed as the faces disappear.
Figure 4 illustrates combined scheme for tracking in both backward and forward direction for acquiring the face instances in varying poses; including the ones prior to first detection. According to an embodiment, the proposed method assumes that a certain person will be detected in either front/profile face at some time in a shot (of duration [tSl te], say). However, it may well happen that the person gets detected only at the tth instant (ts< t < te). although he/she was present from the very beginning (ts) with a facial pose different from either frontal or left/right profile. In such cases, tracking in only forward direction will not provide us with all the face instances of the person. To avoid this, we also run a backward tracker initialized with the first detection to provide us with all the facial pose variations of the tracked person. The tracker is terminated when the tracking confidence dips below the threshold rjtc.
Figure 5(a) and 5(b) illustrate removal of outliers from crude face log according to one exemplary embodiment of the invention. According to an embodiment, the cropped face regions acquired by tracking are stored in a face log. However, the face log also contains non-face regions (outliers) on account of detection/tracking failure. The said outliers are of two types — first, trackers initialized on proper face regions which occasionally drift to non-face regions due to motion-model failure or pre-

mature mean-shift convergence; and second, trackers initialized from non-face regions (false detections) continuously tracking these outlier regions during the entire shot. The invention propose a two-stage filtering scheme to remove such outliers based on three assumptions — first, hue-saturations histograms computed from face regions will have similar distributions for the skin pixels while non-face regions will have completely different distribution profiles; second, in each track the face regions are in the majority and hence the average color distribution will be considerably different from the color distributions of non-face regions; and third, in face-tracks initialized on false detections, there will be hardly any face region and thus the average hue-saturation distribution of that track will be significantly different from an average distribution computed from only face regions.
Consider the case where N face tracks are extracted where the Ith
track contains n; faces denotes the hue-saturation
distribution computed from Ftj and we compute the average
from all the faces in 7}. Based on the assumptions,
declares the qth face as an outlier if where r]cm is a
color distribution match threshold. The outliers, if present are removed from each track and leaves us with This process only
removes outliers from each track but can not filter the ones where the trackers were initialized on non-face regions due to erroneous face detections as shown in the Figure 5(a).
The process of individual track filtering leaves us with two kinds of tracks — first, the
'pure" ones with only face regions; and second, the ones containing mostly outliers
where the tracker was initialized on non-face regions. The average hue-saturation
distribution is computed from, each track and obtains their average as

Proceeding on the same assumptions outlined earlier, the ith track is
described as an outlier, if as shown in the Figure 5(b). The
same threshold is used in both the stages of filtering since both employ the same criterion of matching through Bhattacharya co-efficient, The color distribution match threshold is empirically chosen as r/cm = 0.6.
The filtering of individual tracks followed by outlier track removal leaves the tracks Tj (t = 1,... N', where AT < A/. if outlier tracks are removed). The faces belonging to these tracks are clustered further to group the similar faces.
Figure 6 illustrates clustering performance analysis according to one exemplary embodiment of the invention. The face regions obtained from all tracks of the filtered face log are clustered. Ideally, each cluster should contain faces of the same person. However, such a cluster purity varies with different values of the cluster membership threshold (A.) and cluster membership violation tolerance threshold {r\mv). Consider the case where K clusters are formed, where the kth cluster contains nCk faces, of which mCk number of faces belong to the same person and satisfies the plurality criterion. Then, average cluster purity cP(X, r)mv) is defined for a certain set of
chosen thresholds as The clustering performance is analyzed
by varying A from 0.5 to 4.5 in steps of 0.1 and r\mv in the interval [0.05,0.25] in steps of 0.005. The results of clustering performance analysis is shown in Figure 6. The performance analysis is performed on 4 test data sets and the A is chosen as 1.8 is and rjmv = 0.215 for which we achieve the maximum cluster purity of 0.804.
Figure 7 illustrates Track-Cluster-Correspondence-Matrix (TCCM) according to one exemplary embodiment of the invention, In the next step 108, Consider the case where the filtered face log contains A/' face tracks and K clusters are obtained by face

clustering. Then, N' x M Track-Cluster-Correspondence-Matrix (TCMM) is formed to analyze the equivalences of the different tracks present in the face log. cL(LJ) denote the cluster index of the jth face in the Ith track, i.e. cL(i.j) € \1,M). The TCMM is thus formed as where, the Ith track
contains n\ faces and £(•) is the Kronecker Delta function.
Tracking provides the various facial poses of the same person while clustering helps to discover the modes of facial appearance. The similar facial appearances are grouped through clustering while the different facial appearances of the same person are linked through tracking. Each row of the TCCM signify the number of occurrences of different facial appearance modes in a certain track and each column of TCCM denote the frequencies of assuming the same facia! appearance mode by different tracks. A track i is linked to the cluster k if more than 25% faces of the Ith track assume the k facial appearance mode i.e. if TCMM[i][k] > Q.25n'L. Consider the case where the Ith track is linked to the clusters k and p while the rth track is linked with clusters p and q. A linkage transitivity analysis is performed to identify that the tracks t and p have a common link to the ptn cluster and use the same to declare the tracks i and j as equivalent. A similar analysis is performed on the entire TCMM to identify the equivalent tracks.
Figure 8 illustrates rows for indicating the face of the human participants according to one exemplary embodiment of the invention. In the next step 110. since the lace tracks are obtained from indexed shots, analysing the equivalent tracks reveal the shot presences of the same person. This is illustrated in Figure 8 where a part of the Video Face Book formed by analysing the first episode of the first season of the TV series "Friends" is shown.

BEST METHOD
The present application is described in the example given below which is provided only to illustrate the application and therefore should not be construed to limit the scope of the application.
Figure 9(a)-9(t) illustrate results of multiple faces tracking under occlusions according to one exemplary embodiment of the invention. The proposed approach for multiple face tracking is implemented on a single core 1.6 GHz Intel Pentium-4 PC with semi-optimized coding and operates at 13.33 FPS (face detection stage included). The results from 4 shots from different movies "300" (624 images) and "Sherlock Holmes" (840 images); "House" TV Series, an episode from Season 7 (495 images) and "Friends TV Series, an episode from Season 1 (143 images) is presented.
Performance Analysis:
Figure 10(a) and 10(b) illustrate multiple faces tracking performance analysis according to one exemplary embodiment of the invention. An object centric performance analysis by manually inspecting the surveillance log for computing the average rates of tracking precision and track switches is presented below. A tracker initialized over a certain face may eventually lose it on account of occlusions and may switch track to some other face(s) until an exit event. The tracking is consireded to be successful if the localized region has a non-zero overlap with the actual face region in the image. Consider the case of a tracker with a life span of T frames, of which for the first Ttrk frames, the tracker successfully tracks the same face over which it is initialized and then successively switches track to Nswitch number of (different) faces(s) during the remaining frames. The tracking precision of an
individual object is then defined as and the average tracking precision computed
over the entire set of extracted faces is called the Tracking Success Rate for the entire video. In the same line, the Tracker Switch Rate is evaluated as the average

number of track switches over the entire set of extracted objects. After a track switch from the T[rk + 1 frame onwards, a different tracker may pick up the trail of this object through a track switch from some other face or through the initialization of a new tracker — let there be Nreinit number of tracker re-initializations on some face region. The Tracker Re-initialization Rate is defined as the average number of tracker re-initializations per face computed over the entire set of extracted faces. The tracking performance measures are plotted by varying the fractional overlap (jjf0) and tracking confidence threshold irjtc) in the interval of [0.1,0.9] in steps of 0.1. We have chosen the fractional overlap threshold as r\f0 is chosen as 0.4 and the tracking confidence threshold as r)tc = 0.6 for the best performance. The performance of the tracking method on the 4 test shots with respect to these measures and the chosen parameters are presented in table 1.
TABLE 1: PERFORMANCE ANALYSIS OF MULTIPLE FACE TRACKING

Tracking Success Tracker Failure Tracker Re-initialization
300
Figure 9(a)-(e) 85.10% 1.00 0.40
House Season 8
Figure 9(f)-(j) 91.52% 0.40 0.40
House, Season 7 Figure 9(k)-(o) 87.80% 0.33 0.00
Friends, Season 1
Figure 9(p)-(t) 81.20% 0.50 0.00
The methodology and techniques described with respect to the exemplary embodiments can be performed using a machine or other computing device within

which a set of instructions, when executed, may cause the machine to perform any one or more of the methodologies discussed above. In some embodiments, the machine operates as a standalone device. In some embodiments, the machine may be connected (e.g., using a network) to other machines. In a networked deployment, the machine may operate in the capacity of a server or a client user machine in a server-client user network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The machine may comprise a server computer, a client user computer, a personal computer (PC), a tablet PC, a laptop computer, a desktop computer, or any machine capable o\' executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while a single machine is illustrated, the term "machine" shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
The machine may include a processor (e.g., a central processing unit (CPU), a graphics processing unit (GPU, or both), a main memory and a static memory, which communicate with each other via a bus. The machine may further include a video display unit (e.g.. a liquid crystal displays (LCD), a flat panel, a solid state display, or a cathode ray tube (CRT)). The machine may include an input device (e.g.. a keyboard) or touch-sensitive screen, a cursor control device (e.g., a mouse), a disk drive unit, a signal generation device (e.g., a speaker or remote control) and a network interface device.
The disk drive unit may include a machine-readable medium on which is stored one or more sets of instructions (e.g., software) embodying any one or more of the methodologies or functions described herein, including those methods illustrated above. The instructions may also reside, completely or at least partially, within the

main memory, the static memory, and/or within the processor during execution thereof by the machine. The main memory and the processor also may constitute machine-readable media.
Dedicated hardware implementations including, but not limited to, application specific integrated circuits, programmable logic arrays and other hardware devices can likewise be constructed to implement the methods described herein. Applications that may include the apparatus and systems of various embodiments broadly include a variety of electronic and computer systems. Some embodiments implement functions in two or more specific interconnected hardware modules or devices with related control and data signals communicated between and through the modules, or as portions of an application-specific integrated circuit. Thus, the example system is applicable to software, firmware, and hardware implementations.
In accordance with various embodiments of the present disclosure, the methods described herein are intended for operation as software programs running on a
computer processor. Furthermore, software implementations can include, but not limited to. distributed processing or component/object distributed processing, parallel processing, or virtual machine processing can also be constructed to implement the methods described herein.
The present disclosure contemplates a machine readable medium containing instructions, or that which receives and executes instructions from a propagated signal so that a device connected to a network environment can send or receive voice, video or data, and to communicate over the network using the instructions. The instructions may further be transmitted or received over a network via the network interface device.

While the machine-readable medium can be a single medium, the term "machine-readable medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term "machine-readable medium'1 shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that causes the machine to perform any one or more of the methodologies of the present disclosure.
The term "machine-readable medium" shall accordingly be taken to include, but not be limited to: tangible media; solid-state memories such as a memory card or other package that houses one or more read-only (non-volatile) memories, random access memories, or other re-writable (volatile) memories; magneto-optical or optical medium such as a disk or tape; non-transitory mediums or other self-contained information archive or set of archives is considered a distribution medium equivalent to a tangible storage medium. Accordingly, the disclosure is considered to include any one or more of a machine-readable medium or a distribution medium, as listed herein and including art-recognized equivalents and successor media, in which the software implementations herein are stored.
The illustrations of arrangements described herein are intended to provide a general understanding of the structure of various embodiments, and they are not intended to serve as a complete description of all the elements and features of apparatus and systems that might make use of the structures described herein. Many other arrangements will be apparent to those of skill in the art upon reviewing the above description. Other arrangements may be utilized and derived there from, such that structural and logical substitutions and changes may be made without departing from [he scope of this disclosure, figures are also merely representational and may not be drawn to scale. Certain proportions thereof may be exaggerated, while others may be

minimized. Accordingly, the specification and drawings are to he regarded in an illustrative rather than a restrictive sense.
The preceding description has been presented with reference to various embodiments. Persons skilled in the art and technology to which this application pertains will appreciate that alterations and changes in the described structures and methods of operation can be practiced without meaningfully departing from the principle and scope.
ADVANTAGES:
The above proposed system and method can be used in the applications related to:
(a) Security surveillance - Indexing video segments through faces, enabling face based space -time queries (who/where/when) for person search in areas under surveillance.
(b) Home entertainment - Automatic organization of personal videos (e.g. recordings from travel, wedding, birthday parties etc.) based on human participants.
(c) Video databases — Management of large video databases, where individual videos are indexed first with respect to faces and different videos are cross-linked next through the faces.
(d) Hyperlinked videos - Replacing the face detector in our system with a general object detector will lead to a similar system where the scene presence of similar objects can be linked. The tracked and linked object regions may now contain hyperlinks which can point to advertisements, news or general information about that object. For example, in case of faces, these links might point to celebrity homepages, recent movie releases, news headlines etc.

WE CLAIM:
1. A method for face based video Indexing in a video, the method comprising machine implemented steps of: (a) tracking in forward and backward directions to extract face tracks for varying facial poses via a detector-tracker reasoning scheme operational in both forward and backward directions; (b) filtering outliers from individual tracks containing face regions and non-face region via hue-saturation color distribution;(c) clustering the extracted face tracks to generate a set of clusters to detect the modes of facial appearance via a Gaussian Mixture Model (GMM) variant; (d) computing a face Track-Cluster-Correspondence-Matrix (TCCM) to identify equivalent tracks; and (e) automatically generating face based video index by analyzing the track and cluster correspondences.
2. The method of claim 1, wherein the combination of backward and forward tracking is adapted to acquire all the instances of subject face with varying poses via the detector-tracker reasoning scheme.
3. The method of claim 1. wherein the cluster comprises normalized faces obtained from the different facial poses to learn the modes of appearances.
4. The method of claim 1, wherein the target facial features are updated via a face detector, whenever the face is detected next, isolated from other detected/tracked face regions.
5. A system for face based video Indexing in a video, the system comprises of: (a) at least one detector-tracker reasoning scheme is operational in both forward

and backward directions to extract at least one face track for varying facial poses using at least one face detector and face tracker; (b) a means for filtering outliers from individual track containing face region and non-face region via hue-saturation color distribution; (c) a Gaussian Mixture Model (GMM) means for generating a set of clusters to detect modes of facial appearance and (d) a means for generating face based video index by analyzing the track and cluster correspondences for indexing the video in terms of faces.
6. The system of claim 5 wherein the face detector is adapted to update the target facial features, whenever the face is detected next, isolated from other detected/tracked face regions.
7. The system of claim 5 wherein the detector-tracker reasoning scheme is adapted to acquire all the instances of subject face with varying poses via the combination of backward and forward tracking.

Documents

Application Documents

#	Name	Date
1	2254-MUM-2011-FORM 26(14-10-2011).pdf	2011-10-14
1	2254-MUM-2011-US(14)-HearingNotice-(HearingDate-17-09-2020).pdf	2021-10-03
2	2254-MUM-2011-AMMENDED DOCUMENTS [01-10-2020(online)].pdf	2020-10-01
2	2254-MUM-2011-CORRESPONDENCE(14-10-2011).pdf	2011-10-14
3	2254-MUM-2011-OTHERS [13-10-2017(online)].pdf	2017-10-13
3	2254-MUM-2011-FORM 13 [01-10-2020(online)].pdf	2020-10-01
4	2254-MUM-2011-MARKED COPIES OF AMENDEMENTS [01-10-2020(online)].pdf	2020-10-01
4	2254-MUM-2011-FER_SER_REPLY [13-10-2017(online)].pdf	2017-10-13
5	2254-MUM-2011-RELEVANT DOCUMENTS [01-10-2020(online)].pdf	2020-10-01
5	2254-MUM-2011-COMPLETE SPECIFICATION [13-10-2017(online)].pdf	2017-10-13
6	2254-MUM-2011-Written submissions and relevant documents [01-10-2020(online)].pdf	2020-10-01
6	2254-MUM-2011-CLAIMS [13-10-2017(online)].pdf	2017-10-13
7	ABSTRACT1.jpg	2018-08-10
7	2254-MUM-2011-Correspondence to notify the Controller [12-09-2020(online)].pdf	2020-09-12
8	2254-MUM-2011-FORM-26 [12-09-2020(online)].pdf	2020-09-12
8	2254-mum-2011-form 3.pdf	2018-08-10
9	2254-mum-2011-form 2.pdf	2018-08-10
9	2254-MUM-2011-Response to office action [12-09-2020(online)].pdf	2020-09-12
11	2254-mum-2011-abstract.pdf	2018-08-10
11	2254-mum-2011-form 2(title page).pdf	2018-08-10
12	2254-mum-2011-form 18.pdf	2018-08-10
13	2254-mum-2011-claims.pdf	2018-08-10
13	2254-mum-2011-form 1.pdf	2018-08-10
14	2254-MUM-2011-CORRESPONDENCE(7-10-2011).pdf	2018-08-10
14	2254-MUM-2011-FORM 1(7-10-2011).pdf	2018-08-10
15	2254-mum-2011-correspondence.pdf	2018-08-10
15	2254-MUM-2011-FER.pdf	2018-08-10
16	2254-mum-2011-description(complete).pdf	2018-08-10
16	2254-mum-2011-drawing.pdf	2018-08-10
17	2254-mum-2011-description(complete).pdf	2018-08-10
17	2254-mum-2011-drawing.pdf	2018-08-10
18	2254-MUM-2011-FER.pdf	2018-08-10
18	2254-mum-2011-correspondence.pdf	2018-08-10
19	2254-MUM-2011-FORM 1(7-10-2011).pdf	2018-08-10
19	2254-MUM-2011-CORRESPONDENCE(7-10-2011).pdf	2018-08-10
20	2254-mum-2011-claims.pdf	2018-08-10
20	2254-mum-2011-form 1.pdf	2018-08-10
21	2254-mum-2011-form 18.pdf	2018-08-10
22	2254-mum-2011-abstract.pdf	2018-08-10
22	2254-mum-2011-form 2(title page).pdf	2018-08-10
24	2254-mum-2011-form 2.pdf	2018-08-10
24	2254-MUM-2011-Response to office action [12-09-2020(online)].pdf	2020-09-12
25	2254-mum-2011-form 3.pdf	2018-08-10
25	2254-MUM-2011-FORM-26 [12-09-2020(online)].pdf	2020-09-12
26	2254-MUM-2011-Correspondence to notify the Controller [12-09-2020(online)].pdf	2020-09-12
26	ABSTRACT1.jpg	2018-08-10
27	2254-MUM-2011-CLAIMS [13-10-2017(online)].pdf	2017-10-13
27	2254-MUM-2011-Written submissions and relevant documents [01-10-2020(online)].pdf	2020-10-01
28	2254-MUM-2011-RELEVANT DOCUMENTS [01-10-2020(online)].pdf	2020-10-01
28	2254-MUM-2011-COMPLETE SPECIFICATION [13-10-2017(online)].pdf	2017-10-13
29	2254-MUM-2011-MARKED COPIES OF AMENDEMENTS [01-10-2020(online)].pdf	2020-10-01
30	2254-MUM-2011-FORM 13 [01-10-2020(online)].pdf	2020-10-01
31	2254-MUM-2011-AMMENDED DOCUMENTS [01-10-2020(online)].pdf	2020-10-01
32	2254-MUM-2011-US(14)-HearingNotice-(HearingDate-17-09-2020).pdf	2021-10-03

Search Strategy

1	searchstrategy_07-04-2017.pdf