A Method Of Face Detection In Video Images And The Like.

< Back

A Method Of Face Detection In Video Images And The Like.

Abstract: The present invention is directed to an efficient method to find regions in a video to capture faces of people in motion, limiting the search space using motion detection technique, control the computational requirement based on desired accuracy of capturing faces. The invention can be used to capture faces from real time video where the accuracy of the operation can be further controlled depending on the computational bandwidth available in the system.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

12 March 2012

Publication Number

37/2013

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Patent Number

Legal Status

Grant Date

2020-03-17

Renewal Date

Applicants

VIDEONETICS TECHNOLOGY PRIVATE LIMITED

PLOT-5, BLOCK-BP, SALT LAKE, KOLKATA-700091

Inventors

1. ACHARYA, TINKU

E-375 BAISHNABGHATA - PATULI TOWNSHIP, KOLKATA - 700091, WEST BENGAL, INDIA

2. BHATTACHARYYA, DIPAK

16/839 KISHORI BAGAN MEARBER PIRTALA, P.O.: CHINSURAH, DIST.:HOOGHLY, PIN: 712101 WEST BENGAL, INDIA.

3. BOSE, TUHIN

BE-1/14/1, PEYARA BAGAN DESHBANDHU NAGAR, CITY:KOLKATA, PIN: 700 059, WEST BENGAL, INDIA

4. DALAL, TUTAI KUMAR

KHIRPAI - HATTALA (WD - 4), DIST: PASCHIM MEDINIPUR, PIN: 721232, WEST BENGAL, INDIA.

5. DAS, SAWAN

16 GREEN VIEW, GARIA, CITY: KOLKATA, PIN: 700084, WEST BENGAL, INDIA.

6. DHAR, SOUMYADEEP

PURBAYAN APPARTMENT, TARASANKAAR ROAD BY LANE, DESBANDHU PARA, P.O.: SILIGURI, DIST: DARJEELING, PIN: 734404, WEST BENGAL, INDIA.

7. MAITY, SOUMYADIP

VILL & PO: DUMARDARI, PIN: 721425, PURBA MEDINIPUR, WEST BENGAL, INDIA

Specification

Field of the Invention
The present invention is directed to an efficient method to find regions in a video to
capture faces of people in motion, limiting the search space using motion detection
technique, control the computational requirement based on desired accuracy of
capturing faces. The invention can be used to capture faces from real time video
where the accuracy of the operation can be further controlled depending on the
computational bandwidth available in the system.
Background of the Invention
Video Management Systems are used for video data acquisition and search processes
using single or multiple servers. They are often loosely coupled with one or more
separate systems for performing operations on the acquired video data such as
analyzing the video content, etc. Servers can record different types of data in storage
media, and the storage media can be directly attached to the servers or accessed
over IP network. This demands a significant amount of network bandwidth to receive
data from the sensors (e.g, Cameras) and to concurrently transfer or upload the data
in the storage media. Due to high demand in bandwidth to perform such tasks,
especially for video data, often separate high speed network are dedicated to transfer
data to storage media. Dedicated high speed network is costly and often require
costly storage devices as well. Often this is overkill for low or moderately priced
installations.
It is also known that to back up against server failures, one or more dedicated fail-
over (sometimes called mirror) servers are often deployed in prior art. Dedicated fail-
over servers remain unused during normal operations and hence resulting in wastage
of such costly resources. Also, a central server process either installed in the failover
server or in a central server is required to initiate the back-up service, in case a
server stops operating. This strategy does not avoid a single point of failure.
Moreover, when the servers and clients reside over different ends in an internet and
the connectivity suffers from low or widely varying bandwidth, transmission of multi-
channel data from one point to another becomes a challenge. Data aggregation
techniques are often applied in such cases which are computationally intensive or
suffer from inter-channel interference, particularly for video, audio or other types of
multimedia data.

As regards analytic servers presently in use it is well known that there are many
video analytics system in the prior art. Video content analysis is often done per frame
basis which is mostly pre defined which make such systems lacking in desired
efficiency of analytics but are also unnecessarily cost extensive with unwanted loss of
valuable computing resources.
Added to the above, in case of presently available techniques of video analysis, cases
of unacceptable number of false alarms are reported when the content analysis
systems are deployed in a noisy environment for generating alerts in real time. This
is because the traditional methods are not automatically adaptive to demography
specific environmental conditions, varying illumination levels, varying behavioural
and movement patterns of the moving objects in a scene, changes of appearance of
colour in varying lighting conditions, changes of appearance of colours in global or
regional illumination intensity and type of illumination, and similar other factors.
It has therefore been a challenge to identify the appearance of a non-moving foreign
object (static object) in a scene in presence of other moving objects, where the
moving objects occasionally occlude the static object. Detection accuracy suffers in
various degrees under different demographic conditions.
Extraction of particular types of objects (e.g. face of a person, but not limited to) in
images based on fiduciary points is a known technique. However, computational
requirement is often too high for traditional classifier used for this purpose in the
prior art, e.g., Haar classifier.
Also, in a distributed system where multiple sites with independent administrative
controls are present, unification of those systems through a central monitoring
station may be required at any later point of time. This necessitates hardware and OS
independence in addition to the backward compatibility of the underlying
computational infrastructure components, and the software architecture should
accommodate such amalgamation as well.
It would be thus clearly apparent from the above state of the art that there is need
for advancement in the art of sensory input/data such as video acquisition cum
recording and /or analytics of such sensory inputs/data such as video feed adapted to
facilitate fail-safe integration and /or optimized utilization of various sensory inputs
for various utility applications including event/alert generation, recording and related
aspects.

Extraction of particular types of objects (e.g. face of a person, but not limited to) in
images based on fiduciary points is a known technique. However, computational
requirement is often too high for traditional classifier used for this purpose in the
prior art, e.g., Haar classifier.
Traditionally, the faces are located in a still image using Haar feature based classifier.
Inherently, some non-face regions are also wrongly classified as faces. Computational
requirement is also very high due to excessive number of convolution operations.
This is unacceptable in a real time surveillance scenario. Viola et al. [1] have
introduced a rapid object detection scheme based on a boosted cascade of simple
features to achieve high frame rates working only with the information present in a
single grey scale image using Integral Matrices. Operating on 384 by 288 pixel
images, it's able to detect faces at 15 fps on a conventional 700 MHz Intel Pentium
R. Leinhart [2] introduced a set of rotated Haar-like features, which significantly
enrich this basic set of simple Haar-like features and gives on average a 10% lower
false alarm. These extended feature set, however, increase overall computational
requirement. In some other face detection systems, auxiliary information, such as
image differences in video sequences or pixel color in color images have been used to
decrease computation time. But after applying all these techniques together, the
system cannot process more than 10-15 frames per second for a 384 by 288 pixel
video in a 2.0 GHz Core 2 Duo Intel processor based system.
Increasing the video size decreases the fps exponentially. A 384 by 288 pixel image
size is not effective for a real-time surveillance system for proper detection and
subsequent processing using these faces, e.g. recognition, and other biometric
applications. With the advent of Megapixel cameras, it is possible to use wide areas
with prominent higher resolution face capture to effectively use the faces for
subsequently applications as explained above. However, the computational
requirement with traditional technology significantly increases to make it prohibitive
for on-line application, such as surveillance, and on-line criminal detection system in
a smart city.
Objects of the Invention
It is thus the basic object of the present invention to provide for advancements to
find regions in a video to capture faces of people in motion, limiting the search space

using motion detection technique, controlling the computational requirement based
on desired accuracy of capturing faces.
Another object of the present invention is directed to advancement relating to
capturing of faces from real time video where the accuracy of the operation can be
controlled depending on the computational bandwidth available in the system.
A further object of the present invention is directed to enhance the efficiency of
extracting face regions from a sequence of video frames and also, depending on the
availability of computational bandwidth, the number of iterations and pixel shifts as
required in the proposed advancement is controlled whereby it is possible to strike a
balance between the computational requirement and the accuracy of face detection.
Yet another object of the present invention is directed to a multi-channel, multiple
analysis process system, wherein the face detection technique can be used as a
cooperative process coexisting with other compute intensive processes.
Another object of the present invention is directed to advancements in face detection
wherein the search space could be reduced by considering the motion vector and
sliding the window only in the blob regions where motion is detected which favours
achieving reduced computation enabling to process larger resolution video imagery
to advance the face detection systems in today's era of increasingly growing demand
of higher resolution surveillance cameras.
Yet further object of the present invention is directed to advancement in face
detection techniques whereby several parameters can also be dynamically adjusted
so that detection and capture of face of people in motion can be done with varying
accuracy depending upon the computational bandwidth available at any point.
Another object of the present invention is directed to advancements in method
discussed above by interconnecting a number of intelligent components consisting of
hardware and software, and involving implementation techniques adapted to make
the system efficient, scalable, cost effective, fail-safe, adaptive to various
demographic conditions, adaptive to various computing and communication
infrastructural facilities.

Summary of the Invention
Thus according to the basic aspect of the present invention there is provided a
method of face detection in video images and the like comprising the step of
limiting the search space involving motion detection technique and controlled
computational requirements based on desired accuracy by carrying out prediction of
number iterations and temporal parameter "t".
A method of face detection in video images as above comprising the steps of:
i) involving the grey image of cropped motion rectangular area from current
frame to calculate said temporal parameter "t" and updating "t" with history
and calculating possible number of iterations "nlterations"
ii) calculating scale factor, no. of iterations and other parameter from look up
table;
iii) using convolution on different scaled images to get probable face
rectangles;
iv) grouping the probable faces with spatial information; and obtaining
therefrom the confirmed faces.
A method of face detection in video images as above comprising using the
convolution on probable face regions with Haar feature set to confirm faces and
publishing the confirmed faces based thereon.
A method of face detection in video images as above comprising step of carrying out
said temporal estimation "t", prediction of possible number of iterations "nlterations"
following :
i. Generating time taken to detect face for Image with size MxN based on
TMN = t * [(M - m) * (N - n) ] / [pixelShift * pixelShift]
where, pixelShift is the window shift size and the time taken to process a
single window area (fixed window size mxn) with standard feature set = t.
ii. For multi-scale processing ScaleFactor = f(M, N, m, n, nlteration)
iii. Total time taken to detect faces,
Where, M' = M / (ScaleFactor')

N' = N / (ScaleFactor')
iv. T = f(M, N, t, pixelShift, alteration), for a fixed size window.
v. Calculating average t in host machine and tune the parameters pixelShift,
alteration accordingly using generated lookup table to suite the bandwidth;
and
vi. Optionally, to increase the accuracy, enable a second pass upon the probable
face regions detected by first pass.
The framework disclosed herein can be used for such situations, and also for
integrating multiple heterogeneous systems in a distributed environment. The
proposed architecture is versatile enough to interface and scale it to many other
management systems.
The details of the invention and its objects and advantages are explained hereunder
in greater detail in relation to the following non-limiting exemplary illustrations as per
the following accompanying figures:
Brief Description of the Drawings
Fig 1: is a schematic layout of an illustrative embodiment showing an integrated
intelligent server based system of the invention having sensory input/data
acquisition cum recording server group and /or analytics server group adapted to
facilitate fail-safe integration and /or optimized utilization of various sensory inputs
for various utility applications;
Fig 2: is an illustrative top level view of intelligent video management system with
framework for multiple autonomous system integration;
Fig 3: is an illustration of fail-safe bandwidth optimized recording without any
supporting failover support server in accordance with the present invention;
Fig 4.:is an illustration of the dataflow diagram from a single video source through
the recording server ;
Fig.4A to 43: illustrate an exemplary Intelligent Home Security" box involving the
system of the invention;

Fig.5: is an illustration of the single channel data flow in video analytical engine in
accordance with the present invention;
Fig.6: is an illustration of intelligent video analytics server in accordance with the
present invention;
Fig.7: is an illustration of video management interface functionalities in accordance
with the present invention;
Fig.8: is an illustration of a conventional process of identification of faces with spatial
information;
Fig.9:is an illustration of the process for enhanced and confirmatory identification of
faces in accordance with the present invention;
Detailed Description of the Invention:
Reference is first invited to accompanying figure 1 which shows the broad overview of
an illustrative embodiment showing an integrated intelligent server based system
having sensory input/data acquisition cum recording server group and /or analytics
server group adapted to facilitate fail-safe integration and /or optimized utilization of
various sensory inputs for various utility applications. More specifically, the system
involves the method for bandwidth adaptive data transfer to central storage cluster in
accordance with the present invention. The following description in relation to figures
1 to 7 deals with the utilities of the advancement in an integrated intelligent server
based system and further in relation to figures 8 and 9 further illustrates the method
of face detection in video images and the like in accordance with the present
invention.
As would be apparent from the figure 1 the system basically involves the self-reliant
group of recording servers (101), the group of analytical servers (102) and an
intelligent interface (103). Importantly, said recording servers apart from being
mutually cooperative and self-reliant to continuously monitor and distribute the
operative load based on the number of active servers in the group are also adapted
for bandwidth optimized fail-safe recording ((104 ) and join-split mechanism for multi
channel video streaming ( 105).

The analytical servers (102) are also adapted to cater to atleast one of more of
background estimation (106), identifying moving, static, quasi static objects ( 107),
enhanced object tracking (108), content aware resource scheduling ( 109) , join-split
mechanism for sensory date streaming (110) and resource dependent accuracy
control (111).
The various components of the above system adapted to carry out the above
advanced functionalities in accordance with the present invention is further outlined
and schematically described in Fig 2:
1. Intelligent Video Management System (204)
1.1. Video Recording Server (201)
1.2. Video Management Interface (203)

1.2.1. User management and Client access controller
1.2.2. Event concentrator and Handler (206)
1.2.3. Event distributor

2. Intelligent Video Analytics Server (202)
3. Surveillance Client (207)
4. Web client (207)
5. Mobile device Client (207)
6. Remote Event Receiver ( 206 )
As is clearly apparent from Figure 2, the present system would enable seamless and
intelligent Interconnection of multiple Autonomous Systems (210-01)210-02... 210-
On). Thus at the same time, multiple such Autonomous Systems can be used as
building blocks for a distributed system spanning across wide geographical regions
under different local administrative control, with a Centralized view of the whole
system from a single point. An Autonomous system (210-01)) is considered as a
system capable to implement the functionalities and services involving sensory data
and /or its analysis.
Also, the system is capable of handling any sensory data/input and it is only by way
of an illustration but not by way of any limitations of the present system that the
various exemplary illustrations hereunder are discussed with reference to video
sensory data. The underlying system architecture/methodology is applicable in other
sensory data types for a true Intelligent Sensor Management System .
A number of machine vision products spanning the domain of Security and
surveillance, Law enforcement, Data acquisition and Analysis, Transmission of

multimedia contents, etc can be adapted to one or more or the whole of the system
components of the present invention.
Reference is now invited to accompanying figure 3 which shows by way of an
embodiment a fail-safe bandwidth optimized recording without any failover support
server. As apparent from said figure, for the purpose the input from the pool of
sensors (305) are fed not to any single server but to a group of servers
(301).Importantly , communication channel (303) is provided to carry inter-VRS
communication forming a team towards failover support without any central
management and failover server while the communications channel (302) is provided
to carry data to central storage involving intelligent bandwidth sharing technique of
the invention.
The implementation of the Recording System :
The Recording system essentially implements the functionalities and services as
hereunder:
1. Collecting Data real time: Collect data from various images, video and
other sensory sources, both on-line and off-line, archiving and indexing
them to seamlessly map in any relational or networked database in a
fail-safe way making optimal usage of computing, communication and
storage resources, facilitate efficient search, transcoding,
retransmission, authentication of data, rendering and viewing of
archived data at any point of time.
2. Streaming data real time or on Demand: Streaming video and other
sensory content in multiple formats to multiple devices for purposes
like live view in different matrix layout, relay of the content, local
archiving, rendering of the sensory data in multiple forms and formats,
etc. by a fail-safe mechanism without affecting speed and performance
of on-going operations and services.
The Video Recording system is implemented using hardware and software, where the
hardware can be any standard computing platform operated under control of various
operating systems like Windows, Linux, MacOS, Unix, etc. Dependence on hardware
computing platform and operating system has been avoided and no dedicated
hardware and communication protocol has been used to implement the system.

Recording server implements an open interface both for input and output, (including
standard initiatives by various industry consortium such as ONVIF, PSIA, etc.), and
can input video feed from multiple and different types of video sources in parallel,
with varying formats including MPEG4, H.264, MJPEG, etc. OEM specific SDKs to
receive video can also be used. Internal operating principle of the Recording server is
outline below:
Recording Server operating principle is adapted for the following:
1. Auto register itself to the IVMS system so that other components like VMS,
Surveillance Clients, other VRSes can automatically find and connect it even
when its IP-address changes automatically or manually.
2. Form a group with other VRS in the system to implement a failover support
without any central control and without support from any dedicated failover
server.
3. Accept request from VMI to add and delete data sources including video
sources like cameras, receive data from those input sources over IP-network
or USB or other connectivity, wired or wireless, using open protocols or SDKs
as applicable for a particular data source.
4. Record the video and other sensory data in local storage either continuously or
on trigger from external devices including the data source itself or on trigger
from other components of the Video management system or on user request
or on combination of some of the above cases
5. Intelligently upload the video or other sensory data in a cluster of storage
devices, where a cluster contains of one or more network accessible storages,
in an efficient way giving fair share to individual data sources, utilizing
optimal bandwidth and in a cooperative way.
6. Insert information in database so that the data including video data can be
searched easily by any component in the system.
7. Stream the video or other sensory data in their original format or in some
other transcoded format to other devices including the Surveillance clients
when the surveillance client connects it using defined protocol.

Auto registration of servers:
All the servers in the system, including the Recording servers, auto register
themselves by requesting and then getting a unique Identification number (ID) from
the VMI. All the configuration data related to the server including the identification of
data sources including the video sources it caters to, the storage devices it uses, etc
are stored in the database against this ID. This scheme has the advantage that with
only one Static IP address (that of the VMI), one can access any component of the
Autonomous System (AS), and the IP addresses of the individual hardware
components may be kept varying.
Recording Video or other sensory data in local storage and streaming the data to
Client machine:
The cameras, other video sources or sources generating streaming data (henceforth
called Channels) can be auto detected or manually added to the VRS. The details of
the channels are stored in the Central Database. Once done, one or more channels
can be added to the Recording System. The Recording system thus comprises of one
or more Recording servers (VRS) and the Central Database Management System.
VRS-es consults the database, know about details of the system, and records the
channel streaming data either continuously, or on trigger from any external or
internal services, as configured by the user.
The data stream is first segmented into small granular clips or segments of
programmable and variable length sizes (usually of 2 to 10 minutes duration) and the
clips are stored in the Local storage of the server, the clip metadata being stored in
local database.
Reference is invited to accompanying figure 4 which shows the dataflow mechanism
in accordance with the invention from a single video service through the recording
server. As apparent from Figure 4, the sensory data stream viz. video (405) is feed
to a data segment generator (401) which is next stored in segments in local storage
(403/402) and thereafter uploaded through data upload module (404) to a central
storage (406)/407).
Any external component of the system can enquire the VRS to know about the details
of the channels it is using and get the data streams for purposes like live view,
Relaying to other devices etc using a networked mutual client-server communication
protocol
Bandwidth adaptive data uploading to central storage system

In the system of the invention, an efficient technique has been designed to transfer
video or other sensory data received from the channels to the central storage system
via the local storage. Instead of allocating a particular data source (e.g., a camera) to
a particular server (dedicated point to point) for recording of data (e.g, video), it is
allocated to a 'Server group' with multiple servers in the group [Fig 3]. The members
of the group exchange their capacity information amongst themselves and share the
load according to their capacity. In case of breakdown of one or more servers, the
team members share the load of the failed server(s), without any central control or
without support from any dedicated fail-over server. For data uploading, each server
not only monitors the available bandwidth but also the data inflow rate for each
channel into the server, and accordingly adjusts the upload rate for an individual
channel. For the purpose the data stream is segmented into variable sized clips and
the rate of uploading the clips to the central storage is adjusted depending on the
available network bandwidth and data inflow rate for that particular channel [Fig
4].As shown in the figure , the sensor data stream ( 405) is segmented in data
segment generator (401) which is next stored in local storage ( (402 ,403) and
thereafter involving a data upload module (404) the same is sent to the central
storages ( 406/407).
Implementing fail-over support without any dedicated failover server and mirror
central control
The system of the invention is further adapted for back up support in case of server
failure without the involvement of any special independent stand by support server.
Traditionally (prior art), dedicated fail-over servers are used which senses the
heartbeat signals broadcasted by the regular servers. Once the heart beat is found
missing, the failover server takes up the task of the failed server. This technique is
inefficient as it not only blocks the resources as dedicated failover servers, but cannot
utilize the remaining capacity of the existing servers for back up support. Also, failure
of the failover server itself jeopardizes the overall failover support system.
In the proposed system the recording servers exchange information amongst
themselves so that each server knows the leftover capacity and the channel
information of every other server. In case of server failure, the remaining active
servers distributes the load amongst themselves.
The Implementation of the Video Analytics System
The Video Analytics System essentially implements the functionalities as hereunder:

1. Data Content Analysis: Intelligently analysing the data, on-line or off-line, to
extract the meaningful content of the data, identifying the activities of
foreground human and other inanimate objects in the scene from the sensor
generated data, establishing correlation among various objects (living or
non-living) in the scene, establishing correlation amongst multiple types of
sensory data, identifying events of interests based on the detected activities-
-- all either automatically or in an user interactive way under various
demographic and natural real life situations. Several novelties have described
in the relevant sections describing the details of the data content analysis
techniques.
2. Automatic Alert Generation: Generating Alerts, signals, video clips, other
sensory data segments, covering the events automatically as and when
detected.
The Video Analytics system comprises hardware and software, where the hardware
can be any standard computing platform operated under control of various operating
systems like Microsoft Windows, Linux, MacOS, Unix, RTOS for embedded hardware,
etc.
Dependence on hardware computing platforms and operating systems has been
avoided and no dedicated closed hardware needs to be used to implement the
system. At the same time, part or whole of the system can be embedded into other
products with some existing services, without affecting those services.
An example is provided in the form of "Intelligent Home Security" box shown in
Figures 4A to 4J where a specially built hardware is used to provide several services
viz, Digital Photo-frame, Perimeter security, Mobile camera FOV recording & relay,
Live view of cameras, etc.
Referring to FIG. 4A, a schematic diagram of a Networked Intelligent
Villa/Home/Property Monitoring System is shown. All of the intelligent video
management server and intelligent monitoring applications that are described in
previous sections have been embedded into the Videonetics Box. The Box has an
easy to use GUI using touch-screen so that any home/villa/property owner can easily
operate it with minimum button pressing using visual display based instructions only.
The top level systems architecture for the embedded hardware and details of the
components in the hardware system is shown in FIG. 4B.

The following is a micro-architectural components summary for an example of a
multi-channel IP-camera solution. Video from IP-Cameras is directly fed to the
computer without the requirement of any encoder. There are three options: One, no
network switch is required. The Motherboard should have multiple Ethernet ports;
two, the Motherboard has only one Ethernet port assuming all the cameras are
wireless IP-Cameras. The Motherboard should have 1 x Ethernet port and 1 x Wifi
interface; and three, the Motherboard has only one Ethernet port, the cameras are
wired, but a Network switch is required as an external hardware.
On detection of events the following tasks are performed:
a siren blows;
an SMS/MMS is sent;
event clip is archived; and
the event clip is also streamed to any designated device over the Internet.
The following Interfaces are required to handle the above tasks: at least one RELAY
O/P for siren drive or DIO for Transmitter interface; and a 3G interface for SMS/MMS
or sending event clip to Cell Phone. Other usual hardware includes:
a. USB;
b. Touch Screen Interface;
c. external storage;
d. 3G dongle, if 3G is not embedded into motherboard;
e. keyboard, if touch screen is not attached; and
f. DVI port for display.
The following is a micro-architectural components summary for an example of a
multi-channel analog camera solution. Video from analog camera is received by an
encoder hardware. The encoded RAW image is fed to the computer for processing.
System Hardware should be capable to handle the following activities:
1. multi channel encoding, each at 15 - 30 fps for Dl size, but not limited to, higher
frame rate and higher resolution as long as computing bandwidth supports this frame
rate and resolution video data
a. Input to encoder: Analog video in NTSC or PAL
b. Output from encoder: YUV or RGB

There are two options:
a. The encoder could be a separate module connected to motherboard
through PCIE
b. The encoder circuitry may be embedded in the mother board
2. On detection of events following tasks are performed:
a. A siren blows
b. An SMS/MMS is sent
c. Event clip is archived
d. Event clip is also streamed to any designated device over Internet
The following hardware Interfaces are required to handle the above tasks:
a. At least one RELAY O/P for siren drive or External Transmitter interface
(DIO)
b. 3G interface for SMS/MMS or sending event clip to Cell Phone.
c. Ethernet for remote access to the system
3. Other usual hardware:
1. USB :
a. Touch Screen Interface
b. External Storage
c. 3G dongle, if 3G is not embedded into motherboard
d. keyboard if touch screen is not attached
e. DVI port: for Display
Referring to FIG. 4C, a top level heterogeneous system architecture (both IP and
analog cameras) is illustrated. Referring additionally to FIGS. 4D-4J an operational
flow by a user and representative GUI using a touch panel display of the intelligent
monitoring system is detailed in a step-by-step flow.
Thus, a new and improved intelligent video surveillance system is illustrated and
described. The improved intelligent video surveillance system is highly adaptable
and can be used in a large variety of applications can be conveniently adapted to a
variety of customer-specific requirements. Also, the intelligent video surveillance
system is automated, intelligent, and requires a minimum or no human intervention.

Various changes and modifications to the embodiment herein chosen for purposes of
illustration will readily occur to those skilled in the art. To the extent that such
modifications and variations do not depart from the spirit of the invention, they are
intended to be included within the scope thereof.
The Analytics Engine
Various rule sets for inferencing the dynamics of the data (interpretation of Events)
are defined inherently in the system or they can be defined by the users. An Analytics
engine detects various activities in the video or other sensory data stream and on
detection of said activities conforming to one or more Events, sends notification
messages with relevant details to the recipients. The recipients can be the VMI, the
central VMS or Surveillance Clients or any other registered devices. To perform the
above tasks, the scene is analyzed and the type of analysis depends on the type of
events to be detected.
The data flow within the Analytics Engine for a single channel, taking video stream as
the channel data, is as schematized below [Fig. 5],. The functionalities of various
internal modules of the Analytics Engine and other components are described below,
taking Video channel as an example for Sensory data source.
(A) Scene Analyzer (501) : The Scene analyzer is the primary module of the
Analytics engine and that of the IVAS as well. Depending on the Events to be
detected, various techniques have been developed to analyze the video and sensory
data content and extract the objects of interests in the scene or the multi-sensory
acquired data. Importantly, the scene analyzer is adapted to analyze the content of
the media(e.g, video) based on intelligent scene adaptive colour coherent object
analysis framework and method . Implementation of the same has been done so
that it is adaptive to the availability of computational bandwidth and memory and the
processing steps are dynamically reconfigured. As for example, as described further
in detail hereunder a trade-off is done automatically by the Analytics engine to strike
a balance between the accuracy of face capture and the CPU clock cycles available for
processing.
The Scene Analyzer generates meta-data against each frame supplied to it for
analyzing. It also computes the complexity of the scene using a novel technique and
dynamically reconfigure the processing steps in order to achieve optimal analysis
result depending upon the availability of the computational and other resources for
on-line and real-time detection of events and follow up actions. It feeds the metadata

along with the scene complexity measure to the Controller, so that the Controller can
decide the optimal rate at which the frames of that particular video channel should be
sent to the Analytics engine for processing. This technique is unique and saves
computational and memory bandwidth for decoding and analysis of the video frames
(B) Rule Engine (502): The Rule Engine keeps history of the metadata and correlates
the data across multiple frames to decide behavioural patterns of the objects in the
scene. Based on the rules, various applications can be defined. As for example it is
possible to detect whether a person in jumping a fence or whether there is a
formation of crowd or whether a vehicle is exceeding the speed limit, etc.
(C) Event Decider (503): The behavioural patterns, as detected by the Rule Engine is
analyzed by this module to detect various events in parallel. The Events can be
inherently defined or it may be configured by the user. As for example, if there is
crowd formation only in a specific zone whereas other areas are not crowded, that
may be defined to be an Event. Once an Event is detected, a message is generated
describing the type of event, time of occurrence of the Event, the location of
occurrence of the Event, the Video clip URL, etc.
The Event decider can also control any external device including a PTZ camera
controller which can focus a region where the event has taken place for better
viewing of the activities around that region or recording the scene in a close up view.
One such advanced framework is detailed hereunder as enhanced object tracking
where the utility of an Object tracking system is enhanced using a novel technique
using a PTZ camera along with the Object tracking system.
The Analytics Engine Controller
A Controller module (602) as shown in Figure 6 has been designed which can receive
multiple video channels, possibly in some compressed form (e.g, MJPEG, Motion
JPEG2000, MPEG, H.264, etc. for video and relevant format for other sensory data
such as MP4 for audio, for example but not limited to), and feeds the decoded video
frames to the Analytic engine. The Controller uses an advanced technique to decide
the rate of decoding of the frames and feed the decoded video frames of multiple
channels to the Analytics engine in an optimal way, so that the number of frames
sent per second for each video channel is individually and automatically controlled
depending on the requirement of the Analytics engine and also on the computational
bandwidth available in the system at any point of time. The technique has been

described in detail in relation to video content driven resource allocation for analytical
processing.
The Controller also streams the video along with all the Video Analytics data (existing
configuration for Events, Event Information, video clip URL etc), either as individual
streams for each channel, or as a joined single stream of video data for all or user
requested channels. A novel technique for joining the video channels and
transmitting the resulting combined single channel over IP network has been
deployed to adapt to varying and low bandwidth network connectivity. The technique
is described in detail in relation to video channel join-split mechanism for low
bandwidth communications.
The Controller can generate Events on its own for the cases where Events can be
generated without the help of Video Analytics engine (eg, Loss of Video, Camera
Tampering as triggered by Camera itself, Motion detection as intimated by the
Camera itself, as so on).
The implementation of Video Management Interface (VMI)
The Video Management Interface (702) is shown in figure 7 which interfaces between
an individual Autonomous System and rest of the world. It also acts as the
coordinator among various other components within a single Autonomous system,
viz, Video Recording System (703), Intelligent Video Analytical Server (704),
Surveillance Clients (701), Remote Event Receiver (705), etc. [It essentially
implements the functionalities including:
1. Filtering and need based transmission of data: Distribution of whole or part
of the collected sensory data, including the video and other sensory data
segments generated as a result of detection of an Event by the Analytical
engine, at the right recipient at the right point of time automatically or on
user interaction.
2. Directed distribution of Alerts: Distributing Event information in various
digital forms (SMS, MMS, emails, Audio alerts, animation video, Text,
illustrations, etc. but not limited to) with or without received data segments
(viz, video clips) to the right recipient at the right point of time
automatically or on user interaction.

3. Providing a common gateway for heterogeneous entities: Providing a unified
gateway for users to access the rest of the system for configuration,
management and monitoring of system components.
The Interface operating principle involved in the system is discussed hereunder:
1. Auto register itself to the IVMS system so that other components like
Surveillance Clients (including Web Clients and Mobile Clients), Remote Event
Receivers, can find and connect it even when its IP-address changes;
2. Accept request from Surveillance clients to add and delete data sources like
cameras to the VRSes and IVASes and relay the same to the corresponding
VRSes and IVASes.
3. Receive configuration data from the Surveillance clients and feed them to the
intended components (viz, VRS, IVAS, DBMS, Camera etc) of the system. For
VRS, the configuration data includes Recording parameters, Database paths,
Retention period of recording, etc. For IVAS, it is the Event and Application
settings, Event clip prologue-, after event- and lifetime-duration, etc.
4. Receives Event information from IVAS on-line and transmit it to various
recipients including Remote Event Receivers. Fetch outstanding Event clips, if
any, from IVAS. Outstanding clips may have been there inside IVAS, in case
there was a temporary network connectivity failure to IVAS.
5. Periodically receive heartbeat signals along with status information from all
the active devices, and relay that to other devices in the same or in other
networks.
6. Serve the Web clients and Mobile embedded clients by streaming Live video,
Recorded Video or Event Alerts at the right time.
7. Join multiple channel video into a single combined stream to adapt to variable
and low bandwidth network. A novel technique for joining the video channels
and transmitting the resulting combined single channel over IP network has
been deployed to adapt to varying and low bandwidth network connectivity.
The technique is described in relation to video channel join -split mechanism
for low bandwidth communication.

8. Enable the user to search for the recorded video and the Event clips based on
various criteria, including Data, Time, Event types, Video Channels.
9. Enable the user to perform an User-interactive Smart search to filter out
desired segment of video from video database
In essence, once the Interface (702) is installed, the VRS (703), IVAS (704) and
other components of the system can be configured, and the user can connect to the
System. However, at run time all the VRS and IVAS can operate on their own, and do
not require any service from the VMI, unless and otherwise some System
configuration data has been changed.
Independence for of the servers from any Central controller for their routine
operation gives unprecedented scalability with respect to increase in number of
servers. This is because, it does not add any extra load to any other component than
the server itself. This is a unique advancement where the Video Management Server
Interface acts only as a unified gateway to the services being executed in other
hardware devices, only for configuration and status updating tasks. This opens up the
possibility of keeping the User interface software unchanged while integrating new
type of devices. The devices themselves can supply their configuration pages when
the VMI connects to them for configuration. Similarly, the messages generated by the
servers can also be shown in the VMI panel seamlessly.
The Video Management Client(701), Web client(707), Mobile device embedded
client(708)
All the above client modules in essence implement the functionalities including:
Providing Live view or recorded view of the data stream: Enabling user to view
camera captured video in different matrix layouts, view other sensory data in a
presentable form, recorded video and other data search and replay, Event clips
search and replay, providing easy navigation across camera views with help of
sitemaps, PTZ control, and configuring the system as per intended use.
The VMS system can be accessed through the standalone surveillance client or any
standard Internet browser can be used to access the system. Handheld devices like
Android enabled cell phone or tablet PCs can also be used as a Client to the system
for the purposes (wholly or partially) as mentioned above.
The Remote Event receiver (705)

RER (705) shown in Figure 7 is the software module which can be integrated to any
other modules of the IVMS. The Remote Event Receiver is meant to receive and
display messages and ALERTs from other components, which are multicast or
broadcasted. Those messages include Event ALERTS, ERROR status from VRS or
IVAS, operator generated messages, etc. The Messages can be in the Video as well as
Audio form, or any other form as transmitted by the Video management system
components and the resulting response from by the RER depends on the capability
and configuration of the hardware where the RER is installed. When integrated with
the Surveillance clients (IVMC), the IVMC can switch to RER mode and thus will
respond to ALERTs and messages only.
The Central VMS system
Central VMS System (204 in Figure 2) is adapted to serve as a gateway to any
Autonomous System (210-01...210-0n) components. It also stores the configuration
data for all ASes in its Centralized database. It is possible to integrate otherwise
running independent VMS systems into a single unified system by including Central
VMS in a Server and configure that accordingly.
The Sitemap Server
A Sitemap server is included within each Autonomous System (210-01...210-0n) and
also within the Centralized VMS(204 in Figure 2). The Sitemap server listens to
requests from any authorized components of the System and responds with positional
data corresponding to any component (Camera, server, user etc.) which is linked to
the Site map. The Site map is multilayered and components can be linked to any
spatial position of the map in any layer.
The above describe the framework, architecture and system level components of the
Intelligent system of the invention. The technology involved in the development of
the system can be used to integrate various other types of components not shown or
discussed above. As for example, an Access Control System or a Fire Detection
System can be integrated similar to VRS or IVAS, configured using IVMC and VMI,
and their responses or messages can be received, shown or displayed and responded
to by IVMC or RER, stored as done for Event clips or Video segments and searched on
various criteria.
The system of the invention detailed above is further versatile enough to interface
and scale to many other management systems such as the involvement in intelligent
automated traffic enforcement system also discussed in later sections.

Reference is now invited to accompanying figures 8 and 9 to discuss the modified,
computationally efficient technique for Harr feature based face capture application
according to the present invention.
More specifically, Figure 8 shows a tradition method of face detection using the
flowchart in Figure 8 and by way of components/features/stages 1701 to 1706 while
Figure illustrates the face detection in accordance with the present invention by way
of components/features/stages under 1801 to 1809.
What is disclosed is an efficient technique to find regions in a video to capture faces
of people in motion, limiting the search space using motion detection technique,
control the computational requirement based on desired accuracy of capturing faces.
This technique can be used to capture faces from real time video where the accuracy
of the operation can be controlled depending on the computational bandwidth
available in the system.
Extraction of particular types of objects (e.g. face of a person, but not limited to) in
images based on fiduciary points is a known technique. However, computational
requirement is often too high for traditional classifier used for this purpose in the
prior art, e.g., Haar classifier. A novel method is proposed to enhance the efficiency
of extracting face regions from a sequence of video frames. Also, depending on the
availability of computational bandwidth, the number of iterations and pixel shifts as
required in the proposed technique is controlled with the help of a look up table. This
helps in striking a balance between the computational requirement and the accuracy
of face detection. In a multi-channel, multiple analysis process system, this novel
technique can be used as a cooperative process coexisting with other compute
intensive processes. In the proposed technique, the search space is reduced by
considering the motion vector and sliding the window only in the blob regions where
motion is detected. First, the average time t to analyze an image in host machine is
calculated, and for subsequent frames pixel-shifts and number of iterations are
calculated based on two lookup tables, to suite the computational bandwidth.
To increase the accuracy, a second pass upon the probable face regions detected by
first pass is performed. This concept of increasing the accuracy of data analysis
automatically depending on available computational bandwidth is novel and unique.
Traditionally, the faces are located in a still image using Haar feature based classifier.
Inherently, some non-face regions are also wrongly classified as faces. Computational
requirement is also very high due to excessive number of convolution operations.

This is unacceptable in a real time surveillance scenario. Viola et al. [1] have
introduced a rapid object detection scheme based on a boosted cascade of simple
features to achieve high frame rates working only with the information present in a
single grey scale image using Integral Matrices. Operating on 384 by 288 pixel
images, it's able to detect faces at 15 fps on a conventional 700 MHz Intel Pentium
III
R. Leinhart [2] introduced a novel set of rotated Haar-like features, which
significantly enrich this basic set of simple Haar-like features and gives on average a
10% lower false alarm. These extended feature set, however, increase overall
computational requirement. In some other face detection systems, auxiliary
information, such as image differences in video sequences or pixel color in color
images have been used to decrease computation time. But after applying all these
techniques together, the system cannot process more than 10-15 frames per second
for a 384 by 288 pixel video in a 2.0 GHz Core 2 Duo Intel processor based system.
Increasing the video size decreases the fps exponentially. A 384 by 288 pixel image
size is not effective for a real-time surveillance system for proper detection and
subsequent processing using these faces, e.g. recognition, and other biometric
applications. With the advent of Megapixel cameras, we can use wide areas with
prominent higher resolution face capture to effectively use the faces for subsequently
applications as explained above. However, the computational requirement with
traditional technology significantly increases to make it prohibitive for on-line
application, such as surveillance, and on-line criminal detection system in a smart
city.
An advanced technique is proposed in this disclosure so that the search space is
significantly reduced by considering the motion vector of the moving objects only and
applying the proposed novel algorithm in the regions represented by these motion
vectors only. This reduced computation enables to process larger resolution video
imagery to advance the face detection systems in today's era of increasingly growing
demand of higher resolution surveillance cameras. Also, several parameters can also
be dynamically adjusted so that detection and capture of face of people in motion can
be done with varying accuracy depending upon the computational bandwidth
available at any point.
Before discussing in detail the advanced technique of the invention , review is made
of the tradition method of face detection using the flowchart in Figure 8 .
Limitations of the traditional approach:

1. As the above algorithm is a multi-scale convolution-based face detection
algorithm, it takes huge time to process a single frame. In real-time situation
it's very much problematic to suite the m/c bandwidth.
2. Even at the cost of very high computation, it generates lots of non-face
regions as face regions as it processes a rectangular image bounding the
presumed face region (where some background portions are present with
motion areas).
3. Because of the inefficient nature of the today's algorithm, often these
bounding rectangular regions are too large with very small percentage of
pixels with actual motion. The larger the input image size, the execution time
increases exponentially.
The proposed advanced technique of the present invention:
The present invention involves advanced and enhanced the technology by
incorporating advanced features as follows in order to accomplish effective face
capture and detection system with higher resolution imagery with reduced
computation requirement. The proposed technique of the invention is explained in
Flowchart F-2 shown in accompanying Figure 9.
Importantly, the proposed concept is not limited to Haar features, however for
illustration herein Haar feature has been used to explain the advancement . The
estimation of several parameter such as temporal estimation "t", prediction of
possible number of iterations Alteration' in above flowchart is novel and described
below.
Let, the time taken to process a single window area (fixed window size mxn) with
Haar feature set = t.
Then, time taken to detect face for Image with size MxN
TMN « t * [(M - m) * (N - n) ] / [pixelShift * pixelShift]
where, pixelShift is the window shift size.
For multi-scale processing ScaleFactor = f(M, N, m, n, alteration)
Total time taken to detect faces,
Where, M' = M / (ScaleFactorj)

N' = N / (ScaleFactor')
So, T = f(M, N, t, pixelShift, nlteration), for a fixed size window.
Calculate average t in host machine and tune the parameters pixelShift, nlteration
accordingly using the lookup table T-l, T-2 to suite the bandwidth.
To increase the accuracy, enable a second pass upon the probable face regions
detected by first pass.
Lookup Table T-l:

It is thus possible by way of the method according to the present invention to achieve
advancement and enhanced technology by incorporating advanced features as
detailed hereinbefore in order to accomplish effective face capture and detection
system with higher resolution imagery with reduced computation requirement.

We Claim:
1. A method of face detection in video images and the like comprising the step
of limiting the search space involving motion detection technique and
controlled computational requirements based on desired accuracy by carrying
out prediction of number iterations and temporal parameter "t".
2. A method of face detection in video images as claimed in claim 1 comprising the
steps of:
i) involving the grey image of cropped motion rectangular area from current
frame to calculate said temporal parameter "t" and updating "t" with
history and calculating possible number of iterations "nlterations"
ii) calculating scale factor, no. of iterations and other parameter from look
up table;
iii) using convolution on different scaled images to get probable face
rectangles;
iv) grouping the probable faces with spatial information; and
v) obtaining therefrom the confirmed faces.
3. A method of face detection in video images as claimed in anyone of preceding
claim comprising using the convolution on probable face regions with Harr
feature set to confirm faces and publishing the confirmed faces based thereon.
4. A method of face detection in video images as claimed in anyone of preceding
claim comprising step of carrying out said temporal estimation "t", prediction
of possible number of iterations "nlterations" following :
i. Generating time taken to detect face for Image with size MxN based
on
TMN = t * [(M - m) * (N - n) ] / [pixelShift * pixelShift]
where, pixelShift is the window shift size and the time taken to
process a single window area (fixed window size mxn) with standard
feature set = t.
ii. For multi-scale processing ScaleFactor = f(M, N, m, n, alteration)

iii. Total time taken to detect faces,
Where, M1 = M / (ScaleFactori)
N1 = N / (ScaleFactor j)
iv. T = f(M, N, t, pixelShift, alteration), for a fixed size window.
v. Calculating average t in host machine and tune the parameters
pixelShift, alteration accordingly using generated lookup table to
suite the bandwidth; and
vi. Optionally, to increase the accuracy, enable a second pass upon the
probable face regions detected by first pass.

ABSTRACT
The present invention is directed to an efficient method to find regions in a video to
capture faces of people in motion, limiting the search space using motion detection
technique, control the computational requirement based on desired accuracy of
capturing faces. The invention can be used to capture faces from real time video
where the accuracy of the operation can be further controlled depending on the
computational bandwidth available in the system.

Documents

Application Documents

#	Name	Date
1	263-Kol-2012-(12-03-2012)SPECIFICATION.pdf	2012-03-12
2	263-Kol-2012-(12-03-2012)FORM-3.pdf	2012-03-12
3	263-Kol-2012-(12-03-2012)FORM-2.pdf	2012-03-12
4	263-Kol-2012-(12-03-2012)FORM-1.pdf	2012-03-12
5	263-Kol-2012-(12-03-2012)DRAWINGS.pdf	2012-03-12
6	263-Kol-2012-(12-03-2012)DESCRIPTION (COMPLETE).pdf	2012-03-12
7	263-Kol-2012-(12-03-2012)CORRESPONDENCE.pdf	2012-03-12
8	263-Kol-2012-(12-03-2012)CLAIMS.pdf	2012-03-12
9	263-Kol-2012-(12-03-2012)ABSTRACT.pdf	2012-03-12
10	263-KOL-2012-(30-03-2012)-FORM-1.pdf	2012-03-30
11	263-KOL-2012-(30-03-2012)-CORRESPONDENCE.pdf	2012-03-30
12	263-KOL-2012-(09-04-2012)-PA.pdf	2012-04-09
13	263-KOL-2012-(09-04-2012)-CORRESPONDENCE.pdf	2012-04-09
14	263-KOL-2012-FORM-18.pdf	2012-09-04
15	263-KOL-2012-FER.pdf	2018-09-11
16	263-KOL-2012-OTHERS [01-03-2019(online)].pdf	2019-03-01
17	263-KOL-2012-FER_SER_REPLY [01-03-2019(online)].pdf	2019-03-01
18	263-KOL-2012-COMPLETE SPECIFICATION [01-03-2019(online)].pdf	2019-03-01
19	263-KOL-2012-CLAIMS [01-03-2019(online)].pdf	2019-03-01
20	263-KOL-2012-ABSTRACT [01-03-2019(online)].pdf	2019-03-01
21	263-KOL-2012-HearingNoticeLetter-(DateOfHearing-02-03-2020).pdf	2020-02-20
22	263-KOL-2012-Correspondence to notify the Controller [27-02-2020(online)].pdf	2020-02-27
23	263-KOL-2012-FORM-26 [28-02-2020(online)].pdf	2020-02-28
24	263-KOL-2012-Written submissions and relevant documents [16-03-2020(online)].pdf	2020-03-16
25	263-KOL-2012-PETITION UNDER RULE 137 [16-03-2020(online)].pdf	2020-03-16
26	263-KOL-2012-PatentCertificate17-03-2020.pdf	2020-03-17
27	263-KOL-2012-IntimationOfGrant17-03-2020.pdf	2020-03-17
28	263-KOL-2012-RELEVANT DOCUMENTS [25-09-2021(online)].pdf	2021-09-25
29	263-KOL-2012-RELEVANT DOCUMENTS [30-09-2022(online)].pdf	2022-09-30
30	263-KOL-2012-RELEVANT DOCUMENTS [24-07-2023(online)].pdf	2023-07-24

Search Strategy

1	search(74)_11-09-2018.pdf