Method And System For Real Time Gaze Tracking In Real World Scenarios

< Back

Method And System For Real Time Gaze Tracking In Real World Scenarios Using Red Green Blue (Rgb) Camera

Abstract: This disclosure relates generally to method and system for real-time gaze tracking in real-world scenarios using RGB camera. Existing eye trackers are expensive, and their usage is limited to lab settings. Off-the-shelf RGB cameras used for eye tracking faces challenges related to varying illumination apart from head pose and occlusions and IR-based eye trackers which are agnostic to illumination conditions are expensive and cannot be used for mass deployments. The disclosed gaze tracking system can effectively track a person’s gaze in the real-world scenarios using the low resolution RGB camera by fusion of a plurality of glint-based features and a plurality of appearance-based features.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

04 January 2023

Publication Number

27/2024

Publication Type

INA

Invention Field

ELECTRONICS

Status

Parent Application

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point, Mumbai 400021, Maharashtra, India

Inventors

1. KARMAKAR, Somnath

Tata Consultancy Services Limited, Building 1B, Ecospace, Plot - IIF/12, New Town, Rajarhat, Kolkata – 700156, West Bengal, India

2. GAVAS, Rahul Dasharath

Tata Consultancy Services Limited, Gopalan Enterprises Pvt Ltd (Global Axis) SEZ, "H" Block No. 152 (Sy No. 147,157 & 158), Hoody Village, EPIP Zone, (II Stage), Whitefield, K.R. Puram Hobli, Bangalore – 560066, Karnataka, India

3. RAMAKRISHNAN, Ramesh Kumar

4. SINGH, Priya

5. PAL, Arpan

Tata Consultancy Services Limited, Building 1B, Ecospace, Plot - IIF/12, New Town, Rajarhat, Kolkata – 700156, West Bengal, India

6. SHESHACHALA, Mithun Basaralu

7. VARGHESE, Tince

Specification

DESC:FORM 2

THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of invention:
METHOD AND SYSTEM FOR REAL-TIME GAZE TRACKING IN REAL-WORLD SCENARIOS USING RED-GREEN-BLUE (RGB) CAMERA

Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India

The following specification particularly describes the invention and the manner in which it is to be performed.
CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
The present application claims priority from Indian provisional patent application no. 202321000651, filed on January 04, 2023. The entire contents of the aforementioned application are incorporated herein by reference.

TECHNICAL FIELD
The disclosure herein generally relates to the field of eye gaze tracking, and, more particularly, to method and system for real-time gaze tracking in real-world scenarios using Red-Green-Blue (RGB) camera.

BACKGROUND
Eye trackers or gaze trackers are very important in studies related to human behavior. A good illumination condition, handling of head pose, and occlusions are of paramount importance for the gaze trackers to work well in real-time scenarios. The gaze trackers which are agnostic to the illumination conditions are primarily an Infrared (IR)-based cameras. Such IR-based eye trackers are not easily available and either separate head mounted devices or chin rests are required while in use, making it non conducive for enterprise use cases. Further, the IR-based eye trackers available today are very expensive and cannot be used for mass deployments and their usage is limited to lab settings. Red-Green-Blue (RGB) camera can be used as an alternative. However, the RGB camera have technical limitation as they have lower resolution, which leads to the poor quality of the images, thus building a robust gaze tracker with RGB cameras is challenging. Off-the-shelf RGB cameras used for the eye tracking faces challenges related to the varying illumination apart from the head pose and the occlusions.

SUMMARY
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a method for real-time gaze tracking in real-world scenarios using RGB camera is provided. The method includes receiving via a low-resolution RGB camera a plurality of RGB image frames comprising a subject sequentially gazing at a plurality of grid cells on a display screen in accordance with a stimulus generated using a stimulus design technique. Facial landmarks comprising, eye key point coordinates, and eyeball key point coordinates are identified from the plurality of RGB image frames. Further the method identifies an eye region representing an RGB eye image, from the eye key point coordinates. Then the RGB eye image corresponding to the plurality of RGB image frames is normalized. Further a plurality of glint-based features and a plurality of appearance-based features are extracted from the normalized RGB eye image. A gaze estimation regression model is trained for estimating a gaze point on the display screen by fusion of the plurality of glint-based features and the plurality of appearance-based features. The trained gaze estimation regression model, during inferencing stage, estimates a gaze of a subject of interest using fusion of the plurality of glint-based features and the plurality of appearance-based features.
In another aspect, a system for real-time gaze tracking in real-world scenarios using RGB camera is provided. The system includes: a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces (106), wherein the one or more hardware processors are configured by the instructions to: receive via a RGB camera, the plurality of RGB image frames comprising a subject sequentially gazing at a plurality of grid cells on a display screen in accordance with a stimulus generated using a stimulus design technique. The one or more hardware processors are configured to identify facial landmarks comprising, eye key point coordinates, and eyeball key point coordinates, from the plurality of RGB image frames. Further the one or more hardware processors are configured to identify an eye region representing an RGB eye image, from the eye key point coordinates. Then the one or more hardware processors are configured to normalize the RGB eye image corresponding to the plurality of RGB image frames. Further the one or more hardware processors are configured to extract a plurality of glint-based features and a plurality of appearance-based features from the normalized RGB eye image. Finally, the one or more hardware processors are configured to train a gaze estimation regression model, for estimating a gaze point on the display screen, by fusion of the plurality of glint-based features and the plurality of appearance-based features.
In yet another aspect, there are provided one or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause a method for real-time gaze tracking in real-world scenarios using RGB camera is provided. The method includes receiving via a low-resolution RGB camera a plurality of RGB image frames comprising a subject sequentially gazing at a plurality of grid cells on a display screen in accordance with a stimulus generated using a stimulus design technique. Facial landmarks comprising, eye key point coordinates, and eyeball key point coordinates are identified from the plurality of RGB image frames. Further the method identifies an eye region representing an RGB eye image, from the eye key point coordinates. Then the RGB eye image corresponding to the plurality of RGB image frames is normalized. Further a plurality of glint-based features and a plurality of appearance-based features are extracted from the normalized RGB eye image. A gaze estimation regression model is trained for estimating a gaze point on the display screen by fusion of the plurality of glint-based features and the plurality of appearance-based features. The trained gaze estimation regression model, during inferencing stage, estimates a gaze of a subject of interest using fusion of the plurality of glint-based features and the plurality of appearance-based features.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
FIG. 1 illustrates an exemplary system for real-time gaze tracking in real-world scenarios using RGB camera, in accordance with some embodiments of the present disclosure.
FIG. 2 depicts overview of architecture diagram of the system for the real-time gaze tracking in the real-world scenarios using the RGB camera, in accordance with some embodiments of the present disclosure.
FIG. 3 with reference to FIG. 1 and FIG. 2 depicts flow diagrams illustrating a method for the real-time gaze tracking in the real-world scenarios using the RGB camera, in accordance with some embodiments of the present disclosure.
FIG. 4 depicts a stimulus for calibration as implemented by systems of FIG. 1 and FIG. 2, in accordance with some embodiments of the present disclosure.
FIGS. 5A, 5B, 5C, and 5D depict an illustration of simulated transformed RGB eye images corresponding to glint enhanced RGB eye image and appearance enhanced RGB eye image, in accordance with some embodiments of the present disclosure.
FIG. 6 depicts a data collection setup for the real-time gaze tracking in the real-world scenarios using the RGB camera, in accordance with some embodiments of the present disclosure.
FIGS. 7A, 7B, and 7C depict distribution of head poses for datasets, for testing the real-time gaze tracking in the real-world scenarios using the RGB camera, in accordance with some embodiments of the present disclosure.
FIG. 8 depicts accuracy of 20 gazed points across participants and multiple calibration points for the datasets, in accordance with some embodiments of the present disclosure.
FIGS. 9A, 9B, 9C, 9D, and 9E depict a user interface (UI) for patients for hospital bed assisted living use-case, in accordance with some embodiments of the present disclosure.
FIGS. 10 and 11 depict performance results of the hospital bed assisted living use-case, in accordance with some embodiments of the present disclosure.
FIG. 12 depicts a tele-robotics use-case demonstration of a user controlling a robot, in accordance with some embodiments of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following embodiments described herein.
In healthcare applications such as powered wheelchair, powered hospital beds, there is use of hand driven controls such as joystick or keypads for option selection and navigation. Many a times, patient finds it difficult to use these owing to fatigue or weak motor control. However, the patients might have good control over their eye gaze. Hence, controlling and navigating using the eye gaze would be an excellent alternative. Similarly in the robotics and human computer interaction (HCI) applications, if the navigation and control is gaze-based, then an operator can free up his/her hands for other tasks. The ability to estimate a person’s gaze has become a vital technology in the healthcare, human behavior studies, rehabilitation, HCI, augmented reality, virtual reality and thereof.
Gaze tracking estimates where a person is looking, at a given instance. The position of a target object is determined by observing eye movements. There are numerous gaze tracking solutions available in literature. However, they face one or more challenges such as custom or invasive hardware, high cost and cumbersome hardware, inaccuracy under real-world scenario related constraints such as poor illumination, occlusion like spectacles, and head pose and movements. All these challenges make them difficult for large scale adoption in existing practical applications.
Popular devices that capture the eye movement are head-mounted devices, Red Blue Green (RGB) cameras, IR cameras and depth cameras. Even with progress of the gaze tracking, gaze estimation remains challenging due to real world scenario’s related challenges like the variance in eye appearance across subjects, head movement, illumination conditions, occlusion due to the spectacles viewing angle and image quality.
The gaze trackers available today are mainly IR-based gaze trackers and are very expensive and cannot be used for mass deployments. Hence, the RGB camera can be used as an alternative. However, the RGB camera have lower resolution effectively generating poor quality of the images. Thus, building a robust gaze tracker is challenging. Varying illumination conditions or ambient light plays a very important role in determining quality of an image data. For light source placed above or sideways of the head, casting shadow on eyes by forehead, hairs, etc. makes it challenging to identify distinguishing features from image space. The gaze trackers available today, mainly have a calibration phase after which they expect no head movements, as adaption of a model to new head positions is challenging to ascertain. Hence, usage of a chin rest is often encouraged in gaze tracking experiments. This makes it challenging to use existing gaze trackers for out-of-the-lab scenarios. Building a gaze tracking system devoid of the chin rest is thus of paramount importance. Detecting eye region in the image space is a main prerequisite for extracting features for the gaze tracking. Hence, the eye region should be clearly visible in the image space. However, the presence of the spectacles and other occlusions, poses a serious issue when reflections appear in them, and the gaze tracking becomes challenging task. To perform the gaze tracking in real time scenario, a trained model should be light weight so that point of gaze locations on a display screen should be predicted quickly, in time efficient manner.
The varying illumination conditions or the ambient light plays a major role in the performance of the gaze trackers. Most of the widely used gaze trackers that use the RGB camera, primarily work in the good ambient light conditions only. A good and uniform illumination condition is a key requirement for RGB-based gaze trackers to perform well in real-world scenarios. The eye trackers agnostic to the varying illumination conditions are primarily IR-based which are not readily available for practical applications. This creates a need to have the gaze tracking system that works solely on the RGB camera in the varying illumination conditions, occlusions and head pose or movements, to make its way into out-of-lab scenarios.
Due to the importance of the gaze tracking in human cognition and behavior, vast research has been conducted in this area. The gaze trackers are highly correlated to the eye movement and hence eye image features are very important. Eye tracking approaches can be categorized into model-based approaches, appearance-based approaches, and feature-based approaches. The model-based approaches and the feature-based approaches require dedicated devices which includes the IR camera, the Red Green Blue-Depth (RGBD) camera, stereo cameras whereas, the appearance-based approaches can work using off-the-shelf RGB cameras which can be integrated to personal devices or ambient displays.
Earlier the gaze trackers were based on the feature-based approaches. The feature-based approaches are designed on low-level local eye features such as pupil center, glint, and corneal reflection. These features are dependent on pixel values of images which makes them susceptible to even a slight variation. The model-based approaches use geometrical model of an eye, and the appearance-based approaches directly map high dimensional eye images to low dimensional gaze direction or gaze location using learning approaches.
Since the advent of deep learning and accessibility of large training dataset, approach to the gaze tracking changed to the appearance-based approaches. There are two types of appearance-based approaches: conventional appearance-based approach, and deep learning appearance-based approach. The conventional appearance-based approach uses a regression function to learn mapping function and estimates human gaze from high-dimensional eye image features. These features are dependent on pixel values of the eye images which makes them susceptible to even a slight variation in the environmental change. Whereas the deep learning appearance-based approach directly maps the eye images to gaze using non-linear complex mapping function to estimate the human gaze. The deep learning appearance-based approach tries to overcome the challenges associated with the gaze trackers.
However, models associated with the deep learning appearance-based approach are highly dependent on large scale dataset which are nontrivial to procure and not free from subjective bias. Further obtaining large scale dataset for a specific demography is also a challenge. The existing datasets have subjective bias, none of them, at large, contain human subjects with dark eyeballs in which there is no significant difference between pupil and eyeball with respect to color. This is a major difference between the IR camera obtained images and that of the RGB camera. The IR shows clear demarcation in the pupil and the eyeball, whereas the RGB camera does not. Hence, the gaze tracking with the RGB cameras is difficult. Procuring large scale annotated datasets without this subjective bias is a non-trivial task.
Also, deep learning models pose the issue of non-interpretability which makes them susceptible to failure on unseen datasets with huge variations. While the model-based approaches fit a geometric eye model to the eye images, the appearance-based approaches directly regress from the eye images to gaze directions using machine learning approaches. The appearance-based approach requires large data for training. Cumulating large scale annotated datasets is a challenging task. Further the regression-based approaches use geometric feature like the pupil, the glint, the corneal reflection to regress to the gaze directions. However, all these approaches require high end IR cameras or depth cameras, therefore, such gaze trackers cannot be directly used in day-to-day life scenarios.
The key requirement for RGB-based gaze trackers is good and consistent illumination. Unlike IR-based eye trackers, the performance of RGB-based eye trackers suffers under the varying illumination conditions in the literature. To solve the challenges posed due to low-light condition during gaze tracking in literature (“Chen Wei, Wenjing Wang, Wenhan Yang, and Jiaying Liu. Deep retinex decomposition for low-light enhancement. arXiv preprint arXiv:1808.04560, 2018.”) is proposed using a Retinex-Net an encoder-decode network which is trained on a pair of low-and normal-light images captured to enhance the brightness of the low-light images. In literature (“Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. Mpiigaze: Real-world dataset and deep appearance-based gaze estimation. IEEE transactions on pattern analysis and machine intelligence, 41(1):162–175, 2017.”) classification model has been trained using the eye images synthesized by the varying illumination conditions. However, these studies have limitations such as collecting data across the varying illumination conditions, in both low and good light condition is a challenge and due to high complexity of the models trained on such datasets might not be suitable for real-time applications.
Embodiments herein provide a method and system for real-time gaze tracking in real-world scenarios using the RGB camera. The method uses the eye image features in the RGB image space captured by the low resolution RGB camera, converts them to a plurality of appearance-based features and a plurality of glint-based features. Then these features are regressed to screen co-ordinates. Using fusion of the plurality of glint-based features along with the plurality of appearance-based features make the gaze tracking robust to be deployed in real-world scenarios comprising the varying illumination conditions, the occlusions and the head pose or movements thereof.
In the present disclosure the method uses off-the-shelf low-resolution laptop integrated RGB camera, for mass-deployments and easy scaling of solution. The disclosed gaze tracking system is robust to the real-world scenarios by incorporating fusion of the plurality of glint-based features and the plurality of appearance-based features. Further the disclosed gaze tracking system works well in unconstrained environment which means even in the absence of the chin rest support.
Referring now to the drawings, and more particularly to FIG. 1 through FIG. 12, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
FIG. 1 illustrates an exemplary system 100 for the real-time gaze tracking in the real-world scenarios using the RGB camera, in accordance with some embodiments of the present disclosure. In an embodiment, the system 100 may also be referred as gaze tracking system or CamTratak. In an embodiment, the system 100 includes one or more hardware processors 104, communication interface device(s) or input/output (I/O) interface(s) 106 (also referred as interface(s)), and one or more data storage devices or memory 102 operatively coupled to the one or more hardware processors 104. The one or more processors 104 may be one or more software processing components and/or hardware processors.
Referring to the components of the system 100, in an embodiment, the processor (s) 104 can be one or more hardware processors 104. In an embodiment, the one or more hardware processors 104 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) 104 is/are configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices (e.g., smartphones, tablet phones, mobile communication devices, and the like), workstations, mainframe computers, servers, a network cloud, and the like.
The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, a database 108 is comprised in the memory 102, wherein the database 108 comprises information on a plurality of RGB image frames, an RGB eye image, an eye key point coordinates, an eyeball key point coordinates, the plurality of glint-based features, the plurality of appearance-based features, a gaze point. The memory 102 further includes a plurality of modules such as a calibration block, a training block, a feature extraction block, a testing block, a gaze estimation regression model, such modules for various technique(s) such as deep learning-based techniques, calibration, normalization, gray scaling, histogram equalizing, and stimulus design technique.
The above-mentioned technique(s) are implemented as at least one of a logically self-contained part of a software program, a self-contained hardware component, and/or, a self-contained hardware component with a logically self-contained part of a software program embedded into each of the hardware component (e.g., hardware processor 104 or memory 102) that when executed perform the method described herein. The memory 102 further comprises (or may further comprise) information pertaining to input(s)/output(s) of each step performed by the systems and methods of the present disclosure. In other words, input(s) fed at each step and output(s) generated at each step are comprised in the memory 102 and can be utilized in further processing and analysis.
Functions of the components of system 100 are explained in conjunction with diagrams depicted in FIG. 2, and FIG. 3 for the real-time gaze tracking in the real-world scenarios using the RGB camera. In an embodiment, the system 100 comprises one or more data storage devices or the memory 102 operatively coupled to the processor(s) 104 and is configured to store instructions for execution of steps of the method depicted in FIG. 3 by the processor(s) or one or more hardware processors 104. The steps of the method of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIG. 1, architecture diagram of the system for the real-time gaze tracking in the real-world scenarios using the RGB camera in FIG. 2 and, the steps of flow diagrams as depicted in FIG. 3. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods, and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps to be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
FIG. 2 depicts overview of architecture diagram of the system for the real-time gaze tracking in the real-world scenarios using the RGB camera, in accordance with some embodiments of the present disclosure. More specifically, the system 100 in FIG. 2 includes components such as the calibration block that is configured to design stimulus for the calibration, the training data block configured to capture the plurality of RGB image frames by a low-resolution RGB camera, the feature extraction block is configured to extract the plurality of glint-based features and the plurality of appearance-based features from the plurality of RGB image frames, the testing data block is configured to capture the plurality of RGB image frames by the low-resolution RGB camera during an inferencing stage, the gaze estimation regression model is configured to estimate the gaze point on the display screen by fusion of the plurality of glint-based features and the plurality of appearance-based features. The modules, models and the blocks mentioned are some of the modules among the plurality of modules in the memory 102. The above description of each block/component depicted in FIG. 2 is better understood by way of examples and in conjunction with FIG. 3.
FIG. 3 with reference to FIG. 1 and FIG. 2 depicts flow diagram illustrating the method 300 for the real-time gaze tracking in the real-world scenarios using the RGB camera, in accordance with some embodiments of the present disclosure. In an embodiment, the system(s) 100 comprises one or more data storage devices or the memory 102 operatively coupled to the one or more hardware processors 104 and is configured to store instructions for execution of steps of the method by the one or more processors 104. The steps of the method of the present disclosure will now be explained with reference to components of the system 100 of FIG. 1, FIG. 2 and the flow diagram illustrated in FIG. 3.
Referring to steps of FIG. 3, at step 302, the one or more hardware processors 104 receive, via the low-resolution RGB camera, the plurality of RGB image frames comprising the subject sequentially gazing at the plurality of grid cells on the display screen in accordance with the stimulus generated using the stimulus design technique.
The stimulus for calibration is generated during calibration phase using the stimulus design technique, while the subject gazing sequentially at the plurality of grid cells, to obtain the plurality of RGB image frames by:
Dividing the display screen into the plurality of grid cells on a black background.
Sequentially highlighting to the subject, each of the plurality of grid cells with gray color having a white circle with a black dot in the center, for every t seconds.
Obtaining, the plurality of RGB image frames of the subject, while gazing at the grid cells.
The stimulus is designed, which comprises of dividing the display screen into the plurality of grid cells, wherein the plurality of grid cells includes MXN number of grids. M represents the number of rows and N represents number of columns in the display screen. FIG. 4 depicts stimulus for calibration as implemented by the systems of FIG. 1 and FIG. 2, in accordance with some embodiments of the present disclosure.
In FIG. 4 the stimulus for calibration comprises dividing the display screen into 4 X 5 grid cells, in accordance with some embodiments of the present disclosure. A particular grid cell turns grey color in sequence and the white circle with the black dot in the center is presented over the display screen. The subject is asked to gaze at the black dot. The white circle changes for every 3 seconds and thus spans all the 20 grid cells one after the another following the arrow direction as shown in FIG. 4, in accordance with some embodiments of the present disclosure. The position of the white circle is taken as a ground truth gaze location and the grid cell in large, defines margin for error. The plurality of RGB image frames is captured against the stimulus are subjected for further feature extraction.
At step 304 of the present disclosure, the one or more hardware processors 104 identify, facial landmarks comprising eye key point coordinates, and eyeball key point coordinates, from the plurality of RGB image frames. The facial landmarks are identified from the plurality of RGB image frames by using the deep learning-based techniques in accordance with some embodiments of the present disclosure.
Upon identifying the facial landmarks, at step 306 of the present disclosure, the one or more hardware processors 104 identify, the eye region representing an RGB eye image, from the eye key point coordinates.
At step 308 of the present disclosure, the one or more hardware processors 104 normalize, the RGB eye image corresponding to the plurality of RGB image frames. The RGB eye image is normalized with respect to a Euler roll angle. Coordinates corresponding to left eyeball center and right eyeball center are considered as eyeball center key point coordinates. The Euler roll angle between the eyeball center key point coordinates corresponding to the left eyeball center (x_cL,y_cL ) and the right eyeball center (x_cR,y_cR ) is computed as ? = deg???_y/?_y ?, where ?_y= y_cL – y_cR, ?_x= x_cL – x_cR, and ? is the Euler roll angle. A rotation matrix F is computed using the Euler roll angle and the RGB eye image is rotated as I_E=?FI?_(E.) This rotated RGB eye image referred to as normalized RGB eye image which is subjected to further feature extraction. This process is applied on to each of the plurality of RGB image frames before extracting the plurality of glint-based features and the plurality of appearance-based features.
At step 310 of the present disclosure, the one or more hardware processors 104 extract a plurality of glint-based features from the normalized RGB eye image. Glint is a kind of corneal reflection visible in cornea of the eyes of the subject. Each of the glint, if well detected and modelled correctly, can provide a good estimation of position of the display screen relative to the eyes of the subject which aids in the gaze tracking. The glints are specifically predominant when luminance of the RGB camera capturing the subject is greater than of the surrounding ambient light (using computer in a dimly lit room). Contours are formed for the left eye and right eye of the subject using the eyeball key point coordinates corresponding to the normalized RGB eye image. The eyeball key point coordinates are defined as E, E={(x_0,y_0),(x_1,y_1),(x_2,2),(x_3,y_3)}. These eyeball key point coordinates create the contour P by joining the consecutive eyeball key point coordinates. An eyeball center is defined as C= (x_c,y_c ) of the subject with radius R corresponding to the normalized RGB eye image for the left eye and right eye of the subject. For each of the eyeball key point coordinates, if (x_i,y_i ) is within the contour P, intensities of the RGB camera corresponding to the normalized RGB eye image I_E is registered as:
I_E (x_i,?y_i,?)?=?{?_(I_E (x_i,?y_i,?.),?Otherwise)^((0,?0,?0),? if D>R) }? (1)
where D= ?v((x_i?-?x_c )^2?+?(y_i?-?y_c )^2 )
The I_E is of an 8-bit image data, so pixel intensities of the I_E varies between 0-255, with the minimum pixel intensities for the eyeball compared to the sclera and the glint, which comparatively have higher pixel intensities. Extraction of the glints cannot be done based on the pixel intensities, so the glints separation is done spatially. In the disclosed gaze tracking system gives importance to the glints within eyeball region and reduce the importance of the sclera and sclera reflection regions, by changing the pixel intensities corresponding to the sclera of the left eye and the right eye of the subject in the normalized RGB eye image, to the pixel intensities significantly different from that of the pixel intensities corresponding to the glints generating glint enhanced RGB image {G_E}. Hence, values of the pixel intensities to (x_i,y_i ) with the Euclidean distance (x_c,y_c ) greater than the radius R of the eyeball are changed to 0, as mentioned in the equation (1), in accordance with some embodiments of the present disclosure. This processing step makes sure that the gaze estimation regression model will learn mapping of the plurality of glint-based features only to the point of the gaze on the display screen. This is achieved using the equation (1) and thus portions of the glints become significantly different. The reason for considering the entire contour instead of only the eyeball region is because, the location of the eyeball in the contour is a significant feature pertaining to the gaze direction. The glint enhanced RGB image is grey scaled and further enhanced using histogram equalization for contrast enhancement.
To have a uniform sized glint enhanced RGB images across the plurality of RGB image frames, which can vary depending on the distance of the subject from the RGB camera, the glint enhanced RGB images are transformed into corresponding pXq sized patches. The transformed RGB eye image G_E^'? corresponding to the glint enhanced RGB image is given by:
G_E^'?(x^',?y^' )=?T[G_E (x,?y)] (2)
where T is the transformation function and G_E^' is the output representing transformed RGB eye image, and (x^',?y^' ) are the transformed equivalents of (x,?y).
G_E^' (x_i^,,?y_j^,)=(?_(m?=?i×a?)^((i+1)×a)¦?_(n?=?j×b)^((j+1)×b)¦?G_E (x_m,?y_n ) ?)/(a×b)?????????????????????(3)
where, a?=?x/x^' ?;?b?=y/y^' ?, i = 0,…,p -1 and j = 0,…,q-1, T is the transformation function to convert the glint enhanced RGB eye image G_E of varying resolution to the transformed RGB eye image G_E^' of resolution p×q; the p×q pixel intensities are coming from each of the transformed RGB eye image; and a total of 2× p×q pixel intensities for the left eye and right eye of the subject.
The transformed RGB eye image is windowed into p×q RGB image patches with p = 6 and q=10, in accordance with some embodiments of the present disclosure. Each value across p×q RGB image is represented as µ_k, where k ? [1,2,...,p × q], and µ_k represents the values in the glint enhanced RGB eye image G_E. The plurality of glint-based features F {G_E^'} extracted from the transformed RGB eye image is represented as:
F {G_E^'}= [L(µ_k )_G,R(µ_k )_G ] (4)
where L(.)_G corresponds to left eye and R(.)_G corresponds to right eye, comprising 2×p×q glint-based features for each of the plurality of RGB image frames.
At step 312 of the present disclosure, the one or more hardware processors 104 extract the plurality of appearance-based features from the plurality of RGB image frames. The appearance-based approaches focus on the richness of eye appearance to map the plurality of RGB image frames directly to the eye gaze. These approaches perform better than the other feature-based approaches or the model-based approaches as these are less constrained to image resolution or distances. Typically, the appearance-based approaches extract high dimensional features of the plurality of RGB image frames, which basically involves a raster scan of all the pixels. The high dimensional features come with added advantage of retaining all the information in the plurality of RGB image frames. However, in the gaze tracking applications wherein high expected variance is with respect to the eyeballs only, the rest portion of the features lead to redundancy of information. The positions of eyeballs in the plurality of RGB frames helps to estimate the eye gaze on the display screen. However, pixel-wise feature extraction from the normalized RGB eye image increases the feature dimensionality which is excessive for estimating the gaze point, leading to overfitting issues and loss of generalizability. This overfitting is prevalent in cases of changes in illumination, head pose, occlusions and thereof.
The normalized RGB eye image is enhanced using gray scaling and histogram equalization, generating an appearance enhanced RGB eye image (A_E) corresponding to the left eye and the right eye of the subject.
To have a uniform sized appearance enhanced RGB images across the plurality of RGB image frames, which can vary depending on the distance of the subject from the RGB camera, the appearance enhanced RGB eye image is transformed into corresponding pxq sized patches. The transformed RGB eye image A_E^'? corresponding to the appearance enhanced RGB image is given by:
A_E^'?(x^',?y^' )=?T[A_E (x,?y)] (5)
where T is the transformation function and A_E^' is the output representing transformed RGB eye image, and (x^',?y^' ) are the transformed equivalents of (x,?y).
A_E^' (x_i^,,?y_j^,,?.)=(?_(m?=?i×a?)^((i+1)×a)¦?_(n?=?j×b)^((j+1)×b)¦?A_E (x_m,?y_n ) ?)/(a×b)?????????????????????(6)
where, a?=?x/x^' ?;?b?=y/y^' ?, i = 0,…,p -1 and j = 0,…,q-1, T is the transformation function to convert the appearance enhanced RGB eye image A_E of varying resolution to the transformed RGB eye image A_E^' of resolution p×q; the p×q pixel intensities are coming from each of the RGB image eye; and a total of 2×M×N 2xpxq pixel intensities for the left eye and the right eye of the subject.
The transformed RGB eye image is windowed into p×q RGB image patches with p = 6 and q=10, in accordance with some embodiments of the present disclosure. Each value across p×q RGB image is represented as µ_k, where k ? [1,2,...,p × q], and µ_k represents the values in the appearance enhanced RGB eye image A_E. The plurality of appearance-based features F {A_E^'} extracted from the transformed RGB eye image is represented as:
F {A_E^'}= [L(µ_k )_A,R(µ_k )_A ] (7)
where L(.)_G L(.)A corresponds to left eye and R(.)_G R(.)A corresponds to right eye, comprising 2×p×q appearance-based features for each of the plurality of RGB image frames.
FIGS. 5A, 5B, 5C, and 5D depict an illustration of simulated transformed RGB eye images corresponding to the glint enhanced RGB eye image and the appearance enhanced RGB eye image, in accordance with some embodiments of the present disclosure. The p×q is taken as 3X3 in accordance with some embodiments of the present disclosure. A two-dimensional (2D) matrix represents the G_E, and a white patch corresponds to the glint and a black patch corresponds to the sclera as shown in FIGS. 5A and 5B, in accordance with some embodiments of the present disclosure. FIGS. 5C and 5D show 2D matrix that represents A_E with the black patch corresponding to the eyeball and the rest pertains to the sclera around the contour. Then horizontal bar shows row wise concatenation of transformed RGB eye images corresponding to the glint enhanced RGB images and transformed RGB eye images corresponding to the appearance enhanced RGB images with color coding being the value of µ_k. From the FIGS. 5A, 5B, 5C, and 5D, most significant feature value that contribute to change in the gaze for the glint enhanced RGB eye image and the appearance enhanced RGB eye image is given by:
d_(G_E ) = argmax {µ_k}
k
d_(A_E ) = argmin {µ_k} (8)
k
wherein ?k?[1,2,…,p×q] for G_E^' and A_E^'.
At step 314 of the present disclosure, the one or more hardware processors 104 train the gaze estimation regression model for estimating the gaze point on the display screen, by fusion of the plurality of glint-based features and the plurality of appearance-based features. The fusion of the plurality of glint-based features obtained from step 310 and the plurality of appearance-based features obtained from step 312 is represented as:
F=[ F {G_E^'},F {A_E^'}] (9)
wherein F corresponds to a feature vector of dimension dX1.
The gaze estimation regression model is trained to estimate the gaze point on the display screen using F. The trained gaze estimation regression model, during inferencing stage, estimates the gaze of a subject of interest using fusion of the plurality of glint-based features and the plurality of appearance-based features.
EXPERIMENTAL RESULTS
The disclosed gaze tracking system is compared with the most appropriate Webgazer in literature (“Alexandra Papoutsaki. Scalable webcam eye tracking by learning from user interactions. In Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems, pages 219–222, 2015.”). Three types of datasets are considered for evaluating the disclosed gaze tracking system. First one is inhouse collected dataset in which the data is collected in an ideal controlled manner with minimal head movements, albeit the usage of chin rest is avoided. Next, the gaze tracking system is evaluated against benchmark datasets – MPIIFaceGaze and EVE datasets. In both datasets, extraction of the plurality of glint-based features and the plurality of appearance-based features for training phase is done analogous to that of the inhouse collected dataset against 20 calibration points shown in FIG. 4, in accordance with some embodiments of the present disclosure.
FIG. 6 depicts a data collection setup for the real-time gaze tracking in real-world scenarios using the RGB camera, in accordance with some embodiments of the present disclosure. The data collection for the inhouse collected dataset is carried out in a closed room with proper ventilation as shown in FIG. 6, in accordance with some embodiments of the present disclosure. A volunteer accompanies a participant and guides him/her through the overall data collection procedure. A data collection kiosk is developed which takes care of end-to-end data capture. Luxmeter is used to test the lighting conditions and two variants of light conditions with poor light with a range of 5-6 lux and good light with 205-240 lux range are used for the study. The data is collected in both good and poor ambient lighting settings. The volunteer briefs the data collection procedure to the participant and Short Form Health Survey 12 (SF-12) assessment is carried out to screen the participant for overall wellness. Only the participant who clear the SF-12 assessment is considered for the experiment. The data is collected from 27 participants (10 females; Mean age±SD: 32.2±8.2) when they are subjected to gaze the designed stimulus. All the participants hailed from similar cultural and educational backgrounds. They had normal or corrected to normal vision with glasses. 11 out of the 27 participants used glasses during the experiment. All the data files are anonymized, and written consent is obtained from all the participants.
To ensure the proper data collection in terms of head pose, the participants are asked to sit comfortably, and head position should align to center of the RGB camera. An application is developed which shows the participant’s face landmarks in real-time in the display screen on a white background whose dimension is same as of camera resolution. A face box, which is black in color initially, is shown at the center of that white background and participants are asked to align the face to the center of the black box. Face box turns green in color if the participants face is within the face box. Once the participant’s nose is close to the center, within 50-pixel distance from center of the face box, the application captures that position, and the application shows a message to the participants to maintain that position throughout the trial. This process is repeated before each trial to maintain the same position for all the participants and across all the trials.
MPIIFaceGaze (MPIIFaceGaze “Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. It’s written all over your face: Full-face appearancebased gaze estimation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2299–2308, 2017”) is a publicly available benchmark dataset that contains images from 15 subjects and has an average 2500 images per person. The MPIIFaceGaze is collected under various head poses and the varying illumination conditions. The MPIIFaceGaze was collected under the real-world scenarios while the participants were doing their day-to-day work on laptops under varying illuminations conditions and eye appearances. A software was used to collect the data from the participants automatically for every 10 minutes. The participants were asked to look at on-screen shrinking grey circles with the white dot at the center which are distributed randomly over the display screen. The participants are requested to fixate on the grey circle and press spacebar before the grey circles disappear from the display screen. High resolution laptop camera was used to capture images along with the circle positions on the display screen as ground truths.
EVE dataset (EVE dataset “Seonwook Park, Emre Aksan, Xucong Zhang, and Otmar Hilliges. Towards end-to-end video-based eye-tracking. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pages 747–763. Springer, 2020”) is a video dataset captured at 1920×1080 pixels resolution along with the eye gaze and pupil size data. It includes 54 participants, and the data was collected in indoor settings. Only the participant’s data is considered for whom the ground truth corresponding to point-of-gaze was present. A large variety of stimuli in the form of image, video and Wikipedia pages were presented to the participants for capturing the gaze data. The EVE dataset is collected using multi-camera views so that it covers large variations in the gaze directions and the head poses. Ground truth pertaining to gazed locations is obtained using commercial eye tracker, Tobii Pro Spectrum eye tracker.
FIGS. 7A, 7B, and 7C depict distribution of head poses for the datasets, used for testing the real-time gaze tracking in real-world scenarios using the RGB camera, in accordance with some embodiments of the present disclosure. The head poses are calculated using 6 face landmarks, detected using Mediapipe library, and camera parameters. The 6 facial landmarks used in head pose calculation are tip of the nose, chin, left corner of the left eye, right corner of the right eye, and left and right corners of mouth. The inhouse dataset, the MPIIFaceGaze, and the EVE dataset cover range of the gaze directions that can occur during real- world laptop interactions. It is observed in the heatmaps that the inhouse dataset has lesser variations in terms of the head pose in comparison to the MPIIFaceGaze and the EVE dataset.
The performance of the disclosed gaze tracker system is evaluated with the MPIIFaceGaze and the EVE dataset and compared its performance with that of the state-of-the-art gaze tracker Webgazer. The results are further presented quantitatively considering all the participants and for the inhouse dataset, the MPIIFaceGaze, and the EVE dataset in Tables 1A and 1B. All the 20 calibration points are used for the calibration and are tested against 20 grid cells.
Dataset Approach Accuracy (%)

Mean SD
EVE Webgazer 91.6 7.5
CamTratak 96.6 3.7
MPIIFaceGaze Webgazer 78.7 15.2
CamTratak 86.7 13.2
Inhouse - Good Webgazer 72.8 23.6
CamTratak 73.9 23.2
Inhouse - Poor Webgazer 51.9 30.7
CamTratak 52 27.3
Table 1A

Dataset Approach Best performance (%)

=100% >=95% >=90% >=85% >=80% >=75%
EVE Webgazer 22.7 54.5 72.7 86.4 95.5 97.7
CamTratak 45.5 86.4 97.7 100 100 100
MPIIFace Webgazer 13.3 20 26.7 53.3 60 60
Gaze
CamTratak 20 46.7 60 73.3 73.3 80
Inhouse - Good Webgazer 22.2 33.3 33.3 44.4 51.9 55.6
CamTratak 22.2 25.9 44.4 44.4 48.1 59.3
Inhouse - Poor Webgazer 0 7.4 22.2 22.2 25.9 33.3
CamTratak 0 3.7 14.8 18.5 22.2 29.6
Table 1B
The mean± standard Deviation (SD) accuracy results along with that of percentage of accuracy values in the bins (>=75% through 100% in steps of 5%) in Tables 1A and 1B shows that in most of the cases, the CamTratak (system 100) is better performer over the Webgazer when tested against different datasets. Further the efficacy of the Webgazer and the proposed gaze tracker system CamTratak is explored for different calibration points, starting with 4 till 20, in steps of 4. This is for identifying impact of minimum and maximum calibration points on prediction accuracy. FIG. 8 depicts accuracy of 20 gazed points across participants and multiple calibration points for different datasets, in accordance with some embodiments of the present disclosure. It is to be observed that the Cam-Tratak performs better than the Webgazer in all the cases. This shows the robustness of the CamTratak for the calibration with the minimal as well as with the maximum calibration points.
A hospital bed assisted living is demonstrated to control interface for patients using the proposed gaze tracking system. In hospitals and assisted living centers, powered patient beds use hand driven controls such as joystick or keypads for option selection and navigation. Many patients find it difficult to use these controls due to different issues such as fatigue, movement restrictions, shivering of hands and so on. FIGS. 9A, 9B, 9C, 9D, and 9E depict user interface (UI) for patients for hospital bed assisted living use-case, in accordance with some embodiments of the present disclosure. The control interface is presented in front of the patients to gaze at a particular entity.
A grid-based screen resolution of the gaze tracking system is higher than that of the hospital bed assisted living UI, hence care must be taken to avoid incorrect selections due to overlapping of the gaze tracking system with that of the entities in the UI. The UI is designed by ensuring enough separation between entities so that the chances of wrong selection due to gaze overlap is avoided. This is achieved by introducing no action land, shown in black color in FIGS. 9A, 9B, 9C, and 9D. Main screen UI shows 4 entities that includes emergency, bed, nurse, and food as large icons. The entities are placed in a 2×2 grid format by purposefully introducing the no action region land in between. This ensures that the grid cells of the gaze tracking system which are on the boundary of the given entity, span the no action region land, thereby avoiding the chances of selecting incorrect non-gazed entities. All the other UI pages were designed in similar way.
If the patient wants to raise leg portion of the hospital bed, then he/she just needs to gaze block on main navigation menu of the UI corresponding to the ‘Bed’, and the gaze tracking system navigate to next screen where button icons are provided to raise the leg portion or head portion up or down. A side bar menu is provided to navigate back to the main menu. Similarly, the patient can access other essential things like food, calling nurse, help in case of emergency and thereof. The adjustment of the bed from current state to the target state is shown in FIG. 9E.
The user experience is evaluated using hospital bed assisted living use-case by subjecting 6 participants (Mean±SD age: 29.5±4.37) through two input modalities, mouse click-based and gaze-based using the gaze tracking system. A few initial practice sessions were provided to these participants to make them accustomed with these two modalities. The aim is to show that on providing an initial training, how close can the participant be in terms of using the gaze tracking system in comparison to that of using the mouse click-based as the input modality. Each of the participant is subjected to 4 trials per the input modality. Each trial consisted of completing a given task in case of calling for help during emergency, calling nurse, ordering of food, and adjusting the bed. These 4 trials correspond to the 4 entities shown in the main page of the use-case in FIG. 9A. A hold time of 3 seconds is provided to make a choice in every page. To quantitatively assess the performance, the following scores are computed, (i) Scan path score (SPS): the ratio of the actual path (number of pages) to ascertain the trial to that of traversed path (number of pages traversed) and (ii) Task completion time (TCT): Total time taken to complete the given trial. The SPS is indicative of the system’s effectiveness whereas, the TCT assesses the efficiency of the gaze tracking system.
FIGS. 10 and 11 depicts performance results of the hospital bed assisted living use-case, in accordance with some embodiments of the present disclosure. FIGS. 10 and 11 shows results of the SPS and the TCT, respectively which is averaged across the participants. It is to be noted that the participants could control the UI using the gaze tracking system in a similar way that they did in case of with the mouse click input modality. This shows effectiveness and efficiency of using the gaze tracking system in controlling such real time applications.
Further a tele-robotics use-case is described to provide gaze directed locomotion to robots. FIG. 12 depicts a tele-robotics use case demonstration of a user controlling a robot, in accordance with some embodiments of the present disclosure. the user can control robot’s movements using the disclosed gaze tracking system. The field of view of the robot is displayed on the display to the user. The user can gaze at the locations where he/she wants the robot to move and guide the robot accordingly. The gaze tracking system is tested across users in lab and is found to be effective in terms of easily controlling the robot in any given direction. The gaze tracking system finds applications in hospital scenario, wherein the doctor, if physically not present near the patient, can attend using the robot and assess the patient's health and have interactions with them, while his/her hands are free to make notes or other actions.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
The present disclosure herein addresses the real-time gaze tracking in real-world scenarios using the RGB camera. Existing eye trackers are expensive, and their usage is limited to lab settings. Off-the-shelf RGB cameras used for eye tracking faces challenges related to varying illumination apart from head pose and occlusions. The eye trackers which are agnostic to illumination conditions are primarily IR based cameras. Such IR eye tracker are not easily available and very costly and cannot be used for mass deployments. The disclosed gaze tracking system can effectively track a person’s gaze in real-time and in real world-scenarios using the low resolution RGB camera. The gaze tracking system detects and extracts the plurality of glint-based features from the eye images in poor illumination conditions. The plurality of glint-based features and the appearance-based features are fused together for a robust gaze tracking solution for the real-time scenarios.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
,CLAIMS:
1. A processor implemented method (300), the method comprising:
receiving, via a low-resolution Red, Blue and Green (RGB) camera controlled by one or more hardware processors, a plurality of RGB image frames comprising a subject sequentially gazing at a plurality of grid cells on a display screen in accordance with a stimulus generated using a stimulus design technique (302);
identifying, by the one or more hardware processors, facial landmarks comprising, eye key point coordinates, and eyeball key point coordinates, from the plurality of RGB image frames (304);
identifying, by the one or more hardware processors, an eye region representing an RGB eye image, from the eye key point coordinates (306);
normalizing, by the one or more hardware processors, the RGB eye image corresponding to the plurality of RGB image frames (308);
extracting, by the one or more hardware processors, a plurality of glint-based features from the normalized RGB eye image (310);
extracting, by the one or more hardware processors, a plurality of appearance-based features from the normalized RGB eye image (312); and
training, by the one or more hardware processors, a gaze estimation regression model, for estimating a gaze point on the display screen, by fusion of the plurality of glint-based features and the plurality of appearance-based features (314).

2. The processor implemented method as claimed in claim 1, wherein the trained gaze estimation regression model, during inferencing stage, estimates a gaze of a subject of interest using fusion of the plurality of glint-based features and the plurality of appearance-based features.
3. The processor implemented method as claimed in claim 1, wherein generating the stimulus using the stimulus design technique comprises:
dividing the display screen into the plurality of grid cells on a black background;
sequentially highlighting to the subject, for every t seconds, each of the plurality of grid cells with gray color having a white circle with a black dot in center; and
obtaining, the plurality of RGB image frames of the subject, while gazing at the plurality of grid cells.

4. The processor implemented method as claimed in claim 1, where in the plurality of glint-based features are extracted from the normalized RGB eye image by:
forming, contours for the left eye and right eye of the subject, using the eyeball key point coordinates corresponding to the normalized RGB eye image;
highlighting, the contours, by deemphasizing region outside the contour region;
performing, gray scaling and histogram equalizing on the normalized RGB eye image for contrast enhancement, generating a gray scaled enhanced image corresponding to the left eye and the right eye of the subject;
transforming, the gray scaled enhanced image to a predefined image size, to maintain uniformity across the plurality of RGB image frames; and
extracting, a plurality of glint-based features, from the transformed gray scaled enhanced image.

5. The processor implemented method as claimed in claim 1, wherein the plurality of appearance-based features is extracted from the plurality of RGB image frames by:
performing, gray scaling and histogram equalization on the normalized RGB eye image for contrast enhancement, generating a gray scaled enhanced image corresponding to the left eye and the right eye of the subject;
transforming, the gray scaled enhanced image to the predefined image size, to maintain uniformity across the plurality of RGB image frames; and
extracting, a plurality of appearance-based features, from the transformed gray scaled enhanced image.

6. A system (100) comprising:
a memory (102) storing instructions;
one or more communication interfaces (106); and
one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
receive via a Red, Blue and Green (RGB) camera, a plurality of RGB image frames comprising a subject sequentially gazing at a plurality of grid cells on a display screen in accordance with a stimulus generated using a stimulus design technique;
identify facial landmarks comprising, eye key point coordinates, and eyeball key point coordinates, from the plurality of RGB image frames;
identify an eye region representing an RGB eye image, from the eye key point coordinates;
normalize the RGB eye image corresponding to the plurality of RGB image frames;
extract a plurality of glint-based features from the normalized RGB eye image;
extract a plurality of appearance-based features from the normalized RGB eye image; and
train a gaze estimation regression model, for estimating a gaze point on the display screen, by fusion of the plurality of glint-based features and the plurality of appearance-based features.

7. The system as claimed in claim 6, wherein the trained gaze estimation regression model, during inferencing stage, estimates a gaze of a subject of interest using fusion of the plurality of glint-based features and the plurality of appearance-based features.

8. The system as claimed in claim 6, wherein generating the stimulus using the stimulus design technique comprises:
dividing the display screen into the plurality of grid cells on a black background;
sequentially highlighting to the subject, for every t seconds, each of the plurality of grid cells with gray color having a white circle with a black dot in center; and
obtaining, the plurality of RGB image frames of the subject, while gazing at the plurality of grid cells.

9. The system as claimed in claim 6, where in the plurality of glint-based features are extracted from the normalized RGB eye image by:
forming, contours for the left eye and the right eye of the subject using the eyeball key point coordinates corresponding to the normalized RGB eye image;
highlighting, the contours, by deemphasizing e region outside the contour region;
performing, gray scaling and histogram equalizing on the normalized RGB eye image for contrast enhancement, generating a gray scaled enhanced image corresponding to the left eye and the right eye of the subject;
transforming, the gray scaled enhanced image to a predefined image size, to maintain uniformity across the plurality of RGB image frames; and
extracting, a plurality of glint-based features, from the transformed gray scaled enhanced image.

10. The system as claimed in claim 6, wherein the plurality of appearance-based features is extracted from the plurality of RGB image frames by:
performing, gray scaling and histogram equalization on the normalized RGB eye image for contrast enhancement, generating a gray scaled enhanced image corresponding to the left eye and the right eye of the subject;
transforming, the gray scaled enhanced image to the predefined image size, to maintain uniformity across the plurality of RGB image frames; and
extracting, a plurality of appearance-based features, from the transformed gray scaled enhanced image.

Documents

Application Documents

#	Name	Date
1	202321000651-STATEMENT OF UNDERTAKING (FORM 3) [04-01-2023(online)].pdf	2023-01-04
2	202321000651-PROVISIONAL SPECIFICATION [04-01-2023(online)].pdf	2023-01-04
3	202321000651-FORM 1 [04-01-2023(online)].pdf	2023-01-04
4	202321000651-DRAWINGS [04-01-2023(online)].pdf	2023-01-04
5	202321000651-DECLARATION OF INVENTORSHIP (FORM 5) [04-01-2023(online)].pdf	2023-01-04
6	202321000651-FORM-26 [14-02-2023(online)].pdf	2023-02-14
7	202321000651-FORM 3 [05-04-2023(online)].pdf	2023-04-05
8	202321000651-FORM 18 [05-04-2023(online)].pdf	2023-04-05
9	202321000651-ENDORSEMENT BY INVENTORS [05-04-2023(online)].pdf	2023-04-05
10	202321000651-DRAWING [05-04-2023(online)].pdf	2023-04-05
11	202321000651-COMPLETE SPECIFICATION [05-04-2023(online)].pdf	2023-04-05
12	Abstract1.jpg	2023-05-03
13	202321000651-Proof of Right [09-06-2023(online)].pdf	2023-06-09
14	202321000651-FER.pdf	2025-11-17

Search Strategy

1	202321000651_SearchStrategyNew_E_Untitled_SearchScript(5)E_14-11-2025.pdf