Method And System For Gaze Estimation To Construct Human Vision To

< Back

Method And System For Gaze Estimation To Construct Human Vision To Augment Reality

Abstract: State of art neural network based gaze estimation has accuracy limitation as some critical parameters that affect gaze are not included during training process. A method and system for gaze estimation to construct human vision to augment reality is disclosed which train a neural network for accurate gaze estimation. During training, a grid with a moving object is presented to a user; images of the user are captured when the user fixates on the moving object; bounding box of the face of the user and position of the user’s iris are identified; the bounding box of the face of the user, position of the user’s iris, height of the user, distance of the user and pose of the user are marked on blank images to generate a set of training images for training the neural network for gaze estimation to construct human vision to augment reality. To be published with FIG. 2

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

15 March 2021

Publication Number

37/2022

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Patent Number

Legal Status

Grant Date

2024-02-21

Renewal Date

Applicants

Tata Consultancy Services Limited

Nirmal Building, 9th Floor, Nariman Point Mumbai Maharashtra India 400021

Inventors

1. DAS, Apurba

Tata Consultancy Services Limited ITPL Anchor Building, International Tech Park Bangalore, Whitefield, Bangalore Karnataka India 560066

2. ROY, Shormi

Tata Consultancy Services Limited ITPL Anchor Building, International Tech Park Bangalore, Whitefield, Bangalore Karnataka India 560066

3. SAHA, Pallavi

Tata Consultancy Services Limited ITPL Anchor Building, International Tech Park Bangalore, Whitefield, Bangalore Karnataka India 560066

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention:
METHOD AND SYSTEM FOR GAZE ESTIMATION TO CONSTRUCT HUMAN VISION TO AUGMENT REALITY
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD [001] The disclosure herein generally relates to computer vision, and, more particularly, to a method and system for gaze estimation to construct human vision to augment reality.
BACKGROUND
[002] Virtual reality (VR) and augmented reality (AR) are related fields that provide an artificial sensory experience to end-users. As VR and AR have become more accessible, interest has grown amongst a subset of users in combining VR/ AR with gaze tracking. Gaze tracking is tracking or monitoring the direction of a user’s gaze (i.e., tracking or monitoring where a person is looking). It can provide information about a user's attention, perception, cognition, eye-hand coordination and other neurological functions. Most of the existing gaze detection/gaze tracking techniques make use of devices that need to be worn by the user/ attached to the users, for example, head mounted displays. These techniques are intrusive and may cause discomfort to the user.
[003] Non-intrusive gaze detection/tracking methods make use of neural network models trained on a set of images of a person’s face to predict the direction of person’s gaze. However, in existing methods, the training images capture only facial features and do not include any information on the individual’s height, pose etc. These additional features are critical since the gaze direction may change based on the changes in the individual’s height, pose and distance of the individual from a camera which captures the image of the person. Thus, accuracy in gaze direction in non-intrusive gaze detection methods for more effective AR applications is an area of research.
SUMMARY
[004] Embodiments of the present disclosure present technological
improvements as solutions to one or more of the above-mentioned technical
problems. For example, in one embodiment, a method for gaze estimation to
construct human vision to augment reality is provided. The method includes

displaying to a user a grid comprising a plurality of cells and an object, wherein the object is in motion across the plurality of cells. Further, the method includes capturing a set of images of the user fixating on the object in motion across the plurality of cells; determining a bounding box around face of the user in each of the set of images, and identifying feature points corresponding to eyes and nose endpoint on the face in each of the set of images; calculating a distance of the user from the camera, wherein the distance is computed from the nose endpoint of the user to the camera; embedding, into a set of blank images, the bounding box around the face of the user, the nose endpoint and the position of iris from each of the set of images, height and pose of the user, and the distance of the user from the camera wherein embedding comprises: (i) marking the bounding box around the face of the user and the nose endpoint of the user from each of the set of images, into a corresponding image among the set of blank images, (ii) drawing a horizontal axis, a vertical axis and a diagonal in each of the corresponding image among the set of blank images, wherein the horizontal axis, the vertical axis and the diagonal intersect at the nose endpoint, (iii) marking position of iris, identified from feature points corresponding to eyes in each of the set of images, into the corresponding image among the set of blank images, (iv) marking height of the user along the vertical axis in the corresponding image among the set of blank images, (v) marking pose of the user along the horizontal axis of the corresponding image among the set of blank images, and (vi) marking distance of the user from the camera along the diagonal of corresponding image among the set of blank images, to get a set of training images. Further, the method includes training a neural network with the set of training images for gaze estimation to construct human vision to augment reality. [005] In another aspect, a system for gaze estimation to construct human vision to augment reality is provided. The system comprises a memory storing instructions; one or more Input/Output (I/O) interfaces; a camera for capturing images of a user; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to display to a user, via the one or more I/O interfaces (106), a grid comprising a plurality of cells and an object, wherein the object is in

motion across the plurality of cells. Further, the one or more hardware processors are configured to capture a set of images of the user fixating on the object in motion across the plurality of cells; determine a bounding box around face of the user in each of the set of images, and identify feature points corresponding to eyes and nose endpoint on the face in each of the set of images; calculate a distance of the user from the camera, wherein the distance is computed from the nose endpoint of the user to the camera; embed, into a set of blank images, the bounding box around the face of the user, the nose endpoint and the position of iris from each of the set of images, height and pose of the user, and the distance of the user from the camera wherein embedding comprises: (i) marking the bounding box around the face of the user and the nose endpoint of the user from each of the set of images, into a corresponding image among the set of blank images, (ii) drawing a horizontal axis, a vertical axis and a diagonal in each of the corresponding image among the set of blank images, wherein the horizontal axis, the vertical axis and the diagonal intersect at the nose endpoint, (iii) marking position of iris, identified from feature points corresponding to eyes in each of the set of images, into the corresponding image among the set of blank images, (iv) marking height of the user along the vertical axis in the corresponding image among the set of blank images, (v) marking pose of the user along the horizontal axis of the corresponding image among the set of blank images, and (vi) marking distance of the user from the camera along the diagonal of corresponding image among the set of blank images, to get a set of training images. Further, the one or more hardware processors are configured to train a neural network with the set of training images for gaze estimation to construct human vision to augment reality.
[006] In yet another aspect, there are provided one or more non-transitory machine readable information storage mediums comprising one or more instructions, which when executed by one or more hardware processors causes a method for gaze estimation to construct human vision to augment reality. The method includes displaying to a user a grid comprising a plurality of cells and an object, wherein the object is in motion across the plurality of cells. Further, the method includes capturing a set of images of the user fixating on the object in

motion across the plurality of cells; determining a bounding box around face of the user in each of the set of images, and identifying feature points corresponding to eyes and nose endpoint on the face in each of the set of images; calculating a distance of the user from the camera, wherein the distance is computed from the nose endpoint of the user to the camera; embedding, into a set of blank images, the bounding box around the face of the user, the nose endpoint and the position of iris from each of the set of images, height and pose of the user, and the distance of the user from the camera wherein embedding comprises: (i) marking the bounding box around the face of the user and the nose endpoint of the user from each of the set of images, into a corresponding image among the set of blank images, (ii) drawing a horizontal axis, a vertical axis and a diagonal in each of the corresponding image among the set of blank images, wherein the horizontal axis, the vertical axis and the diagonal intersect at the nose endpoint, (iii) marking position of iris, identified from feature points corresponding to eyes in each of the set of images, into the corresponding image among the set of blank images, (iv) marking height of the user along the vertical axis in the corresponding image among the set of blank images, (v) marking pose of the user along the horizontal axis of the corresponding image among the set of blank images, and (vi) marking distance of the user from the camera along the diagonal of corresponding image among the set of blank images, to get a set of training images. Further, the method includes training a neural network with the set of training images for gaze estimation to construct human vision to augment reality.
[007] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS [008] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:

[009] FIG. 1 illustrates a functional block diagram of a system for gaze estimation to construct human vision to augment reality, according to some embodiments of the present disclosure.
[010] FIG. 2 is a flow diagram illustrating a method for gaze estimation to construct human vision to augment reality, according to some embodiments of the present disclosure.
[011] FIG. 3 is a flow diagram illustrating a process for embedding information from a set of user images into a set of blank images, according to some embodiments of the present disclosure.
[012] FIG. 4 illustrates a grid comprising a plurality of cells and an object, according to some embodiments of the present disclosure.
[013] FIG. 5 illustrates measurements of height of user, distance of the user from a camera and pose of the user in terms of head rotation, according to some embodiments of the present disclosure.
[014] FIG. 6 illustrates an example training image, according to some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS [015] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
[016] Augmented reality (AR) is a technology that overlays information and virtual objects on real-world scenes in real-time. It uses the existing environment and adds information to it to make a new artificial environment. It can be used in various fields such as education, medicine, airports, gaming etc. Consider an example scenario where AR can be used in a supermarket to enhance customer experience - when a customer looks at items displayed in racks of a supermarket,

information of the items such as price, ratings of other customers etc. can be provided to the customer on his/her handheld device. This helps the customer in deciding which items to buy.
[017] In such applications, it is important to identify where the person is looking at or the gaze direction of the person so that appropriate information can be provided to him/her. Some other applications of gaze detection are:
• Understand where a person is gazing and capture a photo of the same through a camera. It can be of any real world object.
• Relay information of an object the user is looking at on any nearby screen, by capturing, identifying and fetching data about the object from cloud.
• Zoom part of the screen where the user fixates and show advertisements accordingly.
[018] Existing gaze detection algorithms implemented in AR applications require external devices (for example, head mounted displays) to be worn by a person to identify gaze direction of the person and to accurately identify the object on which the person fixates. This increases the cost of the application and also wearing the external device causes discomfort to the user. Existing non-intrusive methods generally make use of neural network models trained on a set of images of a person gazing in different directions. However, these existing techniques fail to capture data points related to the person such as his/her height, distance from camera and head pose, which are critical in improving accuracy of gaze estimation.
[019] Embodiments described herein provide a method and system for gaze estimation to construct human vision to augment reality. The method embeds person’s height, distance from camera and head pose into training images for training a neural network model, thereby improving accuracy while estimating gaze direction of the person.
[020] Referring now to the drawings, and more particularly to FIG. 1 through 6, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.

[021] FIG. 1 illustrates a functional block diagram of a system for gaze estimation to construct human vision to augment reality, according to some embodiments of the present disclosure. In an embodiment, the system 100 includes one or more hardware processors 104, communication interface device(s) or input/output (I/O) interface(s) 106 (also referred as interface(s)), one or more data storage devices or memory 102, and one or more cameras 110 operatively coupled to the one or more hardware processors 104. The one or more processors 104 may be one or more software processing components and/or hardware processors. In an embodiment, the hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) is/are configured to fetch and execute computer-readable instructions stored in the memory. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
[022] The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server. During the training process, the system 100 via the graphical user interface displays a grid comprising a plurality of cells. An object is displayed to a user (person) that moves gradually across the plurality of cells with a short pause at pre-defined intervals. The user is requested to fixate on the object as it moves across the cells and the one or more camera(s) 110 capture a set of images of the user fixating on the object in motion. Further, the set of images are stored in memory 102.
[023] The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random

access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, a database 108 is comprised in the memory 102, wherein the database 108 may store set of images of a user fixating on the object in motion. The database 108 further stores a set of training images after one or more hardware processor(s) 104 processes the set of images. The memory 102 may also comprise a neural network model for gaze estimation wherein the neural network model is trained using the set of training images by the one or more hardware processor(s) 104. The database 108 may comprise a repository neural network models, from which the neural network model may be selected for gaze estimation. Information related to the user such as height of the user and distance of the user from the camera may also be stored in the database 108.
[024] The one or more cameras 110 may include one or more stereo cameras such as ZED, Intel® RealSense™ or the like.
[025] FIG. 2 is a flow diagram illustrating a method 200 for gaze estimation to construct human vision to augment reality, according to some embodiments of the present disclosure.
[026] In an embodiment, the system 100 comprises one or more data storage devices or the memory 102 operatively coupled to the processor(s) 104 and is configured to store instructions for execution of steps of the method 200 by the processor(s) or one or more hardware processors 104. The steps of the method 200 of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in FIG. 1.
[027] Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps to be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.

[028] The method 200 disclosed herein provides a method for gaze estimation to construct human vision to augment reality. Referring to the steps of the method 200, at step 202, one or more hardware processors 104 are configured to display to a user, via the I/O interface 106, a grid comprising a plurality of cells and an object as illustrated in FIG. 4. The object is in motion across the plurality of cells with a short pause at each of the plurality of cells to allow the user to fixate on the object at predefined intervals. In an embodiment, the object is a circular ball. Any other object of a different shape can be used in alternate embodiments.
[029] Further at step 204 of the method 200, one or more hardware processors 104 are configured to capture a set of images of the user fixating on the object in motion across the plurality of cells, via the one or more cameras (110).
[030] Further at step 206 of the method 200, one or more hardware processors 104 are configured to process the set of images to determine a bounding box around face of the user in each of the set of images, and identify feature points corresponding to eyes and nose endpoint on the face in each of the set of images. In an embodiment, the bounding box around face of the user is determined using Haar cascade classifier available in OpenCV library. Further, iris of the eyes are determined using Haar cascade classifier. Further, the nose endpoint of the user is determined using Multi-task Cascaded Convolutional Networks (MTCNN) available in Python library. Any other face detection algorithms can be used in alternate embodiments to implement the step 206.
[031] In an embodiment, if Haar cascade classifier identifies multiple faces in the set of images, the face with largest bounding box is selected for further processing.
[032] Haar cascade classifier also provides a bounding box around each eye along with bounding box around the face in each of the set of images. Further, iris of the eyes are localized by processing the set of images as follows:
i. Calculate distance between the bounding boxes around eyes.
ii. Calculate height and width of the bounding box around each eye,
find the midpoint of the bounding box around each eye and roughly localize the iris at midpoint of the bounding box around each eye.

Estimate radius of each eye using equation 1 to determine eye regions.

iii. Detect edges on the eye regions using Sobel filter (available in OpenCV library) in both x and y direction with kernel size 3*3. .Output of this step is two matrices, each corresponding to the edges detected (in x and y directions) on each of the eye regions.
iv. Calculate squared magnitude, which represents gradient of the eye regions, by adding square of the two matrices.
v. Calculate threshold of magnitude according to equation 2.

vi. If the squared magnitude is greater than the threshold of magnitude and gradient direction satisfies characteristics such as being more horizontal than vertical then the squared magnitude is accepted as true in the list of gradients to be used in subsequent steps.
vii. A parameter length is calculated by performing square root of the positions given by the list of gradients to be used in the magnitude squared image.
viii. Calculate gradient along X and Y axis according to equation 3.

ix. The eye regions are considered dark if it satisfies equation 4. This
helps in identifying pupil within the eye which is dark compared to
other parts of eye region.
image < ( mean value of image )x0.8....(4) x. Dilation operator (available in OpenCV library) is applied to the eye
regions so that reflection is removed. xi. Histogram of the gradients of the pixels in the eye regions is plotted.
The region corresponding to highest gradient is selected as iris and
circle is drawn around the selected region.

[033] Further, the nose endpoint of the user is determined using Multi-task Cascaded Convolutional Networks (MTCNN) available in Python library.
[034] Further, at step 208 of the method 200, one or more hardware processors 104 are configured to calculate distance of the user from the camera 110 as distance between the nose endpoint and the camera 110. Nose endpoint is considered as center of the user’s face and distance between nose endpoint and camera 110 is considered as distance between the user and the camera 110. Further, roll, pitch and yaw of the face is calculated. Further, a y-vector is obtained from ZED point cloud. If the nose endpoint is above the camera, then, y-vector is considered as positive and negative otherwise. Further, height of the user is calculated as sum of original height of the user and y-vector. FIG. 5 illustrates measurements of height of the user, distance of the user from a camera, according to some embodiments of the present disclosure.
[035] Further, at step 210 of the method 200, the one or more hardware processors 104 are configured to embed, into a set of blank images, the bounding box around the face of the user, the nose endpoint and the position of iris from each of the set of images, height and pose of the user, and the distance of the user from the camera. The step 210 is explained in detail with the help of steps 302 to 312 of a process 300 illustrated in FIG. 3.
[036] At step 302 of the process 300, the one or more hardware processors 104 are configured to mark the bounding box around the face of the user and the nose endpoint of the user from each of the set of images, into a corresponding image among the set of blank images.
[037] Further, at step 304 of the process 300, the one or more hardware processors 104 are configured to draw a horizontal axis, a vertical axis and a diagonal in each of the corresponding image among the set of blank images, wherein the horizontal axis, the vertical axis and the diagonal intersect at the nose endpoint.
[038] Further, at step 306 of the process 300, the one or more hardware processors 104 are configured to mark position of iris, identified from feature points corresponding to eyes in each of the set of images, into the corresponding image

among the set of blank images. The position of iris in each of the set of images, identified in the step 206 of the method 200, is marked onto corresponding image in the set of blank images.
[039] Further, at step 308 of the process 300, the one or more hardware processors 104 are configured to mark height of the user along the vertical axis in the corresponding image among the set of blank images. In an embodiment, a first end point and a second end point of the vertical axis represent 3feet and 7feet respectively.
If H is the height of the user calculated at the step 208, hL is the first endpoint and hU is the second endpoint, then the point on vertical axis to be marked is calculated according to equation 5.

[040] Further, at step 310 of the process 300, the one or more hardware processors 104 are configured to mark pose of the user along the horizontal axis of the corresponding image among the set of blank images. In an embodiment, a first end point and a second end point of the horizontal axis represent pose corresponding to face turned completely left (0 degrees) and face turned completely right (180 degrees) respectively. The pose of the user is marked in between the first endpoint and the second endpoint according to yaw of the face calculated at step 208.
[041] Further, at step 312 of the process 300, the one or more hardware processors 104 are configured to mark distance of the user from the camera, calculated at the step 208 of the method 200, along the diagonal of corresponding image among the set of blank images, to get a set of training images. FIG. 6 illustrates an example training image, according to some embodiments of the present disclosure. It shows bounding box of face, position of iris, points marked on the horizontal axis, vertical axis and the diagonal corresponding to 30 degrees pose of the user, 5ft height of the user and distance of 7ft between the user and the camera respectively.
[042] At step 212 of the method 200, the one or more hardware processors 104 are configured to train a neural network with the set of training images for gaze

estimation to construct human vision to augment reality. In an embodiment, VGG network available in TensorFlow framework is trained using the set of training images. Any other neural networks capable of learning images can be trained using the set of training images for gaze estimation.
[043] The neural network model trained using the method 200 can be utilized for gaze estimation in different augmented reality applications as explained earlier. It is important to accurately estimate gaze direction in such applications since a small error in gaze estimation can totally degrade user experience by presenting irrelevant or undesired information. For example, in the supermarket scenario explained earlier, if gaze of a customer is not estimated accurately, information of a product different from the one which the customer gazes at (and which the customer is expecting) may be overlaid on the customer’s handheld device. This leads to customer dissatisfaction. Thus, the disclosed method enhances the accuracy, thereby enhancing user experience in augmented reality applications.
[044] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[045] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC

and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[046] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[047] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriatel performed. Alternatives (including equivalents, extensions,
variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.

[048] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[049] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.

We Claim:
1. A method (200) for gaze estimation to construct human vision to augment reality, the method comprising:
displaying (202) to a user, by one or more hardware processors via an I/O interface, a grid comprising a plurality of cells and an object, wherein the object is in motion across the plurality of cells;
capturing (204), by the one or more hardware processors via a camera, a set of images of the user fixating on the object in motion across the plurality of cells;
determining (206), by the one or more hardware processors, a bounding box around face of the user in each of the set of images, and identifying feature points corresponding to eyes and nose endpoint on the face in each of the set of images;
calculating (208), by one or more hardware processors, a distance of the user from the camera, wherein the distance is computed from the nose endpoint of the user to the camera;
embedding (210) into a set of blank images, by one or more hardware processors, the bounding box around the face of the user, the nose endpoint of the user and the position of iris from each of the set of images, height and pose of the user, and the distance of the user from the camera, wherein process of embedding comprises:
marking (302) the bounding box around the face of the user
and the nose endpoint of the user from each of the set of images, into
a corresponding image among the set of blank images;
drawing (304) a horizontal axis, a vertical axis and a diagonal
in each of the corresponding image among the set of blank images,
wherein the horizontal axis, the vertical axis and the diagonal
intersect at the nose endpoint;
marking (306) position of iris, identified from feature points
corresponding to eyes in each of the set of images, into the
corresponding image among the set of blank images;

marking (308) height of the user along the vertical axis in the corresponding image among the set of blank images;
marking (310) pose of the user along the horizontal axis of the corresponding image among the set of blank images; and
marking (312) distance of the user from the camera along the diagonal of corresponding image among the set of blank images, to get a set of training images; and
training (212), by one or more hardware processors, a neural network with the set of training images for gaze estimation to construct human vision to augment reality.
2. The method as claimed in claim 1, wherein the object is in motion across the plurality of cells with a short pause at each of the plurality of cells to allow the user to fixate on the object at predefined intervals.
3. The method as claimed in claim 1, wherein a first end point (hL) and a second end point (hU) of the vertical axis represent 3feet and 7feet respectively and wherein height of the user (H) is marked in between the first end point and the second end point at a point determined by following equation:

4. The method as claimed in claim 1, wherein a first end point and a second end point of the horizontal axis represent pose corresponding to face turned completely left and face turned completely right respectively, and wherein the pose of the user is marked in between the first endpoint and the second endpoint according to yaw of the user’s face.
5. A system (100) for gaze estimation to construct human vision to augment reality, the system (100) comprising:

a memory (102) storing instructions; one or more Input/Output (I/O) interfaces (106); a camera (110) for capturing images of a user; and
one or more hardware processors (104) coupled to the memory (102) via the one or more I/O interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to:
display to the user, via the one or more I/O interfaces (106), a grid comprising a plurality of cells and an object, wherein the object is in motion across the plurality of cells;
capture, via the camera (110), a set of images of the user fixating on the object in motion across the plurality of cells;
determine a bounding box around face of the user in each of the set of images, and identify feature points corresponding to eyes and nose endpoint on the face in each of the set of images;
calculate distance of the user from the camera as distance between the nose endpoint and the camera;
embed information related to each of the set of images into a set of blank images wherein embedding comprises:
marking the bounding box around the face of the user
and the nose endpoint of the user from each of the set of
images, into a corresponding image among the set of blank
images;
drawing a horizontal axis, a vertical axis and a
diagonal in each of the corresponding image among the set
of blank images, wherein the horizontal axis, the vertical axis
and the diagonal intersect at the nose endpoint;
marking position of pupils, identified from feature
points corresponding to eyes in each of the set of images,
into the corresponding image among the set of blank images; marking height of the user along the vertical axis in
the corresponding image among the set of blank images;

marking pose of the user along the horizontal axis of the corresponding image among the set of blank images; and
marking distance of the user from the camera along the diagonal of corresponding image among the set of blank images, to get a set of training images; and
train a neural network with the set of training images for gaze estimation to construct human vision to augment reality.
6. The system as claimed in claim 5, wherein the object is in motion across the plurality of cells with a short pause at each of the plurality of cells to allow the user to fixate on the object at predefined intervals.
7. The system as claimed in claim 5, wherein a first end point (hL) and a second end point (hU) of the vertical axis represent 3feet and 7feet respectively and wherein height of the user (H) is marked in between the first end point and the second end point at a point determined by following equation:

8. The system as claimed in claim 5, wherein a first end point and a second
end point of the horizontal axis represent pose corresponding to face turned
completely left and face turned completely right respectively, and wherein
the pose of the user is marked in between the first endpoint and the second
endpoint according to yaw of the user’s face.

Documents

Application Documents

#	Name	Date
1	202121010969-STATEMENT OF UNDERTAKING (FORM 3) [15-03-2021(online)].pdf	2021-03-15
2	202121010969-REQUEST FOR EXAMINATION (FORM-18) [15-03-2021(online)].pdf	2021-03-15
3	202121010969-PROOF OF RIGHT [15-03-2021(online)].pdf	2021-03-15
4	202121010969-FORM 18 [15-03-2021(online)].pdf	2021-03-15
5	202121010969-FORM 1 [15-03-2021(online)].pdf	2021-03-15
6	202121010969-FIGURE OF ABSTRACT [15-03-2021(online)].jpg	2021-03-15
7	202121010969-DRAWINGS [15-03-2021(online)].pdf	2021-03-15
8	202121010969-DECLARATION OF INVENTORSHIP (FORM 5) [15-03-2021(online)].pdf	2021-03-15
9	202121010969-COMPLETE SPECIFICATION [15-03-2021(online)].pdf	2021-03-15
10	202121010969-FORM-26 [22-10-2021(online)].pdf	2021-10-22
11	Abstract1.jpg	2022-02-17
12	202121010969-FER.pdf	2022-09-27
13	202121010969-OTHERS [09-12-2022(online)].pdf	2022-12-09
14	202121010969-FER_SER_REPLY [09-12-2022(online)].pdf	2022-12-09
15	202121010969-CLAIMS [09-12-2022(online)].pdf	2022-12-09
16	202121010969-PatentCertificate21-02-2024.pdf	2024-02-21
17	202121010969-IntimationOfGrant21-02-2024.pdf	2024-02-21

Search Strategy

1	202121010969E_26-09-2022.pdf