System And Method For Detecting And Tracking Hand Object Interaction

< Back

System And Method For Detecting And Tracking Hand Object Interaction

Abstract: In system and method of present disclosure, a fast, accurate and simultaneous detection and tracking of hand, object and interaction thereof is disclosed. The system makes use of vision based method and hand glove based data to aptly detect hand-object pose using a combination of detectors, followed by seamless tracking of hand and object that enables overall process computationally more efficient by considering of temporal frames. The tracker is monitored for its accuracy and corrected for any reported discrepancy. The exact coordinates of hand and object are obtained from joint estimation of 6DoF pose that functions as a valuable feedback for enhancing skill of any given trade.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

17 March 2023

Publication Number

16/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

Dimension NXG Pvt. Ltd.

527 & 528, 5th floor, Lodha Supremus 2, Road no.22, near new passport office, Wagle Estate, Thane West, Maharashtra, India- 400604, India

Inventors

1. Abhishek Tomar

527 & 528, 5th floor, Lodha Supremus 2, Road no.22, near new passport office, Wagle Estate, Thane West, Maharashtra, India- 400604, India

2. Pankaj Raut

527 & 528, 5th floor, Lodha Supremus 2, Road no.22, near new passport office, Wagle Estate, Thane West, Maharashtra, India- 400604, India

3. Abhijit Patil

527 & 528, 5th floor, Lodha Supremus 2, Road no.22, near new passport office, Wagle Estate, Thane West, Maharashtra, India- 400604, India

4. Yukti Suri

527 & 528, 5th floor, Lodha Supremus 2, Road no.22, near new passport office, Wagle Estate, Thane West, Maharashtra, India- 400604, India

5. Divyam Vipul Sheth

527 & 528, 5th floor, Lodha Supremus 2, Road no.22, near new passport office, Wagle Estate, Thane West, Maharashtra, India- 400604, India

6. Purwa Rathi

527 & 528, 5th floor, Lodha Supremus 2, Road no.22, near new passport office, Wagle Estate, Thane West, Maharashtra, India- 400604, India

Specification

DESC:FIELD OF THE INVENTION
Embodiment of the present invention relates to a system and method for detecting and tracking hand-object interaction and more particularly to a system and method for detecting and tracking interaction between hand and object to be manipulated in a virtual reality environment.
BACKGROUND OF THE INVENTION
Human hands are practically one of the most remarkable feat of evolution as they have evolved to take and perform varying complex and arduous tasks. Whether reflective of instructions received from cognitive minds or responding to external events or simulating an act, human hand articulation is an interesting phenomena that has intrigued researchers since long. In today’s world, use of hand gestures form an integral part of human computer interaction and is widely studied to understand their innate structural complexity.
Hand-object interactions are ubiquitous in our daily lives and form a major part of way humans interact with complex real-world scenes. With advances in simulation techniques, reconstructing human object interaction in 3D has become crucial, and significance of comprehending hand motion has gained much popularity. Robust and detailed 3D tracking of hands, objects being manipulated by hands and kinematic structures in 3D space have been focus of research in recent years.
Despite extensive efforts, limited success has been achieved in capturing hand articulation due to many reasons – multiple degrees of freedom, dextrous nature, partial occlusion, inter and self-occlusion issues, low spatial resolution or low temporal resolutions owing to fast movements that is not adequately captured by existing devices, variation in hand appearances under different viewing angles and the like.
Gaining detailed understanding of hand movements and manoeuvrability is highly desirous in simulation environments (e.g. training scenarios or in gaming world) where the user manipulates the real world objects to operate and interact within the virtual world. Existing virtual reality/augmented reality systems limit the user experience and do not allow full immersion in the real world. Where interaction is enabled, it is coarse, inaccurate and cumbersome, and it interferes with the natural movement of the user. Such cost, complexity, and convenience considerations limit the deployment and use of VR/AR technology, which otherwise has a strong use case in almost each and every segment of human experience, development and overall well-being.
Beyond complexities involved in seamless capturing of hand articulation, daunting challenge lies in measuring handling of real world objects by hands in a virtual world. For example, in a typical training scenario where the user is provided with virtual reality device for enhancing his skills in a particular trade, he may have to interact with multiple tools of trade having variable constraints and operable parameters associated with them.
To accomplish robust, precise and immersive user training experience, it is imperative that a user is made aware of his positional relationship with the tool along with other factors such as appropriate position and orientation in which the tool should be held, position on the tool where user fingers and thumb should be placed or the firmness with which the tool should be handled. There are various other important aspects that are to be accounted to ensure real life like interaction of user with the objects.
Currently, very limited to no success has been achieved in ability to capture interaction between the user hand and interacting object such that exact position of hand with respect to tool, firmness with which the tool is held, grip status or other such features can be adequately mapped. These features constitute vital parameters of hand-object interaction that are integral in providing an improved, quality and measurable feedback to user in his endeavour to upskill himself.
Recent known methods for estimating 3D location and orientation of objects or tools held by the user rely on template matching techniques or CNN architecture. However, these approaches suffer limitations with respect to background clutter, lack of depth information and availability of small scale datasets with limited variations that leads to inaccurate generalization. Further, single frame estimations are computationally intensive, which plainly paved the way for tracking of hand object pose over 3D hand pose estimation from single RGB images.
However, the joint detection and tracking of human object interaction is an under-explored area. Existing solution that employ simultaneous detection and tracking either takes aid of externally installed sensors for capturing object motion, which makes these systems identify user-object interaction from outside world perspective and not the user perspective. Other global distractions like occlusions, truncations and background clutter along with limitations of achieving seamless tracking from egocentric perspective, primarily relevant in virtual reality training and simulation environment has further decelerated wider adoption of ar/vr technology in real world scenarios.
In this vein, the present disclosure sets forth system and method for jointly detecting and tracking human object interaction including 6DoF pose estimation of joint hand and object interacting therewith, even under situations of occlusions and truncations; the system and method embodying advantageous alternatives and improvements to existing detection and tracking approaches, and that may address one or more of the challenges or needs mentioned herein, as well as provide other benefits and advantages.
OBJECT OF THE INVENTION
An object of the present invention is to provide a system and method for detecting and tracking hand-object interaction.
Another object of the present invention is to provide an efficient, fast and accurate hand and object simultaneous detection and tracking system and method.
Yet another object of the present invention is to provide a computationally efficient system and method of simultaneous detection and tracking of hand-object interaction as it eliminates need of any higher end hardware.
Yet another object of the present invention is to provide a less data hungry yet more precise 6DoF pose estimation system and method that optimally works for all kinds of objects including smooth, texture less, rounded, uniform surfaces.
Yet another object of the present invention is to provide a more robust and seamless simultaneous tracking of hand, object held in hand along with interaction thereof such that exact coordinates of position at which the object is held may be determined for real world applications.
In yet another embodiment, the system and method enables determination of firmness with which the object is held by hand by use of IMU based hand gloves to provide valuable feedback to user.
SUMMARY OF THE INVENTION
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.
In accordance with one aspect of disclosure, a method for detecting and tracking hand-object interaction within a virtual environment is proposed. The method comprising steps of: capturing image data of hand and object held within the hand along with pose data of the hand. This is followed by detecting, using a first detector, a first object pose from the image data based on localization of object keypoints. Next a second object pose is detected from the pose data of the hand using the second detector. Now a final pose is computed from the first detector (155) and the second detector (156) based on comparison with a predetermined threshold value. These steps are followed by tracking the hand-object interaction based on the final pose. The tracking process is monitored for accuracy and wherein in an event inaccurate tracking is reported, the tracking process is re-seeded with a corrected final pose.
In accordance with another aspect of disclosure, a system for detecting and tracking hand-object interaction within a virtual environment is provided. The system comprising of a head mounted display configured to capture image data of hand and object held within the hand; a hand glove configured to capture pose data of the hand; a computing module, implemented upon a processor, and configured with (a) a first detector to detect a first object pose from the image data based on localization of object keypoints; (b) a second detector to detect a second object pose from the pose data of the hand. Here, the computing module is configured to compute a final pose from the first detector and the second detector based on comparison with a predetermined threshold value. Finally, a tracker is provided for tracking the hand-object interaction based on the final pose and is continually monitored for accuracy, and wherein in an event of inaccurate tracking, the tracker is re-seeded with a corrected final pose.
BRIEF DESCRIPTION OF THE DRAWINGS
So that the manner in which the above recited features of the present invention can be understood in detail, a more particular to the description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, the invention may admit to other equally effective embodiments.
These and other features, benefits and advantages of the present invention will become apparent by reference to the following text figure, with like reference numbers referring to like structures across the views, wherein:
Fig. 1 illustrates a block diagram of system comprising head mounted display along with hand gloves worn by user for estimating 6DoF pose of hand-object interaction, in accordance with an embodiment of the present invention.
Fig. 2 illustrates flow diagram for achieving 6DoF pose estimation of hand-object interaction, in accordance with an embodiment of the present invention.
DETAILED DESCRIPTION
While the present invention is described herein by way of example using embodiments and illustrative drawings, those skilled in the art will recognize that the invention is not limited to the embodiments of drawing or drawings described and are not intended to represent the scale of the various components. Further, some components that may form a part of the invention may not be illustrated in certain figures, for ease of illustration, and such omissions do not limit the embodiments outlined in any way. It should be understood that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the invention is to cover all modifications, equivalents, and alternatives falling within the scope of the present invention as defined by the appended claims.
As used throughout this description, the word "may" be used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense, (i.e., meaning must). Further, the words "a" or "an" mean "at least one” and the word “plurality” means “one or more” unless otherwise mentioned. Furthermore, the terminology and phraseology used herein is solely used for descriptive purposes and should not be construed as limiting in scope. Language such as "including," "comprising," "having," "containing," or "involving," and variations thereof, is intended to be broad and encompass the subject matter listed thereafter, equivalents, and additional subject matter not recited, and is not intended to exclude other additives, components, integers or steps.
Likewise, the term "comprising" is considered synonymous with the terms "including" or "containing", “hand gloves is replaceable with “haptic gloves”, “vision sensors” is referred for sensors positioned in head mounted display, for applicable legal purposes. Any discussion of documents, acts, materials, devices, articles, and the like are included in the specification solely for the purpose of providing a context for the present invention. It is not suggested or represented that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present invention.
In this disclosure, whenever a composition or an element or a group of elements is preceded with the transitional phrase “comprising”, it is understood that we also contemplate the same composition, element or group of elements with transitional phrases “consisting of”, “consisting”, “selected from the group of consisting of, “including”, or “is” preceding the recitation of the composition, element or group of elements and vice versa.
The present invention is described hereinafter by various embodiments with reference to the accompanying drawings, wherein reference numerals used in the accompanying drawing correspond to the like elements throughout the description. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiment set forth herein. Rather, the embodiment is provided so that this disclosure will be thorough and complete and will fully convey the scope of the invention to those skilled in the art.
In the following detailed description, numeric values and ranges are provided for various aspects of the implementations described. These values and ranges are to be treated as examples only and are not intended to limit the scope of the claims. In addition, a number of materials are identified as suitable for various facets of the implementations. These materials are to be treated as exemplary and are not intended to limit the scope of the invention.
In accordance with one general embodiment of present disclosure, the present system and method are directed to simultaneous detection and tracking of 6DoF (3 for translation and 3 for rotation) pose of hand, object held by the hand and interaction therebetween. The accurate detection and seamless tracking of both hand and object is important in precisely determining the position at which the object is held by the hand. Along with the position of hand on the object, the force/grip with which the object is held is a vital piece of information for providing valuable feedback to the user involved in upskilling himself for a particular trade.
While many skilled in trade have explored possibilities of detection and tracking, yet little success has been achieved in 6DoF pose estimation of object held in user’s hand, let alone assessing the manner of manipulation. The known approaches have proven to be computationally more expensive besides requiring high power, cost, and large amount of training data for learning detection and tracking of objects accurately. Further, achieving detection and tracking of user hand and object held is a major milestone especially when performed by user wearables alone (egocentric perspective), not requiring aid of any external cameras or sensors.
Often the vision based systems fail to accurately locate the objects when used alone unaided by any externally positioned sensors or other added body tracking wearable. Such occurrence of failed tracking is observed in situations like e.g. the object gets occluded by user’s hand interacting with the object, the object moving rapidly and slipping out of camera’s field of vision or if multiple background objects interfere with target object and many more. The present disclosure compensates for this detection and tracking error by providing a combination of both vision based and hand glove based tracking and detection for accurate and precise pose estimation.
In the background of above limitations, the present disclosure attempts to detect and track user hand and object held within the hand simultaneously, particularly from egocentric perspective to enable user single handedly manage effective handling of object/tool manipulated by him. In situations of occlusions and truncations and even for texture less objects (where detection and tracking is difficult to achieve ordinarily), the present system has been able to achieve remarkable progress in terms of seamless detection and tracking in real time for real world applications. While the application has unlimited use cases, the present disclosure will occasionally cite references from trainings and learning sessions. However, the disclosure is nowhere limited to a particular use case, and has wide applicability.
In accordance with the first embodiment, the system and method of present disclosure is configured to detect and track 6 DoF pose of any kind of object or tool held by the user to train himself in a particular trade along with hand tracking from a combination of approaches. In one exemplary embodiment, according to Fig. 1, the system 1000 comprises of a head mounted display (HMD) 100 that further comprises of a plurality of sensors 50 including, though not limited to, RGB sensors 50a, RGBD sensors 50b, IR trackers 50c, IMU sensors 50d and the like along with a computing module 150 to process feed obtained from the plurality of sensors 50 optimally positioned on the HMD 100. The vision based sensors 50 such as RGB 50a or RGBD 50b are effective in achieving robust and effective vision based tracking of hand 10 as well as objects 20 being held by hand 10 for reconstructing hand-object interaction even under severe occlusion or truncation.
However, as the object 20 is being held by the user hand 10, it needs to be accurately localized, classified and detected for correct 6 DoF pose (orientation and translation estimated in 3D space) and thence corrected for appropriate handling and manipulation. The object pose has to be computed in face of manifold challenges and global distractions such as occlusions, truncations, background clutter, lightning and appearance, texture and the like. Primarily, the aim is to understand and determine the position from where the object 20 is held/grab. Unlike in contemporary approaches where the object 20 may be detected and tracked from external imaging devices, the real challenge lies when the object has to be viewed, detected and tracked from user’s egocentric perspective without any latency issues and that too with utmost precision.
In one exemplary scenario, image of the object 20 along with user hand 10 is captured by an RGB/RGBD cameras disposed on the head mounted display 100 worn by the user. In addition, the system 1000 comprises of hand worn haptic gloves 200 that communicates with the HMD 100 to provide initial hand pose to the computing module 150. This primary image data captured by RGB/RGBD cameras of HMD 100 as well as the hand pose data from hand gloves 200 with respect to user hand 10 and object 20 held by the user is transmitted to the computing module 150 that is configured to compute 6D pose of hand, object held, and precise position at which the object is held. In one example embodiment, the computing module 150 may be housed inside the head mounted display 100, or alternately may reside externally within a computing hardware.
As indicated above, as there may be situations where egocentric viewpoint posits issue of heavy inter and self-hand occlusion or object parts being hidden behind the user hand 10, the present disclosure proposes acquiring object image data and user hand pose using vision based sensors 50 positioned on head mounted display 100 along with the haptic gloves 200 that is also studded with plurality of sensors 250 in real-time, respectively. This hand pose includes the orientation and position of the user hand 10 relative to the camera of head mounted display 100.
For estimating 6DoF pose, the process begin with object detection, wherein the computing module 150 comprising of a detector 155 performs initial 3D location and orientation of object 20 using a combination of a first detector and a second detector with their respective approaches. In the first approach, the first detector 155 derives object pose (orientation and translation in 3D space) from the image data obtained from visions sensors 50 and makes use of vector field representation for detecting and localizing object keypoints that are even outside input image.
Thereafter, the hand position and object is detected by way of making pixel-wise predictions for 2D keypoints, whereby a vector-field representation is generated for image keypoint localization. The solution offers much more flexibility in handling truncated objects, cluttered backgrounds, and lightning and appearance limitations besides making dense predictions for keypoint localization of both visible and invisible parts of object instead of prior art solutions whereby object keypoint coordinates are directly regressed.
Elaborating upon the first approach, at first local 2D keypoints for captured images are efficiently detected by any of existing approaches. For example, convolutional Neural Networks (CNNs) may be used for 2D keypoint detection. To achieve keypoint localization, object semantic labels and vectors for each pixel of the captured images are determined; the vectors being representative of directions from each image pixel towards the detected 2D keypoints.
Thus, for a pixel p, its semantic label associates it with a specific object and the vector v(p) that represents the direction from the pixel p to a 2D keypoint x of the object. Vector based representation enables inferring even hidden aspects as it focuses on local features of image (hand and object) along with spatial relations therebetween, and capably estimate keypoints of unseen aspects of image from directions derived from other visible parts of the image. This vector may be an offset between the pixel p and keypoint x. By using semantic labels and offsets, target object pixels may be obtained.
Accordingly, hypotheses of 2D locations for keypoints is generated based on vector directions from pixels (p) to keypoints (x) in order to determine uncertainties of predicted keypoint locations and improvise pose estimation. To generate a set of hypotheses, any two pixels may be selected and the intersection of their corresponding vectors is taken as hypotheses for keypoint x, which may be repeated N times to obtain a set of hypotheses that represent possible keypoint location and probability distribution in the image.
For addressing issues with confidence levels and uncertainty patterns of localized keypoints, mean and covariance of spatial probability distribution for each keypoint in the image is computed based on the generated hypothesis. With the estimated mean and covariance matrix, 3D coordinates of keypoints are obtained with minimized projection errors as correspondence is established between 2D keypoints and 3D coordinates. The keypoints are preferably selected on object image surface to limit localization errors rather than known approach of choosing bounding box corners in the 3D model of corresponding image as keypoints.
This enables efficient, fast and accurate object detection in 3D space, optimally factoring in occlusions and truncations for real-world applications. Since direct localization of objects in 3D is difficult due to missing depth information, the objects are localized at first in 2D image and then their depth is predicted to obtain 3D coordinates as the 2D keypoint detection is relatively easier than 3D localization and rotation estimation. Such a pixel wise prediction for 2D keypoints using a vector field representation makes handing of truncated objects much flexible and easier.
Typically, even with optimal detection of object pose from the first approach, there are yet certain unique set of challenges for 6dof pose estimation as the detection though works precisely accurate for smooth texture less objects but comes with other challenges of high computational cost, high learning curve for training with wide range of objects, along with existing limitations of truncations and occlusions. Furthermore, in many practical applications, especially in crowded scenes, the detection result of the detector is usually not accurate enough due to the interaction between objects, the appearance similarity and frequent occlusion of the objects, thereby seriously affecting the accuracy and performance of 6DoF pose estimation.
To overcome the limitations of first detector using the first approach, the present disclosure complements the first detector 155 configured to detect object pose from image data (first approach) with a second detector 156 configured to detect object pose using hand pose data obtained from haptic gloves 200 (second approach). The combined approach of the first detector 155 and the second detector 156 helps to predict 6DoF pose (rotation and translation) and precisely determine the hand position over the object 20 along with the exact coordinates from where the object 20 is held. With the detection of hand 10 and object 20 held by the hand 10 using the first detector 155 complemented by the detection from the hand pose using the second detector 156, the system 1000 is able to achieve high level detection of objects 20 under all levels of occlusions, truncations, changed illumination, and other such factors.
The combination of two detectors 155, 156 for object pose detection is not only effective, it can be applicable for wide range of objects including textureless less surfaces, under all light conditions, improper illumination and provides even natural estimates of how the object 20 is held by user hand 10. For the second detector 156, the system 1000 utilizes hand pose data from hand gloves 250. The system 1000 thus attempts to cater to situations where the object 20 gets occluded or moves beyond field of view of vision camera 50 of HMD 100 where they become prone to failures when hands are obscured by cameras during manipulation.
Accordingly, in accordance with one preferred embodiment, the haptic gloves 200 capable of tracking movements of hands, objects held in hand and their complex interplay in real time and providing direct user hand pose with utmost precision is provided. Primarily, the hand gloves 200 are provided with specialized sensors 250 in form of ultrasonic array disposed throughout the hand on inner surface of glove 200. These sensors 250 produce ultrasonic waves that generate a focused pressure point at a particular distance from the ultrasonic array.
Most importantly, the hand gloves 200 studded with an array of ultrasonic sensors is selected mainly for reasons of effective performance as these sensors remain more robust to external radiations, illumination parameters, insufficient lightening, excessive movement or any other external noise factor that usually leads to compromised performance of many other existing glove sensors such as IMU based gloves, stretch sensor based gloves and other sensor data gloves. Further, they can effectively overcome self-occlusion based problems while providing high accuracy.
The focused pressure point generated by the throughout distribution of ultrasonic sensors replicates the counter-force that would be applied by an object 20 on the tip of the finger, thereby simulating the feeling of touching the object 20 with the tip of the finger. The frequency and/or intensity of the ultrasonic array can be changed to produce different pressures that can simulate different forces and/or textures or materials. For example, a higher intensity can be used to create a higher pressure, which may simulate a harder, more rigid surface whereas a lower intensity can be used to create a lower pressure, which may simulate a softer surface.

Further, the hand gloves 200 are configured with linear haptic sensors to utilize vibrations for imitating sensations of holding the object 20, gripping it tight or loose, squeezing a trigger and other subtlety and complexity in hand and object interaction. With such additional sensitivity, the hand glove 200 allows a user to touch, feel, and hold an object in real time to convey the information on exactly where the object 20 is being held by the hand 10. In some examples, one or more functions and features of the hand glove 20 may allow a user to interact with the simulated virtual environment without a camera and with less obstruction.
In some examples, the hand glove may be configured with flex sensors to detect bending or flexing of the hand and/or fingers and/or an accelerometer to detect motion of the hand. The combination of these sensors 250 and the like provide a better feedback to the user when instrumenting with object 20 in hand 10. The hand glove 200 can be a fabric glove, wherein one or more (e.g., all) of the electronic components can be knitted or woven into the glove. In some examples, the fabric may be stitched together using conductive thread.
Most importantly, the plurality of sensors disposed on hand gloves enable the user to determine the force with which the object is held by his hand. This is instrumental in any kind of training as the firmness with which the object is highly characteristic of the manner in which training is performed. For example, feedback on firmness with which a spray gun is held and position where it is held by a user during spray painting training session is vital information for a trainee to perform his task adequately. This critical information is provided by hand glove 200 based sensors 250 that goes as a feedback to user for correctively adjusting his hand-object interplay.
With the input of hand pose from the ultrasonic and other sensor studded hand gloves 200, the second detector 156 provides a closer approximation of object’s actual position and orientation in 3D space. In one working embodiment, the second detector 156 makes use of existing datasets of complex human-object interaction that contain full-body motion, 3D body shape, and detailed body-object contact for generating plausible grasps for unseen 3D objects. Based on understanding dynamics of whole-body grasps, various other full-body user motion, object motion, in-hand manipulation etc. can be known based on which “accurate hand pose of user” along with relative hand-object configurations may be estimated at all times without any interruptions. To initialize the object pose, this accurate hand pose is chosen as a heuristic.
Accordingly, in one working embodiment, the second detector 156 makes use of a conditional variational autoencoder for determining initial grasp and then estimate final grasp by refinement based on trained neural network. The present system 1000 has utilized GRAB dataset containing full 3D shapes and pose sequences to compute contact between hand 10 and the object 20. The second detector 156 is thus configured to infer the pose and orientation of objects based on how they are grasped by the human hand 10.
Capturing hand-object contact is hard, because the hand and object heavily occlude each other. In order to capture hand manipulating the object in real time and associated interactions, the GRAB dataset based generative conditional model on object grasping that use a combination of coarse prediction and refinement for initial hand pose estimation. The second detector 156 computes an initial estimate of the object's 6DOF pose from a given hand grasp, and then refines this estimate to produce a precise object pose. This architecture leverages the rich interactions captured in the GRAB dataset, which contains detailed 3D meshes of humans interacting with various objects, capturing full-body motion and hand poses.
The accurate hand pose acquisition is followed by an initial pose estimation whereby detailed hand pose parameters (position, orientation, and full finger articulations) are recorded alongside the grasp configuration. This comprehensive input captures the intricacies of how the hand 10 interacts with the object 20, providing a rich context for predicting the object's pose. To accommodate the enriched input, the second detector 156 utilizes a more complex conditional Variational Autoencoder (cVAE) architecture. This cVAE is designed to encode both hand pose and grasp configuration into a unified latent space. The encoder part of the network processes the hand pose and grasp parameters to generate a latent representation.
This representation captures the essential features of the interaction between the hand and the object. The decoder then uses this latent code, conditioned on the hand pose and grasp information, to predict the object's initial 6DOF pose. The architecture ensures that both the static aspects of the grasp and the dynamic aspects of the hand pose are factored into the pose estimation process. With the enriched context provided by the detailed input, the second detector further refines this initial pose estimate using the trained neural network. It adjusts the pose to align more closely with the actual object orientation and position as dictated by the hand's manipulation, ensuring a higher fidelity to real-world interactions.
Now, once the first estimate of the object pose from the first detector 155 and the second estimate of the object pose from the second detector 156 is obtained, a weighted approach is adopted to determine which object pose estimation is more accurate, precise and of higher confidence score against the ground truth. Since, it is well understood that while the first detector 155 approach may fail to perform pinpoint correct under situations of occlusions, truncations, field of view limitations and lightening conditions, the second detector 156 approach may suffer limitations of over natural estimations. This means that the second detector 156 when using input of hand pose predicts the object pose, it makes very natural estimates about how the object 20 is being held in an ideal scenario since the model is trained for specific objects and the way they are handled.
However, it might be so that the user is not holding the object correctly, or holds it unnaturally. Say for example the user is holding the welding gun from its front rather than rear, the second detector 156 will fail to provide estimations as it is trained only for making estimations of objects that are correctly handled. Hence, this training of second detector 156, though accurate, is of limited nature and may pose challenge when solving real world equation of object handling.
Next, the computing module 150 determines the weighted average of the object pose estimations from the first detector 155 and the second detector 156 based on their confidence scores and compares it with a predetermined threshold to find most accurate initial object pose for the first frame. This step is important to have an initial correct pose as the right feed for the subsequent tracking process performed by a tracker 160. The tracker 160 is operable by the computing module 150 and is configured to obtain the weighted object pose as the starting point for performing the object tracking. This involves setting the initial conditions of the object tracking right for seamless tracking of following frames.
In a noteworthy embodiment, the object pose estimate having a higher confidence score in overall computation is selected. However, if both of the first detector 155 and the second detector 156 exhibits high confidence score in pose estimations under static external conditions, but the pose obtained from each of them are distant in their estimations, the computing module 150 gives precedence to the first detector 155 as the second detector 156 is capable of providing estimates of objects that are most naturally and correctly held by hand. However, in event of varying or non-static outside conditions, in presence of multiple objects overlapping with each other or possibilities of truncations evident, the computing module 150 shall give precedence to second detector 156 for being more robust and resilient to external environmental factors.
In an event of both the first detector 155 and the second detector 156 resulting low confidence scores, the computing module 150 re-initialize the first detector 155 and the second detector 156 for obtaining correct feed. Using the above computed object pose a much smoother tracking process can be initiated as a better starting point with known initial conditions is provisioned that enables improved tracking accuracy and stability, especially for initial frames, a feature and capability not enabled with existing hand-object tracking solutions.
However, there still may be situations where the tracker 160 may fail abruptly. The solution aptly compensates for such incidental situations and by way of reversing the impact of failed tracking before it’s too late and the solution continues feeding on wrong input from failing tracker. Thus, in an event the tracking is found unsuccessful at a certain instance, the tracker 160 is re-initialized with the ground-truth pose from the computing module 150 that uses weighted object pose estimation from the first detector 155 and the second detector 156.
This makes up for lost tracking and works well for initial few frames. Thereafter, there may be challenges like inherent variation in occupancy of object in image; limited availability of large datasets; low resolution visuals; handling objects of varied size; or handling overlapping objects. Moving to detector very often to resolve tracking failure will again invite added computational cost, as the detector requires higher end hardware and takes every input frame as a new frame, making it inefficient where temporal frames can be generated and used for subsequent pose estimation.
Thus, it is critical to obtain cues for probable failure of tracking process in order to avoid unnecessary invoking of detector function. For this, the system 1000 proposes making use of known hand pose and derived object pose to determine tracking quality. As will be agreeable, since the hand pose and the object pose is now known for the first frame (as it is used as an initial feed for tracking), the relative transforms between the hand and the object pose shall be relatively constant for a given hand-object interaction. This means that the relative position and orientation of hand with respect to specific object held in hand shall remain fixedly rigid for entire session unless either the object 20 or the manner in which the object 20 is held by the hand 10 is changed.
Now, with the initiated tracking process, the computing module 150 will always be provided with hand pose and the object pose and thus the relative transform can be computed for each frame. This relative transformation can be measured against a threshold value to understand if the tracking is going off-limits. For the relative transformation values going beyond the stipulated threshold, a strong clue can be inferred of failed tracking process. In such events (only), the tracker 160 can be re-seeded with corrected object pose that can be computed using first detector 155 and the second detector 156, as explained above.
Next, in accordance with one working embodiment, the sparse probabilistic based tracker 160 is deployed for tracking hand, hand held objects and their interaction in cluttered scenes based on initial pose derived from detection performed on target scene (as discussed above). The sparse probabilistic based tracking is based on sparse viewpoint probabilistic model, where instead of considering joint probability over the entire image, the probability is only calculated along a small set of entities called as correspondence lines that considers fewer pixels making the process computationally efficient, an approach that overcomes problems of commonly deployed region-based tracking.
The correspondence lines are based on projected 3D model points and facilitate the estimation of where the object’s edges (contours) appear on the image. Their role is crucial in optimizing the object's pose by providing specific locations to assess alignment between the projected model and image, thereby guiding in adjusting the pose for better tracking accuracy. Once a correspondence line is established and corresponding pixels on the line are defined, it remains fixed.
The terms correspondence lines here is described by a center and associated vector normal to 3D contour point, both being further defined by projecting the 3D contour point and associated normal vector into the image. This effectively creates lines along which the tracking approach evaluates the match between the projected model and observed image data. When there is a 6Dof pose variation, the difference between unmoved center and variated/moved contour point is computed.
In one explanatory embodiment, for sparse viewpoint probabilistic modelling, 3D model of the object is rendered from different viewpoints, and 3D model points and associated normal vectors are reconstructed. This is followed by computing continuous distances for the foreground and background of the target scene, where the corresponding regions are not interrupted by the other. Each tracking step starts either from the previous pose or an initial pose provided by a 3D object detection pipeline, as explained above. Model points and normal vectors associated with the contour are then projected into the image plane to define correspondence lines (as discussed above). Using color histograms, the object is differentiated from its background based on the probability that a pixel belongs to object or the background and the results optimized using Gauss-Newton scheme.
Basis the differentiation, probability of contour location is calculated alongside correspondence lines. Starting from contour coordinate, those distances are measured along the normal vector. The probability of contour location on image plane is determined for each correspondence line to estimate 6dof pose of hand-object. Thus, using sparse based probabilistic formulation, high computational efficiency is obtained as fewer pixels are considered, compared to other state of art methods.
Correspondence lines, defined by projected 3D model points and their normal, are used to determine where the object's edges appear in the image. These lines guide the algorithm in assessing the fit between the projected model and the image. The final optimization step implements adjustments to the pose based on the assessment for fit between the projected model and image data, using a regularized Newton optimization method to iteratively refine the pose for improved tracking accuracy.
Accordingly, the optimization step utilizes a regularized Newton optimization method. This approach iteratively refines the object's pose (its position and orientation) to align the projected 3D model with the observed image data. The optimization algorithm adjusts the pose by considering both the gradient and the Hessian of a joint posterior probability function, which represents the likelihood of a pose given the observed data. Regularization is applied to ensure stability and avoid overfitting to noise in the image data, thereby achieving a balance between fitting the data and maintaining a plausible pose.
With above tracking process, starting from an initialization with the ground-truth pose, the tracker runs automatically until either the recorded sequence ends or tracking is found unsuccessful. The tracking solution thus calculates probabilities sparsely along the correspondence lines rather than across the entire image, optimizing computational efficiency and focus. This sparse modeling combined with a probabilistic approach that treats these probabilities according to a Gaussian distribution, underpins the proposed solution’s efficient and robust tracking mechanism.
Thus, the temporal consecutive image frames can be effectively put to use with tracking process instead of treating every frame as a new frame (as in case of using detection alone), thereby reducing latency and minimising associated computational cost. Thus, the detection process need not be trained on a heavy training data set for detection to work fine; rather in conjugation with tracking, the 6dof pose estimation is found to be more robust and accurate.
The combined approach of detecting followed by tracking of target scene inclusive of hand, object held within hand and interaction between the hand and object is highly efficient as it runs on single CPU core for real time pose estimation, can effectively makes use of temporal frames, requires less data for training, and works well for all kinds of objects including smooth, texture less surfaces.
In next working embodiment, the method 500 for detecting and tracking hand-object interaction within a virtual environment is proposed, as shown in Fig. 2. The method beginning with capturing of image data of hand 10 and object (20) held within the hand 20 along with pose data of the hand in step 501. This is followed by detecting of a first object pose from the image data based on localization of object keypoints by the first detector 155 in step 502. Next, with the second detector 156, a second object pose from the pose data of the hand is computed in step 503. In step 504, a final pose is computed from the first detector 155 and the second detector 156 based on comparison with a predetermined threshold value. Thereafter, in step 505 the tracking process is initiated of the hand-object interaction based on the final pose. The tracking process is always monitored for its accuracy, as shown in step 506. Finally, in step 507 if any error or failure is detected in tracking process, it gets re-seeded with a corrected final pose obtained from the first detector and the second detector.
The joint detection and tracking process helps to know the exact coordinates where the object is held, which is an extremely crucial information for skilling and upgrading in trades where precision is of immense value.
In accordance with an embodiment, the head mounted display comprises a memory unit configured to store machine-readable instructions. The machine-readable instructions may be loaded into the memory unit from a non-transitory machine-readable medium, such as, but not limited to, CD-ROMs, DVD-ROMs and Flash Drives. Alternately, the machine-readable instructions may be loaded in a form of a computer software program into the memory unit. The memory unit in that manner may be selected from a group comprising EPROM, EEPROM and Flash memory. Further, the computing module is implemented by a processor operably connected with the memory unit. In various embodiments, the processor is one of, but not limited to, a general-purpose processor, an application specific integrated circuit (ASIC) and a field-programmable gate array (FPGA).
In general, the word “module,” as used herein, refers to logic embodied in hardware or firmware, or to a collection of software instructions, written in a programming language, such as, for example, Java, C, or assembly. One or more software instructions in the modules may be embedded in firmware, such as an EPROM. It will be appreciated that modules may comprised connected logic units, such as gates and flip-flops, and may comprise programmable units, such as programmable gate arrays or processors. The modules described herein may be implemented as either software and/or hardware modules and may be stored in any type of computer-readable medium or other computer storage device.
Further, while one or more operations have been described as being performed by or otherwise related to certain modules, devices or entities, the operations may be performed by or otherwise related to any module, device or entity. As such, any function or operation that has been described as being performed by a module could alternatively be performed by a different server, by the cloud computing platform, or a combination thereof. It should be understood that the techniques of the present disclosure might be implemented using a variety of technologies. For example, the methods described herein may be implemented by a series of computer executable instructions residing on a suitable computer readable medium. Suitable computer readable media may include volatile (e.g., RAM) and/or non-volatile (e.g., ROM, disk) memory, carrier waves and transmission media. Exemplary carrier waves may take the form of electrical, electromagnetic or optical signals conveying digital data steams along a local network or a publicly accessible network such as the Internet.
It should also be understood that, unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as "controlling" or "obtaining" or "computing" or "storing" or "receiving" or "determining" or the like, refer to the action and processes of a computer system, or similar electronic computing device, that processes and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
Various modifications to these embodiments are apparent to those skilled in the art from the description and the accompanying drawings. The principles associated with the various embodiments described herein may be applied to other embodiments. Therefore, the description is not intended to be limited to the embodiments shown along with the accompanying drawings but is to be providing broadest scope of consistent with the principles and the novel and inventive features disclosed or suggested herein. Accordingly, the invention is anticipated to hold on to all other such alternatives, modifications, and variations that fall within the scope of the present invention.
,CLAIMS:We Claim:

1) A method (500) for detecting and tracking hand-object interaction within a virtual environment, the method (500) comprising:
capturing image data of hand (10) and object (20) held within the hand (20) along with pose data of the hand (10);
detecting, using a first detector (155), a first object pose from the image data based on localization of object keypoints;
detecting, using a second detector (156), a second object pose from the pose data of the hand;
computing final pose from the first detector (155) and the second detector (156) based on comparison with a predetermined threshold value;
tracking the hand-object interaction based on the final pose; and
monitoring the tracking for accuracy, wherein in an event of inaccurate tracking, re-seeding the tracking with a corrected final pose.

2) The method (500), as claimed in claim 1, wherein the detection and tracking of hand-object interaction is a 6DoF estimation of the hand (10) and the object (20) held by the hand (10).

3) The method (500), as claimed in claim 1, wherein the image data is obtained from one or more vision sensors (50) positioned on a head mounted display (100).

4) The method (500), as claimed in claim 1, wherein the pose data of the hand (10) is obtained from hand gloves (200) studded with a plurality of sensors (250).

5) The method (500), as claimed in claim 4, wherein the plurality of sensors (250) comprise, though not limiting to, an array of ultrasonic sensors distributed throughout the hand (10), linear haptic sensors and flex sensors.

6) The method (500), as claimed in claim 1, wherein the first object pose is detected in steps of:
detecting 2D keypoints for the object (20) from the image data using convolutional neural network;
predicting object semantic label and vector representative of direction from every pixel of the image data to the detected 2D keypoint; and
generating hypotheses of 2D location for the object keypoint in the image data from vector the vector representation.

7) The method (500), as claimed in claim 6, wherein a mean and covariance of spatial probability distribution is computed for each object keypoint based on the generated hypotheses to obtain 3D coordinates of the object from the 2D keypoints.

8) The method (500), as claimed in claim 1, wherein the second object pose is detected in steps of:
providing an initial estimate of the pose of the hand (10) based on a hand grasp of the object (20) using conditional variational autoencoder and generating a latent representation thereof; and
refining the initial estimate of the pose of the hand (10) based on a trained neural network such that the pose aligns closely with object orientation in real world.

9) The method (500), as claimed in claim 1, wherein the final pose is computed based on weighted average of the first pose and the second pose along with respective confidence scores.

10) The method (500), as claimed in claim 9, wherein in an event the first detector (155) and the second detector (156) exhibits high confidence scores with distant pose estimates under static external conditions, the first pose estimate is selected as the final pose for seeding the tracking.

11) The method (500), as claimed in claim 9, wherein in an event the first detector (155) and the second detector (156) exhibits high confidence scores with distant pose estimates under non-static external conditions, the second pose estimate is selected as the final pose for seeding the tracking.

12) The method (500), as claimed in claim 9, wherein in an event the first detector (155) and the second detector (156) exhibits low confidence score against the predetermined threshold value, the tracking is re-seeded with corrected final pose.

13) The method (500), as claimed in claim 1, wherein the corrected final pose is obtained by re-initiating the first detector (155) and the second detector (156).

14) The method (500), as claimed in claim 1, wherein the tracking is based on sparse viewpoint probabilistic model, wherein probability of object appearing in the image is computed based on correspondence lines.

15) The method (500), as claimed in claim 14, wherein the sparse probabilistic model is operable in steps of:
constructing 3D model points and associated normal vectors from 3D model of the object (20);
projecting the 3D model points and the normal vectors into an image plane for defining the correspondence line and assessing fit between the 3D model points and the image for smooth tracking.

16) The method (500), as claimed in claim 15, wherein the fit between the 3D model points and image is refined by Newton optimization method based on a combination of gradient and hessian of joint posterior probability function.

17) A system (1000) for detecting and tracking hand-object interaction within a virtual environment, the system (1000) comprising:
a head mounted display (100) configured to capture image data of hand (10) and object (20) held within the hand (20);
a hand glove (200) configured to capture pose data of the hand (10);
a computing module (150), implemented upon a processor, and configured with:
a first detector (155) to detect a first object pose from the image data based on localization of object keypoints;
a second detector (156) to detect a second object pose from the pose data of the hand; and
wherein the computing module (150) is configured to compute a final pose from the first detector (155) and the second detector (156) based on comparison with a predetermined threshold value;
a tracker (160) for tracking the hand-object interaction based on the final pose; and
wherein the computing module (150) is configured to monitor the tracker (160) for accuracy, wherein in an event of inaccurate tracking, re-seeding the tracker (160) with a corrected final pose.

18) The system (1000), as claimed in claim 17, wherein the detection and tracking of hand-object interaction is a 6DoF estimation of the hand (10) and the object (20) held by the hand (10).

19) The system (1000), as claimed in claim 17, wherein the head mounted display (100) comprises of a plurality of vision sensors (50).

20) The system (1000), as claimed in claim 17, wherein the hand gloves (200) is studded with a plurality of sensors (250), comprising, though not limiting to, an array of ultrasonic sensors distributed throughout the hand (10), linear haptic sensors and flex sensors.

21) The system (1000), as claimed in claim 17, wherein the first detector (155) is configured to perform first object detection in steps of:
detecting 2D keypoints for the object (20) from the image data using convolutional neural network;
predicting object semantic label and vector representative of direction from every pixel of the image data to the detected 2D keypoint; and
generating hypotheses of 2D location for the object keypoint in the image data from vector the vector representation.

22) The system (1000), as claimed in claim 21, wherein a mean and covariance of spatial probability distribution is computed for each object keypoint based on the generated hypotheses to obtain 3D coordinates of the object from the 2D keypoints.

23) The system (1000), as claimed in claim 17, wherein the second object pose is detected in steps of:
providing an initial estimate of the pose of the hand (10) based on a hand grasp of the object (20) using conditional variational autoencoder and generating a latent representation thereof; and
refining the initial estimate of the pose of the hand (10) based on a trained neural network such that the pose aligns closely with object orientation in real world.

24) The system (1000), as claimed in claim 17, wherein the computing module (250) is configured to compute the final pose based on weighted average of the first pose and the second pose along with respective confidence scores.

25) The system (1000), as claimed in claim 24, wherein in an event the first detector (155) and the second detector (156) exhibits high confidence scores with distant pose estimates under static external conditions, the computing module (150) is configured to select the first pose estimate as the final pose for seeding the tracker (160).

26) The system (1000), as claimed in claim 24, wherein in an event the first detector (155) and the second detector (156) exhibits high confidence scores with distant pose estimates under non-static external conditions, the computing module (150) is configured to select the second pose estimate as the final pose for seeding the tracker (160).

27) The system (1000), as claimed in claim 24, wherein in an event the first detector (155) and the second detector (156) exhibits low confidence score against the predetermined threshold value, the tracker (160) is re-seeded with corrected final pose.

28) The system (1000), as claimed in claim 17, wherein the corrected final pose is obtained by re-initiating the first detector (155) and the second detector (156).

29) The system (1000), as claimed in claim 17, wherein the tracker (160) is based on sparse viewpoint probabilistic model, wherein probability of object appearing in the image is computed based on correspondence lines.

30) The system (1000), as claimed in claim 29, wherein the tracker (160) is configured to perform tracking based on the sparse probabilistic model in steps of:
constructing 3D model points and associated normal vectors from 3D model of the object (20);
projecting the 3D model points and the normal vectors into an image plane for defining the correspondence line and assessing fit between the 3D model points and the image for smooth tracking ; and
refining the fit between the 3D model points and image by Newton optimization method based on a combination of gradient and hessian of joint posterior probability function.

Documents

Application Documents

#	Name	Date
1	202321018879-PROVISIONAL SPECIFICATION [17-03-2023(online)].pdf	2023-03-17
2	202321018879-FORM FOR STARTUP [17-03-2023(online)].pdf	2023-03-17
3	202321018879-FORM FOR SMALL ENTITY(FORM-28) [17-03-2023(online)].pdf	2023-03-17
4	202321018879-FORM 1 [17-03-2023(online)].pdf	2023-03-17
5	202321018879-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [17-03-2023(online)].pdf	2023-03-17
6	202321018879-DRAWINGS [17-03-2023(online)].pdf	2023-03-17
7	202321018879-DRAWING [15-03-2024(online)].pdf	2024-03-15
8	202321018879-COMPLETE SPECIFICATION [15-03-2024(online)].pdf	2024-03-15
9	202321018879-FORM-9 [19-03-2024(online)].pdf	2024-03-19
10	202321018879-ENDORSEMENT BY INVENTORS [19-03-2024(online)].pdf	2024-03-19
11	Abstract.jpg	2024-04-13
12	202321018879-STARTUP [16-07-2024(online)].pdf	2024-07-16
13	202321018879-FORM28 [16-07-2024(online)].pdf	2024-07-16
14	202321018879-FORM 18A [16-07-2024(online)].pdf	2024-07-16
15	202321018879-FER.pdf	2024-08-08
16	202321018879-FER_SER_REPLY [27-09-2024(online)].pdf	2024-09-27
17	202321018879-DRAWING [27-09-2024(online)].pdf	2024-09-27
18	202321018879-COMPLETE SPECIFICATION [27-09-2024(online)].pdf	2024-09-27
19	202321018879-US(14)-HearingNotice-(HearingDate-21-11-2024).pdf	2024-10-22
20	202321018879-FORM 3 [19-11-2024(online)].pdf	2024-11-19
21	202321018879-FORM-26 [21-11-2024(online)].pdf	2024-11-21
22	202321018879-PETITION u-r 6(6) [22-11-2024(online)].pdf	2024-11-22
23	202321018879-Covering Letter [22-11-2024(online)].pdf	2024-11-22
24	202321018879-Written submissions and relevant documents [25-11-2024(online)].pdf	2024-11-25
25	202321018879-Power of Authority [25-11-2024(online)].pdf	2024-11-25
26	202321018879-PETITION u-r 6(6) [25-11-2024(online)].pdf	2024-11-25
27	202321018879-FORM-26 [25-11-2024(online)].pdf	2024-11-25
28	202321018879-Covering Letter [25-11-2024(online)].pdf	2024-11-25
29	202321018879-Annexure [25-11-2024(online)].pdf	2024-11-25

Search Strategy

1	exped_searchE_07-08-2024.pdf