Abstract: The present invention is directed to self-tracking head mounted device for enhancing user skilling and training. In one aspect of disclosure, one or more images or videos of the user along with the user surroundings are captured along with egocentric view of user using image capturing units and wide angled fisheye lens. The captured images, videos of the user and user surroundings are used to estimate egocentric 3D pose in local coordinate space. The egocentric 3D pose is transformed from local coordinate space to global space and rendered for comparative analysis for pose deviation from standardized pose. In an event the pose deviation is beyond a predetermined threshold, the user is instructed to correct the pose and improvise his skilling and training.
DESC:FIELD OF THE INVENTION
Embodiment of the present invention relates to a head mounted device for detecting and tracking wearer’s own body and associated motion, and more particularly a self-contained, independent head mounted device capable of self-tracking for use in enhancing user skilling and training in a computer simulated environment.
BACKGROUND OF THE INVENTION
Head mounted display (“HMD”) device is generally worn by user for experiencing virtual reality (VR), augmented reality (AR) or mixed reality (MR) environments in real time. The user enjoys viewing of updated digitally produced images and perceiving them as real. These images should be updated in no time to avoid issues of latency, noise, dropped frames, slow/inaccurate tracking errors etc. for user to have a fully immersive experience. The delay in updating the images in response to the head movements may lead to motion artifacts, such as juddering, latency in overlaying images, color breakup, and/or general sluggishness, which may cause a bad user experience that may lead to headaches and nausea.
To enhance the realism of virtual world, tracking of various variables is an essential prerequisite – positional head tracking for estimating position and orientation, face tracking, body tracking etc. Though plethora of research work is now available for enabling positional head tracking and even face tracking to some extent; detection of body motion and performance have only been possible by using a cumbersome and still unpractical set up external to HMD (outside-in approach).
For example, multiple external cameras may be installed in a closed room to track user position, movements, hand gestures and determine overall user motion and performance characteristics. However, use of multiple camera set up in outside-in arrangement for expanding the recording scene suffers from various artefacts such as inability of user to roam around in a larger space beyond a fixed recording volume, is not scalable, and is quiet complex and unviable. Further, the applicability is limited in events of dense social interaction where the scene is cluttered with dynamic scene elements or furniture, other objects and the like.
Other methods may be based on flooding and sweeping the room with infrared light generated by one or two base stations, synchronized with multiple IR photosensors precisely positioned on the HMD. Other known solutions have attempted computation of user motion and proprioception senses by using controller devices to track hand position in real time, and then inferring 3D pose of rest of body from inverse kinematics of the head and hand poses. However, this too results in inaccurate estimates of the body configuration as signal looses are high, which eventually brings in not so comfortable experience for user.
It is pertinent to note that all those approaches requiring external set up limit the movement of user within the closed space to ensure continuous estimation of his position without any occlusions and obstructions. To achieve so, multiple cables run across user’s body to assure seamless continuity in tracking, which is both obstructive and require tedious calibration. Alternately, there may be another tracking device that can be worn by the user on his body to track position and orientation; however, the use of external components for impose a limit on the freedom of the user and often adds calibration steps before the HMD can be used.
Hence, the industry is converging on self-contained headsets where cameras mounted on the headset (“inside-out”) can be used for both headset and hand-held controller tracking. For example using Structure-from-motion technique offers speed, flexibility, accuracy and allows free roaming but it requires a static background, optimal conditions for data acquisition, good control points for reconstructing 3D poses and intensive computational requirements to infer pose for a minute of video.
Thus, there exists a need in art to incorporate all tracking components within a single, independent self-contained head display that can perform all tracking functions for the user while assuring his freedom of movement in a given space. However, multiple challenges that are accounted for while devising such self-contained inside-out tracking devices include, though not restricted to, occlusions, out-of-view motions, disappearing markers with frequent occlusions, limited field of view, lack of visual evidence, high quality motion data, need of strong priors to generate plausible motions from sparse dataset and many more.
The present disclosure sets forth a head mounted device for extracting accurate motion and performance characteristics of user while wearing a head mounted device, and that may address one or more of the challenges or needs mentioned herein, as well as provide other benefits and advantages. The above information disclosed in this Background section is only for enhancement of understanding of the background of the invention, and therefore, it may contain information that does not form prior art.
OBJECT OF THE INVENTION
An object of the present invention is to provide a head mounted device (HMD) capable of tracking wearer’s or user’s head position and orientation, face expressions, and full body motion and performance.
Another object of the present invention is to provide an independent self-tracking head mounted device (HMD) that assists in wearer’s ease of movement and freedom while wearing the HMD.
Yet another object of the present invention is to provide a light weight, compact self-tracking head mounted device (HMD) that enables user to perform his entire body tracking in an open space where support of external cameras or sensors is practically unviable for making such estimations.
Yet another object of the present invention is to provide a user friendly self-tracking head mounted device that can perform full body tracking of HMD wearer with sufficient accuracy without requiring aid of external sensors for attaining such accuracy.
Yet another object of the present invention is to provide a self-tracking head mounted device that is equipped with intelligible processing for egocentric estimation of body pose.
In yet another object of the present invention, the portable head mounted device is provided with IR stereo cameras that enable capturing enough information around dynamic motions of whole body including hands and feet.
In still another object of the present invention, the self-tracking head mounted device enables the wearer to improvise his skilling and training in a simulated environment.
In one other object of the present invention, the self-tracking head mounted device enables capturing of egocentric 3D pose estimates that may be compared with standardized poses to determine pose deviations.
In another object of the present invention, the user is guided for incorrect poses and instructed with visual display or audio notifications for pose correction using the self-tracking head mounted display.
SUMMARY OF THE INVENTION
This section is for the purpose of summarizing some aspects of the present invention and to briefly introduce some preferred embodiments. Simplifications or omissions may be made to avoid obscuring the purpose of the section. Such simplifications or omissions are not intended to limit the scope of the present invention.
In general, the present invention is related to a self-tracking head mounted device used for enhancing user skilling and training. According to one aspect of the present invention, head mounted device (HMD) for enhancing user skill and training comprises of a plurality of image capturing units configured to capture one or more images or videos of user surroundings. Further, one or more wide-view optical element is provided on HMD to capture egocentric view of user and the user surroundings for generating user-contextual scene. The computing device of the HMD, in communication with the image capturing units and wide-view element, is configured to acquire the one or more images or the videos along with the egocentric view of the user-contextual scene.
According to another aspect, the computing device estimates the 3D egocentric user pose under constraints of the user-contextual scene in a local coordinate space of the head mounted device. This is followed by transformation of the 3D egocentric pose estimates from the local coordinate space to a global coordinate space. Now the transformed 3D egocentric pose estimates of the user is rendered for comparative analysis with standardized poses corresponding to the skill and training the user is performing. Finally, the user is instructed on the head mounted device for a corrected pose, in an event of pose deviation from the standardized pose is determined beyond a predetermined threshold.
In still another aspect, a method for enhancing user skill and training using a head mounted device is disclosed. Accordingly, the method comprises of capturing one or more images or videos of user surroundings and capturing egocentric view of user and the user surroundings for generating user-contextual scene. These one or more images or the videos are acquired along with the egocentric view of the user-contextual scene for estimating 3D egocentric user pose under constraints of the user-contextual scene in a local coordinate space of the head mounted device. The method now comprises of transforming the 3D egocentric pose estimates from the local coordinate space to a global coordinate space. These transformed 3D egocentric pose estimates of the user are rendered for comparative analysis with standardized poses corresponding to the skill and training the user is performing. The method finally includes instructing on the head mounted device of a corrected pose, in an event of pose deviation from the standardized pose is determined beyond a predetermined threshold.
BRIEF DESCRIPTION OF THE DRAWINGS
So that the manner in which the above recited features of the present invention can be understood in detail, a more particular to the description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, the invention may admit to other equally effective embodiments.
These and other features, benefits and advantages of the present invention will become apparent by reference to the following text figure, with like reference numbers referring to like structures across the views, wherein:
Fig. 1 illustrates the self-tracking head mounted display used for enhancing user skilling and training, in accordance with an embodiment of the present invention.
Fig. 2 illustrates a flow diagram depicting the method of enhancing user skilling and training using the head mounted device, in accordance with an embodiment of present invention.
DETAILED DESCRIPTION
While the present invention is described herein by way of example using embodiments and illustrative drawings, those skilled in the art will recognize that the invention is not limited to the embodiments of drawing or drawings described and are not intended to represent the scale of the various components. Further, some components that may form a part of the invention may not be illustrated in certain figures, for ease of illustration, and such omissions do not limit the embodiments outlined in any way. It should be understood that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the invention is to cover all modifications, equivalents, and alternatives falling within the scope of the present invention as defined by the appended claims. As used throughout this description, the word "may" be used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense, (i.e., meaning must). Further, the words "a" or "an" mean "at least one” and the word “plurality” means “one or more” unless otherwise mentioned. Furthermore, the terminology and phraseology used herein is solely used for descriptive purposes and should not be construed as limiting in scope. Language such as "including," "comprising," "having," "containing," or "involving," and variations thereof, is intended to be broad and encompass the subject matter listed thereafter, equivalents, and additional subject matter not recited, and is not intended to exclude other additives, components, integers or steps. Likewise, the term "comprising" is considered synonymous with the terms "including" or "containing" for applicable legal purposes. Any discussion of documents, acts, materials, devices, articles, and the like are included in the specification solely for the purpose of providing a context for the present invention. It is not suggested or represented that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present invention.
In this disclosure, whenever a composition or an element or a group of elements is preceded with the transitional phrase “comprising”, it is understood that we also contemplate the same composition, element or group of elements with transitional phrases “consisting of”, “consisting”, “selected from the group of consisting of, “including”, or “is” preceding the recitation of the composition, element or group of elements and vice versa.
The present invention is described hereinafter by various embodiments with reference to the accompanying drawings, wherein reference numerals used in the accompanying drawing correspond to the like elements throughout the description. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiment set forth herein. Rather, the embodiment is provided so that this disclosure will be thorough and complete and will fully convey the scope of the invention to those skilled in the art. In the following detailed description, numeric values and ranges are provided for various aspects of the implementations described. These values and ranges are to be treated as examples only and are not intended to limit the scope of the claims. In addition, a number of materials are identified as suitable for various facets of the implementations. These materials are to be treated as exemplary and are not intended to limit the scope of the invention.
In accordance with one general embodiment of present disclosure, the present system and method is directed to egocentric tracking of user pose estimation for enabling user improvisation in skilling and training activities. This is accomplished by making use of a compact, sophisticated head mounted device (HMD) provided with advanced image capturing modules equipped with accurate 3D pose estimation capabilities. In order to make enhancements to particular skilling sessions or any other practical activity, it is important that correct pose, form, posture, orientation of hands or legs, and other related facts are precisely known.
This requires that user of HMD is tracked closely and monitored precisely for his practice sessions. Equally important is that the user is not constrained by space, set up, wires/cables or other necessities which makes the entire session highly constricted and uneasy for outside use. Attempts have been made to self-track using some harnesses and body wearables; however they have not proven to be a promising solutions for reasons of portability, too bulky wearables and heavy computation requirements.
In conventional scenarios, capturing egocentric view for pose estimation of HMD wearer has often received unfavourable views for reasons of non-conforming physical plausibility, including body-environment interaction or body unable to make proper connect with the ground. This is mostly ascribed to self- occlusion and strong perspective distortions obtained with egocentric view.
To address this issue, the present disclosure provides for a sophisticated, lightweight, portable head mounted, near-eye device having small form factor (weighing less than 390gms). For example, the HMD may be a non-limiting example of HMD 100 of Fig. 1. An HMD 100 may take any other suitable form in which a transparent, semi-transparent, and/or non-transparent display is supported in front of a user’s eye or eyes. Further, implementations described herein may be used with any other suitable computing device, including but not limited to mobile computing devices, laptop computers, desktop computers, tablet computers, other wearable computers, etc.
The HMD 100 includes a display 50 and a computing device 70 configured to perform various operations related to visual presentation of artificial reality to the user or wearer. In one non-limiting embodiment, the HMD 100 comprises of one or more cameras or interchangeably referred to as tracking units/image capturing units 20 to capture images and video of the user surroundings. For example, the HMD comprises of an upper frame 10(a) and a lower frame 10(b) and accommodates two or more image capturing units 20(a), 20(b), 20(c) and 20(d) on upper and lower frame 10(a) and 10(b) for capturing user environment from different viewpoints.
In one specific embodiment, the HMD 100 is provided with at least four tracking units 20(a), 20(b), 20(c) and 20(d) (collectively referred by numeral 20) mounted at four edges of HMD 100 to enable viewing and capturing user surroundings. In one preferable embodiment, the four tracking units are wide angled fisheye lens. Additionally, at least two front facing RGB pass through cameras 30(a) and 30(b) for enabling mixed reality environment are provisioned, which peculiarly allow the user of HMD 100 to approximate world view as if one is able to look directly through the front of HMD 100 into the real world.
The HMD 100 may further include an eye tracking unit 40 comprising of an eye tracking camera that provides another vital input to determine eye gaze (e.g., direction or orientation of one or both eyes). In accordance with one exemplary embodiment, two or more inward facing eye tracking (ET) cameras may be deployed over a Head mounted device (HMD) 100 for capturing images (still and video) of unobservable areas of face e.g. lips, mouth, chin area, nose, peri-eye region, top of head; forehead; eyebrows; eye sockets; temples; upper cheeks; lower cheeks; nose; philtrum; lips; chin boss; and jaw line etc along with observing muscle contractions, changes in eye shape, or trait changes that produce movement.
However, even when together taken, these image capturing units 20(a), 20(b), 20(c), 20(d) along with RGB pass through cameras 30(a), 30(b) and eye tracking camera 40 may provide not so complete view of user own body, user motion, full body posture, which prevents correct and accurate pose estimates for user at all. In fact, in many captured frames, arms and legs are not visible in daily activity routines and hence very limited understanding of user body posture, correctness of posture and using of hands and lower body for correct functioning of particular trade or activity can be inferred.
Next, in accordance with one preferred embodiment, at least two wide-angle downward looking optical element (such as a fish eye lens) IR stereo cameras 60(a), 60(b) that enable capture of images over a wide range of angles, such as 180 degrees or more are positioned on the lower frame/rim 10(b) at a predefined angle and distance, preferably 31 mm each side from the nose bridge of HMD 100. In other words, the two fisheye lens 60(a), 60(b) are maintained at approximately 62 mm distance between each other. Specifically, the IR stereo cameras 60(a), 60(b) are configured to be positioned in a manner almost the entire body of the wearer, complex hand-body dynamic interactions can be captured from the first-person perspective. In one alternate embodiment, convolutional neural network may be deployed to compensate for the regions that are far from view of fisheye cameras 60(a), 60(b).
Further, strong image distortion caused by the fisheye lens 60(a), 60(b) can be mitigated with automatic camera calibration. As it is difficult to observe the whole body from close proximity in the egocentric setting while maintaining a small form factor for HMDs, using lightweight pair of IR stereo cameras provides a simpler way for estimating full-body user pose and depth estimations along with improved accuracy and reduced complexity. The stereo cameras capture the complete image of user including the face, neck, hands, torso, legs and other body part information including the skeletal estimation. Such a pose estimation method wherein images are taken from the first-person perspective is called the “egocentric-view pose estimation”.
In another exemplary embodiment, a computing device 70 is in electrical communication with plurality of image capturing units 20(a), 20(b), 20(c), 20(d) and wide angled optical elements 60(a), 60(b) positioned over the HMD 100, and is configured to process acquired egocentric image information (broadly comprising user pose and user surroundings) from plurality of image capturing units and optical elements to sense dynamic motions of the user’s body including facial expressions, hand gestures, body posture, torso and limb movements and body pose.
To elaborate, the HMD 100 may be coupled with a computing device 70 for processing input received from the aforementioned image capturing units 20(a), 20(b), 20(c), 20(d) and optical elements 60(a), 60(b) or other sensors e.g. FOV cameras for capturing outer world view (FOV) and tracking gestures of user, inward facing imaging system such as digital camera for observing user movements (eye and face movements, face stretches, freckles, wrinkles, eye posture, eye orientation, pupil diameter, peri-eye region), inertial measurement units (IMUs) including accelerometers, gyroscopes to determine head posture, head orientation, compasses, Global Positioning units,. The information fed as an input to the computing device 70 may be used for extracting user-contextual scene information.
In some embodiments, multiple image-based user detection and/or tracking processes can be executed simultaneously, other motion determination techniques can be performed, and/or other sensor data analyzed for detecting and tracking full body motion and performance characteristics of a user. The data obtained by these independent processes can be aggregated for more robustly detecting and tracking a user. In various embodiments, sensor fusion techniques can be used to combine supplementary data from multiple sensors to be processed by a computing device 70. Sensor fusion can be used to aggregate data captured by multiple sensors or input devices, such as multiple cameras, inertial sensors, infrared transceivers, GPS, microphones, etc., to obtain information around wearer’s 3D pose and motion that may be more accurate and/or complete than would be possible from a single sensor alone.
For example, plurality of images of user and user surroundings may be captured by multiple cameras, image capturing units, wide angled optical elements with different fields of view to deeply analyse the user and his associated surroundings in three dimensions. By implementing sensor fusion, the sensor data captured can be used to derive motion according to six dimensions or six degrees of freedom (6DOF). As yet another example, sensor fusion can be applied to aggregate motion and/or position of an object of interest evaluated using image analysis and motion and/or position derived from inertial sensor data. Sensor fusion techniques and probabilistic approaches can include Kalman filtering, extended Kalman filtering, unscented Kalman filtering, particle filtering, among others.
The input to the computing device 70 is a set of greyscale or colored images (distorted and disconnected) along with depth data for estimating 3 D user pose. The images are distorted because the wide angled stereo camera is placed in close approximation to user’s body, which makes the 3D pose estimation challenging. Another major limitation is of both occlusion and a limited field of view. For example, when standing, the upper body may occlude the lower body; when seated, the knees may occlude the feet.
As humans are articulated objects, many joints or key points, such as those at wrist, elbow, and foot can be invisible due to occlusion, which may occur persistently across multiple frames and may serve as a major source of errors of accurate 3D pose estimation. There are also strong perspective distortions that result in a drastic difference in resolution between the lower and upper body. HMD 100 in process of tracking the lower body segment observes own body, which makes some body segments not visible from such camera viewpoint.
Thus, capturing full body motion from an egocentric camera perspective has multiple challenges. At the same time, ability to estimate correct pose and user position has a promising end effect as it can be successfully adopted in numerous AR/VR applications, especially skilling and training.
Since, right working posture is crucial for users immersed in prolonged hours of training and skilling sessions, it is imperative that the user is repeatedly guided for avoiding awkward body positions which cause fatigue, reduce concentration and lead to poor performance, which may further require to repeat task. For example, learning to weld well, or play a sport stroke, practice a dance move, meditate for mindfulness and awareness, and many other such daily rituals require correct dynamic as well as static posture.
Aiming to make use of virtual reality enabling HMDs to help user improvise upon his body posture and position, egocentric view capture is important while taking into account user interaction with surrounding objects and environment. However, ambiguity caused by distortions and self-occlusions in the egocentric view have been a cause of concern for some while. In order to overcome this challenge, the present disclosure propose a more user-contextual pose estimation that takes into account the constraints of user surrounding environment, also referred here as “scene-constraints”.
Accordingly, the depth map is user surrounding is generated and for addressing the concerns of occlusion, depth map is predicted including the user followed by deploying a network to recover depth of scene behind the user, as discussed in detail below.
At first, a user surroundings sensitive synthetic training dataset is generated which contains various daily motions and ground truth scene geometry. In one example embodiment, ground truth scene geometry may be obtained with Structure-from-Motion method from a multi-view capture system with high end image capturing units and IR stereo cameras, collectively referred to as “egocentric camera”. Further, the ground truth egocentric pose of egocentric camera, may be obtained by localizing a calibration board rigidly attached to the egocentric camera.
Another in-the-wild training dataset may be generated that reconstruct the dense background scene geometry of the user from the egocentric image sequences of previous synthetic dataset using Structure-from-Motion approach. Based on the reconstructed geometry, depth maps of the scene are rendered with or without the presence of user. Inferring scene depth information behind the user is important for generating plausible poses, for which a two-step method is adopted.
At first the depth map with high spatial resolution (e.g. precise boundaries) including the user is generated. This requires training a network that takes egocentric image as input to output the depth map along with the user. In one working embodiment, the network comprises of an encoder for multi-scale feature extraction, a decoder for feature decoding, multi-scale feature fusion to fuse features of different resolutions extracted by encoder and a module for refining and integrating features obtained from decoder and multi-scale feature extraction for final depth map prediction.
Next, a segmentation network is generated and trained to estimate segmentation mask of the user from background images captured by the camera system of HMD. A masked depth map is created from combining background segmentation and depth map containing the user. This masked depth map and segmentation mask (previously obtained) are fed as input to the inpainting network. Finally, this depth inpainting network is applied to recover depth behind the user and final depth map of the scene without the user.
Following from above, plausible 3D pose is estimated from the above estimated scene geometry and features extracted from the input image captured by egocentric camera of the HMD 100. To achieve the same, body joints heatmap is estimated to extract 2D body pose features. The depth map of scene without the user (as obtained above) along with the body pose features is projected into a 3D volumetric space.
After obtaining volumetric representation of user body features and scene depth, plausible 3D body pose is predicted from volumetric representation with a V2V network- volumetric convolutional network. The specialized V2V network is chosen to make distortion-invariant estimation, which is the most commonly found lacuna while regressing 3D coordinates from a 2D image as it is a highly non-linear mapping. Hence, a voxel-to-voxel prediction using a 3D voxelized grid to estimate per voxel likelihood of keypoints.
The 3D voxel representation projects the extracted 2D poses along with the depth information provides direct geometric connection between 2D image features and 3D scene geometry, which in further facilitates the V2V network to learn the relative position and potential interactions between the user joints and the surrounding environment to enable the prediction of plausible poses under the scene constraints.
In one alternate embodiment, in case accuracy of voxel-based pose estimation network is constrained by accuracy of estimated depth, for example in case of occlusions, temporal information can be leveraged to get full view of environment surrounding the user.
In another alternate embodiment, the invisible user poses while capturing images from egocentric point of view can be estimated using clues such as dynamic motion signatures and static scene structure. While the dynamic motion signature for pose changes are resistant to scene changes (e.g. standing in whichever way possible), the static scene sets the context and offers a prior on likely poses.
The plausible pose estimation from egocentric camera view of HMD 100 is performed in a local coordinate space of the HMD. Obtaining the user pose with global position and orientation in the world coordinate system is necessary for many applications where local pose capture alone is not sufficient. For example, if the user is engaged in a skilling or training sessions, say performing a welding operation or try excelling in perfect yoga pose, and wishes the same to be verified by his teacher or expert, captured local body poses of the user will not be enough to animate the locomotion of user’s virtual avatar in his expert’s environment, which requires global poses.
In order to enable the pose estimates in local coordinate system to be shared, viewed and analyzed by the instructor present in any other distant location, it is imperative to transform the pose estimation from local to global coordinate space. In one example embodiment, the virtual avatar of HMD created from pose estimates in local coordinate system needs to be teleported to other environments, which require global poses.
In accordance with one example embodiment, local pose estimated by Simultaneous Localization and Mapping (SLAM) may be projected into world coordinate system. This method however suffers from notable temporal jitters, occlusion, and depiction of unrealistic motions. In one preferable embodiment, accurate and temporally stable egocentric global 3D pose estimation may be achieved based spatio-temporal optimization framework that captures a motion prior with a convolutional based sequential variational autoencoder (VAE). Accordingly, 1D convolutional neural network (CNN) is deployed to extract 2D and 3D keypoints along with variational autoencoders (VAE) for deriving motion priors that helps producing realistic, smooth and physically plausible user motions.
In the given methodology, egocentric video is inputted and processed in segments comprising of a fixed number of consecutive frames. Now, initial 3D poses and 2D heatmaps are obtained using an egocentric local body pose estimation method. It is important to note that possibility of errors in projecting 3D joints from estimated 2D joints (where 2D joint detection is prone to error for reasons of self-occlusion) is overcome by using 2D heatmaps. Thus, the uncertainty captured in 2D heatmaps is leveraged to determine probability of each pixel being a 2D joint.
Now, to compensate for accuracy, temporal instability, and perspective distortions, local pose estimates are fed into the optimization framework to perform spatio-temporal optimization. Here, the optimization framework is devised based on learning local pose prior as a latent space with a sequential VAE consisting of an encoder and a decoder designed as 5-layers 1D convolutional networks. While the encoder maps the input sequence of local poses to a latent vector, the decoder is used to reconstruct a pose sequence from the latent vector. In one alternate embodiment, RNN based VAEs can also be used; they however suffer from vanishing and exploding gradients which makes the optimization less stable.
This is followed by searching for a latent vector in the learned latent space that minimizes the objective function comprising of heatmap based reprojection term, pose regulation term, motion smoothness regularization term and several other regularization terms in order to obtain optimized local poses. These optimized local poses are combined with camera poses estimated from SLAM and transformed from the local egocentric camera space to the world coordinate space in sequence of steps detailed below.
First, a camera pose sequence is obtained from SLAM and local pose sequence is then projected to initial global body pose in global space. The initial global pose sequence is now optimized with global pose optimizer as simply combining local poses with camera poses will not achieve high-quality global poses as the previously obtained optimized local poses are not constrained to be consistent with corresponding camera poses. The global pose optimizer obviate this inconsistency error using a sequential VAE, where the a latent space is learned from global pose sequences and a latent vector is searched with an objective to minimize reprojection term along with other regularization terms (as discussed above)
Furthermore, to reduce the error due to strong occlusion, an uncertainty- aware reprojection energy term by summing up the probability values at the pixels on the heatmap occupied by the projection of the 3D estimated joints rather than comparing the projection of 3D estimated joints against the predicted 2D joint position.
In next sequence of steps, authoritative coaching and training can be provided by an instructor expert who may be located at a far off location from the user. This is critical for user engaged in any complex skilling or training operations as real time guidance and feedback will help the user improvise upon the technique, learn proper poses, posture and reduce risk of injury. The instructor can then guide, verbally tell or show which body part to move, in which direction and how far. Such instructions can also be displayed on the display of HMD worn by user.
Alternately, in one another embodiment, the user can compare his pose and motion with corresponding motion library (database) as he can adjacently place his 3D pose estimates with poses or motion stored in the database. Most importantly, the egocentric view of his own body enables the user to practice his sessions anywhere as the entire system is modular, portable, does not require any external set up of camera system (e.g. in outside-in tracking) and is not tethered that makes the entire movement unrestricted and enables the user practice his natural postures uninhibitedly.
Accordingly, the transformed 3d pose estimate data including the user motion data is captured to be used in remote collaboration with a distant instructor or it may be used by the user for self-analysis and comparison with a global motion database. In one specific embodiment, a vast library of probable and accurate motions for correct skill performance is maintained in global space, which serves as a ground truth for motion and pose analysis by the user.
Precisely, a degree of pose similarity is determined of the user pose estimates with a standardized set of poses contained within the database. Thereafter, a pose inconsistency detection is performed to understand the deviation from the standard pose for the particular trade user is practicing. Accordingly, the instructions are overlaid on the display of HMD 100 worn by the user about the adjustments to be made to his pose.
For example, the pose deviation may be calculated based on difference between height or tilt of the user 3D joints compared with that of standard pose. The system can automatically compute that the shoulders are not aligned, or the head is tilted, face is in a sideway direction towards the left or right beyond a predetermined threshold, and accordingly trigger the function of providing a pose correction warning to provide a recommendation for aligning the shoulders, or bringing back the head to normal pose or turning the head to make face into a regular pose. The system may issue a voice correction statement of "please turn your head to the left" or may display the same message on a display of HMD 100.
In other aspect of present disclosure, a method 500 for enabling the user enhance his skilling and training sessions using the head mounted device 100 is discussed, as shown in Fig. 2. Accordingly, plurality of images or videos of user surroundings are captured in step 501 by making use of image capturing units 20(a), 20(b), 20(c), 20(d) provided on four edges of the head mounted device 100. In step 502, egocentric view of user and the user surroundings is captured for generating user-contextual scene. This is achieved by making use of a pair of wide angled fisheye lens optimally positioned on lower frame 10(b) of HMD 100 and having at least 180 field of view.
In step 503, the one or more images or the videos are acquired along with the egocentric view of the user-contextual scene. In step 504, 3D egocentric user pose is estimated under constraints of the user-contextual scene in a local coordinate space of the head mounted device 100. Here, user-contextual scene is computed from predicting a first depth map of high spatial resolution of the user along with the user surrounding and a second depth map of the user surrounding behind the user. Further, 2D pose features are extracted and combined with the second depth map for projection into a 3D volumetric space with a volumetric convolutional network to obtain egocentric 3D pose estimates in local space.
Next, in step 505, the 3D egocentric pose estimates are transformed from the local coordinate space to a global coordinate space using a spatio-temporal optimization framework. The spatio-temporal optimization framework captures a motion prior with a convolutional based sequential variational autoencoder (VAE).
In step 506, the transformed 3D egocentric pose estimates of the user are rendered for comparative analysis with standardized poses corresponding to the skill and training the user is performing. Finally, in step 507, the user is instructed on the head mounted device about the corrected pose, in an event of pose deviation from the standardized pose is determined beyond a predetermined threshold, for the user to follow and master.
In one exemplary, though non-limiting, example of present disclosure spray painting skilling session is illustrated to facilitate understanding of the present application. Here the user is attempting to learn a spray painting skill for example in a virtual reality (VR) environment. The user is configured with a head mounted display over which he receives instructions and guidelines for performing spray painting on a given virtual object in a VR world. The user is provided with real world tools such as a spraying gun in order to have real, immersive experience of spray painting.
To begin with, the user is tracked completely via self-tracking HMD he is wearing to understand parameters such as angle at which the spray gun is held, distance from which the filling paint is being sprayed, proximity to trigger with which the spray gun is held, amount of paint the user is applying at a given time, his posture while performing the task, position and orientation of head and shoulders. These parameters are critical for not only honing the skill of user, but also assuring that the user is not fatigued or develop muscle pain, frozen shoulder, cervical, sprain and other issues when practicing for prolonged hours.
These parameters are tracked from a self-tracking HMD and computing is performed by the computing device that estimates the egocentric 3D pose estimates from given parameters in local space. Now, if the user intends to get the process verified by a distantly located professional or even self-analyze his progress, it is pertinent that the transformation of egocentric 3D estimates happen in global space. Thus, with the spatio-temporal transformation from local to global space, the user has opportunity to be personally instructed in real time by a professional of the field.
Alternately, a comprehensive database compiling the standardized ground truth pose and motion of user while performing spray painting activity is maintained for benchmarking. The user 3D pose estimates are rendered and compared against the ground truth pose for any deviations beyond a predetermined threshold. In an event, the user is found tilting his head too much, squeezing his shoulders beyond what is required, pressing the gun too hard, or operating in any other inaccurate pose, instructions will be overlaid on the display of HMD for pose correction. These instructions can either be by way of visuals showing the user how to spray in correct pose or send alert notifications for pose correction by way of audio signals.
Thus, the user can practice for long hours and get trained aptly for skill he is operating upon. This comes without any hassles of bulky HMD, tethered headset, requirement of external set up for perfect tracking. This all-in-one solution is beneficial for full body pose tracking that may be used in wide range of AR/VR applications for manifold purposes.
In accordance with an embodiment, the head mounted device comprises a memory unit configured to store machine-readable instructions. The machine-readable instructions may be loaded into the memory unit from a non-transitory machine-readable medium, such as, but not limited to, CD-ROMs, DVD-ROMs and Flash Drives. Alternately, the machine-readable instructions may be loaded in a form of a computer software program into the memory unit. The memory unit in that manner may be selected from a group comprising EPROM, EEPROM and Flash memory. Further, a processor is operably connected with the memory unit. In various embodiments, the processor is one of, but not limited to, a general-purpose processor, an application specific integrated circuit (ASIC) and a field-programmable gate array (FPGA).
In general, the word “module,” as used herein, refers to logic embodied in hardware or firmware, or to a collection of software instructions, written in a programming language, such as, for example, Java, C, or assembly. One or more software instructions in the modules may be embedded in firmware, such as an EPROM. It will be appreciated that modules may comprised connected logic units, such as gates and flip-flops, and may comprise programmable units, such as programmable gate arrays or processors. The modules described herein may be implemented as either software and/or hardware modules and may be stored in any type of computer-readable medium or other computer storage device.
Further, while one or more operations have been described as being performed by or otherwise related to certain modules, devices or entities, the operations may be performed by or otherwise related to any module, device or entity. As such, any function or operation that has been described as being performed by a module could alternatively be performed by a different server, by the cloud computing platform, or a combination thereof. It should be understood that the techniques of the present disclosure might be implemented using a variety of technologies. For example, the methods described herein may be implemented by a series of computer executable instructions residing on a suitable computer readable medium. Suitable computer readable media may include volatile (e.g., RAM) and/or non-volatile (e.g., ROM, disk) memory, carrier waves and transmission media. Exemplary carrier waves may take the form of electrical, electromagnetic or optical signals conveying digital data steams along a local network or a publicly accessible network such as the Internet.
It should also be understood that, unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as "controlling" or "obtaining" or "computing" or "storing" or "receiving" or "determining" or the like, refer to the action and processes of a computer system, or similar electronic computing device, that processes and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
Various modifications to these embodiments are apparent to those skilled in the art from the description and the accompanying drawings. The principles associated with the various embodiments described herein may be applied to other embodiments. Therefore, the description is not intended to be limited to the embodiments shown along with the accompanying drawings but is to be providing broadest scope of consistent with the principles and the novel and inventive features disclosed or suggested herein. Accordingly, the invention is anticipated to hold on to all other such alternatives, modifications, and variations that fall within the scope of the present invention.
,CLAIMS:We Claim:
1) A head mounted device (100) for enhancing user skill and training, comprising:
a plurality of image capturing units (20) configured to capture one or more images or videos of user surroundings;
one or more wide-view optical element (60) configured to capture egocentric view of user and the user surroundings for generating user-contextual scene;
a computing device (70), in communication with the image capturing units (20) and wide-view optical element (60), configured to acquire the one or more images or the videos along with the egocentric view of the user-contextual scene, and further configured to:
estimate 3D egocentric user pose under constraints of the user-contextual scene in a local coordinate space of the head mounted device (100);
transform the 3D egocentric pose estimates from the local coordinate space to a global coordinate space;
render the transformed 3D egocentric pose estimates of the user for comparative analysis with standardized poses corresponding to the skill and training the user is performing; and
instruct on the head mounted device (100) a corrected pose, in an event of pose deviation from the standardized pose beyond a predetermined threshold, for the user to follow and master.
2) The head mounted device (100), as claimed in claim 1, wherein the plurality of image capturing units (20) are positioned on four corner edges of the head mounted device (100).
3) The head mounted device (100), as claimed in claim 1, wherein the wide-view optical element (60) comprise of a pair of fisheye lens having at least 180 degrees field of view and positioned at around 30 mm from nose bridge of the head mounted device (100).
4) The head mounted device (100), as claimed in claim 1, wherein the computing device (70) is configured to acquire supplementary data comprising images and videos captured from at least one sensor selected from the group consisting of a microphone, an accelerometer, a gyroscope, an eye tracking sensor, a head-tracking sensor or inertial sensors for the 3D egocentric pose estimation.
5) The head mounted device (100), as claimed in claim 1, wherein the user-contextual scene is computed from predicting a first depth map of high spatial resolution of the user along with the user surrounding and a second depth map of the user surrounding behind the user.
6) The head mounted device (100), as claimed in claim 5, wherein 2D pose features are extracted and combined with the second depth map for projection into a 3D volumetric space with a volumetric convolutional network.
7) The head mounted device (100), as claimed in claim 1, wherein the transformation from the local coordinate space to the global coordinate space is achieved using a spatio-temporal optimization framework.
8) The head mounted device (100), as claimed in claim 7, wherein the spatio-optimization framework captures a motion prior with a convolutional based sequential variational autoencoder (VAE).
9) The head mounted device (100), as claimed in claim 1, wherein a virtual avatar is generated from the 3d egocentric pose estimates that is rendered and streamed to a distant location for verification of poses by an instructor.
10) The head mounted device (100), as claimed in claim 1, wherein the instruction is displayed or the user is audibly notified via the head mounted device (100) to correct the pose.
11) A method for enhancing user skill and training using a head mounted device (100), wherein the method comprises:
capturing one or more images or videos of user surroundings;
capturing egocentric view of user and the user surroundings for generating user-contextual scene;
acquiring the one or more images or the videos along with the egocentric view of the user-contextual scene;
estimating 3D egocentric user pose under constraints of the user-contextual scene in a local coordinate space of the head mounted device (100);
transforming the 3D egocentric pose estimates from the local coordinate space to a global coordinate space;
rendering the transformed 3D egocentric pose estimates of the user for comparative analysis with standardized poses corresponding to the skill and training the user is performing; and
instructing on the head mounted device (100) a corrected pose, in an event of pose deviation from the standardized pose beyond a predetermined threshold, for the user to follow and master.
12) The method for enhancing user skill and training, as claimed in claim 11, wherein the one or more images or videos of the user surroundings are captured by one or more image capturing units (20) provided on the HMD (100).
13) The method for enhancing user skill and training, as claimed in claim 11, wherein the egocentric view of user and the user surroundings for generating user-contextual scene is achieved using a wide-view optical element comprising of a pair of fisheye lens (60) having at least 180 degrees field of view and positioned at around 30 mm from nose bridge of the head mounted device (100).
14) The method for enhancing user skill and training, as claimed in claim 11, further comprising acquiring supplementary data from images and videos captured from at least one sensor selected from the group consisting of a microphone, an accelerometer, a gyroscope, an eye tracking sensor, a head-tracking sensor or inertial sensors for the 3D egocentric pose estimation.
15) The method for enhancing user skill and training, as claimed in claim 11, wherein the user-contextual scene is computed from predicting a first depth map of high spatial resolution of the user along with the user surrounding and a second depth map of the user surrounding behind the user.
16) The method for enhancing user skill and training, as claimed in claim 15, wherein 2D pose features are extracted and combined with the second depth map for projection into a 3D volumetric space with a volumetric convolutional network.
17) The method for enhancing user skill and training, as claimed in claim 11, wherein the transformation from the local coordinate space to the global coordinate space is achieved using a spatio-temporal optimization framework.
18) The method for enhancing user skill and training, as claimed in claim 17, wherein the spatio-optimization framework captures a motion prior with a convolutional based sequential variational autoencoder (VAE).
19) The method for enhancing user skill and training, as claimed in claim 18, wherein a virtual avatar is generated from the 3d egocentric pose estimates that is rendered and streamed to a distant location for verification of poses by an instructor.
20) The method for enhancing user skill and training, as claimed in claim 1, wherein the instruction is displayed or the user is audibly notified via the head mounted device to correct the pose.
| # | Name | Date |
|---|---|---|
| 1 | 202221038168-PROVISIONAL SPECIFICATION [02-07-2022(online)].pdf | 2022-07-02 |
| 2 | 202221038168-FORM FOR STARTUP [02-07-2022(online)].pdf | 2022-07-02 |
| 3 | 202221038168-FORM FOR SMALL ENTITY(FORM-28) [02-07-2022(online)].pdf | 2022-07-02 |
| 4 | 202221038168-FORM 1 [02-07-2022(online)].pdf | 2022-07-02 |
| 5 | 202221038168-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [02-07-2022(online)].pdf | 2022-07-02 |
| 6 | 202221038168-DRAWINGS [02-07-2022(online)].pdf | 2022-07-02 |
| 7 | 202221038168-FORM 18 [30-06-2023(online)].pdf | 2023-06-30 |
| 8 | 202221038168-DRAWING [30-06-2023(online)].pdf | 2023-06-30 |
| 9 | 202221038168-COMPLETE SPECIFICATION [30-06-2023(online)].pdf | 2023-06-30 |
| 10 | 202221038168-FER.pdf | 2025-05-01 |
| 11 | 202221038168-FER_SER_REPLY [12-05-2025(online)].pdf | 2025-05-12 |
| 12 | 202221038168-CORRESPONDENCE [12-05-2025(online)].pdf | 2025-05-12 |
| 13 | 202221038168-US(14)-HearingNotice-(HearingDate-13-11-2025).pdf | 2025-10-29 |
| 14 | 202221038168-FORM-26 [30-10-2025(online)].pdf | 2025-10-30 |
| 15 | 202221038168-FORM-26 [07-11-2025(online)].pdf | 2025-11-07 |
| 16 | 202221038168-Correspondence to notify the Controller [07-11-2025(online)].pdf | 2025-11-07 |
| 17 | 202221038168-Written submissions and relevant documents [18-11-2025(online)].pdf | 2025-11-18 |
| 18 | 202221038168-RELEVANT DOCUMENTS [18-11-2025(online)].pdf | 2025-11-18 |
| 19 | 202221038168-RELEVANT DOCUMENTS [18-11-2025(online)]-1.pdf | 2025-11-18 |
| 20 | 202221038168-PETITION UNDER RULE 137 [18-11-2025(online)].pdf | 2025-11-18 |
| 21 | 202221038168-PETITION UNDER RULE 137 [18-11-2025(online)]-1.pdf | 2025-11-18 |
| 22 | 202221038168-Annexure [18-11-2025(online)].pdf | 2025-11-18 |
| 1 | SearchStrategyE_24-06-2024.pdf |
| 2 | 202221038168_SearchStrategyAmended_E_search_strategyAE_11-09-2025.pdf |