Sign In to Follow Application
View All Documents & Correspondence

Multimodal Input For Extended Reality Systems

Abstract: The present invention is directed towards a smart, untethered and lightweight wearable head mounted device that is capable of taking input from different modalities to enable interaction with the virtual keypad within an extended reality environment. The head mounted device comprises of a unique combination of eye tracking elements and voice recognition module uniquely placed in a configuration that is processed to provide input based on eye swipe to the text field of virtual keypad. At various instances, the eye gaze based input is complimented by the voice input derived from user speech utterance that helps the user to interact in a most natural and intuitive way with the virtual world without compromising on typing speed or accuracy.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
07 March 2024
Publication Number
12/2025
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
Parent Application

Applicants

Dimension NXG Pvt. Ltd.
527 & 528, 5th floor, Lodha Supremus 2, Road no.22, near new passport office, Wagle Estate, Thane West, Maharashtra, India- 400604

Inventors

1. Abhishek Tomar
527 & 528, 5th floor, Lodha Supremus 2, Road no.22, near new passport office, Wagle Estate, Thane West, Maharashtra, India- 400604
2. Abhijit Patil
527 & 528, 5th floor, Lodha Supremus 2, Road no.22, near new passport office, Wagle Estate, Thane West, Maharashtra, India- 400604
3. Pankaj Raut
527 & 528, 5th floor, Lodha Supremus 2, Road no.22, near new passport office, Wagle Estate, Thane West, Maharashtra, India- 400604
4. Purwa Rathi
527 & 528, 5th floor, Lodha Supremus 2, Road no.22, near new passport office, Wagle Estate, Thane West, Maharashtra, India- 400604
5. Yukti Suri
527 & 528, 5th floor, Lodha Supremus 2, Road no.22, near new passport office, Wagle Estate, Thane West, Maharashtra, India- 400604

Specification

DESC:FIELD OF THE INVENTION
Embodiment of the present invention relates to multimodal input for interacting in virtual environment, and more particularly to hands free text typing in a virtual environment based on a combination of gaze modality and speech utterance.
BACKGROUND OF THE INVENTION
Effective and usable text typing poses a formidable challenge while using interactive devices, especially head mounted displays for experiencing augmented/virtual reality. Quite evidently, typing in such interactive devices is challenging as the whole experience of virtual typing depends on combination of various physical factors- mapping of fingers to key, ability to prepare hands for next keystroke, amount of hand motion required for a particular finger-key mapping and the like. Primarily, issue of arm fatigue/physical fatigue for reasons of tapping on a free-hand mid-air hanging keyboard and visual occlusion with no haptic feedback brings poor typing experience as one tries annotating the real-world environment.
Many existing devices have focused on use of controllers as input devices along with head tracking for text entry. While intuitive, it is prone to errors for reasons of inaccurate hand tracking and user fatigue, which also led to exploration of imaginary keyboards. However, the pain of obtaining feedback for pressing virtual keys and finding appropriate distance between hand and virtual keyboard for text entry was observed as a major impediment in quick adoption of imaginary keyboards as the only viable alternative.
One of proposed solution has attempted eye gaze-based tracking for free-hand text entry. However, the solution is not truly eye gaze based as the input requires additional modality of manual pointer/trigger aligned along with gaze at each key. But the problem of manual intervention as a feedback to trigger selection remains an obstruction as inspite of user’s focused gaze, the user is required to engage his fingers for invoking a selection command. Besides, there is an increased error rate as the eyes are capable of glancing rapidly over the virtual keyboard, which somehow is difficult for manual cursor to cope up with. This eventually leads to accidental selection of letters and incorrect message text typing.
Further, in situations of user holding one or more tools/equipment/interactive objects in his hand, it is difficult to perform hand gesture that usually serves the purpose of confirming the gaze selection as an explicit delimiter to avoid the issue of Midas touch, a problem of accidental activation of everything one sees while texting between keyboard space and text field, or otherwise. Hence, the reliance on hand gesture as the delimiter needs to be restricted for reasons of user unable to perform any other task using his hands, except for assistance in gaze typing.
Furthermore, the user while looking at key for predetermined threshold, also needs to point towards it in order to enter the character in the text field. This is both cumbersome and time taking for user who is now adept at achieving fantastic pace of texting over these years - almost 50 words per minute (WPM). Additionally, with solutions based on hand tracking, it is pertinent to note that people wearing same headset may still wear it differently, which makes the position and angle of camera with respect to keyboard always varying, and overall texting experience error prone.
In one other stated art, swipe and switch mechanism has been opted using gaze. The solution eliminated need for any hand gesture and managed differentiating between gaze control and gaze perception with distinct and decoupled text, gesture and action regions. However, each time the user chose to switch between words or context, he has to traverse his eyes through an action field, which makes the overall task straining and less fluid.
Notably, besides presenting characteristically unique problems, each of these modes share a common problem of low typing speed. On an average 40 words per minute (wpm) is considered to be the average typing speed for English speakers, which gets reduced to 30wpm while typing in AR. Though essential for quick mass adoption of head mounted devices, the technique of interaction with such devices is still underdeveloped.
Using hand gesture alone for displaying and correcting a large amount of text is further challenging given current limitations in HMD of the narrow field of view and imprecise hand tracking available. Likewise, using eye gaze and dwell time alone will put immense eye strain on user besides causing dizziness with too much eye focus on typing in right set of words. It apparently establishes a case of a multi-modal input interface that can address above concerns and achieve an immersive state with respect to usability, user experience, load, fatigue and text-entry performance.
With none of the above discussed configurations and combinations capable of providing text typing solution that is natural, intuitive and ergonomically well designed, the present disclosure attempts to provide for an interaction interface that can provide reasonable typing speed, ease of typing and better tactile feedback for minimising typing errors. The disclosure further addresses special demand created for a viable alternative to mid-air tapping or use of hand gestures or eye gaze alone to interact with virtual objects that can substantially reduce physical discomfort, fatigue without any trade-off between error rates and number of words typed per minute.
The proposed disclosure may address one or more of the challenges or needs mentioned herein, as well as provide other benefits and advantages. The above information disclosed in this Background section is only for enhancement of understanding of the background of the invention, and therefore, it may contain information that does not form prior art.
OBJECT OF THE INVENTION
An object of the present invention is to provide a high-performance, portable and light weight device and an advanced multimodal input method for enabling typing and texting in virtual/augmented environment.
Another object of the present invention is to provide a robust and accurate device and an advanced method that provides a good balance between high text entry performance and reduction of physical movement while engaging with virtual keypad in virtual/augmented environment.
Yet another object of the present invention is to provide a natural, convenient, and immersive interactive experience to enable ease of typing and providing input in extended reality with combination of eye gaze and voice modalities.
Yet another object of the present invention is to provide a reusable, reliable and high-quality text input solution that allows user to interact with virtual keyboard in a most naturally intuitive way.
Yet another object of the present invention is to provide safe, untethered and privacy sensitive device that enables private communication with virtual/augmented world using combined input modalities of eye gaze and speech utterance.
In yet another embodiment, easy to use and comfortable to adopt user wearable device configured with silent speech interface and accurate eye trackers that enables seamless typing of text on virtual keyboard.
SUMMARY OF THE INVENTION
This section is for the purpose of summarizing some aspects of the present invention and to briefly introduce some preferred embodiments. Simplifications or omissions may be made to avoid obscuring the purpose of the section. Such simplifications or omissions are not intended to limit the scope of the present invention.
In first aspect of disclosure, a system comprising of a multi-modal input interface for enabling user interaction within an extended reality environment is disclosed. Here, the system comprises of a head mounted device comprising of one or more eye tracking elements configured to track eye gaze for selecting or engaging one or more virtual elements within the extended reality environment. The system further comprises of a voice recognition module configured to capture non-audible murmur data uttered from user; and a processing module configured to receive the eye gaze data to analyse user behaviour, predict user intent and provide contextual prompts based on the user behaviour and the user intent, and receive the non-audible murmur data to complement the eye gaze data and enable interaction with the one or more virtual elements.
BRIEF DESCRIPTION OF THE DRAWINGS
So that the manner in which the above recited features of the present invention can be understood in detail, a more particular to the description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, the invention may admit to other equally effective embodiments.
These and other features, benefits and advantages of the present invention will become apparent by reference to the following text figure, with like reference numbers referring to like structures across the views, wherein:
Fig. 1 illustrates head mounted device configured with eye trackers and voice recognition module to provide input for virtual keyboard, in accordance with an embodiment of the present invention.
DETAILED DESCRIPTION
While the present invention is described herein by way of example using embodiments and illustrative drawings, those skilled in the art will recognize that the invention is not limited to the embodiments of drawing or drawings described and are not intended to represent the scale of the various components. Further, some components that may form a part of the invention may not be illustrated in certain figures, for ease of illustration, and such omissions do not limit the embodiments outlined in any way. It should be understood that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the invention is to cover all modifications, equivalents, and alternatives falling within the scope of the present invention as defined by the appended claims. As used throughout this description, the word "may", “at least”, “greater than” or “greater than equal to”, “about”, or “approximately” be used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense, (i.e., meaning must). For example, “about” may mean within 1 or more than 1 standard deviation, or may mean a range of up to 20%, up to 10%, up to 5%, or up to 1% of a given value. Where particular values are described in the application and claims, unless otherwise stated the term “about” meaning within an acceptable error range for the particular value may be assumed.
Further, the words "a" or "an" mean "at least one” and the word “plurality” means “one or more” unless otherwise mentioned. Furthermore, the terminology and phraseology used herein is solely used for descriptive purposes and should not be construed as limiting in scope. Language such as "including," "comprising," "having," "containing," or "involving," and variations thereof, is intended to be broad and encompass the subject matter listed thereafter, equivalents, and additional subject matter not recited, and is not intended to exclude other additives, components, integers or steps. Likewise, the term "comprising" is considered synonymous with the terms "including" or "containing" for applicable legal purposes. Any discussion of documents, acts, materials, devices, articles, and the like are included in the specification solely for the purpose of providing a context for the present invention. It is not suggested or represented that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present invention.
Reference will now be made to embodiments, examples of which are illustrated in the accompanying drawings. In the following description, numerous specific details are set forth in order to provide an understanding of the various described embodiments. However, it will be apparent to one of ordinary skill in the art that the various described embodiments may be practiced without these specific details. In other instances, well-known methods, procedures, components, circuits, and networks have not been described in detail so as not to unnecessarily obscure aspects of the embodiments. In this disclosure, whenever a composition or an element or a group of elements is preceded with the transitional phrase “comprising”, it is understood that we also contemplate the same composition, element or group of elements with transitional phrases “consisting of”, “consisting”, “selected from the group of consisting of, “including”, or “is” preceding the recitation of the composition, element or group of elements and vice versa.
The present invention is described hereinafter by various embodiments with reference to the accompanying drawings, wherein reference numerals used in the accompanying drawing correspond to the like elements throughout the description.
This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiment set forth herein. Rather, the embodiment is provided so that this disclosure will be thorough and complete and will fully convey the scope of the invention to those skilled in the art. In the following detailed description, numeric values and ranges are provided for various aspects of the implementations described. These values and ranges are to be treated as examples only and are not intended to limit the scope of the claims. In addition, a number of materials are identified as suitable for various facets of the implementations. These materials are to be treated as exemplary and are not intended to limit the scope of the invention.
For the purposes of present invention, terms such as “non-audible murmur”, “silent speech”, “subvocal speech”, “murmur”, “utterances” and the like may be interchangeably used as they define feeble speech produced without much local vibration, and shall not be limited to be construed for their literal interpretation only.
Text input is input in everyday life, whether it is taking notes, chatting, browsing web, commenting on social media platform, or providing user login credentials. Anything to do regarding interaction with smart device involves typing in text instructions from user. Typing in is the most preferred input mode for reasons of clarity in instructions while preserving user privacy at same time, which is difficult to maintain in clear voice instructions, especially in noisy or crowded environments.
However, typing in virtual keyboard for interacting with augmented/virtual reality devices is painful as most common challenges of text entry performance include gorilla arm fatigue, visual occlusion, resolution, registration errors, plane interference, eye fatigue, hand restrictions, inaccurate controller, hand or/and eye tracking, and the like.
To overcome said limitations, the present disclosure, as shown in Fig. 1, has devised a system 1000 comprising of a multi-modal input interface 500 that can engage with such head mounted devices 100 with ease and in a much naturally intuitive way with minimal error rate and maximized typing speed in real time. Various features are supported for text entry interface such as text correction, word suggestion, capitalization, editing etc. which supremely expands the way we work, thereby extending and enhancing our physical workspace to limits of extraordinary levels.
Previously, the user gaze is attempted for a dwell time of approximately 300ms for fixating at the target key over the virtual keyboard, triggering the selection and obtaining an error free typing. But the dwell-time of 300ms also decreased the text entry rate. The eyes are quickly on target, but entry in text field happens only after lapse of fixed dwell time of 300ms. This time between gaze fixated on key and selection of corresponding key needs to be reduced. Here, eye fixations are received as a continuous stream of input data and which are interpreted on a real time basis.
The present disclosure devises a head mounted device comprising of a unique combination of selective sensors that provide an input for typing, texting and interacting with virtual keyboard overlaid in a virtual/augmented/extended reality environment. The main sub-task of text entry is the selection of each button on the keyboard, which has been proposed in present solution using combined input modalities of user gaze and user voice that is found to be most natural and intuitive manner of communication.
Further, it has also been empirically established that it is unnatural to overload the visual perceptual system i.e. eyes to perform motor activities such as interaction with computer systems by way of menu selection, scrolling, giving prompts or direct manipulation, which is customarily achieved using electromechanical mouse. However, at the same time eye gaze is preferably used for direct manipulation leveraging the fact that mental image is formed first in user mind, which is followed by a consequential act of pointing or highlighting corresponding text by motor means, such as hands, using any electromechanical device like mouse.
Thus, eye gaze is particularly chosen as one of input to speed up the process of manipulating (interact, select and reference) objects in extended reality for purposes of present invention. But, due to the nature of the eyes and the strain that they may experience in trying to accomplish a large number of tasks for long periods of time, the eye gaze input is complemented with other suitable input modality to create parallelism required in natural and most intuitive way of human communication. Furthermore, the eyes with their saccadic motion are difficult to fixate on an object of interest, which makes the entire effort counterproductive. As it continues to move about the object of interest, it becomes frustrating for user to give command when the intent is to pause the gaze, e.g. tapping, double clicking on same object in continuity, or drag and drop functionality.
Apropos, the present solution introduces a combination of gaze swipe reinforced with speech recognition-based error-tolerant input mechanism, which includes eye fixations and user speech utterances selected as input modalities for user interaction with virtual keypad in virtual environment. This combination is important as speech is most natural way for humans to express their ideas. It thus enables the user to perform all kinds of operation of object manipulation in virtual world and make the system 1000 more accessible that are otherwise not effectively and non-strenuously possible using eye gaze as singular input modality.
The selection of gaze swipe along with voice recognition has been consciously chosen amidst plurality of input modalities and is not a mere extension of something existing in art. Since speech comprising of information content is exactly conveyed to user, without much information distortion, it is mindfully selected as other input modality in addition to eye gaze. Thus, as gaze and voice are understood to be most natural user interface to achieve high speed input with minimal scope of error, the present solution has leveraged the advances made in field of speech recognition and eye gaze using deep neural network engine (DNNs). However, as there are concerns of using voice modality for reasons of privacy, security and even annoyance in public settings, the present solution has devised solution exploring silent speech input like in case of murmurs, ultra-small utterances, soft whispers or any speech spoken below threshold decibel value that can be referred as non-audible murmur (NAM). Majorly available voice recognition systems work well for normal speech but not other kind of voice outputs. Additionally, using whispers for long time can adversely impact our vocal cords. In order to overcome these limitations, the combination of two input modalities-eye gaze and speech (non-audible murmur) recognition has been proposed that serves well the purpose of speed typing with enhanced accuracy.
Apropos, in one example embodiment, the system 1000 is configured with a head mounted device 100 studded with eye tracking elements 150 such as RGB cameras for tracking user gaze, eye movements, eye blinks, dwell time or eye pose; a voice recognition module 200 to capture user non audible murmur, and a processing module 400 to process the received input eye gaze data and non-audible murmur data. The head mounted device 100 uses eye tracking elements 150 to track user’s eye trace word paths, eye blinks, saccadic movement, dwell time and eye ball movements, and the voice recognition module 200 captures the non-audible murmur, which is fed as an input to the processing model 400 configured with a language model to express the user’s language regularities and constraints, and a recognition feature to recognize and input the word into text field.
For example, the head mounted device 100 is configured with eye tracking elements 150 that may include one or more light sensor(s) and/or one or more image sensor(s), configured to capture images of the user's eyes, for example, a particular portion of the user's eyes, such as, for example, the pupil, eye ball movements, pattern and duration of user eye blinks. The captured images may be processed to detect and track direction and movement of the user's eye gaze, and the detected and tracked eye gaze as eye swipe may be processed as a gaze gesture corresponding to a user input to be translated into a corresponding interaction in the immersive virtual experience.
The head mounted device 100 allows a user to input text to the system 1000 using a first input mode (e.g., eye gaze, eyeball movement and pattern and duration of eye blinks, dwell time, pupil dilation as input) for selecting the keys on virtual keyboard and a second different input mode (voice input) for editing, correction or whenever the user triggers action for gathering voice input via its voice recognition module 200. In order to start and stop selection of words by user eye gaze, a prompt is provided by an entry tab on virtual keyboard. As soon as user gaze is fixated on entry tab for minimal threshold time, a connection is established that enables activation of text selection command. The fixation of user gaze corresponds to fovea region of the eyes that a light falls on. Further, the user interface operation may be controlled by combined eye gaze and voice command to perform operations such as select, zoom, pause, scroll, play, tap, double click, drag and drop etc.
This enables preventing problem of Midas-Touch associated with using eye gaze as the only input modality, which refers to the selection of multiple objects over which the eye rolls on as the eye movement happens constantly, but was not necessarily intended to be system input. Even when gaze enables selection, it is not a natural trigger for beginning or ending a selection command. Thus, voice command by way of non-audible murmur is leveraged to potentially enrich this interaction to trigger a variety of actions associated with engagement with virtual object(s).
Accordingly, the voice utterance can be used as a cursor positioning tool, or for scrolling the interface while the eye glance or swipe can be used for word selection, tapping, double click, zoom. In addition, the combination of eye gaze data and non-audible murmur can be used for drag and drop feature and confirming/validating eye gaze related gestures to improve system accuracy. The combination of eye gaze data and NAM data for engaging within extended reality environment enables an intuitive engagement with the virtual elements in a non-cumbersome manner. For example, in order to position the cursor or even scroll over interface, using eye gaze (only) for initial cursor position control creates a positive feedback error, in which location of cursor is always slightly offset from target word as it is difficult to keep the head steady. This leads to selecting murmur as initial command for controlling the cursor. Thus, for different level of engagement, the multimodal inputs are utilized in a unique manner, as will be explained in greater detail below.
For example, a voice utterance or murmur captured by a voice recognition module 200 can prompt a start/enter command to position the cursor and begin typing in first instance. Next, once the cursor is aptly positioned, the word selection is initiated with eye glance or swipe over the virtual keyboard, wherein a predetermined threshold of 200ms-300ms dwell time period may be chosen to select characters over the screen. E.g. all the characters that the eyes have traversed over within 200-300 ms are selected and entered in text field. In case the word is not yet completed, a probable list of words that are likely to match typed in characters are prompted in select field, which when selected by eye gaze is typed in the text field. The character fixation or dwell time, in one preferred embodiment is chosen as 250 milliseconds as when the eye is fixated, it usually rests for almost 250 milliseconds on content word and takes in a span of 7-9 letters to right of fixation before moving to next word.
Hence, in order to fasten the speed of typing or engagement with virtual elements within the extended reality environment, the list of probably likely words matching the first few characters eye-swiped by the user are prompted in text field by the processing module 400 after 250 milliseconds from the beginning of the dwell time to help user make a quick selection. The term fixation describes the act of fixating the fovea on a given spot to visually encode a word. For fast readers/typers, the eye remains relatively still for at least 60-80 ms, however for an average user fixation is approximately 250 ms. Hence, the threshold of 250 ms is selected as character fixation time for purposes of present disclosure.
In accordance with one working embodiment, for slow readers who have longer fixation durations and shorter saccades (rapid movements of the eye that occur in-between fixations than the average readers) the device is capable of dynamically adjusting the threshold time for user fixation in real time. To achieve the same, at first the user profile is created based on his speech recognition, eye ball movement, dwell time, gaze characteristics and other added parameters.
For profile creation, the user may be then presented with a small paragraph containing 3-4 lines of text to understand user reading speed mean eye gaze fixation duration, forward saccade amplitude and average proportion of regressions (perceptual scan in reverse direction) during standard text passage reading. The recorded speed is chosen as a parameter to determine if the threshold of more than or less than 250 ms should be chosen for character fixation by user’s gaze. Alternatively, the user may also select the threshold time based on his requirement at a particular instance. This feature is characteristically important for achieving typing speed based on user preference and profile, and makes up for user needs in varying situations.
In order to repeat a character in a text string, such as, for example, the adjacent occurrence of the letter “e” in the word “Greetings,” the user may, for example, maintain the gaze input on the virtual key at the position for a set amount time, until the first occurrence of the letter “e” appears in the text field followed by an instant blink. For enabling the second occurrence of the same letter “e” (in the word “Greetings” in this example) the user may respond with gaze over the same letter followed by a prolonged eye blink for more than 200 ms for registration of same character twice. Thus, the detected eye blink event for a predetermined time period is recorded as input of previous character in text field.
Next, in order to complete the text entry, and indicate completion of a word, phrase and the like, the user may then switch the gaze to space bar provided on virtual keypad. For having eye-typed for more than 10 minutes, or in event of eye-strain, which can be identified from teary eyes, pupil dilation or strained eye muscle movement, the user may be instructed by the processing module 400 to switch to voice input to prevent eyes from over-straining. The user, may voluntarily chose to switch the input modality to voice or may even continue with eye gaze to maintain privacy.
Thus, in this example implementation, a part of the entry in the text field is generated in response to eye gaze input, and another part of the entry is generated in response to a voice input to have a more accurate, fast and natural way of interacting with virtual keyboard. The user may whisper a faint “hmm” or performs a subtle NAM action to confirm the selection instead of forcing the user to stare at an option for selection.
Further, in an event of false recognition or incorrect entry, the solution provides an auto-text correction feature for correcting erroneous texting and introducing alternative group of likely words that have highest probability of being placed in a given context. Thus, by using an auto-correction algorithm, the faulty input sequence could be improved and error rate can be significantly reduced to 1%. Alternatively, the user can speak or spell the word to speed type, which is error tolerant and can still assist in maintaining privacy as volume can be kept much lower than audible thresholds like in murmur, utterance, or whisper.
In one working embodiment, the opted virtual keyboard has a layout similar to physical keyboard layouts, most prominently used layout being QWERTY, which takes advantage of prior knowledge and minimize the effort of learning. Accordingly, the present disclosure endeavours to augment typing capability in ar/vr device using a combination of gaze and voice input modalities. In some implementations, the head mounted device 100 may anticipate, or predict the user's intended entry based on, for example, characters already entered, word(s) and/or phrase(s) preceding the current text entry, user's usage history, and other such factors, and may display a list of probable words including one or more recommended or suggested entries.
In this situation, the user may, instead of completing all of the steps discussed above to complete entry of the word, the user may instead re-direct the eye gaze and drag, or swipe, along the virtual keypad to the desired entry in the probable list to complete entry of the desired word in the text field.
In accordance with one preferred embodiment, to optimize object manipulation or interaction with virtual elements in extended reality space, the system 1000 uses input in form of gaze data from eye tracking elements 150 like eye movement metrics including saccadic movement, dwell time, eye blinks and pupil dilation along with interpretation of non- audible murmur from voice recognition module 200 to analyse human behaviour, predict intent and provide contextual prompt words based on the analysed human behaviour and user intent in order to speed up interaction.
As the dwell time describes duration of gaze fixation on any object, it can be related to understand user interest levels and selection intent. Likewise, saccadic movements are quick shifts in gaze that can be indicative of user scanning behaviour to predict next focus area. Next, the blink rate and pupil dilation are measure of cognitive load and engagement level that defines and adjusts user interface responsiveness and word suggestion sensitivity. In one exemplary embodiment, the saccadic movement, blink rate and pupil dilation are analysed by the processing module to estimate user behaviour.
Thus, once the raw gaze data from eye tracking element 150 is obtained, the processing module 400 extracts features such as fixation duration and frequency, saccade speed and amplitude, pupil dilation under different conditions, along with reading flow, as explained in earlier part of description, and omitted here for reasons of brevity. These features are labelled based on user interaction success rates e.g. if the prompt was successful. In accordance with one working embodiment, using a supervised learning approach, input features such as eye movement metrics and previous prompt interactions may be mapped to predict optimal prompt words based on gaze behaviour. Apropos, machine learning models such as Random Forests/Gradient Boosting may be utilized for interpretable user profiling, Long Short Term Memory (LSTM) based Recurrent Neural Network (RNN) may be used for time-series eye movement analysis and Reinforcement Learning may be used to personalize interactions over time.
In parallel, non-audible murmur (NAM) based commands uses soft murmurs for initiating or closing the selection, scrolling, and privacy activation besides getting activated as eye muscles begin to tire of overuse. Such an eye strain can be inferred from watery eyes, or high blink rate or pupil dilation and the like. Thus, NAM intensity and duration is captured to trigger commands silently based on NAM cues. For example, soft “hmm” may confirm a suggested word selection, a gentle “uhh” may enable scrolling through options without moving eyes, a light exhale may be read as closing a window or dismissing a notification, and a long “mmm” may activate privacy mode to blur sensitive content.
In next working embodiment, eye gaze data (fixation duration, saccade speed, pupil dilation, dwell time on UI elements, blink rate) and NAM data (wavelet transformed NAM signal) to extract frequency features, Mel-Frequency Cepstral Coefficients to extract voice feature for classification, signal energy to differentiate murmur from ambient noise, duration to determine type of NAM command.
For example, confirm selection when a long fixation and soft “hmm” murmur is received; scroll up/down when a saccade movement and “uhh” murmur is obtained; privacy mode is on when sudden gaze shift and long “mmm” murmur is fetched; and dismiss an action when blink and exhale NAM signal is received. In one example embodiment, LSTM model is selected to take sequential eye gaze data and NAM data as input and convert them into time-series features. Here, the first LSTM layer captures short-term dependencies in user behaviour, a next drop out layer may prevent overfitting by randomly disabling neurons, a second LSTM layer processes long term dependencies, a drop out layer to improve generalization, a dense layer (ReLU activation) for reducing feature space to obtain better representation, and a final dense layer (Softmax activation) to output intent class probabilities. The easy to wear head mounted device 100 with a light form factor is equipped with smart sensors and processing module 400 that enables concealed and seamless communication in any routine setting. In one preferred embodiment, the subtle vibrations are picked by the sensors positioned on the head mounted device 100 whenever the user slightly murmurs or utters any arbitrary word or text in natural language with a very low vocal effort. A private and discreet communication is established between the user and his worn device that can provide direct command to assist in fast and error free typing.
User privacy is maintained even in high noise environments or densely populated public places such as airport, coffee shop, crowded restaurant. Slightly discernible whispers, lip movement and murmurs that makes up for subtle acoustic speech are effectively picked by smart sensors, whenever the user gives voice instructions which can be selectively activated by the user along with gaze command. The slight utterance gets interpreted by advanced machine learning based acoustic processing method for subsequent instructions for typing in text field. The slightly vocalized phrase, word or sound is recognized, contextually processed and transmitted after user validation to other device towards which the communication is directed or intended.
In accordance with one primary embodiment, the user worn head mounted device 100 is studded with sensors configured for sufficiently precise sensing, which is followed by optimal feature selection and signal processing by the processing module 400 to derive meaningful contextual information from weak utterances captured by voice recognition module 200, without requiring repeated vocalization to make the word comprehensible or understandable. The voice in form of audio stream outputted by the user is processed by the processing module 400 to recognize a user's voice (from other voices or background audio), to extract commands, subjects, parameters, etc. from the audio stream by way of natural language processing in real-time.
In one example embodiment, speech recognition technique is deployed by the recognition module to determine who is speaking as well as speech recognition technology to determine what is being said. Voice recognition techniques can include hidden Markov models, Gaussian mixture models, pattern matching algorithms, neural networks, matrix representation, Vector Quantization, decision trees, and dynamic time warping (DTW) techniques, alone or in combination. Voice recognition techniques can also include anti-speaker techniques, such as cohort models, and world models. Spectral features may be used in representing speaker characteristics.
In some embodiments, one input mode may be the primary input mode such as eye gaze while another input mode may be the secondary input mode such as voice recognition. The inputs from the secondary input mode may supplement the input from the primary input mode to ascertain words to be inputted into text field of head mounted device. In accordance with an embodiment, the head mounted device comprises a memory unit configured to store machine-readable instructions. The machine-readable instructions may be loaded into the memory unit from a non-transitory machine-readable medium, such as, but not limited to, CD-ROMs, DVD-ROMs and Flash Drives. Alternately, the machine-readable instructions may be loaded in a form of a computer software program into the memory unit. The memory unit in that manner may be selected from a group comprising EPROM, EEPROM and Flash memory. Further, a processor is operably connected with the memory unit. In various embodiments, the processor is one of, but not limited to, a general-purpose processor, an application specific integrated circuit (ASIC) and a field-programmable gate array (FPGA).
In general, the word “module,” as used herein, refers to logic embodied in hardware or firmware, or to a collection of software instructions, written in a programming language, such as, for example, Java, C, or assembly. One or more software instructions in the modules may be embedded in firmware, such as an EPROM. It will be appreciated that modules may comprised connected logic units, such as gates and flip-flops, and may comprise programmable units, such as programmable gate arrays or processors. The modules described herein may be implemented as either software and/or hardware modules and may be stored in any type of computer-readable medium or other computer storage device.
Further, while one or more operations have been described as being performed by or otherwise related to certain modules, devices or entities, the operations may be performed by or otherwise related to any module, device or entity. As such, any function or operation that has been described as being performed by a module could alternatively be performed by a different server, by the cloud computing platform, or a combination thereof. It should be understood that the techniques of the present disclosure might be implemented using a variety of technologies. For example, the methods described herein may be implemented by a series of computer executable instructions residing on a suitable computer readable medium. Suitable computer readable media may include volatile (e.g., RAM) and/or non-volatile (e.g., ROM, disk) memory, carrier waves and transmission media. Exemplary carrier waves may take the form of electrical, electromagnetic or optical signals conveying digital data steams along a local network or a publicly accessible network such as the Internet. It should also be understood that, unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as "controlling" or "obtaining" or "computing" or "storing" or "receiving" or "determining" or the like, refer to the action and processes of a computer system, or similar electronic computing device, that processes and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices. Various modifications to these embodiments are apparent to those skilled in the art from the description and the accompanying drawings. The principles associated with the various embodiments described herein may be applied to other embodiments. Therefore, the description is not intended to be limited to the embodiments shown along with the accompanying drawings but is to be providing broadest scope of consistent with the principles and the novel and inventive features disclosed or suggested herein. Accordingly, the invention is anticipated to hold on to all other such alternatives, modifications, and variations that fall within the scope of the present invention.
,CLAIMS:We Claim:
1) A system (1000) comprising of a multi-modal input interface (500) for enabling user interaction within an extended reality environment, wherein the system (1000) comprises of:
a head mounted device (100) comprising of one or more eye tracking elements configured to track eye gaze for selecting or engaging one or more virtual elements within the extended reality environment;
a voice recognition module (200) configured to capture non-audible murmur data uttered from user; and
a processing module (400) configured to receive the eye gaze data to analyse user behaviour, predict user intent and provide contextual prompts based on the user behaviour and the user intent, and receive the non-audible murmur data to complement the eye gaze data and enable interaction with the one or more virtual elements.

2) The system (1000), as claimed in claim 1, wherein the eye gaze data comprises of dwell time, saccadic movement, eye ball motion, eye blinks, and pupil dilation.

3) The system (1000), as claimed in claim 2, wherein the processing module (400) is configured to enable selection of the one or more virtual elements within the extended reality environment for the dwell time determined in range of 200-300ms.

4) The system (1000), as claimed in claim 3, wherein the processing module (400) is configured to prompt one or more words matching the selected virtual element after lapse of 250 milliseconds from beginning of the dwell time.

5) The system (1000), as claimed in claim 4, wherein the processing module (400) is configured to enable repetition of a character in word formation by recording an instant eye blink after first instance of occurrence of the character, and a prolonged eye blink of more than 200ms for second instance of occurrence of the character.

6) The system (1000), wherein in an event the processing module (400) determines an eye strain from watery eyes, high blink rate, pupil dilation or strained eye muscle movement, the user is prompted to switch to voice as only input for engaging with the one or more virtual elements.

7) The system (1000), as claimed in claim 2, wherein the processing module (400) is configured to predict the user intent based on the dwell time that is associated with user interest levels and selection intent.

8) The system (1000), as claimed in claim 2, wherein the processing module (400) is configured to estimate the user behaviour based on the saccadic movement, the pupil dilation and blink rate of the user.

9) The system (1000), as claimed in claim 1, wherein the processing module (400) is configured to receive the non-audible murmur intensity and duration to trigger commands that enable interaction with the one or more virtual elements.

10) The system (1000), as claimed in claim 2, wherein the processing module (400) is configured to extract features from the eye gaze data such as fixation duration and frequency, saccade speed and amplitude, and pupil dilation under different conditions.

11) The system (1000), as claimed in claim 1, wherein the processing module (400) is configured to utilize a Long Short-Term Memory (LSTM) based Recurrent Neural Network (RNN) to provide the contextual prompts based on the user behaviour and the user intent.

12) The system (1000), as claimed in claim 1, wherein the non-audible murmur (NAM) data comprises of wavelet transformed NAM signal and Mel-Frequency Cepstral Coefficients.

13) The system (1000), as claimed in claim 12, wherein the processing module (400) is configured to extract frequency features from the wavelet transformed NAM signal and extract voice features from the Mel-Frequency Cepstral Coefficients.

14) The system (1000), as claimed in claim 13, wherein the processing module (400) is configured to enable selection, scrolling, closing or dismissing of the one or more virtual elements and activation of privacy mode within the extended reality environment based on type of command determined from the extracted frequency and voice features from the non-audible murmurs.

Documents

Application Documents

# Name Date
1 202421016661-PROVISIONAL SPECIFICATION [07-03-2024(online)].pdf 2024-03-07
2 202421016661-FORM FOR SMALL ENTITY(FORM-28) [07-03-2024(online)].pdf 2024-03-07
3 202421016661-FORM 1 [07-03-2024(online)].pdf 2024-03-07
4 202421016661-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [07-03-2024(online)].pdf 2024-03-07
5 202421016661-EVIDENCE FOR REGISTRATION UNDER SSI [07-03-2024(online)].pdf 2024-03-07
6 202421016661-DRAWINGS [07-03-2024(online)].pdf 2024-03-07
7 202421016661-DRAWING [06-03-2025(online)].pdf 2025-03-06
8 202421016661-CORRESPONDENCE-OTHERS [06-03-2025(online)].pdf 2025-03-06
9 202421016661-COMPLETE SPECIFICATION [06-03-2025(online)].pdf 2025-03-06
10 202421016661-FORM-9 [10-03-2025(online)].pdf 2025-03-10
11 202421016661-FORM-9 [10-03-2025(online)]-1.pdf 2025-03-10
12 202421016661-STARTUP [12-03-2025(online)].pdf 2025-03-12
13 202421016661-FORM28 [12-03-2025(online)].pdf 2025-03-12
14 202421016661-FORM 18A [12-03-2025(online)].pdf 2025-03-12
15 Abstract.jpg 2025-03-18
16 202421016661-FER.pdf 2025-05-08
17 202421016661-OTHERS [16-05-2025(online)].pdf 2025-05-16
18 202421016661-FER_SER_REPLY [16-05-2025(online)].pdf 2025-05-16

Search Strategy

1 202421016661_SearchStrategyNew_E_202421016661E_04-04-2025.pdf