Augmented Reality Assisting System And Method For Assisting User In

< Back

Augmented Reality Assisting System And Method For Assisting User In Real Time For Performing A Task

Abstract: Systems and methods for assisting user in real-time for performing a task are described. The system receives a request from the user for performing the task. Based on intend of the request, the system extracts multimodal data and corresponding timestamp information. The multimodal data is extracted from real time interactions between the user and expert stored in an interaction database. A portion of the multimodal data comprises one or more annotations. Further, the system determines one or more entities associated with the request based on the one or more annotations. The system further generates a guidance information for assisting the user in performing the task based on the determined one or more entities and by synchronizing each multimodal data in sequence based on the timestamp information of each multimodal data. FIG. 1

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

01 July 2019

Publication Number

02/2021

Publication Type

INA

Invention Field

ELECTRONICS

Status

bangalore@knspartners.com

Parent Application

Applicants

WIPRO LIMITED

Doddakannelli, Sarjapur Road, Bangalore 560035, Karnataka, India.

Inventors

1. Ghulam Mohiuddin Khan

F-901, Concorde Manhattans, Electronic City Phase-1, Bangalore-560100.

2. Sethuraman Ulaganathan

#76/3, South Vaikolkara Street, Woraiyur (PO), Ramalinga Nagar, Tiruchirappalli (DT), Tamil Nadu 620003.

3. Pandurang Dulba Naik

“Sunit Isha”, #302, Near Parkhe Mala, Baner, Pune 411045.

4. Deepanker Singh

9/1 Siddhartha Nagar, Shastri Nagar, Meerut, UP-25004.

Specification

Claims:
1. A method of assisting a user (104) in real-time for performing a task, the method comprising:
receiving, by an Augmented Reality (AR) assisting system (102), a request from the user (104) for performing the task;
extracting, based on intent of the request, by the AR assisting system (102), multimodal data (110) along with timestamp information (112), from real time interactions between the user (104) and an expert (106) stored in an interaction database (108) associated with the AR assisting system (102), in response to the request, wherein portion of the multimodal data (110) comprises one or more annotations;
determining, by the AR assisting system (102), one or more entities associated with the request based on the one or more annotations associated with the portion of the multimodal data (110); and
generating, by the AR assisting system (102), a guidance information (114) for assisting the user (104) in performing the task based on the determined one or more entities and by synchronizing each multimodal data (110) in sequence based on the timestamp information (112) of each multimodal data (110).

2. The method as claimed in claim 1, wherein the multimodal data (110) comprises at least one of audio, text, images, and videos.

3. The method as claimed in claim 1, wherein the intent of the request is identified using Natural Language Processing (NLP) technique.

4. The method as claimed in claim 1, wherein the one or more annotations in the portion of the multimodal data (110) comprises:
explanations indicating instructions for performing the task, wherein the explanations are provided at Region of Interest (ROI) identified in the portion of the multimodal data (110).

5. The method as claimed in claim 1, wherein each multimodal data (110) is synchronized based on the associated timestamp information (112) using trained neural network model.

6. The method as claimed in claim 1, wherein the AR assisting system (102) is trained using training data (212) based on plurality of interactions between the user (104) and the expert (106) on one or more domains.

7. The method as claimed in claim 1, further comprising creating at least one of image data set and video snippets using the guidance information (114), the synchronized multimodal data (110) and the corresponding one or more annotations.

8. An Augmented Reality (AR) assisting system (102) for assisting a user (104) in real-time for performing a task, the system (102) comprising:
a processor (204); and
a memory (206) communicatively coupled to the processor (204), wherein the memory (206) stores processor-executable instructions, which, on execution, causes the processor (204) to:
receive a request from the user (104) for performing the task;
extract, based on intent of the request, multimodal data (110) along with timestamp information (112), from real time interactions between the user (104) and an expert (106) stored in an interaction database (108) associated with the AR assisting system (102), in response to the request, wherein portion of the multimodal data (110) comprises one or more annotations;
determine one or more entities associated with the request based on the one or more annotations associated with the portion of the multimodal data (110); and
generate a guidance information (114) for assisting the user (104) in performing the task based on the determined one or more entities and by synchronizing each multimodal data (110) in sequence based on the timestamp information (112) of each multimodal data (110).

9. The AR assisting system (102) as claimed in claim 8, wherein the one or more annotations in the portion of the multimodal data (110) comprises:
explanations indicating instructions for performing the task, wherein the explanations are provided at Region of Interest (ROI) identified in the portion of the multimodal data (110).

10. The AR assisting system (102) as claimed in claim 8, wherein the processor (204) trains the AR assisting system using training data (212) based on plurality of interactions between the user (104) and the expert (106) on one or more domains.
, Description:TECHNICAL FIELD

The present disclosure relates in general to an Augmented Reality (AR) assisting system. More particularly, but not exclusively, the present disclosure discloses a method and a system to assist a user in real-time and provide guidance information for performing a task.
BACKGROUND

Virtual assistance is known from quite a long time. The main objective of such assistance is to assist users in performing any task. In such technology, a chatbot or human-like virtual assistant may be provided for assisting the users. One type of virtual assistance is an Augmented Reality (AR) virtual assistance. This technology navigates the users through a stepwise or sequential process required for solving their queries or for performing the task. It may be in a form of animation or series of images or a video. The AR virtual assistance provides more convenience to the users when compared with conventional virtual assistance technology. That is, this technology provides look and feel experience to the users in understanding the steps required for solving the queries which may not be feasible using the chatbot or even the human-like virtual assistant.

In order to implement such AR virtual assistance technology, content is required. Based on such content, animations, series of images, or videos are generated for providing stepwise guidance to the users. For creating the content, a lot of human effort as well as time is required. The effort and the time multi-folds with increase in domain and different use cases or scenarios. For instance, one user may want to know how to install a new software in his/her laptop, whereas another user may want to know how to operate his/her newly bought washing machine. The domain and use cases will be huge in numbers, and hence, creating the content for all such domains is a challenge. Moreover, linking the content in a proper sequence for generating the stepwise information is another technical challenge.

SUMMARY

Accordingly, the present disclosure relates to a method of assisting a user in real-time for performing a task. The method comprises receiving a request from the user for performing the task. The method further comprises the step of extracting, based on intent of the request, multimodal data along with timestamp information. The multimodal data is extracted from real time interactions between the user and an expert stored in an interaction database in response to the request. Further, portion of the multimodal data comprises one or more annotations. The method further comprises the step of determining one or more entities associated with the request based on the one or more annotations associated with the portion of the multimodal data. Further, the method comprises the step of generating a guidance information for assisting the user in performing the task based on the determined one or more entities and by synchronizing each multimodal data in sequence based on the timestamp information of each multimodal data. In one aspect, the aforementioned method for assisting the user in the real-time for performing the task may be performed by a processor using programmed instructions stored in a memory.

Further, the present disclosure relates to an Augmented Reality (AR) assisting system for assisting a user in real-time for performing a task. The AR assisting system comprises a processor and a memory communicatively coupled to the processor. The memory stores processor-executable instructions, which, on execution, causes the processor to perform one or more operations comprising receiving a request from the user for performing the task. The AR assisting system is configured to extract, based on intent of the request, multimodal data along with timestamp information. The multimodal data is extracted from real time interactions between the user and an expert stored in an interaction database in response to the request. Further, portion of the multimodal data comprises one or more annotations. Further, the AR assisting system determines one or more entities associated with the request based on the one or more annotations associated with the portion of the multimodal data. The AR assisting system further generates a guidance information for assisting the user in performing the task based on the determined one or more entities and by synchronizing each multimodal data in sequence based on the timestamp information of each multimodal data.

The foregoing summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.

BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS

The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the figures to reference like features and components. Some embodiments of system and/or methods in accordance with embodiments of the present subject matter are now described, by way of example only, and with reference to the accompanying figures, in which:

FIG. 1 shows an exemplary environment illustrating an Augmented Reality (AR) assisting system for assisting a user in real-time for performing a task in accordance with some embodiments of the present disclosure;

FIG. 2 shows a detailed block diagram illustrating the AR assisting system in accordance with some embodiments of the present disclosure;

FIG. 3 shows an example of an interaction between user and expert in accordance with some embodiments of the present disclosure;

FIG. 4 shows an example illustrating generation of guidance information for assisting the user for performing a task in accordance with some embodiments of the present disclosure;

FIG. 5 shows a flowchart illustrating a method for assisting a user in real-time for performing a task in accordance with some embodiments of the present disclosure; and

FIG. 6 illustrates a block diagram of an exemplary computer system for implementing embodiments consistent with the present disclosure.

It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and executed by a computer or processor, whether or not such computer or processor is explicitly shown.

DETAILED DESCRIPTION

In the present document, the word "exemplary" is used herein to mean "serving as an example, instance, or illustration." Any embodiment or implementation of the present subject matter described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
While the disclosure is susceptible to various modifications and alternative forms, specific embodiment thereof has been shown by way of example in the drawings and will be described in detail below. It should be understood, however that it is not intended to limit the disclosure to the particular forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternative falling within the spirit and the scope of the disclosure.
The terms “comprises”, “comprising”, “includes”, or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a setup, device or method that comprises a list of components or steps does not include only those components or steps but may include other components or steps not expressly listed or inherent to such setup or device or method. In other words, one or more elements in a system or apparatus proceeded by “comprises… a” does not, without more constraints, preclude the existence of other elements or additional elements in the system or method.
The present disclosure relates to a method and an Augmented Reality (AR) assisting system (alternatively also referred as “system”) for assisting a user in real-time for performing a task. Although, the method for assisting the user is described in conjunction with a server, the said method can also be implemented in various computing systems/devices, other than the server.
Technology related to assisting users for performing any task has emerged from a long time. Earlier, when a new television was bought, people has to wait for manual demo and has to follow manual books for understanding the operations of the newly bought television. However, with the time and technology, user assistance has evolved from manual assistance to virtual assistance. Augmented Reality (AR) virtual assistance is one of a type of virtual assistance technology which provides interactive experience of a real-world to the users.
Today, if someone buys a smart television and not able to operate some of its functionalities, users do not have to wait for any manual assistance for understanding the functionalities. The users can simply use the AR type virtual assistance by using their device. However, for making such AR virtual assistance to work, a lot of backend effort and time is required for collecting information, validating the collected information and then binding them together in a meaningful format. For example, a dedicated television expert has to be employed for identifying different use cases related to the smart television, and thereafter creating guidance information for each of the use cases. This effort is only for one domain (smart television) and its related use cases (different functionalities of the smart television). Imagine, how much effort and time will be required for creating the content for n number of domains and their corresponding use cases available in the world.
The objective of the present disclosure is to overcome this issue, in which, the guidance information may be generated, but however without employing any dedicated person/team. That is, the present disclosure creates the guidance information or content required for AR virtual assistance by simply capturing and analyzing interaction between the user and the expert. Here, the expert may be a domain expert or subject-matter expert, for example, but not limited to, an electronic appliance expert who understand what are the features of the electronic appliances (smart television) and how the features are operated, an automobile expert who understand how to operate different components of a vehicle (dashboard), or a software expert who understand how to install a software program in a computer/laptop, or a banking expert who understand how to redeem credit card points.
Considering the user wants to redeem his/her credit card points and asks the banking expert for the help during an interaction. In response, the banking expert may explain to the user the steps or procedures to be followed for performing the task. During the interaction, multimodal data may be captured by the system in a form of text, image, audio, video or combination thereof. For instance, the user may use a chat mode or an audio mode or a visual mode or a combination thereof for his/her request. Along with the request, the user might also share images or videos of objects upon which the user wants to perform the task.
As the interaction progresses, the system parallelly captures data pertaining to the interaction at each and every stage along with timestamp information. Considering that the user wants to redeem his/her credit card points, the user might tell the expert about his/her bank’s name and may also share an image of his/her bank’s internet banking home page to the expert for assistance. On receiving the request and the image, the expert may first identify one or more region of interests (ROIs) and annotate them with proper instruction. For instance, by using the annotations in the image, the expert may instruct the user to first select “credit card” under the tab “cards”, then click on link “transaction”, and then select check-box “redeem points”. This sequence provided in the annotation is used by the system for generating the guidance information for the user, which is explained in detail in subsequent paragraphs of the specification.
In the following detailed description of the embodiments of the disclosure, reference is made to the accompanying drawings that form a part hereof, and in which are shown by way of illustration specific embodiments in which the disclosure may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the disclosure, and it is to be understood that other embodiments may be utilized and that changes may be made without departing from the scope of the present disclosure. The following description is, therefore, not to be taken in a limiting sense.

The environment 100 includes the AR assisting system 102, a user 104, an expert 106, interaction database 108 associated with the AR assisting system 102, multimodal data 110, timestamp information 112, and guidance information 114. According to an embodiment of present disclosure, the user 104 may raise a request to the expert 106 for performing a task. For raising the request and for communicating with the expert 106, the user 104 may use his/her device, such as mobile device, laptop, desktop and the like. Here, the expert 106 may be a subject-matter expert or a domain expert as discussed in the above paragraphs of the specification. Further to the above examples, few more examples of the expert 106 may be a chef who may help in preparing a dish, or may be a computer hardware expert who may help in installing computer parts or may be any person who may help/assist the user 104 is completing the task during the interaction between the user 104 and expert 106.

At first, the interaction between the user 104 and the expert 106 may be recorded and stored in the interaction database 108. According to an embodiment of present disclosure, the interaction database 108 may be external to the system 102 or be a part of the system 102. In both the cases, the interaction database 108 serves as a repository for the interaction happened between the user 104 and the expert 106. As discussed above, the interaction may happen using different communication modalities. For example, the user 104 and the expert 106 may communicate with each other using texts, images, audios, videos or combination thereof. Hence, the multimodal data 110 generated during the interaction is stored in the interaction database 108. The multimodal data 110 along with the timestamp information 112 are further extracted by the system 102 for analysis and for generating the guidance information 114, which is explained in detail in later paragraphs of the specification.

It may be understood that, the entire multimodal data 110 captured during the interaction may not be useful for generating the guidance information 114. The reason being, there may be a formal introduction between the user 104 and the expert 106 during the interaction. For example, the user 104 may initiate the interaction with “Hi, I am John! I would like to seek your help for addressing my query.”. In response, the expert 106 may reply “Hello, John! I am Mary. How can I help you?”. Hence, the system 102 may not require this formal introduction interaction for generating the guidance information 114.

However, the system 102 may use this information, for example the formal introduction or any other interaction even if it is not going to be considered for generating the guidance information 114. This is because, such interaction may be useful to understand context. Such interaction may also help the system 102 to understand whether the query has been successfully resolved by the expert 106. For instance, once the expert 106 successfully addresses the user’s 104 query/request, the user 104 may thanks to the expert 106 like “Thank you Marry for your help. Now, I can do my stuffs”. The expert 106 may respond like “It’s my pleasure, John. Have a great day”. Based on this interaction (i.e., feedback or acknowledgement), the system 102, using natural language processing technique, also understands that information provided by the expert 106 were correct and helped the user 104 to successfully perform the task.

Apart from the above introduction and feedback related interaction, there may be some useful interaction which is actually required for generating the guidance information 114. Considering a case, in which, the user 104 may be a first-time smartphone user and may not know how to insert subscriber identity module (SIM) card into the smartphone. In this case, the user 104 may raise request like “Hi Marry, I would like to insert the SIM card into my smartphone OppoTM F7” in a textual or audio format. Along with this textual or audio request, the user 104 may also share a front image and backside image of his smartphone. On receiving the user’s 104 request and the images (front & backside) of the smartphone, the expert 106 understands the intention of the user’s 104 request. The expert 106 also understands that the images provided by the user 104 are not sufficient for making him understand how to insert the SIM. Hence, the expert 106 may reply to the user 104 like “Thank you, John. Can you share the image of the right-hand side of the smartphone when smartphone is facing to you”? In response, the user 104 again shares the image of the right-hand side of his smartphone during the interaction.

The expert 106 may then annotate the right-hand side image provided by the user 104. Here, the annotation means identifying the Region of Interests (ROIs) in the image and providing corresponding explanations. In this case, the ROIs may be a “small hole” in which a pin is inserted, a “tray” which comes out after pin is pressed, and a “slot” designated for receiving the SIM card. During the annotation, the expert 106 may provide the corresponding explanation regarding each ROIs. Explanation is an explanatory note of how to operate upon the ROIs. For example, for the ROI “small hole”, the explanatory note may be “press the pin into the small hole to pull out the tray outside”. Similarly, the expert 106 may provide the explanation for remaining ROIs and shares the annotated image back to the user 104.

Here, it may be noted that, each instances of the above multimodal data 110 (introduction, feedback, user’s 104 request, images of the smartphone shared by the user 104, expert’s annotations) during the interaction are recorded with corresponding timestamp information 112. This may aid/facilitate the system 102 to understand the flow of the interaction. Thereafter, the system 102 may identify the useful multimodal data 110 amongst the entire multimodal data 110 which are actually required for generating the guidance information 114. From the useful multimodal data 110 which contains the annotations, the system 102 further determines entities. The entities are the objects upon which ROIs are identified for the annotation. Based on the entities, the system 102 generates the guidance information 114 by synchronizing each multimodal data 110 based on the timestamp information 112. In this scenario, the ROIs and corresponding explanations associated with the annotated right-hand side image of the smartphone may be arranged in a following sequence for generating the guidance information 114 like press the pin into the small hole?tray comes out?place the SIM into the slot provided on the tray? push back the tray into the smartphone. In this manner, the system 102 not only generates the guidance information 114 used for AR virtual assistance, but also learns from the interaction between the user 104 and the expert 106. Based on the learning, the system 102 scale up its knowledge base. It may be understood to a skilled person that apart from the multimodal data (as discussed above), other data may also be used, according to embodiments of present disclosure. For example, annotation-bounding box coordinate, annotation type (box , polygon, rectangle), user location, user context, and problem context related numerical data. For instance, the explanation provided for the ROIs may be presented in different shapes like box, polygon, rectangle or any other shape.

FIG. 2 shows a detailed block diagram illustrating the AR assisting system in accordance with some embodiments of the present disclosure.

The AR assisting system 102 (alternatively also referred as “system”) comprises an I/O interface 202, a processor 204, and a memory 206. The I/O interface 202 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface 202 may allow the system 102 to interact with the user 104 directly or through user devices. The memory 206 is communicatively coupled to the processor 204. The processor 204 is configured to perform one or more functions of the system 102 for assisting the user 104 in real-time for performing a task. In one implementation, the system 102 comprises data 208 and modules 210 for performing various operations in accordance with the embodiments of the present disclosure. In an embodiment, the data 208 may include, without limitation, multimodal data 110, timestamp information 112, guidance information 114, training data 212, and other data 214.

In one embodiment, the data 208 may be stored within the memory 206 in the form of various data structures. Additionally, the aforementioned data 208 can be organized using data models, such as relational or hierarchical data models. The other data 214 may store data, including temporary data and temporary files, generated by the modules 210 for performing the various functions of the system 102. For example, other data 214 may include an intent of the interaction which is required for understanding the context of the interaction. The intent may be determined using Natural Language Processing (NLP) technique.

In an embodiment, the multimodal data 110 may comprise data pertaining to the interaction between the user 104 and the expert 106. The multimodal data 110 may include, without limitation, user related data and expert related data. The user related data may include user’s textual utterances, user’s audio utterances, user’s visual data, and images or videos shared by the user 104. In response to the user related data, on the other hand, the expert related data may include expert’s textual utterances, expert’s audio utterances, expert’s visual data, and annotated images or annotated videos.

The multimodal data 110 may be extracted by system 102 along with corresponding timestamp information 112. In an embodiment, the timestamp information 112 may include, without limitation, timepoints at which the multimodal data 110 is extracted by the system 102. For instance, when the user 104 initiates the interaction by “Hi, I am John! I would like to seek your help for addressing my query”, the system 102 parallelly captures the corresponding timepoint, for example 10:15 AM. In response, when the expert 106 replies with “Hello, John! I am Marry. How can I help you?”, the system 102 again captures the next timepoint, for example 10:16 AM. As the interaction progresses, the system 102 extracts the multimodal data 110 and the corresponding timestamp information 112 in a similar manner as discussed in the above statement.

In next step, system 102 analyses the multimodal data 110 and the corresponding timestamp information 112 in order to generate the guidance information 114. As discussed earlier, the guidance information 114 is a stepwise information required for completing the task requested by the user 104. The guidance information 114 generated is used for AR virtual assistance, according to an embodiment of the present disclosure.

While the system 102 performs the above analysis for generating the guidance information 114, the system 102 parallelly also learns from the analysis. In other words, the system 102 learns about the domain knowledge based on the interaction or interaction between the user 104 and the expert 106. In an embodiment, the system 102 generates training data 212 based on learning while generating the guidance information 114. Once the system 102 is trained, the system 102 may use trained neural network model for generating the guidance information 114 for future use.

In an embodiment, the above discussed data 208 (multimodal data 110, timestamp information 112, guidance information 114, training data 212) may be processed by one or more modules 210. In one implementation, the one or more modules 210 may also be stored as a part of the processor 204. In an example, the one or more modules 210 may be communicatively coupled to the processor 204 for performing one or more functions of the system 102.

In one implementation, the one or more modules 210 may include, without limitation, a receiving module 216, an extracting module 218, a determining module 220, a generating module 222, and other modules 224. As used herein, the term module refers to an application specific integrated circuit (ASIC), an electronic circuit, a processor (shared, dedicated, or group) and memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that provide the described functionality. The other modules 224 may include programs or coded instructions that supplement applications and functions of the system 102 for assisting the user 104 in real-time for performing the task.

Now, how the system 102 is implemented for assisting the user 104 in real-time for performing the task is discussed herein detail with reference to an example of the interaction shown in Fig. 3. The objective of the present disclosure is to create guidance information 114 for the AR virtual assistance without employing any dedicated person/team. In order to perform this, the system 102 analyzes the real-time interaction between the user 104 and the expert 106.

From Fig. 3, it can be observed that how the real-time interaction has taken place between the user 104 and the expert 106. It can be also observed that how the user 104 and the expert 106 has verbally interacted with each other. It must be understood to a skilled person that, the verbal utterances of the user 104 and the expert 106 may be converted into textual form using speech-to-text converter. Further, it must be also understood to the skilled person that the user 104 and the expert 106 may also use other type of communication modalities, for example, textual utterances, visual utterances or combination of textual, verbal, and visual utterances. During the interaction, the user 104 and the expert 106 may also share multimedia data like images, videos and the like with each other. Since the interaction can happen using different types of communication modalities, all the interaction related data has been termed as “multimodal data” as can be seen from Fig. 3.

The data related to the interaction shown in Fig. 3 may be stored in an interaction database 108. As discussed, while referring figure 1, the interaction database 108 may be external to the system 102 or may a part of the system 102. In both the scenarios, the interaction database 108 serves as a repository of the interaction related data.

From the Fig. 3, it can be observed that the interaction starts with a formal introduction and a request made by the user 104 for removing the battery from the laptop. In an embodiment, the receiving module 216 may receive the request from the user 104. Once the request is made by the user 104, the expert 106 (Terresa) assist the user 104 in completing his task. However, for providing assistance, it is important to understand the intent of the user’s 104 request. Thus, in next the step, the system 102 determines the intent of the request made by the user 104 by using the NLP technique. That is, the system 102 analyzes the user’s utterance “I want to replace my laptop battery” provided at the timepoint 9:02. During the analysis, the user’s utterance “I want to replace my laptop battery” may be split into plurality of words. Meanings of each of the plurality of words are determined using the NLP technique. Based on the meanings determined, the system 102 may understand that the word “want” reflects the request of the user 104. Further, by using the meanings of words “replace”, “laptop” and “battery”, the system 102 also understands that the user 104 wishes to remove or replace the laptop’s battery.

Once the intent is determined, the next task to extract the multimodal data 110. However, the system understands that the entire multimodal data 110 captured during the interaction may not be required for generating the guidance information 114. The reason being and as discussed earlier, there may be some formal introduction between the user 104 and the expert 106 (shown at timepoints 9:00 and 9:01 in Fig. 3), which may not be required. Apart from the formal introduction, there may be some feedback or acknowledgment or confirmation related interaction (as shown at timepoints 9:15, 9:19, 9:21, 9:26, 9:27, and 9:30), which may also be not required.

Though the above discussed interaction is not required for generating the guidance information 114, however, the system 102 analyzes them to understand the context and output of the interaction. For instance, by analyzing the feedback and acknowledgement related interaction, the system 102 understands whether the expert 106 is able to assist the user 104 for successfully completing the task. In the above example, the interaction at the timepoints 9:26, 9:27, and 9:30 clearly indicates that the user 104 is able to successfully remove the battery from the laptop based on the expert’s 106 assistance.

However, the objective of the present disclosure is to generate the guidance information 114 for using it for AR virtual assistance. The above discussed paragraphs with reference to Fig. 3, so far do not explain how exactly the guidance information 114 is generated. Rather, the paragraphs discuss how the system 102 prepares itself for generating the guidance information 114. That is, how the system 102 first identifies the intent of the user’s 104 request, and then how the system 102 identifies those multimodal data 110 (introduction, feedback, acknowledgement, confirmation) which are not required for generating the guidance information 114.

Once the above analysis is done, in the next step, the system 102 now focusses on those multimodal data 110 which are required for generating the guidance information. 114. For this, the extracting module 218 extracts the multimodal data 110 at the timestamp information 112 i.e., timepoints 9:10, 9:12, 9:13, 9:16, 9:17, 9:22, and 9:24 (emphasized with bold & italic in Fig. 3) which are actually required for generating the guidance information 114. The extracting module 218 also extracts the abovementioned the timestamp information 112 along with the multimodal data 110. The reason being, the multimodal data 110 associated with the aforesaid timepoints indicate the sequence of the steps followed for removing the battery from the laptop. That is, when the backside image of the laptop is shared by the user 104 at timepoint 9:10, the expert 106 performs the annotation and shares the first annotated image at the timepoints 9:12 and 9:13 respectively. The first annotated image helps the user 104 to locate the position of latch and also to understand how to pull out the latch.

Moving further, the expert 106 again performs the annotation and shares the second annotated image at the timepoints 9:16 and 9:17 respectively. Here, it can be noted that, the system 102 has ignored the multimodal data 110 at timepoint 9:15 as it is a feedback related interaction and not required for generating the guidance information 114. Similar to the first annotated image, the second annotated image also helps the user 104 to locate the battery (after removing the latch) and to understand how to pull out the battery for removing it from the laptop. Now, at timepoint 9:24, the system 102 gets the backside image of the laptop after the battery being removed.

The first annotated image and the second annotated image indicates the portion of the multimodal data 110 which get annotated by the expert 106. Here, the annotation is performed by identifying Regions of Interest (ROIs) and providing corresponding explanation for the ROIs. In the above example, the ROIs indicates area at the backside of the laptop where the battery is located, latch and the battery itself.

In next step, the determining module 220 determines one or more entities based on the annotation performed on the multimodal data 110. In this case, the entities may include latch and battery which are annotated by the expert 106. Once the entities are determined, the generating module 222 generates the guidance information 114 by synchronizing each multimodal data 110 in sequence based on the timestamp information 112 of each multimodal data 110. As discussed in the above paragraphs, each multimodal data 110 at the timepoints 9:10, 9:12, 9:13, 9:16, 9:17, 9:22, and 9:24 of Fig. 3 are synchronized for generating the guidance information 114.

According to an embodiment of the present disclosure, the guidance information 114 generated may be seen in Figure 4. In the first stage 402, the guidance information 114 helps the user 104 to locate the actual position of the battery. In the first stage 402, it can be observed that the region of interest (ROI) 402-B is shown with a dotted circle where the battery is located in the laptop’s backside. Along with the ROI 402-B, the corresponding explanation 402-A is also provided for the user 104. In the second stage 404, the guidance information 114 helps the user 104 to locate latch position and also to understand in which direction the latch has to be pulled. In the second stage 404, it can be observed that the region of interest (ROI) 404-B is shown with a dotted circle where the latch is located. Along with the ROI 404-B, the corresponding explanation 404-A of how to pull out the latch is also provided for the user 104. Finally, in the third stage 406, the guidance information 114 helps the user 104 to locate the battery (after removing the latch) and also to understand how the battery has to be pulled out. In the third stage 406, it can be observed that the region of interest (ROI) 406-B is shown with a dotted circle where the battery which has to be removed is actually located. Along with the ROI 406-B, the corresponding explanation 406-A of how to pull out the battery is also provided for the user 104.Thus, the system 102 provides the stepwise guidance to the user 104 for performing the task.

It must be understood to the skilled person that the aforesaid guidance information 114 shown in a form of stages 402-406 may be presented in a form of animation or series of images (image data set) or video snippets or any other form required for the AR virtual assistance. Based on the above discussed interaction or interaction between the user 104 and the expert 106 for generating the guidance information 114, the system 102 learns about the domain knowledge.

FIG. 5 shows a flowchart illustrating a method for assisting a user in real-time for performing a task in accordance with some embodiments of the present disclosure.

As illustrated in FIG. 5, the method 500 comprises one or more blocks for assisting a user 104 in real-time for performing a task using an Augmented Reality (AR) assisting system 102. The method 500 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, and functions, which perform particular functions or implement particular abstract data types.

The order in which the method 500 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method. Additionally, individual blocks may be deleted from the methods without departing from the scope of the subject matter described herein. Furthermore, the method can be implemented in any suitable hardware, software, firmware, or combination thereof.

At block 502, the AR assisting system 102 receives a request from the user 104 for performing the task.

At block 504, the AR assisting system 102 extracts, based on intent of the request, multimodal data 110 along with timestamp information 112, from real time interactions between the user 104 and an expert 106. According to an embodiment, the multimodal data 110 along with timestamp information 112 may be stored in an interaction database 108 associated with the AR assisting system 102. Further, the portion of the multimodal data 110 may comprise one or more annotations.

At block 506, the AR assisting system 102 determines one or more entities associated with the request based on the one or more annotations associated with the portion of the multimodal data 110.

At block 508, the AR assisting system 102 generates a guidance information 114 for assisting the user 104 in performing the task based on the determined one or more entities and by synchronizing each multimodal data 110 in sequence based on the timestamp information 112 of each multimodal data 110.

Computer System
Fig.6 illustrates a block diagram of an exemplary computer system 600 for implementing embodiments consistent with the present invention. In an embodiment, the computer system 600 can be the Augmented Reality (AR) assisting system 102 which is used for assisting a user 104 in real-time for performing a task. According to an embodiment, the computer system 600 may extract interaction data 610 between the user 104 and the expert 106 which may include, for example, multimodal data 110 and timestamp information 112 from interaction database 108. The computer system 600 may comprise a central processing unit (“CPU” or “processor”) 602. The processor 602 may comprise at least one data processor for executing program components for executing user- or system-generated business processes. The processor 602 may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc.
The processor 602 may be disposed in communication with one or more input/output (I/O) devices (611 and 612) via I/O interface 601. The I/O interface 601 may employ communication protocols/methods such as, without limitation, audio, analog, digital, stereo, IEEE-1394, serial bus, Universal Serial Bus (USB), infrared, PS/2, BNC, coaxial, component, composite, Digital Visual Interface (DVI), high-definition multimedia interface (HDMI), Radio Frequency (RF) antennas, S-Video, Video Graphics Array (VGA), IEEE 802.n /b/g/n/x, Bluetooth, cellular (e.g., Code-Division Multiple Access (CDMA), High-Speed Packet Access (HSPA+), Global System For Mobile Communications (GSM), Long-Term Evolution (LTE) or the like), etc.
Using the I/O interface 601, the computer system 600 may communicate with one or more I/O devices (611 and 612).
In some embodiments, the processor 602 may be disposed in communication with a communication network 609 via a network interface 603. The network interface 603 may communicate with the communication network 609. The network interface 603 may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), Transmission Control Protocol/Internet Protocol (TCP/IP), token ring, IEEE 802.11a/b/g/n/x, etc. The communication network 609 can be implemented as one of the different types of networks, such as intranet or Local Area Network (LAN) and such within the organization. The communication network 609 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), etc., to communicate with each other. Further, the communication network 609 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, etc.
In some embodiments, the processor 602 may be disposed in communication with a memory 605 (e.g., RAM 613, ROM 614, etc. as shown in FIG. 6) via a storage interface 604. The storage interface 604 may connect to memory 605 including, without limitation, memory drives, removable disc drives, etc., employing connection protocols such as Serial Advanced Technology Attachment (SATA), Integrated Drive Electronics (IDE), IEEE-1394, Universal Serial Bus (USB), fiber channel, Small Computer Systems Interface (SCSI), etc. The memory drives may further include a drum, magnetic disc drive, magneto-optical drive, optical drive, Redundant Array of Independent Discs (RAID), solid-state memory devices, solid-state drives, etc.
The memory 605 may store a collection of program or database components, including, without limitation, user/application data 606, an operating system 607, web browser 608 etc. In some embodiments, the computer system 600 may store user/application data 606, such as the data, variables, records, etc. as described in this invention. Such databases may be implemented as fault-tolerant, relational, scalable, secure databases such as Oracle or Sybase.
The operating system 607 may facilitate resource management and operation of the computer system 600. Examples of operating systems include, without limitation, APPLE® MACINTOSH® OS X®, UNIX®, UNIX-like system distributions (E.G., BERKELEY SOFTWARE DISTRIBUTION® (BSD), FREEBSD®, NETBSD®, OPENBSD, etc.), LINUX® DISTRIBUTIONS (E.G., RED HAT®, UBUNTU®, KUBUNTU®, etc.), IBM®OS/2®, MICROSOFT® WINDOWS® (XP®, VISTA®/7/8, 10 etc.), APPLE® IOS®, GOOGLETM ANDROIDTM, BLACKBERRY® OS, or the like. The User interface (I/O interface) 506 may facilitate display, execution, interaction, manipulation, or operation of program components through textual or graphical facilities. For example, user interfaces may provide computer interaction interface elements on a display system operatively connected to the computer system 500, such as cursors, icons, checkboxes, menus, scrollers, windows, widgets, etc. Graphical User Interfaces (GUIs) may be employed, including, without limitation, Apple® Macintosh® operating systems’ Aqua®, IBM® OS/2®, Microsoft® Windows® (e.g., Aero, Metro, etc.), web interface libraries (e.g., ActiveX®, Java®, Javascript®, AJAX, HTML, Adobe® Flash®, etc.), or the like.
In some embodiments, the computer system 600 may implement the web browser 608 stored program components. The web browser 608 may be a hypertext viewing application, such as MICROSOFT® INTERNET EXPLORER®, GOOGLETM CHROMETM, MOZILLA® FIREFOX®, APPLE® SAFARI®, etc. Secure web browsing may be provided using Secure Hypertext Transport Protocol (HTTPS), Secure Sockets Layer (SSL), Transport Layer Security (TLS), etc. Web browsers 608 may utilize facilities such as AJAX, DHTML, ADOBE® FLASH®, JAVASCRIPT®, JAVA®, Application Programming Interfaces (APIs), etc. In some embodiments, the computer system 600 may implement a mail server stored program component. The mail server 616 may be an Internet mail server such as Microsoft Exchange, or the like. The mail server 616 may utilize facilities such as Active Server Pages (ASP), ACTIVEX®, ANSI® C++/C#, MICROSOFT®, .NET, CGI SCRIPTS, JAVA®, JAVASCRIPT®, PERL®, PHP, PYTHON®, WEBOBJECTS®, etc. The mail server 616 may utilize communication protocols such as Internet Message Access Protocol (IMAP), Messaging Application Programming Interface (MAPI), MICROSOFT® exchange, Post Office Protocol (POP), Simple Mail Transfer Protocol (SMTP), or the like. In some embodiments, the computer system 600 may implement a mail client 615 stored program component. The mail client 615 may be a mail viewing application, such as APPLE® MAIL, MICROSOFT® ENTOURAGE®, MICROSOFT® OUTLOOK®, MOZILLA® THUNDERBIRD®, etc.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present invention. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., non-transitory. Examples include Random Access Memory (RAM), Read-Only Memory (ROM), volatile memory, nonvolatile memory, hard drives, Compact Disc (CD) ROMs, Digital Video Disc (DVDs), flash drives, disks, and any other known physical storage media.

Advantages of the embodiment of the present disclosure are illustrated herein.
In an embodiment, the present disclosure provides a method of generating the guidance information used for Augmented Reality (AR) virtual assistance without employing any dedicated expert or team.

In an embodiment, the method of present disclosure learns from the interaction between the user and expert for generating the guidance information.

In an embodiment, the present disclosure provides scaling up of knowledge base on the system by observing plurality of interactions between the user and the expert on different domains.

The terms "an embodiment", "embodiment", "embodiments", "the embodiment", "the embodiments", "one or more embodiments", "some embodiments", and "one embodiment" mean "one or more (but not all) embodiments of the invention(s)" unless expressly specified otherwise.

The terms "including", "comprising", “having” and variations thereof mean "including but not limited to", unless expressly specified otherwise.

The enumerated listing of items does not imply that any or all of the items are mutually exclusive, unless expressly specified otherwise.

The terms "a", "an" and "the" mean "one or more", unless expressly specified otherwise.

A description of an embodiment with several components in communication with each other does not imply that all such components are required. On the contrary, a variety of optional components are described to illustrate the wide variety of possible embodiments of the invention.

When a single device or article is described herein, it will be readily apparent that more than one device/article (whether or not they cooperate) may be used in place of a single device/article. Similarly, where more than one device or article is described herein (whether or not they cooperate), it will be readily apparent that a single device/article may be used in place of the more than one device or article or a different number of devices/articles may be used instead of the shown number of devices or programs. The functionality and/or the features of a device may be alternatively embodied by one or more other devices which are not explicitly described as having such functionality/features. Thus, other embodiments of the invention need not include the device itself.

Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based here on. Accordingly, the embodiments of the present invention are intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.

While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.

Referral Numerals:

Reference Number Description
100 ENVIRONMENT
102 AUGMENTED REALITY (AR) ASSISTING SYSTEM
104 USER
106 EXPERT
108 INTERCATION DATABASE
110 MULTIMODAL DATA
112 TIMESTAMP INFORMATION
114 GUIDANCE INFORMATION
202 I/O INTERFACE
204 PROCESSOR
206 MEMORY
208 DATA
210 MODULES
212 TRAINING DATA
214 OTHER DATA
216 RECEIVING MODULE
218 EXTRACTING MODULE
220 DETERMINING MODULE
222 GENERATING MODULE
224 OTHER MODULES
402 FIRST STAGE OF GUIDANCE INFORMATION
402-A EXPLANATION OF THE FIRST STAGE
402-B REGION OF INTEREST OF THE FIRST STAGE
404 SECOND STAGE OF GUIDANCE INFORMATION
404-A EXPLANATION OF THE SECOND STAGE
404-B REGION OF INTEREST OF THE SECOND STAGE
406 THIRD STAGE OF GUIDANCE INFORMATION
406-A EXPLANATION OF THE THIRD STAGE
406-B REGION OF INTEREST OF THE THIRD STAGE
600 EXEMPLARY COMPUTER SYSTEM
601 I/O INTERFACE OF THE EXEMPLARY COMPUTER SYSTEM
602 PROCESSOR OF THE EXEMPLARY COMPUTER SYSTEM
603 NETWORK INTERFACE
604 STORAGE INTERFACE
605 MEMORY OF THE EXEMPLARY COMPUTER SYSTEM
606 USER/APPLICATION
607 OPERATING SYSTEM
608 WEB BROWSER
609 COMMUNICATION NETWORK
610 INTERACTION DATA
611 INPUT DEVICES
612 OUTPUT DEVICES
613 RAM
614 ROM
615 MAIL CLIENT
616 MAIL SERVER