Sign In to Follow Application
View All Documents & Correspondence

A Method And System For Performing Tasks On A User Device

Abstract: The present invention relates to a method and system for performing one or more tasks on a user device. The system (100) comprises a processor (202) and a memory (204) configured to store programmed instructions. The system (100) receives one or more voice inputs. Further, the system (100) processes the one or more voice inputs to identify one or more voice activities from the one or more voice inputs. Further, the system (100) transcribes the one or more voice activities into one or more textual input. Further, the system (100) identifies one or more intent from the one or more textual input. Further, the system (100) generates one or more virtual gestures based on the identified one or more intent. Further, the system (100) performs the one or more tasks on the user device by utilizing the one or more virtual gestures. [To be published with Fig. 1]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
19 February 2025
Publication Number
11/2025
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
Parent Application

Applicants

Visioapps Technology Private Limited
227, Tower-B, Spaze Edge, Sector-47, Gurgaon, Haryana - 122001, India

Inventors

1. Pramit Bhargava
Villa MD-35, Eldeco Mansionz, Sector-48, Gurgaon - 122018, India

Specification

Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of Invention:
A METHOD AND SYSTEM FOR PERFORMING TASKS ON A USER DEVICE
APPLICANT:
VISIOAPPS TECHNOLOGY PRIVATE LIMITED
An Indian entity having address as:
227, Tower-B, Spaze Edge, Sector-47, Gurgaon, Haryana – 122001, India

The following specification particularly describes the invention and the manner in which it is to be performed.

CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
[0001] The present application does not claim priority from any other application.
TECHNICAL FIELD
[0002] The presently disclosed embodiments are related, in general, to the field of voice inputs to identify user interaction. More particularly, the presently disclosed embodiments are related to perform one or more tasks on a user device.
BACKGROUND
[0003] This section is intended to introduce the reader to various aspects of art (the relevant technical field or area of knowledge to which the invention pertains), which may be related to various aspects of the present disclosure that are described or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it should be understood that these statements in this background section are to be read in this light, and not as admissions of prior art. Similarly, a problem mentioned in the background section.
[0004] In today’s rapidly evolving digital landscape, mobile applications and websites have become integral to everyday life. Users now rely on these platforms for a wide range of activities, including making payments, managing finances, booking rides, shopping, and performing various other tasks. As the complexity and number of digital interactions increase, users increasingly demand more intuitive and convenient ways to interact with their devices. One of the most significant technological advancements in this space has been the rise of voice interfaces, such as Siri, Google Assistant, and Amazon Alexa. These voice-driven systems allow users to control their devices and perform various tasks through spoken commands rather than through traditional input methods like typing or tapping on a screen. This shift towards voice-enabled interactions reflects a broader trend toward more hands-free and convenient user experiences, particularly in mobile environments where users are often on the move or engaged in tasks that make traditional forms of interaction cumbersome. As voice assistants continue to improve in terms of accuracy, speed, and functionality, users are becoming increasingly accustomed to performing more complex actions through speech alone, fuelling demand for deeper integration of voice interfaces across apps and services.
[0005] The field of voice-enabled assistance is fundamentally focused on empowering users to interact with their devices through natural language. At its core, this technology leverages a combination of voice recognition and natural language processing (NLP) to interpret spoken commands and understand user intents. Voice recognition technology enables the system to accurately capture and convert speech into text, while NLP allows the system to interpret the meaning behind those words, accounting for nuances like context, tone, and intent. Over time, voice interfaces have become more sophisticated, enabling devices to understand increasingly complex commands and perform a broader array of actions. In earlier versions, voice assistants could handle only basic tasks, such as setting reminders or playing music. Despite advancements in machine learning and AI, many voice interfaces still struggle with accuracy and context understanding, often requiring manual intervention and limiting their ability to handle complex tasks
[0006] Users today have increasingly high expectations when it comes to the functionality of voice assistants. Rather than simply issuing basic commands, they now expect voice interfaces to be able to manage complete tasks without the need for manual input. This includes tasks that require multiple steps and integrations with various third-party apps and services. For instance, a user might want to transfer funds from one bank account to another or book a ride through a popular ride-hailing app. In these scenarios, the voice assistant must not only understand the initial command but also navigate through the entire workflow authenticating the user, processing the request, and confirming the outcome without requiring the user to touch the screen or type any information. The desire for such end-to-end voice-controlled experiences is driving innovation in the voice assistant industry. Users increasingly want to be able to handle complex transactions seamlessly, from start to finish, using nothing more than their voice. As a result, the demand for voice-enabled systems that can integrate deeply with mobile apps and websites, allowing users to manage entire processes via voice, is growing rapidly. This evolution in user expectations underscores the need for a more robust and versatile voice interface technology capable of supporting multi-step workflows and providing secure, efficient, and reliable interactions across a wide range of services.
[0007] One of the most powerful voice assistants available excels in a range of basic tasks, such as making phone calls, setting alarms, checking account balances, sending messages, and retrieving information from apps. However, its reliance on API-based integrations severely limits its ability to handle more complex app workflows. For example, while it can initiate a task, it often cannot carry out a series of interrelated actions within an app. Additionally, security and privacy concerns further restrict its functionality. Many apps are reluctant to expose sensitive data or authentication steps to external voice assistants, which prevents full control over apps in certain scenarios. As a result, users often need to switch to manual input when completing more intricate processes, interrupting the otherwise seamless experience. This voice assistant is limited to performing basic tasks and cannot complete complex workflows within third-party apps.
[0008] Further, the traditional voice assistants are deeply integrated into its device ecosystem, offering smooth interactions with native apps like messaging, phone calls, and media services. However, its integration with third-party apps remains more limited. While it is effective for simple tasks like sending messages or making calls, it struggles with more complex workflows and is limited to performing basic actions like setting alarms and reminders. This assistant can open apps but cannot interact deeply with them or manage advanced features. The emphasis on data protection and security has led to stringent limitations on app-to-app interactions, further curbing its effectiveness in handling workflows beyond basic functions. As a result, this voice assistant is primarily useful for simple, direct tasks but lacks the ability to manage more complex app interactions.
[0009] In view of the above, addressing the aforementioned technical challenges requires a system for performing one or more tasks on a user device.
[0010] Further limitations and disadvantages of conventional and traditional approaches will become apparent to one of skill in the art, through comparison of described systems with some aspects of the present disclosure, as set forth in the remainder of the present application and with reference to the drawings.
SUMMARY
[0011] This summary is provided to introduce concepts related to a method and a system for performing one or more tasks on a user device and the concepts are further described below in the detailed description. This summary is not intended to identify essential features of the claimed subject matter nor is it intended for use in determining or limiting the scope of the claimed subject matter.
[0012] According to embodiments illustrated herein, a method for performing one or more tasks on a user device is disclosed. The method may be implemented by an electronic device including one or more processors and a memory communicatively coupled to the processor and the memory is configured to store processor executable programmed instructions. Further, the method may comprise a step of receiving coupled with the user device, one or more voice inputs. Further, the method may comprise a step of processing the one or more voice inputs to identify one or more voice activities from the one or more voice inputs. Further, the method may comprise a step of transcribing the one or more voice activities into one or more textual input. Further, the method may comprise a step of identifying one or more intent from the one or more textual input. Further, the method may comprise a step of generating one or more virtual gestures based on the identified one or more intent. Further, the method may comprise a step of performing the one or more tasks on the user device by utilizing the one or more virtual gestures.
[0013] According to embodiments illustrated herein, a system for performing one or more tasks on a user device is disclosed. Further, the system may comprise a processor and a memory. Further, the memory may be configured to store programmed instructions that cause the processor to perform following operations. Further, the processor may be configured for receiving one or more voice inputs coupled with the one or more user device. Further, the processor may be configured for processing the one or more voice inputs to identify one or more voice activities from the one or more voice inputs. Further, the processor may be configured for transcribing the one or more voice activities into one or more textual input. Further, the processor may be configured for identifying one or more intent from the one or more textual input. Further, the processor may be configured for generating one or more virtual gestures based on the identified one or more intent. Further, the processor may be configured for performing the one or more tasks on the user device by utilizing the one or more virtual gestures.
[0014] According to embodiments illustrated herein, a non-transitory computer-readable storage medium for performing one or more tasks on a user device is disclosed. The non-transitory computer-readable storage medium having stored thereon, a set of computer-executable instructions causing a computer comprising a processor to perform steps. The step may comprise receiving one or more voice inputs. Further, the step may comprise processing the one or more voice inputs to identify one or more voice activities from the one or more voice inputs. Further, the step may comprise transcribing the one or more voice activities into one or more textual input. Further, the step may comprise identifying one or more intent from the one or more textual input. Further, the step may comprise generating one or more virtual gestures based on the identified one or more intent. Furthermore, the step may comprise performing the one or more tasks on the user device by utilizing the one or more virtual gestures.
[0015] The foregoing summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.
BRIEF DESCRIPTION OF DRAWINGS
[0016] The accompanying drawings illustrate the various embodiments of systems, methods, and other aspects of the disclosure. Any person with ordinary skills in art will appreciate that the illustrated element boundaries (e.g., boxes, groups of boxes, or other shapes) in the figures represent one example of the boundaries. In some examples, one element may be designed as multiple elements, or multiple elements may be designed as one element. In some examples, an element shown as an internal component of one element may be implemented as an external component in another, and vice versa. Further, the elements may not be drawn to scale.
[0017] Various embodiments will hereinafter be described in accordance with the appended drawings, which are provided to illustrate and not to limit the scope in any manner, wherein similar designations denote similar elements, and in which:
[0018] FIG. 1 is a block diagram that illustrates a system (100) for performing one or more tasks on a user device, in accordance with an embodiment of present subject matter.
[0019] FIG. 2 is a block diagram that illustrates various components of an application server (104) configured for performing the one or more tasks on the user device, in accordance with an embodiment of the present subject matter.
[0020] FIG. 3 is a flowchart that illustrates a method (500) for performing the one or more tasks on the user device, in accordance with an embodiment of the present subject matter; and
[0021] FIG 4 illustrates a block diagram (400) of an exemplary computer system for implementing embodiments consistent with the present subject matter.
DETAILED DESCRIPTION
[0022] The present disclosure may be best understood with reference to the detailed figures and description set forth herein. Various embodiments are discussed below with reference to the figures. However, those skilled in the art will readily appreciate that the detailed descriptions given herein with respect to the figures are simply for explanatory purposes as the methods and systems may extend beyond the described embodiments. For example, the teachings presented, and the needs of a particular application may yield multiple alternative and suitable approaches to implement the functionality of any detail described herein. Therefore, any approach may extend beyond the particular implementation choices in the following embodiments described and shown.
[0023] References to “one embodiment,” “at least one embodiment,” “an embodiment,” “one example,” “an example,” “for example,” and so on indicate that the embodiment(s) or example(s) may include a particular feature, structure, characteristic, property, element, or limitation but that not every embodiment or example necessarily includes that particular feature, structure, characteristic, property, element, or limitation. Further, repeated use of the phrase “in an embodiment” does not necessarily refer to the same embodiment. The terms “comprise”, “comprising”, “include(s)”, or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a setup, system or method that comprises a list of components or steps does not include only those components or steps but may include other components or steps not expressly listed or inherent to such setup or system or method. In other words, one or more elements in a system or apparatus preceded by “comprises… a” does not, without more constraints, preclude the existence of other elements or additional elements in the system or apparatus.
[0024] The objective of the present disclosure is to provide a method for performing one or more tasks on a user device.
[0025] Another objective of the present disclosure is to provide a system where voice inputs are processed to identify voice activities, enabling seamless task execution.
[0026] Yet another objective of the present disclosure is to convert voice activities into textual input for further analysis and task execution on the user device.
[0027] Yet another objective of the present disclosure is to enable the identification of the user intents from textual inputs.
[0028] Yet another objective of the present disclosure is to generate virtual gestures based on identified intents, which can be used to control and execute tasks on the user device.
[0029] Yet another objective of the present disclosure is to perform one or more tasks on a user device by utilizing virtual gestures that correspond to specific actions or commands.
[0030] Yet another objective of the presentation disclosure is to process voice inputs by removing blank portions to improve the accuracy and efficiency of subsequent task execution.
[0031] Yet another objective of the present disclosure is to provide real-time feedback to users after identifying their intents, which assists in further task execution on the user device.
[0032] Yet another objective of the present disclosure is to possesses the capability to comprehend user intentions and execute gestures accordingly to facilitate a seamless end-to-end process.
[0033] Yet another objective of the present disclosure is to prompt users for additional voice inputs during task execution, ensuring that the system remains responsive to user needs.
[0034] Yet another objective of the present disclosure is to perform the generation of virtual gestures after receiving additional voice inputs, allowing for more accurate and context-specific task execution.
[0035] Yet another objective of the present disclosure is to capture voice inputs using a microphone coupled with the user device, ensuring the system can work in diverse environments.
[0036] Yet another objective of the present disclosure is to perform time alignment of voice input data to synchronize the corresponding textual input with accurate timestamps, ensuring proper task execution.
[0037] Yet another objective of the present disclosure it to map textual input to a time-aligned transcript by deriving timestamps from the voice activities, improving the efficiency of task execution.
[0038] Yet another objective of the present disclosure is to enable the sequential execution of virtual gestures on the user interface, enhancing the accuracy and timing of task performance.
[0039] Yet another objective of the present disclosure is to create a robust system architecture that allows voice command interpretation, intent recognition, and gesture generation without requiring app-specific API access.
[0040] Yet another objective of the present disclosure is to provide humanoid gesture control for executing one or more functions on a user device through virtual gestures eliminating the need for manual human interaction by utilizing voice input.
[0041] FIG. 1 is a block diagram that illustrates a system (100) implementing a for performing one or more tasks on a user device, in accordance with an embodiment of present subject matter. The system (100) typically includes a database server (102), an application server (104), a communication network (106), and one or more portable devices (108). The database server (102), the application server (104), and the one or more portable devices (108) are typically communicatively coupled with each other via the communication network (106). In an embodiment, the application server (104) may communicate with the database server (102), and the one or more portable devices (108) using one or more protocols such as, but not limited to, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), RF mesh, Bluetooth Low Energy (BLE), and the like, to communicate with one another.
[0042] In one embodiment, the database server (102) may refer to a computing device that may be configured to store one or more voice inputs, one or more voice activities, the one or more tasks, one or more intent from one or more textual input, one or more virtual gestures, process for performing the one or more tasks on the user device.
[0043] In an embodiment, the database server (102) may include a special purpose operating system specifically configured to perform one or more database operations on the stored content. Examples of database operations may include, but are not limited to, storing, retrieving, comparing, and updating data. In an embodiment, the database server (102) may include hardware that may be configured to perform one or more predetermined operations. In an embodiment, the database server (102) may be realized through various technologies such as, but not limited to, Microsoft® SQL Server, Oracle®, IBM DB2®, Microsoft Access®, PostgreSQL®, MySQL®, SQLite®, distributed database technology and the like. In an embodiment, the database server (102) may be configured to utilize the application server (104) for implementing the method for performing the one or more tasks on the user device.
[0044] A person with ordinary skills in art will understand that the scope of the disclosure is not limited to the database server (102) as a separate entity. In an embodiment, the functionalities of the database server (102) can be integrated into the application server (104) or into the one or more portable device (108).
[0045] In an embodiment, the application server (104) may refer to a computing device or a software framework hosting an application or a software service. In an embodiment, the application server (104) may be implemented to execute procedures such as, but not limited to, programs, routines, or scripts stored in one or more memories for supporting the hosted application or the software service. In an embodiment, the hosted application or the software service may be configured to perform one or more predetermined operations. The application server (104) may be realized through various types of application servers such as, but are not limited to, a Java application server, a .NET framework application server, a Base4 application server, a PHP framework application server, or any other application server framework.
[0046] In an embodiment, the application server (104) may be configured to utilize the database server (102) and the one or more portable device (108), in conjunction, for implementing the method for performing the one or more tasks on the user device. In an implementation, the application server (104) corresponds to an infrastructure for implementing the method for performing the one or more tasks on the user device. In an exemplary embodiment, the one or more tasks may comprise a booking cab, transferring money, ordering food, investing in stocks or gold, remotely controlling smart home devices, setting alarms or timers, sending messages or making calls, accessing calendar events or a combination thereof.
[0047] In an embodiment, the communication network (106) may correspond to a communication medium through which the application server (104), the database server (102), and the one or more portable device (108) may communicate with each other. Such a communication may be performed in accordance with various wired and wireless communication protocols. Examples of such wired and wireless communication protocols include, but are not limited to, Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), Hypertext Transfer Protocol (HTTP), Wireless Application Protocol (WAP), File Transfer Protocol (FTP), ZigBee, EDGE, infrared IR), IEEE 802.11, 802.16, 2G, 3G, 4G, 5G, 6G, 7G cellular communication protocols, and/or Bluetooth (BT) communication protocols. The communication network (106) may either be a dedicated network or a shared network. Further, the communication network (106) may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like. The communication network (106) may include, but is not limited to, the Internet, intranet, a cloud network, a Wireless Fidelity (Wi-Fi) network, a Wireless Local Area Network (WLAN), a Local Area Network (LAN), a cable network, the wireless network, a telephone network (e.g., Analog, Digital, POTS, PSTN, ISDN, xDSL), a telephone line (POTS), a Metropolitan Area Network (MAN), an electronic positioning network, an X.25 network, an optical network (e.g., PON), a satellite network (e.g., VSAT), a packet-switched network, a circuit-switched network, a public network, a private network, and/or other wired or wireless communications network configured to carry data.
[0048] In an embodiment, the one or more portable devices (108) may refer to a computing device used by a user. The one or more portable devices (108) may comprise of one or more processors and one or more memory. The one or more memories may include computer readable code that may be executable by one or more processors to perform predetermined operations. In an embodiment, the one or more portable devices (108) may present a web interface or an application interface for performing the one or more tasks on the user device using the application server (104). Example application interfaces presented on the one or more portable devices (108) to display one or more virtual gestures and a real-time feedback. Examples of the user one or more portable devices (108) may include, but are not limited to, a personal computer, a laptop, a personal digital assistant (PDA), a mobile device, a tablet, a microphone or any other computing device.
[0049] The system (100) can be implemented using hardware, software, or a combination of both, which includes using where suitable, one or more computer programs, mobile applications, or “apps” by deploying either on-premises over the corresponding computing terminals or virtually over cloud infrastructure. The system (100) may include various micro-services or groups of independent computer programs which can act independently in collaboration with other micro-services. The system (100) may also interact with a third-party or external computer system. Internally, the system (100) may be the central processor of all requests for transactions by the various actors or users of the system. A critical attribute of the system (100) is that it can concurrently and instantly perform the one or more tasks of the user from the environment in collaboration with other systems.
[0050] In one embodiment, the system (100) is configured to execute the one or more tasks on the user device based on the one or more voice inputs and the one or more virtual gestures. Further, the system (100) integrates various components to process user commands effectively.
[0051] FIG. 2 illustrates a block (200) diagram illustrating various components of the application server (104) configured for stepwise performing one or more tasks on a user device, in accordance with an embodiment of the present subject matter. Further, FIG. 2 is explained in conjunction with elements from FIG. 1. Here, the application server (104) preferably includes a processor (202), a memory (204), a transceiver (206), an Input/Output unit (208), a user interface (210), a voice processing unit (212), an intent generation unit (214), a gesture generation unit (216) and a task performance unit (218). The processor (202) is further preferably communicatively coupled to the memory (204), the transceiver (206), the Input/Output unit (208), the user interface (210), the voice processing unit (212), the intent generation unit (214), the gesture generation unit (216) and the task performance unit (218). while the transceiver (206) is preferably communicatively coupled to the communication network (106).
[0052] The processor (202) comprises suitable logic, circuitry, interfaces, and/or code that may be configured to execute a set of instructions stored in the memory (204), and may be implemented based on several processor technologies known in the art. The processor (202) works in coordination with the memory (204), the transceiver (206), the Input/Output unit (208), the user interface (210), the voice processing unit (212), the intent generation unit (214), the gesture generation unit (216) and the task performance unit (218) for performing the one or more tasks on the user device. Examples of the processor (202) include, but not limited to, standard microprocessor, microcontroller, central processing unit (CPU), an X86-based processor, a Reduced Instruction Set Computing (RISC) processor, an Application- Specific Integrated Circuit (ASIC) processor, and a Complex Instruction Set Computing (CISC) processor, distributed or cloud processing unit, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions and/or other processing logic that accommodates the requirements of the present invention.
[0053] The memory (204) comprises suitable logic, circuitry, interfaces, and/or code that may be configured to store the set of instructions, which are executed by the processor (202). Preferably, the memory (204) is configured to store one or more programs, routines, or scripts that are executed in coordination with the processor (202). Additionally, the memory (204) may include any computer-readable medium or computer program product known in the art including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic random-access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, a Hard Disk Drive (HDD), flash memories, Secure Digital (SD) card, Solid State Disks (SSD), optical disks, magnetic tapes, memory cards, virtual memory and distributed cloud storage. The memory (204) may be removable, non-removable, or a combination thereof. Further, the memory (204) may include routines, programs, objects, components, data structures, etc., which perform particular tasks or implement particular abstract data types. The memory (204) may include programs or coded instructions that supplement applications and functions of the system (100). In one embodiment, the memory (204), amongst other things, serves as a repository for storing data processed, received, and generated by one or more of the programs or the coded instructions. In yet another embodiment, the memory (204) may be managed under a federated structure that enables adaptability and responsiveness of the application server (104).
[0054] The transceiver (206) comprises suitable logic, circuitry, interfaces, and/or code that may be configured to receive, process or transmit information, data or signals, which are stored by the memory (204) and executed by the processor (202). The transceiver (206) is preferably configured to receive, process or transmit, one or more programs, routines, or scripts that are executed in coordination with the processor (202). The transceiver (206) is preferably communicatively coupled to the communication network (106) of the system (100) for communicating all the information, data, signal, programs, routines or scripts through the network.
[0055] The transceiver (206) may implement one or more known technologies to support wired or wireless communication with the communication network (106). In an embodiment, the transceiver (206) may include, but is not limited to, an antenna, a radio frequency (RF) transceiver, one or more amplifiers, a tuner, one or more oscillators, a digital signal processor, a Universal Serial Bus (USB) device, a coder-decoder (CODEC) chipset, a subscriber identity module (SIM) card, and/or a local buffer. Also, the transceiver (206) may communicate via wireless communication with networks, such as the Internet, an Intranet and/or a wireless network, such as a cellular telephone network, a wireless local area network (LAN) and/or a metropolitan area network (MAN). Accordingly, the wireless communication may use any of a plurality of communication standards, protocols and technologies, such as: Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), wideband code division multiple access (W-CDMA), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wireless Fidelity (Wi-Fi) (e.g., IEEE 802.11a, IEEE 802.11b, IEEE 802.11g and/or IEEE 802.11n), voice over Internet Protocol (VoIP), Wi-MAX, a protocol for email, instant messaging, and/or Short Message Service (SMS).
[0056] The input/output (I/O) unit (208) comprises suitable logic, circuitry, interfaces, and/or code that may be configured to receive or present information. The input/output unit (208) comprises various input and output devices that are configured to communicate with the processor (202). Examples of the input devices include, but are not limited to, a keyboard, a mouse, a joystick, a touch screen, a microphone, a camera, and/or a docking station. Examples of the output devices include, but are not limited to, a display screen and/or a speaker. The I/O unit (208) may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O unit (208) may allow the system (100) to interact with the user directly or through the portable devices (108). Further, the I/O unit (208) may enable the system (100) to communicate with other computing devices, such as web servers and external data servers (not shown). The I/O unit (208) can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. The I/O unit (208) may include one or more ports for connecting a number of devices to one another or to another server. In one embodiment, the I/O unit (208) allows the application server (104) to be logically coupled to other portable devices (108), some of which may be built in. Illustrative components include tablets, mobile phones, desktop computers, wireless devices, etc. Further, the input/output (I/O) unit (204) may comprise an input device, namely a microphone to receive the one or more voice inputs. In an embodiment, the (I/O) unit (204) may require one or more special permissions for receiving one or more user inputs. The one or more special permissions comprises at least one of, but not limited to, a microphone permission, camera permission, location permission, contact permission, storage permission, SMS permission, calendar permission, Bluetooth permission, phone permission, notification permission, or a combination thereof.
[0057] In yet another embodiment, the user interface (210) of the application server (104) is disclosed. Further, the user interface unit (210) may be configured to present one or more user interface (UI) components on a display for performing the one or more tasks on the user device. Further, the user interface (UI) components may include, but not limited to, the one or more voice inputs, the one or more user intent, the one or more virtual gestures based on the identified one or more intent, the one or more textual input, performing the one or more tasks. In one embodiment, the user interface unit (210) may be configured to display the one or more virtual gestures and the real-time feed back based on the one or more task. In an embodiment, the user interface unit (210) may include, but not limited to an app interface, a web interface, a graphical user interface and a touch user interface, a voice interface. Further, the user interface unit (210) may be configured to generate the one or more virtual gestures based on the identified one or more virtual gestures. In an embodiment, the user interface (UI) unit (210) may be communicatively coupled with the I/O unit (204) for receiving the one or more user input. Further, the user interface (UI) unit (204) may perform the one or more tasks based on the received one or more user input. In an exemplary embodiment, the user interface (UI) unit (210) is configured to activate the application of the user device based on the voice input received from the one or more users. The application may correspond to one of an Android application, iOS application, or a website.
[0058] In an embodiment, the voice processing unit (212) of the application server(104), is disclosed. The voice processing unit (212) may be configured for processing the one or more voice inputs to identify the one or more voice activities from the one or more voice inputs. Further, the processing one or more voice inputs may correspond to removing blank portions from the one or more voice inputs. Further, the one or more voice inputs may be captured using a microphone coupled with the user device. Further, the processing of the one or more voice inputs may comprise a time alignment, after transcribing the one or more voice activities into the one or more textual input. In one exemplary environment, the user device may receive the one or more voice inputs simultaneously. Further, the voice processing unit (212) may be configured to segregate the one or more voice inputs received from the one or more users based on the input receive time alignment. Further, the time alignment may correspond to mapping the one or more textual input to a corresponding timestamp derived from time intervals of the one or more voice activities to generate a time-aligned transcript.
[0059] In one embodiment, the voice processing unit (212) may be configured for transcribing the one or more voice activities into the one or more textual input. Further, the textual input may correspond to the one or more voice inputs received from the user device. Further, the transcribing may be performed by at least one of Deepspeech, Wav2cec, Kaldi, SpeechBrain, or any combination of speech to text algorithms.
[0060] In an non-limiting embodiment, the voice processing unit (212) may comprise a voice activity detection (VAD). Further, the voice activity detection (VAD) may be configured to divide the one or more voice inputs into one or more portions. Further, the one or more portions may correspond to the one or more voice activities, one or more blank portions, one or more junk information, and a combination thereof. Further, the one or more voice activities may correspond to background noise, silence, other non-speech element, unclear voice command or a combination thereof. Further, processing unit (212) may be configured to extract the only relevant part of each voice input from the one or more voice inputs received from the user.
[0061] In another embodiment, the intent identification unit (214) of the application server (104), is disclosed. The intent identification unit (214) may be configured for identifying one or more intent from the one or more textual input. Further, the one or more intent may correspond to initiating, modifying, completing, navigating, scheduling, providing feedback, managing settings, or executing contextual commands, or a combination thereof. Further the intent identification unit (214) may be configured for providing real-time feedback to one or more users. Further, the real-time feedback may be provided to the one or more users after identifying the one or more intent. Further, the real-time feedback may be provided to the one or more users while performing the one or more tasks on the user device. Further, the real-time feedback may correspond to a prompt for receiving an additional one or more user inputs. Further, the intent identification unit (214) may perform the one or more tasks corresponds to a sequential execution of the one or more virtual gestures on the UI associated with the user device.
[0062] In one embodiment, the gesture generation unit (216) of the application server (104), is disclosed. The gesture generation unit (216) may be configured to generate one or more virtual gestures based on the identified one or more intent. Further, the human-like actions may correspond to a non-human entity with one or more virtual gestures. In an exemplary embodiment, performing the one or more virtual gestures may corresponds to a humanoid gesture control. Further, the humanoid gesture control may refer to the act of simulating human-like interactions on a user device's screen via the one or more virtual gestures by allowing the non-human entity to control, navigate, and manipulate the interface. Further, the one or more virtual gestures may be configured to simulate the human-like actions on the user device. Further, the one or more virtual gestures may include but are not limited to clicking on dialer, fetching the relevant data, searching for a button, highlighting performing section, selecting a section, entering an amount, calculating the amount, entering a location, selecting a cab, asking for a confirmation, describing every step of the one or more tasks, or a combination thereof. Further, the gesture generation unit (216) may be configured to generate the one or more virtual gestures that being performed after receiving the additional one or more voice inputs. Further, the one or more virtual gestures may correspond to one or more control events to be performed on the user interface (UI) of the user. In one exemplary embodiment, the gesture generation unit (216) may interact with app or website interfaces without requiring dedicated APIs or special permissions, facilitating secure, end-to-end task completion directly through the virtual gestures.
[0063] In another embodiment, the task performance unit (218) of the application server (104), is disclosed. The task performance unit (218) is configured for performing the one or more tasks on the user device by utilizing the one or more virtual gestures. Further, the one or more tasks may include but not limited to financial transactions, booking a cab service, sending messages, making calls, controlling smart home devices or a combination thereof. Further, the task performance unit (218) may be configured to provide the feedback to the one or more users after the completion of the one or more tasks. Further, the task performance unit (218) may be configured to end voice transaction after the completion of the one or more tasks.
[0064] A person skilled in the art will understand that the scope of the disclosure should not be limited to voice assistance domain and using the aforementioned techniques. Further, the examples provided in supra are for illustrative purposes and should not be construed to limit the scope of the disclosure.
[0065] Referring to FIG. 3, a flowchart that illustrates a method (300) for performing the one or more tasks on the user device, in accordance with at least one embodiment of the present subject matter. The method (300) may be implemented by an electronic device (108) including one or more processors (202) and a memory (204) communicatively coupled to the processor (202) and the memory (204) is configured to store processor-executable programmed instructions, caused the processor to perform the following steps.
[0066] At step (302), the processor (202) is configured to receive one or more voice inputs.
[0067] At step (304), the processor (202) is configured to process the one or more voice inputs to identify one or more voice activities from the one or more voice inputs.
[0068] At step (306), the processor (202) is configured to transcribe the one or more voice activities into one or more textual input.
[0069] At step (308), the processor (202) is configured to identify one or more intent from the one or more textual input.
[0070] At step (310), the processor (202) is configured to generate one or more virtual gestures based on the identified one or more intent.
[0071] At step (312), the processor (202) is configured to perform the one or more tasks on the user device by utilizing the one or more virtual gestures.
[0072] Let us delve into a detailed working example of the present disclosure.
[0073] Example 01: X, the uninformed investor, eagerly logs into the disclosed device, stepping into a world brimming with potential opportunities, fluctuating markets, and real-time financial data, hoping to navigate the complexities and make decisions that could yield profitable returns.
[0074] The interaction begins as X speaks their intent in the device: “I want to invest 50 rupees in digital gold.” This voice command is captured by the microphone of the X’s device and transmitted to the voice processing unit (212). Further, the system immediately converts the spoken words of X into a text format and initiates the next phase of the process. Further, the system accurately processes the voice input of X to ensure no critical information is lost or misinterpreted. This is accomplished by removing unnecessary blank spaces, filler words, or irrelevant sounds that may occur during the spoken input, ensuring that only the relevant content is retained for further analysis. Further, the processed text is now ready for the next step: intent identification.
[0075] Once the voice input of X is transcribed into the text, the system is configured to determine the X’s intent. This is done by analyzing the transcribed text of X using sophisticated algorithms designed to detect specific commands or actions. In this case, the system recognizes that X’s intent is to invest a fixed amount of 50 rupees in digital gold. Further, the intent generation unit (214), is an integral part of the processor, plays a crucial role. It understands that the spoken words imply an action related to digital gold investment, which is a task that the system is pre-programmed to handle. Further, the system is configured to generate virtual gestures. Further, the gestures simulate the interactions that X would normally perform on their device manually.
[0076] Further, the gesture generation unit (216) is responsible for creating the virtual gestures based on the identified intent. Once the system has understood that X wishes to make an investment in digital gold, it begins simulating the necessary on-screen interactions. First, the virtual gesture unit (216) simulates the action of selecting the "Investment" tab from the device's home screen, opening up the investment options available. After this, it navigates the interface to find the "Digital Gold" section, which is specifically designed for making gold-related purchases. Further, upon selecting "Digital Gold," the system must now calculate how much gold 50 rupees will buy at the current market rate. Further, the task performance unit (218) is configured to perform real-time calculations, which are displayed to X. Further, the system instantly updates the user interface to show the weight of gold that may be purchased for 50 rupees, ensuring that X has an accurate understanding of the transaction before proceeding. These calculations are performed dynamically, taking into account the fluctuating price of gold at the moment of the transaction.
[0077] In order to ensure that X is fully informed and confident in their decision, the system provides clear and timely feedback. After displaying the amount of gold that may be purchased, the system prompts X with a voice confirmation: “Do you confirm investing 50 rupees in digital gold?” This voice prompt allows X to review the details of the transaction in real-time, ensuring that X understand exactly what the system is about to do. If X is unsure or wishes to modify the investment amount or type of investment, the X may issue another voice command, which would update the task execution accordingly. Once X confirms their intention to invest, the system is configured to proceed to finalize the transaction. Further, the task performance unit (218) simulates a tap on the "Start Saving" button to initiate the gold investment process. Following this, the system automatically simulates a "Proceed to Pay" gesture to complete the financial transaction. The entire process, from the initial voice input to the final confirmation, is conducted without requiring X to physically interact with the device, making the experience smooth, efficient, and accessible.
The ability to perform a sequence of tasks in this way significantly enhances the X’s experience. Further, the system ensures that each gesture is time-aligned with the specific input, allowing for a smooth flow of actions. Whether it’s purchasing digital gold or completing a money transfer, the tasks are performed in a logical, sequential order, ensuring that X never feels lost or confused about the process. The voice input system also offers real-time feedback, asking for additional voice inputs as needed, further enhancing the experience and allowing for corrections or clarifications as the X interacts with the device.
[0078] Moreover, the system is designed to be adaptable to various voice inputs. If at any point the X needs an additional information or wants to adjust their request (e.g., increasing the investment amount), the system can prompt for additional voice inputs, re-calculating or re-navigating the interface in real time. This adaptability is key to creating an efficient, personalized experience, especially for someone like X, who may not be well-versed in digital investment platforms. In this way, the voice-assisted system provides X with a seamless and accessible interface to perform tasks they might not have otherwise considered, such as investing in digital gold. By transforming voice commands into actionable virtual gestures and providing real-time feedback, the system empowers users to make informed decisions, regardless of their prior knowledge or technical expertise.
[0079] Example 02: Y is a busy professional, wants to head to City center mall, Delhi and decides to book a cab using his smartphone.
[0080] As, Y logs into his smartphone and without navigate through complex menus or apps, Y simply speaks his request: “Book a cab to City Center Mall.” Further, this this voice command is captured by the Y’s device’s microphone, which transmits the spoken words to the voice processing unit (212). Further, the system (100) is configured to convert Y’s voice into text, and further configured to create a clean transcription of his request. Further, the converted text is sent to the intent generation unit (214). Further, the intent generation unit (214) is configured to process the input and determines Y’s clear intent that: Y wants to book a cab to the specified destination, i.e. City Center Mall, Delhi.
[0081] Once the system (100) accurately identified the intent of Y, it triggers the gesture generation unit (216) to simulate the necessary actions on the screen. Further, these virtual gestures may replicate the steps Y would take manually, by ensuring a seamless experience without requiring Y’s physical interaction with his device. Further, the gesture generation unit (216) first activates the ride-hailing application, simulating the action of selecting the “Book a Cab” button on the home screen. Further, this action opens the cab booking interface by preparing the system (100) for the further steps. With the app now open, the task performance unit (218) takes over. Further, the task performance unit (218) may automatically fill in Y’s pickup location, using the GPS functionality on his device to detect his current position. Further, this information is then displayed in the "Pickup Location" field, ensuring that Y does not need to enter his details manually. After confirming or automatically populating the pickup address, the system (100) moves to the "Destination" field and enters “City Center Mall, Delhi” as the destination. At this stage, the essential travel information is already set, and the system (100) is ready to present Y with one or more ride options.
[0082] Further, the system (100) proceeds by searching for available ride options. Further, Based on Y’s location and preferences, the task performance unit (218) identifies and presents a variety of options such as "Standard," "Premium," or "Shared" rides. Further, the choices clearly highlighted on Y’s screen, giving him the ability to choose the ride that best fits his requirements. Further, to enhance the Y's experience, the system (100) provides real-time feedback by audibly listing the available options and their estimated fares. Further, the voice feedback may say, “Available ride options: Standard, Premium, and Shared. Further, the Standard ride costs 150 rupees, Premium costs 250 rupees, and Shared costs 100 rupees. Would you like to book a Standard, Premium, or Shared ride to City Center Mall?”. After hearing this prompt, Y responds, “Standard,” indicating his preference for the most economical option. Upon hearing Y’s voice confirmation, the gesture generation unit (216) springs into the action again. Further, the system (100) simulates a virtual tap on the "Standard" ride option, followed by a tap on the “Confirm Booking” button. Further, the interaction is configured to confirm Y's selection and starts the booking process.
[0083] As the system (100) is configured to finalize the ride request, it processes the necessary information, by securing the can booking with the ride-hailing service. If payment authorization is required, the system (100) prompts Y to confirm the payment method. For example, the system (100) may ask, “How would you like to pay? Credit card, debit card, or wallet or UPI or Cash?” After Y selects his preferred method, the transaction is completed seamlessly. Finally, the task performance unit (218) is configured to provide a confirmation notification to Y. The system (100) audibly states, “Your Standard cab to City Center Mall, Delhi has been successfully booked and will arrive in 5 minutes.” At the same time, the app updates Y’s screen with real-time details about the ride, such as the driver’s location, estimated arrival time, and car details, including the vehicle’s make and model and provide Y with audible feedback of the same with the received OTP. This ensures Y is well-informed and can track his ride in real-time, providing peace of mind and convenience. The entire process, from initial voice command to booking confirmation, is completed quickly and efficiently, demonstrating how voice-assisted technology can enhance the user experience and simplify everyday tasks.
[0084] FIG. 4 illustrates a block diagram of an exemplary computer system (401) for performing the one or more tasks on the user device with embodiments consistent with the present disclosure.
[0085] Variations of computer system (401) may be used for performing the one or more tasks on the user device. The computer system (401) may comprise a central processing unit (“CPU” or “processor”) (402). The processor (402) may comprise at least one data processor for executing program components for executing user or system generated requests. A user may include a person, a person using a device such as those included in this disclosure, or such a device itself. Additionally, the processor (402) may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, or the like. In various implementations the processor (402) may include a microprocessor, such as AMD Athlon, Duron or Opteron, ARM’s application, embedded or secure processors, IBM PowerPC, Intel’s Core, Itanium, Xeon, Celeron or other line of processors, for example. Accordingly, the processor (402) may be implemented using mainframe, distributed processor, multi-core, parallel, grid, or other architectures. Some embodiments may utilize embedded technologies like application-specific integrated circuits (ASICs), digital signal processors (DSPs), or Field Programmable Gate Arrays (FPGAs), for example.
[0086] Processor (402) may be disposed in communication with one or more input/output (I/O) devices via I/O interface (403). Accordingly, the I/O interface (403) may employ communication protocols/methods such as, without limitation, audio, analog, digital, monoaural, RCA, stereo, IEEE-1394, serial bus, universal serial bus (USB), infrared, PS/2, BNC, coaxial, component, composite, digital visual interface (DVI), high-definition multimedia interface (HDMI), RF antennas, S-Video, VGA, IEEE 802.n /b/g/n/x, Bluetooth, cellular (e.g., code-division multiple access (CDMA), high-speed packet access (HSPA+), global system for mobile communications (GSM), long-term evolution (LTE), WiMAX, or the like, for example.
[0087] Using the I/O interface (403), the computer system (401) may communicate with one or more I/O devices. For example, the input device (404) may be an antenna, keyboard, mouse, joystick, (infrared) remote control, camera, card reader, fax machine, dongle, biometric reader, microphone, touch screen, touchpad, trackball, sensor (e.g., accelerometer, light sensor, GPS, gyroscope, proximity sensor, or the like), stylus, scanner, storage device, transceiver, video device/source, or visors, for example. Likewise, an output device (405) may be a user’s smartphone, tablet, cell phone, laptop, printer, computer desktop, fax machine, video display (e.g., cathode ray tube (CRT), liquid crystal display (LCD), light- emitting diode (LED), plasma, or the like), or audio speaker, for example. In some embodiments, a transceiver (406) may be disposed in connection with the processor (402). The transceiver (406) may facilitate various types of wireless transmission or reception. For example, the transceiver (406) may include an antenna operatively connected to a transceiver chip (example devices include the Texas Instruments® WiLink WL1283, Broadcom® BCM4750IUB8, Infineon Technologies® X-Gold 618-PMB9800, or the like), providing IEEE 802.11a/b/g/n, Bluetooth, FM, global positioning system (GPS), and/or 2G/3G/5G/6G HSDPA/HSUPA communications, for example.
[0088] In some embodiments, the processor (402) may be disposed in communication with a communication network (408) via a network interface (407). The network interface (407) is adapted to communicate with the communication network (408). The network interface, coupled to the processor may be configured to facilitate communication between the system and one or more external devices or networks. The network interface (407) may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), transmission control protocol/internet protocol (TCP/IP), token ring, or IEEE 802.11a/b/g/n/x, for example. The communication network (408) may include, without limitation, a direct interconnection, local area network (LAN), wide area network (WAN), wireless network (e.g., using Wireless Application Protocol), or the Internet, for example. Using the network interface (407) and the communication network (408), the computer system (401) may communicate with devices such as shown as a laptop (409) or a mobile/cellular phone (410). Other exemplary devices may include, without limitation, personal computer(s), server(s), fax machines, printers, scanners, various mobile devices such as cellular telephones, smartphones (e.g., Apple iPhone, Blackberry, Android-based phones, etc.), tablet computers, eBook readers (Amazon Kindle, Nook, etc.), laptop computers, notebooks, gaming consoles (Microsoft Xbox, Nintendo DS, Sony PlayStation, etc.), or the like. In some embodiments, the computer system (401) may itself embody one or more of these devices.
[0089] In some embodiments, the processor (402) may be disposed in communication with one or more memory devices (e.g., RAM 413, ROM 414, etc.) via a storage interface (412). The storage interface (412) may connect to memory devices including, without limitation, memory drives, removable disc drives, etc., employing connection protocols such as serial advanced technology attachment (SATA), integrated drive electronics (IDE), IEEE-1394, universal serial bus (USB), fiber channel, small computer systems interface (SCSI), etc. The memory drives may further include a drum, magnetic disc drive, magneto-optical drive, optical drive, redundant array of independent discs (RAID), solid-state memory devices, or solid-state drives, for example.
[0090] The memory devices may store a collection of program or database components, including, without limitation, an operating system (416), user interface application (417), web browser (418), mail client/server (419), user/application data (420) (e.g., any data variables or data records discussed in this disclosure) for example. The operating system (416) may facilitate resource management and operation of the computer system (401). Examples of operating systems include, without limitation, Apple Macintosh OS X, UNIX, Unix-like system distributions (e.g., Berkeley Software Distribution (BSD), FreeBSD, NetBSD, OpenBSD, etc.), Linux distributions (e.g., Red Hat, Ubuntu, Kubuntu, etc.), IBM OS/2, Microsoft Windows (XP, Vista/7/8, etc.), Apple iOS, Google Android, Blackberry OS, or the like.
[0091] The user interface (417) is for facilitating the display, execution, interaction, manipulation, or operation of program components through textual or graphical facilities. For example, user interfaces (417) may provide computer interaction interface elements on a display system operatively connected to the computer system (401), such as cursors, icons, check boxes, menus, scrollers, windows, or widgets, for example. Graphical user interfaces (GUIs) may be employed, including, without limitation, Apple Macintosh operating systems’ Aqua, IBM OS/2, Microsoft Windows (e.g., Aero, Metro, etc.), Unix X-Windows, or web interface libraries (e.g., ActiveX, Java, JavaScript, AJAX, HTML, Adobe Flash, etc.), for example.
[0092] In some embodiments, the computer system (401) may implement a web browser (418) stored program component. The web browser (418) may be a hypertext viewing application, such as Microsoft Internet Explorer, Google Chrome, Mozilla Firefox, Apple Safari, or Microsoft Edge, for example. Secure web browsing may be provided using HTTPS (secure hypertext transport protocol), secure sockets layer (SSL), Transport Layer Security (TLS), or the like. Web browsers may utilize facilities such as AJAX, DHTML, Adobe Flash, JavaScript, Java, or application programming interfaces (APIs), for example. In some embodiments the computer system (401) may implement a mail client/server (419) stored program component. The mail server (419) may be an Internet mail server such as Microsoft Exchange, or the like. The mail server may utilize facilities such as ASP, ActiveX, ANSI C++/C#, Microsoft .NET, CGI scripts, Java, JavaScript, PERL, PHP, Python, or WebObjects, for example. The mail server (419) may utilize communication protocols such as internet message access protocol (IMAP), messaging application programming interface (MAPI), Microsoft Exchange, post office protocol (POP), simple mail transfer protocol (SMTP), or the like. In some embodiments, the computer system (401) may implement a mail client (420) stored program component. The mail client (420) may be a mail viewing application, such as Apple Mail, Microsoft Entourage, Microsoft Outlook, or Mozilla Thunderbird.
[0093] In some embodiments, the computer system (401) may store user/application data (421), such as the data, variables, records, or the like as described in this disclosure. Such databases may be implemented as fault-tolerant, relational, scalable, secure databases such as Oracle or Sybase, for example. Alternatively, such databases may be implemented using standardized data structures, such as an array, hash, linked list, struct, structured text file (e.g., XML), table, or as object-oriented databases (e.g., using ObjectStore, Poet, Zope, etc.). Such databases may be consolidated or distributed, sometimes among the various computer systems discussed above in this disclosure. It is to be understood that the structure and operation of any computer or database component may be combined, consolidated, or distributed in any working combination.
[0094] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present invention. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., non-transitory. Examples include Random Access Memory (RAM), Read- Only Memory (ROM), volatile memory, nonvolatile memory, hard drives, Compact Disc (CD) ROMs, Digital Video Disc (DVDs), flash drives, disks, and any other known physical storage media.
[0095] Various embodiments of the disclosure encompass numerous advantages including methods and systems for performing the one or more tasks on the user device. The disclosed method and system have several technical advantages, but are not limited to the following:
• Hands-Free Operation and Accessibility: The method allows users to perform tasks without needing to physically interact with the device, promoting hands-free operation. This is especially beneficial for users with physical disabilities, those multitasking, or in situations where using the hands is impractical (e.g., while driving or cooking). It ensures accessibility to a wider range of users.

• Increased Task Efficiency: Voice-controlled tasks can be performed more quickly compared to traditional methods of manual interaction. Users can accomplish multiple tasks (like sending messages, booking a cab, or controlling smart devices) through simple voice commands, significantly improving overall task efficiency and saving time.

• Real-Time Feedback and Interaction: The method provides real-time feedback to the user, helping them refine their inputs or clarify their requests. This immediate response ensures users are guided in real-time, improving the interaction quality and reducing errors. It helps users stay in control, especially when dealing with complex tasks or multi-step processes.

• Context-Aware Task Execution: By identifying user intent from voice input and mapping it to specific tasks, the system is context-aware and can intelligently perform tasks based on the user's needs. This adaptability means the system can handle a diverse set of tasks ranging from sending messages to controlling smart devices based on the recognized intent.

• Accurate Voice-to-Text Transcription: The method ensures accurate voice input transcription into text, improving the precision of task execution. The transcription is enhanced by eliminating blanks and unnecessary pauses, which reduces errors in interpreting user commands. This leads to higher accuracy in executing tasks based on voice input.

• Virtual Gesture Integration for Seamless UI Navigation: The system generates virtual gestures based on identified user intent, allowing users to control the device interface without direct touch interaction. This integration makes navigating the UI smoother and more intuitive, especially when performing tasks like setting reminders, adjusting settings, or remotely controlling smart home devices.

• Multitasking and Sequential Task Management: The method supports sequential execution of tasks by performing multiple virtual gestures in a defined order. Users can complete complex, multi-step actions with just a few voice commands, reducing the need for manual, time-consuming navigation. This feature improves multitasking capabilities, allowing users to manage different tasks more effectively.

• Improved User Engagement and Experience: The combination of voice recognition, real-time feedback, and virtual gestures ensures an engaging and interactive user experience. The system's responsiveness to voice inputs and its ability to adapt to user needs leads to a smoother, more satisfying experience. This is particularly important for enhancing user satisfaction and encouraging prolonged device usage.

• Enhanced Accuracy Through Intent Recognition: By identifying specific intents from voice input, the system is able to execute more accurate and contextually relevant tasks. This minimizes the likelihood of errors or misinterpretations, ensuring that tasks like transferring money, booking a cab, or setting alarms are performed correctly according to the user's intent.

• Reduction in Manual Input Errors: The method leverages speech-to-text algorithms to transcribe voice inputs into textual data. This reduces the likelihood of errors that often occur with manual typing or tapping on a touchscreen, leading to more reliable task execution, particularly when performing complex tasks or entering detailed information.

• Increased Device Automation: The use of virtual gestures to control the device interface enables a high degree of automation in task execution. Virtual gestures can trigger a variety of functions on the device by making the device more autonomous in responding to voice commands and reducing the need for manual intervention.

• Personalized and Adaptive Task Management: The system's ability to process and recognize a wide range of voice inputs, along with its capacity for real-time feedback and adjustment, allows for more personalized and adaptive interactions. The device can learn the user's preferences over time, offering smarter suggestions or actions based on previous interactions, leading to an increasingly personalized experience.

• Improved Accessibility for Non-Technical Users: Users with limited technical knowledge or unfamiliarity with specific device interfaces can benefit from the intuitive nature of voice commands and virtual gestures. The method reduces the learning curve for complex features, making it easier for non-technical users to interact with the device, perform tasks, and access functionalities with minimal effort.

• Enhanced Privacy and Security: The method provides voice input can be used to perform tasks like transferring money, making calls, or controlling devices, the method can be designed to incorporate secure authentication protocols (e.g., voice recognition or PIN confirmation) to ensures that sensitive actions are performed only after confirming the user's identity, adding an extra layer of security to the system.

• Faster Response Time in Critical Tasks: The integration of voice commands and real-time feedback allows the system to quickly respond to time-sensitive requests (such as booking a taxi, setting an alarm, or sending urgent messages). This can improve overall user satisfaction, especially when the user is in a rush or needs to perform tasks quickly.

• Support for Multilingual and Multi-Tasking Environments: The method can be designed to support multiple languages, allowing users to interact with the device in their preferred language. Additionally, it can handle multiple tasks concurrently (such as setting reminders while ordering food or controlling smart devices), which is particularly valuable in environments where multitasking is essential.

• No External API Dependency: The method is configured to operate independently of traditional API integration, allowing for seamless interaction without relying on external systems.

• Humanoid gesture control: It can be understood as the process of mimicking human-like interactions on a user device's display through one or more virtual gestures. This enhances the ability to efficiently navigate and interact with the interface, ultimately leading to the successful completion of tasks.
[0096] The system empowers users to navigate through the voice commands, without the need for direct screen interaction. It helps to bridge the limitations of traditional voice interfaces, which often rely on restricted APIs that prevent end-to-end control and completion of tasks within apps. By leveraging a unique, human-like control mechanism to perform virtual gestures on-screen, the system enables truly hands-free, end-to-end functionality, allowing users to perform tasks like investing, booking cabs, or transferring money efficiently and intuitively, enhancing both accessibility and user experience.
[0097] In summary, these technical advantages solve the technical problem of enabling fully end-to-end, voice-controlled interactions within mobile apps or websites without requiring API access or special permissions, thereby addressing the challenges associated with traditional voice interface methods such as limited integration capabilities and restricted task completion. Additionally, these advantages contribute to a more accessible, user-friendly experience, allowing seamless transaction processing, enhanced navigation, and efficient task completion across various applications with greater ease and autonomy.
[0098] The claimed invention of method and system for performing the one or more tasks on the user device involves tangible components, processes, and functionalities that interact to achieve specific technical outcomes. The system integrates various elements such as processors, memory, databases, transcription, virtual gestures, one or more tasks, user intent and providing the real-time feedback to users through the user device.
[0099] Furthermore, the invention involves a non-trivial combination of technologies and methodologies that provide a technical solution for a technical problem. While individual components like processors, databases, transcribing and virtual gesture are well-known in the field of computer science, their integration into a comprehensive system for performing the one or more tasks on the user device, brings about an improvement and technical advancement in the field of voice assistance and other related environments.
[00100] In light of the above-mentioned advantages and the technical advancements provided by the disclosed method and system for performing the one or more tasks on the user device, the claimed steps as discussed above are not routine, conventional, or well understood in the art, as the claimed steps enable the following solutions to the existing problems in conventional technologies. Further, the claimed steps clearly bring an improvement in the functioning of the device itself as the claimed steps provide a technical solution to a technical problem.
[00101] The present disclosure may be realized in hardware, or a combination of hardware and software. The present disclosure may be realized in a centralized fashion, in at least one computer system, or in a distributed fashion, where different elements may be spread across several interconnected computer systems. A computer system or other apparatus adapted for carrying out the methods described herein may be suited. A combination of hardware and software may be a general-purpose computer system with a computer program that, when loaded and executed, may control the computer system such that it carries out the methods described herein. The present disclosure may be realized in hardware that comprises a portion of an integrated circuit that also performs other functions.
[00102] A person with ordinary skills in the art will appreciate that the systems, modules, and sub-modules have been illustrated and explained to serve as examples and should not be considered limiting in any manner. It will be further appreciated that the variants of the above disclosed system elements, modules, and other features and functions, or alternatives thereof, may be combined to create other different systems or applications.
[00103] Those skilled in the art will appreciate that any of the aforementioned steps and/or system modules may be suitably replaced, reordered, or removed, and additional steps and/or system modules may be inserted, depending on the needs of a particular application. In addition, the systems of the aforementioned embodiments may be implemented using a wide variety of suitable processes and system modules, and are not limited to any particular computer hardware, software, middleware, firmware, microcode, and the like. The claims can encompass embodiments for hardware and software, or a combination thereof.
[00104] While the present disclosure has been described with reference to certain embodiments, it will be understood by those skilled in the art that various changes may be made, and equivalents may be substituted without departing from the scope of the present disclosure. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the present disclosure without departing from its scope. Therefore, it is intended that the present disclosure is not limited to the particular embodiment disclosed, but that the present disclosure will include all embodiments falling within the scope of the appended claims.
, Claims:WE CLAIM:
1. A method (300) for performing one or more tasks on a user device, the method (300) comprising:
receiving (302), via a processor (202) coupled with the user device, one or more voice inputs;
processing (304), via the processor (202), the one or more voice inputs to identify one or more voice activities from the one or more voice inputs;
transcribing (306), via the processor (202), the one or more voice activities into one or more textual input;
identifying (308), via the processor (202), one or more intent from the one or more textual input;
generating (310), via the processor (202), one or more virtual gestures based on the identified one or more intent; and
performing (312), via the processor (202), the one or more tasks on the user device by utilizing the one or more virtual gestures.

2. The method (300) as claimed in claim 1, wherein the processing (304) the one or more voice inputs corresponds to removing blank portions from the one or more voice inputs.

3. The method (300) as claimed in claim 1, wherein providing (304) real-time feedback to one or more users, wherein the real time feedback is provided to the one or more users after identifying the one or more intent, wherein the real time feedback is provided to the one or more users while performing the one or more tasks on the user device, wherein the real-time feedback corresponds to a prompt for receiving additional one or more voice inputs.

4. The method (300) as claimed in claim 3, wherein the generation of the one or more virtual gestures is being performed after receiving the additional one or more voice inputs, wherein the one or more virtual gestures correspond to one or more control events to be performed on a user interface (UI) of the user device.

5. The method (300) as claimed in claim 1, wherein the one or more voice inputs are captured using a microphone coupled with the user device.

6. The method (300) as claimed in claim 1, wherein the one or more tasks correspond to booking cab, transferring money, ordering food, investing in stocks or gold, remotely controlling smart home devices, setting alarms or timers, sending messages or making calls, accessing calendar events or a combination thereof.

7. The method (300) as claimed in claim 1, wherein the processing (304) of the one or more voice inputs comprises a time alignment, after transcribing the one or more voice activities into the one or more textual input.

8. The method (300) as claimed in claim 7, wherein the time alignment corresponds to mapping the one or more textual input to a corresponding timestamp derived from time intervals of the one or more voice activities to generate a time-aligned transcript.

9. The method (300) as claimed in claim 1, wherein performing (312) the one or more tasks corresponds to a sequential execution of the one or more virtual gestures on the UI associated with the user device.

10. The method (300) as claimed in claim 1, wherein the transcribing (306) is being performed by utilizing at least one of Deepspeech. Wav2vec, Kaldi, SpeechBrain speech to text algorithm.

11. A system (100) for performing one or more tasks on a user device, the system (100) comprises:
a processor (202), and
a memory (204) communicatively coupled with the processor (202), wherein the memory (202) is configured to store one or more executable instructions, which cause the processor (202) to:
receive one or more voice inputs coupled with the one or more user device;
process the one or more voice inputs to identify one or more voice activities from the one or more voice inputs;
transcribe the one or more voice activities into one or more textual input;
identify one or more intent from the one or more textual input;
generate one or more virtual gestures based on the identified one or more intent; and
perform the one or more tasks on the user device by utilizing the one or more virtual gestures.

12. A non-transitory computer-readable storage medium having stored thereon, a set of computer-executable instructions causing a computer comprising one or more processors to perform steps comprising:
receiving the user device, one or more voice inputs;
processing the one or more voice inputs to identify one or more voice activities from the one or more voice inputs;
transcribing the one or more voice activities into one or more textual input;
identifying one or more intent from the one or more textual input;
generating one or more virtual gestures based on the identified one or more intent; and
performing the one or more tasks on the user device by utilizing the one or more virtual gestures.
Dated this 19th Day of February 2025

PRIYANK GUPTA
IN/PA- 1454
AGENT FOR THE APPLICANT

Documents

Application Documents

# Name Date
1 202511014271-STATEMENT OF UNDERTAKING (FORM 3) [19-02-2025(online)].pdf 2025-02-19
2 202511014271-STARTUP [19-02-2025(online)].pdf 2025-02-19
3 202511014271-REQUEST FOR EARLY PUBLICATION(FORM-9) [19-02-2025(online)].pdf 2025-02-19
4 202511014271-FORM28 [19-02-2025(online)].pdf 2025-02-19
5 202511014271-FORM-9 [19-02-2025(online)].pdf 2025-02-19
6 202511014271-FORM FOR STARTUP [19-02-2025(online)].pdf 2025-02-19
7 202511014271-FORM FOR SMALL ENTITY(FORM-28) [19-02-2025(online)].pdf 2025-02-19
8 202511014271-FORM 18A [19-02-2025(online)].pdf 2025-02-19
9 202511014271-FORM 1 [19-02-2025(online)].pdf 2025-02-19
10 202511014271-FIGURE OF ABSTRACT [19-02-2025(online)].pdf 2025-02-19
11 202511014271-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [19-02-2025(online)].pdf 2025-02-19
12 202511014271-EVIDENCE FOR REGISTRATION UNDER SSI [19-02-2025(online)].pdf 2025-02-19
13 202511014271-DRAWINGS [19-02-2025(online)].pdf 2025-02-19
14 202511014271-DECLARATION OF INVENTORSHIP (FORM 5) [19-02-2025(online)].pdf 2025-02-19
15 202511014271-COMPLETE SPECIFICATION [19-02-2025(online)].pdf 2025-02-19
16 202511014271-Proof of Right [06-03-2025(online)].pdf 2025-03-06
17 202511014271-FORM-26 [06-03-2025(online)].pdf 2025-03-06
18 202511014271-FORM 3 [05-06-2025(online)].pdf 2025-06-05