Sign In to Follow Application
View All Documents & Correspondence

System And Method For Voice Assisted Digital Payment Application Through Deep Learning

Abstract: Disclosed herein is a system and method for voice assisted digital payment application using hybrid deep learning technique, specially developed for uneducated, less tech-savvy, and visually impaired users who face challenges with existing digital payment services. The system employs a microphone (100), a signal conditioner (200), an analogue-to-digital converter (ADC) (300), a server (400), application programming interface (API) (500), a speaker (600), and a display (700). The voice commands of a user are received by the microphone and converted from its analogue signals to digital signals. The API (500) is adapted to wirelessly transmit the digital signals of the ADC (300) to the server (400) and therefrom receive signal processing outcomes to be delivered through the speaker (600) or the display (700) in real time. The server (400) hosts an ECAPA-TDNN (emphasized channel attention, propagation, and aggregation time delay neural network) model trained for verifying whether the user is a real enrolled user or fraudster, a Wav2Vec2 model trained for verifying whether the voice is biological or electronical (spoofed), an NLP (natural language processing) tool trained for interpreting intent/name/entity present in the voice commands and give command execution signals to banking servers (800) via the API (500). Then, the necessary banking transaction is performed via the corresponding banking servers (800) with voice alerts. Fig. 1

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
20 August 2025
Publication Number
39/2025
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
Parent Application

Applicants

ANURAG SINGH
IIIT-Naya Raipur, Chhattisgarh-493661, India
SOHEL PRATAP SINGH
IIIT-Naya Raipur, Chhattisgarh-493661, India
ABHIJEET AGRAWAL
IIIT-Naya Raipur, Chhattisgarh-493661, India
ABHISHEK SHARMA
IIIT-Naya Raipur, Chhattisgarh-493661, India
SRINIVASA K G
IIIT-Naya Raipur, Chhattisgarh-493661, India
SHRIVISHAL TRIPATHI
IIIT-Naya Raipur, Chhattisgarh-493661, India
SANTOSH KUMAR
IIIT-Naya Raipur, Chhattisgarh-493661, India
AMIT AGRAWAL
IIIT-Naya Raipur, Chhattisgarh-493661, India
SRESHA YADAV NEE GHOSH
IIIT-Naya Raipur, Chhattisgarh-493661, India

Inventors

1. ANURAG SINGH
IIIT-Naya Raipur, Chhattisgarh-493661, India
2. SOHEL PRATAP SINGH
IIIT-Naya Raipur, Chhattisgarh-493661, India
3. ABHIJEET AGRAWAL
IIIT-Naya Raipur, Chhattisgarh-493661, India
4. ABHISHEK SHARMA
IIIT-Naya Raipur, Chhattisgarh-493661, India
5. SRINIVASA K G
IIIT-Naya Raipur, Chhattisgarh-493661, India
6. SHRIVISHAL TRIPATHI
IIIT-Naya Raipur, Chhattisgarh-493661, India
7. SANTOSH KUMAR
IIIT-Naya Raipur, Chhattisgarh-493661, India
8. AMIT AGRAWAL
IIIT-Naya Raipur, Chhattisgarh-493661, India
9. SRESHA YADAV NEE GHOSH
IIIT-Naya Raipur, Chhattisgarh-493661, India

Specification

Description:FIELD OF THE INVENTION
The present invention broadly relates to digital payment application. More particularly, the present invention relates to a system and method for voice assisted digital payment application using hybrid deep learning technique, specially developed for uneducated, less tech-savvy, and visually impaired users who face challenges with existing digital payment services.

BACKGROUND OF THE INVENTION
A digital payment is a method of transferring money electronically using digital devices and internet or mobile networks, eliminating the need for physical cash or checks. The digital payment is processed through websites, mobile apps, or even at physical stores using POS terminals. There are various modes of digital payment services including mobile wallets (Paytm, Google Pay, Phone Pe, Amazon Pay, Apple Pay etc.), UPI (Unified Payments Interface), net banking, debit/credit card swiping/tapping, or like payment gateways (Razorpay, PayU etc.). All these payment modes require the users to visually check, read, type text/numbers, select one from an option list or scan QR code to perform various banking transaction or related operations.

While digital payments are becoming more accessible, in some regions of India people are still reluctant to use digital payment services (mobile apps) due to concerns about security, lack of trust in digital transactions, and lack of awareness and understanding about how these payment applications work, especially in rural areas. Additionally, the Indian language variability poses significant challenges in adoption of such payment apps, since English is usually used on digital platforms which create a barrier for users who are not proficient in that language. Further, the existing digital payment applications are quite difficult for uneducated and disabled (visually impaired) users as they neither read nor type/select correctly in such platforms. Therefore, there is a need of developing an alternative solution to improve user convenience for those who are not comfortable in reading/typing/selecting based payment platforms.

Mobile payment apps offer convenience but come with inherent security risks. These include phishing scams, malware infections, and the potential for unauthorized access if a device is lost or stolen. Sometimes the user credentials are decoded/decrypted by cyber fraudsters to steal money of innocent users through online transactions. Therefore, it is required to incorporate unique biometric security features which is impossible to tamper in the digital payment platforms.

A reference may be made to Indian patent application number 202211065006 that discloses a digital payment system having a point-of-sale device integrated with a camera to verify the face of payer followed by OTP authentication while doing any transaction.

Preethika Kambampati et al. reported a platform using voice recognition to enable payment transactions; where NLP, speech recognition, face, and fingerprint recognition technique is used to provide secure user authentication across various payment devices. Its primary objective is to leverage AI to streamline payment processes, minimize human intervention, and enhance digital payment accessibility for individuals with disabilities.

Although, the application of neural networks (AI/ML) in digital payments has recently been increased; however, the existing AI/ML models very often struggle with accuracy and generalization across diverse datasets and signal processing conditions. Issues associated to data collection, inherent biases in datasets and hardware-software compatibility pose significant challenges. The baseline models usually face risks of overfitting or underfitting during the training phase due to lack of optimization. To address these challenges, it becomes necessary to develop an advanced hybrid deep learning approach that can accurately classify speaker identity based on his voice signals in real-time, making it a reliable tool for digital payment applications. Moreover, it is desired to devise a mobile or web-based system and method for voice assisted digital payment application, which includes all the advantages of the conventional/existing techniques/methodologies and overcomes the deficiencies of such techniques/methodologies.

OBJECT OF THE INVENTION
It is an object of the present invention to design a reliable, cost-effective, and user-friendly mobile/web App like voice operated digital payment application platform, especially for uneducated, less tech-savvy, and visually impaired users who face challenges with existing digital payment services.

It is another object of the present invention to develop and train advanced (hybrid) deep learning architectures on both English and Indian languages to accurately identify speaker voices (spoken in different languages) and distinguish speaker voices from spoofed/fake voices and electronic voices, thus verifying the user authenticity based on voice biometric before executing each payment related transaction.

It is one more object of the present invention to trigger voice alerts to the users for every debit transaction.

It is a further object of the present invention to devise a system and method for voice assisted digital payment application using hybrid deep learning technique.

SUMMARY OF THE INVENTION
In one aspect, the present invention provides a system for voice assisted digital payment application using hybrid deep learning technique, specially developed for uneducated, less tech-savvy, and visually impaired users who face challenges with existing digital payment services. The system employs a microphone, a signal conditioner, an analogue-to-digital converter (ADC), a server, application programming interface (API), a speaker, and a display. The voice commands of a user are received by the microphone and converted from its analogue signals to digital signals. The API is adapted to wirelessly transmit the digital signals of the ADC to the server and therefrom receive signal processing outcomes to be delivered through the speaker or the display in real time. The server hosts an ECAPA-TDNN (emphasized channel attention, propagation, and aggregation time delay neural network) model, a Wav2Vec2 model, an NLP (natural language processing) tool, and a database. The ECAPA-TDNN is trained to extract mel-filter bank energies associated with user unique vocal trait embeddings therefrom, classify the extracted vocal trait embeddings into real user class or fake user class; compare the extracted vocal trait embeddings with reference vocal embeddings in an enrolled user list as stored in a database through cosine similarity, and verify user identity based on the classification and comparison results. The Wav2Vec2 model is trained to extract raw waveform associated with voice liveness features from therefrom, classify the extracted voice liveness features into biological voice class or electronic voice class, and verify user voice liveness based on the classification results. The NLP tool is trained to convert the voice commands into text format upon successful verification of the user identity and voice liveness; process the text through normalization, tokenization, and lemmatization; extract therefrom features associated with intent, name, and entity present in the voice commands; and give command execution signals to banking servers via the API based on the extracted intent, name, and entity features. Then, the necessary banking transaction is performed via the corresponding banking servers upon receipt of the command execution signals. The API is configured generate a voice alert before and after every debit transaction by the API in real time.

Other aspects, advantages, and salient features of the present invention will become apparent to those skilled in the art from the following detailed description, which delineate the present invention in different embodiments.

BRIEF DESCRIPTION OF DRAWINGS
These and other features, aspects, and advantages of the present invention will become better understood when the following detailed description is read with reference to the accompanying figures.

Fig. 1 is a schematic diagram illustrating key hardware components of the system for voice assisted digital payment application, in accordance with an embodiment of the present invention.

Fig. 2 is a block diagram illustrating various modules embedded in the system hardware components, in accordance with an embodiment of the present invention.

Fig. 3 illustrates method steps for voice assisted digital payment application, in accordance with an embodiment of the present invention.

Fig. 4 illustrates first deep learning model architecture (ECAPA-TDNN) as employed in the system and method, in accordance with an embodiment of the present invention.

Fig. 5 illustrates second deep learning model architecture (Wav2Vec2) as employed in the system and method, in accordance with an embodiment of the present invention.

Fig. 6 illustrates NLP tool architecture as employed in the system and method, in accordance with an embodiment of the present invention.

Fig. 7 illustrates enrolment steps as configured in API employed in the system and method, in accordance with an embodiment of the present invention.

Fig. 8 illustrates payment transaction steps as configured in API employed in the system and method, in accordance with an embodiment of the present invention.

Fig. 9 illustrates an exemplary mobile deployment of the proposed payment application (Voice-Pay), in accordance with an embodiment of the present invention.

Fig. 10 illustrates an exemplary web deployment of the proposed payment application (Voice-Pay), in accordance with an embodiment of the present invention.

List of reference numerals
100 microphone
200 signal conditioner
300 analogue-to-digital converter (ADC)
400 cloud server
402 identity verification module (ECAPA-TDNN)
404 voice verification module (Wav2Vec2)
406 voice command interpretation module (NLP tool)
408 database
500 application programming interface (API)
600 speaker
700 display
800 banking server

DETAILED DESCRIPTION OF THE INVENTION
Various embodiments described herein are intended only for illustrative purposes and subject to many variations. It is understood that various omissions and substitutions of equivalents are contemplated as circumstances may suggest or render expedient, but are intended to cover the application or implementation without departing from the scope of the present invention. Also, it is to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting.

The use of terms “including,” “comprising,” or “having” and variations there of herein are meant to encompass the items listed thereafter and equivalents thereof as well as additional items. The terms, “an” and “a” herein do not denote a limitation of quantity, but rather denote the presence of at least one of the referenced items.

In accordance with an embodiment of the present invention, as shown in Fig. 1-2, the system for voice assisted digital payment application is depicted. The system comprises a microphone (100), a signal conditioner (200), an analogue-to-digital converter (ADC) (300), a server (400), an application programming interface (API) (500), a speaker (600), and a display (700). The API (500) is a mobile or web App that communicates with the ADC (300), the server (400) and banking/payment servers (800). The server (400) has hosted therein an identity verification module (402), a voice verification module (404), a voice command interpretation module (406), and a database (408) linked with the banking servers (800). The identity verification module (402) deploys a first deep learning (ECAPA-TDNN) model trained for checking identity of user (speaker). The voice verification module (404) deploys a second deep learning (Wav2Vec2) model trained for checking liveness of user (speaker). The voice command interpretation module (406) deploys a natural language processing (NLP) tool for understanding intention of user (what he/she wants to do). The database (408) stores enrolled speaker list with their credential data and voice samples mapped with the corresponding banking server database (800).

In accordance with an embodiment of the present invention, the microphone (100) acts as a transducer, converting acoustic pressure waves from voice commands of the users (speakers) into corresponding analogue electrical signals. It serves as the initial interface between the speaker and the electronic system, enabling the acquisition of spoken commands.

In accordance with an embodiment of the present invention, the raw analogue signals require preprocessing/conditioning. The signal conditioner (200) boosts/amplifies the microphone's low-voltage signals to a voltage range appropriate for the ADC (300) without introducing significant noise or distortion. Especially, it filters unwanted frequencies such as DC offset or high-frequency noise, and allow only speech-relevant frequency band (typically 300 Hz to 3400 Hz) to pass, improving clarity and system efficiency. It also helps in matching the output impedance of the microphone (100) with the input impedance of the ADC (300) to avoid signal loss and maximize energy transfer.

In accordance with an embodiment of the present invention, the ADC (300) transforms the conditioned/pre-processed analogue signal into a digital signal by sampling it at regular intervals and quantizing its amplitude values. This digital representation is required for further processing within digital systems. The resolution (e.g., 12-bit, 16-bit) and sampling rate (e.g., 16 kHz, 44.1 kHz) of the ADC (300) significantly influence the quality and precision of the captured voice signal. The digital signals are stored in a local memory and transferred to a processor for further processing.

In accordance with an embodiment of the present invention, the API (500) serves as an interposer between front end and back end of the system. Through it, interactions with cloud services are established in order to do real-time processing, authentication, or storing data. The API (500) is adapted to wirelessly transmit the digital signals of the ADC (300) to the server (400) and therefrom receive signal processing outcomes to be delivered through the speaker (600) or the display (700) in real time.

In accordance with an embodiment of the present invention, the microphone (100), the signal conditioner (200), the (ADC) (300), the API (500), the speaker (600), and the display (700) are integrated in a single IoT enabled computing device such smartphone, tab, and laptop etc.

In accordance with an embodiment of the present invention as shown in Fig. 7, the speakers need to enrol/register their details including voice samples for first time and create login credential via the API (500). The speaker details include one or more information selected from speaker name, mobile number, email id, AADHAR number, PAN number, bank account details etc. The login credentials may include user id/mobile number/email id with password or OTP (one time password). The login credentials may be in form of a spoken word/phrase. The users can log in the API either by typing the login credentials or by uttering the spoken word/phrase.

In accordance with an embodiment of the present invention, the speaker’s uttered word/phase and voice command need to be checked whether the voice belong to the original/real enrolled speaker or any fraudster based on which the speaker identity is authenticated. Further, it is essentially required to check whether the voice is originated from a biological source (i.e., original/real enrolled human speaker) or from any electronic means (recorded voice or AI generated voice or spoofed voice). The user’s speech serves as a command to trigger various actions such as scanning a QR code, paying a contact, viewing transaction history, or checking account balance. By simply speaking, the users can perform these tasks hands-free, making payments, reviewing past transactions, or checking their current balance, offering a fast and convenient banking experience through voice interaction.

As shown in Fig. 8, the API (500) transmits the digital signals of the speaker voice commands from the ADC (300) to the server (400) where the identity of speaker and the liveness of voice are verified using dedicated deep learning models. If verification is failed, the voice alerts are triggered to the speaker (600) via the API (500). Only after successful verification of the speaker voice identity and liveness, the NLP tools are deployed on the voice/speech commands to understand speaker intention embedded behind the voice commands. Accordingly, the API (500) gets feedback from the server (400) and gives voice command execution signals to the banking server (800) to perform the desired payment/banking related transaction by retrieving the user mapped data from the database (408). Simultaneously, the user gets a voice alert with OTP before the debit happens. Once the user gives ‘approve’ voice prompt and/or types OTP, the debit (payment) happens and another voice alert is triggered regarding payment confirmation status (i.e., paid or failed with reason).

In accordance with an embodiment of the present invention as shown in Fig. 3, the method for voice assisted digital payment application is depicted. The method comprises steps of: receiving (S1) analogue electrical signals of voice commands of a speaker through a microphone (100);
converting (S2) the analogue signals into digital signals by an analogue-to-digital converter (ADC) (300);
configuring (S3) an application programming interface (API) (500) to wirelessly transmit the digital signals of the ADC (300) to a server (400), receive therefrom signal processing outcomes, and deliver the outcomes through a speaker (600) or a display (700) in real time;
feeding (S4) the digital signals into a first deep learning (ECAPA-TDNN) model (402) hosted in the server (400) to extract mel-filter bank energies associated with speaker unique vocal trait embeddings therefrom, classify the extracted vocal trait embeddings into real user class or fake user class; compare the extracted vocal trait embeddings with reference vocal embeddings in an enrolled user list as stored in a database (408) through cosine similarity, and verify user identity based on the classification and comparison results;
passing (S5) the digital signals into a second deep learning (Wav2Vec2) model (404) hosted in the server (400) to extract raw waveform associated with voice liveness features from therefrom, classify the extracted voice liveness features into biological voice class or electronic voice class, and verify user voice liveness based on the classification results;
deploying (S6) a natural language processing (NLP) tool (406) upon successful verification of the user identity and voice liveness to convert the voice commands into text format; process the text through normalization, tokenization, and lemmatization; extract therefrom features associated with intent, name, and entity present in the voice commands; and give command execution signals to banking servers (800) via the API (500) based on the extracted intent, name, and entity features;
performing (S7) payment transaction via the corresponding banking servers (800) upon receipt of the command execution signals; and
generating (S8) a voice alert before and after every debit transaction by the API in real time.

In accordance with an embodiment of the present invention as shown in Fig. 4, the first deep learning model is an emphasized channel attention, propagation, and aggregation time delay neural network (ECAPA-TDNN) architecture comprising an 1D CNN (one dimensional convolutional) layer with 512 filters, 5 kernel size and 1 dilation rate followed by rectified linear unit activation (ReLU) and batch normalization (BN); three SE-Res2Net (Squeeze-and-Excitation- Residual connections) layers with kernel size 3 and dilation rates 2, 3, and 4 respectively; a feature unification 1D CNN layer with 512 filters, 5 kernel size and 1 dilation rate followed by rectified linear unit (ReLU) activation; an ASP (attentive statistical pooling) layer followed by batch normalization (BN); a fully connected (FC) layer followed by batch normalization (BN), and an output layer with sigmoid activation.

The ECAPA-TDNN enhances traditional TDNN-based x-vector factors by incorporating channel attention mechanisms, multi-scale temporal modelling, and residual learning, resulting in more robust and generalizable speaker representations. The model initiates with extracting 80-dimensional log Mel-filter bank energies from each speech utterance (voice command digital signals) using a 25 msec Hamming window and 10 msec frame shift. These features are effective in capturing perceptual spectral properties of speech and are widely used in speaker recognition systems. The input to the network is an 80×T80 \times T80×T matrix, where TTT denotes the number of frames. Standard mean and variance normalization is applied across time. The Mel-filter bank energies are sequentially passed through 1D CNN layer, SE-Res2Net layers, feature unification 1D CNN layer, ASP layer, and FC layer to extract speaker unique vocal trait embeddings.

The ReLU activation allows the network to model non-linear dependencies while maintaining computational efficiency and sparsity, as expressed in equation 1.
ReLU(x) = max(0,x) equation 1
The BN stabilizes and accelerates training by normalizing the activations of each layer.

The SE-Res2 Blocks split the input channels into groups, apply convolutions at different dilation rates, and aggregate the results, enabling the network to capture both short- and long-term temporal dependencies. Squeeze-and-Excitation (SE) module that learns channel-wise weights to emphasize informative features. Residual connections are used to ease gradient flow and preserve information. The number of output channels increases progressively, reaching 1536. The outputs from the SE-Res2Blocks are concatenated and passed through a 1×1 convolution (pointwise projection layer) to unify the representation back to 1536 channels.

To aggregate the variable-length temporal features into a fixed-size vector, the model employs the Attentive Statistics Pooling (ASP) layer. This layer learns attention weights α_t∈[0,1] for each frame t, and computes a weighted mean (μ) and weighted standard deviation (σ) across time as expressed in equation 2:
μ=∑_(t=1)^T▒〖α_t h_t 〗,σ = √(∑_(t=1)^T▒〖α_t (h_t-μ)〗^2 ) equation 2

where h_t is the feature vector at time t, and ∑_(t=1)^T▒〖α_t=1〗.

This yields a 3072-dimensional output (1536 for the mean and 1536 for the standard deviation), encoding both the central tendency and variation in the speaker’s voice. The model finally outputs the 192-dimensional speaker embeddings (associated with temporal contextual features, phonetic and prosodic pattern, and jitter and shimmer of audio signal). The Sigmoid activation function, as expressed in equation 3, outputs a probability indicating whether the extracted embeddings belong to the same speaker.
σ(z) = 1/(1+e^(-z) ) equation 3
In other word, the sigmoid activation is used to classify the extracted vocal trait embeddings into real speaker class or fake speaker class.

To cross check classification accuracy, the extracted vocal trait embeddings are further compared with reference vocal embeddings in the enrolled speaker list as stored in the database (408) through cosine similarity, as expressed equation 4.
Cosine_- sim(x_1.x_2)= (x_1.x_2)/(||x_1 ||.||x_2 ||) equation 4

The outputs of equations 3 and 4 are validated against a predefined threshold to verify speaker identity based on the classification and comparison results. If the probability value and the cosine similarity value exceed a threshold value (for example >0.5), then the voice is considered as real speaker. On the other hand, if the probability value and the cosine similarity value are equal or less than the threshold value (for example ≤ 0.5), then the voice is considered as fake speaker.

In accordance with an embodiment of the present invention as shown in Fig. 5, the second deep learning model is a Wav2Vec2 architecture comprising a seven 1D CNN (one dimensional convolutional) layered feature encoder (with kernel sizes 10, 3, 3, 3, 3, 2, 2 for seven layers respectively), a twelve layered transformer encoder having attention mechanism, two feedforward (FF) layers, an ASP (attentive statistical pooling) layer, a feedforward (FF) linear layer, and an output layer with sigmoid activation. The raw waveforms from the digital signals are passed through 1D CNN layers, transformer layers, feed forward layers, and the ASP layer to extract voice liveness features (whether the voice is spoken by real user or synthesized/mimicked/recorded by some computing/AI machines). The CNN layers extract low-level acoustic features associated with energy patterns, phase distortion, and fine temporal structure which are outputs projected from 512 to 768 dimensional vectors. The transformer encoder captures temporal context and long-range dependencies, detects repetitive or unnatural prosody in synthetic voices, missing coarticulation effects flat intonation or abrupt transitions. The outputs from the transformer encoder layer are combined using trainable attention weights based on which it is learned which layers are most informative for spoofed voice detection. The FF layers capture artifacts in frequency bands and over-smoothed or over-pronounced phonemes in synthetic speech. The ASP layer converts variable-length sequences into fixed-length embeddings using attention to weigh frames with spoof-specific anomalies (e.g., sudden energy bursts, unnatural pauses). The FF linear layer outputs from 768 to 256 dimensions embeddings. Then, the sigmoid activation is performed using equation 3 to classify the extracted voice source features into biological voice class or electronic voice class; and verify speaker voice based on the classification results. If the probability value (output of sigmoid activation) exceeds a threshold value (for example >0.5), then the voice is considered as biological. On the other hand, if the probability value is equal or less than the threshold value (for example ≤ 0.5), then the voice is considered as electronical.

Further, both the models (ECAPA-TDNN and Wave2vec2) are trained using both English and Indian language (Hindi) spoken speech/voice database for 10 epochs with an 80:20 train-test split, and their training is optimized using Binary Cross-Entropy (BCE) Loss function to improve its performance.

The model’s performances are evaluated in terms of equal error rate (EER), training accuracy and validation accuracy. It is observed that the primitive model shows 2.6%-3.0% EER, where the proposed ECAPA TDNN model shows 2.1% EER indicating high level of accuracy in distinguishing between speakers’ voices. The primitive model shows 88% training accuracy, where the proposed ECAPA TDNN model shows 92%-94% training/validation accuracy confirming good generalization without overfitting. Also, the d-vector embeddings generated by the proposed ECAPA TDNN model maintains low intra-speaker distance and high inter-speaker separation, demonstrating effective speaker discrimination. Similarly, the proposed Wave2vec2 model achieves training accuracy of 98.5% and test accuracy of 95%, indicating to effectively distinguish biological voices from electronic/spoofed voices.

In accordance with an embodiment of the present invention as shown in Fig. 6, the voice command interpretation module (406) deploys a natural language processing (NLP) tool upon successful verification of the speaker identity and voice to convert the voice commands into text format; process the text through normalization, tokenization, and lemmatization; extract therefrom features associated with intent, name, and entity present in the voice commands; and give command execution signals to the banking servers (800) via the API (500) based on the extracted intent, name, and entity features.

The voice commands (acoustic digital signals) are converted into text (actionable instruction sentence) using "Speech to Text" (STT). During normalization all the text characters are converted to lowercase, and unwanted punctuation, special characters, and stop words (if needed) are removed. Then tokenization is applied to split the sentences into individual words or meaningful segments (e.g., "send", "₹500", "to", "ramesh", "via", "UPI"). Then lemmatization or stemming is used to reduce words to their root forms (e.g., "sending" → "send"). Then the speaker’s intention behind the spoken command is recognized using a model like SVM, RNNs, BERT, GPT, or LLaMA, which are trained on vast amounts of human language to detect context, meaning, and user intent with high accuracy. For example, if the speaker says, “Send ₹1,000 to Rahul,” the model understands that the intent is to transfer money. Similarly, if the speaker says, “Check my balance,” the model identifies the intent as a balance inquiry. Commands like “Show my last transactions” trigger the intent to view transaction history, while “Scan this QR code” is recognized as the intent to initiate a QR-based payment. Once the intent is recognized, the next step pulls out specific pieces of information (called name and entities) from the spoken command. For example, if the speaker says, “Send ₹1,000 to Rahul,” the model understands "₹1,000" as the amount to be debited (entity), and "Rahul" is the recipient (name). Other entities might include dates, account types, phone numbers, or keywords like "last week" or "savings account". The common entity types used in banking are Monetary amounts (₹500, $100, etc.), Account types (savings, current), Payment methods (UPI, credit card), Dates (e.g., for transaction history).

Once the intent, name and entity of the voice commands are understood by the NLP models, the command execution signals are transmitted to the corresponding banking server via the API, and the banking server performs all the necessary security checking (authentication) by comparing the retrieved data (e.g., user name, date of birth, PAN, AADHAR, UPI/account number, beneficiary details, mobile number etc.) with data stored in the banking database. If the said data security checking (authentication) is validated, then the necessary transactions (e.g., payment of specific amount to specific person stored in phone contact list) is initiated. At the same time a voice feedback/alert is triggered by the API using "Text to Speech" (TTS) to ensure the transaction information is affirmed back to the user. This feature facilitates a smooth and adequate voice-based interface, improving accessibility and usability and accuracy especially for users with low literacy or technical skills.

In an example, prior to a financial transaction, a voice-based debit alert is activated by the NLP model to inform the user of the deduction from their account. The alert provides important transaction information like amount, recipient's name, and the purpose of transfer. Through synthesized speech, the alert serves as a security checkpoint, enabling the user to audit and authorize the transaction by uttering a voice confirmation, typically a clear "Yes" or "No" response. If the user accepts (e.g., "Yes"), the transaction proceeds to completion. If the user rejects or does not respond, the transaction is cancelled or kept on hold. This mechanism enhances transactional transparency, fraud prevention, and user control, ensuring that funds are only transferred with explicit user consent.

In another example. while making a payment through a QR code, scanning the code directly takes the user to the transaction page. In such cases, a fraudster might claim that the QR code leads to a lottery prize or reward. To help prevent such scams, the user should also receive an immediate debit alert after the transaction. For added user security, additional voice alert during any concurrent call session. The real-time alert serves as a proactive warning to inform the user of possible fraudulent transactions. Further, the same alerts/feedbacks are simultaneously shown in visual formats in the display in real-time for regular user convenience.

Referring to Fig. 9, a mobile application (named as ‘Voice Pay’ mobile App) is developed using the proposed models and API. After a successful login, the home screen of the Voice Pay mobile application greets the user by name and provides the primary functional options — Make Payment, Check Balance, Add Bank, and History. At this stage, the user initiates interaction through a voice command, enabling hands-free navigation within the application. Once the user speaks a command, the user identity and the voice liveness are verified, the intent and entity are interpreted from the command for further processing. For example, the user speaks the command — “Send ₹100 to Parv Patidar”. The application processes the spoken input using Natural Language Processing (NLP) techniques to extract both intent and entities from the command. Here, the intent is identified as making a payment, while the entities include the recipient name (“Parv Patidar”) and the transaction amount (₹1000). After the NLP processing, the app displays the beneficiary details including the recipient’s name and mobile number and provides a voice-based alert stating that the specified amount is about to be transferred. This ensures the user is fully aware of the transaction details. The user is then prompted to provide confirmation, either through a voice command or by manual input, before the payment process proceeds to the next stage. Once the user provides confirmation, the request is forwarded to the NPCI ( National Payments Corporation of India) server, which handles the interbank transaction processing. During this stage, a short buffering or processing period occurs while the NPCI system validates, authorizes, and executes the payment. Upon successful processing, the app displays a “Payment Successful” message along with confirmation details, assuring the user that the transaction has been securely completed.

Referring to Fig. 10, a web application (named as ‘Voice Pay’ web App) is developed using the proposed models and API. The login screen of the Voice Pay web application prompts the user to sign in with their email address and password. Security features highlighted include Voice Biometrics for identity verification, Advanced Security with bank-grade encryption, Multi-Factor Authentication for layered security, and Fraud Detection with real-time transaction monitoring. Further, a QR code authentication screen enables secure login through double authentication. The users can open the Voice Pay mobile app, access the QR scanner, and scan the code displayed on the web interface. Upon successful scan, the application grants login access without requiring manual credential entry. The QR code has a time limit for added security. The user dashboard of the Voice Pay web application is displayed after login. It greets the user by name and shows account details such as account number and current balance. The interface provides quick-access options including Send Money, Bank Statements, Notifications, and Security, along with navigation options like Home, Transactions, Settings, and Logout in the sidebar. Similar procedure is followed as executed in the mobile App.

The foregoing descriptions of exemplary embodiments of the present invention have been presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiment was chosen and described in order to best explain the principles of the invention and its practical application, to thereby enable the persons skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is understood that various omissions, substitutions of equivalents are contemplated as circumstance may suggest or render expedient, but is intended to cover the application or implementation without departing from the scope of the claims of the present invention. , Claims:We claim:

1. A system for voice assisted digital payment application, the system comprises:
a microphone (100) adapted to receive and convert voice commands of a user into analogue electrical signals;
a signal conditioner (200) adapted to filter and amplify the analogue electrical signals;
an analogue-to-digital converter (ADC) (300) adapted to convert the conditioned analogue electrical signals into digital signals;
a server (400) having hosted therein an identity verification module (402), a voice verification module (404), a voice command interpretation module (406), and a database (408) linked with banking servers (800); and
an application programming interface (API) (500) adapted to wirelessly transmit the digital signals of the ADC to the server (400) and therefrom receive signal processing outcomes to be delivered through a speaker (600) or a display (700) in real time;
wherein the identity verification module (402) deploys a first deep learning model trained to capture mel-filter bank energies from the digital signals; pass them through one dimensional convolutional (1D CNN) block, squeeze-and-excitation Res2Net (SE-Res2Net) block, attentive statistical pooling (ASP) block, and fully connected block to extract user unique vocal trait embeddings, perform sigmoid activation to classify the extracted vocal trait embeddings into real user class or fake user class; compare the extracted vocal trait embeddings with reference vocal embeddings in an enrolled user list as stored in the database (408) through cosine similarity; and verify user identity based on the classification and comparison results;
wherein the voice verification module (404) deploys a second deep learning model trained to capture raw waveform from the digital signals, pass them through one dimensional convolutional (1D CNN) block, transformer block, feed forward block, and attentive statistical pooling (ASP) block to extract voice liveness features, perform sigmoid activation to classify the extracted voice liveness features into biological voice class or electronic voice class; and verify user voice liveness based on the classification results;
wherein the voice command interpretation module (406) deploys a natural language processing (NLP) tool upon successful verification of the user identity and voice liveness to convert the voice commands into text format; process the text through normalization, tokenization, and lemmatization; extract therefrom features associated with intent, name, and entity present in the voice commands; and give command execution signals to the banking servers (800) via the API (500) based on the extracted intent, name, and entity features.

2. The system as claimed in claim 1, wherein the application programming interface (API) is configured for one-time enrolment of user details including voice samples to be mapped with corresponding banking server database.

3. The system as claimed in claim 1, wherein the application programming interface (API) is configured to generate a voice alert before and after every debit transaction in real time.

4. The system as claimed in claim 1, wherein the first deep learning model is an emphasized channel attention, propagation, and aggregation time delay neural network (ECAPA-TDNN) architecture comprising an initial 1D CNN layer with 512 filters, 5 kernel size and 1 dilation rate followed by rectified linear unit activation and batch normalization; three SE-Res2Net layers with kernel size 3 and dilation rates 2, 3, and 4 respectively; a feature unification 1D CNN layer with 512 filters, 5 kernel size and 1 dilation rate followed by rectified linear unit activation; an ASP layer followed by batch normalization; a fully connected layer followed by batch normalization, and an output layer with sigmoid activation.

5. The system as claimed in claim 1, wherein the second deep learning model is a Wav2Vec2 architecture comprising a seven 1D CNN layered feature encoder, a twelve layered transformer encoder having attention mechanism, two feedforward layers, an ASP layer, a feedforward linear layer, and an output layer with sigmoid activation.

6. The system as claimed in claim 1, wherein the deep learning models are optimized through binary cross entropy loss function.

7. The system as claimed in claim 1, wherein the NLP tool includes Speech-to-text (STT) and text-to-speech (TTS) models.

8. The system as claimed in claim 1, wherein the system is deployable in form of a mobile App or a web App.

9. A method for voice assisted digital payment application, the method comprises steps of:
receiving (S1) analogue electrical signals of voice commands of a user through a microphone (100);
converting (S2) the analogue signals into digital signals by an analogue-to-digital converter (ADC) (300);
configuring (S3) an application programming interface (API) (500) to wirelessly transmit the digital signals of the ADC (300) to a server (400), receive therefrom signal processing outcomes, and deliver the outcomes through a speaker (600) or a display (700) in real time;
feeding (S4) the digital signals into an ECAPA-TDNN model hosted in the server (400) to extract mel-filter bank energies associated with user unique vocal trait embeddings therefrom, classify the extracted vocal trait embeddings into real user class or fake user class; compare the extracted vocal trait embeddings with reference vocal embeddings in an enrolled user list as stored in a database (408) through cosine similarity, and verify user identity based on the classification and comparison results;
passing (S5) the digital signals into a Wav2Vec2 model hosted in the server (400) to extract raw waveform associated with voice liveness features therefrom, classify the extracted voice liveness features into biological voice class or electronic voice class, and verify user voice liveness based on the classification results;
deploying (S6) a natural language processing (NLP) tool upon successful verification of the user identity and voice to convert the voice commands into text format; process the text through normalization, tokenization, and lemmatization; extract therefrom features associated with intent, name, and entity present in the voice commands; and give command execution signals to banking servers (800) via the API (500) based on the extracted intent, name, and entity features;
performing (S7) payment transaction via the corresponding banking servers (800) upon receipt of the command execution signals; and
generating (S8) a voice alert before and after every debit transaction by the API in real time.

Documents

Application Documents

# Name Date
1 202521079132-FORM 1 [20-08-2025(online)].pdf 2025-08-20
2 202521079132-DRAWINGS [20-08-2025(online)].pdf 2025-08-20
3 202521079132-COMPLETE SPECIFICATION [20-08-2025(online)].pdf 2025-08-20
4 Abstract.jpg 2025-09-11
5 202521079132-FORM-9 [15-09-2025(online)].pdf 2025-09-15
6 202521079132-FORM-26 [15-09-2025(online)].pdf 2025-09-15
7 202521079132-FORM 3 [15-09-2025(online)].pdf 2025-09-15
8 202521079132-FORM 18A [24-10-2025(online)].pdf 2025-10-24