Sign In to Follow Application
View All Documents & Correspondence

System/Method To Voicechat Using Deep Learning

Abstract: The proposed invention offers users the capability to convert text-based messages into personalized audio messages, delivered in either their own voice. By seamlessly integrating AI algorithms to replicate vocal patterns and nuances, the application captures the sender's or selected voice's authenticity. Leveraging advancements in neural networks and speech processing, this research aims to develop a robust and versatile voice cloning system. This invention enhances emotional expression and overall communication quality. The application balances data privacy with advanced features, positioning itself at the forefront of technological innovation. The proposed invention aims to bridge the gap between text and voice, transforming digital interactions into a more personal, engaging, and accessible experience. 3 Claims & 1 Figure

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
18 November 2023
Publication Number
52/2023
Publication Type
INA
Invention Field
COMPUTER SCIENCE
Status
Email
Parent Application

Applicants

MLR Institute of Technology
Laxman Reddy Avenue, Dundigal – 500 043

Inventors

1. Mr. K. Laxman Vikas
Department of Computer Science and Engineering – Artificial Intelligence and Machine Learning, MLR Institute of Technology, Laxman Reddy Avenue, Dundigal – 500 043
2. Mr. Rayudu Yogeshwar
Department of Computer Science and Engineering – Artificial Intelligence and Machine Learning, MLR Institute of Technology, Laxman Reddy Avenue, Dundigal – 500 043
3. Mr. A. Pavan Kalyan
Department of Computer Science and Engineering – Artificial Intelligence and Machine Learning, MLR Institute of Technology, Laxman Reddy Avenue, Dundigal – 500 043
4. Mr. Md. Ikramuddin
Department of Computer Science and Engineering – Artificial Intelligence and Machine Learning, MLR Institute of Technology, Laxman Reddy Avenue, Dundigal – 500 043

Specification

Description:Field of the Invention
The proposed invention relates to the field of Artificial Intelligence and specifically Deep Learning. This project involves the development of an application that utilizes voice cloning techniques and deep learning to replicate sender's voice and convert text messages into speech.
Background of the Invention
The background of the invention of an app that converts text messages into spoken words in the sender's voice using voice cloning technology is rooted in several technological advancements. Over the past decade, there have been significant breakthroughs in deep learning, specifically in the field of natural language processing (NLP) and speech synthesis. These advancements have paved the way for the development of applications that can replicate human speech patterns and generate realistic synthetic voices.
The evolution of deep learning techniques, particularly recurrent neural networks (RNNs) and convolutional neural networks (CNNs), has enabled more sophisticated analysis and synthesis of audio data. This progress has led to the creation of neural text-to-speech (TTS) models that can generate human-like speech from text inputs. These models can capture intonations, accents, and other vocal nuances, making them well-suited for applications such as the proposed app.
Moreover, the concept of personalization and user experience enhancement has gained traction in the tech industry. As people increasingly rely on messaging apps for communication, the idea of adding an auditory layer to text conversations has emerged. Voice cloning technology can bridge the gap between written and spoken communication, offering a unique and engaging way for users to interact with their messages. Overall, the background of this invention is a convergence of advancements in deep learning, speech synthesis, user experience design, all culminating in the creation of an app that introduces a new dimension to text-based communication by seamlessly blending synthesized speech with the sender's unique voice characteristics.
For instance, US11380327B2 provides a system and method that aims to improve speech communication between humans and machines by using natural language processing, voice cloning, and intervention willing probability. The system can recognize speech intentions, generate cloned audio, and provide guidance suggestions to ensure that an order is achieved. The patent also includes a quality inspection module for inspecting speaking speed and content and feeding back an inspection result to the human agent in real-time. The system and method offer several improvements to speech communication, but there may be limitations and potential drawbacks to implementing it.
Similarly, US10772871B2 describes a system and method for generating a personalized voice assistant. The invention aims to provide a voice assistant that can be customized to sound like a specific person, such as a celebrity or a loved one. The system uses a database of audio recordings to create a voice model that can be used to generate new speech. The system can also use text-to-speech synthesis to generate speech in real-time. The patent includes detailed descriptions of the various components of the system, including the voice model generation module, the speech synthesis module, and the user interface. The invention has potential applications in a variety of fields, including entertainment, education, and healthcare. The personalized voice assistant could be used to provide a more engaging and interactive experience for users and could also be used to assist people with disabilities.
US10096319B1 describes a system that uses voice recognition technology to determine the physical and emotional characteristics of users. The system receives voice input from a user and processes the voice data using signal processing algorithms to determine real-time traits of the user. The system then generates data tags corresponding to the real-time traits and uses them to determine candidate audio content for presentation to the user. The selected audio content is presented via a speaker device. The patent includes drawings that illustrate example embodiments of the system. The left-most digit(s) of a reference numeral may identify the drawing in which the reference numeral first appears. The use of the same reference numerals indicates similar, but not necessarily the same, components. The patent also notes that the drawings are provided for purposes of illustration only and shall not be deemed to limit the breadth, scope, or applicability of the disclosure. The patent further notes that certain embodiments may include voice assistants that process voice or speech and/or determine a meaning of the voice or speech. The patent does not address any potential privacy concerns associated with using voice recognition technology to determine physical and emotional characteristics.
PCT/JP2018/007086 describes a learning device, learning method, voice synthesis device, and voice synthesis method that can provide information via voice for easy understanding of contents by a user. The technology performs voice recognition, estimates statuses, and learns voice synthesis data to generate synthesized voice according to statuses upon voice synthesis. The invention aims to solve the problem of monotonous voice quality and tone in speech made by voice synthesis. The invention can be used in home agent devices, smartphones, and other electronic devices to provide various types of information via voice.
US4692941 is a real-time text-to-speech conversion system that uses a microcomputer-software compatible time domain methodology to convert text to speech in real-time. The system can handle an unlimited vocabulary with minimal hardware requirements and includes an exception dictionary to handle words not found in the main dictionary. The system can be customized to handle different languages or accents and includes detailed descriptions of the system's hardware and software requirements. The patent includes detailed descriptions of the system's components, including a phoneme-and-transition table, a prosody evaluator, and a codes-to-indices converter. The system can be used for a variety of applications, including speech synthesis for the visually impaired and automated telephone systems. The patent also includes detailed descriptions of the system's operation and algorithms. The system's real-time capabilities make it ideal for applications that require fast and accurate speech synthesis.
The invention introduces a novel approach to real-time speech synthesis using compact waveforms and software-based techniques. It eliminates the robotic sound of early devices, enabling a natural pitch variation in the synthesized voice. It identifies clauses in text sentences and converts them into prosody data for intonation. Words are parsed and matched against a pronunciation dictionary, and if not found, pronunciation rules are applied. Speech segments are generated through lookup tables and digitally encoded waveforms. The technique adjusts pitch by altering waveform length digitally. Segments are phased to minimize discontinuities during concatenation, and transitions are interpolated for memory efficiency. Overall, the invention achieves real-time, high-quality text-to-speech conversion with minimal hardware requirements.

Summary of the Invention
The invention is a mobile application that integrates cutting-edge voice cloning technology with messaging functionality. It enables users to convert text-based chats into audio messages delivered in the sender's own voice. This innovation addresses the limitations of traditional text messaging by infusing conversations with a personalized and emotionally rich dimension. By seamlessly replicating voices and providing customization options, the app enhances user engagement and emotional expression. It promotes accessibility for visually impaired individuals and aligns with the evolving landscape of digital communication. The project leverages advancements in AI, voice cloning, and mobile app development to create a revolutionary tool that transforms how people connect and converse in the digital age.
Brief Description of Drawings
The invention will be described in detail with the reference to the exemplary embodiments shown in the figures wherein:
Figure-1: Flowgorithm representing the process flow of VoiceChat app.

Detailed Description of the Invention
The invention is a revolutionary mobile application that transforms text-based messages into audio messages using advanced voice cloning technology. When a user composes a text message within the app and selects the recipient, they can choose to utilize the voice cloning feature. The process begins with the app analysing existing voice recordings of the sender to capture their distinct vocal patterns, intonations, and nuances. These acoustic features are then fed into sophisticated AI algorithms, which generate a synthetic voice that closely emulates the sender's authentic voice. This synthesized voice is seamlessly applied to the text message, resulting in an audio message that retains the emotional depth and personal touch of the original speaker.
Voice cloning using deep learning is an advanced technique that involves creating highly realistic replicas of human voices by harnessing the capabilities of deep neural networks. This technology has evolved significantly in recent years, fuelled by the progress in deep learning algorithms and the availability of vast amounts of audio data for training. The process begins with the collection of a diverse dataset comprising numerous hours of speech recordings from the target person. These recordings capture a wide range of linguistic variations, emotional expressions, and speaking styles. This dataset serves as the foundation for training deep learning models, such as recurrent neural networks (RNNs) or more specialized architectures like WaveNet and Tacotron. During training, the selected neural network architecture learns to transform the acoustic features of speech into a format that can be efficiently learned and reproduced. This often involves converting raw audio data into spectrogram representations, which provide a visual representation of the frequency and amplitude components of sound over time. The model learns to associate different linguistic and acoustic elements with specific patterns in the spectrogram.
The training process requires significant computational power and time to fine-tune the model's parameters, enabling it to capture intricate details of the target voice. As the training progresses, the model refines its ability to replicate the person's speech patterns, accents, prosody, and other characteristics that make their voice unique. Once the model is trained, it can generate spectrogram representations for new text inputs. These representations are then converted back into raw audio using techniques like Griffin-Lim reconstruction or WaveGAN synthesis. The result is a synthesized voice that sounds remarkably similar to the original person, capable of articulating any given text in a way that mimics their natural speaking style.
To enhance personalization, the app offers customization options that allow users to fine-tune the cloned voice. Parameters such as pitch, tone, and speed can be adjusted, enabling users to match the audio output to their preferences while ensuring fidelity to the original voice. The app also ensures data privacy by employing robust encryption methods to safeguard voice recordings and sensitive information. Consent is sought from users before their voices are used for cloning, respecting ethical and legal considerations. As users engage in conversations, the app continually refines its AI models based on user feedback and ratings regarding the accuracy of the cloned voice. This iterative improvement process ensures that the synthesized voices become increasingly indistinguishable from the real ones over time. By catering to those seeking to express themselves more authentically, the app bridges the gap between text and spoken communication. The synthesized voice imbues the conversations with emotions, nuances, and subtleties, revolutionizing the way people connect and engage through digital messaging. Through seamless integration of voice cloning, AI algorithms, and a user-friendly interface, the invention offers a transformative communication experience that is poised to redefine digital conversations for a more personal and expressive future.

Advantages of the proposed Model,
The proposed model offers several distinct advantages that set it apart from traditional text-based messaging applications:
1. Personalization and Authenticity:
The model enables users to send and receive messages in their own voice or the sender's voice, creating a more authentic and personal communication experience that goes beyond plain text.
2. Enhanced Communication:
Voice-based communication offers a more natural and intuitive means of interaction compared to typing, allowing for faster and more fluid conversations.
3. Emotional Expression:
By capturing vocal nuances and inflections, the model allows users to convey emotions and nuances more effectively, adding depth and emotional richness to their conversations.
4. Revolutionized Accessibility:
The model transforms the way users interact with digital messages, enabling them to listen to messages in situations where reading text might be inconvenient or impossible, such as while driving.
5. Richer Context:
Audio messages provide context beyond words, helping to prevent misunderstandings that can arise from text-based communication lacking tone and intonation.
6. Entertainment and Engagement:
The model offers a fun and engaging way to communicate, making conversations more interactive and enjoyable for users.
In essence, the proposed model presents a paradigm shift in digital communication, enriching conversations with personalization, emotion, and accessibility while harnessing the power of advanced AI-driven voice cloning technology.
3 Claims & 1 Figure , Claims:The scope of the invention is defined by the following claims:
Claims:
1. The system/method to voicechat using deep learning comprising:
a) A voice cloning technology to convert text-based messages into authentic and personalized audio messages and a accurate voice replication model asserts the ability to accurately replicate the sender's voice or a chosen voice using advanced AI algorithms, ensuring that the synthesized voice closely resembles the original speaker's vocal characteristics.
b) A enhanced communication efficiency model asserts that audio messages facilitate more efficient and natural communication compared to text-based messaging, resulting in faster and smoother interactions.
c) A contextual enrichment model claims to provide a richer context to conversations by adding vocal tone and intonation, reducing the likelihood of misunderstandings that can arise from purely text-based communication.
d) A improved user engagement model asserts that audio messages enhance user engagement and interaction, leading to more dynamic and enjoyable conversations.

2. According to claim 1, the continuous enhancement through user feedback model actively seeks user feedback and refining AI models to enhance voice cloning accuracy and overall user satisfaction.

3. According to claim 1, the unique market positioning is incorporated to distinct feature of voice-cloned chats, contributing to its appeal and potential for widespread adoption.

Documents

Application Documents

# Name Date
1 202341078533-REQUEST FOR EARLY PUBLICATION(FORM-9) [18-11-2023(online)].pdf 2023-11-18
2 202341078533-FORM-9 [18-11-2023(online)].pdf 2023-11-18
3 202341078533-FORM FOR STARTUP [18-11-2023(online)].pdf 2023-11-18
4 202341078533-FORM FOR SMALL ENTITY(FORM-28) [18-11-2023(online)].pdf 2023-11-18
5 202341078533-FORM 1 [18-11-2023(online)].pdf 2023-11-18
6 202341078533-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [18-11-2023(online)].pdf 2023-11-18
7 202341078533-EVIDENCE FOR REGISTRATION UNDER SSI [18-11-2023(online)].pdf 2023-11-18
8 202341078533-EDUCATIONAL INSTITUTION(S) [18-11-2023(online)].pdf 2023-11-18
9 202341078533-DRAWINGS [18-11-2023(online)].pdf 2023-11-18
10 202341078533-COMPLETE SPECIFICATION [18-11-2023(online)].pdf 2023-11-18