Real Time Deepfake Audio Deetction System

< Back

Real Time Deepfake Audio Deetction System

Abstract: REAL TIME DEEPFAKE AUDIO DEETCTION SYSTEM Abstract The disclosure pertains to a system and method for the detection of deepfake audio content through the differentiation of authentic and altered audio recordings. Said system employs machine learning algorithms, including a Convolutional Neural Network (CNN) model, to convert audio data into image form and analyze said images for characteristics indicative of audio authenticity. Through said system, variations in speech attributes such as pitch, tone, and cadence are assessed, allowing for a distinction to be made between authentic and altered audio. Discrepancies, artefacts, or abnormalities that are suggestive of deepfake manipulation are identified. Furthermore, said machine learning algorithms are adapted to recognize deviations from expected speech patterns and to highlight probable deepfake irregularities within the converted audio-to-image data. Fig. 1

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

29 December 2023

Publication Number

22/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

MARWADI UNIVERSITY

MARWADI UNIVERSITY, RAJKOT- MORBI HIGHWAY, AT GAURIDAD, RAJKOT – 360003, GUJARAT, INDIA

DEV DUBAL

MARWADI UNIVERSITY, RAJKOT- MORBI HIGHWAY, AT GAURIDAD, RAJKOT – 360003, GUJARAT, INDIA

KEVIN SUREJA

MARWADI UNIVERSITY, RAJKOT- MORBI HIGHWAY, AT GAURIDAD, RAJKOT – 360003, GUJARAT, INDIA

DR. ANJALI DIWAN

MARWADI UNIVERSITY, RAJKOT- MORBI HIGHWAY, AT GAURIDAD, RAJKOT – 360003, GUJARAT, INDIA

PROF. (DR.) R.B. JADEJA

MARWADI UNIVERSITY, RAJKOT- MORBI HIGHWAY, AT GAURIDAD, RAJKOT – 360003, GUJARAT, INDIA

Inventors

1. DEV DUBAL

MARWADI UNIVERSITY, RAJKOT- MORBI HIGHWAY, AT GAURIDAD, RAJKOT – 360003, GUJARAT, INDIA

2. KEVIN SUREJA

MARWADI UNIVERSITY, RAJKOT- MORBI HIGHWAY, AT GAURIDAD, RAJKOT – 360003, GUJARAT, INDIA

3. DR. ANJALI DIWAN

MARWADI UNIVERSITY, RAJKOT- MORBI HIGHWAY, AT GAURIDAD, RAJKOT – 360003, GUJARAT, INDIA

4. PROF. (DR.) R.B. JADEJA

MARWADI UNIVERSITY, RAJKOT- MORBI HIGHWAY, AT GAURIDAD, RAJKOT – 360003, GUJARAT, INDIA

Specification

Description:REAL TIME DEEPFAKE AUDIO DEETCTION SYSTEM
Field of the Invention
[0001] The present study relates to digital audio verification, specifically to detecting and analysing deepfake audio within social media platforms, by employing machine learning algorithms, such as convolutional neural network to convert of audio data into images.
Background
[0002] The background description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed invention, or that any publication specifically or implicitly referenced is prior art.
[0003] In the field of digital communication, particularly within social media platforms, the emergence of deepfake audio technology has manifested a significant challenge. Deepfake audio, characterized by the artificial generation or manipulation of human speech using advanced algorithms, poses severe threats to the authenticity of digital content and the security of personal and public discourse. The proficiency of said technology in creating convincingly altered audio has necessitated the development of reliable detection methods.
[0004] Historically, efforts to identify deepfake audio have predominantly centered around traditional audio analysis techniques. Such techniques typically involve the examination of acoustic features and waveform patterns. However, said methods have exhibited substantial limitations in the face of increasingly sophisticated deepfake technologies. The primary drawback of traditional methods lies in their reliance on surface-level audio characteristics, which advanced deepfake algorithms can adeptly mimic. Consequently, the effectiveness of said methods has been progressively undermined.
[0005] Further compounding the issue is the rapid evolution of deepfake technology, which continuously refines the ability to simulate human speech nuances. As such, prior art in deepfake detection often struggles to keep pace with the advancements in deepfake generation techniques. The disparity between the evolution of deepfake technology and the development of detection methods has created a significant gap in digital security measures.
[0006] Moreover, existing methods have shown a tendency to generate false positives, wherein authentic audio is erroneously flagged as deepfake. Said tendency not only undermines the reliability of said detection systems but also poses risks of unwarranted censorship or mislabeling of legitimate content. The challenge is further exacerbated in a social media context, where the sheer volume and diversity of audio content necessitate highly efficient and accurate detection mechanisms.
[0007] The incorporation of machine learning algorithms, such as Convolutional Neural Networks (CNNs), in deepfake detection has offered some advancement. However, the integration of such algorithms into a system that can effectively tackle the nuances of deepfake audio remains in nascent stages. Prior art in said domain often suffers from limitations in processing complex audio data, a lack of robustness against varying audio qualities, and difficulties in distinguishing subtle manipulations characteristic of high-quality deepfake audio.
[0008] Additionally, the focus of existing methods has predominantly been on the technical aspects of audio, largely neglecting the biometric and forensic analysis of speech. Such oversight limits the scope of detection and fails to address the intricate aspects of speech patterns and voice biometrics that are crucial in identifying deepfake audio.
[0009] Thus, the disadvantages and drawbacks of prior art in the field of deepfake audio detection, particularly in the context of social media platforms, are manifold. Said disadvantages include a reliance on outdated techniques, a lag in keeping pace with advancing deepfake technologies, a propensity for false positives, limitations in processing complex audio data, and a lack of analysis encompassing biometric and forensic aspects of speech. The advent of malicious deepfake technology thus underscores an urgent need for a more advanced, integrated approach to reliably detect and analyze deepfake audio in the ever-evolving landscape of digital media. Thus, there exists a need in the art for a system and a method for detecting deepfake audio to address said drawbacks.

Summary
[00010] The present study relates to digital audio verification, specifically to detecting and analyzing deepfake audio within social media platforms, by employing machine learning algorithms, such as Convolutional Neural Networks to convert of audio data into images.
[00011] The following presents a simplified summary of various aspects of this disclosure in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements nor delineate the scope of such aspects. Its purpose is to present some concepts of this disclosure in a simplified form as a prelude to the more detailed description that is presented later.
[00012] The following paragraphs provide additional support for the claims of the subject application.
[00013] The disclosed system represents an approach in the field of digital security, specifically targeting the detection of deepfake audio. The system is distinguished by a series of interconnected modules, each designed to address specific aspects of audio analysis and deepfake detection. The foundation of said system is a data conversion module, which is configured to transform audio data into image data, including spectrograms. The conversion facilitates a more nuanced analysis of audio properties, allowing for a visual representation of various audio frequencies over time.
[00014] Central to the system is an analysis module, which encompasses a machine learning algorithm. Said algorithm, particularly a Convolutional Neural Network (CNN) model, is a critical component. The CNN model is trained on a diverse dataset that includes both authentic and deepfake audio recordings, enhancing the accuracy of deepfake detection.
[00015] Within the analysis module, several specialized sub-modules are integrated. The voice biometrics assessment module is designed to analyze the image data for voice biometrics characteristics. Said assessment module focuses on specific biometric parameters such as vocal tract shape, speech rhythm, and articulation patterns, ensuring an accurate biometric assessment. Additionally, a speaker recognition module is included, which is tasked with identifying speaker-specific features in the image data. Said module is configured to compare said features against a pre-established database of known voice profiles, facilitating accurate speaker identification.
[00016] The system also comprises an audio forensics module. Said module employs advanced signal processing techniques to detect anomalies in the frequency, amplitude, and phase properties of the audio signal. The role is crucial in identifying audio forensics attributes that are typical of manipulated audio.
[00017] Furthermore, a discrepancy detection module is incorporated within the analysis module. Said detection module utilizes anomaly detection algorithms to identify subtle inconsistencies characteristic of deepfake audio, such as unnatural pauses, breath sounds, or intonation patterns. Similarly, a pattern deviation identification module is configured to recognize deviations from anticipated patterns in the image data. Said module utilizes machine learning algorithms to learn and predict normal speech patterns, thereby enabling said identification module to flag deviations indicative of deepfake manipulation.
[00018] Each module of the system, including the data conversion module, the analysis module, the voice biometrics assessment component, the speaker recognition component, the audio forensics component, the discrepancy detection component, and the pattern deviation identification component, is executed by a processor. Said processor is coupled with a non-transitory storage medium that comprises machine-executable instructions. Collectively, said components form a system for the detection of deepfake audio, significantly enhancing the capability to differentiate between authentic and altered audio recordings.
[00019] A method for detecting deepfake audio has been developed, characterized by a series of sequential steps executed by various specialized modules. Initially, audio data is converted into image data through a data conversion module. Subsequently, the converted image data undergoes analysis by a machine learning algorithm within an analysis module, wherein said algorithm comprises a Convolutional Neural Network (CNN) model.
[00020] Voice biometrics characteristics within said image data are assessed using a voice biometrics assessment module. Concurrently, speaker-specific features in said image data are identified using a speaker recognition module. Further, said image data is examined for audio forensics attributes by an audio forensics module. Discrepancies, artefacts, or abnormalities in said image data, suggestive of deepfake manipulation, are detected using a discrepancy detection module.
[00021] Finally, deviations from anticipated patterns in said image data are recognized, and probable deepfake irregularities are highlighted using a pattern deviation identification module. In said process, variations in speech characteristics in said audio data, including pitch, tone, and cadence, are assessed by said analysis module to differentiate between authentic and altered audio recordings. Said method presents a approach to identifying and analyzing deepfake audio, enhancing the accuracy and reliability of deepfake detection in digital media.
Brief Description of the Drawings
[00022] The features and advantages of the present disclosure would be more clearly understood from the following description taken in conjunction with the accompanying drawings in which:
[00023] FIG. 1 represents an architecture of a system for detecting deepfake audio, in accordance with the embodiments of the present disclosure.
[00024] FIG. 2 illustrates a flow diagram of a method for detecting deepfake audio, in accordance with the embodiments of the present disclosure.
[00025] Fig. 3 illustrates analysis and determination of audio authenticity, in accordance with the embodiments of the present disclosure.
Detailed Description
[00026] In the following detailed description of the invention, reference is made to the accompanying drawings that form a part hereof, and in which is shown, by way of illustration, specific embodiments in which the invention may be practiced. In the drawings, like numerals describe substantially similar components throughout the several views. These embodiments are described in sufficient detail to claim those skilled in the art to practice the invention. Other embodiments may be utilized and structural, logical, and electrical changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims and equivalents thereof.
[00027] The use of the terms “a” and “an” and “the” and “at least one” and similar referents in the context of describing the invention (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The use of the term “at least one” followed by a list of one or more items (for example, “at least one of A and B”) is to be construed to mean one item selected from the listed items (A or B) or any combination of two or more of the listed items (A and B), unless otherwise indicated herein or clearly contradicted by context. The terms “comprising,” “having,” “including,” and “containing” are to be construed as open-ended terms (i.e., meaning “including, but not limited to,”) unless otherwise noted. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate the invention and does not pose a limitation on the scope of the invention unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the invention.
[00028] The present study relates to digital audio verification, specifically to detecting and analyzing deepfake audio within social media platforms, by employing machine learning algorithms, such as Convolutional Neural Networks to convert of audio data into images.
[00029] Pursuant to the "Detailed Description" section herein, whenever an element is explicitly associated with a specific numeral for the first time, such association shall be deemed consistent and applicable throughout the entirety of the "Detailed Description" section, unless otherwise expressly stated or contradicted by the context.
[00030] The proposed system 100 for detecting deepfake audio comprises several interconnected modules, each designed to perform specific functions in the process of analyzing and identifying deepfake audio content. The system 100, geared towards enhancing digital media authenticity, is particularly significant in the context of the rampant spread of deepfake technologies.
[00031] According to a figurative elucidation of FIG. 1, showcasing an architectural composition of the system 100 that can comprise functional elements, yet not limited to a data conversion module 102, an analysis module 104, a voice biometrics assessment module 106, a speaker recognition module 108, an audio forensics module 110, a discrepancy detection module 112 and a pattern deviation identification module 114. A person ordinarily skilled in art would prefer those elements or components of the system 100, to be functionally or operationally coupled with each other, in accordance with the embodiments of present disclosure.
[00032] In an embodiment, the data conversion module is configured to convert audio data into image data. Said conversion is crucial for enabling the subsequent analysis modules to process and analyze the audio data more effectively. The module may convert audio data into various forms of image data, such as spectrograms, which provide a visual representation of the audio's frequency spectrum over time. Said visual representation is instrumental in identifying unique patterns and anomalies in the audio data that may not be discernible in the audio format alone.
[00033] In an embodiment, the system is arranged with the analysis module, wherein said analysis module can be equipped with a machine learning algorithm, specifically a Convolutional Neural Network (CNN) model. Said CNN model is trained on a diverse dataset that includes both authentic and deepfake audio recordings. The training enables the CNN to learn and recognize patterns and features characteristic of deepfake audio, thus enhancing the detection accuracy.
[00034] Within the analysis module, the voice biometrics assessment module is configured to analyze the image data for voice biometrics characteristics. Said assessment module examines specific biometric parameters such as vocal tract shape, speech rhythm, and articulation patterns. Said analysis is vital in ensuring an accurate biometric assessment of the audio data, aiding in the identification of deepfake content.
[00035] In an embodiment, the speaker recognition module arranged within the analysis module, is tasked with identifying speaker-specific features in the image data. Said recognition module is configured to compare said features against a pre-established database of known voice profiles. Said comparison is crucial for accurate speaker identification and verification, further contributing to the system's ability to detect deepfake audio.
[00036] In an embodiment, the audio forensics module can be configured with the analysis module, utilizes advanced signal processing techniques. Said forensics module examines the image data for audio forensics attributes, focusing on detecting anomalies in frequency, amplitude, and phase properties of the audio signal. Said module plays a significant role in uncovering audio manipulations that may indicate the presence of deepfake audio.
[00037] In an embodiment, the discrepancy detection module can be configured with the analysis module. Said detection module is designed to identify discrepancies, artifacts, or abnormalities in the image data suggestive of deepfake manipulation. Said detection module employs anomaly detection algorithms to identify subtle inconsistencies characteristic of deepfake audio, such as unnatural pauses, breath sounds, or intonation patterns. The ability to detect said subtle inconsistencies is vital for distinguishing deepfake audio from authentic recordings.
[00038] In an embodiment, the pattern deviation identification module is configured to recognize deviations from anticipated patterns in the image data and to highlight probable deepfake irregularities. Said deviation identification module utilizes machine learning algorithms to learn and predict normal speech patterns, thereby enabling said identification module to flag deviations indicative of deepfake manipulation. Said module assesses variations in speech characteristics in the audio data, including pitch, tone, and cadence, to differentiate between authentic and altered audio recordings.
[00039] Referring to one or more preceding embodiments, each module of the system 100, including the data conversion module, the analysis module, the voice biometrics assessment component, the speaker recognition component, the audio forensics component, the discrepancy detection component, and the pattern deviation identification component, is executed by a processor. Said processor is coupled with a non-transitory storage medium that comprises machine-executable instructions. The integration of said modules ensures a holistic approach to detecting deepfake audio, making the system highly effective in digital media authenticity verification. The proposed system 100, through the integrated approach, addresses the challenges posed by deepfake audio technology, offering a robust solution to ensure the integrity of digital audio content.
[00040] Referring to one or more preceding embodiments, the present disclosure discloses a multifaceted system for the detection of deepfake audio content, characterized by the integration of dynamic threshold adjustment techniques, multimodal analysis, adversarial training methodologies, privacy-preserving mechanisms, and real-time processing capabilities. Specifically, the system incorporates adaptive decision-making processes based on real-time analysis to effectively mitigate false positives and negatives. Said system embodies an approach by combining audio-to-image conversion with voice biometrics, speaker recognition, and audio forensics, thereby enhancing detection precision.
[00041] Referring to one or more preceding embodiments, the system includes training protocols for deepfake detection models using adversarial examples, significantly improving the model's robustness against evolving deepfake generation techniques and adversarial attacks. A notable feature of said system is the commitment to privacy preservation, achieved through advanced techniques such as federated learning or on-device model inference, thereby ensuring secure and private processing of audio data. Lastly, the system is equipped with a real-time deepfake audio recognition pipeline that efficiently processes and analyzes audio data using convolutional neural networks, ensuring instantaneous and accurate detection of deepfake content. Said system represents a significant advancement in digital security and audio analysis, providing a reliable, adaptable, and privacy-conscious solution for deepfake audio detection.
[00042] Proposed method 200 for detecting deepfake audio is characterized by a series of sequential and interconnected processes, each executed by specialized modules. Said method represents an approach to identifying and analyzing deepfake audio content, which is increasingly prevalent and sophisticated in digital media.
[00043] Referring to a pictorial depiction put forth in FIG. 2, representing a flow diagram of the method 200 that can comprise steps of, yet not restricted to (at step 202) converting audio data, (at step 204) analyzing the converted image data, (at step 206) assessing voice biometrics characteristics, (at step 208) identifying speaker-specific features, (at step 210) examining the image data, (at step 212) detecting discrepancies, artefacts, or abnormalities, and (at step 214) recognizing deviations from anticipated patterns in the image data and highlighting probable deepfake irregularities. Said steps of the method 200 can be performed or executed, collectively or selectively, randomly or sequentially or in a combination thereof, in accordance with the embodiments of current disclosure.
[00044] In an embodiment, the initial step in the method involves converting audio data into image data. Said conversion is performed by a data conversion module, which is configured to transform audio signals into a visual format, such as spectrograms. Said transformation is crucial for enabling more detailed and nuanced analysis of the audio characteristics. Said transformation allows subsequent modules to process and interpret audio data in a visual context.
[00045] In an embodiment, the method involves an implementation of the analysis module that is equipped with a machine learning algorithm, specifically a Convolutional Neural Network (CNN) model. The CNN model is trained on a diverse dataset comprising both authentic and deepfake audio recordings, enhancing the capability to discern patterns and features indicative of deepfakes. The training process involves exposing the CNN to various examples of audio, allowing said training process to learn and adapt to the intricacies of deepfake audio.
[00046] Within the analysis module, the voice biometrics assessment module is configured to analyze the image data for voice biometrics characteristics. Said module assesses parameters such as vocal tract shape, speech rhythm, and articulation patterns. The analysis of said biometric characteristics plays a vital role in determining the authenticity of the audio data.
[00047] In an embodiment, the speaker recognition module can be a part of the analysis module. Said recognition module is tasked with identifying speaker-specific features in the image data. Said recognition module compares said features against a database of known voice profiles for accurate speaker identification. Said comparison is crucial for verifying the identity of the speaker in the audio and detecting potential impersonations or deepfake manipulations.
[00048] In an embodiment, the audio forensics module is responsible for examining the image data for audio forensics attributes. Said forensics module utilizes advanced signal processing techniques to detect anomalies in frequency, amplitude, and phase properties of the audio signal. Said module plays a significant role in uncovering audio manipulations that may indicate the presence of deepfake audio.
[00049] In an embodiment, the discrepancy detection module is designed to identify discrepancies, artifacts, or abnormalities in the image data suggestive of deepfake manipulation. Said detection module employs anomaly detection algorithms to identify subtle

inconsistencies characteristic of deepfake audio. Said inconsistencies may include unnatural pauses, breath sounds, or intonation patterns that are not typical of natural speech.
[00050] Referring to one or more preceding embodiments, the pattern deviation identification module recognizes deviations from anticipated patterns in the image data and highlights probable deepfake irregularities. Said identification module utilizes machine learning algorithms to learn and predict normal speech patterns, thereby enabling said module to flag deviations indicative of deepfake manipulation.
[00051] Said identification module assesses variations in speech characteristics, including pitch, tone, and cadence, to differentiate between authentic and altered audio recordings. Together, said modules form a method 200 for detecting deepfake audio. Said method enhances the ability to differentiate between authentic and altered audio recordings, thereby contributing significantly to the integrity and trustworthiness of digital audio content.
[00052] Fig. 3 illustrates analysis and determination of audio authenticity. Audio signals are first converted into a visual format, depicted as waveform images. Said images then serve as the input to a convolutional processing sequence, where they undergo a series of transformations to extract distinguishing features. The convolutional layers are designed to identify and amplify subtle characteristics inherent in the waveform that may indicate manipulation. Augmentation techniques are applied to enhance the robustness of the feature extraction process, introducing variability to account for different forms of audio manipulation. The system culminates in an output module that employs classification algorithms to ascertain the authenticity of the audio. Said output module determines whether the analyzed visual representation corresponds to a deepfake or genuine audio. Said determination is based on the presence or absence of identified features indicative of audio authenticity or manipulation, thus providing a conclusive decision regarding the nature of the audio signal.
[00053] Referring to one or more preceding embodiments, the disclosed method for detecting deepfake audio content utilizes machine learning algorithms with audio-to-image conversion techniques. Said method employs convolutional neural networks (CNNs) to analyze visual representations of audio data, enhancing the capability to detect manipulated content. The process integrates voice biometrics and speaker recognition, utilizing unique vocal characteristics to distinguish between genuine and deepfake audio.
[00054] Referring to one or more preceding embodiments, the method involves a thorough audio forensic analysis to identify discrepancies and abnormalities indicative of deepfake manipulation. Said method include the utilization of essential libraries like Numpy, Keras, TensorFlow, OpenCV, and SciKit-Learn for data processing and model development. The ensures consistency in the dataset by using standardized image sizes, which simplifies subsequent processing steps. The dataset is divided into training, validation, and testing subsets for effective model training and performance evaluation. Said method also incorporates smoothing and confidence filtering techniques to refine predictions and employs continuous monitoring and updating mechanisms to keep pace with evolving deepfake techniques. Said method provides a reliable and precise solution for the detection of deepfake audio content.
[00055] Referring to one or more preceding embodiments, the transformation of auditory data into images employ CNNs, for pattern recognition within images, for analyzing audio data. The use of visual representations enables a more detailed and nuanced analysis than traditional audio-only methods. Essential programming and data analysis libraries, such as Numpy, Keras, TensorFlow, OpenCV (cv2), and SciKit-Learn, are utilized in said process. Said provide the necessary framework for manipulating data, developing and optimizing machine learning models, and evaluating their performance.
[00056] Referring to one or more preceding embodiments, to ensure uniformity and consistency in the dataset, the audio data is transformed into visuals with standardized dimensions. Said standardization streamlines subsequent phases of image processing and analysis, facilitating more effective model training and evaluation. The dataset is carefully partitioned into training, validation, and testing subsets. The training subset is used to train the CNN, while the validation set aids in hyperparameter adjustment and model evaluation. The testing subset ensures that the model's effectiveness is accurately assessed against both authentic and deepfake audio samples.
[00057] Referring to one or more preceding embodiments, the method includes smoothing techniques and confidence filtering. Smoothing techniques, such as employing moving averages, help mitigate the impact of occasional inaccurate predictions. Confidence filtering involves setting a predetermined level of certainty for accepting predictions. If a prediction falls below a certain threshold of certainty, the sample may be marked for further examination or classified as uncertain, rather than being categorically identified. The standardization of image sizes and the systematic approach to training and evaluating the model further enhance the efficiency of the process, making said method a significant aspect in the field of audio analysis and deepfake detection.
[00058] Example embodiments herein have been described above with reference to block diagrams and flowchart illustrations of methods and apparatuses. It will be understood that each block of the block diagrams and flowchart illustrations, and combinations of blocks in the block diagrams and flowchart illustrations, respectively, can be implemented by various means including hardware, software, firmware, and a combination thereof. For example, in one embodiment, each block of the block diagrams and flowchart illustrations, and combinations of blocks in the block diagrams and flowchart illustrations can be implemented by computer program instructions. These computer program instructions may be loaded onto a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions which execute on the computer or other programmable data processing apparatus create means for implementing the functions specified in the flowchart block or blocks.
[00059] Throughout the present disclosure, the term ‘processing means’ or ‘microprocessor’ or ‘processor’ or ‘processors’ includes, but is not limited to, a general purpose processor (such as, for example, a complex instruction set computing (CISC) microprocessor, a reduced instruction set computing (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, a microprocessor implementing other types of instruction sets, or a microprocessor implementing a combination of types of instruction sets) or a specialized processor (such as, for example, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), or a network processor).
[00060] The term “non-transitory storage device” or “storage” or “memory,” as used herein relates to a random access memory, read only memory and variants thereof, in which a computer can store data or software for any duration.
[00061] Operations in accordance with a variety of aspects of the disclosure is described above would not have to be performed in the precise order described. Rather, various steps can be handled in reverse order or simultaneously or not at all.
[00062] While several implementations have been described and illustrated herein, a variety of other means and/or structures for performing the function and/or obtaining the results and/or one or more of the advantages described herein may be utilized, and each of such variations and/or modifications is deemed to be within the scope of the implementations described herein. More generally, all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications for which the teachings is/are used. Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific implementations described herein. It is, therefore, to be understood that the foregoing implementations are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, implementations may be practiced otherwise than as specifically described and claimed. Implementations of the present disclosure are directed to each individual feature, system, article, material, kit, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, kits, and/or methods, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the scope of the present disclosure.

Claims
I/We Claim:
1. A system for detecting deepfake audio, characterized by:
a data conversion module configured to convert audio data into image data;
an analysis module equipped with a machine learning algorithm, wherein said machine learning algorithm comprises a Convolutional Neural Network (CNN) model;
a voice biometrics assessment module within said analysis module, configured to analyze said image data for voice biometrics characteristics;
a speaker recognition module within said analysis module, tasked with identifying speaker-specific features in said image data;
an audio forensics module within said analysis module, responsible for examining said image data for audio forensics attributes;
a discrepancy detection module within said analysis module, designed to identify discrepancies, artefacts, or abnormalities in said image data suggestive of deepfake manipulation; and
a pattern deviation identification module within said analysis module, configured to recognize deviations from anticipated patterns in said image data and to highlight probable deepfake irregularities, wherein variations in speech characteristics in said audio data, including pitch, tone, and cadence, are assessed by said analysis module to differentiate between authentic and altered audio recordings.
2. The system according to claim 1, wherein said data conversion module is further configured to convert audio data into spectrograms, allowing for a visual representation of various audio frequencies over time.
3. The system according to claim 1, wherein said Convolutional Neural Network (CNN) model in the analysis module is trained on a diverse dataset comprising both authentic and deepfake audio recordings to enhance the detection accuracy.
4. The system according to claim 1, wherein the voice biometrics assessment module analyzes specific biometric parameters including vocal tract shape, speech rhythm, and articulation patterns to ensure accurate biometric assessment.
5. The system according to claim 1, wherein the speaker recognition module is configured to compare speaker-specific features against a pre-established database of known voice profiles for accurate speaker identification.
6. The system according to claim 1, wherein the audio forensics module utilizes advanced signal processing techniques to detect anomalies in frequency, amplitude, and phase properties of the audio signal.
7. The system according to claim 1, wherein the discrepancy detection module employs anomaly detection algorithms to identify subtle inconsistencies that are characteristic of deepfake audio, such as unnatural pauses, breath sounds, or intonation patterns.
8. The system according to claim 1, wherein the pattern deviation identification module utilizes machine learning algorithms to learn and predict normal speech patterns, thereby enabling it to flag deviations indicative of deepfake manipulation.
9. The system according to claim 1, wherein, each module of said system, including the data conversion module, the analysis module, the voice biometrics assessment component, the speaker recognition component, the audio forensics component, the discrepancy detection component, and the pattern deviation identification component, is executed by a processor, said processor is coupled with a non-transitory storage medium that comprises machine-executable instructions.
10. A method for detecting deepfake audio, comprising: converting audio data into image data using a data conversion module; analyzing the converted image data using a machine learning algorithm equipped in an analysis module, wherein the machine learning algorithm includes a Convolutional Neural Network (CNN) model; assessing voice biometrics characteristics within the image data using a voice biometrics assessment module; identifying speaker-specific features in the image data using a speaker recognition module; examining the image data for audio forensics attributes using an audio forensics module; detecting discrepancies, artefacts, or abnormalities in the image data suggestive of deepfake manipulation using a discrepancy detection module; and recognizing deviations from anticipated patterns in the image data and highlighting probable deepfake irregularities using a pattern deviation identification module, wherein the analysis module assesses variations in speech characteristics in the audio data, including pitch, tone, and cadence, to differentiate between authentic and altered audio recordings.

REAL TIME DEEPFAKE AUDIO DEETCTION SYSTEM
Abstract
The disclosure pertains to a system and method for the detection of deepfake audio content through the differentiation of authentic and altered audio recordings. Said system employs machine learning algorithms, including a Convolutional Neural Network (CNN) model, to convert audio data into image form and analyze said images for characteristics indicative of audio authenticity. Through said system, variations in speech attributes such as pitch, tone, and cadence are assessed, allowing for a distinction to be made between authentic and altered audio. Discrepancies, artefacts, or abnormalities that are suggestive of deepfake manipulation are identified. Furthermore, said machine learning algorithms are adapted to recognize deviations from expected speech patterns and to highlight probable deepfake irregularities within the converted audio-to-image data.
Fig. 1 , C , C , Claims:Claims
I/We Claim:
1. A system for detecting deepfake audio, characterized by:
a data conversion module configured to convert audio data into image data;
an analysis module equipped with a machine learning algorithm, wherein said machine learning algorithm comprises a Convolutional Neural Network (CNN) model;
a voice biometrics assessment module within said analysis module, configured to analyze said image data for voice biometrics characteristics;
a speaker recognition module within said analysis module, tasked with identifying speaker-specific features in said image data;
an audio forensics module within said analysis module, responsible for examining said image data for audio forensics attributes;
a discrepancy detection module within said analysis module, designed to identify discrepancies, artefacts, or abnormalities in said image data suggestive of deepfake manipulation; and
a pattern deviation identification module within said analysis module, configured to recognize deviations from anticipated patterns in said image data and to highlight probable deepfake irregularities, wherein variations in speech characteristics in said audio data, including pitch, tone, and cadence, are assessed by said analysis module to differentiate between authentic and altered audio recordings.
2. The system according to claim 1, wherein said data conversion module is further configured to convert audio data into spectrograms, allowing for a visual representation of various audio frequencies over time.
3. The system according to claim 1, wherein said Convolutional Neural Network (CNN) model in the analysis module is trained on a diverse dataset comprising both authentic and deepfake audio recordings to enhance the detection accuracy.
4. The system according to claim 1, wherein the voice biometrics assessment module analyzes specific biometric parameters including vocal tract shape, speech rhythm, and articulation patterns to ensure accurate biometric assessment.
5. The system according to claim 1, wherein the speaker recognition module is configured to compare speaker-specific features against a pre-established database of known voice profiles for accurate speaker identification.
6. The system according to claim 1, wherein the audio forensics module utilizes advanced signal processing techniques to detect anomalies in frequency, amplitude, and phase properties of the audio signal.
7. The system according to claim 1, wherein the discrepancy detection module employs anomaly detection algorithms to identify subtle inconsistencies that are characteristic of deepfake audio, such as unnatural pauses, breath sounds, or intonation patterns.
8. The system according to claim 1, wherein the pattern deviation identification module utilizes machine learning algorithms to learn and predict normal speech patterns, thereby enabling it to flag deviations indicative of deepfake manipulation.
9. The system according to claim 1, wherein, each module of said system, including the data conversion module, the analysis module, the voice biometrics assessment component, the speaker recognition component, the audio forensics component, the discrepancy detection component, and the pattern deviation identification component, is executed by a processor, said processor is coupled with a non-transitory storage medium that comprises machine-executable instructions.
10. A method for detecting deepfake audio, comprising: converting audio data into image data using a data conversion module; analyzing the converted image data using a machine learning algorithm equipped in an analysis module, wherein the machine learning algorithm includes a Convolutional Neural Network (CNN) model; assessing voice biometrics characteristics within the image data using a voice biometrics assessment module; identifying speaker-specific features in the image data using a speaker recognition module; examining the image data for audio forensics attributes using an audio forensics module; detecting discrepancies, artefacts, or abnormalities in the image data suggestive of deepfake manipulation using a discrepancy detection module; and recognizing deviations from anticipated patterns in the image data and highlighting probable deepfake irregularities using a pattern deviation identification module, wherein the analysis module assesses variations in speech characteristics in the audio data, including pitch, tone, and cadence, to differentiate between authentic and altered audio recordings.

Documents

Application Documents

#	Name	Date
1	202321089685-REQUEST FOR EARLY PUBLICATION(FORM-9) [29-12-2023(online)].pdf	2023-12-29
2	202321089685-POWER OF AUTHORITY [29-12-2023(online)].pdf	2023-12-29
3	202321089685-OTHERS [29-12-2023(online)].pdf	2023-12-29
4	202321089685-FORM-9 [29-12-2023(online)].pdf	2023-12-29
5	202321089685-FORM FOR SMALL ENTITY(FORM-28) [29-12-2023(online)].pdf	2023-12-29
6	202321089685-FORM 18 [29-12-2023(online)].pdf	2023-12-29
7	202321089685-FORM 1 [29-12-2023(online)].pdf	2023-12-29
8	202321089685-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [29-12-2023(online)].pdf	2023-12-29
9	202321089685-EDUCATIONAL INSTITUTION(S) [29-12-2023(online)].pdf	2023-12-29
10	202321089685-DRAWINGS [29-12-2023(online)].pdf	2023-12-29
11	202321089685-DECLARATION OF INVENTORSHIP (FORM 5) [29-12-2023(online)].pdf	2023-12-29
12	202321089685-COMPLETE SPECIFICATION [29-12-2023(online)].pdf	2023-12-29
13	202321089685-POA [17-01-2024(online)].pdf	2024-01-17
14	202321089685-MARKED COPIES OF AMENDEMENTS [17-01-2024(online)].pdf	2024-01-17
15	202321089685-FORM 13 [17-01-2024(online)].pdf	2024-01-17
16	202321089685-AMMENDED DOCUMENTS [17-01-2024(online)].pdf	2024-01-17
17	202321089685-RELEVANT DOCUMENTS [01-10-2024(online)].pdf	2024-10-01
18	202321089685-POA [01-10-2024(online)].pdf	2024-10-01
19	202321089685-FORM 13 [01-10-2024(online)].pdf	2024-10-01
20	202321089685-FORM 3 [02-07-2025(online)].pdf	2025-07-02
21	202321089685-FER.pdf	2025-11-11

Search Strategy

1	202321089685E_10-06-2024.pdf