Psychoacoustic Anomaly Detection Method For Audio Deepfake Detection

< Back

Psychoacoustic Anomaly Detection Method For Audio Deepfake Detection

Abstract: The present disclosure relates to a system (100) designed for the detection of audio deepfake anomalies through an approach integrating several modules. A data acquisition module (102) is tasked with collecting a diverse array of audio samples, including human speech, music, and ambient sounds. An audio conditioning unit (104) then normalizes volume levels and performs noise reduction and filtering to improve the audio quality of said samples. The core of the system, an audio feature extraction component (106), extracts both acoustic and psychoacoustic features, leveraging said for the identification of genuine versus manipulated audio. Utilizing machine learning algorithms, a psychoacoustic model training engine (108) discerns between authentic and deepfake audio based on human auditory perception. Anomaly detection (110) and scoring modules (112) identify and quantify deviations from genuine audio characteristics. Finally, a threshold comparison module (114) classifies the audio as deepfake or genuine, providing a tool against the proliferation of audio misinformation. Drawings /Fig. 1 / Fig. 2 /Fig. 3 / Fig. 4

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

26 April 2024

Publication Number

23/2024

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

Parent Application

Applicants

MARWADI UNIVERSITY

MARWADI UNIVERSITY, RAJKOT- MORBI HIGHWAY, AT GAURIDAD, RAJKOT – 360003, GUJARAT, INDIA

MS. RESHMA SUNIL

MARWADI UNIVERSITY, RAJKOT- MORBI HIGHWAY, AT GAURIDAD, RAJKOT – 360003, GUJARAT, INDIA

MS. PARITA MER

MARWADI UNIVERSITY, RAJKOT- MORBI HIGHWAY, AT GAURIDAD, RAJKOT – 360003, GUJARAT, INDIA

DR. ANJALI DIWAN

MARWADI UNIVERSITY, RAJKOT- MORBI HIGHWAY, AT GAURIDAD, RAJKOT – 360003, GUJARAT, INDIA

Inventors

1. MS. RESHMA SUNIL

MARWADI UNIVERSITY, RAJKOT- MORBI HIGHWAY, AT GAURIDAD, RAJKOT – 360003, GUJARAT, INDIA

2. MS. PARITA MER

MARWADI UNIVERSITY, RAJKOT- MORBI HIGHWAY, AT GAURIDAD, RAJKOT – 360003, GUJARAT, INDIA

3. DR. ANJALI DIWAN

MARWADI UNIVERSITY, RAJKOT- MORBI HIGHWAY, AT GAURIDAD, RAJKOT – 360003, GUJARAT, INDIA

Specification

Description:Field of the Invention

The present disclosure relates to audio security, particularly to a system for detecting and analyzing audio deepfake anomalies using an integrated, multi-module approach based on psychoacoustic and machine learning principles.
Background
The background description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed invention, or that any publication specifically or implicitly referenced is prior art.
The proliferation of audio deepfake technology has emerged as a formidable challenge in the field of digital security. Said technology enables the creation of audio recordings that mimic the voice of individuals with high accuracy, posing significant risks to privacy, security, and the integrity of information dissemination. Prior art in the field of anomaly detection for audio deepfake identification has traditionally focused on acoustic feature analysis, wherein the spectral and waveform characteristics of audio samples are scrutinized for inconsistencies indicative of manipulation. Despite the advancements, several drawbacks have been identified in said conventional systems, necessitating a more nuanced approach.
One significant drawback associated with prior art is the reliance on basic acoustic features, which, while effective in detecting rudimentary deepfakes, often fall short when confronted with the manipulation techniques. Advanced deepfake algorithms can seamlessly blend manipulated audio segments with genuine ones, thereby evading detection by systems that solely analyse acoustic properties. Such limitations highlight the need for methods that go beyond mere spectral analysis to consider the subtleties of human auditory perception.
Furthermore, prior art in the detection of audio deepfakes has not adequately addressed the challenge of real-time analysis. Many existing systems require extensive computational resources and time to analyse audio samples, rendering them impractical for applications where timely detection is important. The latency inherent in said methods undermines their utility in dynamic environments where immediate identification of deepfakes is essential to prevent the spread of misinformation or to secure digital communications in real time.
Additionally, the adaptability of conventional systems poses another concern. With the rapid evolution of deepfake generation techniques, anomaly detection methods based on static criteria quickly become obsolete. Many prior art systems lack the flexibility to learn from new examples of deepfakes, limiting their effectiveness over time. Said drawback necessitates continuous manual updates to the detection models, a process that is both time-consuming and resource-intensive.
The challenge of false positives and negatives also plagues existing anomaly detection methods for audio deepfakes. Systems that set rigid thresholds for anomaly detection risk misclassifying genuine audio as deepfake and vice versa. Such inaccuracies can have serious repercussions, from unjustly impugning the integrity of authentic audio recordings to failing to identify malicious deepfakes. The balance between sensitivity and specificity remains a critical issue that prior art has struggled to optimize.
Moreover, the focus of prior art on singular detection techniques without considering the integration of various data sources and analytical methods has further limited the effectiveness of existing systems. Anomalies indicative of deepfakes are often subtle and varied, necessitating a multifaceted approach to detection that leverages both acoustic and psychoacoustic analysis. The failure to incorporate an analysis framework has rendered many conventional systems inadequate in the face of complex deepfake manipulations.
Prior art approaches failed to integrate advanced machine learning algorithms capable of adapting to new forms of manipulation, employ a nuanced approach to feature analysis that encompasses both acoustic and psychoacoustic properties, and facilitate real-time processing for the timely detection. Thus, there exists a persistent need in the art for optimizing the trade-off between detecting false positives and negatives, thereby maintaining high accuracy and reliability in identifying deepfakes. In light of said challenges, there exists an urgent need for an anomaly detection method for audio deepfake detection that transcends the limitations of prior art.
Summary
The following presents a simplified summary of various aspects of this disclosure in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements nor delineate the scope of such aspects. Its purpose is to present some concepts of this disclosure in a simplified form as a prelude to the more detailed description that is presented later.
The following paragraphs provide additional support for the claims of the subject application.
The disclosure pertains to a system for a system for detecting audio deepfake anomalies. The system comprises a data acquisition module configured to collect a set of audio samples including human speech, music, and ambient sounds. An audio conditioning unit is configured to normalize volume levels across the collected audio samples and perform noise reduction and filtering to enhance audio quality. An audio feature extraction component is configured to extract acoustic and psychoacoustic features from the collected set of audio samples, wherein the acoustic features relate to signal properties.
A psychoacoustic model training engine utilizes machine learning algorithms to discern genuine audio from deepfake audio based on the extracted psychoacoustic features indicative of human auditory perception. An anomaly detection module is configured to compare the extracted psychoacoustic features of the audio samples under test against features of genuine samples to identify deviations characteristic of deepfake audio. An anomaly scoring module is configured to calculate anomaly scores based on the identified deviations from genuine audio characteristics.
A threshold comparison module is configured to classify one or more of the audio samples under test as deepfake or genuine based on the anomaly scores exceeding a predefined threshold and a score not exceeding the predefined threshold, respectively. The system further comprises enhancements in several modules. The audio conditioning unit employs a dynamic range compression algorithm to perform uniform loudness normalization across the set of audio samples.
The audio feature extraction component utilizes a Fourier transform to convert the audio samples from the time domain to the frequency domain for detailed acoustic feature extraction. The psychoacoustic model training engine implements a neural network architecture for adaptive learning to discern between genuine and deepfake audio with each iteration. The anomaly detection module comprises a temporal analysis unit to assess the consistency of the psychoacoustic features.
The anomaly scoring module applies a weighted scoring unit that assigns different levels of importance to the psychoacoustic features based on the relevance to the human auditory perception. The threshold comparison module comprises a feedback loop component to the psychoacoustic model training engine to refine the detection algorithm based on the classification results of the audio samples.
The anomaly detection module employs a multivariate analysis approach to detect complex patterns and interactions among psychoacoustic features indicative of deepfake audio. The system further comprises a reporting interface module configured to generate detailed reports of the analysis, classification results, and characteristics of detected deepfake audio samples for forensic analysis.
The present disclosure pertains to a method for detecting audio deepfake anomalies. The method comprises collecting a set of audio samples including human speech, music, and ambient sounds using a data acquisition module. Volume levels across the collected audio samples are normalized, and noise reduction and filtering are performed to enhance audio quality using an audio conditioning unit.
Acoustic and psychoacoustic features from the collected set of audio samples are extracted, wherein the acoustic features relate to signal properties, using an audio feature extraction component. Machine learning algorithms are utilized to discern genuine audio from deepfake audio based on the extracted psychoacoustic features indicative of human auditory perception with a psychoacoustic model training engine.
The extracted psychoacoustic features of the audio samples under test are compared against features of genuine samples to identify deviations characteristic of deepfake audio using an anomaly detection module. Anomaly scores based on the identified deviations from genuine audio characteristics are calculated with an anomaly scoring module. One or more of the audio samples under test are classified as a deepfake or genuine based on the anomaly scores exceeding a predefined threshold and a score not exceeding the predefined threshold, respectively.

Brief Description of the Drawings

The features and advantages of the present disclosure would be more clearly understood from the following description taken in conjunction with the accompanying drawings in which:
FIG. 1 illustrates a system for detecting audio deepfake anomalies, in accordance with the embodiments of the present disclosure.
FIG. 2 illustrates a method for detecting audio deepfake anomalies, in accordance with the embodiments of the present disclosure.
FIG. 3 illustrates a flowchart for the method of detecting audio deepfake anomalies.
FIG. 4 illustrates a working decision flow diagram (DFD) for the method to detect audio anomalies, in accordance with the embodiments of the present disclosure.
Detailed Description
In the following detailed description of the invention, reference is made to the accompanying drawings that form a part hereof, and in which is shown, by way of illustration, specific embodiments in which the invention may be practiced. In the drawings, like numerals describe substantially similar components throughout the several views. These embodiments are described in sufficient detail to claim those skilled in the art to practice the invention. Other embodiments may be utilized and structural, logical, and electrical changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims and equivalents thereof.
The use of the terms “a” and “an” and “the” and “at least one” and similar referents in the context of describing the invention (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The use of the term “at least one” followed by a list of one or more items (for example, “at least one of A and B”) is to be construed to mean one item selected from the listed items (A or B) or any combination of two or more of the listed items (A and B), unless otherwise indicated herein or clearly contradicted by context. The terms “comprising,” “having,” “including,” and “containing” are to be construed as open-ended terms (i.e., meaning “including, but not limited to,”) unless otherwise noted. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate the invention and does not pose a limitation on the scope of the invention unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the invention.
Pursuant to the "Detailed Description" section herein, whenever an element is explicitly associated with a specific numeral for the first time, such association shall be deemed consistent and applicable throughout the entirety of the "Detailed Description" section, unless otherwise expressly stated or contradicted by the context.
Disclosed herein a system 100 for detecting audio deepfake anomalies comprises several components each dedicated to specific functions within the process of identifying and classifying audio samples as genuine or deepfake. According to a pictorial illustration of FIG. 1, showcasing an architectural paradigm of the system 100 that can comprise functional elements, yet not limited to a data acquisition module 102, an audio conditioning unit 104, an audio feature extraction component 106, a psychoacoustic model training engine 108, an anomaly detection module 110, an anomaly scoring module 112 and a threshold comparison module 114.
Referring to the preceding embodiment, a person ordinarily skilled in art would prefer those elements or components of the system 100, to be functionally or operationally coupled with each other, in accordance with the embodiments of present disclosure. For instance, as used herein, and unless a context may dictate otherwise, the term “coupled to/ with” can be intended to include a direct coupling (may relate to two elements which may be directly interlinked with each other) and an indirect coupling (may relate to one or more element may be positioned between the two elements, interlinked with each other). Thus, the terms “coupled to” and “coupled with” can be used synonymously or interchangeably.
In an embodiment, the data acquisition module 102 is configured to collect a set of audio samples encompassing human speech, music, and ambient sounds. The collection of a diverse set of audio samples is important for the system 100 to have a broad baseline for comparison, which aids in the accurate detection of anomalies indicative of deepfake audio.
In an embodiment, the audio conditioning unit 104 is tasked with the normalization of volume levels across the collected audio samples. Furthermore, said audio conditioning unit 104 performs noise reduction and filtering to enhance the quality of the audio. The normalization and enhancement of audio quality are important for maintaining the fidelity of the audio features to be extracted, thereby improving the overall accuracy of the deepfake detection process.
In an embodiment, the audio feature extraction component 106 is configured to extract acoustic and psychoacoustic features from the collected set of audio samples, wherein said acoustic features relate to signal properties. The extraction of both acoustic and psychoacoustic features allows for an analysis of the audio samples, taking into account not just the signal properties but also the aspects of human auditory perception.
In an embodiment, the psychoacoustic model training engine 108 utilizes machine learning algorithms to discern genuine audio from deepfake audio based on the extracted psychoacoustic features indicative of human auditory perception. The application of machine learning algorithms in training the psychoacoustic model enables the system 100 to learn and adapt to the subtle nuances that differentiate genuine audio from the deepfake counterparts, significantly enhancing the detection capabilities of the system 100.
In an embodiment, the anomaly detection module 110 is configured to compare the extracted psychoacoustic features of the audio samples under test against features of genuine samples to identify deviations characteristic of deepfake audio. The identification of deviations allows for the detection of anomalies within the audio samples that are indicative of tampering or fabrication, which is significant for the accurate classification of audio samples as genuine or deepfake.
In an embodiment, the anomaly scoring module 112 is configured to calculate anomaly scores based on the identified deviations from genuine audio characteristics. The calculation of anomaly scores provides a quantitative measure of the deviations, facilitating a systematic approach to the classification of audio samples.
In an embodiment, the threshold comparison module 114 is configured to classify one or more of the audio samples under test as a deepfake or genuine based on the anomaly scores exceeding a predefined threshold and a score not exceeding the predefined threshold, respectively. The classification based on threshold comparison allows for a clear demarcation between genuine and deepfake audio samples, maintaining the reliability of the detection process.
Referring to one or more preceding embodiments, each component within the system 100 plays a pivotal role in the detection of audio deepfake anomalies, from the initial acquisition of audio samples to the final classification based on anomaly scores. The collaborative functioning of said components enables the system 100 to effectively discern between genuine and deepfake audio, thereby contributing to the integrity of digital audio content.
In an embodiment, the audio conditioning unit 104 additionally employs a dynamic range compression algorithm to perform uniform loudness normalization across the set of audio samples. The application of a dynamic range compression algorithm facilitates the achievement of consistent loudness levels, which is important for reducing variability between samples and enhancing the effectiveness of subsequent analysis stages.
By maintaining uniform loudness, the system 100 improves the reliability of feature extraction and anomaly detection, leading to more accurate identification of audio deepfake anomalies. Said uniform loudness normalization is significant in maintaining the integrity of the audio analysis process, by minimizing the probability for discrepancies that could affect the ability of the system 100 to accurately discern genuine from deepfake audio.
In another embodiment, the audio feature extraction component 106 is further configured to utilize a Fourier transform to convert said audio samples from the time domain to the frequency domain for detailed acoustic feature extraction. The use of a Fourier transform enables the system 100 to analyse the frequency components of audio signals, providing a more granular view of the acoustic features present within the audio samples. Said conversion is fundamental to identifying subtle manipulations characteristic of deepfake audio, as the conversion allows for the examination of the audio content at a level of detail that is not possible in the time domain alone. The ability to perform detailed acoustic feature extraction enhances the precision of said system 100 in detecting anomalies, thereby improving the overall effectiveness of the deepfake detection process.
In a further embodiment, the psychoacoustic model training engine 108 is further configured to implement a neural network architecture for adaptive learning to discern between genuine and deepfake audio with each iteration. The implementation of a neural network architecture enables the system 100 to learn and adapt continuously based on new data, improving the ability to distinguish between genuine and deepfake audio over time. Said adaptive learning process is vital for keeping pace with the evolving techniques used in the creation of deepfake audio, so that the system 100 remains effective in identifying such content as technology advances.
In a further embodiment, the anomaly detection module 110 further comprises a temporal analysis unit to assess the consistency of the psychoacoustic features over time. The inclusion of a temporal analysis unit allows for the examination of psychoacoustic features in a dynamic context, providing insight into the temporal consistency of the audio samples. Said assessment is crucial for identifying deepfake audio, as inconsistencies in psychoacoustic features over time can be indicative of manipulation. The ability to assess temporal consistency strengthens the anomaly detection capabilities, further enhancing the accuracy in identifying deepfake content.
In a further embodiment, the anomaly scoring module 112 is further configured to apply a weighted scoring unit that assigns different levels of importance to said psychoacoustic features based on their relevance to the human auditory perception. The application of weighted scoring allows for a nuanced analysis of psychoacoustic features, taking into account the varying degrees of impact different features have on the perception of audio authenticity. Said anomaly scoring module 112 facilitates that features more strongly indicative of deepfake audio are given greater consideration in the anomaly detection process, improving the precision in identifying fraudulent content.
In a further embodiment, the threshold comparison module 114 further comprises a feedback loop component to the psychoacoustic model training engine 108 to refine the detection algorithm based on the classification results of said audio samples. The integration of a feedback loop enables continuous improvement of the detection algorithm by adjusting the model in response to the performance in classifying audio samples. Said self-refining mechanism maintains that the system 100 evolves in response to the findings, enhancing the effectiveness over time by incorporating insights gained from previous classifications into the detection efforts.
In an additional embodiment, the anomaly detection module 110 employs a multivariate analysis approach to detect complex patterns and interactions among psychoacoustic features indicative of deepfake audio. The use of multivariate analysis allows for an examination of the relationships between different psychoacoustic features, enabling the detection of intricate patterns that may not be evident through univariate analysis. Said approach is instrumental in identifying deepfake audio, as the manipulation of audio content often results in complex alterations across multiple features. The ability to detect such complex patterns enhances the capability of the system 100 to accurately identify deepfake anomalies.
In a further embodiment, the system 100 further comprises a reporting interface module configured to generate detailed reports of the analysis, classification results, and characteristics of detected deepfake audio samples for forensic analysis. The provision of a reporting interface module facilitates the documentation and review of the findings of the system 100, providing valuable insights into the nature and characteristics of detected deepfake content. Said capability is significant for forensic analysis, allows for the detailed examination of deepfake audio, contributing to efforts to understand and combat said form of digital manipulation.
The method 200 for detecting audio deepfake anomalies encompasses a series of steps disclosed to accurately identify and classify audio samples as either genuine or deepfake. Referring to a diagrammatic depiction put forth in FIG. 2, representing a flow diagram of the method 200 that can comprise steps of, yet not restricted to, (at step 202) collecting a set of audio samples, (at step 204) normalizing volume levels across the collected audio samples, (at step 206) extracting acoustic and psychoacoustic features from the collected set of audio samples, (at step 208) utilizing machine learning algorithms to discern genuine audio from deepfake audio, (at step 210) comparing the extracted psychoacoustic features of the audio samples under test, (at step 212) calculating the anomaly scores based on the identified deviations and (at step 214) classifying one or more of said audio samples under test. Said steps of the method 200 can be performed or executed, collectively or selectively, randomly, or sequentially or in a combination thereof, in accordance with the embodiments of current disclosure.
In an embodiment, at step 202, the collection of a set of audio samples is carried out, including human speech, music, and ambient sounds, using a data acquisition module (102). Said step 202 is foundational for the method 200, as the method 200 can maintain a diverse dataset upon which subsequent analysis can be conducted. The variety in audio samples is significant for the robustness of the detection process, enabling the system 100 to accurately identify anomalies across a wide range of audio types.
In an embodiment, at step 204, normalization of volume levels across the collected audio samples is performed alongside noise reduction and filtering to enhance audio quality, using an audio conditioning unit (104). Said step 204 is critical for establishing a uniform baseline across all samples, which is necessary for effective comparison and analysis. The enhancement of audio quality through noise reduction and filtering is important in facilitating that the features extracted in subsequent steps are not obscured by extraneous noise or variations in volume.
In an embodiment, at step 206, the extraction of acoustic and psychoacoustic features from the collected set of audio samples is executed, wherein the acoustic features relate to signal properties, using an audio feature extraction component (106). Said step 206 enables the method 200 to analyse the intrinsic properties of the audio samples, including both the physical signal properties and the aspects that influence human auditory perception. The extraction of said features is an important aspect of the method 200 in providing the detailed data necessary for the identification of deepfake audio.
In an embodiment, at step 208, machine learning algorithms are utilized to discern genuine audio from deepfake audio based on the extracted psychoacoustic features indicative of human auditory perception, with a psychoacoustic model training engine (108). Said step 208 leverages the power of machine learning to analyze the complex patterns within the psychoacoustic features, facilitating the differentiation between genuine and manipulated audio. The use of machine learning algorithms allows the method 200 to adapt and improve over time, enhancing the ability to detect deepfake anomalies as new examples are encountered.
In an embodiment, at step 210, a comparison of the extracted psychoacoustic features of the audio samples under test against features of genuine samples is made to identify deviations characteristic of deepfake audio, using an anomaly detection module (110). Said step 210 is crucial for pinpointing the specific differences that indicate a sample may be a deepfake, based on deviations from the established norms of genuine audio features. The ability to identify said deviations is key to the effectiveness of the method 200 in detecting deepfake content.
In an embodiment, at step 212, the calculation of anomaly scores based on the identified deviations from genuine audio characteristics is conducted with an anomaly scoring module (112). Said step 212 quantifies the deviations observed in the previous step, assigning a numerical value to the likelihood that a sample is deepfake. The anomaly scores are instrumental in providing a systematic and objective basis for classifying audio samples.
In an embodiment, at step 214, classification of one or more of the audio samples under test as a deepfake or genuine is based on the anomaly scores exceeding a predefined threshold and a score not exceeding the predefined threshold, respectively. Said final step 214 uses the anomaly scores to make a definitive classification of each sample, effectively separating genuine audio from deepfake. The use of a predefined threshold facilitates that the classification process is consistent and based on objective criteria, thereby enhancing the reliability of the method 200 in detecting audio deepfake anomalies.

In an embodiment, the method 200 described herein systematically addresses the challenge of detecting audio deepfake anomalies through a series of well-defined steps, each contributing to the overall accuracy and effectiveness of the process. The integration of machine learning algorithms and the focus on psychoacoustic features are particularly noteworthy, as they represent advanced approaches to distinguishing between genuine and manipulated audio content.
Referring to one or more embodiments of the system 100 may be described in detail with reference to the drawings, wherein like reference numerals can represent like elements and assemblies throughout the present disclosure. The pictorial portrayal in FIG. 1, FIG. 2 and so forth can be considered as a mere depiction or the demonstration of system 100, thus cannot limit a scope thereof. However, to those ordinarily skilled or extra ordinarily skilled in art may prefer those functional elements/embodiments included in the architectural setup of system 100, can be modified and updated (such as detachably coupled and replaced), as and when necessary, in accordance with the embodiments of present disclosure.
FIG. 3 illustrates a flowchart for the method of detecting audio deepfake anomalies. The process begins with the collection of audio samples and proceeds with preprocessing. Subsequently, training of the psychoacoustic model takes place, which includes the extraction of both acoustic and psychoacoustic features. Anomaly detection is then performed, followed by a comparison and analysis of the samples. If an anomaly is detected, the process moves on to calculate an anomaly score. Said score is then compared to a predefined threshold to determine if the audio is deepfake. If the score exceeds the threshold, the audio is classified as deepfake, otherwise, classified as genuine. The process concludes after this classification step.
FIG. 4 illustrates a working decision flow diagram (DFD) for the method to detect audio anomalies, starting with an input phase and followed by preprocessing. Said DFD then emphasizes the psychoacoustic model training and feature extraction, where both acoustic and psychoacoustic features are extracted. The process continues with anomaly detection, sample comparison, and analysis, followed by the calculation of an anomaly score. The score is then evaluated and compared against a threshold, which leads to the decision-making step. The final step in said DFD is the output, where the result of the decision-making process is presented.
Referring to the preceding embodiment, the disclosure pertained to the field of cyber security and digital forensics, introducing said system 100 for detecting audio deepfake anomalies. With the ability of the deepfake technology to generate convincing fake audio, there is a need for effective detection systems. The system 100 meets the need through a psychoacoustic anomaly detection system that employs machine learning algorithms to scrutinize audio samples for subtle anomalies that suggest the presence of deepfakes.
Referring to the preceding embodiment, the significance of said system 100 is important in an era where misinformation can spread quickly, with audio deepfakes posing risks of deception, manipulation of public opinion, and threats to national security. By reliably identifying deepfakes, the system 100 upholds the integrity of audio content, allowing users to trust the media they consume. The system 100 starts by gathering a broad array of audio samples, essential for the machine learning model training to recognize authentic audio and identify deepfakes. The psychoacoustic model is pivotal in this process, focusing on features like loudness, pitch, and timbre that are key to distinguishing manipulated audio.
Referring to the preceding embodiment, after training, the model examines new samples for deepfake indicators by analyzing acoustic and psychoacoustic features and comparing them to those of genuine samples. Detected anomalies result in an anomaly score, which, when surpassing a certain threshold, classifies the audio as deepfake, otherwise, said the audio considered genuine.
Referring to the preceding embodiment, the approach using psychoacoustic principles and threshold-based classification enhances accuracy and reduces false positives. The applications are extensive, benefiting sectors where audio authenticity is critical. Additionally, the model adapts to new deepfake methods, maintaining ongoing efficacy. This system 100 is a valuable tool in countering misinformation and also in reinforcing security across various industries, highlighting the importance in contemporary digital security measures.
Example embodiments herein have been described above with reference to block diagrams and flowchart illustrations of methods and apparatuses. It will be understood that each block of the block diagrams and flowchart illustrations, and combinations of blocks in the block diagrams and flowchart illustrations, respectively, can be implemented by various means including hardware, software, firmware, and a combination thereof. For example, in one embodiment, each block of the block diagrams and flowchart illustrations, and combinations of blocks in the block diagrams and flowchart illustrations can be implemented by computer program instructions. These computer program instructions may be loaded onto a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions which execute on the computer or other programmable data processing apparatus create means for implementing the functions specified in the flowchart block or blocks.
Throughout the present disclosure, the term ‘processing means’ or ‘microprocessor’ or ‘processor’ or ‘processors’ includes, but is not limited to, a general purpose processor (such as, for example, a complex instruction set computing (CISC) microprocessor, a reduced instruction set computing (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, a microprocessor implementing other types of instruction sets, or a microprocessor implementing a combination of types of instruction sets) or a specialized processor (such as, for example, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), or a network processor).
The term “non-transitory storage device” or “storage” or “memory,” as used herein relates to a random access memory, read only memory and variants thereof, in which a computer can store data or software for any duration.
Operations in accordance with a variety of aspects of the disclosure is described above would not have to be performed in the precise order described. Rather, various steps can be handled in reverse order or simultaneously or not at all.
While several implementations have been described and illustrated herein, a variety of other means and/or structures for performing the function and/or obtaining the results and/or one or more of the advantages described herein may be utilized, and each of such variations and/or modifications is deemed to be within the scope of the implementations described herein. More generally, all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications for which the teachings is/are used. Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific implementations described herein. It is, therefore, to be understood that the foregoing implementations are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, implementations may be practiced otherwise than as specifically described and claimed. Implementations of the present disclosure are directed to each individual feature, system, article, material, kit, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, kits, and/or methods, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the scope of the present disclosure.
Claims

I/We claims:

1. A system 100 for detecting audio deepfake anomalies, comprising:
a data acquisition module 102 configured to collect a set of audio samples including human speech, music, and ambient sounds;
an audio conditioning unit 104 configured to:
normalize volume levels across the collected audio samples; and
perform noise reduction and filtering to enhance audio quality;
an audio feature extraction component 106 configured to extract acoustic and psychoacoustic features from said collected set of audio samples, wherein said acoustic features relate to signal properties;
a psychoacoustic model training engine 108 utilizes machine learning algorithms to discern genuine audio from deepfake audio based on said extracted psychoacoustic features indicative of human auditory perception;
an anomaly detection module 110 configured to compare said extracted psychoacoustic features of said audio samples under test against features of genuine samples to identify deviations characteristic of deepfake audio;
an anomaly scoring module 112 configured to calculate anomaly scores based on the identified deviations from genuine audio characteristics; and
a threshold comparison module 114 configured to:
classify one or more of said audio samples under test as a deepfake or genuine based on:
the anomaly scores exceeding a predefined threshold; and
a score not exceeding the predefined threshold, respectively.
The system of claim 1, wherein the audio conditioning unit 104 additionally employs a dynamic range compression algorithm to perform uniform loudness normalization across said set of audio samples.
The system of claim 1, wherein the audio feature extraction component 106 is further configured to utilize a Fourier transform to convert said audio samples from the time domain to the frequency domain for detailed acoustic feature extraction.
The system of claim 1, wherein the psychoacoustic model training engine 108 is further configured to implement a neural network architecture for adaptive learning to discern between genuine and deepfake audio with each iteration.
The system of claim 1, wherein the anomaly detection module 110 further comprises a temporal analysis unit to assess the consistency of the psychoacoustic features.
The system of claim 1, wherein the anomaly scoring module 112 is further configured to apply a weighted scoring unit that assigns different levels of importance to said psychoacoustic features based on the relevance to the human auditory perception.
The system of claim 1, wherein the threshold comparison module 114 further comprises a feedback loop component to the psychoacoustic model training engine 108 to refine the detection algorithm based on the classification results of said audio samples.
The system of claim 1, wherein the anomaly detection module 110 employs a multivariate analysis approach to detect complex patterns and interactions among psychoacoustic features indicative of deepfake audio.
The system of claim 1, further comprising a reporting interface module configured to generate detailed reports of the analysis, classification results, and characteristics of detected deepfake audio samples for forensic analysis.
A method 200 for detecting audio deepfake anomalies, the method 200 comprising the steps of:
(at step 202) collecting a set of audio samples including human speech, music, and ambient sounds using a data acquisition module (102);
(at step 204) normalizing volume levels across the collected audio samples and performing noise reduction and filtering to enhance audio quality using an audio conditioning unit (104);
(at step 206) extracting acoustic and psychoacoustic features from the collected set of audio samples, wherein the acoustic features relate to signal properties, using an audio feature extraction component (106);
(at step 208) utilizing machine learning algorithms to discern genuine audio from deepfake audio based on the extracted psychoacoustic features indicative of human auditory perception with a psychoacoustic model training engine (108);
(at step 210) comparing the extracted psychoacoustic features of the audio samples under test against features of genuine samples to identify deviations characteristic of deepfake audio using an anomaly detection module (110);
(at step 212) calculating the anomaly scores based on the identified deviations from genuine audio characteristics with an anomaly scoring module (112); and
(at step 214) classifying one or more of said audio samples under test as a deepfake or genuine based on:
the anomaly scores exceeding a predefined threshold; and
a score not exceeding the predefined threshold, respectively.

PSYCHOACOUSTIC ANOMALY DETECTION METHOD FOR AUDIO DEEPFAKE DETECTION

The present disclosure relates to a system (100) designed for the detection of audio deepfake anomalies through an approach integrating several modules. A data acquisition module (102) is tasked with collecting a diverse array of audio samples, including human speech, music, and ambient sounds. An audio conditioning unit (104) then normalizes volume levels and performs noise reduction and filtering to improve the audio quality of said samples. The core of the system, an audio feature extraction component (106), extracts both acoustic and psychoacoustic features, leveraging said for the identification of genuine versus manipulated audio. Utilizing machine learning algorithms, a psychoacoustic model training engine (108) discerns between authentic and deepfake audio based on human auditory perception. Anomaly detection (110) and scoring modules (112) identify and quantify deviations from genuine audio characteristics. Finally, a threshold comparison module (114) classifies the audio as deepfake or genuine, providing a tool against the proliferation of audio misinformation.

Drawings
/Fig. 1
/ Fig. 2

/Fig. 3

/ Fig. 4

, Claims:I/We claims:

1. A system 100 for detecting audio deepfake anomalies, comprising:
a data acquisition module 102 configured to collect a set of audio samples including human speech, music, and ambient sounds;
an audio conditioning unit 104 configured to:
normalize volume levels across the collected audio samples; and
perform noise reduction and filtering to enhance audio quality;
an audio feature extraction component 106 configured to extract acoustic and psychoacoustic features from said collected set of audio samples, wherein said acoustic features relate to signal properties;
a psychoacoustic model training engine 108 utilizes machine learning algorithms to discern genuine audio from deepfake audio based on said extracted psychoacoustic features indicative of human auditory perception;
an anomaly detection module 110 configured to compare said extracted psychoacoustic features of said audio samples under test against features of genuine samples to identify deviations characteristic of deepfake audio;
an anomaly scoring module 112 configured to calculate anomaly scores based on the identified deviations from genuine audio characteristics; and
a threshold comparison module 114 configured to:
classify one or more of said audio samples under test as a deepfake or genuine based on:
the anomaly scores exceeding a predefined threshold; and
a score not exceeding the predefined threshold, respectively.
The system of claim 1, wherein the audio conditioning unit 104 additionally employs a dynamic range compression algorithm to perform uniform loudness normalization across said set of audio samples.
The system of claim 1, wherein the audio feature extraction component 106 is further configured to utilize a Fourier transform to convert said audio samples from the time domain to the frequency domain for detailed acoustic feature extraction.
The system of claim 1, wherein the psychoacoustic model training engine 108 is further configured to implement a neural network architecture for adaptive learning to discern between genuine and deepfake audio with each iteration.
The system of claim 1, wherein the anomaly detection module 110 further comprises a temporal analysis unit to assess the consistency of the psychoacoustic features.
The system of claim 1, wherein the anomaly scoring module 112 is further configured to apply a weighted scoring unit that assigns different levels of importance to said psychoacoustic features based on the relevance to the human auditory perception.
The system of claim 1, wherein the threshold comparison module 114 further comprises a feedback loop component to the psychoacoustic model training engine 108 to refine the detection algorithm based on the classification results of said audio samples.
The system of claim 1, wherein the anomaly detection module 110 employs a multivariate analysis approach to detect complex patterns and interactions among psychoacoustic features indicative of deepfake audio.
The system of claim 1, further comprising a reporting interface module configured to generate detailed reports of the analysis, classification results, and characteristics of detected deepfake audio samples for forensic analysis.
A method 200 for detecting audio deepfake anomalies, the method 200 comprising the steps of:
(at step 202) collecting a set of audio samples including human speech, music, and ambient sounds using a data acquisition module (102);
(at step 204) normalizing volume levels across the collected audio samples and performing noise reduction and filtering to enhance audio quality using an audio conditioning unit (104);
(at step 206) extracting acoustic and psychoacoustic features from the collected set of audio samples, wherein the acoustic features relate to signal properties, using an audio feature extraction component (106);
(at step 208) utilizing machine learning algorithms to discern genuine audio from deepfake audio based on the extracted psychoacoustic features indicative of human auditory perception with a psychoacoustic model training engine (108);
(at step 210) comparing the extracted psychoacoustic features of the audio samples under test against features of genuine samples to identify deviations characteristic of deepfake audio using an anomaly detection module (110);
(at step 212) calculating the anomaly scores based on the identified deviations from genuine audio characteristics with an anomaly scoring module (112); and
(at step 214) classifying one or more of said audio samples under test as a deepfake or genuine based on:
the anomaly scores exceeding a predefined threshold; and
a score not exceeding the predefined threshold, respectively.

PSYCHOACOUSTIC ANOMALY DETECTION METHOD FOR AUDIO DEEPFAKE DETECTION

Documents

Application Documents

#	Name	Date
1	202421033114-OTHERS [26-04-2024(online)].pdf	2024-04-26
2	202421033114-FORM FOR SMALL ENTITY(FORM-28) [26-04-2024(online)].pdf	2024-04-26
3	202421033114-FORM 1 [26-04-2024(online)].pdf	2024-04-26
4	202421033114-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [26-04-2024(online)].pdf	2024-04-26
5	202421033114-EDUCATIONAL INSTITUTION(S) [26-04-2024(online)].pdf	2024-04-26
6	202421033114-DRAWINGS [26-04-2024(online)].pdf	2024-04-26
7	202421033114-DECLARATION OF INVENTORSHIP (FORM 5) [26-04-2024(online)].pdf	2024-04-26
8	202421033114-COMPLETE SPECIFICATION [26-04-2024(online)].pdf	2024-04-26
9	202421033114-FORM-9 [07-05-2024(online)].pdf	2024-05-07
10	202421033114-FORM 18 [08-05-2024(online)].pdf	2024-05-08
11	202421033114-FORM-26 [12-05-2024(online)].pdf	2024-05-12
12	202421033114-FORM 3 [13-06-2024(online)].pdf	2024-06-13
13	202421033114-RELEVANT DOCUMENTS [09-10-2024(online)].pdf	2024-10-09
14	202421033114-POA [09-10-2024(online)].pdf	2024-10-09
15	202421033114-FORM 13 [09-10-2024(online)].pdf	2024-10-09