Abstract: The present invention relates to a system and method for detecting AI-generated deepfake images, comprising a pre-processing module to resize and normalize input images, and a latent space inversion module comprising a StyleGAN2 architecture (100) to extract latent vectors from input images. A synthesis network (108) is configured to regenerate images from the extracted latent vectors, and a similarity computation module calculates similarity metrics, including Euclidean Distance, Cosine Similarity, CLIP ViT, SSIM, MSE, Perceptual Loss, and PSNR, Feature Extraction with SIFT between original and regenerated images. A machine learning module configured to train and deploy classification models, including an ANN, XGBoost, or KNN using the computed similarity metrics, and an output module configured to provide binary classification labels indicating the inputted image as real or AI-generated. The system is efficient, has fast processing speed, and has less carbon footprint.
Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of Invention:
SYSTEM AND METHOD FOR DETECTING AI-GENERATED DEEPFAKE IMAGES
Applicant:
TECH MAHINDRA LIMITED
A company Incorporated in India under the Companies Act, 1956
Having address:
Tech Mahindra Limited, Phase III, Rajiv Gandhi Infotech Park Hinjewadi,
Pune - 411057, Maharashtra, India
The following specification particularly describes the invention and the manner in which it is to be performed.
CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
[001] The present application does not claim priority from any patent application.
TECHNICAL FIELD
[002] The present invention related with the field of a detection of AI-generated DeeFake images. Particularly, the present invention involves the detection of AI-generated images (DeepFakes) using advanced inversion techniques, similarity metrics, and machine learning models. Further, the present invention relates to an artificial intelligence, image processing, and cybersecurity.
BACKGROUND OF THE INVENTION
[003] Conventional Generative adversarial networks (GANs) like StyleGAN2 enable the creation of highly realistic synthetic images, leading to challenges in verifying digital authenticity. The Deepfake content is widely proliferating after the evolution of the GenAI. The GenAI may create fake social media profiles, fake images, and videos with human faces and identities that do not exist. Currently, the worlds are entering into Web3.0, where internet will be democratized these fake identities can definitely create challenges to balance the internet's sanity. Traditional methods lack robustness against evolving AI techniques.
[004] The rapid advancements in generative models, particularly StyleGAN may have revolutionized the field of image synthesis, enabling the creation of highly realistic artificial images. These advancements may have numerous positive applications. This advancement also pose significant challenges, particularly in the realm of digital media integrity and the proliferation of misinformation. The ability to distinguish between AI-generated images and real images has become increasingly critical. Existing methods for detecting AI-generated images primarily focus on direct analysis techniques, which often fall short in accuracy and robustness when faced with sophisticated generative models.
[005] Therefore, to overcome the problems associated with the traditional methods, there is need for a system to detect any such GenAI generated human face with 100% accuracy by using a scalable and efficient solution that combines StyleGAN2 inversion and sophisticated machine learning for DeepFake detection.
OBJECTS OF THE INVENTION
[006] Primary objective of the present invention is to provide a system for detecting AI-generated deepfake images from real ones in real-world applications.
[007] Another objective of the present invention is to provide a machine learning model with 100% Accuracy, Precision, Recall and F1-Score by demonstrating high effectiveness for real-world applications in digital media integrity and combating misinformation spread through AI-generated content.
[008] Yet another objective of the present invention is to provide a graphics processing unit (GPU) for inverting the StyleGAN and regenerating the image from the latent vector for evaluating the efficiency and scalability of the proposed methodology.
[009] Another objective of the present invention is for detecting AI-generated DeepFakes offering scalability, precision, and adaptability for real-world applications in cybersecurity, forensics, and media verification.
[0010] Yet another objective of the present invention is to provide an efficient system with fast processing speed, and less carbon footprint.
SUMMARY OF THE INVENTION
[0011] Before the present system is described, it is to be understood that this application is not limited to the particular machine, device, or system, as there can be multiple possible embodiments that are not expressly illustrated in the present disclosures. It is also to be understood that the terminology used in the description is to describe the particular versions or embodiments only, and is not intended to limit the scope of the present application. This summary is provided to introduce aspects related to a system for detecting AI-generated deepfake images, and the aspects are further elaborated below in the detailed description. This summary is not intended to identify essential features of the proposed subject matter nor is it intended for use in determining or limiting the scope of the proposed subject matter.
[0012] In an embodiment, the present invention provide a system for detecting AI-generated deepfake images. The system comprising a database configured to store input images. In an embodiment, the input images comprising a real and an AI-generated image. A pre-processing module configured to resize and normalize input images, a latent space inversion module comprising a StyleGAN2 architecture, configured to extract latent vectors from input images, a synthesis network, configured to regenerate images from the extracted latent vectors, a similarity computation module, configured to calculate similarity metrics, including Euclidean Distance, Cosine Similarity, Contrastive Language-Image Pre-training with Vision Transformer (CLIP ViT), Structural Similarity Index Measure (SSIM), Mean Squared Error (MSE), Perceptual Loss, and Peak Signal-to-Noise Ratio (PSNR), Feature Extraction with Scale-Invariant Feature Transform (SIFT), between original and regenerated images, a machine learning module configured to train and deploy classification models, including an Artificial Neural Networks (ANN), XGBoost, or K-Nearest Neighbors (KNN), using the computed similarity metrics, and an output module configured to provide binary classification labels indicating the inputted image as real or AI-generated. The system is efficient, has fast processing speed, and has less carbon footprint. The system achieves an impressive 100% accuracy on training and testing dataset in distinguishing between real and AI-generated images, highlighting its potential for real-world applications.
[0013] In an embodiment, the system achieves scalability by leveraging a modular architecture that supports multiple GAN models (e.g., StyleGAN2, BigGAN, CycleGAN) without retraining from scratch. The use of latent space inversion ensures adaptability across various AI-generated image sources, making it effective for evolving deepfake techniques. The system optimizes the GPU utilization through batch processing and lightweight machine learning models like XGBoost to maintain a low carbon footprint, which reduces computational overhead. Additionally, similarity metric-based detection minimizes redundant processing, ensuring energy-efficient deepfake verification. The method's real-time inference capability allows seamless deployment on edge devices, further reducing cloud dependency and power consumption.
[0014] In an embodiment the present invention provides a system for detecting AI-generated deepfake images includes a database configured to store input images, which comprise both real and AI-generated images. A pre-processing module, coupled with the database, receives at least one input image and performs resizing and normalization on it. The module then performs StyleGAN inversion on the resized and normalized input image using a pre-trained StyleGAN2 model. This process initializes a latent vector in a latent space and optimizes the latent vector through iterative optimization to generate a regenerated version of the input image. The system calculates a plurality of similarity metrics between the input image and the regenerated image, including Euclidean distance, cosine similarity using CLIP Vision Transformer embeddings, mean squared error (MSE), perceptual loss using a VGG16 model, structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), and scale-invariant feature transform (SIFT) metric. These calculated similarity metrics are then input into a trained classification model, which determines, based on its output, whether the input image is AI-generated.
[0015] In an embodiment, the present disclosure provides the StyleGAN2 model (100) comprises a mapping network configured to transform an initial latent vector into a disentangled intermediate space; a synthesis network comprising: adaptive instance normalization (AdaIN) layers configured to modulate activations based on the latent vector; and noise inputs configured to add stochastic variation to generated images.
[0016] In an embodiment, the present disclosure provides that optimizing the latent vector comprises: utilizing a perceptual loss function based on a pre-trained VGG network to measure similarity between the regenerated image and the input image; applying noise regularization to stabilize the optimization; and using an Adam optimizer to iteratively update the latent vector and noise patterns.
[0017] In an embodiment, the present disclosure provides that calculating the cosine similarity metric comprises: converting the input image and regenerated image to RGB format; pre-processing the images using CLIP's preprocessing pipeline; generating high-dimensional vector representations using CLIP's image encoder; and computing a cosine similarity score between the vector representations.
[0018] In an embodiment, the present disclosure provides that the classification model comprises an XGBoost model configured with: a colsample_bytree parameter set to 1.0; a gamma parameter set to 0.0; a learning rate of 0.1; a maximum tree depth of 3; a minimum child weight of 3; 100 estimators; and a subsample rate of 0.8.
[0019] In still another embodiment, the present disclosure provides that the system comprising standardize the calculated similarity metrics using a standard scaler before inputting them into the classification model.
[0020] In yet another embodiment, the present disclosure provides that determining whether the input image is AI-generated comprises: generating a probability score indicating likelihood of the input image being AI-generated; and classifying the input image as AI-generated when the probability score exceeds a predetermined threshold.
[0021] In still another embodiment, the present disclosure provides that the StyleGAN2 model (100) is configured to: map input images to a latent space (104); generate high-resolution images using the synthesis network (108), wherein the synthesis network (108) comprises an Adaptive Instance Normalization (AdaIN) (110) and convolutional layers (112); incorporate noise inputs (114) at various stages of image synthesis, configured to enhance realism and authenticity; employ a progressive growing architecture to upscale images from a low initial resolution to a high final resolution; and implement the machine learning models, configured to train on the similarity metrics for robust and accurate classification.
[0022] In an embodiment, the present disclosure provides that the system be configured to train the classification models using a labelled dataset of the original and regenerated images.
[0023] In an embodiment, the present disclosure provides that the system comprises an evaluation module for assessing system performance using accuracy, precision, recall, and F1 score metrics.
[0024] In an embodiment, the present disclosure provides that the classification model comprises machine-learning models, such as artificial neural networks (ANN), XGBoost, and K-Nearest Neighbors (KNN), to classify the images based on the calculated image similarity metrics.
[0025] In an embodiment, the present disclosure provides that the Euclidean distance computes the straight-line distance between two feature vectors, and the system converts images into 1-dimensional feature vectors by resizing and flattening the images.
[0026] In an embodiment, the present disclosure provides that the system employ the vision Transformer (ViT) model from the CLIP (Contrastive Language-Image Pre-training) to compare the similarity between two images using cosine similarity.
[0027] In an embodiment, the present disclosure provides that the system calculates the resulting score for cosine similarity, which ranges from -1 to 1 and is scaled to a percentage value between 0 and 100 for easier interpretation.
[0028] In an embodiment, the present disclosure provides that the Mean Squared Error (MSE) measures the average squared difference between corresponding pixels of two images, providing a quantitative assessment of image similarity.
[0029] In an embodiment, the present disclosure provides that the system comprises an evaluation module for assessing system performance using accuracy, precision, recall, and F1 score metrics.
[0030] In yet another embodiment, the present disclosure provides that the classification model comprises machine-learning models, such as artificial neural networks (ANN), XGBoost, and K-Nearest Neighbors (KNN), to classify the images based on the calculated image similarity metrics.
[0031] In yet another embodiment, the present disclosure provides calculates the resulting score for the Perceptual Loss using the VGG16 model to compare high-level perceptual features between images.
[0032] In an embodiment, the system is configured to be used for an audio and a video deep fake, and wherein the processor comprises a graphics processing unit (GPU).
[0033] In yet another embodiment, the present disclosure provides a method for detecting AI-generated deepfake images involves storing input images in a database, where the images include both real and AI-generated images. The method includes receiving at least one input image from the database, performing resizing and normalization on the received image, and then performing StyleGAN inversion using a pre-trained StyleGAN2 model. This process initializes a latent vector in a latent space and optimizes the latent vector through iterative optimization to generate a regenerated version of the input image. The method calculates several similarity metrics between the input image and the regenerated image, including Euclidean distance, cosine similarity using CLIP Vision Transformer embeddings, mean squared error (MSE), perceptual loss using a VGG16 model, structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), and scale-invariant feature transform (SIFT) metric. These calculated similarity metrics are then input into a trained classification model, which determines whether the input image is AI-generated based on its output.
BRIEF DESCRIPTION OF DRAWING
[0034] The foregoing summary, as well as the following detailed description of embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the disclosure, there is shown in the present document example constructions of the disclosure, however, the disclosure is not limited to the specific methods and device disclosed in the document and the drawing. The detailed description is described with reference to the following accompanying figures.
[0035] Figure 1: illustrate the StyleGAN2 Architecture, in accordance with an embodiment of the present subject matter.
[0036] Figure 2: illustrate the example images of a generative adversarial network (GAN) Inversion Output, in accordance with an embodiment of the present subject matter.
[0037] Figure 3(a): illustrate the graphical representation of the histogram of the Euclidean Distance metrics, in accordance with an embodiment of the present subject matter.
[0038] Figure 3(b): illustrate the graphical representation of the histogram of the Cosine Similarity metrics using CLIP Vision Transformer, in accordance with an embodiment of the present subject matter.
[0039] Figure 3(c): illustrate the graphical representation of the histogram of the Mean Squared Error (MSE) metrics, in accordance with an embodiment of the present subject matter.
[0040] Figure 3(d): illustrate the graphical representation of the histogram of the Perceptual Loss using VGG16, in accordance with an embodiment of the present subject matter.
[0041] Figure 3(e): illustrate the graphical representation of the histogram of the Structural Similarity Index Measure (SSIM), in accordance with an embodiment of the present subject matter.
[0042] Figure 3(f): illustrate the graphical representation of the histogram of the Peak Signal-to-Noise Ratio, in accordance with an embodiment of the present subject matter.
[0043] Figure 3(g): illustrate the graphical representation of the histogram of the Feature Extraction with SIFT (Scale-Invariant Feature Transform), in accordance with an embodiment of the present subject matter.
[0044] Figure 4: illustrate the graphical representation of feature Importance Graph, in accordance with an embodiment of the present subject matter.
[0045] Figure 5: illustrate the graphical representation of the loss of an artificial neural network (ANN) model, in accordance with an embodiment of the present subject matter.
[0046] Figure 6: illustrate the graphical representation of the accuracy of an artificial neural network (ANN) model, in accordance with an embodiment of the present subject matter.
[0047] Figure 7: illustrate the graphical representation of the 2D Decision Boundary Plot of the XG boost model, in accordance with an embodiment of the present subject matter.
[0048] Figure 8: illustrate the graphical representation of the 3D Decision Boundary Plot of the XG boost model, in accordance with an embodiment of the present subject matter.
[0049] Figure 9: illustrate the graphical representation of the KNN Classifier Decision Boundary Plot, in accordance with an embodiment of the present subject matter.
[0050] Figure 10: illustrate the graphical representation of the Confusion Matrix, in accordance with an embodiment of the present subject matter.
[0051] Figure 11 illustrates a flow chart performing a method for detection of the AI-generated deepfake images, in accordance with an embodiment of the present subject matter.
[0052] The figures depict various embodiments of the present disclosure for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures illustrated herein may be employed without departing from the principles of the disclosure described herein.
DETAILED DESCRIPTION OF THE INVENTION
[0053] Some embodiments of this disclosure, illustrating all its features, will now be discussed in detail. The words "comprising", “having”, and "including," and other forms thereof, are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms "a," "an," and "the" include plural references unless the context clearly dictates otherwise. Although any devices and methods similar or equivalent to those described herein can be used in the practice or testing of embodiments of the present disclosure, the exemplary, devices and methods are now described. The disclosed embodiments are merely exemplary of the disclosure, which may be embodied in various forms.
[0054] Various modifications to the embodiment will be readily apparent to those skilled in the art and the generic principles herein may be applied to other embodiments. However, one of ordinary skill in the art will readily recognize that the present disclosure is not intended to be limited to the embodiments illustrated, but is to be accorded the widest scope consistent with the principles and features described herein.
[0055] Following is a list of elements and reference numerals used to explain various embodiments of the present subject matter.
Reference Numeral Element Description
100 StyleGAN2 Architecture/model
102 Mapping Network
104 Latent Space
106 Intermediate Latent Space 𝑊
108 Synthesis Network
110 Adaptive Instance Normalization (AdaIN)
112 Convolution layers
114 Noise input
1100 Method
[0056] Present invention is related to a system for detecting AI-generated images using StyleGAN2 inversion and comprehensive image similarity metrics. The proposed methodology calculates similarity metrics between original and regenerated images to train machine learning models for high-accuracy classification. The system supports various GAN architectures and achieves 100% precision, recall, and F1-scores, making it a highly robust tool for combating misinformation, ensuring media authenticity, and strengthening cybersecurity frameworks. The system achieved 100% accuracy in detecting AI-generated images, offering substantial applications in cybersecurity, digital media integrity, and forensic investigations.
[0057] The system leverages the inversion of StyleGAN to extract latent vectors from images, both real and AI-generated. By regenerating images from these latent vectors, the system can compute a comprehensive set of image similarity metrics that quantify the fidelity of the regenerated images relative to their originals.
[0058] In an embodiment, a method is proposed to detect AI-generated images maintains the authenticity of visual content, thus helping to combat the spread of manipulated media and misinformation. The method indicate an accuracy in distinguishing AI-generated images from real ones in real-world applications.
[0059] The present system relates to the classification of AI-generated images and real-world photographs using advanced image analysis techniques and machine learning techniques. The system comprises at least two datasets. First dataset may constructed for gathering AI-generated images from StyleGAN2, obtained from thispersondoesnotexist.com. Further, the second dataset gathers real images which are obtained from the CineFace10 dataset on Kaggle, ensuring a diverse representation of real-world photographs. The system initially pre-processed the images by employing the StyleGAN inversion to extract latent vectors from both sets of images, followed by image re-generation to create corresponding re-generated images. Further, an Image similarity metrics may be computed between original images and their re-generated counterparts, including Euclidean Distance, Cosine Similarity (CLIP ViT), Mean Squared Error (MSE), Perceptual Loss, Structural Similarity Index Measure (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Feature Extraction with SIFT.
[0060] In an embodiment, the dataset may be structured to include pairs of original and re-generated images, labelled as Class 0 (AI-generated and re-generated pairs) and Class 1 (real and re-generated pairs). Further, the system implements multiple machine learning and deep learning models to select the optimal model for classification. In an embodiment, evaluation metrics such as accuracy, precision, recall, and F1 score were used to assess the performance of the models. The chosen model may be then deployed to classify new images, ensuring robust performance in distinguishing AI-generated images from real photographs based on their similarity metrics. The methodology ensures a systematic approach to accurately classify and differentiate between AI-generated and real images using state-of-the-art techniques in image analysis and machine learning.
[0061] Figure 1 illustrate the StyleGAN2 Architecture (100), in accordance with an embodiment of the present subject matter. The StyleGAN2 introduces advancements in generative adversarial networks (GANs), particularly in image synthesis and manipulation. The architecture comprises a mapping network (102) that transforms initial latent vectors into a disentangled intermediate space 𝑊 (106), and a synthesis network (108) that utilizes Adaptive Instance Normalization (AdaIN) (110) and noise inputs (114) to produce high-resolution, realistic images.
[0062] In an embodiment, the Mapping Network (102) and Latent Spaces (104) are explained further. The mapping network 𝑓 (102) transforms a simple input latent vector 𝑧 (sampled from a standard normal distribution, 𝑍) into an intermediate latent vector 𝑤. This network consists of plurality of fully connected (FC) layers configured to disentangling the latent space. The mapping network (102) distribute the latent vectors more evenly in the latent space 𝑊 (106), facilitating better control and manipulation of the generated images.
[0063] In an embodiment, the Latent Spaces 𝑍 (104) and 𝑊 (106) are illustrated further. The latent space 𝑍 (104) is the initial space from which the input latent vectors are sampled. These vectors are then mapped to the intermediate latent space 𝑊 (106) through the mapping network (102). The latent space 𝑊 (106) is designed to be more disentangled, meaning that different aspects of the image such as pose, texture, and color are controlled more independently. This disentanglement enhances the ability to manipulate specific features of the generated images without affecting others.
[0064] Further, the system further performs the disentanglement, whereas the disentanglement refers to the separation of different generative factors within the latent space. In a well-disentangled latent space, changes in one dimension of the latent vector correspond to changes in a specific aspect of the generated image (e.g., changing the vector might alter only the pose or color of the image). This property is crucial for tasks such as image editing and style transfer, as it allows for precise control over the generated content.
[0065] In an embodiment, the Synthesis Network (108) further comprises an Adaptive Instance Normalization (AdaIN) (110), Noise Inputs (114), and Network Architecture. In an embodiment, an Adaptive Instance Normalization (AdaIN) (110) is a key-feature in StyleGAN2 that modulates the activations of the generator network based on the latent vector 𝑤. This modulation is achieved by adjusting the mean and variance of the activations, effectively controlling the style of the generated image at different levels of the network. AdaIN (110) allows for fine-grained control over the image synthesis process, enabling the generation of diverse and high-quality images.
[0066] In an embodiment, the synthesis network (108) also incorporates noise inputs (114) at various stages. These noise inputs (114) add stochastic variation to the generated images, enhancing the realism by introducing minor variations and details. The synthesis network 𝑔 (108) consists of multiple layers where each layer applies AdaIN (110) and convolution operations. The network progressively up samples the spatial resolution of the image from a small starting resolution (e.g., 4x4) to the final output resolution (e.g., 1024x1024). This progressive growing approach helps in generating high-resolution images with fine details.
[0067] Figure 2 illustrate the example images of a generative adversarial network (GAN) Inversion Output (200), in accordance with an embodiment of the present subject matter. Figure 2 illustrates the process of inverting an image to obtain its latent code using a pre-trained StyleGAN2 model. In an embodiment, the GAN inversion is configured to understand and navigating the latent space, allowing system to regenerate images and compare them to the originals for deepfake detection.
[0068] The system first performs the image pre-processing. The target images are resized to 256x256 pixels, center-cropped, and normalized to match the input requirements of the StyleGAN2 model. This ensures that the images are in a consistent format for processing. The system further performs the Latent Space Initialization starting with a random latent vector 𝑧 from the latent space. The optimization configured to find a latent vector 𝑤 that, when passed through the generator, produces an image that closely resembles the target image. The system use a perceptual loss function based on a pre-trained VGG network to measure the similarity between the generated and target images. The loss function also includes a noise regularization term to stabilize the optimization.
[0069] In an embodiment, the system further performs an iterative optimization. An Adam optimizer is used to iteratively update the latent vector and noise patterns. The learning rate is dynamically adjusted to fine-tune the generated image. The system monitors the progress by visualizing the generated image alongside the target image at regular intervals throughout the optimization. In an embodiment, the system generates the final output. The optimized latent vector 𝑧 is obtained, which is then used to regenerate the target image.
[0070] In another embodiment, the image generation from the optimized Latent vector using StyleGAN Generator is illustrated further:
[0071] In an embodiment, once the latent vector 𝑧 has been optimized through the inversion process. The system proceeds to regenerate the image using the StyleGAN2 generator. The StyleGAN2 incorporates a mapping network (102) that disentangles initial latent vectors into an intermediate space 𝑊 (106) and a synthesis network (108) that utilizes Adaptive Instance Normalization (AdaIN) (110) and noise inputs (114) to produce high-resolution, realistic images.
[0072] Firstly, the system-optimized latent vector 𝑧, obtained from the inversion process, serves as input to the StyleGAN2 generator. The system further performs the Image Synthesis. The StyleGAN's synthesis network 𝑔 (108) utilizes 𝑧 to generate an image. This involves AdaIN (110) for style control and noise inputs (114) for stochastic variation. The Progressive Growing is performed by the system starting from a low resolution (e.g., 4x4), the generator progressively up-samples the image to the desired resolution (e.g., 1024x1024), maintaining fine details and high visual fidelity.
[0073] The system finally generates the Final Output. The output is a regenerated image based on the optimized latent vector 𝑧. This image is used for further analysis and comparison with the original target image.
[0074] In an embodiment, the system performs Image Similarity on the re-generated and the input image. The system performs rigorous evaluation of various metrics to quantify the similarity or dissimilarity between real and generated images.
[0075] As per one example, the system focuses on distinguishing AI-generated images from real images, particularly facial images. The dataset comprises a total of 10,000 images, divided equally between AI-generated faces and real celebrity faces. The following sections provide detailed descriptions of the sources and characteristics of the images.
[0076] A. AI generated images
• Source: The AI-generated images are obtained from the website thispersondoesnotexist.com.
• Generation Method: These images are created using STYLE GAN 2, an advanced generative adversarial network (GAN) that synthesizes highly realistic human faces.
• Number of Images: 5,000
• Image Characteristics: Each image is a high-quality representation of a human face, typically devoid of common artefacts or distortions associated with less sophisticated GANs. These images are entirely synthetic and do not correspond to any real individuals.
• Detecting Synthetic Human Faces Generated by StyleGAN2
[0077] B. Real Images
• Source: The real images are sourced from the CineFace10 dataset on Kaggle.
• Content: This dataset contains facial images of celebrities, covering a diverse range of actors and actresses from various movies and television shows.
• Number of Images: 5,000
• Image Characteristics: The real images feature a variety of facial expressions, lighting conditions, and backgrounds typical of professional and candid photography. The celebrities' faces provide a rich set of real-world variations.
[0078] As illustrated above, Images are stored in a common format (e.g., JPG or PNG) suitable for processing by algorithms. Each image is labelled with its source category (AI generated or real) to facilitate supervised learning tasks. This dataset provides a robust foundation for training and evaluating machine learning models aimed at distinguishing between AI-generated and real human faces. The inclusion of diverse and high-quality images from both categories ensures that the models can generalize well to different types of facial images.
[0079] Figure 3(a) illustrate the graphical representation of the histogram of the Euclidean Distance metrics (300a), in accordance with an embodiment of the present subject matter. The Euclidean distance computes the straight-line distance between the two feature vectors. The system converts Image into 1- dimensional feature vectors. Initially, the generated image is resized to match the dimensions of the real image for direct comparison. Subsequently, both images are flattened into 1-dimensional arrays. This transforms the images into feature vectors where each element represents a pixel value. The formula for calculating Euclidean distance are shown below
Where, a and b are the flattened vectors representing real and generated images, respectively.
[0080] In an embodiment, the Resulting Score is also calculated. In an embodiment, a smaller Euclidean distance indicated greater similarity between the images, suggesting that they shared similar pixel patterns. In another embodiment, a larger Euclidean distance indicates greater dissimilarity between the images, suggesting that the pixel values in the two images differ more significantly.
[0081] Figure 3(b) illustrate the graphical representation of the histogram of the Cosine Similarity metrics using CLIP Vision Transformer (300b), in accordance with an embodiment of the present subject matter. The system employed the Vision Transformer (ViT) model from the CLIP (Contrastive Language-Image Pre-training), specifically the 'ViT-B-16-plus-240' model, pre-trained on the LAION-400M dataset, to compare the similarity between the two images. The embeddings of these images are compared using cosine similarity. Cosine similarity measures the cosine of the angle between two vectors, providing a metric of similarity. The system performs the Image Conversion and Pre-processing. Each image is converted to RGB format. The images are pre-processed using CLIP's pre-processing pipeline.
[0082] In an embodiment, the system further performs the Image Encoding into a High-Dimensional Vector. The pre-processed images are passed through the CLIP model's image encoder. This process generates high-dimensional vector representations (embeddings) that capture the essential visual features of the images. The formula for calculating Euclidean distance are shown below:
Where, A and B are the high-dimensional vectors (embeddings) of the two images.
A · B is the dot product of the vectors.
|| A || and || B || are the magnitudes (norms) of the vectors.
[0083] Further, the system calculates the resulting score. The cosine similarity score ranges from -1 to 1. The cosine similarity score is then scaled to a percentage value between 0 and 100 for easier interpretation by multiplying the cosine similarity score by 100.
Scaled Similarity Score = (cos (θ)) · 100
[0084] This transformation provides a more intuitive percentage score, where: 100% indicates the images are identical. Lower scores indicate decreasing similarity. A score of 0% would indicate no similarity.
[0085] Figure 3(c) illustrate the graphical representation of the histogram of the Mean Squared Error (MSE) metrics (330c), in accordance with an embodiment of the present subject matter. The Mean Squared Error (MSE) measures the average squared difference between corresponding pixels of two images, providing a quantitative assessment of image similarity. The formula for calculating MSE are shown below:
Where, ai and bi are the pixel values of the real and generated images, respectively.
[0086] The system further calculates the Resulting Score of the MSE. In an embodiment, a lower MSE indicates higher similarity. In another embodiment, a higher MSE indicates greater dissimilarity between the images.
[0087] Figure 3(d) illustrate the graphical representation of the histogram of the Perceptual Loss using VGG16 (300d), in accordance with an embodiment of the present subject matter. The system calculates Perceptual Loss using VGG16 model to compare high-level perceptual features between images. The images are passed through the pre-trained VGG16 model to extract feature representations from specific layers. In an embodiment, the feature extraction is performed, the images are fed into the VGG16 model, and feature maps are extracted from chosen convolutional layers (112).
[0088] The system further calculates Loss. The perceptual loss is computed as the difference between these feature maps. The Formula for the Perceptual Loss are shown below:
[0089] Where, i represents the feature maps from the ith layer.
[0090] In an embodiment, the resulting score for the perceptual loss is calculated. In one embodiment, a lower perceptual loss indicates higher perceptual similarity between the images. In another embodiment, a higher perceptual loss indicates lower perceptual similarity between the images.
[0091] Figure 3(e) illustrate the graphical representation of the histogram of the Structural Similarity Index Measure (SSIM) (300e), in accordance with an embodiment of the present subject matter. The system further calculates the Structural Similarity Index Measure. The SSIM evaluates the similarity between two images based on luminance, contrast, and structure, providing a more perceptually relevant measure of image quality. The formula for the calculation of the SSIM are shown below:
Where, denote the mean and standard deviation of the images, respectively, and C1 and C2 are constants.
[0092] The Resulting Score of the SSIM ranges from -1 to 1. In one embodiment, the score 1 indicates perfect similarity. In another embodiment, the values closer to -1 indicate dissimilarity
[0093] Figure 3(f) illustrate the graphical representation of the histogram of the Peak Signal-to-Noise Ratio (300f), in accordance with an embodiment of the present subject matter. The Peak Signal-to-Noise Ratio (PSNR) may quantifies the peak error between the original and generated images, providing a measure of image reconstruction quality. The formula for calculating the PSNR value is shown below:
Where, the MAX is the minimum possible pixel value and MSE is the mean squared error.
[0094] The system further calculates the Resulting Score of the PSNR. In one embodiment, the Higher PSNR values indicate better image quality and higher similarity. In another embodiment, the lower PSNR values indicate bad image quality and lower similarity.
[0095] Figure 3(g) illustrate the graphical representation of the histogram of the Feature Extraction with SIFT (Scale-Invariant Feature Transform) (300g), in accordance with an embodiment of the present subject matter. Feature Extraction with SIFT (Scale-Invariant Feature Transform) detects key points and descriptors in images to enable accurate image comparisons based on distinct features. The SIFT Identify key points in the images that are invariant to scale and rotation. The SIFT compute descriptors for each key point to characterize local image regions.
[0096] The system further calculates the Resulting Score of SIFT. The similarity between images is assessed by matching key points and comparing their descriptors. In one embodiment, a higher number of matching key points indicates greater similarity. In another embodiment, a lower number of matching key points indicates lower similarity.
[0097] Figure 4 illustrate the graphical representation of feature Importance Graph (400), in accordance with an embodiment of the present subject matter. In an embodiment, the Euclidean distance and Cosine Similarity using CLIP Vision Transformer may be the most important features. These metrics distinguish between re-generated and original images. They metrics particularly strong at capturing the overall meaning of images and the exact differences at a pixel level, which is crucial for accurately assessing how similar two images are.
[0098] In another embodiment, the Dataset Preparation for Model Training of the system is illustrated further. In an embodiment, for the dataset preparation, each real image is paired with its corresponding regenerated version obtained through StyleGAN inversion. Similarly, every generated (fake) image is paired with its regenerated counterpart. This pairing process ensures that each pair is aligned accurately for precise metric calculation, utilizing metrics such as Euclidean Distance, Cosine Similarity (Using CLIP ViT), Mean Squared Error, Perceptual Loss Using VGG16, Structural Similarity Index Measure (SSIM), Peak Signal-to-Noise Ratio (PSNR) and Feature Extraction with SIFT (Scale-Invariant Feature Transform). The dydtem further assign labels of real or fake to all the records. The inclusion of comprehensive image similarity metrics further enriches the dataset by providing quantitative measures between images and their regenerated versions.
[0099] In an embodiment, splitting the data for training and testing of the system is illustrated further. In one embodiment, the data may split into training and testing sets to effectively evaluate the model's performance. Using the “train_test_split” function from the sklearn library, 80% of the data (8,000 images) is allocated for training, while the remaining 20% (2,000 images) is reserved for testing.
[00100] In another embodiment, the feature scaling feature of the system is explained further in detail. In one embodiment, the features of the dataset are standardized to ensure that the model performs optimally. The standardization is a pre-processing step that scales the features such that they have a mean of 0 and a standard deviation of 1. The “StandardScaler” is first fitted to the training data, which calculates the mean and standard deviation of the training set and then scales it accordingly. The same transformation is applied to the testing data to ensure consistency. The formula for standardization is given by:
Where, X is the original feature, μ is the mean of the feature in the training data, and σ is the standard deviation of the feature in the training data.
[00101] Figure 5 illustrate the graphical representation of the loss of an artificial neural network (ANN) model (500), in accordance with an embodiment of the present subject matter. In an embodiment, the neural network model is constructed with five layers, starting with an input layer and ending with an output layer. The first four layers consist of seven units each, utilizing the rectified linear unit (ReLU) activation function to introduce non-linearity. Following these hidden layers, the final layer incorporates a single unit with a sigmoid activation function, facilitating binary classification between real and generated images. The model is optimized using the Adam optimizer, chosen for its efficiency in handling large datasets and dynamic learning rates. Given the binary nature of the classification task, binary cross-entropy is employed as the loss function to measure the discrepancy between predicted and actual labels.
[00102] Figure 6 illustrate the graphical representation of the accuracy of an artificial neural network (ANN) model (600), in accordance with an embodiment of the present subject matter. In an embodiment, the training parameters include a batch size of 32 instances per iteration and a total of 20 epochs to iteratively improve the ANN model’s performance and convergence. By the end of the training, the model achieved an accuracy of 100% with a corresponding loss of 0.0032, demonstrating its high effectiveness in distinguishing between real and AI-generated images.
[00103] Figure 7 illustrate the graphical representation of the 2D Decision Boundary Plot of the XG boost model (700), in accordance with an embodiment of the present subject matter.
[00104] The system utilize the XGBoost model for the classification task. The XGBoost model is known for its efficiency and superior performance in handling structured data. The XGBoost model is configured with the following hyperparameters: “colsample_bytree” is set to 1.0, indicating that all features are considered for each tree.
[00105] The “gamma” parameter is set to 0.0, allowing for unrestricted tree partitioning. A learning_rate of 0.1 is chosen to balance the trade-off between convergence speed and model accuracy. The “max_depth” of each tree is limited to 3 to prevent overfitting while capturing sufficient data complexity. The “min_child_weight” is set to 3, ensuring that leaf nodes contain a minimum sum of instance weights to maintain model generalization. The system uses “n_estimators” set to 100, specifying the number of trees to build, which strikes a balance between model robustness and computational efficiency. The objective is defined as “binary:logistic”, suitable for our binary classification task. Finally, a “subsample” rate of 0.8 is employed, which randomly samples 80% of the training data for each tree, further mitigating overfitting. The XGBoost model achieved an accuracy of 100% on our testing dataset comprising 2000 images, demonstrating its exceptional capability in performing this classification task.
[00106] Figure 8 illustrate the graphical representation of the 3D Decision Boundary Plot of the XG boost model (800), in accordance with an embodiment of the present subject matter.
[00107] Figure 9 illustrate the graphical representation of the KNN Classifier Decision Boundary Plot (900), in accordance with an embodiment of the present subject matter. The system employs a K-Nearest Neighbors (KNN) classifier for distinguishing between AI-generated and real images. The KNN is used for its simplicity and effectiveness in handling structured data. The KNN model may be configured with a Manhattan distance metric, utilizing 11 neighbors for classification, and employing uniform weights across neighbors. This setup ensures each neighbor contributes equally to the decision-making process. Following training, the KNN model demonstrated exceptional performance, achieving a perfect accuracy rate of 100% on our testing dataset comprising 2000 images. This result underscores the robustness and reliability of using image similarity metrics in conjunction with KNN for accurately classifying AI-generated images.
[00108] In an embodiment, the Model Comparison and Selection are shown below in Table 1.
TABLE I: COMPARISON OF MODELS
[00109] In selecting XGBoost over ANN and KNN, several factors prove pivotal. In an embodiment, the XGBoost model excels in scalability, efficiently handling larger datasets and computation demands compared to KNN. The KNN can become impractical with extensive data due to its storage requirements. Moreover, XGBoost model offers clearer interpretability than ANN, making it easier to understand and explain model decisions, crucial for quality assurance and validation in complex projects like image classification. Despite achieving comparable 100% accuracy on testing data, XGBoost's computational efficiency further underscores its suitability for real-time applications, ensuring robust performance without compromising on speed or interpretative clarity.
[00110] Figure 10 illustrate the graphical representation of the Confusion Matrix (1000), in accordance with an embodiment of the present subject matter. In an embodiment, neural network model of the system comprising five layers with rectified linear unit (ReLU) activations and a sigmoid output, achieved exceptional performance with 100% accuracy and a minimal loss of 0.0032. The model trained over 20 epochs using the Adam optimizer and binary cross-entropy loss. The model demonstrated robust convergence and efficacy in binary classification. Visualizations of the training process showed steady improvements in both accuracy and loss metrics. Concurrently, the XGBoost algorithm, optimized with hyperparameters such as "colsample_bytree=1.0", "gamma=0.0", and "max_depth=3", achieved identical accuracy on a testing dataset of 2000 images. The model's confusion matrix further validated its reliability, with all predictions aligning perfectly with actual classifications. Comparative analysis highlighted XGBoost's superiority in scalability and interpretability over alternative models like KNN and neural networks, underscoring its suitability for real-time applications and complex image classification tasks. The Confusion Matrix are shown in table 2:
Predicted Fake Predicted Real
Actual Fake 1,000 0
Actual Real 0 1,000
Table 2: Confusion Matrix
[00111] In an embodiment, the K-Fold Cross Validation of the system is illustrated further. The system implemented a 10-fold cross-validation to rigorously evaluate the performance of our XGBoost model. This approach involved dividing the training dataset into 10 equal parts, training the model on 9 parts, and validating it on the remaining part, repeating the process 10 times. The model of the system is achieved a mean accuracy of 99.91%, showcasing its high capability in distinguishing between real and AI-generated images. The standard deviation of 0.10% indicates the model's consistent performance across different data subsets, underscoring its reliability and generalisation capacity.
[00112] In an embodiment, the performance of the XGBoost model of the system may be evaluated on a dataset comprising 10,000 images, equally divided between AI-generated faces (from StyleGAN2 via thispersondoesnotexist.com) and real faces (from the CineFace10 dataset on Kaggle). The model utilized Euclidean Distance and CLIP Similarity as features to classify the images. The Model achieved 100% Accuracy, Precision, Recall and F1-Score demonstrating high effectiveness. This performance of the model shows significant potential for real-world applications in digital media integrity and combating misinformation spread through AI-generated content. The system provides support for organizations to comply with upcoming digital media integrity laws and create a certification program to validate the authenticity of digital assets.
[00113] In an embodiment, the Pseudocode and Algorithms are explained below in detail:
[00114] Algorithm 1: Pre-processing and Similarity Metric Computation
Input: Image dataset (real and AI-generated).
Output: Similarity metrics dataset.
Step 1. Resize all images to 256×256 pixels.
Step 2. Normalize pixel values.
Step 3. Perform StyleGAN2 inversion to obtain latent vectors.
Step 4. Regenerate images using latent vectors.
Step 5. Compute similarity metrics:
a. Euclidean Distance.
b. Cosine Similarity.
c. Structural Similarity Index (SSIM).
d. Perceptual Loss (VGG16).
e. MSE and PSNR.
Step 6. Save metrics for model training.
[00115] Algorithm 2: XGBoost Classification
Input: Similarity metrics dataset.
Output: Classification results (real vs. AI-generated).
Step 1. Load training and testing datasets.
Step 2. Configure XGBoost:
- colsample_bytree=1.0, max_depth=3, n_estimators=100.
Step 3. Train model on similarity metrics.
Step 4. Evaluate model on test set.
Step 5. Save model for deployment.
[00116] In an embodiment, the performance analysis of the Graphics processing unit (GPU) used for inverting the StyleGAN and regenerating the image from the latent vector is a critical aspect of evaluating the efficiency and scalability of the proposed methodology. The Graphics processing unit (GPU) is an electronic circuit that can perform mathematical calculations at high speed GPUs
[00117] In one embodiment, the GPU Usage Images are stored in a common format (e.g., JPG or PNG) suitable for processing by models.
• Nvidia H100 80GB
• Nvidia A100 80GB
• Nvidia RTX 3060 Ti
• Google Colab T4 GPU
[00118] In an embodiment, the Inversion and Generation Time of the images used in the system are shown below:
• Nvidia H100 80GB: 25 seconds
• Nvidia A100 80GB: 40 seconds
• Nvidia RTX 3060 Ti: 60 seconds
• Google Colab T4 GPU: 130 seconds
[00119] In an embodiment, the choice of GPU has a significant impact on the time efficiency of inverting the StyleGAN and regenerating images from latent vectors. The Nvidia H100 80GB provides the best performance, making it the most suitable for large-scale and real-time applications. The Nvidia A100 80GB offers a good balance of performance and cost-efficiency, while the Nvidia RTX 3060 Ti is a viable option for more budget-conscious projects. The Google Colab T4 GPU, despite its longer processing times, remains a valuable resource for those with limited access to advanced hardware.
[00120] In an embodiment, the system currently addresses the critical need for effective detection of AI-generated images in present real world, where generative models like StyleGAN are increasingly sophisticated. The system combines StyleGAN inversion with a detailed analysis of image similarity metrics, the system may demonstrated a highly accurate method for distinguishing between real and AI-generated images. The system show that the computed image similarity metrics form a comprehensive dataset, enabling the training of robust machine learning models. The use of XGBoost over the ANN for it’s less computationally taxing and better interpretability has proven to be highly effective. These results highlight the potential of the system to be implemented in real-world applications, providing a reliable tool for maintaining the authenticity of digital media. The system uses a new technique to the field of AI-generated image detection but also lay the groundwork for future advancements. The implications of this system extend to various domains, including cybersecurity, media verification, and beyond, demonstrating the far-reaching impact of users findings.
[00121] In an embodiment, for future work the system’s methodology can be expanded to detect AI-generated images from various types of GANs by developing multiple end-to-end systems tailored to each GAN architecture. By constructing a battery of powerful GANs, such as StyleGAN2, BigGAN, CycleGAN, and SAGAN, each capable of inverting and regenerating images specific to their design. The system can create distinct sets of image similarity metrics for each GAN type. These individual systems will be trained separately to recognize patterns unique to their corresponding GANs, allowing for precise detection of AI-generated images from a wide range of GAN architectures. This approach would significantly enhance the robustness and versatility of our detection capabilities, ensuring accurate identification of AI-generated images regardless of the specific GAN used in their creation.
[00122] In an embodiment, the system also provide extension to video deepfakes. The system may adapted for video deepfake detection by applying temporal consistency analysis across frames. The system may detect inconsistencies in facial motion, lighting, and texture changes that are characteristic of AI-generated videos by extending latent space inversion to sequential frames and integrating optical flow-based similarity metrics. Additionally, the system uses recurrent neural networks (RNNs) or transformers to enhance the detection of frame-by-frame anomalies in synthetic content.
[00123] In an embodiment, the system is also capable of providing detection of audio deepfakes. The proposed methodology may be extended to voice deepfake detection by incorporating latent space representations of audio waveforms. The system incorporates tools for audio deepfake detection by analyzing phonetic inconsistencies and spectral patterns. Similar to image similarity metrics, the system may analyze spectrograms using cosine similarity, MFCC (Mel-Frequency Cepstral Coefficients), and perceptual loss to detect synthetic speech. Additionally, adversarial training with GAN-based speech synthesis models (e.g., WaveGAN, MelGAN) may enhance the robustness of classification.
[00124] In an embodiment, the system provides multimodal deepfake detection. The system possibly provides future iterations which may integrate image, video, and audio modalities into a unified framework. The system may achieve a holistic deepfake detection pipeline capable of verifying synthetic media across multiple formats by fusing image-based inversion, temporal tracking, and voice spectrogram analysis.
[00125] In an embodiment, the system provides adaptive AI for evolving threats. The system may be designed to continuously learn from new deepfake techniques by incorporating self-supervised learning and few-shot learning models. This learning may detect novel synthetic media generated by future GAN, diffusion, or transformer-based architectures without requiring extensive retraining.
[00126] In an embodiment, the system provide a real-time edge deployment for scalability. The system may be deployed on mobile and IoT devices with optimizations in hardware acceleration (e.g., TensorRT, ONNX for inference on edge devices), for real-time deepfake detection in communication apps, social media, and digital forensics.
[00127] In an embodiment, the system uses an adaptive machine learning technique for DeepFake detection. The system incorporates multiple machine learning classifiers (e.g., XGBoost, ANN, and KNN) trained on similarity metrics to enhance robustness against adversarial attacks.
[00128] In an embodiment, the system provides a hybrid ensemble learning approach may be used, where XGBoost detects fine-grained pixel-level variations, ANN identifies feature embeddings, and KNN improves generalization on unseen deepfake patterns.
[00129] In an embodiment, the method supports transfer learning, allowing pre-trained models to be fine-tuned for new deepfake architectures without full retraining.
[00130] In an embodiment, the system is optimized for GPU acceleration by employing a TensorRT and ONNX Runtime, enabling low-latency inference without sacrificing accuracy.
[00131] In an embodiment, the parallel processing of image similarity metrics (e.g., Euclidean Distance, Cosine Similarity, SSIM) is implemented by using CUDA and cuDNN, significantly reducing computational overhead.
[00132] In an embodiment, the batch processing of deepfake detection requests ensures optimal GPU memory utilization, preventing bottlenecks in real-time applications.
[00133] In an embodiment, the system supports GPU deployment on NVIDIA Jetson (Nano, Xavier, Orin) and Qualcomm AI processors, optimizing deepfake detection for low-power edge devices.
[00134] In an embodiment, the system uses model quantization techniques (e.g., INT8 and FP16 precision scaling) to reduce memory footprint and enhance energy efficiency.
[00135] In an embodiment, the edge inference is accelerated through pruning and distillation of deep learning models, ensuring lightweight deployment without compromising detection accuracy.
[00136] In an embodiment, the video-based deepfake detection is optimized using frame-level parallelization, in which the GPU cores process consecutive frames concurrently to detect temporal inconsistencies. A WaveNet-based spectral analysis is accelerated using cuFFT (CUDA Fast Fourier Transform) for voice deepfake detection, to quickly analyze frequency domain anomalies.
[00137] In an embodiment, the system supports hybrid cloud inference, allowing deepfake detection models to run on AWS Inferentia, Google TPUs, or Azure ML accelerators for large-scale media verification.
[00138] In an embodiment, the federated learning is employed for privacy-preserving detection, enabling model updates across distributed devices without centralizing user data.
[00139] The system uses advanced AI detection techniques to combat misinformation in Web 3.0 by ensuring the authenticity of digital content. Using StyleGAN2 inversion and similarity metrics, it accurately identifies AI-generated deepfakes by comparing original and regenerated images through robust machine-learning models like XGBoost and ANN. This method offers 100% precision, recall, and accuracy, making it a powerful tool to verify media authenticity across decentralized platforms. Additionally, the system is scalable to other GAN architectures (e.g., BigGAN, CycleGAN) and can be deployed in real-time, making it highly effective in detecting and mitigating false digital information.
[00140] In an embodiment, the technical effect achieved by the proposed system is that the system may become efficient, has fast processing speed, and has less carbon footprint.
[00141] In some embodiments, the system leverages customized training pipelines with models like XGBoost and ANN. The system preprocessing similarity metrics into standardized datasets optimized for high-precision binary classification, achieving 100% accuracy in our validation tests.
[00142] In some embodiment, the system implement an advanced preprocessing techniques like image resizing, normalization, and noise regularization, ensuring balanced and unbiased datasets. The statistical standardization of similarity metrics further differentiates the system’s approach.
[00143] In some embodiment, the system is optimized for real-time deployment on GPUs, ensuring low-latency processing suitable for high-throughput environments such as media forensics and cybersecurity operations.
[00144] In some embodiment, the latent space inversion is addressing challenges in detecting subtle manipulations with unprecedented accuracy.
[00145] In some embodiment, the system achieves 100% classification accuracy using sophisticated machine learning models.
[00146] In some embodiment, system utilizes diverse metrics, including perceptual and statistical similarity measures.
[00147] In some embodiment, the system may have scalable architecture adaptable to different GAN types.
[00148] In some embodiment, the system may be used in cybersecurity for detection of deepfakes in sensitive media.
[00149] In some embodiment, the system performs the media verification and authenticity validation for news outlets.
[00150] In some embodiment, the system provide forensic analysis which further used in criminal investigations for digital forgeries.
[00151] In some embodiment, the system provide content moderation, whereas the social media platforms identifying synthetic content. The system collaborate with media outlets to integrate the system into their workflows for real-time content verification. The system integrate with social media platforms to flag and remove deepfake content at scale.
[00152] In some embodiment, the system provides education training tools for understanding generative AI.
[00153] In some embodiment, the system also used in legal sector for evidence validation in digital cases.
[00154] In some embodiment, the present invention provides enhanced system’s accuracy due to the combination of metrics.
[00155] In some embodiment, the system is robust against new GAN architectures.
[00156] In some embodiment, the system scalable for diverse datasets and real-time applications.
[00157] In some embodiment, the system provides comprehensive evaluation of image features beyond pixel artifacts.
[00158] In some embodiment, the system provides robustness against GAN architectures. Traditional methods requires frequent retraining, whereas the system uses the latent space inversion and similarity metrics, making it adaptable to emerging GAN architectures (e.g., StyleGAN3, BigGAN, CycleGAN, SAGAN) without major modifications.
[00159] In some embodiment, the system provides high accuracy and precision. The system is achieving 100% precision, recall, and F1-score. The method outperforms existing detection models that suffer from false positives due to limited feature extraction.
[00160] In some embodiment, the system provides comprehensive similarity metrics. The system employs a diverse set of similarity metrics (Euclidean Distance, Cosine Similarity, SSIM, Perceptual Loss) rather than using a pixel-based or deep feature comparison, improving robustness against adversarial attacks.
[00161] In some embodiment, the system improves scalability and efficiency. The system provides real-time processing with lower computational cost by optimizing GPU utilization and using lightweight ML models like XGBoost. The system outperforms the deep learning-based classifiers that require extensive retraining. The system enhances real-time detection speeds for live-streamed content using optimized GPU/TPU algorithms.
[00162] In some embodiment, the system provides a low carbon footprint. The compute-heavy deepfake detectors depend on large-scale cloud training, where the proposed technique focuses on energy-efficient processing by minimizing redundant computations and supporting edge deployment.
[00163] In an embodiment, the system is used in real-world applications, such as fraud prevention, social media content moderation, and digital forensics.
[00164] In an embodiment, the system include examples showcasing scalability and low-carbon processing benefits.
[00165] In an embodiment, the system uses adversarial learning to generate more diverse datasets of deepfakes for training purposes. The system integrate with organizations to access proprietary datasets for improving model generalization.
[00166] In an embodiment, the continual learning pipelines for the system is implemented to self-update with new training data. The system create a feedback loop where flagged real-world cases improve model performance.
[00167] Figure 11 illustrates a flow chart performing a method for detection of the AI-generated deepfake images, in accordance with an embodiment of the present subject matter. The order in which the method (1100) may be described may be not intended to be construed as a limitation, and any number of the described method blocks may be combined in any order to implement the method (1100) or alternate methods. Additionally, individual blocks may be deleted from the method (1100) without departing from the spirit and scope of the subject matter described herein. Furthermore, the method may be implemented in any suitable hardware, software, firmware, or combination thereof. However, for ease of explanation, in the embodiments described below, the method (1000) may be considered to be implemented as described in the system for detection of the AI-generates images.
[00168] At block 1102, the system is storing input images in a database, wherein the input images comprise a real and an AI-generated image.
[00169] At block 1104, the system is receiving at least an input image from the database.
[00170] At block 1106, the system is performing resizing and normalization on the received input image.
[00171] At block 1108, the system is performing StyleGAN inversion on the resized and normalized input image using a pre-trained StyleGAN2 model to initialize a latent vector in a latent space and optimize the latent vector through iterative optimization to generate a regenerated version of the input image.
[00172] At block 1110, the system is calculating a plurality of similarity metrics between the input image and the regenerated image, wherein the similarity metrics comprise: an Euclidean distance metric between feature vectors of the input image and the regenerated image; a cosine similarity metric using CLIP Vision Transformer embeddings; a mean squared error (MSE) metric; a perceptual loss metric using a VGG16 model; a structural similarity index measure (SSIM); a peak signal-to-noise ratio (PSNR); and a scale-invariant feature transform (SIFT) metric.
[00173] At block 1112, the system is inputting the calculated similarity metrics into a trained classification model.
[00174] At block 1114, the system is determining, based on an output of the classification model, whether the input image is AI-generated
[00175] Equivalents
[00176] With respect to the use of substantially any plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for the sake of clarity.
[00177] It will be understood by those within the art that, in general, terms used herein, and especially in the appended claims (e.g., bodies of the appended claims) are generally intended as "open" terms (e.g., the term "including" should be interpreted as "including but not limited to," the term "having" should be interpreted as "having at least," the term "includes" should be interpreted as "includes but is not limited to," etc.). It will be further understood by those within the art that if a specific number of an introduced claim recitation is intended, such an intent will be explicitly recited in the claim, and in the absence of such recitation no such intent is present.
[00178] Although implementations for the system and method for detecting AI-generated deepfake images have been described in language specific to structural features and/or methods, it is to be understood that the appended claims are not necessarily limited to the specific features described. Rather, the specific features are disclosed as examples of implementation for the system and method for detecting AI-generated deepfake images.
, Claims:
1. A system for detecting AI-generated deepfake images, comprising:
a database, configured to store input images, wherein the input images comprising a real and an AI-generated image;
a pre-processing module coupled with the database, configured to
receive at least an input image and perform resizing and normalization;
perform StyleGAN inversion on the resized and normalized input image using a pre-trained StyleGAN2 model (100) to: initialize a latent vector in a latent space (104); optimize the latent vector through iterative optimization to generate a regenerated version of the input image;
calculate a plurality of similarity metrics between the input image and the regenerated image, wherein the similarity metrics comprises: an Euclidean distance metric between feature vectors of the input image and the regenerated image; a cosine similarity metric using CLIP Vision Transformer embeddings; a mean squared error (MSE) metric; a perceptual loss metric using a VGG16 model; a structural similarity index measure (SSIM); a peak signal-to-noise ratio (PSNR); and a scale-invariant feature transform (SIFT) metric;
input the calculated similarity metrics into a trained classification model; and
determine, based on an output of the classification model, whether the input image is AI-generated.
2. The system as claimed in 1, wherein the StyleGAN2 model (100) comprises
a mapping network (102) configured to transform an initial latent vector into a disentangled intermediate space (106);
a synthesis network (108) comprising: adaptive instance normalization (AdaIN) layers configured to modulate activations based on the latent vector; and noise inputs (114) configured to add stochastic variation to generated images.
3. The system as claimed in 1, wherein optimizing the latent vector comprises:
utilizing a perceptual loss function based on a pre-trained VGG network to measure similarity between the regenerated image and the input image;
applying noise regularization to stabilize the optimization; and using an Adam optimizer to iteratively update the latent vector and noise patterns.
4. The system as claimed in 1, wherein calculating the cosine similarity metric comprises:
o converting the input image and regenerated image to RGB format;
o pre-processing the images using CLIP's preprocessing pipeline;
o generating high-dimensional vector representations using CLIP's image encoder; and computing a cosine similarity score between the vector representations.
5. The system as claimed in 1, wherein the classification model comprises an XGBoost model configured with: a colsample_bytree parameter set to 1.0; a gamma parameter set to 0.0; a learning rate of 0.1; a maximum tree depth of 3; a minimum child weight of 3; 100 estimators; and a subsample rate of 0.8.
6. The system as claimed in 1, wherein the system comprising standardize the calculated similarity metrics using a standard scaler before inputting them into the classification model.The system as claimed in 1, wherein determining whether the input image is AI-generated comprises: generating a probability score indicating likelihood of the input image being AI-generated; and classifying the input image as AI-generated when the probability score exceeds a predetermined threshold.
7. The system as claimed in 1, wherein the StyleGAN2 model (100) is configured to:
map input images to a latent space (104);
generates high-resolution images using the synthesis network (108), wherein the synthesis network (108) comprises an Adaptive Instance Normalization (AdaIN) (110) and convolutional layers (112);
incorporates noise inputs (114) at various stages of image synthesis, configured to enhance realism and authenticity;
employs a progressive growing architecture to upscale images from a low initial resolution to a high final resolution; and
implements the machine learning models, configured to train on the similarity metrics for robust and accurate classification.
8. The system as claimed in 1, wherein the system is configured to train the classification models using a labeled dataset of the original and regenerated images.
9. The system as claimed in claim 1, wherein the system comprises an evaluation module for assessing system performance using accuracy, precision, recall, and F1 score metrics.
10. The system as claimed in 1, wherein the classification model comprises machine learning models, such as artificial neural networks (ANN), XGBoost, and K-Nearest Neighbors (KNN), to classify the images based on the calculated image similarity metrics.
11. The system as claimed in 1, wherein the Euclidean distance computes the straight-line distance between two feature vectors, and the system converts images into 1-dimensional feature vectors by resizing and flattening the images.
12. The system as claimed in 1, wherein the system employs the Vision Transformer (ViT) model from the CLIP (Contrastive Language-Image Pre-training) to compare the similarity between two images using cosine similarity.
13. The system as claimed in 1, wherein the system calculates the resulting score for cosine similarity, which ranges from -1 to 1 and is scaled to a percentage value between 0 and 100 for easier interpretation.
14. The system as claimed in 1, wherein the Mean Squared Error (MSE) measures the average squared difference between corresponding pixels of two images, providing a quantitative assessment of image similarity.
15. The system as claimed in 1, wherein the system calculates the resulting score for the Perceptual Loss using the VGG16 model to compare high-level perceptual features between images.
16. The system as claimed in 1, wherein the system is configured to detect an audio deepfake content and a video deepfake content, and wherein the processor comprises a graphics processing unit (GPU).
17. A method (1100) for detecting AI-generated deep fake images, comprising:
storing input images in a database, wherein the input images comprise a real and an AI-generated image;
receiving at least an input image from the database;
performing resizing and normalization on the received input image;
performing StyleGAN inversion on the resized and normalized input image using a pre-trained StyleGAN2 model (100) to initialize a latent vector in a latent space (104) and optimize the latent vector through iterative optimization to generate a regenerated version of the input image;
calculating a plurality of similarity metrics between the input image and the regenerated image, wherein the similarity metrics comprise: an Euclidean distance metric between feature vectors of the input image and the regenerated image; a cosine similarity metric using CLIP Vision Transformer embeddings; a mean squared error (MSE) metric; a perceptual loss metric using a VGG16 model; a structural similarity index measure (SSIM); a peak signal-to-noise ratio (PSNR); and a scale-invariant feature transform (SIFT) metric;
inputting the calculated similarity metrics into a trained classification model; and
determining, based on an output of the classification model, whether the input image is AI-generated.
| # | Name | Date |
|---|---|---|
| 1 | 202521018929-STATEMENT OF UNDERTAKING (FORM 3) [03-03-2025(online)].pdf | 2025-03-03 |
| 2 | 202521018929-REQUEST FOR EXAMINATION (FORM-18) [03-03-2025(online)].pdf | 2025-03-03 |
| 3 | 202521018929-REQUEST FOR EARLY PUBLICATION(FORM-9) [03-03-2025(online)].pdf | 2025-03-03 |
| 4 | 202521018929-POWER OF AUTHORITY [03-03-2025(online)].pdf | 2025-03-03 |
| 5 | 202521018929-FORM-9 [03-03-2025(online)].pdf | 2025-03-03 |
| 6 | 202521018929-FORM 18 [03-03-2025(online)].pdf | 2025-03-03 |
| 7 | 202521018929-FORM 1 [03-03-2025(online)].pdf | 2025-03-03 |
| 8 | 202521018929-FIGURE OF ABSTRACT [03-03-2025(online)].pdf | 2025-03-03 |
| 9 | 202521018929-DRAWINGS [03-03-2025(online)].pdf | 2025-03-03 |
| 10 | 202521018929-DECLARATION OF INVENTORSHIP (FORM 5) [03-03-2025(online)].pdf | 2025-03-03 |
| 11 | 202521018929-COMPLETE SPECIFICATION [03-03-2025(online)].pdf | 2025-03-03 |
| 12 | Abstract.jpg | 2025-03-11 |
| 13 | 202521018929-Proof of Right [22-08-2025(online)].pdf | 2025-08-22 |