A System And A Method For Media Analysis

< Back

A System And A Method For Media Analysis

Abstract: ABSTRACT A SYSTEM AND A METHOD FOR MEDIA ANALYSIS The present subject matter relates to a system (100) and a method (300) for media analysis. The system (100) is configured to receive an input media file. Further, the system (100) is configured to pre-process visual information of the received input media file and split the pre-processed visual information into one or more chunks. Furthermore, each chunk from the one or more chunks corresponds to a plurality of frames. Furthermore, the system (100) generate embeddings of each chunk from the one or more chunks, based on the extracted one or more features. Moreover, the system (100) classifies each chunk based on the generated embeddings, determining whether each chunk is real or synthetic. Additionally, a user is indicated of the classified one or more chunks as real chunk or the synthetic chunk. The system enhances the ability to identify synthetic portions in artificially manipulated video content. [To be published with figure 1]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

22 October 2024

Publication Number

1/2025

Publication Type

INA

Invention Field

COMMUNICATION

Status

Parent Application

Applicants

ONIBER SOFTWARE PRIVATE LIMITED

Sr No 26/3 and 4, A 102, Oakwood Hills, Baner Road, Opp Pan Card Club, Baner, Pune, Maharashtra 411045

Inventors

1. Raghu Sesha Iyengar

ARGE Urban Bloom, A-101, No.30/A, Ring Road, 4th main road, Bangalore – 560022

2. Ankush Tiwari

House No. 28, Vascon Paradise, Baner Road, Baner, Pune Maharashtra 411045

3. Abhijeet Zilpelwar

K302, Swiss County, Thergaon, Pune- 411033

4. Asawari Bhagat

A1002 EPIC, New DP Road, Vishal Nagar, Pimpale Nilakh, Pune 411027

5. Srivallabh Milind Mangrulkar

201, Shree Dhavalgiri, Ivory Estates, Baner, Pune-411045

Specification

Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003

COMPLETE SPECIFICATION
(See Section 10 and Rule 13)

Title of invention:
A SYSTEM AND A METHOD FOR MEDIA ANALYSIS
Applicant:
ONIBER SOFTWARE PRIVATE LIMITED
An Indian Entity having address as:
Sr No 26/3 and 4, A 102, Oakwood Hills, Baner Road, Opp Pan Card
Club, Baner, Pune, Maharashtra 411045, India

The following specification particularly describes the invention and the manner in which it is to be performed.
CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
[0001] The present application does not claim priority from any other patent application.
FIELD OF INVENTION
[0002] The presently disclosed embodiments are related, in general, to the field of digital media analysis. More particularly, the present disclosure relates to a system and a method for media analysis to identify synthetic visual portions within the media.
BACKGROUND OF THE INVENTION
[0003] This section is intended to introduce the reader to various aspects of art, which may be related to various aspects of the present disclosure that are described or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it should be understood that these statements in this background section are to be read in this light, and not as admissions of prior art. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also correspond to implementations of the claimed technology.
[0004] In today’s world, where artificial intelligence (AI) is rapidly advancing, the creation and manipulation of multimedia content, including video files, has become increasingly sophisticated. AI-generated videos, such as deepfakes, can mimic real human actions, appearances, and voices with exceptional realism, making it difficult for conventional systems to detect such manipulations. The problem lies in the limitations of traditional video analysis systems, which were designed to process only basic visual and temporal features, lacking the capability to effectively identify manipulated or synthetic video content.
[0005] With AI-generated video content being used to spread misinformation, manipulate public opinion, commit fraud, impersonate individuals, and incite social unrest, the inability to distinguish between real and synthetic video poses significant risks. These risks have serious implications for areas such as journalism, security, legal evidence, and public trust. Detecting synthetic portions embedded within authentic video content is a complex problem that conventional systems are unable to adequately address, resulting in frequent misclassifications and inaccurate conclusions.
[0006] One critical issue is the limited ability of conventional systems to process large, complex video files. These systems are often designed to analyse videos as a whole, making it difficult to detect small manipulations embedded in specific segments of a video. This approach often overlooks important details, particularly when synthetic content is mixed with real footage, resulting in inaccurate or incomplete analysis.
[0007] Furthermore, conventional analysis systems face significant technical challenges when processing complex video content in its entirety. This often leads to increased memory and processing power requirements, as these systems must handle large amounts of data simultaneously. As a result, the system may experience performance degradation or become inefficient, particularly when dealing with high-resolution videos or extensive durations. Additionally, the lack of flexibility in allowing users to select specific portions of media files for analysis rather than requiring them to process the entire video further reduces efficiency. This rigidity limits the user’s ability to focus on relevant segments that might contain manipulated data, reducing the overall effectiveness of the analysis and increasing the time taken to derive meaningful insights from the content.
[0008] Another significant challenge lies in the inefficiency of conventional systems in extracting relevant visual and audio features from video files. These systems often rely on basic algorithms that fail to capture the intricate visual and temporal features necessary for identifying synthetic video content. Such systems struggle to detect small modifications in facial expressions, lighting, motion, and sound that characterize synthetic videos, leading to frequent misclassification. Moreover, as AI technology continues to evolve, the tools used to generate synthetic video content have become more sophisticated and seamless, making detection even more difficult.
[0009] In light of the above stated discussion, there exists a need for an improved system and method for media analysis to overcome at least one of the above stated disadvantages.
[0010] Further limitations and disadvantages of conventional and traditional approaches will become apparent to one of skill in the art, through the comparison of described systems with some aspects of the present disclosure, as set forth in the remainder of the present application and with reference to the drawings.
SUMMARY OF THE INVENTION
[0011] Before the present system and device and its components are summarized, it is to be understood that this disclosure is not limited to the system and its arrangement as described, as there can be multiple possible embodiments which are not expressly illustrated in the present disclosure. The present disclosure overcomes one or more shortcomings of the prior art and provides additional advantages discussed throughout the present disclosure. Additional features and advantages are realized through the techniques of the present disclosure. It is also to be understood that the terminology used in the description is for the purpose of describing the versions or embodiments only and is not intended to limit the scope of the present application. This summary is not intended to identify essential features of the claimed subject matter nor is it intended for use in detecting or limiting the scope of the claimed subject matter.
[0012] According to embodiments illustrated in a present disclosure, a system for media analysis is disclosed. In one implementation of the present disclosure, the system may involve a processor and a memory. The memory is communicatively coupled to the processor. Further, the memory is configured to store processor executable instructions, which, on execution, may cause the processor to receive an input media file. The received input media file may comprise an audio information and a visual information. Further, the processor may be configured to pre-process the visual information of the received input media file. Further, the processor may be configured to split the visual information of the pre-processed input media file into one or more chunks. In an embodiment, each chunk from the one or more chunks corresponds to a plurality of frames of the visual information of the received input media file. Furthermore, the processor may be configured to extract one or more features corresponding to each chunk from the one or more chunks. Furthermore, the processor may be configured to generate embeddings of each chunk from the one or more chunks, using a transformer encoder model based on the extracted one or more features Furthermore, the processor may be configured to classify, via a classification model, each chunk from the one or more chunks, to either a real chunk or a synthetic chunk based on the generated embeddings. Moreover, the processor may be configured to provide each chunk from the one or more chunks with an indication of either the real chunk or the synthetic chunk.
[0013] According to embodiments illustrated herein, there is provided a method for media analysis. In one implementation of the present disclosure, the method may involve various steps performed by the processor. The method may involve a step of receiving, via the processor, the input media file. The received input media file may comprise an audio information and a visual information. Further, the method may involve a step of pre-processing, via the processor, the visual information of the received input media file. Furthermore, the method may involve a step of splitting, via the processor, the visual information of the pre-processed input media file into the one or more chunks. In an embodiment, each chunk from the one or more chunks corresponds to the plurality of frames of the visual information of the received input media file. Furthermore, the method may involve a step of extracting, via the processor, the one or more features corresponding to each chunk from the one or more chunks. Furthermore, the method may involve a step of generating embeddings of each chunk from the one or more chunks, using the transformer encoder model based on the extracted one or more features. Furthermore, the method may involve a step of classifying, via the processor coupled with the classification model, each chunk from the one or more chunks, to either the real chunk or the synthetic chunk based on the generated embeddings. Moreover, the method may involve a step of providing each chunk from the one or more chunks with the indication of either the real chunk or the synthetic chunk.
[0014] According to embodiments illustrated herein, there is provided a non-transitory computer-readable storage medium having stored thereon, a set of computer-executable instructions causing a computer comprising one or more processors to perform various steps. The steps may involve receiving the input media file. The received input media file may comprise an audio information and a visual information. Further, the steps may involve pre-processing the visual information of the received input media file. Furthermore, the steps may involve splitting the visual information of the pre-processed input media file into the one or more chunks. In an embodiment, each chunk from the one or more chunks corresponds to the plurality of frames of the visual information of the received input media file. Furthermore, the steps may involve extracting the one or more features corresponding to each chunk from the one or more chunks. Furthermore, the steps may involve generating embeddings of each chunk from the one or more chunks, using the transformer encoder model based on the extracted one or more features. Furthermore, the steps may involve classifying, via the classification model, each chunk from the one or more chunks, to either the real chunk or the synthetic chunk based on the generated embeddings. Moreover, the steps may involve providing each chunk from the one or more chunks with the indication of either the real chunk or the synthetic chunk.
[0015] The foregoing summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments, examples, and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.
BRIEF DESCRIPTION OF DRAWINGS
[0016] The accompanying drawings illustrate the various embodiments of systems, methods, and other aspects of the disclosure. Any person with ordinary skills in art will appreciate that the illustrated element boundaries (e.g., boxes, groups of boxes, or other shapes) in the figures represent one example of the boundaries. In some examples, one element may be designed as multiple elements, or multiple elements may be designed as one element. In some examples, an element shown as an internal component of one element may be implemented as an external component in another, and vice versa. Further, the elements may not be drawn to scale.
[0017] Various embodiments will hereinafter be described in accordance with the appended drawings, which are provided to illustrate and not to limit the scope in any manner, wherein similar designations denote similar elements.
[0018] The detailed description is described with reference to the accompanying figures. In the figures, same numbers are used throughout the drawings to refer like features and components. Embodiments of a present disclosure will now be described, with reference to the following diagrams below wherein:
[0019] Figure 1 illustrates a block diagram describing a system (100) for media analysis, in accordance with at least one embodiment of the present disclosure.
[0020] Figure 2 illustrates a block diagram showing an overview of various components of an application server (101) configured for media analysis, in accordance with at least one embodiment of the present disclosure.
[0021] Figure 3 illustrates a flowchart describing a method (300) for media analysis, in accordance with at least one embodiment of the present disclosure; and
[0022] Figure 4 illustrates a block diagram (400) of an exemplary computer system (401) for implementing embodiments consistent with the present disclosure.
[0023] It should be noted that the accompanying figures are intended to present illustrations of exemplary embodiments of the present disclosure. These figures are not intended to limit the scope of the present disclosure. It should also be noted that accompanying figures are not necessarily drawn to scale.

DETAILED DESCRIPTION OF THE INVENTION
[0024] Reference throughout the specification to “various embodiments,” “some embodiments,” “one embodiment,” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases “in various embodiments,” “in some embodiments,” “in one embodiment,” or “in an embodiment” in places throughout the specification are not necessarily all referring to the same embodiment. Furthermore, the features, structures or characteristics may be combined in any suitable manner in one or more embodiments.
[0025] The words "comprising," "having," "containing," and "including," and other forms thereof, are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items or meant to be limited to only the listed item or items. It must also be noted that, the singular forms "a," "an," and "the" include plural references unless the context clearly dictates otherwise. Although any methods similar or equivalent to those described herein can be used in the practice or testing of embodiments of the present disclosure, the exemplary methods are described. The disclosed embodiments are merely exemplary of the disclosure, which may be embodied in various forms.
[0026] The terminology “synthetic”, “artificial”, “fake”, and “deepfake” has the same meaning and are used alternatively throughout the specification. Further, the terminology “chunk”, “segment”, and “portion” has the same meaning and are used alternatively throughout the specification. Further, the terminology “identify” and “detect” has the same meaning and are used alternatively throughout the specification. Furthermore, the terminology “user” and “users” has the same meaning and are used alternatively throughout the specification.
[0027] The present disclosure relates to a system for media analysis to identify synthetic portions in a media. The system comprises a processor and a memory communicatively coupled to the processor, and the memory is configured to store processor-executable instructions which, on execution, may cause the processor to receive an input media file, comprise an audio information and a video information, and pre-process the visual information of the received input media file. Furthermore, the system may split the visual information of the pre-processed input media file into the one or more chunks. In an embodiment, each chunk from the one or more chunks corresponds to the plurality of frames of the visual information of the received input media file. Furthermore, the system may extract one or more features corresponding to each chunk from the one or more chunks. Furthermore, system may generate embeddings of each chunk from the one or more chunks, using a transformer encoder model based on the extracted one or more features. Moreover, the system may classify, via a classification model, each chunk from the one or more chunks, to either a real chunk or a synthetic chunk based on the generated embeddings. Additionally, the system may provide each chunk from the one or more chunks with an indication of either the real chunk or the synthetic chunk.
[0028] To address the problems of conventional systems, the disclosed system integrates advanced processing and classification techniques for more accurate detection of real chunk and the synthetic chunk in the media. Unlike conventional methods that struggle to detect small manipulations in media files, this system splits the visual information of the pre-processed input media file into the one or more chunks corresponding to the plurality of frames. Further, the system utilizes the transformer encoder model to generate embeddings of the media chunks based on the extracted one or more features. These embeddings allow the classification model to distinguish between real and synthetic portions with greater accuracy. The system also offers a user-friendly interface, making it easy for users to analyse media files and view results, ensuring a reliable and efficient solution for identifying manipulated video content in the media.
[0029] Referring to Figure 1 is a block diagram that illustrates a system (100) for media analysis, in accordance with at least one embodiment of the present disclosure. The system (100) typically includes an application server (101), a database server (102), a communication network (103), and a user computing device (104). The application server (101), the database server (102), and the user computing device (104) are typically communicatively coupled with each other via the communication network (103). In an embodiment, the application server (101) may communicate with the database server (102), and the user computing device (104) using one or more protocols such as, but not limited to, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), RF mesh, Bluetooth Low Energy (BLE), and the like, to communicate with one another.
[0030] In an embodiment, the database server (102) may refer to a computing device that may be configured to store the received input media files, the pre-processed input media files, the one or more chunks corresponding to the plurality of frames, the extracted one or more features, the generated embeddings, classification scores, and the classified one or more chunks. In an embodiment, the database server (102) may include a special purpose operating system specifically configured to perform one or more database operations on the received input media file. In an embodiment, the database server (102) may include one or more instructions specifically for storing the training data used to enhance the performance of the system's models including a feeder neural network model, the transformer encoder model, and the classification model. Examples of database operations may include, but are not limited to, storing, retrieving, comparing, and updating data related to the processing of media. In an embodiment, the database server (102) may include hardware that may be configured to perform the processing of the video. In an embodiment, the database server (102) may be realized through various technologies such as, but not limited to, Microsoft® SQL Server, Oracle®, IBM DB2®, Microsoft Access®, PostgreSQL®, MySQL®, SQLite®, distributed database technology and the like. In an embodiment, the database server (102) may be configured to utilize the application server (101) for storage and retrieval of data used for media analysis to identify synthetic portions within video content.
[0031] A person with ordinary skills in art will understand that the scope of the disclosure is not limited to the database server (102) as a separate entity. In an embodiment, the functionalities of the database server (102) can be integrated into the application server (101) or into the user computing device (104).
[0032] In an embodiment, the application server (101) may refer to a computing device or a software framework hosting an application or a software service. In an embodiment, the application server (101) may be implemented to execute procedures such as, but not limited to, programs, routines, or scripts stored in one or more memories for supporting the hosted application or the software service. In an embodiment, the hosted application or the software service may be configured to perform the processing of the video. The application server (101) may be realized through various types of application servers such as, but are not limited to, a Java application server, a .NET framework application server, a Base4 application server, a PHP framework application server, or any other application server framework.
[0033] In another embodiment, the application server (101) may be configured to utilize the database server (102) and the user computing device (104), in conjunction, for media analysis. In an implementation, the application server (101) is configured for an automated processing of the input media file including both video and audio components in various formats, such as MP4, AVI, and MKV, to identify synthetic portions within the input media file. Further, the system (100) comprising the application server (101) performs various operations, including receiving the input media file, pre-processing video components of the received input media file, splitting the pre-processed video components of the input media file into one or more chunks, and extracting the one or more features corresponding to each chunk from the one or more chunks. In an embodiment, each chunk corresponds to the plurality of frames of the video components of the input media file. Furthermore, the application server (101) generates embeddings of each chunk from the one or more chunks using the transformer encoder model. Each chunk is further classified as either the real chunk or the synthetic chunk based on the generated embeddings. This process ensures the identification of synthetic portions for subsequent operations performed by the application server (101).
[0034] In yet another embodiment, the application server (101) may be configured to receive the input media file. The received input media file may comprise an audio information and a visual information.
[0035] In yet another embodiment, the application server (101) may be configured to pre-process the visual information of the received input media file.
[0036] In yet another embodiment, the application server (101) may be configured to split the visual information of the pre-processed input media file into one or more chunks. In an embodiment, each chunk from the one or more chunks may correspond to the plurality of frames of the visual information of the received input media file.
[0037] In yet another embodiment, the application server (101) may be configured to extract the one or more features corresponding to each chunk from the one or more chunks.
[0038] In yet another embodiment, the application server (101) may be configured to generate embeddings of each chunk from the one or more chunks, using the transformer encoder model based on the extracted one or more features.
[0039] In yet another embodiment, the application server (101) may be configured to classify, via the classification model, each chunk from the one or more chunks, to either a real chunk or a synthetic chunk based on the generated embeddings.
[0040] In yet another embodiment, the application server (101) may be configured to provide each chunk from the one or more chunks with an indication of either the real chunk or the synthetic chunk.
[0041] In an embodiment, the communication network (103) may correspond to a communication medium through which the application server (101), the database server (102), and the user computing device (104) may communicate with each other. Such communication may be performed in accordance with various wired and wireless communication protocols. Examples of such wired and wireless communication protocols include, but are not limited to, Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), Hypertext Transfer Protocol (HTTP), Wireless Application Protocol (WAP), File Transfer Protocol (FTP), ZigBee, EDGE, infrared IR), IEEE 802.11, 802.16, 2G, 3G, 4G, 5G, 6G, 7G, cellular communication protocols, and/or Bluetooth (BT) communication protocols. The communication network (103) may either be a dedicated network or a shared network. Further, the communication network (103) may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like. The communication network (103) may include, but is not limited to, the Internet, intranet, a cloud network, a Wireless Fidelity (Wi-Fi) network, a Wireless Local Area Network (WLAN), a Local Area Network (LAN), a cable network, the wireless network, a telephone network (e.g., Analog, Digital, POTS, PSTN, ISDN, xDSL), a telephone line (POTS), a Metropolitan Area Network (MAN), an electronic positioning network, an X.25 network, an optical network (e.g., PON), a satellite network (e.g., VSAT), a packet-switched network, a circuit-switched network, a public network, a private network, and/or other wired or wireless communications network configured to carry data.
[0042] In an embodiment, the user computing device (104) may comprise one or more processors and one or more memory. The one or more memories may include computer readable code that may be executable by one or more processors to perform the processing of the video. In an embodiment, the user computing device (104) may present a web user interface to transmit the user input to the application server (101). Example web user interfaces presented on the one or more portable devices to display each chunk from the one or more chunks with the indication of either the real chunk or the synthetic chunk to the user to facilitate interaction within the system (100). Examples of the user computing devices may include, but are not limited to, a personal computer, a laptop, a personal digital assistant (PDA), a mobile device, a tablet, or any other computing device.
[0043] The system (100) can be implemented using hardware, software, or a combination of both, which includes using where suitable, one or more computer programs, mobile applications, or “apps” by deploying either on-premises over the corresponding computing terminals or virtually over cloud infrastructure. The system (100) may include various micro-services or groups of independent computer programs which can act independently in collaboration with other micro-services. The system (100) may also interact with a third-party or external computer system. Internally, the system (100) may be the central processor of all requests for transactions by the various actors or users of the system. A critical attribute of the system (100) is that it can leverage the feeder neural network model, the transformer encoder model and the classification model to process various formats of media for identifying synthetic media within the media. The system extracts the one or more features, generates embeddings, and classifies each chunk from the one or more chunks into the real chunk or the synthetic chunk. In a specific embodiment, the system (100) is implemented for media analysis.
[0044] Now referring to Figure 2, illustrates a block diagram showing an overview of various components of the application server (101) configured for media analysis, in accordance with at least one embodiment of the present disclosure. Figure 2 is explained in conjunction with elements from Figure 1. In an embodiment, the application server (101) includes a processor (201), a memory (202), a transceiver (203), an input/output unit (204), a user interface unit (205), a receiving unit (206), a pre-processing unit (207), an embedding generation unit (208), and a classification unit (209). The processor (201) may be communicatively coupled to the memory (202), the transceiver (203), the input/output unit (204), the user interface unit (205), the receiving unit (206), the pre-processing unit (207), the embedding generation unit (208), and the classification unit (209). The transceiver (203) may be communicatively coupled to the communication network (103) of the system (100).
[0045] In an embodiment, the application server (101) may be configured to receive the input media file. The received input media file may comprise an audio information and a visual information. Further, the application server (101) may be configured to pre-process the visual information of the received input media file. Further, the application server (101) may be configured to split the visual information of the pre-processed input media file into the one or more chunks. In an embodiment, each chunk from the one or more chunks may correspond to the plurality of frames of the visual information of the received input media file. Furthermore, the application server (101) may be configured to extract the one or more features corresponding to each chunk from the one or more chunks. Furthermore, the application server (101) may be configured to generate embeddings of each chunk from the one or more chunks, using the transformer encoder model based on the extracted one or more features. Furthermore, the application server (101) may be configured to classify each chunk from the one or more chunks, to either the real chunk or the synthetic chunk based on the generated embeddings, using the classification model. Moreover, the application server (101) may be configured to provide each chunk from the one or more chunks with the indication of either the real chunk or the synthetic chunk.
[0046] The processor (201) comprises suitable logic, circuitry, interfaces, and/or code that may be configured to execute a set of instructions stored in the memory (202), and may be implemented based on several processor technologies known in the art. The processor (201) works in coordination with the memory (202), the transceiver (203), the input/output unit (204), the user interface unit (205), the receiving unit (206), the pre-processing unit (207), the embedding generation unit (208), and the classification unit (209) for media analysis. Examples of the processor (201) include, but not limited to, a standard microprocessor, microcontroller, central processing unit (CPU), an X86-based processor, a Reduced Instruction Set Computing (RISC) processor, an Application- Specific Integrated Circuit (ASIC) processor, and a Complex Instruction Set Computing (CISC) processor, distributed or cloud processing unit, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions and/or other processing logic that accommodates the requirements of the present invention.
[0047] The memory (202) comprises suitable logic, circuitry, interfaces, and/or code that may be configured to store the set of instructions, which are executed by the processor (201). Preferably, the memory (202) is configured to store one or more programs, routines, or scripts that are executed in coordination with the processor (201). Additionally, the memory (202) may include any computer-readable medium or computer program product known in the art including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic random-access memory (DRAM), and/or non-volatile memory, such as read-only memory (ROM), erasable programmable ROM, a Hard Disk Drive (HDD), flash memories, Secure Digital (SD) card, Solid State Disks (SSD), optical disks, magnetic tapes, memory cards, virtual memory and distributed cloud storage. The memory (202) may be removable, non-removable, or a combination thereof. Further, the memory (202) may include routines, programs, objects, components, data structures, etc., which perform particular tasks or implement particular abstract data types. The memory (202) may include programs or coded instructions that supplement the applications and functions of the system (100). In one embodiment, the memory (202), amongst other things, serves as a repository for storing data processed, received, and generated by one or more of the programs or the coded instructions. In yet another embodiment, the memory (202) may be managed under a federated structure that enables the adaptability and responsiveness of the application server (101).
[0048] The transceiver (203) comprises suitable logic, circuitry, interfaces, and/or code that may be configured to receive, process or transmit information, data or signals, which are stored by the memory (202) and executed by the processor (201). The transceiver (203) is preferably configured to receive, process or transmit, one or more programs, routines, or scripts that are executed in coordination with the processor (201). The transceiver (203) is preferably communicatively coupled to the communication network (103) of the system (100) for communicating all the information, data, signals, programs, routines or scripts through the communication network (103).
[0049] The transceiver (203) may implement one or more known technologies to support wired or wireless communication with the communication network (103). In an embodiment, the transceiver (203) may include but is not limited to, an antenna, a radio frequency (RF) transceiver, one or more amplifiers, a tuner, one or more oscillators, a digital signal processor, a Universal Serial Bus (USB) device, a coder-decoder (CODEC) chipset, a subscriber identity module (SIM) card, and/or a local buffer. Also, the transceiver (203) may communicate via wireless communication with networks, such as the Internet, an Intranet and/or a wireless network, such as a cellular telephone network, a wireless local area network (LAN) and/or a metropolitan area network (MAN). Accordingly, the wireless communication may use any of a plurality of communication standards, protocols and technologies, such as: Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), wideband code division multiple access (W-CDMA), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wireless Fidelity (Wi-Fi) (e.g., IEEE 802.11a, IEEE 802.11b, IEEE 802.11g and/or IEEE 802.11n), voice over Internet Protocol (VoIP), Wi-MAX, a protocol for email, instant messaging, and/or Short Message Service (SMS).
[0050] The input/output (I/O) unit (204) comprises suitable logic, circuitry, interfaces, and/or code that may be configured to receive or present information. The input/output unit (204) comprises various input and output devices that are configured to communicate with the processor (201). Examples of the input devices include but are not limited to, a keyboard, a mouse, a joystick, a touch screen, a microphone, a camera, and/or a docking station. Examples of the output devices include, but are not limited to, a display screen and/or a speaker. The I/O unit (204) may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O unit (204) may allow the system (100) to interact with the user directly or through the user computing devices (104). Further, the I/O unit (204) may enable the system (100) to communicate with other computing devices, such as web servers and external data servers (not shown). The I/O unit (204) can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. The I/O unit (204) may include one or more ports for connecting a number of devices to one another or to another server. In one embodiment, the I/O unit (204) allows the application server (101) to be logically coupled to other user computing devices (104), some of which may be built in. Illustrative components include tablets, mobile phones, wireless devices, etc.
[0051] Further, the input/output (I/O) unit (204) comprising input device namely keyboard, touchpad, trackpad may be configured to select one or more portions of the received input media file by users to identify synthetic portion within the media file. Further, the input/output (I/O) unit (204) may be configured to select a specific chunk from the one or more chunks by the user, via the UI. In an embodiment, the microphone may be configured to receive the selection from the user. Additionally, the system (100) may allow users to upload media files from external storage devices or cloud services, enabling versatile methods for inputting media for analysis.
[0052] In an exemplary embodiment, the input media may include any one of social media videos, educational videos (like online courses, tutorials), surveillance footage, video conference recordings, webinars, news broadcasts, TV shows, movies, advertisements, live event recordings, or a combination thereof.
[0053] Further, the user interface unit (205) may include the user interface (UI) displaying specific operations such as receiving selections corresponding to the user and presenting the classified input media file to the user. The user interface unit (205) may feature the UI designed to facilitate interaction with the system (100) for media analysis. In an exemplary embodiment, the UI may allow users to upload the input media file for analysis. In an exemplary embodiment, the UI may allow the users to select the one or more portions from the media file for analysis. In another exemplary embodiment, the UI may allow users to select the specific chunk from the one or more chunks to determine whether the selected chunk is real or synthetic. Users can interact with the UI through voice or text commands to initiate various operations, such as executing media analysis tasks or adjusting parameters based on their preferences. In an exemplary embodiment, the UI may display current or intermediate status of the media analysis process, and classified portion of the media file to the user. Additionally, the user interface (UI) of the system is designed to support multiple media formats, enabling user interaction with the input media file in form of the video. In an exemplary embodiment, the UI may allow users to manage and modify analysis task of the video content in real-time, ensuring that their specific needs are met. The UI also displays relevant information regarding the analysis process, such as the pre-processing status, object extraction, and analysis of the one or more chunks. Furthermore, the UI may present the classification of each chunk as the real or the synthetic through visual indicators, alerts, and detailed notifications, keeping users informed of the analysis outcomes, thereby enhancing user engagement with the system. This addresses the limitations of conventional systems by accurately identifying synthetic portions within the received input media file.
[0054] Further, the receiving unit (206) may be configured to receive the input media file. In an exemplary embodiment, the receiving unit (206) may allow the system to receive the input media files including an audio information and a visual information. The visual information may be received in various formats, such as MP4, AVI, and MKV, or a combination thereof. In an exemplary embodiment, the receiving unit (206) may verify the file format and size to ensure compatibility with the system's processing capabilities. Upon successful validation, the receiving unit (206) may transmit the visual information of the input media file to the processor (201) for further analysis. It is important to note that scope of the media analysis performed by the system (100) is limited to analysing the visual information of the input media file. The processing of the audio information of the received input media file is not covered in this application.
[0055] Further, the pre-processing unit (207) may be configured to pre-process the visual information of the received input media file. In an embodiment, the pre-processing unit (207) may be configured to pre-process the visual information of the received input media file for sampling the received input media file based on a predefined frame sampling rate. The sampling of the received input media file may be used to transform the visual information of the received input media file into a compatible frae sampling rate of the system (100). In another embodiment, the pre-processing unit (207) may be configured for filtering the sampled input media file in time domain to reduce either noise or artifacts in frequency domain, thereby preparing the media file for further analysis
[0056] In an exemplary embodiment, the pre-processing unit (207) receives the input media file containing background noise, such as a video file of an outdoor interview with ambient light and high contrast object. For example, upon receiving the video file, the pre-processing unit may sample the visual information of the video to the predefined sampling rate of 30fps (frames per second), optimizing the visual data for further analysis. The pre-processing unit (207) may then apply a low-pass filter in the time domain to reduce background noise, while retaining the clarity of the interviewee’s visual content. The resulting pre-processed video file is then ready for further analysis.
[0057] In an exemplary embodiment, filtering of the sampled input media file may be performed using one of hamming window, hanning window, low pass filter, high pass filter, band pass filter or a combination thereof.
[0058] In yet another embodiment, the pre-processing unit (207) may be configured to pre-process the visual information of the received input media file to identify the one or more portions associated with one or more users, within the received input media file. The one or more portions may correspond to facial information associated with the one or more users within the received input media file. In one embodiment, the facial information associated with the one or more users may be identified by using one of Multi-task Cascaded Convolutional Networks (MTCNN), Yoloface, deepFace, retinaFace, FaceNet or a combination thereof.
[0059] In an exemplary embodiment, the user may select a portion of the video file that features a female actor in a scene with both male and female performances. The user may want to identify synthetic manipulations by focusing on the specific portion of the video containing the female actor. The pre-processing unit (207) is configured to isolate the selected portion of the video, ensuring that only the relevant frames containing the female actor are processed for further analysis. This targeted approach enhances the accuracy and efficiency of detecting any synthetic alterations, allowing the system to focus on the selected portion by the user.
[0060] Furthermore, the pre-processing unit (207) may be configured to split the visual information of the pre-processed input media file into the one or more chunks. In an embodiment, each chunk from the one or more chunks may corresponds to the plurality of frames of the visual information of the received input media file. In an embodiment, the plurality of frames may be associated with either a facial image or a non-facial image.
[0061] In an exemplary embodiment, after receiving the pre-processed input video, the pre-processing unit (207) may segment the video into the one or more chunks. For example, if the pre-processed video is 60 seconds long, the pre-processing unit (207) may be configured to divide the video into the one or more chunks based on either a predefined time interval or based on a predefined number of frames. In an implementation, the received video may be divided into the one or more chunks of 15 seconds each. In another implementation, the received video may be divided into the one or more chunks with 15 frames per chunk. Each chunk from the one or more chunks may include both facial images of the actors and non-facial images of the surrounding environment. This targeted segmentation enables the system to analyse specific portions of the video more effectively, facilitating the identification of any synthetic alterations in the facial expressions of the actors or other elements within the scene. In one embodiment, the splitting of the visual information of the pre-processed input media file may be performed by grouping the predefined number of frames for each face from the facial information associated with one or more faces of the one or more users. In an instance, each chunk from the one or more chunks may comprise 15 frames associated with a face from the one or more faces in the received input media file.
[0062] In an exemplary embodiment, when the input media file contains a human face, the system analyses the input media file to identify whether the face is real or fake, including instances of deepfake generation or lip-syncing. For example, in the video featuring a celebrity endorsing a product, the system may analyze the facial movements and expressions to identify any manipulations that suggest the face is artificially generated or not.
[0063] In another exemplary embodiment, when the input media file lacks the human face, the system focuses on determining whether the overall scene is real or fake. This includes evaluating background elements, objects, and environmental scenes to ascertain whether the scene is real or fake. For instance, in the video showcasing a picturesque landscape, the system may analyze visual features, such as lighting inconsistencies or unnatural textures, to detect if the scenery is digitally created or altered.
[0064] In an embodiment, each chunk from the one or more chunks, obtained by the splitting, may be partially overlapping with an adjacent chunk of the one or more chunks.
[0065] In an exemplary embodiment, the splitting of the media into overlapping chunks may significantly enhance accuracy of detecting synthetic alterations, particularly when small changes occur at different parts of the video. For example, consider a 10-second video analysed for synthetic manipulations. The splitting unit divides the video into overlapping chunks to ensure that important facial details and background elements at boundaries of each chunk are analysed. For example, without overlap, the system may create different chunks such as 0-2 seconds, 2-4 seconds, and so on, which may miss small manipulations occurring at the chunk boundaries. However, by introducing a 50% overlap, the chunks might become 0-2 seconds, 1-3 seconds, 2-4 seconds, and so on. This overlapping allows each chunk to share some frames with its adjacent chunks, ensuring that critical features such as motion continuity, facial expressions, or lighting changes across boundaries are captured and not missed in the analysis. This overlapping approach enhances the accuracy of detecting synthetic manipulations, particularly when small changes occur at different parts of the video across multiple segments of the video.
[0066] In an embodiment, the receiving unit (206) may be configured to receive, a specific chunk from the one or more chunks to determine whether the selected chunk is real or synthetic. The selection of the specific chunk for analysis is performed by the user. This allows for more targeted analysis of relevant chunks within the media to identify synthetic portion. The selection of a particular chunk from the one or more chunks and further analysis on the particular chunk only, leads to quick processing of the whole system (100) along with less computational processing and less storage, which overall optimize the functionality of the system (100) over the conventional systems.
[0067] In an exemplary embodiment, the receiving unit (206) may be configured to allow the user to select the specific chunk from the one or more chunks of the media. For example, the user might focus on a particular chunk of the video where the user suspect manipulation, such as a visually suspicious scene or altered face. The selected chunk is then analysed to determine whether it is real or synthetic, helping the user verify the authenticity of the selected chunk.
[0068] Furthermore, the pre-processing unit (207) may be configured to extract the one or more features corresponding to each chunk from the one or more chunks. For example, each chunk may be individually examined for features such as facial key points, facial expressions, lighting inconsistencies, facilitating more accurate classification in subsequent processing steps.
[0069] In an exemplary embodiment, the one or more features may include one of a facial key points, swapped faces, cropped faces in RGB color space, YUV color space, masked faces, focused eye region, lip region, facial expression, number of pixels, number of person, video only, audio-video, file format, frame rate, compression artifacts, lighting inconsistencies, reflections, shadows, motion blur, noise patterns, or a combination thereof. In an embodiment, the pre-processing unit (207) may be configured to stack the extracted one or more features, before passing to a feeder neural network model, to create a volume of input to the feeder neural network model. In an embodiment, a separate stack is created for the plurality of frames associated with either the facial image or the non-facial image.
[0070] In an exemplary embodiment, the feeder neural network model may correspond to one of CNN (convolutional neural network), VGG (Visual Geometry Group), 3D-CNNs, fully connected layers, sigmoid activation, ReLU (Rectified Linear Unit) activation, dropout layers, ResNet (Residual Networks), RawNet2, LSTM (Long short-term memory), Deep Neural Network (DNN), or a combination thereof.
[0071] Furthermore, the embedding generation unit (208) may be configured to generate embeddings of each chunk from the one or more chunks, using a transformer encoder model, based on the extracted one or more features. In an embodiment, the embeddings generated by the transformer encoder model may be multi-dimensional vector representation of each chunk from the one or more chunks. Furthermore, the embedding generation unit (208) may be configured to input the extracted one or more features to the feeder neural network model based on the created volume of input, before passing to the transformer encoder model, for obtaining a changed activation volume of an input to the transformer encoder model. In an exemplary embodiment, the transformer encoder model may correspond to one of VideoMAE, XClip, TimeSformer, ViViT, Vision Transformers (ViTs), BEiT (BERT Pre-Training of Image Transformers), CAiT (Class-Attention in Image Transformers), DeiT (Data-efficient Image Transformers) or a combination thereof.
[0072] In an exemplary embodiment, embeddings may correspond to vector representations of the one or more features. These embeddings may be multi-dimensional vector representations of each chunk, for example, represented as vectors with dimensions that may range from 512 to 2048. Specifically, for video processing, each input video chunk may be transformed into a vector of floating-point numbers, such as in case of VideoMAE transformer model, the encoder processes the input video chunks consisting of 16 frames to obtain a 768-dimensional vector representation. This embedding compresses essential information from the video into a vector format, which can be used by the system for further analysis, such as classifying whether the video chunk is real or synthetic with greater accuracy and efficiency.
[0073] In another exemplary embodiment, in case of the transformer encoder model, the feeder neural network may generate an activation volume that aligns with the specific input dimensions expected by the transformer encoder model. For example, an output activation volume of a convolutional layer in a neural network could be 224x224x10, representing 10 channels with each channel having a height and a width of 224x224. The feeder neural network model is designed so that its output activation volume aligns with an input dimension required by the transformer encoder model namely VideoMAE, which may expect input in a specific format, such as 16x224x224x3. Here, 224x224x3 corresponds to an image with height and width of 224x224 and 3 channels (RGB), and the input consists of 16 such images stacked together. Thus, the embedding generation unit (207) ensures that the output activation volume of the feeder neural network model matches the input dimensions expected by the transformer encoder model, enabling seamless data flow and ensuring effective training of both models.
[0074] Furthermore, the classification unit (209) may be configured to utilize a classification model to classify each chunk from the one or more chunks, to either the real chunk or the synthetic chunk based on the generated embeddings.
[0075] In an exemplary embodiment, the input video chunk may first be processed by the transformer encoder model, which converts the input into an embedding, represented as a high-dimensional vector such as the 768-dimensional vector in case of the video MAE. The size of these embeddings may vary, typically ranging from 512 to 2048 dimensions. Furthermore, the classification unit (209) may be trained to transform these embeddings into a two-dimensional vector [a, b], where a = 1 - b. In an embodiment, the output "a" may represent the probability of the input chunk being classified as real, while "b" may represent the probability of the input chunk being classified as synthetic.
[0076] In an embodiment, the classification model may be configured to generate a classification score, for each chunk from the one or more chunks. In an embodiment, the classification model may be configured to compare the classification score, of each chunk, with a predefined classification threshold to classify the chunk as the real chunk or the synthetic chunk. In an exemplary embodiment, the classification model may correspond to one of logistic regression, random forest, k-nearest neighbor (k-NN), support vector machines (SVM), or a combination thereof.
[0077] In an embodiment, the classification unit (209) may be configured for training the classification model using the one or more chunks of the input media file. In an embodiment, during the training, the classification unit (209) may be configured to utilize one or more loss functions on the one or more chunks, to identify a wrong classification by the classification model and further update the classification model using a back propagation technique. In an exemplary embodiment, the one or more loss functions may include a cross-entropy loss model, a triplet loss model, L2 regularizer, or a combination thereof.
[0078] In an exemplary embodiment, the training process is a standard procedure followed by the model including any neural network model to enhance system classification capabilities. For example, each input is fed into the classification model, which generates a score from a classification head, where an ideal score approaches 1 for fake chunks and near 0 for real chunks. If the model outputs an incorrect score such as predicting a value close to 0 for a fake chunk or vice versa, a loss function, such as cross-entropy loss or triplet loss, quantifies the discrepancy by assigning a high loss value for such erroneous predictions. This calculated loss is then utilized to update the model's parameters through the technique known as backpropagation. As a result, the classification model refines its understanding with each training iteration and learns to classify real and fake inputs.
[0079] In another exemplary embodiment, the classification between real and synthetic chunks is accomplished through the training of neural network model, designed specifically for classification. During training, the classification model learns to differentiate between real and fake chunks by utilizing an appropriate loss function, such as cross-entropy loss or triplet loss, which guides the optimization process. Upon completion of the training phase, the transformer encoder model model generates embeddings including vector representations of the input data, ensuring that the embeddings of real chunks may be positioned distinctly apart from those of fake chunks. This thresholding may be performed via a smaller neural network referred to as the classification head. This classification head outputs the score between 0 and 1, representing the likelihood of the input being real or fake. In practice, a threshold of 0.5 is commonly employed, so if the score falls below 0.5, the input is classified as the real chunk. Conversely, if the score exceeds 0.5, it is classified as the fake chunk. This systematic approach enables robust differentiation between real and synthetic chunks, enhancing the overall accuracy of the classification process.
[0080] In yet another exemplary embodiment, the neural network model is often trained using a set of training data to classify inputs accurately. However, a common challenge arises when the model performs exceptionally well on this training data yet struggles to classify new, unseen data effectively. This concept is known as "overfitting," where the model learns the noise and specific patterns of the training data instead of generalizing to new inputs. To mitigate overfitting, modifications may be made to the training objective, also referred to as the loss function. One effective method to prevent overfitting is the use of an L2 regularizer, which adds a penalty for larger weights in the model. By incorporating an L2 regularizer during training, the model is encouraged to develop a more generalized representation of the data, improving its performance on unseen data. As a result, a model trained with this regularization technique is better equipped to handle diverse inputs, enhancing its robustness and reliability in real-world applications.
[0081] Moreover, the application server (101) may be configured to provide each chunk from the one or more chunks with the indication of either the real chunk or the synthetic chunk, via the UI. In an embodiment, the indication may be provided to the user with the UI in different forms. In an exemplary embodiment, the indication of the real chunk and the fake chunk may be represented in the form of different colors assigned to chunks. For example, a fake chunk may be displayed as red color chunk and the real chunk may be represented as green color chunk.
[0082] Now referring to Figure 3, illustrates a flowchart describing a method (300) for media analysis, in accordance with at least one embodiment of the present disclosure. The flowchart is described in conjunction with Figure 1 and Figure 2. The method (300) starts at step (301) and proceeds to step (307).
[0083] In operation, the method (300) may involve a variety of steps for media analysis.
[0084] At step (301), the method (300) comprises a step of receiving the input media file. The input media file comprises an audio information and a visual information.
[0085] At step (302), the method (300) comprises a step of pre-processing the visual information of the received input media file.
[0086] At step (303), the method (300) comprises a step of splitting the visual information of the pre-processed input media file into the one or more chunks. In an embodiment, each chunk from the one or more chunks corresponds to the plurality of frames of the visual information of the received input media file.
[0087] At step (304), the method (300) comprises a step of extracting the one or more features corresponding to each chunk from the one or more chunks.
[0088] At step (305), the method (300) comprises a step of generating embeddings of each chunk from the one or more chunks, using the transformer encoder model based on the extracted one or more features.
[0089] At step (306), the method (300) comprises a step of classifying via the classification model each chunk from the one or more chunks, to either the real chunk or the synthetic chunk based on the generated embeddings.
[0090] At step (307), the method (300) comprises a step of providing each chunk from the one or more chunks with the indication of either the real chunk or the synthetic chunk.
[0091] These sequence of steps may be repeated and continue till the system stops receiving the input media file.
[0092] Let us delve into a detailed example of the present disclosure.
[0093] Working Example 1
[0094] Imagine a video from a recent news broadcast is analyzed by an advanced media analysis system designed to differentiate between real and synthetic video chunks for applications such as content verification and security assessments by law enforcement agencies. When the system receives the video for analysis, the system pre-processes the video to eliminate noise or artifacts and enhance clarity. Next, the system splits the input video into the one or more chunks corresponding to the plurality of frames like isolating facial and non-facial scenes, such as human faces and background elements. Additionally, the UI allows the user to select the specific chunk from the one or more chunks to be further analysed. Further, the selected chunk from the one or more chunks is analyzed to extract the one or more features such as facial expressions, movement patterns, and lighting conditions. The system inputs the one or more extracted features into the feeder neural network model, which adjusts the activation volumes before passing them to the transformer encoder model, which then generates embeddings for each chunk.
[0095] In the classification phase, the classification model classifies each chunk as real or synthetic based on the generated embeddings. Further, each chunk is assigned a score indicating its likelihood of being real or synthetic. For instance, if the chunk receives the score of 0.9, it suggests a high probability of being synthetic, while the score of 0.2 indicates it is likely real. The system uses a threshold condition, say 0.5, to make its determination: if the score exceeds this threshold, the chunk is classified as synthetic; if it falls below, it is classified as real. The outcomes of these classifications are then provided to the user via the UI, enabling informed decision-making regarding the authenticity of the video content.
[0096] Working Example 2
[0097] Imagine a media company wants to ensure the authenticity of a news broadcast before airing it, as there are concerns about potential deepfake manipulations in the report. The analysis begins with the system receiving the input news broadcast video. The system first pre-processes the video to eliminate any noise, compression artifacts, or irrelevant data, adjusting frame rates and standardizing the video format to facilitate effective analysis.
[0098] Next, the system splits the video into multiple overlapping chunks, each corresponding to the plurality of frames that capture both facial images and non-facial elements. For example, the focus may be on the small facial movements of the news anchor as they deliver their speech, as well as the background imagery that could have been tampered with.
[0099] For each chunk, the system extracts the one or more features such as facial key points, lighting inconsistencies, motion blur, and pixel-level details. Further, the system assesses changes in eye regions and lip sync, while monitoring for compression artifacts or unnatural background movements that could signal manipulation.
[0100] Further, the extracted features are fed into the feeder neural network model, to optimize the input video before passing to the transformer encoder model, thereby enhancing the analysis by obtaining the changed activation volume of the input to the transformer encoder model. The system further generate embeddings from the extracted one or more features, using the transformer encoder model by creating a vector representation of the content for deeper analysis.
[0101] Finally, using the classification model, the system classifies each chunk as either real or synthetic based on the generated embeddings. For example, if the system detects small manipulations in the anchor's facial movements or lighting that match known patterns of deepfake technology, the system classifies that chunk as synthetic.
[0102] This comprehensive analysis provides the media company with confidence in the authenticity of their broadcast before it goes live, ensuring that viewers receive accurate information.
[0103] A person skilled in the art will understand that the scope of the disclosure is not limited to scenarios based on the aforementioned factors and using the aforementioned techniques and that the examples provided do not limit the scope of the disclosure.
[0104] Now referring to Figure 4, illustrates a block diagram (400) of an exemplary computer system (401) for implementing embodiments consistent with the present disclosure. Variations of computer system (401) may be used for media analysis to identify synthetic portions within the media. The computer system (401) may comprise a central processing unit (“CPU” or “processor”) (402). The processor (402) may comprise at least one data processor for executing program components for executing user- or system-generated requests. A user may include a person, a person using a device such as those included in this disclosure, or such a device itself. Additionally, the processor (402) may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, or the like. In various implementations the processor (402) may include a microprocessor, such as AMD Athlon, Duron or Opteron, ARM’s application, embedded or secure processors, IBM PowerPC, Intel’s Core, Itanium, Xeon, Celeron or other line of processors, for example. Accordingly, the processor (402) may be implemented using mainframe, distributed processor, multi-core, parallel, grid, or other architectures. Some embodiments may utilize embedded technologies like application-specific integrated circuits (ASICs), digital signal processors (DSPs), or Field Programmable Gate Arrays (FPGAs), for example.
[0105] Processor (402) may be disposed of in communication with one or more input/output (I/O) devices via I/O interface (403). Accordingly, the I/O interface (403) may employ communication protocols/methods such as, without limitation, audio, analog, digital, monoaural, RCA, stereo, IEEE-1394, serial bus, universal serial bus (USB), infrared, PS/2, BNC, coaxial, component, composite, digital visual interface (DVI), high-definition multimedia interface (HDMI), RF antennas, S-Video, VGA, IEEE 802.n /b/g/n/x, Bluetooth, cellular (e.g., code-division multiple access (CDMA), high-speed packet access (HSPA+), global system for mobile communications (GSM), long-term evolution (LTE), WiMAX, or the like, for example.
[0106] Using the I/O interface (403), the computer system (401) may communicate with one or more I/O devices. For example, the input device (404) may be an antenna, keyboard, mouse, joystick, (infrared) remote control, camera, card reader, fax machine, dongle, biometric reader, microphone, touch screen, touchpad, trackball, sensor (e.g., accelerometer, light sensor, GPS, gyroscope, proximity sensor, or the like), stylus, scanner, storage device, transceiver, video device/source, or visors, for example. Likewise, an output device (405) may be a user’s smartphone, tablet, cell phone, laptop, printer, fax machine, video display (e.g., cathode ray tube (CRT), liquid crystal display (LCD), light- emitting diode (LED), plasma, or the like), or audio speaker, for example. In some embodiments, a transceiver (406) may be disposed in connection with the processor (402). The transceiver (406) may facilitate various types of wireless transmission or reception. For example, the transceiver (406) may include an antenna operatively connected to a transceiver chip (example devices include the Texas Instruments® WiLink WL1283, Broadcom® BCM4750IUB8, Infineon Technologies® X-Gold 618-PMB9800, or the like), providing IEEE 802.11a/b/g/n, Bluetooth, FM, global positioning system (GPS), and/or 2G/3G/5G/6G HSDPA/HSUPA communications, for example.
[0107] In some embodiments, the processor (402) may be disposed in communication with a communication network (408) via a network interface (407). The network interface (407) is adapted to communicate with the communication network (408). The network interface, coupled to the processor may be configured to facilitate communication between the system and one or more external devices or networks. The network interface (407) may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), transmission control protocol/internet protocol (TCP/IP), token ring, or IEEE 802.11a/b/g/n/x, for example. The communication network (408) may include, without limitation, a direct interconnection, local area network (LAN), wide area network (WAN), wireless network (e.g., using Wireless Application Protocol), or the Internet, for example. Using the network interface (407) and the communication network (408), the computer system (401) may communicate with devices such as shown as a laptop (409) or a mobile/cellular phone (410). Other exemplary devices may include, without limitation, personal computer(s), server(s), fax machines, printers, scanners, various mobile devices such as cellular telephones, smartphones (e.g., Apple iPhone, Blackberry, Android-based phones, etc.), tablet computers, eBook readers (Amazon Kindle, Nook, etc.), laptop computers, notebooks, gaming consoles (Microsoft Xbox, Nintendo DS, Sony PlayStation, etc.), or the like. In some embodiments, the computer system (401) may itself embody one or more of these devices.
[0108] In some embodiments, the processor (402) may be disposed of in communication with one or more memory devices (e.g., RAM 413, ROM 414, etc.) via a storage interface (412). The storage interface (412) may connect to memory devices including, without limitation, memory drives, removable disc drives, etc., employing connection protocols such as serial advanced technology attachment (SATA), integrated drive electronics (IDE), IEEE-1394, universal serial bus (USB), fiber channel, small computer systems interface (SCSI), etc. The memory drives may further include a drum, magnetic disc drive, magneto-optical drive, optical drive, redundant array of independent discs (RAID), solid-state memory devices, or solid-state drives, for example.
[0109] The memory devices may store a collection of program or database components, including, without limitation, an operating system (416), user interface application (417), web browser (418), mail client/server (419), user/application data (420) (e.g., any data variables or data records discussed in this disclosure) for example. The operating system (416) may facilitate resource management and operation of the computer system (401). Examples of operating systems include, without limitation, Apple Macintosh OS X, UNIX, Unix-like system distributions (e.g., Berkeley Software Distribution (BSD), FreeBSD, NetBSD, OpenBSD, etc.), Linux distributions (e.g., Red Hat, Ubuntu, Kubuntu, etc.), IBM OS/2, Microsoft Windows (XP, Vista/7/8, etc.), Apple iOS, Google Android, Blackberry OS, or the like.
[0110] The user interface (417) is for facilitating the display, execution, interaction, manipulation, or operation of program components through textual or graphical facilities. For example, user interfaces may provide computer interaction interface elements on a display system operatively connected to the computer system (401), such as cursors, icons, check boxes, menus, scrollers, windows, or widgets, for example. Graphical user interfaces (GUIs) may be employed, including, without limitation, Apple Macintosh operating systems’ Aqua, IBM OS/2, Microsoft Windows (e.g., Aero, Metro, etc.), Unix X-Windows, or web interface libraries (e.g., ActiveX, Java, JavaScript, AJAX, HTML, Adobe Flash, etc.), for example.
[0111] In some embodiments, the computer system (401) may implement a web browser (418) stored program component. The web browser (418) may be a hypertext viewing application, such as Microsoft Internet Explorer, Google Chrome, Mozilla Firefox, Apple Safari, or Microsoft Edge, for example. Secure web browsing may be provided using HTTPS (secure hypertext transport protocol), secure sockets layer (SSL), Transport Layer Security (TLS), or the like. Web browsers may utilize facilities such as AJAX, DHTML, Adobe Flash, JavaScript, Java, or application programming interfaces (APIs), for example. In some embodiments the computer system (401) may implement a mail client/server (419) stored program component. The mail server (419) may be an Internet mail server such as Microsoft Exchange, or the like. The mail server may utilize facilities such as ASP, ActiveX, ANSI C++/C#, Microsoft .NET, CGI scripts, Java, JavaScript, PERL, PHP, Python, or WebObjects, for example. The mail server (419) may utilize communication protocols such as internet message access protocol (IMAP), messaging application programming interface (MAPI), Microsoft Exchange, post office protocol (POP), simple mail transfer protocol (SMTP), or the like. In some embodiments, the computer system (401) may implement a mail client (420) stored program component. The mail client (520) may be a mail viewing application, such as Apple Mail, Microsoft Entourage, Microsoft Outlook, or Mozilla Thunderbird.
[0112] In some embodiments, the computer system (401) may store user/application data (421), such as the data, variables, records, or the like as described in this disclosure. Such databases may be implemented as fault-tolerant, relational, scalable, secure databases such as Oracle or Sybase, for example. Alternatively, such databases may be implemented using standardized data structures, such as an array, hash, linked list, struct, structured text file (e.g., XML), table, or as object-oriented databases (e.g., using ObjectStore, Poet, Zope, etc.). Such databases may be consolidated or distributed, sometimes among the various computer systems discussed above in this disclosure. It is to be understood that the structure and operation of the any computer or database component may be combined, consolidated, or distributed in any working combination.
[0113] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present invention. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer- readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., non-transitory. Examples include Random Access Memory (RAM), Read- Only Memory (ROM), volatile memory, non-volatile memory, hard drives, Compact Disc (CD) ROMs, Digital Video Disc (DVDs), flash drives, disks, and any other known physical storage media.
[0114] In light of the above-mentioned advantages and the technical advancements provided by the disclosed method and system, the claimed steps as discussed above are not routine, conventional, or well understood in the art, as the claimed steps enable the following solutions to the existing problems in conventional technologies. Further, the claimed steps clearly bring an improvement in the functioning of the device itself as the claimed steps provide a technical solution to a technical problem.
[0115] Various embodiments of the disclosure provide a non-transitory computer readable medium and/or storage medium, and/or a non-transitory machine-readable medium and/or storage medium having stored thereon, a machine code and/or a computer program having at least one code section executable by a machine and/or a computer media analysis. The at least one code section in the application server (101) causes the machine and/or computer including one or more processors to perform the steps, which includes receiving (301) the input media file. The received input media file comprises an audio information and a visual information Further, the processor may perform a step of pre-processing (302) the visual information of the received input media file. Furthermore, the processor (201) may perform a step of splitting (303) the visual information of the pre-processed input media file into the one or more chunks. In an embodiment, each chunk from the one or more chunks corresponds to the plurality of frames of the visual information of the received input media file. Furthermore, the processor (201) may perform a step of extracting (304) the one or more features corresponding to each chunk from the one or more chunks. Furthermore, the processor (201) may perform a step of generating (305) embeddings of each chunk from the one or more chunks, using the transformer encoder model based on the extracted one or more features. Furthermore, the processor (201) may perform a step of classifying (306), via the classification model, each chunk from the one or more chunks, to either the real chunk or the synthetic chunk based on the generated embeddings. Moreover, the processor (201) may perform a step of providing (307) each chunk from the one or more chunks with the indication of either the real chunk or the synthetic chunk, via the UI.
[0116] Various embodiments of the disclosure encompass numerous advantages including the system for media analysis. The disclosed system and method have several technical advantages, but not limited to the following:
• Efficient Media Analysis: The system is equipped to handle various types of media, including video formats, splitting the media into the one or more chunks and extracting relevant features, and generating embeddings, enable media analysis with accurate classification. The ability to classify real and synthetic portions enhances the detection of deepfakes or manipulated media content.
• Improved Accuracy through Chunk-Based Analysis: By splitting the input pre-processed media file into the one or more chunks, the system enables more precise detection of manipulated portions. This chunk-based analysis allows for the identification of synthetic portions that might otherwise be missed in conventional systems that treat media as a whole.
• Optimized Memory Usage: By focusing on smaller chunks, the system minimizes memory consumption during analysis. This ensures that devices with limited resources can effectively utilize the system, broadening its applicability across various hardware configurations.
• User-driven customization: The system empowers users to select specific portions of the media for analysis, allowing for more accurate and focused detection of synthetic media within shorter periods.
• Scalability for complex video content: The system can handle larger, more complex media without sacrificing accuracy or increasing processing times, enabling better performance across a variety of applications, such as journalism, security, and public trust.
• Enhanced Feature Extraction: The system’s ability to segment media into chunks based on the plurality of frames containing facial or non-facial images ensures that relevant information is analyzed separately.
• Transformer Encoder for Embedding Generation: The transformer encoder model generates embeddings from the modified features, enabling more advanced pattern recognition. This helps in identifying patterns specific to synthetic media and enhances the system's capability to distinguish real from synthetic media with higher precision.
[0117] In summary, these technical advantages solve the technical challenges associated with conventional systems for media analysis, including the limitations in processing complex video content in its entirety, increased memory and processing power requirements, inability to detect small manipulations in segmented frames, and inefficiency in extracting relevant features from the plurality of frames. By enabling users to select one or more portions of a whole media, as well as the specific chunk from the one or more chunks, the system facilitates targeted analysis, significantly reducing computational load and processing time. This makes real-time analysis more efficient and feasible. Furthermore, by employing advanced processing and classification techniques, the system enhances the accuracy in identifying synthetic portions within media. This approach mitigates the risk of misclassification and improves overall reliability in distinguishing between real and artificial content.
[0118] The claimed invention of the system and the method for media analysis involves tangible components, processes, and functionalities that interact to achieve specific technical outcomes. The system integrates various elements such as processors, memory, databases, and relevant units to analyse the received input media files by performing operations including pre-processing, splitting, and extracting one or more features corresponding to each chunk from the one or more chunks. Further, the system utilizes the feeder neural network model, the transformer encoder model and classification model, to accurately identify real versus synthetic chunks in the input media file.
[0119] Furthermore, the invention involves a non-trivial combination of technologies and methodologies that provide a technical solution for a technical problem. While individual components like processors, databases, encryption, authorization and authentication are well-known in the field of computer science, their integration into a comprehensive system for media analysis, brings about an improvement and technical advancement in the field of digital media by identifying synthetic portions in the media. The scope of this application is limited to only analysis of video data or video files. In case of a digital media file contains visual and audio information, then the claimed steps of this application is limited to only analyse visual data of the media file.
[0120] In light of the above mentioned advantages and the technical advancements provided by the disclosed method and system, the claimed steps as discussed above are not routine, conventional, or well understood in the art, as the claimed steps enable the following solutions to the existing problems in conventional technologies. Further, the claimed steps clearly bring an improvement in the functioning of the device itself as the claimed steps provide a technical solution to a technical problem.
[0121] The present disclosure may be realized in hardware, or a combination of hardware and software. The present disclosure may be realized in a centralized fashion, in at least one computer system, or in a distributed fashion, where different elements may be spread across several interconnected computer systems. A computer system or other apparatus adapted for carrying out the methods described herein may be suited. A combination of hardware and software may be a general-purpose computer system with a computer program that, when loaded and executed, may control the computer system such that it carries out the methods described herein. The present disclosure may be realized in hardware that comprises a portion of an integrated circuit that also performs other functions.
[0122] A person with ordinary skills in the art will appreciate that the systems, modules, and sub-modules have been illustrated and explained to serve as examples and should not be considered limiting in any manner. It will be further appreciated that the variants of the above disclosed system elements, modules, and other features and functions, or alternatives thereof, may be combined to create other different systems or applications.
[0123] Those skilled in the art will appreciate that any of the aforementioned steps and/or system modules may be suitably replaced, reordered, or removed, and additional steps and/or system modules may be inserted, depending on the needs of a particular application. In addition, the systems of the aforementioned embodiments may be implemented using a wide variety of suitable processes and system modules, and are not limited to any particular computer hardware, software, middleware, firmware, microcode, and the like. The claims can encompass embodiments for hardware and software, or a combination thereof.
[0124] While the present disclosure has been described with reference to certain embodiments, it will be understood by those skilled in the art that various changes may be made, and equivalents may be substituted without departing from the scope of the present disclosure. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the present disclosure without departing from its scope. Therefore, it is intended that the present disclosure is not limited to the particular embodiment disclosed, but that the present disclosure will include all embodiments falling within the scope of the appended claims.
[0125] From the foregoing, it will be seen that this invention is one well adapted to attain all the ends and objects hereinabove set forth together with other advantages which are obvious, and which are inherent to the structure.
[0126] It will be understood that certain features and sub-combinations are of utility and may be employed without reference to other features or sub-combinations. This is contemplated by and is within the scope of the claims.
, C , C , Claims:WE CLAIM:
1. A system (100) for media analysis, wherein the system (100) comprises:
a processor (201);
a memory (202) communicatively coupled with the processor (201), wherein the memory (202) is configured to store one or more executable instructions that, when executed by the processor (201), cause the processor (201) to:
receive (301) an input media file, wherein the received input media file comprises an audio information and a visual information;
pre-process (302) the visual information of the received input media file;
split (303) the visual information of the pre-processed input media file into one or more chunks, wherein each chunk from the one or more chunks corresponds to a plurality of frames of the visual information of the received input media file;
extract (304) one or more features corresponding to each chunk from the one or more chunks;
generate (305) embeddings of each chunk from the one or more chunks, using a transformer encoder model based on the extracted one or more features;
classify (306), via a classification model, each chunk from the one or more chunks, to either a real chunk or a synthetic chunk based on the generated embeddings; and
provide (307), each chunk from the one or more chunks with an indication of either the real chunk or the synthetic chunk.

2. The system (100) as claimed in claim 1, wherein the processor (201) is configured to pre-process (302) the visual information of the received input media file for sampling the received input media file based on a predefined frame sampling rate, wherein the processor is configured for filtering the sampled input media file in time domain to reduce either noise or artifacts in frequency domain, wherein filtering of the sampled input media file is performed using one of hamming window, hanning window, low pass filter, high pass filter, band pass filter or a combination thereof, wherein the processor is configured to pre-process the visual information of the received input media file to identify one or more portions associated with one or more users, within the received input media file; wherein the one or more portions corresponds to facial information associated with the one or more users within the received input media file, wherein the facial information associated with the one or more users are identified by using one of Multi-task Cascaded Convolutional Networks (MTCNN), Yoloface, deepFace, retinaFace, FaceNet or a combination thereof.

3. The system (100) as claimed in claim 1, wherein the splitting (303) of the visual information of the pre-processed input media file into the one or more chunks is performed either using a predefined time interval or a predefined number of frames; wherein the splitting (303) of the visual information of the pre-processed input media file is performed by grouping the predefined number of frames for each face from the facial information associated with one or more faces of the one or more users, wherein each chunk from the one or more chunks, obtained by the splitting (303), is partially overlapping with an adjacent chunk of the one or more chunks.

4. The system (100) as claimed in claim 1, wherein the one or more features comprises one of a facial key points, swapped faces, cropped faces in RGB color space, YUV color space, masked faces, focussed eye region, lip region, facial expressions number of pixels, number of person, video only, audio-video, file format, frame rate, compression artifacts, lighting inconsistencies, reflections, shadows, motion blur, noise patterns, or a combination thereof.

5. The system (100) as claimed in claim 1, wherein the processor (201) is configured to stack the extracted one or more features, before passing to a feeder neural network model, to create a volume of input to the feeder neural network model; wherein the processor is configured to input the extracted one or more features to the feeder neural network model based on the created volume of input, before passing to the transformer encoder model, for obtaining a changed activation volume of an input to the transformer encoder model.

6. The system (100) as claimed in claim 1, wherein the feeder neural network model corresponds to one of CNN (convolutional neural network), VGG (Visual Geometry Group), 3D-CNNs, fully connected layers, sigmoid activation, ReLU (Rectified Linear Unit) activation, dropout layers, ResNet (Residual Networks), RawNet2, LSTM (Long short-term memory), Deep Neural Network (DNN), or a combination thereof.

7. The system (100) as claimed in claim 1, wherein the embeddings generated by the transformer encoder model are multi-dimensional vector representation of each chunk of the one or more chunks, wherein the transformer encoder model corresponds to one of VideoMAE, XClip, TimeSformer, ViViT, Vision Transformers (ViTs), BEiT (BERT Pre-Training of Image Transformers), CAiT (Class-Attention in Image Transformers), DeiT (Data-efficient Image Transformers) or a combination thereof.

8. The system (100) as claimed in claim 1, wherein the classification model is configured to generate a classification score, for each chunk from the one or more chunks, wherein the classification model is configured to compare the classification score, of each chunk, with a predefined classification threshold to classify the chunk as the real chunk or the synthetic chunk, wherein the classification model corresponds to one of logistic regression, random forest, k-nearest neighbor (k-NN), support vector machines (SVM), or a combination thereof.

9. The system (100) as claimed in claim 1, the processor (201) is configured for training the classification model using the one or more chunks of the input media file; wherein during the training, the processor (201) is configured to utilize one or more loss functions on the one or more chunks, to identify a wrong classification and further update the classification model using a back propagation technique; wherein the one or more loss functions comprise a cross-entropy loss model, a triplet loss model, L2 regularizer, or a combination thereof.

10. The system (100) as claimed in claim 1, wherein the processor is configured to receive, a specific chunk from the one or more chunks to determine whether the selected chunk is real or synthetic, wherein the specific chunk is selected by a user.

11. A method (300) for media analysis, the method (300) comprising:
receiving (301), via a processor (201), an input media file, wherein the received input media file comprises an audio information and a visual information;
pre-processing (302), via the processor (201), the visual information of the received input media file;
splitting (303), via the processor (201), the visual information of the pre-processed input media file into one or more chunks, wherein each chunk from the one or more chunks corresponds to a plurality of frames of the visual information of the received input media file;
extracting (304), via the processor (201), one or more features corresponding to each chunk from the one or more chunks;
generating (305) embeddings of each chunk from the one or more chunks, using a transformer encoder model based on the extracted one or more features;
classifying (306), via the processor coupled to a classification model, each chunk from the one or more chunks, to either a real chunk or a synthetic chunk based on the generated embeddings; and
providing (307), each chunk from the one or more chunks with an indication of either the real chunk or the synthetic chunk.

12. A non-transitory computer-readable storage medium having stored thereon, a set of computer-executable instructions that, when executed by a processor (201), cause the processor (201) to perform steps comprising:
receiving (301) an input media file, wherein the received input media file comprises an audio information and a visual information;
pre-processing (302) the visual information of the received input media file;
splitting (303) the visual information of the pre-processed input media file into one or more chunks, wherein each chunk from the one or more chunks corresponds to a plurality of frames of the visual information of the received input media file;
extracting (304) one or more features corresponding to each chunk from the one or more chunks;
generating (305) embeddings of each chunk from the one or more chunks, using a transformer encoder model based on the extracted one or more features;
classifying (306), a classification model, each chunk from the one or more chunks, to either a real chunk or a synthetic chunk based on the generated embeddings; and
providing (307), each chunk from the one or more chunks with an indication of either the real chunk or the synthetic chunk.

Dated this October 22, 2024

ABHIJEET GIDDE
AGENT FOR THE APPLICANT
IN/PA- 4407

Documents

Application Documents

#	Name	Date
1	202421080389-STATEMENT OF UNDERTAKING (FORM 3) [22-10-2024(online)].pdf	2024-10-22
2	202421080389-REQUEST FOR EARLY PUBLICATION(FORM-9) [22-10-2024(online)].pdf	2024-10-22
3	202421080389-FORM-9 [22-10-2024(online)].pdf	2024-10-22
4	202421080389-FORM FOR STARTUP [22-10-2024(online)].pdf	2024-10-22
5	202421080389-FORM FOR SMALL ENTITY(FORM-28) [22-10-2024(online)].pdf	2024-10-22
6	202421080389-FORM 1 [22-10-2024(online)].pdf	2024-10-22
7	202421080389-FIGURE OF ABSTRACT [22-10-2024(online)].pdf	2024-10-22
8	202421080389-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [22-10-2024(online)].pdf	2024-10-22
9	202421080389-EVIDENCE FOR REGISTRATION UNDER SSI [22-10-2024(online)].pdf	2024-10-22
10	202421080389-DRAWINGS [22-10-2024(online)].pdf	2024-10-22
11	202421080389-DECLARATION OF INVENTORSHIP (FORM 5) [22-10-2024(online)].pdf	2024-10-22
12	202421080389-COMPLETE SPECIFICATION [22-10-2024(online)].pdf	2024-10-22
13	202421080389-STARTUP [23-10-2024(online)].pdf	2024-10-23
14	202421080389-FORM28 [23-10-2024(online)].pdf	2024-10-23
15	202421080389-FORM 18A [23-10-2024(online)].pdf	2024-10-23
16	Abstract 1.jpg	2024-11-19
17	202421080389-FORM-26 [27-12-2024(online)].pdf	2024-12-27
18	202421080389-FER.pdf	2025-02-18
19	202421080389-Proof of Right [21-03-2025(online)].pdf	2025-03-21
20	202421080389-FORM 3 [27-03-2025(online)].pdf	2025-03-27
21	202421080389-FER_SER_REPLY [16-05-2025(online)].pdf	2025-05-16
22	202421080389-DRAWING [16-05-2025(online)].pdf	2025-05-16
23	202421080389-FORM-8 [05-08-2025(online)].pdf	2025-08-05
24	202421080389-FORM28 [17-10-2025(online)].pdf	2025-10-17
25	202421080389-Covering Letter [17-10-2025(online)].pdf	2025-10-17

Search Strategy

1	202421080389_SearchStrategyNew_E_202421080389E_17-02-2025.pdf