Surveillance System For Processing Video Data Based On Inter Frame

< Back

Surveillance System For Processing Video Data Based On Inter Frame Analysis And A Surveillance Method Thereof

Abstract: Embodiments of the present disclosure relate to a surveillance system for processing video data and a method thereof. The system includes one or more scene capturing devices for capturing the video data and a video processing device comprising one or more processors and a memory storing computer-executable instructions. When executed, the instructions cause the video processing device to perform inter-frame analysis to compute a difference score for each frame, indicating a visual change relative to a successive frame. If the difference score satisfies a predefined threshold, the corresponding frame is encoded, compressed, and transmitted to a cloud video server or a local user viewing device, along with metadata including at least one of a timestamp or a frame index, thereby optimizing bandwidth and storage utilization. FIG. 1

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

29 August 2025

Publication Number

39/2025

Publication Type

INA

Invention Field

ELECTRONICS

Status

Parent Application

Applicants

ITRANSZ ESOLUTION PRIVATE LIMITED

NEW NO 48, OLD NO18, SRIMAN SRINIVASA ROAD, ALWARPET, CHENNAI-600018,TAMIL NADU, INDIA

Inventors

1. SRIVATHSAN KUMAR

802 MAANGALYA SURYODHAYA, VARTHUR ROAD, MARATHAHALLI, BANGALORE- 560037, KARNATAKA, INDIA

2. P LAKSHMI NARASIMHAN

S2, LAKSHMIKANTHAM FLATS, 14 BV NAGAR, 5TH STREET SOUTH, NANGANALLUR, TAMIL NADU 600061, CHENNAI, INDIA

3. KEERTHI GOWDA D

2ND CROSS, D GROUP EMPLOYEES LAYOUT, LINGADHEERANAHALLI, BENGALURU - 560091, KARNATAKA, INDIA

Specification

Description:FIELD OF INVENTION
[0001] Embodiments of the present disclosure relate generally to a system and method for surveillance system for processing video data based on inter-frame analysis of visual changes, motion, or scene attributes between frames, and a surveillance method thereof, thereby enabling bandwidth-efficient video streaming, storage optimization, and continuous playback reconstruction. The disclosed techniques may be applicable to surveillance systems, remote monitoring, smart cameras, or other environments where efficient video transmission, intelligent storage, and event-based recording are desirable.
BACKGROUND
[0002] Video surveillance systems commonly rely on continuous capture, transmission, and storage of video data from scene capturing devices, such as network-connected surveillance cameras. In conventional systems, each frame is transmitted and stored irrespective of whether it contains meaningful visual changes. This often leads to the storage of redundant or low-value video frames, resulting in excessive use of storage resources, particularly when cloud-based storage is employed. The associated storage costs may increase significantly with the volume and duration of the stored video content. For instance, in a typical retail environment, surveillance cameras may continue to capture and transmit video data throughout non-operational hours, during which the monitored area remains unchanged. In such scenarios, frames depicting an unaltered store interior such as empty aisles, stationary merchandise, and static lighting are continuously recorded and stored. Conventional systems process and store each of these redundant frames individually, thereby leading to unnecessary consumption of network bandwidth and storage resources, especially when utilizing cloud-based services.
[0003] Furthermore, continuous transmission of video data places substantial demands on available network bandwidth. In scenarios where network connectivity is intermittent or unreliable, such as in remote or mobile deployments, sustained video transmission can lead to frame loss or latency. In such cases, retransmission of entire video sequences may be required, further straining bandwidth and impacting the timeliness and reliability of surveillance data.
[0004] The overhead caused by continuous transmission also results in increased power consumption and processing load, both at the capturing device and at the receiving end, such as a cloud video server. This overhead is especially critical when the video data is subjected to further analysis using artificial intelligence (AI) or machine learning (ML) modules deployed on the cloud. Transmitting redundant frames to cloud-based AI systems not only incurs unnecessary computational costs but also increases power consumption, reducing the overall efficiency and scalability of the surveillance system.
[0005] Moreover, in a peer-to-peer (P2P) network, where cameras communicate directly with external systems, creating network pinholes (open ports in firewalls) presents significant security risks. Open ports allow unauthorized access, potentially enabling attackers to compromise cameras and other network devices. Each camera becomes an individual vulnerability point, increasing the attack surface and providing multiple entryways for cyber-attacks. Additionally, without secure communication, video data can be intercepted or eavesdropped on during transmission, exposing sensitive information. Bypassing firewalls and Network Address Translation (NAT) devices introduces further risks, as it may require additional pinholes, which increase vulnerability. The decentralized nature of P2P communication also makes monitoring and managing security more challenging, as breaches may go undetected. Furthermore, each camera requires its own security configuration, raising the risk of mismanagement or overlooked vulnerabilities.
[0006] Accordingly, there exists a need for a video surveillance system and a method thereof that selectively process and transmit only relevant video frames containing visual changes, thereby optimizing storage and bandwidth requirements, reducing reliance on continuous high-throughput internet connections, lowering operational costs, minimizing unnecessary computational and power overhead in both local and cloud environments and enhances the safety of the surveillance by mitigating the risks associated with open ports and unauthorized access.
BRIEF DESCRIPTION
[0007] In accordance with an embodiment of the present disclosure, a surveillance system for processing video data is provided. The system includes one or more scene capturing devices, each configured to capture the video data comprising a plurality of frames of a surveillance area in a respective field of view; and a video processing device, comprising: one or more processors; a memory comprising computer-executable instructions that, when executed by the one or more processors, cause the video processing device to: based on inter-frame analysis, for each frame, compute a respective difference score representing a visual change between each frame and a corresponding successive frame; compare the respective difference score to a predefined threshold; and in response to the respective difference score satisfying the predefined threshold, perform the following: select based on degree of contribution of the each frame and the corresponding successive frame towards the visual change the each frame corresponding to the respective difference score or the corresponding successive frame corresponding to the respective difference score or a frame pair comprising both the each frame and the corresponding successive frame corresponding to the respective difference score, encode and compress the selected each frame; and transmit the encoded and compressed each frame, to a cloud video server or a local user viewing device, along with a corresponding metadata, wherein the metadata comprises at least one of a respective timestamp indicating a time at which the transmitted each frame was captured by a corresponding scene capturing device and a respective frame index assigned to the transmitted each frame by the corresponding scene capturing device.
[0008] In accordance with another embodiment of the present disclosure, a surveillance method for processing video data is provided. The method includes capturing, by one or more scene capturing devices of a surveillance system, the video data comprising a plurality of frames of a surveillance area within respective fields of view; performing, by a video processing device of the surveillance system, inter-frame analysis to compute, for each frame, a respective difference score representing a visual change between the frame and a corresponding successive frame; comparing, by the video processing device, the respective difference score to a predefined threshold; and in response to the respective difference score satisfying the predefined threshold, performing the following: selecting based on degree of contribution of the each frame and the corresponding successive frame towards the visual change the each frame corresponding to the respective difference score or the corresponding successive frame corresponding to the respective difference score or a frame pair comprising both the each frame and the corresponding successive frame corresponding to the respective difference score, encoding and compressing the selected each frame and transmitting the encoded and compressed each frame, to a cloud video server or a local user viewing device, along with a corresponding metadata, wherein the metadata comprises at least one of a respective timestamp indicating a time at which the transmitted each frame was captured by a corresponding scene capturing device and a respective frame index assigned to the transmitted each frame by the corresponding scene capturing device.
[0009] To further clarify the advantages and features of the present disclosure, a more particular description of the disclosure will follow by reference to specific embodiments thereof, which are illustrated in the appended figures. It is to be appreciated that these figures depict only typical embodiments of the disclosure and are therefore not to be considered limiting in scope. The disclosure will be described and explained with additional specificity and detail with the appended figures.
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] The disclosure will be described and explained with additional specificity and detail with the accompanying figures in which:
[0011] FIG. 1 illustrates a network environment for implementing example techniques for surveillance system for processing video data, in accordance with an example implementation of the present subject matter;
[0012] FIG. 2 illustrates a schematic diagram of a video processing device, in accordance with an example implementation of the present subject matter;
[0013] FIG. 3 illustrates selection of temporally adjacent frames of a video frame by a video processing device in the event of threshold satisfaction by the video frame, in accordance with an example implementation of the present subject matter;
[0014] FIG. 4 illustrates a schematic diagram of reconstruction of video stream by a video processing device, in accordance with an example implementation of the present subject matter.
[0015] FIGS. 5-7 illustrate a surveillance method implemented by a video processing system, in accordance with an example implementation of the present subject matter;
[0016] Further, those skilled in the art will appreciate that elements in the figures are illustrated for simplicity and may not have necessarily been drawn to scale. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the figures by conventional symbols, and the figures may show only those specific details that are pertinent to understanding the embodiments of the present disclosure so as not to obscure the figures with details that will be readily apparent to those skilled in the art having the benefit of the description herein.
DETAILED DESCRIPTION
[0017] For the purpose of promoting an understanding of the principles of the disclosure, reference will now be made to the embodiment illustrated in the figures and specific language will be used to describe them. It will nevertheless be understood that no limitation of the scope of the disclosure is thereby intended. Such alterations and further modifications in the illustrated system, and such further applications of the principles of the disclosure as would normally occur to those skilled in the art are to be construed as being within the scope of the present disclosure.
[0018] The terms “comprises”, “comprising”, or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such a process or method. Similarly, one or more devices or subsystems or elements or structures or components preceded by "comprises... a" does not, without more constraints, preclude the existence of other devices, sub-systems, elements, structures, components, additional devices, additional sub-systems, additional elements, additional structures or additional components. Appearances of the phrase "in an embodiment", "in another embodiment" and similar language throughout this specification may, but not necessarily do, all refer to the same embodiment.
[0019] The term “plurality,” as used herein, means two or more, i.e., it encompasses two, three, four, five, etc. For example, the expression “plurality of frames” encompasses two frames, three frames and so on.
[0020] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the art to which this disclosure belongs. The system, methods, and examples provided herein are only illustrative and not intended to be limiting.
[0021] In the following specification and the claims, reference will be made to a number of terms, which shall be defined to have the following meanings. The singular forms “a”, “an”, and “the” include plural references unless the context clearly dictates otherwise.
[0022] FIG. 1 illustrates a network environment for implementing example techniques for surveillance system 1 for processing video data, in accordance with an example implementation of the present subject matter. It should be understood that, although specific systems are depicted as distinct blocks in the schematic diagrams, any of these systems may alternatively be combined or separated through hardware and/or software implementations. Referring to FIG. 1, one or more scene capturing devices 10-N are deployed within a surveillance area and are configured to capture video data or a video stream comprising a plurality of frames, each frame representing a visual snapshot of the surveillance area at a respective point in time. For example, scene capturing device 10 may capture video data comprising a plurality of frames 10(a)-10(n), scene capturing device 11 may capture video frames 11(a)-11(n), and so forth, with scene capturing device N capturing frames N(a)-N(n). Each of the one or more scene capturing devices 10-N is oriented to cover a defined field of view that is overlapping, adjacent, or distributed, which may be fixed or dynamically adjustable, thereby enabling flexible surveillance coverage. The one or more scene capturing devices (10-N) may include, but are not limited to, fixed-position cameras, pan-tilt-zoom (PTZ) cameras, fisheye or panoramic cameras, thermal imaging cameras, night-vision enabled cameras, infrared (IR) cameras, low-light cameras, stereo or depth cameras, or other types of imaging sensors suitable for surveillance and monitoring applications. The one or more scene capturing devices (10-N) operate at a predefined frame capture rate, which may be determined by factory settings, user configuration, or dynamically adjusted by authorized personnel through administrative interfaces. The capture rate may vary based on use-case requirements, such as high frame rate capture for entry points or motion-sensitive zones, and lower rates for passive areas.
[0023] The one or more scene capturing devices (10-N) may be configured to stream the captured video data using standard streaming protocols such as Real-Time Streaming Protocol (RTSP), Real-Time Transport Protocol (RTP), HTTP Live Streaming (HLS), or similar. In an example, the one or more scene capturing devices 10-N may include an onboard volatile memory configured as a ring buffer (shown as 10(1) in FIG. 2), wherein video frames are temporarily stored in a circular memory structure that overwrites the oldest data with new data once capacity is reached. The use of a ring buffer facilitates real-time streaming and minimizes memory requirements while ensuring short-term availability of the most recent video frames for retrieval or retransmission. The one or more scene capturing devices 10-N may support secure data transmission through protocols such as (SSL) Secure Sockets Layer/(TLS) Transport Layer Security.
[0024] Each frame of the captured video data may be associated with metadata tagging to facilitate intelligent processing, indexing, and retrieval by a corresponding scene capturing device. The metadata may be embedded within the frame or transmitted as accompanying data, and may include, but is not limited to, one or more of the following: a timestamp indicating the exact time the frame was captured, a frame index representing its sequential position within the video stream, and scene capturing device identification information such as a unique camera ID or source address. In some embodiments, metadata may also comprise geolocation coordinates, camera orientation details (e.g., pan, tilt, zoom states), scene descriptors (e.g., lighting condition, activity label), or capture settings (e.g., resolution, frame rate). Various formats of metadata tagging may be supported, including inline tagging within video container formats (e.g., MP4, MKV) or separate sidecar files (e.g., JSON or XML metadata files). Such metadata enables robust synchronization, event-based indexing, secure transmission, and facilitates post-processing activities such as intelligent event detection, audit trails, or multi-camera correlation. The scene capturing devices 10-N may be powered via Power over Ethernet (PoE) or dedicated power supplies. The scene capturing devices 10-N may also support auxiliary features such as automatic gain control, exposure compensation, digital zoom, and time synchronization through Network Time Protocol (NTP).
[0025] Although not shown for simplicity, the scene capturing devices 10-N may also include at least one processor and an associated memory comprising computer-executable instructions that, when executed by the processor, cause the processor to perform various operations. The processor may be, but is not limited to, a single-processor or multi-processor system of any of a wide array of possible architectures, including field programmable gate array (FPGA), central processing unit (CPU), application specific integrated circuits (ASIC), digital signal processor (DSP) or graphics processing unit (GPU) hardware arranged homogenously or heterogeneously. The memory may be but is not limited to a random-access memory (RAM), read only memory (ROM), or other electronic, optical, magnetic or any other computer readable medium.
[0026] Further, the system 1 may comprise a video processing device 20 configured to receive and process video data transmitted by one or more scene capturing devices 10-N. The architecture shown in Fig. 1 is scalable, and may be adapted for centralized, decentralized, or hybrid deployments depending on the requirements of the surveillance application. Fig. 1 illustrates a representative example of such a deployment where the video processing device 20 is provisioned on the same network as the scene capturing devices 10-N, to facilitate low-latency and secure communication, for example, via RTSP (Real-Time Streaming Protocol) over secure channels such as SSL (Secure Sockets Layer) or TLS (Transport Layer Security). In certain implementations, a single video processing device may be sufficient to process video data streams from a small to moderate number of scene capturing devices. However, in larger or high-density deployments, multiple video processing devices may be provisioned and distributed across the network infrastructure to handle high-throughput data, perform parallel processing, load balancing, and to ensure scalability and fault tolerance. Each video processing device 20 may be implemented as a dedicated hardware appliance, an edge computing unit, or a virtualized software instance hosted on local servers or compute nodes. The video processing device 20 may include one or more processors and memory modules configured to execute computer-executable instructions for processing video frames, managing metadata, performing event-based logic, or interfacing with cloud-based systems or local user viewing devices. Furthermore, in certain embodiments, the video processing devices may be configured to selectively store intermediate processing results, maintain audit trails, or coordinate with centralized storage systems depending on system design and policy requirements. In some examples, the video processing device 20 may be logically or physically segmented based on scene capturing device location, video content type, sensitivity levels, or processing load. Further details regarding the video processing device are provided in the description of FIG. 2.
[0027] In some embodiments, the video processing device 20 may transmit the processed video data to a local storage system, such as a Network Video Recorder (NVR) 30, or to a remote storage system, such as a cloud video server/cloud computing platform 40. The NVR 30, typically located within the same network as the video processing device 20, serves as a centralized storage and playback unit for surveillance footage. In contrast, the cloud video server 40 provides remote storage and access capabilities, enabling broader scalability and remote viewing functionalities. The selection between local (NVR) and remote (cloud) storage may be based on user preference, system architecture, or specific deployment requirements. Further, a local user viewing device 80, such as a monitor, control console, or display terminal, may be operatively connected to the network video recorder (NVR) 30. The local user viewing device 80 enables real-time or recorded video footage to be accessed and monitored directly at the location of the NVR 30, without requiring remote access over a public or external network. This local connection can facilitate secure and efficient video review, playback control, and system management by authorized personnel, thereby supporting enhanced situational awareness and reducing reliance on cloud-based or remote user interfaces.
[0028] Although details are not provided here for simplicity, the cloud video server 40, as referred to herein, may be implemented using a single computing device or a plurality of interconnected computing devices operating in a distributed environment. In some embodiments, the cloud video server comprises one or more physical or virtual machines configured to perform computing operations and data storage functions associated with remote access and video data handling. Each instance of a cloud video server 40 may include at least one processor, a memory unit, one or more I/O interfaces and one or more communication interfaces. The processor may be configured to execute software instructions for hosting applications, managing data flow, enforcing access controls, or processing client requests. The memory may include volatile (e.g., RAM) and non-volatile (e.g., SSD, HDD) components for storing executable code, metadata, video data, user credentials, and other relevant data. The communication interfaces may support wired or wireless network protocols, enabling connectivity over the Internet or private networks with external devices such as the video processing device 20 and local user viewing device 80. The cloud video server 40 infrastructure may further utilize load balancers, data replication mechanisms, and failover systems to ensure scalability, redundancy, and high availability. In a distributed configuration, the cloud video server 40 functions may be logically or physically divided across multiple geographic locations to enhance performance and reliability.
[0029] In the system 1, the scene capturing devices 10-N may be configured to communicate directly with the video processing device 20, which acts as a secure centralized communication node. This communication may occur over wired or wireless connections, including but not limited to Ethernet, Wi-Fi, or other short-range communication protocols. The video processing device 20 then transmits the processed video data to one or more external systems, such as a local Network Video Recorder (NVR) 30 or a remote cloud computing platform/cloud video server 40. These transmissions may occur via direct links or through intermediate components like switches, routers, access points, or gateways. Wired communication paths may include Ethernet cables, coaxial lines, or fiber optics, while wireless paths may include Wi-Fi, 4G/5G, Bluetooth, Zigbee, or other wireless communication protocols.
[0030] Furthermore, the system may include a remote user viewing device 50 configured to access video data remotely via a cloud computing platform/cloud video server 40. The remote user viewing device 50, which could be a smartphone, tablet, laptop, or desktop computer, connects to the cloud video server 40 over the Internet 60. An application or web-based interface hosted on the cloud video server 40 enables secure access to the video data, which is processed and then transmitted by the video processing device 20 to the cloud video server 40 for storing. The cloud video server 40 serves as an intermediary between the video processing device 20 and the remote user viewing device 50, managing access control and authentication. The user may securely log into the application and, depending on their access rights, retrieve video data stored on the cloud video server 40. This architecture enables users to remotely access video data without exposing internal network components or requiring direct communication with the scene capturing devices 10-N or the video processing device 20, thereby maintaining the security of the system 1.
[0031] Thus, the video processing device 20, acting as a centralized communication node, addresses several security challenges by consolidating communication and data management in one secure location. It eliminates the need for individual scene capturing devices to have open ports, thus preventing unauthorized access by ensuring that no direct communication occurs between the scene capturing device 20 and external devices, including the cloud video server 40, a local user viewing device 80, or a remote user viewing device 50. By routing all communication through a single, secure node, the video processing device reduces the attack surface and limits exposure to potential threats. To protect video data from interception or eavesdropping during transmission, the video processing device employs secure communication protocols. This ensures that all data transmitted between the cameras and external systems is safeguarded, maintaining the confidentiality and integrity of the video footage. The video processing device also centralizes all network traffic, eliminating the need for additional pinholes in firewalls or NAT devices. By managing all external communication through a single point, it simplifies the firewall configuration, minimizes the number of exposed ports, and reduces network vulnerability. Furthermore, the centralized architecture makes security monitoring more efficient. Since all data passes through the video processing device, it becomes easier to detect potential breaches and respond to threats. The device can log and analyze security events in one location, improving visibility and control over the system. Additionally, the video processing device simplifies security management by centralizing configuration. Instead of configuring security settings for each camera individually, the system ensures that uniform security policies are applied across all devices, reducing the risk of misconfiguration and ensuring consistent protection for the entire network.
[0032] In accordance with an embodiment of the present disclosure, a surveillance system for processing video data is provided. The system includes one or more scene capturing devices, each configured to capture the video data comprising a plurality of frames of a surveillance area in a respective field of view; and a video processing device, comprising: one or more processors; a memory comprising computer-executable instructions that, when executed by the one or more processors, cause the video processing device to: based on inter-frame analysis, for each frame, compute a respective difference score representing a visual change between each frame and a corresponding successive frame; compare the respective difference score to a predefined threshold; and in response to the respective difference score satisfying the predefined threshold, perform the following: encode and compress a frame corresponding to the respective difference score; and transmit the encoded and compressed frame, to a cloud video server or a local user viewing device, along with a corresponding metadata comprising at least one of: a respective timestamp indicating a time at which the frame was captured by a corresponding scene capturing device; and a respective frame index assigned to the frame by the corresponding scene capturing device.
[0033] FIG. 2 illustrates a schematic diagram of a video processing device 200, in accordance with an example implementation of the present subject matter. It may be noted that the foregoing system is an exemplary system and may be implemented as computer executable instructions in any computing or processing environment, including in digital electronic circuitry or in computer hardware, firmware, device driver, or software. As such, the system is not limited to any specific hardware or software configuration. As shown therein, the video processing device 200 may comprise a processor(s) 202, a memory(s) 204 coupled to and accessible by the processor(s) 202, and a network interface 208 coupled to the memory(s) 204. The functions of various elements shown in the figs., including any functional blocks labelled as "processor(s)", may be provided through the use of dedicated hardware as well as hardware capable of executing instructions. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term "processor" would not be construed to refer exclusively to hardware capable of executing instructions, and may implicitly comprise, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA). Other hardware, standard and/or custom, may also be coupled to the processor(s) 202. The video processing device 200 may further include other components such as, but not limited to, I/O interfaces, sensors, logic circuits etc. Further, the video processing device 200 may include data 212 which may include data that may be stored for example, video frames and their corresponding metadata, data that may be utilized, for example, predefined threshold, predefined subranges or data that may be generated, for example, difference scores during the operation of the video processing device 200. The video processing device 200 may be the same as video processing device 20 shown in FIG. 1.
[0034] The memory(s) 204 may be a computer-readable medium, examples of which comprise volatile memory (e.g., RAM), and/or non-volatile memory (e.g., Erasable Programmable read-only memory, i.e.. EPROM, flash memory, etc.). The memory(s) 204 may be an external memory, or internal memory, such as a flash drive, a compact disk drive, an external hard disk drive, or the like. The video processing device 200 may further include a network interface 210 that may allow the connection or coupling of the video processing device 200 with one or more other devices, through a wired (e.g., Local Area Network, i.e., LAN) connection or through a wireless connection (e.g., Bluetooth®, WiFi), for example, for connecting to the resilience cloud video server 40, the one or more scene capturing devices 10-N shown in FIG. 1. The interface 210 may also enable intercommunication between different logical as well as hardware components of the video processing device 200.
[0035] Further, the video processing device 200 may include module(s) 206. The module(s) 206 may include a retrieving module 206A, a difference detection module 206B, a selection module 206C, an encoding module 206D, a compression module 206E, a transmit module 206F and other modules(s) 206G. The other modules (s) 206G may implement similar or extended functionalities of the video processing device 200. In one example, the module(s) 206 may be implemented as a combination of hardware and firmware. In an example described herein, such combinations of hardware and firmware may be implemented in several different ways. For example, the firmware for module(s) 206 may be processor 202 executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the module(s) 206 may include a processing resource (for example, implemented as either single processor or combination of multiple processors), to execute such instructions.
[0036] In the present examples, the non-transitory machine-readable storage medium may store instructions that, when executed by the processing resource, implement the functionalities of modules(s) 206. In such examples, the video processing device 200 may include the machine-readable storage medium storing the instructions and the processing resource to execute the instructions. In other examples of the present subject matter, the machine-readable storage medium may be located at a different location but accessible to the video processing device 200 and the processor(s) 202.
[0037] The video processing device 200 may further include a reconstruction engine 210. The reconstruction engine 210 may be implemented as a combination of hardware and programming, for example, programmable instructions to implement a variety of functionalities of the reconstruction engine 210. In examples described herein, such combinations of hardware and programming may be implemented in several different ways. For example, the programming for the reconstruction engine 210 may be executable instructions. Such instructions may be stored on a non-transitory machine-readable storage medium which may be coupled either directly with the video processing device 200 or indirectly (for example, through networked means). In an example, the reconstruction engine 210 may include a processing resource, for example, either a single processor or a combination of multiple processors, to execute such instructions. In the present examples, the non-transitory machine-readable storage medium may store instructions that, when executed by the processing resource, implement reconstruction engine 210. In other examples, the reconstruction engine 210 may be implemented as electronic circuitry. The reconstruction engine 210 may further include a decoding module 210A, a decompression module 210B, a transcoding module 210B, and a rendering module 210D.
[0038] The video processing device 200 may include other engine(s) (not shown) that may implement functionalities that supplement functions performed by the video processing device 200 or the reconstruction engine 210. Further, the system 300 includes data 310. It may be noted that such examples of the various functions are only indicative. The present approaches may be applicable to other examples without deviating from the scope of the present subject matter. It will be appreciated that one or more alternative or additional algorithms may also be stored in the memory 204 and executed by the processor 202 to achieve similar or extended functionality, without deviating from the scope of the present subject matter.
[0039] In operation, as shown in FIG. 2, the scene capturing device 10 provided as a representative example may capture video data comprising a plurality of frames 10(a), 10(b), 10(c), 10(d),…, 10(n-1), 10(n). The frames 10(a), 10(b), 10(c), 10(d),…, 10(n-1), 10(n) together with a corresponding metadata comprising at least one of a respective timestamp indicating a time at which the frame was captured by a corresponding scene capturing device 10 and a respective frame index assigned to the frame by the corresponding scene capturing device 10 may be stored on a ring buffer 10(1). The ring buffer may be implemented as a “circular first-in-first-out (FIFO) memory structure” in a corresponding scene capturing device, for instance, the ring buffer 10(1) may be implemented as a “circular first-in-first-out (FIFO) memory structure” in the scene capturing device 10, wherein new frame data continuously overwrites the oldest frame data once the buffer reaches its maximum capacity. The ring buffer 10(1) is indexed by a pointer system comprising a “write pointer” and a “read pointer”. The write pointer advances with each newly received video frame, while the read pointer may be used to access frames for processing. When the write pointer reaches the end of the ring buffer 10(1), it wraps around to the beginning of the buffer, hence the term “ring” or “circular” buffer. For example, assuming the ring buffer 10(1) can hold “K” frames, frame 10(a) is written to index position 0, followed by frame 10(b) at position 1, and so on, up to frame 10(n) at position K-1. When a new frame 10(n+1) is captured, it overwrites the data at position 0 unless the data at that position has already been processed. This mechanism ensures low-latency storage and retrieval of real-time video data, making it particularly advantageous for applications such as surveillance, where only a short time window of recent frames is needed for inter frame analysis.
[0040] From the ring buffer 10(1), the retrieving module 206A may be configured to retrieve the frames in real-time or in near real-time. In some examples, the retrieving module 206A may be configured to prefetch a pair of consecutive frames, such as Frame(t) and a corresponding successive Frame (t+1), and forward them to a difference detection module 206B.
[0041] Further, the difference detection module 206B may be configured to compute, for each frame, a respective difference score representing a visual change between the each frame and a corresponding successive frame based on inter-frame analysis. The inter-frame analysis may be based on application of one or more techniques selected from a group consisting of: pixel-based difference, optical flow, scene change detection, block matching, background subtraction, histogram comparison, edge detection, feature-based comparison, machine learning models, semantic segmentation, Fourier transform or wavelet analysis, and deep learning-based change detection. The one or more techniques may be selected adaptively based on the number of active video streams being handled by the video processing device 200 or a frame capture rate of the one or more scene capturing devices 10, 11, …, N.
[0042] In the pixel-based difference technique, the video processing device evaluates visual changes between two temporally successive frames, denoted as Frame(t) and Frame(t+1), by computing an absolute pixel-wise difference. Specifically, for each pixel coordinate (i, j), the absolute difference between the intensity value in Frame(t) and the corresponding intensity value in Frame(t+1) is calculated. These per-pixel differences are summed over the entire frame area to generate a respective difference score. The resulting score reflects the cumulative variation in pixel intensity between the frames, which correlates to the extent of visual change in the scene. For example, if a stationary scene in Frame(t) is interrupted by the presence of a moving object in Frame(t+1), such as a person walking into the frame, the affected pixels will exhibit significant intensity changes. This results in a high difference score indicative of substantial motion. The pixel-based difference technique is advantageous due to its simplicity and computational efficiency, making it particularly suitable for detecting abrupt or localized changes in surveillance environments with relatively static backgrounds.
[0043] On the other hand, the optical flow analysis technique may include tracking the movement of objects between two successive frames, Frame(t) and Frame(t+1), by estimating the motion of pixels or groups of pixels between the frames. The optical flow technique computes motion vectors that represent the apparent displacement of pixel intensities in the frame. These vectors are determined by comparing the spatial relationships and intensity patterns in Frame(t) and Frame(t+1). For instance, if a person walks from left to right, optical flow would generate motion vectors directed along that path. The difference score is then derived by aggregating the motion vectors across the frame, capturing the magnitude and direction of motion. This technique is particularly advantageous for tracking continuous movement between two consecutive frames.
[0044] Similarly, scene change detection identifies major global shifts between Frame(t) and Frame(t+1) by comparing overall scene features such as color histograms, brightness levels, or edge content. A scene change typically occurs when there is a significant alteration in the scene, such as a transition from one camera view to another or a change in lighting conditions. This technique involves comparing global properties of the frames, such as the average color distribution or edge density, to detect abrupt transitions. For example, if a room's lights are turned on, the brightness and color properties between Frame(t) and Frame(t+1) would differ markedly. This may be particularly advantageous for detecting scene transitions or large-scale environmental changes.
[0045] Likewise, block matching is a technique wherein Frame(t) is divided into a grid of fixed-size blocks, and for each block, the system searches for the most similar block in Frame(t+1). This comparison is typically performed based on metrics such as pixel intensity or structural similarity. When a block shifts in position or its content changes significantly, it indicates that motion has occurred and aggregates the mismatch values across all blocks to compute the overall difference score. Block matching is particularly effective in detecting localized changes or object movement within the frame, and it is robust against minor variations, such as small shifts or rotations of the captured scene.
[0046] Background subtraction assumes that Frame(t) represents a stable, static background, while Frame(t+1) contains potential changes that deviate from this background. The system compares Frame(t+1) to Frame(t) to detect foreground objects that are either newly introduced into the scene or removed. Any significant deviation from the expected background is flagged, and the difference score is calculated based on the area of change. This technique is highly effective in environments where the background remains relatively constant, such as in surveillance systems monitoring a fixed area, making it ideal for detecting moving objects in otherwise static scenes.
[0047] Histogram comparison involves calculating the color or intensity distribution for both Frame(t) and Frame(t+1). This is done by analyzing the frequency of color or brightness values in the frame. A significant difference in these histograms between the two frames indicates a change in the overall lighting, color composition, or presence of new objects in the scene. For example, if bright headlights of a car enter the frame in Frame(t+1), it would cause a noticeable change in the histogram when compared to Frame(t). This method is particularly useful for detecting global changes in lighting or color, which are indicative of larger environmental changes.
[0048] Edge detection involves identification of the boundaries of objects in Frame(t) and Frame(t+1) using edge detection algorithms, such as Sobel or Canny filters. These filters highlight areas of high spatial intensity change, where object edges are located and compare the edge maps of Frame(t) and Frame(t+1) to detect any differences in the positions or presence of object boundaries. For example, when a person enters the scene, new edges will be detected in Frame(t+1) that were not present in Frame(t). This technique is well-suited for structured scenes where object contours are clearly defined, and it is effective in detecting changes in object outlines or new object appearances.
[0049] Feature-based comparison detects distinctive features, such as corners or textures, in both Frame(t) and Frame(t+1) using feature detection algorithms like Harris or SIFT. These features are matched across frames, and the number of unmatched or displaced features serves as an indicator of change. For example, if a significant object moves between Frame(t) and Frame(t+1), it may be detected that many features in Frame(t) do not align with those in Frame(t+1). The difference score is derived from the number of unmatched features. This technique is robust to small camera movements or rotations and can detect local changes even if they are subtle or spatially constrained.
[0050] Semantic segmentation assigns each pixel in Frame(t) and Frame(t+1) to a specific category, such as "person," "vehicle," or "background." The system compares the pixel-wise categories between the two frames to detect changes in the composition of the scene. For example, if a previously empty area in Frame(t) is classified as a "person" in Frame(t+1), this change is flagged as significant. By focusing on the semantic meaning of the scene rather than just pixel values, this technique provides a more intuitive understanding of changes, such as the entry of a person or vehicle, and is especially useful in dynamic environments where object identification is critical.
[0051] Fourier transform or wavelet analysis converts Frame(t) and Frame(t+1) into the frequency domain, focusing on the spatial or temporal frequency components of the frames. The system compares these frequency representations to detect subtle changes in patterns, such as flickering lights, oscillating objects, or periodic motion. For example, if a light source flickers between Frame(t) and Frame(t+1), the frequency analysis will reveal a significant change in the frequency domain, which is used to compute the difference score. These techniques are especially effective for detecting periodic or low-frequency changes that may not be readily apparent in the spatial domain.
[0052] Deep learning-based change detection utilizes neural networks, such as convolutional or recurrent networks, to analyze Frame(t) and Frame(t+1) for visual discrepancies. The deep learning model is trained to detect complex patterns of change, including subtle or occluded movements that might not be captured by traditional methods. The model processes the frames, extracting high-level features from each, and generates a difference score reflecting the magnitude and type of change. For example, the model might detect a person moving behind an object or a slight change in a highly cluttered environment. This technique excels in challenging conditions where traditional techniques may fail, such as in scenes with high noise or complex motion patterns.
[0053] When machine learning models or convolutional neural networks (CNNs) are used, the difference detection module 206B may be implemented as an AI model trained to detect changes between frames. It is understood that this application is not limited to the difference detection algorithms explicitly mentioned, and any known or future-developed difference detection algorithms may be utilized within the scope of the present subject matter.
[0054] Once the respective difference score for each frame is computed and obtained in percentage format, the difference detection module 206B may be configured to compare the respective difference score to a predefined threshold. The predefined threshold may be a percentage value.
[0055] Furthermore, the difference detection module 206B may be configured to discard the respective frame corresponding to the respective difference score when the score dissatisfies the predefined threshold. This operation ensures that only frames with significant visual changes are forwarded for further processing, thereby reducing the data load for downstream modules. As a result, subsequent modules, such as encoding, compression, and transmission modules, are not burdened with redundant or irrelevant frames, which optimizes both processing efficiency and resource utilization, including bandwidth and storage. This approach improves the overall performance of the device 20 by ensuring that only meaningful video data is handled.
[0056] In response to the respective difference score satisfying the predefined threshold, the selection module 206C may be configured to select based on degree of contribution of the each frame and the corresponding successive frames towards the visual change select either the each frame corresponding to the respective difference score, or the corresponding successive frame corresponding to the respective difference score, or a frame pair comprising both the each frame and the corresponding successive frame corresponding to the respective difference score.
[0057] In other words, based on the degree of contribution of each frame and the corresponding successive frame toward the visual change, the selection module 206C may determine which frame or frame pair to select for further processing or storage. For instance, consider a scenario where Frame(t) captures the initial moment of a person beginning to enter the monitored area, while Frame(t+1) captures the majority of the person's movement into the scene. Although both frames contribute to the detected change, Frame(t+1) may exhibit a higher concentration of novel visual features, such as body posture and motion blur, indicating that it contains more significant visual information. Accordingly, the selection module 206(C) may be configured to select only Frame(t+1) for storage. Conversely, if Frame(t) captures the exact moment a person triggers an action such as opening a door and Frame(t+1) shows the continuation of the motion, both frames may be selected as a pair to preserve contextual continuity. In cases where the change is mostly represented in the earlier frame (e.g., a sudden light flash or an object abruptly disappearing), the selection module 206(C) may select Frame(t) and discard Frame(t+1). This selective approach ensures that only the most representative frames, based on their individual or combined contribution to the visual change, are retained, thereby reducing redundancy while maintaining meaningful context.
[0058] By ensuring that only those frames contributing substantively to the visual change are identified for further processing and by excluding frames with minimal or no relevance to the change, the system avoids processing redundant data and focuses solely on content-bearing frames. This mechanism inherently reduces the downstream computational footprint. The resulting frame set is streamlined, containing only those frames with maximal informational value, thereby enabling efficient resource utilization in subsequent stages such as storage, review, or reporting. This is particularly advantageous when the system is processing video in real-time or near real-time, as it prevents the system from being overloaded with redundant data that does not contribute to meaningful analysis. This approach ensures that the video surveillance system (1), particularly, video processing device 200 operates efficiently, even in environments where computational resources are limited. The threshold comparison also provides flexibility, allowing the threshold value to be fine-tuned based on the specific application requirements, enabling the system to adapt to varying environments or use cases. For example, in environments with frequent minor motion, the threshold can be adjusted to ensure that only major events or significant changes trigger further processing.
[0059] Upon selection, the encoding module 206D may be configured to encode the selected each frame using an image encoding format suitable for web-based optimization, such as, but not limited to, WebP. The encoded each frame is then passed to the compression module 206E. The compression module 206E may be configured to apply additional compression using techniques such as discrete cosine transform (DCT) or other transformation methods. This two-layer processing including encoding and compression effectively reduces the data size of each frame, thereby minimizing the bandwidth required for transmission. By reducing the volume of data transmitted over the network, this two-layer processing alleviates the burden on network infrastructure, especially in bandwidth-constrained environments. This reduction in required bandwidth ensures that video data, particularly the encoded and compressed frames, can be transmitted more efficiently, reducing the likelihood of congestion and enabling smoother transmission even in cases where network capacity is limited. This approach is particularly beneficial for scenarios with limited or fluctuating network bandwidth, as it allows for a more efficient use of available resources without compromising the critical video information.
[0060] Further, the transmit module 206F may be configured to transmit the encoded and compressed each frame, to a cloud video server 40 or a local user viewing device 80, along with a corresponding metadata, wherein the metadata comprises at least one of a respective timestamp indicating a time at which the transmitted each frame was captured by a corresponding scene capturing device and a respective frame index assigned to the transmitted each frame by the corresponding scene capturing device. It may be noted that the encoding module 206D may incorporate the metadata into the corresponding encoded frame during the encoding process. This ensures that the metadata is seamlessly included with the encoded frame, facilitating synchronization and accurate temporal representation of the frames during transmission.
[0061] This configuration provides significant advantages for storage on a cloud video server 40. By transmitting only the encoded and compressed frames that are deemed to contain meaningful visual changes, the device 20 reduces the volume of video data sent to the cloud video server 40. This reduces the storage burden on the cloud video server 40, as redundant or unimportant frames (such as those with little to no visual change) are not transmitted or stored. The inclusion of metadata alongside the encoded and compressed frames further streamlines the process, as it allows for easy tracking of the frames in the cloud storage without requiring additional storage resources for separate metadata management. Consequently, this reduces the overall storage requirements and costs associated with maintaining large amounts of video data in the cloud. Additionally, by sending only relevant data, the cloud storage is more efficient and optimized, allowing for better scalability and resource management.
[0062] Further, in response to the respective difference score satisfying the predefined threshold, the selection module 206C may be further configured to classify the selected frame into a predefined sub-range of a plurality of predefined sub-ranges, each above the predefined threshold, wherein each sub-range is associated with a predetermined number of temporally adjacent frames. In some examples, the predetermined number of temporally adjacent frames preceding, succeeding, or both is thereby determined based solely on the sub-range classification of the selected frame's difference score, wherein the pre-determined number of preceding and succeeding frames may be the same or different. In other examples, the selection module 206C may be configured to evaluate the respective difference scores of both preceding and succeeding frames in the temporal video stream relative to the selected frame. The number and direction (preceding and/or succeeding) of the temporally adjacent frames to be selected is dynamically determined based on the relative difference scores of those adjacent frames, such that frames exhibiting significant changes or contributing contextual relevance to the selected frame are included and are selected once and sent further to the cloud server. This approach ensures that contextual continuity is preserved by transmitting not only the selected frame but also its relevant temporally adjacent frames, thereby providing a more comprehensive visual understanding of the detected visual change. By classifying the selected frame into appropriate sub-ranges based on the magnitude of the difference score and dynamically configuring the number of adjacent frames through the selection module 206C, the device 200 optimizes data transmission, reduces unnecessary bandwidth usage, retaining situational awareness or contextual continuity.
[0063] FIG. 3 illustrates selection of temporally adjacent frames of a video frame by a video processing device in the event of threshold satisfaction by the video frame, in accordance with an example implementation of the present subject matter. Particularly, FIG. 3 illustrates an exemplary video stream/video data 300 comprising a sequence of video frames (Frames 1 through 8) captured by a scene capturing device (any but one of scene capturing devices 10, 11, …, N, demonstrating the selection and transmission of temporally adjacent frames for contextual continuity. In this example, Frame 4 is identified as a selected frame due to its difference score exceeding a predefined threshold. The difference score of the selected frame is classified into a corresponding predefined sub-range, wherein each sub-range is mapped to a predetermined number of temporally adjacent frames. The number of adjacent frames preceding, succeeding, or both is thereby determined based solely on the sub-range classification of the selected frame's difference score or may be dynamically determined by evaluating individual difference scores of the adjacent frames themselves. The adjacent frames (e.g., Frame 3 preceding, and Frames 5 and 6 succeeding) are selected accordingly through the selection module 206C. These selected frames are intended to preserve the context of the detected visual change. The selected frame (Frame 4) and its associated temporally adjacent frames (Frame 3 preceding, and Frames 5 and 6) are encoded, compressed by the encoding module 206D and the compression module 206E respectively, and transmitted to a cloud video server 40 or a local user viewing device 80 by the transmit module 206F.
[0064] It may be noted here that cloud video server 40, upon receipt of the transmitted frames, may store the same on cloud storage in an encoded and compressed format for storage optimization purposes. Each transmitted frame is associated with metadata that includes, but is not limited to, camera identification, timestamp, and frame index. This metadata enables the cloud video server 40 to organize and store frames in a manner that prevents intermixing of frames from different scene capturing devices, thereby maintaining the integrity and traceability of camera-specific video data. Such structured storage facilitates accurate retrieval and reconstruction of video streams corresponding to individual cameras during playback or analysis operations.
[0065] The reconstruction engine 210 and the corresponding modules may be configured to process the stored frames on demand for playback or analysis. For example, the cloud video server 40 may receive an instruction from a user via a user interface of the remote user viewing device 50, wherein the instruction indicates request of the user for playback of reconstructed video based on the selected set of video frames. Upon receipt of the instruction from a user to render frames stored on the cloud video server 40 as a playback via a remote user viewing device communicatively coupled to the cloud video server 40, the cloud video server 40 may forward a first request comprising the frames selected by the user and corresponding metadata associating each video frame to at least one of a corresponding timestamp and a corresponding frame index indicating playback of the video frames to the video processing device 200.
[0066] Upon the receipt of the first request, the decoding module 210A may be configured to decode the video frames and subsequently the decoded frames may be decompressed using reverse compression algorithms such as iDCT by the decompression module 210B. The transcoding module 210C may be configured to identify a missing frame in a video timeline of the decoded and decompressed video frames based on a detected discontinuity in either the timestamps or the frame indices. Upon identification of such a missing frame, the transcoding module 210C may be configured to determine a reference frame based on the closest earlier timestamp or frame index among the decoded and decompressed video frames. The transcoding module 210C may be configured to insert the determined reference frame in place of the missing frame in the video timeline to preserve temporal continuity. The rendering module 210D may be configured to subsequently generate a seamless reconstructed video stream from the decoded and decompressed video frames, along with the inserted reference frames, thereby ensuring a continuous playback experience for the user. Thus, the selective transmission and storage of video frames offers a beneficial balance between data efficiency and user experience. By storing only selected frames, the system reduces unnecessary data retention while still enabling the generation of a coherent and uninterrupted playback stream. This selective approach is particularly advantageous in scenarios where data loss may occur, as it avoids the need to retransmit entire continuous video sequences. Instead, only the relevant selected frames are handled, simplifying recovery and preserving playback integrity. Furthermore, each stored frame is associated with metadata such as camera ID, timestamp, and frame index, allowing the device 200 to accurately distinguish and organize frames from multiple cameras without overlapping. Using the timestamps and indices, the video processing device 200 can reconstruct a consistent timeline, detect any discontinuities, and insert suitable reference frames to fill any gaps. As a result, the user is provided with a smooth, uninterrupted playback experience that reflects the original scene progression, despite the underlying use of selectively stored video frames.
[0067] FIG. 4 illustrates a schematic diagram of reconstruction of video stream by a video processing device, in accordance with an example implementation of the present subject matter. The user may select Frame 1, Frame 2, Frame 5 and Frame 9, each associated with corresponding metadata, stored in the memory 302 (cloud storage) of the cloud video server 40 and may provide an instruction to the cloud video server 40 for playback of the selected video frames. The cloud video server 40 may forward the first request to the video processing device 200. Upon receiving the first request, the video processing device 200 may decode and decompress the frames Frame 1, Frame 2, Frame 5 and Frame 9 and may identify the discontinuity, specifically, between Frame 2 and Frame 5, and between Frame 5 and Frame 9 based on missing intermediate frame indices or timestamp gaps. To reconstruct a continuous playback stream, the device 200 replicates the closest earlier frame to fill each missing frame position. For instance, Frame 2 is repeated to fill the timeline positions corresponding to missing Frame 3 and Frame 4, while Frame 5 is repeated to fill the positions of missing Frame 6, Frame 7, and Frame 8. These replicated frames preserve the metadata format of the original frames they replicate, ensuring continuity in indexing, timestamp alignment, and camera source attribution. The resulting reconstructed video stream 304 thus comprises both the originally transmitted frames and the replicated reference frames, enabling uninterrupted and temporally consistent playback, even when the original transmission was limited to selected frames based on inter-frame analysis.
[0068] In accordance with another embodiment of the present disclosure, a surveillance method for processing video data is provided. The method includes capturing, by one or more scene capturing devices of a surveillance system, the video data comprising a plurality of frames of a surveillance area within respective fields of view; performing, by a video processing device of the surveillance system, inter-frame analysis to compute, for each frame, a respective difference score representing a visual change between the frame and a corresponding successive frame; comparing, by the video processing device, the respective difference score to a predefined threshold; and in response to the respective difference score satisfying the predefined threshold, performing the following: selecting based on degree of contribution of the each frame and the corresponding successive frame towards the visual change the each frame corresponding to the respective difference score or the corresponding successive frame corresponding to the respective difference score or a frame pair comprising both the each frame and the corresponding successive frame corresponding to the respective difference score, encoding and compressing the selected each frame and transmitting the encoded and compressed each frame, to a cloud video server or a local user viewing device, along with a corresponding metadata, wherein the metadata comprises at least one of a respective timestamp indicating a time at which the transmitted each frame was captured by a corresponding scene capturing device and a respective frame index assigned to the transmitted each frame by the corresponding scene capturing device.
[0069] FIGS. 5-7 illustrate a surveillance method implemented by a video processing system 1, in accordance with an example implementation of the present subject matter. Although the methods 500-700 may be implemented in a variety of devices, for ease of explanation, the description of methods 500-700 is provided in reference to the above-described surveillance system 1. The order in which the methods 500-700 are described is not intended to be construed as a limitation, and any number of the described method blocks may be combined in any order to implement the methods 500-700, or an alternative method. It may be understood that blocks of the methods 500-700 may be performed in the system 1. The blocks of the methods 500-700 may be executed based on instructions stored in a non-transitory computer-readable medium, as will be readily understood. The non-transitory computer-readable medium may comprise, for example, digital memories, magnetic storage media, such as magnetic disks and magnetic tapes, hard drives, or optically readable digital data storage media.
[0070] At block 502, the video data comprising a plurality of frames of a surveillance area within respective fields of view may be captured by the one or more scene capturing devices of a surveillance system.
[0071] At block 504, inter-frame analysis may be performed by a video processing device to compute, for each frame, a respective difference score representing a visual change between the frame and a corresponding successive frame.
[0072] At block 506, the respective difference score may be compared to a predefined threshold.
[0073] At block 508, in response to the respective difference score satisfying the predefined threshold, the steps 510 to 514 may be performed.
[0074] At block 510, based on degree of contribution of the each frame and the corresponding successive frame towards the visual change: the each frame corresponding to the respective difference score, or the corresponding successive frame corresponding to the respective difference score, or a frame pair comprising both the each frame and the corresponding successive frame corresponding to the respective difference score may be selected.
[0075] At block 512, the selected each frame may be encoded and compressed.
[0076] At block 514, the encoded and compressed each frame may be transmitted, to a cloud video server or a local user viewing device, along with a corresponding metadata, wherein the metadata comprises at least one of a respective timestamp indicating a time at which the transmitted each frame was captured by a corresponding scene capturing device and a respective frame index assigned to the transmitted each frame by the corresponding scene capturing device.
[0077] At block 602, in response to the respective difference score satisfying the predefined threshold, the selected frame may be classified into a predefined sub-range of a plurality of predefined sub-ranges above the pre-defined threshold, each predefined sub-range being associated with a predetermined number of temporally adjacent frames.
[0078] At block 604, temporally adjacent frames may be selected based on the predetermined number associated with the predefined sub-range such that each temporally adjacent frame is selected once.
[0079] At block 606, the selected temporally adjacent frames may be encoded and compressed.
[0080] At block 608, the encoded and compressed temporally adjacent frames may be transmitted along with the selected frame, to the cloud video server or the local user viewing device, wherein each temporally adjacent frame is transmitted with a corresponding metadata comprising at least one of a respective timestamp indicating a time at which the each temporally adjacent frame was captured by the corresponding scene capturing device and a respective frame index assigned to the each temporally adjacent frame by the corresponding scene capturing device.
[0081] At block 702, a first request may be received, by the video processing device, from the cloud video server, wherein the first request comprises video frames and corresponding metadata associating each video frame to at least one of a timestamp or a frame index, wherein the first request is to indicate playback of the video frames.
[0082] At block 704, the video frames may be decoded and decompressed.
[0083] At block 706, a missing frame may be identified in a video timeline of the decoded and decompressed video frames based on a discontinuity in the timestamps or the frame indices.
[0084] At block 708, a reference frame may be determined based on the closest earlier timestamp or frame index in the set of decoded and decompressed video frames.
[0085] At block 710, the reference frame may be inserted in place of the missing frame in the video timeline to preserve temporal continuity.
[0086] At block 712, a seamless reconstructed video stream may be generated from the set of decoded and decompressed video frames and the inserted reference frames, ensuring a continuous playback experience.
[0087] Collectively, the present subject matter addresses and overcomes the limitations of conventional video surveillance systems by providing a system and method that intelligently processes and selectively transmits only video frames containing meaningful visual changes. Unlike traditional systems that continuously capture, transmit, and store every video frame regardless of whether it reflects any scene variation, the present subject matter filters out redundant frames depicting static or unaltered scenes. This substantially reduces the volume of data transmitted and stored, thereby conserving network bandwidth and optimizing storage utilization, particularly in cloud-based environments. To preserve contextual continuity and ensure coherent playback, the present subject matter further includes the selective transmission of a limited number of temporally adjacent frames surrounding the frames of interest. This enables accurate reconstruction of event sequences without requiring full-frame transmission, thereby maintaining the interpretability of video content while minimizing data overhead. By reducing the frequency and volume of frame capture, encoding, transmission, and storage, the present subject matter significantly lowers the computational burden and power consumption at the scene-capturing devices. This is particularly advantageous for resource-constrained or remotely deployed surveillance units. Additionally, on the server side, particularly when leveraging cloud-based infrastructure, reduced data ingress and storage demands translate into lower cloud storage costs and decreased computational expenses associated with video analytics and AI-based cloud processing.
[0088] Furthermore, the present subject matter enhances system security by eliminating the reliance on peer-to-peer communication models that require open firewall ports. Instead, it enables centralized and secure communication, thereby minimizing the attack surface, mitigating the risk of unauthorized access, and simplifying the overall security architecture. As such, the present subject matter offers scalable, secure, and cost-efficient video surveillance approaches.
[0089] It will be understood by those skilled in the art that the foregoing general description and the following detailed description are exemplary and explanatory of the disclosure and are not intended to be restrictive thereof.
[0090] While specific language has been used to describe the disclosure, any limitations arising on account of the same are not intended. As would be apparent to a person skilled in the art, various working modifications may be made to the method in order to implement the inventive concept as taught herein.
[0091] The figures and the foregoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment. For example, the order of processes described herein may be changed and are not limited to the manner described herein. Moreover, the actions of any flow diagram need not be implemented in the order shown; nor do all of the acts need to be necessarily performed. Also, those acts that are not dependent on other acts may be performed in parallel with the other acts. The scope of embodiments is by no means limited by these specific examples.
, Claims:
1. A surveillance system for processing video data comprising:
one or more scene capturing devices, each configured to capture the video data comprising a plurality of frames of a surveillance area in a respective field of view; and
a video processing device communicatively coupled with the one or more cameras, comprising:
one or more processors;
a memory comprising computer-executable instructions that, when executed by the one or more processors, cause the video processing device to:
based on inter-frame analysis, for each frame, compute a respective difference score representing a visual change between the each frame and a corresponding successive frame;
compare the respective difference score to a predefined threshold; and
in response to the respective difference score satisfying the predefined threshold, perform the following:
select based on degree of contribution of the each frame and the corresponding successive frame towards the visual change:
the each frame corresponding to the respective difference score;
the corresponding successive frame corresponding to the respective difference score; or
a frame pair comprising both the each frame and the corresponding successive frame corresponding to the respective difference score;
encode and compress the selected each frame; and
transmit the encoded and compressed each frame, to a cloud video server or a local user viewing device, along with a corresponding metadata, wherein the metadata comprises at least one of:
a respective timestamp indicating a time at which the transmitted each frame was captured by a corresponding scene capturing device; and
a respective frame index assigned to the transmitted each frame by the corresponding scene capturing device.
2. The system as claimed in claim 1, wherein the computer-executable instructions, when executed by the one or more processors, further cause the video processing device to:
in response to the respective difference score dissatisfying the predefined threshold, discard the each frame corresponding to the respective difference score.
3. The system as claimed in claim 1, where inter-frame analysis is based on application of one or more techniques selected from a group consisting of: pixel-based difference, optical flow, scene change detection, block matching, background subtraction, histogram comparison, edge detection, feature-based comparison, machine learning models, semantic segmentation, Fourier transform or wavelet analysis, and deep learning-based change detection.
4. The system as claimed in claim 3, wherein the one or more techniques are selected adaptively based on the number of active video streams or a frame capture rate of the one or more scene capturing devices.
5. The system as claimed in claim 1, wherein in response to the respective difference score satisfying the predefined threshold, perform the following, further comprises:
classify the selected frame into a predefined sub-range of a plurality of predefined sub-ranges above the predefined threshold, each predefined sub-range being associated with a predetermined number of temporally adjacent frames;
select temporally adjacent frames of the selected frame based on the predetermined number associated with the predefined sub-range such that each temporally adjacent frame is selected once;
encode and compress the selected temporally adjacent frames; and
transmit the encoded and compressed temporally adjacent frames along with the selected frame, to the cloud video server or the local user viewing device, wherein each temporally adjacent frame is transmitted with a corresponding metadata comprising at least one of:
a respective timestamp indicating a time at which the each temporally adjacent frame was captured by the corresponding scene capturing device; and
a respective frame index assigned to the each temporally adjacent frame by the corresponding scene capturing device.
6. The system as claimed in claim 1, wherein the computer-executable instructions, when executed by the one or more processors, further cause the video processing device to:
receive a first request from the cloud video server, wherein the first request comprises video frames and corresponding metadata associating each video frame to at least one of a timestamp or a frame index, wherein the first request is to indicate playback of the video frames;
decode and decompress the video frames;
identify a missing frame in a video timeline of the decoded and decompressed video frames based on a discontinuity in the timestamps or the frame indices;
determine a reference frame based on the closest earlier timestamp or frame index in the decoded and decompressed video frames;
insert the reference frame in place of the missing frame in the video timeline to preserve temporal continuity; and
generate a seamless reconstructed video stream from the decoded and decompressed video frames and the inserted reference frames, ensuring a continuous playback experience.
7. The system as claimed in claim 1, wherein the video frames are stored on the cloud video server in a compressed and encoded format, and are decoded and decompressed only on demand.
8. The system as claimed in claim 1, wherein the video processing device is deployed on the same network as the one or more scene capturing devices.
9. The system as claimed in claim 1, wherein the video processing device is configured to function as a centralized communication node, managing all network traffic between the one or more scene capturing devices and any external devices, including the cloud video server or the local user viewing device, so that all external access is routed exclusively through the video processing device, thereby safeguarding the one or more scene capturing devices from unauthorized access.
10. A surveillance method for processing video data comprising:
capturing, by one or more scene capturing devices of a surveillance system, the video data comprising a plurality of frames of a surveillance area within respective fields of view;
performing, by a video processing device of the surveillance system, inter-frame analysis to compute, for each frame, a respective difference score representing a visual change between the frame and a corresponding successive frame;
comparing, by the video processing device, the respective difference score to a predefined threshold; and
in response to the respective difference score satisfying the predefined threshold performing the following:
selecting based on degree of contribution of the each frame and the corresponding successive frame towards the visual change:
the each frame corresponding to the respective difference score;
the corresponding successive frame corresponding to the respective difference score; or
a frame pair comprising both the each frame and the corresponding successive frame corresponding to the respective difference score;
encoding and compressing the selected each frame; and
transmitting the encoded and compressed each frame, to a cloud video server or a local user viewing device, along with a corresponding metadata, wherein the metadata comprises at least one of:
a respective timestamp indicating a time at which the transmitted each frame was captured by a corresponding scene capturing device; and
a respective frame index assigned to the transmitted each frame by the corresponding scene capturing device.
11. The method as claimed in claim 10, wherein in response to the respective difference score satisfying the predefined threshold, performing the following, further comprises:
classifying the selected frame into a predefined sub-range of a plurality of predefined sub-ranges above the pre-defined threshold, each predefined sub-range being associated with a predetermined number of temporally adjacent frames;
selecting temporally adjacent frames based on the predetermined number associated with the predefined sub-range such that each temporally adjacent frame is selected once;
encoding and compressing the selected temporally adjacent frames; and
transmitting the encoded and compressed temporally adjacent frames along with the selected frame, to the cloud video server or the local user viewing device, wherein each temporally adjacent frame is transmitted with a corresponding metadata comprising at least one of:
a respective timestamp indicating a time at which the each temporally adjacent frame was captured by the corresponding scene capturing device; and
a respective frame index assigned to the each temporally adjacent frame by the corresponding scene capturing device.
12. The method as claimed in claim 10, further comprises:
receiving, a first request, from the cloud video server, wherein the first request comprises video frames and corresponding metadata associating each video frame to at least one of a timestamp or a frame index, wherein the first request is to indicate playback of the video frames;
decoding and decompressing the video frames;
identifying a missing frame in a video timeline of the decoded and decompressed video frames based on a discontinuity in the timestamps or the frame indices;
determining a reference frame based on the closest earlier timestamp or frame index in the set of decoded and decompressed video frames;
inserting the reference frame in place of the missing frame in the video timeline to preserve temporal continuity; and
generating a seamless reconstructed video stream from the set of decoded and decompressed video frames and the inserted reference frames, ensuring a continuous playback experience.
Dated this 29th day of August 2025
Signature

Manish Kumar
Patent Agent (IN/PA-5059)
Agent for the Applicant

Documents

Application Documents

#	Name	Date
1	202541082277-STATEMENT OF UNDERTAKING (FORM 3) [29-08-2025(online)].pdf	2025-08-29
2	202541082277-REQUEST FOR EARLY PUBLICATION(FORM-9) [29-08-2025(online)].pdf	2025-08-29
3	202541082277-PROOF OF RIGHT [29-08-2025(online)].pdf	2025-08-29
4	202541082277-POWER OF AUTHORITY [29-08-2025(online)].pdf	2025-08-29
5	202541082277-FORM-9 [29-08-2025(online)].pdf	2025-08-29
6	202541082277-FORM FOR SMALL ENTITY(FORM-28) [29-08-2025(online)].pdf	2025-08-29
7	202541082277-FORM FOR SMALL ENTITY [29-08-2025(online)].pdf	2025-08-29
8	202541082277-FORM 1 [29-08-2025(online)].pdf	2025-08-29
9	202541082277-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [29-08-2025(online)].pdf	2025-08-29
10	202541082277-EVIDENCE FOR REGISTRATION UNDER SSI [29-08-2025(online)].pdf	2025-08-29
11	202541082277-DRAWINGS [29-08-2025(online)].pdf	2025-08-29
12	202541082277-DECLARATION OF INVENTORSHIP (FORM 5) [29-08-2025(online)].pdf	2025-08-29
13	202541082277-COMPLETE SPECIFICATION [29-08-2025(online)].pdf	2025-08-29
14	202541082277-MSME CERTIFICATE [03-09-2025(online)].pdf	2025-09-03
15	202541082277-FORM28 [03-09-2025(online)].pdf	2025-09-03
16	202541082277-FORM-8 [03-09-2025(online)].pdf	2025-09-03
17	202541082277-FORM 18A [03-09-2025(online)].pdf	2025-09-03