Method And System For Context Aware Video Compression

< Back

Method And System For Context Aware Video Compression

Abstract: ABSTRACT The present disclosure provides a method and system for context-based video compression. The method includes combining context information with saliency information to generate heatmap which is used to identify one or more significant regions and one or more insignificant regions. The one or more significant regions are encoded with a high number of bits as compared to the number of bits used for encoding one or more insignificant regions. By performing context-based video compression, the present disclosure yields a considerable size reduction on the overall video without degrading the viewing experience of the user. Fig. 1a

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

19 December 2019

Publication Number

22/2020

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Patent Number

Legal Status

Grant Date

2021-04-06

Renewal Date

Applicants

RED BRICK LANE MARKETING SOLUTIONS PRIVATE LIMITED

6th Floor, Salarpuria Sattva Magnificia Next to Tin Factory, Old Madras Road, Bangalore 560016, India

Inventors

1. LITAN KUMAR MOHANTA

10-1-79/A, Shiva Sai Nagar, Peerzadiguda, Medipally (Near Uppal), Ghatkesar, Hyderabad -500098.

2. JAYAKRISHNA ALAPATI

F-1, B-Block, Himagiri Green Forest, Vinayakanagar, JP Nagar, 7th phase, Bangalore – 560078.

3. PALEPU RAVI TEJA

Flat No 205, Parimala Elan Apartment Marathahalli, Bangalore – 560037.

4. JASMEEN PATEL

B-4, KTR residency, 5th cross, Shiva Ganga Layout, Mahadevapura, Bengaluru, Karnataka-560048.

Specification

DESC:TECHNICAL FIELD
The present subject matter is generally related to video compression technique and more particularly, but not exclusively, to a method and a system for context-based video compression technique.

BACKGROUND
At present, there are a lot of techniques available for video compression. However, in these techniques, significant compression is mainly in the temporal domain than in the spatial domain. At first, frames at some fixed intervals are selected as keyframes which are stored in full. Usually, a keyframe is stored after every K frames. The frames between successive keyframes are not stored in full. Only relative information with respect to both key frames, like motion vectors, is encoded. Hence, by the virtue of not storing every frame a reduction/compression of the video is achieved. Some other techniques compress the video in the spatial domain by leveraging general image processing techniques to help them encode different regions of the image with different number of bits thereby reducing the overall size of the video.
Further, existing video compression is done in the temporal domain and the spatial domain, but within the spatial domain the compression is done by leveraging only the generic image processing techniques. This may restrict the amount of compression that may be achieved. Also, the generic image processing techniques used for spatial compression has no way of judging what is important and what is not. They mainly depend on image aspects like color uniformity and sharpness and the like for spatial compression using which the magnitude of compression achieved would be restricting. Using these existing techniques, the only way the magnitude of compression may be increased is by impacting quality across the frame and thereby degrading viewing experience of the user. Some other existing techniques discloses a method for identifying context in the video based on which compression may be performed. However, the context is limited to user specified objects and hence does not provide an accurate identification of important objects to be compressed.
The information disclosed in this background of the disclosure section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.
SUMMARY

Disclosed herein is a method for context-based video compression technique. The method comprises receiving, by a video compression system, the video from one or more data sources and providing the video to a saliency network model and a context network model associated with the video compression system. The saliency network model is a trained neural network to identify saliency information in each frame of the video and the context network model is a trained neural network to identify context information in each frame of the video. Thereafter, the method comprises identifying the saliency information which indicates region of high probability of being viewed by a user for each frame of the video from the saliency network model and context information which indicates context of each frame in the video from the context network model. Once the saliency information and the context information is identified, the method comprises generating a heatmap for each frame of the video by combining the saliency information and the context information using a neural network model. The heatmap comprises information of one or more significant regions and one or more insignificant regions in each frame of the video. Further, the method comprises encoding, by the video compression system, each frame of the video using the heatmap to allocate number of bits for compressing each frame of the video based on the one or more significant regions and one or more insignificant regions.

Further, the present disclosure discloses a video compression system for performing a context-based video compression technique. The video compression system comprises a saliency network model which is a trained neural network to identify saliency information in each frame of the video, a context network model which is a trained neural network to identify context information in each frame of the video, a processor and a memory. The memory is communicatively coupled to the processor, wherein the memory stores processor-executable instructions, which, on execution, causes the processor to receive the video from one or more data sources and provide the video to the saliency network model and the context network model. The processor identifies the saliency information which indicates region of high probability of being viewed by a user for each frame of the video from the saliency network model and context information which indicates context of each frame in the video from the context network model. Thereafter, the processor generates a heatmap for each frame of the video by combining the saliency information and the context information using a neural network model, wherein the heatmap comprises information of one or more significant regions and one or more insignificant regions in each frame of the video. Once the heatmap is generated, the processor encodes each frame of the video using the heatmap to allocate number of bits for compressing each frame of the video based on the one or more significant regions and one or more insignificant regions.

The foregoing summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.

BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, explain the disclosed principles. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the figures to reference like features and components. Some embodiments of system and/or methods in accordance with embodiments of the present subject matter are now described, by way of example only, and regarding the accompanying figures, in which:

Figs.1a-1b shows an exemplary architecture for context-based video compression technique of a video in accordance with some embodiments of the present disclosure.

Fig.2a shows a block diagram of a video compression system in accordance with some embodiments of the present disclosure.

Figs.2b-2e shows exemplary heatmap generated in accordance with some embodiment of the present disclosure.

Fig.3 shows a flowchart illustrating a method for context-based video compression technique of a video in accordance with some embodiments of the present disclosure.

Figs. 4a-6b shows heatmap generated with and without context information in accordance with some embodiments of the present disclosure.

Fig.7 illustrates a block diagram of an exemplary computer system for implementing embodiments consistent with the present disclosure.
It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and executed by a computer or processor, whether such computer or processor is explicitly shown.

DETAILED DESCRIPTION
In the present document, the word "exemplary" is used herein to mean "serving as an example, instance, or illustration." Any embodiment or implementation of the present subject matter described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
While the disclosure is susceptible to various modifications and alternative forms, specific embodiment thereof has been shown by way of example in the drawings and will be described in detail below. It should be understood, however that it is not intended to limit the disclosure to the specific forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternatives falling within the scope of the disclosure.
The terms “comprises”, “comprising”, “includes”, “including” or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a setup, device, or method that comprises a list of components or steps does not include only those components or steps but may include other components or steps not expressly listed or inherent to such setup or device or method. In other words, one or more elements in a system or apparatus proceeded by “comprises… a” does not, without more constraints, preclude the existence of other elements or additional elements in the system or method.
The present disclosure provides a method and a system for context-based video compression technique for a video. The system comprises a saliency network model which is a trained neural network to identify saliency information in each frame of the video, a context network model which is a trained neural network to identify context information in each frame of the video, a processor and a memory. The processor may receive the video from one or more data sources and provides the video to the saliency network model and the context network model. The saliency network model identifies saliency information in each frame of the video and the context network model identifies context information in each frame of the video. The saliency information comprises one or more regions in the frame of the video with high probability of being viewed by the user. The context information comprises context in each frame of the video. The context information is based on scene descriptors and environment information in each frame of the video. Thereafter, the processor may combine the saliency information and the context information and generate a heatmap for each frame of the video using a neural network. The heatmap may comprise information of one or more significant regions and one or more insignificant regions in each frame of the video. Thereafter, each frame of the video is encoded using the heatmap by an encoder. In an embodiment, the one or more significant regions may be encoded with a high number of bits as compared to the number of bits used for encoding the one or more insignificant regions. Therefore, in the present invention, the one or more significant regions are encoded with high number of bits compared to one or more insignificant regions in each frame of the video so that the important regions are with high quality and also reduce the overall size of the video without affecting the perceptual visual quality of the video.
Fig.1a shows an exemplary architecture for context-based video compression technique for a video in accordance with some embodiments of the present disclosure.

The architecture 100a may include, an input video 101 [also referred as video], a video compression system 103 (alternatively referred as “system”) and an encoder. In one embodiment, the video compression system 103 may include a saliency network model 107, a context network model 109, and a neural network 111. In another embodiment, the video compression system 103 may include a saliency network model 107, a context network model 109, a segmentation network model 110, and a neural network 111 as shown in Fig.1b. The input video 101 may comprises multiple image frames. In one embodiment, the input video 101 may be provided to the saliency network model 107 and the context network model 109. In another embodiment, the input video 101 may be provided to the saliency network model 107, the context network model 109, and the segmentation network model 110. In an embodiment, multiple copies of the video 101 may be provided to each of the models i.e. the input video can be processed parallelly by each of the saliency network model, the context network model, and the segmentation network model. In another embodiment, same copy of the video 101 may be processed and provided sequentially to each of these models. The saliency network model 107 is a trained neural network to identify saliency information in each frame of the video 101. The saliency information may indicate one or more regions in each frame of the video 101 which has a high probability of being viewed by a user.
The saliency network model 107 may be trained to identify one or more regions in the frame of the video 101 with high probability of being viewed by the user based on pre-learnt one or more regions from plurality of videos with different video categories. As an example, in a game of cricket while a “ball” is being bowled, the focus may be on batsman, bowler, wickets, crease and other fielders in the view but not on the texture or crispness of the grass and audience. Therefore, these regions in the frame may be identified as saliency information.
The context network model 109 is a trained neural network to identify context information in each frame of the video 101. As an example, the context network model 109 may be a Recurrent Neural Network (RNN) or Long Short-Term Memory (LSTM) model. The context network model 109 may be trained to identify context in the frame of the video 101 based on pre-learnt context from plurality of videos with different video categories. Further, the context network may be trained for each frame based on future frames and past frames to determine what is important in the current frame. In an embodiment, the context may be thought of as a situation within which something exists or happens. As an example, the context of a sports video 101 is markedly different to that of a usual content of a General Entertainment Channel (GEC). As an example, in the sports video, ball, batsmen, bowler and fielders may be considered as important regions. A game of football and rugby have nearly identical context, with players on the field being more important than the rest of the audience and presence of a ball which is almost always part of the important region. For example, a “ball” will be remarkably more salient in a cricket video 101 as compared to a Television (TV) Series. Or as an example, in a soap opera/movie about cricket, in which the overall content is that of a movie but when the shots are about cricket then once again the context changes and the “ball” becomes important. So, context here identifies a specific category or nature of the frame based on the context information in the frames.
The segmentation network model 110 is a trained neural network 111 for identifying one or more object segmentations in each frame of the video 101, wherein each of the one or more object segmentations is assigned to a predefined object category or class. The category or the class may be as an example, a chair, a human, an animal, a wall, posters, paintings, sports instrument, and the like. The one or more object segmentations in the frame of the video 101 may be identified based on pre-learnt object segmentations from plurality of videos with different video categories. The object segmentation provides information of all the pixel locations covered by each of the predefined class of objects. In an embodiment, the saliency network model 107, the context network model 109, and the segmentation network model 110 may run in parallel and in real-time to avoid any latency.
In an embodiment, the neural network 111 may combine the saliency information and the context information to generate a heatmap. As an example, the neural network 111 may be a Selective Saliency Map generator which may be based on U-Net architecture. In another embodiment, the neural network 111 may combine the saliency information, context information and the segmentation information to generate the heatmap. In another embodiment, the neural network 111 may combine the saliency information and the segmentation information to generate the heatmap. In another embodiment, the neural network 111 may also combine the context information and the segmentation information to generate the heatmap. The heatmap may be a 2-dimensional matrix with values ranging from 0 to 1. As an example, the value 1 may represent the most important pixel and the value 0 may represent the least important pixel. The heatmap may comprise information of one or more significant regions and one or more insignificant regions in each frame of the video 101. Each frame of the video may be encoded using the heatmap by the encoder 105 to allocate number for bits for compressing each frame of the video 101. The heatmap may be associated with any of the existing codec for performing compression. In an embodiment, the one or more significant regions may be compressed using a high number of bits as compared to the number of bits used for encoding the one or more insignificant regions thereby reducing the overall size of the video 101 without affecting the perceptual visual quality.
Fig.2a shows a block diagram of a video compression system in accordance with some embodiments of the present disclosure.
In some implementations, the video compression system 103 may include an I/O interface 201, a processor 203, and a memory 205 along with the saliency network model 107, a context network model 109, a segmentation network model 110 and the neural network 111. The I/O interface 201 may be configured to receive the video 101 from one or more data sources. The processor 203 may be configured to perform context-based video compression for a video 101. The video compression system 103 may include data 206 and modules 218. As an example, the data 206 is stored in a memory 205 configured in the system 103 as shown in the Fig.2a. In one embodiment, the data 206 may include saliency information 207, context information 209, object segment information 211, heatmap data 211, and other data 217. In the illustrated Fig.2a, modules 218 are described herein in detail.

In some embodiments, the data 206 may be stored in the memory 205 in form of various data structures. Additionally, the data 206 can be organized using data models, such as relational or hierarchical data models. The other data 217 may store data, including temporary data and temporary files, generated by the modules 218 for performing the various functions of the system 103.
In some embodiments, the data 206 stored in the memory 205 may be processed by the modules 218 of the system 103. The modules 218 may be stored within the memory 205. In an example, the modules 218 communicatively coupled to the processor 203 configured in the system 103, may also be present outside the memory 205 as shown in Fig.2a and implemented as hardware. As used herein, the term modules 218 may refer to an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor 203 (shared, dedicated, or group) and memory 205 that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that provide the described functionality.
In some embodiments, the modules 218 may include, for example, a saliency identification module 219, a context identification module 221, a segmentation module, a heatmap generation module 225, an encoding module 227, and other modules 231. The other modules 231 may be used to perform various miscellaneous functionalities of the system 103. It will be appreciated that such aforementioned modules 218 may be represented as a single module or a combination of different modules 218. It will be appreciated that such modules 218 may be represented as a single module or a combination of different modules 218. Furthermore, a person of ordinary skill in the art will appreciate that in an implementation, the one or more modules 218 may be stored in the memory 205, without limiting the scope of the disclosure. The said modules 218 when configured with the functionality defined in the present disclosure will result in a novel hardware.
In an embodiment, the saliency identification module 219 may be configured to identify saliency information 207 in each frame of the video 101 from the saliency network model 107. The saliency network model 107 is a trained neural network to identify saliency information 207 in each frame of the video 101. The saliency information 207 may indicate one or more regions in the frame which has high probability of being viewed by a user. The one or more regions which humans notice after seeing images at a first glance are generally referred as salient regions or saliency information 207. The saliency network model 107 is trained based on pre-learnt one or more regions from plurality of videos with different video categories or genres. As an example, the videos may be categorized based on one or more genres and each genre may include one or more sub-genre. For each sub-genre, the one or more regions may be identified which may have high probability of being viewed and trained by the saliency network model 107. As an example, in a cricket-based movie, the one or more regions such as cricket player, ball, fielders may be identified as salient regions which have high probability of being viewed by the user.
In an embodiment, the context identification module 221 may be configured to identify context in each frame of the video 101 from the context network model 109. The context network model 109 is a trained neural network to identify context information 209 in each frame of the video 101. The context network model 109 may identify the context information 209 based on pre-learnt context from plurality of video with different video categories such as movies, sports, soap opera and the like. The context may refer to the scene/content of the video 101. The importance of the objects present in the video 101 depends on current scene in the video 101. Further, the context may refer to content or relevant things that are happening in a video 101. In other words, relevant-to-human-eye portions of a video 101.
Scenario 1: A couple may be conversing standing next to a crowded bus stop in a TV series. A TV viewer’s eyes would fixate on the couple’s faces/expressions and unconsciously ignore the rest of the crowd. Here, the context while imitating the above viewer, would identify the speaking couple as important or significant to the scene and rest of the crowd as unimportant or insignificant. Hence, the context aware portion or the region may be faces of the couple which would remain in focus.
Scenario 2: A cricket match is being telecasted and the bowler’s ball gets blown to a very high-altitude trajectory. The camera zooms to the sky (following the ball) and next to the crowded end to capture a potential catch or a potential boundary. The next camera shots are interspersed between the umpire, scoreboard, and jubilant crowd. TV audience meanwhile is eye fixated on only tracking the ball while in sky, eye-tracks only the fielder while close to the boundary and only the score board while the camera pans the crowd. Therefore, the context network model 109 identifies these important portions of the video 101 based on context and only focuses on the portions of the video 101 that a human would be interested in. In an embodiment, the context network model 109 may be trained such that the model understands a class and consequently global context of the video 101. This way the context network model 109 knows where to look for in a video 101 of “players and spectators” and classify it as a cricket video 101. This may help to identify global context of the video 101 and differentiate between significant and insignificant regions in the video 101. Further as an example, a frame may be of a kitchen scene wherein there may be a “guitar” in foreground. The saliency network model 107 may indicate guitar as important because it is in the foreground, but context embedding provides information that this is a kitchen scene and hence not to not include the guitar as the significant region. That is the use of having the global context.
In an embodiment, the segmentation module 223 may be configured to identify one or more object segments in each frame of the video 101 using a segmentation network model 110. In an embodiment the segmentation network model 110 is a trained neural network to identify the one or more object segmentations in the frame of the video 101 based on pre-learnt object segmentations from plurality of videos with different video categories. Each segment may be assigned to a predefined object category or a class. As an example, the object category or the class may be a chair, a human, an animal, a wall, posters, paintings, sports instrument, and the like. In an embodiment, the object segmentation is more precise which provides exactly all pixel locations covered by each of the predefined object categories or the class. The segmentation module 223 may provide object segment information 211 which comprises one or more objects in each frame of the video 101 which would be the most occurring objects in the frame.
In an embodiment, the heatmap generation module 225 may be configured to generate a heatmap for each frame of the video 101 by combining the saliency information 207 and the context information 209 using a neural network 111 which may be stored as heatmap data 213. Applying heatmap with knowledge of the saliency information 207 with the context information 209 would facilitate to identify more accurately one or more significant regions and one or more insignificant regions in each frame of the video 101. The heatmap may comprise information of one or more significant regions and one or more insignificant regions in each frame of the video 101. As an example, the one or more significant regions may be represented by value 1 and the one or more insignificant regions may be represented by a value 0. As an example, the saliency information 207 may only provide salient regions which are generally more relevant-to-human-eye regions present in the video 101. The regions which humans notice after seeing images at a first glance are generally salient regions. But when the focus is infinity where all the regions of the image are focused it then fails to identify salient regions. As an example, in sports like cricket when the “ball” is reaching the boundary and part of the audience is also shown at infinity focus. The ball and boundary are more important than the audience in this scene. So, when we combine saliency information 207 and the context information 209, the significant regions may be identified more accurately. Fig.2b shows an exemplary input image which may be provided to the saliency network model 107 and the context network model 109. Fig.2c shows output of the saliency network model 107 wherein salient regions are highlighted in green color and Fig.2d shows output of the combination of saliency information 207 and context information 209 wherein the significant regions are highlighted in green color. When the saliency information 207 and the context information 209 is combined, the output is more accurate. Therefore, as seen from Fig.2d, the car is also considered and highlighted along with humans since the humans are approaching towards the car. Therefore, the context also considers the car and highlights the object car as well along with highlighting the humans.
In another embodiment, the heatmap generation module 225 may be configured to generate the heatmap for each frame in the video 101 by combining saliency information 207, context information 209 and segment information using the neural network 111 to accurately identify one or more significant regions and one or more insignificant regions in each frame of the video 101. As an example, Fig.2e shows various input images 240 and output images 242 and 244. The output images 242 are the images obtained from the saliency network model 107 and the output images 244 are the images obtained by combining saliency information 207, context information 209 and segment information. As seen in Fig.2e, in the output images 244, the characters which are important are better highlighted in green color when compared to the highlighted characters in the output images 242. In another embodiment, the heatmap generation module 225 may combine the saliency information and the segmentation information to generate the heatmap. In another embodiment, the heatmap generation module 225 may also combine the context information and the segmentation information to generate the heatmap.
In an embodiment, once the heatmap is generated, each frame of the video 101 may be processed. Each pixel in the frame of the video 101 may be adjusted based on value of other pixels in its neighborhood. Therefore, the heatmap is processed by performing morphological operations to prune the heatmap and the pruned heatmap may be further processed to remove false positives to obtain a final heatmap.
In an embodiment, once the heatmap is generated, the encoding module 227 may encode each frame of the video using the heatmap by an encoder 105 to allocate number of bits for compressing each frame. The encoder 105may allocate number of bits based on the one or more significant regions and the one or more insignificant regions. As an example, the encoder 105 may allocate a high number of bits to the one or more significant regions compared to the number of bits used for encoding the one or more insignificant regions.
Fig.3 shows a flowchart illustrating a method of context-based video compression technique in accordance with some embodiments of the present disclosure.

As illustrated in Fig.3, the method 300 includes one or more blocks illustrating a method of context-based video compression technique. The method 300 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, and functions, which perform specific functions or implement specific abstract data types.

The order in which the method 300 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method. Additionally, individual blocks may be deleted from the methods without departing from the spirit and scope of the subject matter described herein. Furthermore, the method can be implemented in any suitable hardware, software, firmware, or combination thereof.

At block 301, the method may include receiving, by a video compression system 103, the video 101 from one or more data sources and providing the video 101 to a saliency network model 107 and a context network model 109 associated with the video compression system 103. The saliency network model 107 is a trained neural network to identify saliency information 207 in each frame of the video 101 and the context network model 109 is a trained neural network to identify context information 209 in each frame of the video 101. The received video 101 may be provided to the saliency network model 107 and the context network in a sequential manner or multiple copies of the video 101 may be obtained and provided to different network models.
At block 303, the method may include identifying, by the video compression system 103, the saliency information 207 for each frame in the video 101 from the saliency network model 107. The saliency information 207 may indicate one or more regions in the frame of high probability of being viewed by a user. The method may also include identifying context information 209 which indicates context of each frame in the video 101 from the context network model 109. The saliency network model 107 is trained to identify one or more regions in the frame of the video 101 with high probability of being viewed by the user based on pre-learnt one or more regions from plurality of videos with different video categories. The context network model 109 is trained to identify context in the frame of the video 101 based on pre-learnt context from plurality of videos with different video categories.
At block 305, the method may include generating, by the video compression system 103, a heatmap for each frame of the video 101. The heatmap may be generated by combining the saliency information 207 and the context information 209 using a neural network 111. The heatmap may comprise information of one or more significant regions and one or more insignificant regions in each frame of the video 101.
At block 307, the method may include encoding, by the video compression system 103, each frame of the video using the heatmap by an encoder 105 to allocate number of bits for compressing each frame of the video 101. Each frame of the video may be encoded based on the one or more significant regions and one or more insignificant regions.
Figs. 4a-6b shows heatmap generated with and without context information 209. As shown in Figs. 4a-4b, Fig.4a shows input image 240 and an output image 244 (showing heatmap) which is without adding context wherein some regions are highlighted in green color in the output image 244 and Fig.4b for the input image 240, the output image is generated which is by adding context information 209. As seen in Fig.4b, adding context not only adds more scope and gives entire region of humans as important but it also provides importance to human to the side pose. Therefore, in Fig.4b significant regions are highlighted in green color in the output image 244. Based on this the heatmap is generated which comprises of one or more significant regions and one or more insignificant regions. The one or more significant regions are highlighted using green color and the one or more insignificant regions are indicated using red color. Similarly, Fig.5a shows input image 240 and an output image 244 (showing heatmap) wherein the output image 244 is without adding context and hence only few regions are highlighted in green color and in Fig.5b the output image 244 is by adding context. As seen in Fig.5b adding context helps in detecting the person in the rear view as important as well despite low illumination while also extending saliency to the entire segment of the important regions. Therefore, in Fig.5b significant regions highlighted in green color in the output image 244. Fig.6a shows input image 240 and the output image 244 (showing heatmap) wherein the output image 244 is by without adding context and in Fig.6b the output image 244 (showing heatmap) is by adding the context. As seen in Fig.6b, adding the context helps to identify even car as important since the characters on the screen are going towards it.
Computer System
Fig.7 illustrates a block diagram of an exemplary computer system 700 for implementing embodiments consistent with the present disclosure. In an embodiment, the computer system 700 may be a video compression system 103, which is used for context-based video compression technique for a video 101. The computer system 700 may include a central processing unit (“CPU” or “processor”) 702. The processor 702 may comprise at least one data processor for executing program components for executing user or system-generated business processes. The processor 702 may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc.
The processor 702 may be disposed in communication with one or more input/output (I/O) devices (711 and 712) via I/O interface 701. The I/O interface 701 may employ communication protocols/methods such as, without limitation, audio, analog, digital, stereo, IEEE-1394, serial bus, Universal Serial Bus (USB), infrared, PS/2, BNC, coaxial, component, composite, Digital Visual Interface (DVI), high-definition multimedia interface (HDMI), Radio Frequency (RF) antennas, S-Video, Video Graphics Array (VGA), IEEE 802.n /b/g/n/x, Bluetooth, cellular (e.g., Code-Division Multiple Access (CDMA), High-Speed Packet Access (HSPA+), Global System For Mobile Communications (GSM), Long-Term Evolution (LTE) or the like), etc. Using the I/O interface 701, the computer system 700 may communicate with one or more I/O devices 711 and 712.
In some embodiments, the processor 702 may be disposed in communication with a communication network 709 via a network interface 703. The network interface 703 may communicate with the communication network 709. The network interface 703 may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), Transmission Control Protocol/Internet Protocol (TCP/IP), token ring, IEEE 802.11a/b/g/n/x, etc.
The communication network 709 can be implemented as one of the several types of networks, such as intranet or Local Area Network (LAN) and such within the organization. The communication network 709 may either be a dedicated network or a shared network, which represents an association of several types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), etc., to communicate with each other. Further, the communication network 709 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, etc.
In some embodiments, the processor 702 may be disposed in communication with a memory 705 (e.g., RAM 713, ROM 714, etc. as shown in Fig. 7) via a storage interface 704. The storage interface 704 may connect to memory 705 including, without limitation, memory drives, removable disc drives, etc., employing connection protocols such as Serial Advanced Technology Attachment (SATA), Integrated Drive Electronics (IDE), IEEE-1394, Universal Serial Bus (USB), fiber channel, Small Computer Systems Interface (SCSI), etc. The memory drives may further include a drum, magnetic disc drive, magneto-optical drive, optical drive, Redundant Array of Independent Discs (RAID), solid-state memory devices, solid-state drives, etc.
The memory 705 may store a collection of program or database components, including, without limitation, user /application 706, an operating system 707, a web browser 708, a mail client 715, a mail server 716, a web server 717, and the like. In some embodiments, computer system 700 may store user /application data 706, such as the data, variables, records, etc. as described in this invention. Such databases may be implemented as fault-tolerant, relational, scalable, secure databases such as OracleR or SybaseR.
The operating system 707 may facilitate resource management and operation of the computer system 700. Examples of operating systems 707 include, without limitation, APPLE MACINTOSHR OS X, UNIXR, UNIX-like system distributions (E.G., BERKELEY SOFTWARE DISTRIBUTIONTM (BSD), FREEBSDTM, NETBSDTM, OPENBSDTM, etc.), LINUX DISTRIBUTIONSTM (E.G., RED HATTM, UBUNTUTM, KUBUNTUTM, etc.), IBMTM OS/2, MICROSOFTTM WINDOWSTM (XPTM, VISTATM/7/8, 10 etc.), APPLER IOSTM, GOOGLER ANDROIDTM, BLACKBERRYR OS, or the like. A user interface may facilitate display, execution, interaction, manipulation, or operation of program components through textual or graphical facilities. For example, user interfaces may provide computer interaction interface elements on a display system operatively connected to the computer system 600, such as cursors, icons, check boxes, menus, windows, widgets, etc. Graphical User Interfaces (GUIs) may be employed, including, without limitation, APPLE MACINTOSHR operating systems, IBMTM OS/2, MICROSOFTTM WINDOWSTM (XPTM, VISTATM/7/8, 10 etc.), UnixR X-Windows, web interface libraries (e.g., AJAXTM, DHTMLTM, ADOBE® FLASHTM, JAVASCRIPTTM, JAVATM, etc.), or the like.

Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present invention. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., non-transitory. Examples include Random Access Memory (RAM), Read-Only Memory (ROM), volatile memory, nonvolatile memory, hard drives, Compact Disc (CD) ROMs, Digital Video Disc (DVDs), flash drives, disks, and any other known physical storage media.
Advantages of the embodiment of the present disclosure are illustrated herein.
In an embodiment, the present disclosure provides method and system of performing context-based video compression technique.

In an embodiment, the present disclosure enables to identify defining moments or important regions in the image and facilitates to compress a video in real time which may help with live-streaming applications as less data may be transferred.
In an embodiment, the present disclosure may also be implemented for non-live streaming applications as well. This would help with storage and data costs to reduce the size of the content stored on the server or may be used to store the original on the server while transferring a reduced size version of it for the users.
In an embodiment, in the present disclosure, important regions are identified based on saliency information and context information 209 and hence the important regions are encoded with higher number of bits and are sent using high quality and the rest of the content can be compressed which drastically reduces bandwidth required without losing on the quality of the important content.
The terms "an embodiment", "embodiment", "embodiments", "the embodiment", "the embodiments", "one or more embodiments", "some embodiments", and "one embodiment" mean "one or more (but not all) embodiments of the invention(s)" unless expressly specified otherwise.

The terms "including", "comprising", “having” and variations thereof mean "including but not limited to", unless expressly specified otherwise. The enumerated listing of items does not imply that any or all the items are mutually exclusive, unless expressly specified otherwise.

The terms "a", "an" and "the" mean "one or more", unless expressly specified otherwise.
A description of an embodiment with several components in communication with each other does not imply that all such components are required. On the contrary, a variety of optional components are described to illustrate the wide variety of possible embodiments of the invention.

When a single device or article is described herein, it will be clear that more than one device/article (whether they cooperate) may be used in place of a single device/article. Similarly, where more than one device or article is described herein (whether they cooperate), it will be clear that a single device/article may be used in place of the more than one device or article or a different number of devices/articles may be used instead of the shown number of devices or programs. The functionality and/or the features of a device may be alternatively embodied by one or more other devices which are not explicitly described as having such functionality/features. Thus, other embodiments of the invention need not include the device itself.

Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based here on. Accordingly, the embodiments of the present invention are intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.

While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.

Referral Numerals:
Reference Number Description
100a,100b Architecture
101 Input video
103 Video compression system
105 Encoder
107 Saliency network model
109 Context network model
110 Segmentation network model
111 Neural network
201 I/O interface
203 Processor
205 Memory
206 Data
207 Saliency information
209 Context information
211 Object segment information
213 Heatmap data
217 Other data
218 Modules
219 Saliency identification module
221 Context identification module
223 Segmentation module
225 Heatmap generation module
227 Encoding module
231 Other modules
240 Input images
242,244 Output images
700 Exemplary computer system
701 I/O Interface of the exemplary computer system
702 Processor of the exemplary computer system
703 Network interface
704 Storage interface
705 Memory of the exemplary computer system
706 User /Application
707 Operating system
708 Web browser
709 Communication network
711 Input devices
712 Output devices
713 RAM
714 ROM
715 Mail Client
716 Mail Server
717 Web Server

,CLAIMS:We claim:
1. A method of context-based video compression technique, the method comprising:
receiving, by a video compression system 103, the video 101 from one or more data sources and providing the video 101 to a saliency network model 107 and a context network model 109 associated with the video compression system 103, wherein the saliency network model 107 is a trained neural network to identify saliency information 207 in each frame of the video 101, wherein the context network model 109 is a trained neural network to identify context information 209 in each frame of the video 101;
identifying, by the video compression system 103, the saliency information 207 which indicates region of high probability of being viewed by a user for each frame of the video 101 from the saliency network model 107 and context information 209 which indicates context of each frame in the video 101 from the context network model 109;
generating, by the video compression system 103, a heatmap for each frame of the video 101 by combining the saliency information 207 and the context information 209 using a neural network 111, wherein the heatmap comprises information of one or more significant regions and one or more insignificant regions in each frame of the video 101; and
encoding, by an encoder 105 associated with the video compression system 103, each frame of the video 101 to allocate number of bits for compressing each frame of the video 101 based on the one or more significant regions and one or more insignificant regions obtained using the heatmap.

2. The method as claimed in claim 1, wherein the saliency network model 107 is trained to identify one or more regions in the frame of the video 101 with high probability of being viewed by the user based on pre-learnt one or more regions from plurality of videos with different video categories.

3. The method as claimed in claim 1, wherein the context network model 109 is trained to identify context in the frame of the video 101 based on pre-learnt context from plurality of videos with different video categories.

4. The method as claimed in claim 1, wherein the one or more significant regions are encoded with high number of bits as compared to number of bits used for encoding the one or more insignificant regions.
5. The method as claimed in claim 1 comprises identifying one or more object segmentations in each frame of the video 101 using a segmentation network model 110, wherein each of the one or more object segmentations is assigned to a predefined object category, wherein the segmentation network model 110 is a trained neural network to identify the one or more object segmentations in the frame of the video 101 based on pre-learnt object segmentations from plurality of videos with different video categories.

6. The method as claimed in claim 5, wherein information associated with the one or more object segmentations in each frame of the video 101 is used to generate the heatmap for each frame of the video 101.

7. A video compression system 103 for performing a context-based video compression, the video compression system 103 comprising:
a saliency network model 107 which is a trained neural network to identify saliency information 207 in each frame of the video 101;
a context network model 109 which is a trained neural network to identify context information 209 in each frame of the video 101;
a processor 203; and
a memory 205 communicatively coupled to the processor 203, wherein the memory 205 stores processor-executable instructions, which, on execution, causes the processor 203 to:
receive the video 101 from one or more data sources and provide the video 101 to the saliency network model 107 and the context network model 109;
identify the saliency information 207 which indicates region of high probability of being viewed by a user for each frame of the video 101 from the saliency network model 107 and context information 209 which indicates context of each frame in the video 101 from the context network model 109;
generate a heatmap for each frame of the video 101 by combining the saliency information 207 and the context information 209 using a neural network 111, wherein the heatmap comprises information of one or more significant regions and one or more insignificant regions in each frame of the video 101; and
encode each frame of the video 101 by an encoder 105 to allocate number of bits for compressing each frame of the video 101 based on the one or more significant regions and one or more insignificant regions obtained using the heatmap.

8. The system as claimed in claim 7, wherein the saliency network model 107 is trained to identify one or more regions in the frame of the video 101 with high probability of being viewed by the user based on pre-learnt one or more regions from plurality of videos with different video categories.

9. The system as claimed in claim 7, wherein the context network model 109 is trained to identify context in the frame of the video 101 based on pre-learnt context from plurality of videos with different video categories.

10. The system as claimed in claim 7, wherein the encoder 105encodes the one or more significant regions with high number of bits as compared to number of bits used for encoding the one or more insignificant regions.

11. The system as claimed in claim 7 further comprises a segmentation network model 110 for identifying one or more object segmentations in each frame of the video 101, wherein each of the one or more object segmentations is assigned to a predefined object category, wherein the segmentation network model 110 is a trained neural network to identify the one or more object segmentations in the frame of the video 101 based on pre-learnt object segmentations from plurality of videos with different video categories.

12. The system as claimed in claim 11 uses information associated with the one or more object segmentations in each frame of the video 101 to generate the heatmap for each frame of the video 101.

Documents

Application Documents

#	Name	Date
1	201941052901-STATEMENT OF UNDERTAKING (FORM 3) [19-12-2019(online)].pdf	2019-12-19
2	201941052901-PROVISIONAL SPECIFICATION [19-12-2019(online)].pdf	2019-12-19
3	201941052901-FORM FOR STARTUP [19-12-2019(online)].pdf	2019-12-19
4	201941052901-FORM FOR SMALL ENTITY(FORM-28) [19-12-2019(online)].pdf	2019-12-19
5	201941052901-FORM 1 [19-12-2019(online)].pdf	2019-12-19
6	201941052901-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [19-12-2019(online)].pdf	2019-12-19
7	201941052901-DRAWINGS [19-12-2019(online)].pdf	2019-12-19
8	201941052901-DECLARATION OF INVENTORSHIP (FORM 5) [19-12-2019(online)].pdf	2019-12-19
9	Abstract_201941052901.jpg	2019-12-26
10	201941052901-FORM FOR STARTUP [08-01-2020(online)].pdf	2020-01-08
11	201941052901-EVIDENCE FOR REGISTRATION UNDER SSI [08-01-2020(online)].pdf	2020-01-08
12	201941052901-FORM-26 [16-03-2020(online)].pdf	2020-03-16
13	201941052901-STARTUP [22-05-2020(online)].pdf	2020-05-22
14	201941052901-FORM28 [22-05-2020(online)].pdf	2020-05-22
15	201941052901-FORM-9 [22-05-2020(online)].pdf	2020-05-22
16	201941052901-FORM 18A [22-05-2020(online)].pdf	2020-05-22
17	201941052901-DRAWING [22-05-2020(online)].pdf	2020-05-22
18	201941052901-CORRESPONDENCE-OTHERS [22-05-2020(online)].pdf	2020-05-22
19	201941052901-COMPLETE SPECIFICATION [22-05-2020(online)].pdf	2020-05-22
20	201941052901-Abstract.jpg	2020-05-28
21	201941052901-FER.pdf	2020-06-11
22	201941052901-Proof of Right [15-06-2020(online)].pdf	2020-06-15
23	201941052901-FORM-8 [15-06-2020(online)].pdf	2020-06-15
24	201941052901-FER_SER_REPLY [23-07-2020(online)].pdf	2020-07-23
25	201941052901-Correspondence to notify the Controller [28-01-2021(online)].pdf	2021-01-28
26	201941052901-Correspondence to notify the Controller [25-02-2021(online)].pdf	2021-02-25
27	201941052901-Written submissions and relevant documents [16-03-2021(online)].pdf	2021-03-16
28	201941052901-Response to office action [29-03-2021(online)].pdf	2021-03-29
29	201941052901-PatentCertificate06-04-2021.pdf	2021-04-06
30	201941052901-IntimationOfGrant06-04-2021.pdf	2021-04-06
31	201941052901-US(14)-HearingNotice-(HearingDate-25-02-2021).pdf	2021-10-17
32	201941052901-US(14)-ExtendedHearingNotice-(HearingDate-03-03-2021).pdf	2021-10-17
33	201941052901-RELEVANT DOCUMENTS [29-09-2022(online)].pdf	2022-09-29
34	201941052901-FORM-26 [29-09-2022(online)].pdf	2022-09-29
35	201941052901-RELEVANT DOCUMENTS [15-09-2023(online)].pdf	2023-09-15

Search Strategy

1	201941052901SSE_11-06-2020.pdf