Selective Inter Component Transform (Ict) For Image And Video Coding
Abstract:
An encoder for encoding a plurality of components of an image content region of an image to be encoded is configured for obtaining the plurality of components representing the image content region; selecting an intercomponent transform from a set of intercomponent transforms; encoding the plurality of components using the selected intercomponent transform to obtain encoded components; and providing the encoded components.
Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence
c/o Fraunhofer-Institut für Nachrichtentechnik, Heinrich-Hertz-Institut, HHI
Einsteinufer 37
10587 Berlin
2. RUDAT, Christian
c/o Fraunhofer-Institut für Nachrichtentechnik, Heinrich-Hertz-Institut, HHI
Einsteinufer 37
10587 Berlin
3. NGUYEN, Tung Hoang
c/o Fraunhofer-Institut für Nachrichtentechnik, Heinrich-Hertz-Institut, HHI
Einsteinufer 37
10587 Berlin
4. SCHWARZ, Heiko
c/o Fraunhofer-Institut für Nachrichtentechnik, Heinrich-Hertz-Institut, HHI
Einsteinufer 37
10587 Berlin
5. MARPE, Detlev
c/o Fraunhofer-Institut für Nachrichtentechnik, Heinrich-Hertz-Institut, HHI
Einsteinufer 37
10587 Berlin
6. WIEGAND, Thomas
c/o Fraunhofer-Institut für Nachrichtentechnik, Heinrich-Hertz-Institut, HHI
Einsteinufer 37
10587 Berlin
Specification
Selective Inter-Component Transform (ICT) for Image and Video
Coding
The following description of the figures starts with a presentation of a description of an encoder and a decoder of a block-based predictive codec for coding pictures of a video in order to form an example for a coding framework into which embodiments of the present invention may be built in. The respective encoder and decoder are described with respect to Figures 1 to 3. Thereinafter the description of embodiments of the concept of the present invention is presented along with a description as to how such concepts could be built into the encoder and decoder of Figures 1 and 2, respectively, although the embodiments described with the subsequent Figures 4 and following, may also be used to form encoders and decoders not operating according to the coding framework underlying the encoder and decoder of Figures 1 and 2.
Equal or equivalent elements or elements with equal or equivalent functionality are denoted in the following description by equal or equivalent reference numerals even if occurring in different figures.
In the following description, a plurality of details is set forth to provide a more thorough explanation of embodiments of the present invention. However, it will be apparent to those skilled in the art that embodiments of the present invention may be practiced without these specific details. In other instances, well known structures and devices are shown in block diagram form rather than in detail in order to avoid obscuring embodiments of the present invention. In addition, features of the different embodiments described hereinafter may be combined with each other, unless specifically noted otherwise.
Figure 1 shows an apparatus for predictively coding a picture 12 into a data stream 14 exemplarily using transform-based residual coding. The apparatus, or encoder, is indicated using reference sign 10. Figure 2 shows a corresponding decoder 20, i.e. an apparatus 20 configured to predictively decode the picture 12’ from the data stream 14 also using transform-based residual decoding, wherein the apostrophe has been used to indicate that the picture 12' as reconstructed by the decoder 20 deviates from picture 12 originally encoded by apparatus 10 in terms of coding loss introduced by a quantization of the prediction residual signal. Figure 1 and Figure 2 exemplarily use transform based prediction residual coding, although embodiments of the present application are not restricted to this kind of prediction residual coding. This is true for other details described with respect to Figures 1 and 2, too, as will be outlined hereinafter.
The encoder 10 is configured to subject the prediction residual signal to spatial-to-spectral transformation and to encode the prediction residual signal, thus obtained into the data stream 14, Likewise, the decoder 20 is configured to dec-ode the prediction residual signal from the data stream 14 and subject the prediction residual signal thus obtained to spectral-to-spatial transformation.
Internally, the encoder 10 may comprise a prediction residual signal former 22 which generates a prediction residual 24 so as to measure a deviation of a prediction signal 26 from the original signal, i.e. from the picture 12, The prediction residual signal former 22 may, for instance, be a subtractor which subtracts the prediction signal from the original signal, i.e. from the picture 12. The encoder 10 then further comprises a transformer 28 which subjects the prediction residual signal 24 to a spatial-to-spectral transformation to obtain a spectral-domain prediction residual signal 24' which is then subject to quantization by a quantizer 32. also comprised by the encoder 10. The thus quantized prediction residual signal 24” is coded into bitstream 14. To this end, encoder 10 may optionally comprise an entropy coder 34 which entropy codes the prediction residual signal as transformed and quantized into data stream 14. The prediction signal 26 is generated by a prediction stage 36 of encoder 10 on the basis of the prediction residual signal 24” encoded into, and decodable from, data stream 14. To this end, the prediction stage 36 may internally, as is shown in Figure 1, comprise a dequantizer 38 which dequantizes prediction residual signal 24” so as to gain spectral-domain prediction msidual signal 24”’, which corresponds to signal 24’ except for quantization loss, followed by an inverse transformer to which subjects the latter prediction residual signal 24”’ to an inverse transformation, i.e. a spectral-to-spatial transformation, to obtain prediction residual signal 24”", which corresponds to the original prediction to iduai signal 24 except for quantization loss. A combiner 42 of the prediction stage 36 then recombines, such as by addition, the prediction signal 26 and the prediction residual signal 24”” so as to obtain a reconstructed signal 46, i.e. a reconstruction of the original signal 12. Reconstructed signal 46 may correspond to signal 12’. A prediction module 44 of prediction stage 36 then generates the prediction signal 26 on the basis of signal 46 by using, for instance, spatial prediction, i.e. intra-picture prediction, and/or temporal prediction, i.e. inter-picture prediction.
Likewise, decoder 20, as shown in Figure 2, may be internally composed of components corresponding to, and interconnected in a manner corresponding to, prediction stage 36. In particular, entropy decoder 50 of decoder 20 may entropy decode the quantized spectral-domain prediction residual signal 24” from the data stream, whereupon dequantizer 52, inverse transformer 54, combiner 56 and prediction module 58, interconnected and
cooperating in the manner described above with respect to the modules of prediction stage 36, recover the reconstructed signal on the basis of prediction residual signal 24” so that, as shown in Figure 2, the output of combiner 56 results in the reconstructed signal, namely picture 12'.
Although not specifically described above, it is readily clear that the encoder 10 may set some coding parameters including, for instance, prediction modes, motion parameters and the like, according to some optimization scheme such as, for instance, in a manner optimizing some rate and distortion related criterion, i.e. coding cost. For example, encoder 10 and decoder 20 and the corresponding modules 44, 58, respectively, may support different prediction modes such as intra-coding modes and inter-coding modes. The granularity at which encoder and decoder switch between these prediction mode types may correspond to a subdivision of picture 12 and 12’, respectively, into coding segments or coding blocks. In units of these coding segments, for instance, the picture may be subdivided into blocks being intra-coded and blocks being inter-coded. Intra-coded blocks are predicted on the basis of a spatial, already coded/decoded neighborhood of the respective block as is outlined in more detail below. Several intra-coding modes may exist and be selected for a respective intra-coded segment including directional or angular intra-coding modes according to which the respective segment is filled by extrapolating the sample values of the neighborhood along a certain direction which is specific for the respective directional intra-coding mode, into the respective intra-coded segment. The intra-coding modes may, for instance, also comprise one or more further modes such as a DC coding mode, according to which the prediction for the respective intra-coded block assigns a DC value to all samples within the respective intra-coded segment, and/or a planar intra-coding mode according to which the prediction of the respective block is approximated or determined to be a spatial distribution of sample values described by a two-dimensional linear function over the sample positions of the respective intra-coded block with driving tilt and offset of the plane defined by the two-dimensional linear function on the basis of the neighboring samples. Compared thereto, inter-coded blocks may be predicted, for instance, temporally. For inter-coded blocks, motion vectors may be signaled within the data stream, the motion vectors indicating the spatial displacement of the portion of a previously coded picture of the video to which picture 12 belongs, at which the previously coded/decoded picture is sampled in order to obtain the prediction signal for the respective inter-coded block. This means, in addition to the residual signal coding comprised by data stream 14, such as the entropy-coded transform coefficient levels representing the quantized spectral-domain prediction residual signal 24”, data stream 14 may have encoded thereinto coding
mode parameters for assigning the coding modes to the various blocks, prediction parameters for some ot the blocks, such as motion parameters for inter-coded segments, and optional further parameters such as parameters for controlling and signaling the subdivision of picture 12 and 12’, respectively, into the segments. The decoder 20 uses these parameters to subdivide the picture in the same manner as the encoder did, to assign the same prediction modes to the segments, and to perform the same prediction to result in the same prediction signal.
Figure 3 illustrates the relationship between the reconstructed signal, i.e. the reconstructed picture 12’, on the one hand, and the combination of the prediction residual signal 24"” as signaled in the data stream 14, and the prediction signal 26, on the other hand. As already denoted above, the combination may be an addition. The prediction signal 26 is illustrated in Figure 3 as a subdivision of the picture area into intra-coded blocks which are illustratively indicated using hatching, and inter-coded blocks which are illustratively indicated not-hatched. The subdivision may be any subdivision, such as a regular subdivision of the picture area into rows and columns of square blocks or non-square blocks, or a multi-tree subdivision of picture 12 from a tree root block into a plurality of leaf blocks of varying size, such as a quadtree subdivision or the like, wherein a mixture thereof is illustrated in Figure 3 in which the picture area is first subdivided into rows and columns of tree root blocks which are then further subdivided in accordance with a recursive multi-tree subdivisioning into one or more leaf blocks.
Again, data stream 14 may have an intra-coding mode coded thereinto for infra-coded blocks 80,, which assigns one of several supported intra-coding modes to the respective intra-coded block 80. For inter-coded blocks 82, the data stream 14 may have one or more motion parameters coded thereinto. Generally speaking, inter-coded blocks 82 are not restricted to being temporally coded. Alternatively, inter-coded blocks 82 may be any block predicted from previously coded portions beyond the current picture 13 itself, such as previously coded pictures of a video to which picture 12 belongs, or picture of another view or an hierarchically lower layer in the case of encoder and decoder being scalable encoders and decoders, respectively.
The prediction residual signal 24”” in Figure 3 is also illustrated as a subdivision of the picture area into blocks 84. These blocks might be called transform blocks in order to distinguish same from the coding blocks 80 and 82. In effect, Figure 3 illustrates that encoder 10 and decoder 20 may use two different subdivisions of picture 12 and picture 12’, respectively, into blocks, namely one subdivisioning into codiim blocks 80 and 82,
respectively, and another subdivision into transform blocks 84. Both subdivisions might be the same, i.e. each coding block 80 and 82, may concurrently form a transform block 84, but Figure 3 illustrates the case where, for instance, a subdivision into transform blocks 84 forms an extension of the subdivision into coding blocks 80, 82 so that any border between two blocks of blocks 80 and 82 overlays a border between two blocks 84, or alternatively speaking each block 80, 82 either coincides with one of the transform blocks 84 or coincides with a cluster of transform blocks 84. However, the subdivisions may also be determined or selected independent from each other so that transform blocks 84 could alternatively cross block borders between blocks 80, 82. As far as the subdivision into transform blocks 84 is concerned, similar statements are thus true as those brought forward with respect to the subdivision into blocks 80, 82, i.e. the blocks 84 may be the result of a regular subdivision of picture area into blocks (with or without arrangement into rows and columns), the result of a recursive multi-tree subdivisioning of the picture area, or a combination thereof or any other sort of blockation. Just as an aside, it is noted that blocks 80, 82 and 84 are not restricted to being of quadratic, rectangular or any other shape.
Figure 3 further illustrates that the combination of the prediction signal 26 and the prediction residual signal 24”” directly results in the reconstructed signal 12’. However, it should be noted that more than one prediction signal 26 may be combined with the prediction residual signal 24"” to result into picture 12’ in accordance with alternative embodiments.
In Figure 3, the transform blocks 84 shall have the following significance. Transformer 28 and inverse transformer 54 perform their transformations in units of these transform blocks 84. For instance, many codecs use some sort of DST or DCT for all transform blocks 84. Some codecs allow for skipping the transformation so that, for some of the transform blocks 84, the prediction residual signal is coded in the spatial domain directly. However, in accordance with embodiments described below, encoder 10 and decoder 20 are configured in such a manner that they support several transforms. For example, the transforms supported by encoder 10 and decoder 20 could comprise:
○ DCT-II (or DCT-III), where DCT stands for Discrete Cosine Transform
○ DST-IV, where DST stands for Discrete Sine Transform
○ DCT-IV
○ DST- VII
○ Identity Transformation (IT)
Naturally, while transformer 28 would support all of the forward transform versions of these transforms, the decoder 20 or inverse transformer 54 would support the corresponding backward or inverse versions thereof:
○ Inverse DCT-II (or inverse DCT-III)
○ Inverse DCT-IV
○ Inverse DCT-IV
○ Inverse DCT-VII
○ Identity Transformation (IT)
The subsequent description provides more details on which transforms could be supported by encoder 10 and decoder 20. In any case, it should be noted that the set of supported transforms may comprise merely one transform such as one spectral-to-spatial or spatial-to-spectral transform.
As already outlined above, Figures 1 to 3 have been presented as an example where the inventive concept 1 nbed further below may be implemented in order to form specific examples for encoders and decoders according to the present application. Insofar, the encoder and decoder of Figures 1 and 2, respectively, may represent possible implementations of the encoders and decoders described herein below. Figures 1 and 2 are, however, only examples. An encoder according to embodiments of the present application may, however, perform block-based encoding of a picture 12 using the concept outlined in more detail below and being different from the encoder of Figure 1 such as, for instance, in that same is no video encoder, but a still picture encoder, in that same does not support inter-prediction, or in that the sub-division into blocks 80 is performed in a manner different than exemplified in Figure 3. Likewise, decoders according to embodiments of the present application may perform block-based decoding of picture 12’ from data stream 14 using the coding concept further outlined below, but may differ, for instance, from the decoder 20 of Figure 2 in that same is no video decoder, but a still picture decoder, in that some does not support intra-prediction, or in that same sub-divides picture 12' into blocks in a manner different than described with respect to Figum 5 vid/or in that same does not derive the prediction residual from the data stream 14 in transform domain, but in spatial domain, for instance.
Embodiments of the present invention will now be described whilst making at least in parts reference to Fig. 4a and Fig. 4b that show functionality of a respective encoder 601, 602 respectively and a respective decoder 651, 662 respectively. The configurations of Fig. 4a and Fig. 4b deviate with respect to each other in view of the sequential order at which the inventive selected intercomponent transform 621 or 622, its inverse version 621' or 622' respectively, is applied.
1. introduction, State of the Art
In natural still and moving color pictures (simply referred to as images and videos hereafter), a significant amount of signal correlation between the individual color components can generally be observed. This is particularly the case with content represented in a YUV or YCbCr (luma-chroma) or an RGB (red-green-blue) domain. To efficiently exploit such inter-component redundancy in image or video coding, several predictive techniques have recently been proposed. Of these, the most notable are
• cross-component linear-model (GCLM) prediction, a linear predictive coding (IRC) method which predicts, on a block level, one component’s input signal from another (usually the luma) decoded component’s signal and encodes only the error, i.e., the difference between input and prediction;
• joint chroma coding (JCC), an approach which encodes only the difference between two chroma residual signals (i. e., only a single downmix) and decodes said two chroma signals using the simple sample-wise upmix rule “V= -U" or “Cr = -Cb” for YUV or YCbCr coding, respectively. In other words, the JCC upmix represents a prediction of V or Cr from U or Cb, respectively, without coding an associated error, or residual, for V respectively Cr during the JCC downmix process.
Both the CCLM and JCC techniques, which are described in detail in [1] and [2], respectively, signal their activation in a particular coding block to the decoder by means of a single flag. Moreover, it is worth noting that both schemes can, in principle, be applied between an arbitrary component pair, i.e.,
• between a luma and a chroma signal, or between two chroma signals, in YUV or YCbCr coding,
• between an R and a G signal or an R and a B signal or, finally, a G and a B signal in RGB coding.
In the above list, the term “signal” may denote a spatial-domain input signal within a particular region, or block, of the input image or video, or it may represent the residual (i. e., difference or error) between said spatial-domain input signal and iff spatial-domain prediction signal obtained using an arbitrary spatial, spectral, or temporal predictive coding technique (e.g. angular Intra prediction or motion > · m.pensation).
2. Shortcomings of State of the Art
While the abovementioned solutions succeed in increasing the coding efficiency in a modern image or video codec, two shortcomings can be identified in connection with the CCLP and JCC approaches:
• Applying the CCLM method between two chroma-channel signals requires, in both the encoder and decoder, a computationally relatively complex derivation of a particular prediction parameter (a CCLM weight) from top and left neighboring samples of the coding block under consideration.
• Employing the JCC technique was found to be relatively inflexible since only a signal difference is supported for downmixing and upmixing. While on average, this approach works well for YUV or YCbCr coded content, the coding gains were found to be relatively low on RGB coded input and on natural images or videos rec orded with cameras suffering from notable chromatic aberration.
It is, therefore, desirable to provide a more Wxible method and apparatus for joint-component coding of images or videos, which retains the low complexity of the JCC approach.
3. Summary of Invention
To address the above-noted shortcomings, the present invention comprises the following aspects, where the term signaling denotes the transmission of coding information from an encoder to a decoder. Each of these aspects will be described in detail in a separate section.
1. Block or picture-selective application (i. e., activation) of one of at least two inter- component joint coding/decoding methods, along with a corresponding block or picture-wise explicit signaling of the application of said joint coding/decoding by means of a (possibly entropy coded) on/off flag, or, alternatively, a non-binary index; The two or more inter-component methods may represent any of the following:
• Coding of a single downmix channel which represents two color channels; with C representing the decoded downmix channel, the decoded color channels are obtained by Cb’=a C and Cr-b C’, where a and b represent specific mixing factor (often either a or b are set equal to 1);
• Coding of two mixing channels; with C1' and C2' being the decoded mixing channels; the decoded color components Cb’ and Cr' are obtained by applying an orthogonal (or nearly orthogonal) transform of size 2 to the decoded mixing channels C1’ and C2'.
Both methods can be extended to more than two color components. If the mixing is applied to N>2 color components, it is also possible to code M1) mixing channels and reconstruct the N color components given the M2 coded color channels are linearly mapped to N reconstructed color components. The applied transform can be specified by multiple rotation angles or, more generally, an NxN transform matrix (with at least nearly orthogonal basis functions). As for the N=2 case, the actually applied transform can be specified by linear combinations using integer operations.
ICT class 2: Down-mixing-based coding with a reduction of the number of color channels
As mentioned above, the main advantage of the transform-based ICT variant described above is that the variance of one of the resulting components becomes small compared to the variance of the other component (for blocks with a certain amount of correlation). Often, this results in one of the components being quantized to zero (for the entire block). For simplifying implementations, the color transform can be implemented in a way that one of the resulting components (C1 or C2) is forced to be quantized to zero. In this case, both original color channels Cb and Cr are represented by a single transmitted component C. And given the reconstructed version of the color component, denoted by C’, the reconstructed color channels Cb' and Cr’ can be obtained according to
where α represents a rotation angle and w represents a scaling factor. Similar as above, the actual implementation can be simplified, for example according to
Cb' = C', Cr' = a · C'; or
Cr' = C', Cb' = b · C',
One or more of the multiple supported ICT transforms may correspond to such a joint component coding with different rotation angles a, or different scaling factors a, b (in combination with a decision which of the color components is set equal to the transmitted component C). At the encoder, the actually coded color component C is obtained by a so-called down-mixing, which can be represented as a linear combination C = m1 · Cb + m2 · Cr, where the factors m1 and m2 may, for example, be chosen in a way that the distortion of the reconstructed color components Cb’ and Cr’ is minimized.
Similar as for the variant 1 above, this second variant can also be generalized to more than two color components. Here, multiple configurations are possible. In a first configuration, the N>2 original color channels are represented by a single joint color channel (M=1 resulting coded components). In another configuration, the N>2 original color channels are represented by M1 ) resulting channels (for example, M=N-1 channels). For both configurations, the reconstruction of the original color channels can be represented by a matrix (with N rows and M
Documents
Orders
Section
Controller
Decision Date
Application Documents
#
Name
Date
1
202137040497-IntimationOfGrant02-09-2024.pdf
2024-09-02
1
202137040497-STATEMENT OF UNDERTAKING (FORM 3) [07-09-2021(online)].pdf
2021-09-07
2
202137040497-FORM 1 [07-09-2021(online)].pdf
2021-09-07
2
202137040497-PatentCertificate02-09-2024.pdf
2024-09-02
3
202137040497-Written submissions and relevant documents [11-03-2024(online)].pdf
2024-03-11
3
202137040497-FIGURE OF ABSTRACT [07-09-2021(online)].pdf
2021-09-07
4
202137040497-FORM-26 [25-02-2024(online)].pdf
2024-02-25
4
202137040497-DRAWINGS [07-09-2021(online)].pdf
2021-09-07
5
202137040497-DECLARATION OF INVENTORSHIP (FORM 5) [07-09-2021(online)].pdf
2021-09-07
5
202137040497-Correspondence to notify the Controller [22-02-2024(online)].pdf