Specification
3. PREAMBLE TO THE DESCRIPTION:
The following specification particularly describes the invention and the manner in which it
is to be performed:
2
FIELD OF THE INVENTION
The present invention relates to audio signal coding and, in particular, to an apparatus
5 and method for stereo filling in multichannel coding.
BACKGROUND OF THE INVENTION
Audio coding is the domain of compression that deals with exploiting redundancy and
10 irrelevancy in audio signals.
In MPEG USAC (see, e.g., [3]), joint stereo coding of two channels is performed using
complex prediction, MPS 2-1-2 or unified stereo with band-limited or full-band residual
signals. MPEG surround (see, e.g., [4]) hierarchically combines One-To-Two (OTT) and
15 Two-To-Three (TTT) boxes for joint coding of multichannel audio with or without
transmission of residual signals.
In MPEG-H, Quad Channel Elements hierarchically apply MPS 2-1-2 stereo boxes
followed by complex prediction/MS stereo boxes building a fixed 4x4 remixing tree, (see,
20 e.g., [1]).
AC4 (see, e.g., [6]) introduces new 3-, 4- and 5- channel elements that allow for remixing
transmitted channels via a transmitted mix matrix and subsequent joint stereo coding
information. Further, prior publications suggest to use orthogonal transforms like
25 Karhunen-Loeve Transform (KLT) for enhanced multichannel audio coding (see, e.g., [7]).
For example, in the 3D audio context, loudspeaker channels are distributed in several
height layers, resulting in horizontal and vertical channel pairs. Joint coding of only two
channels as defined in USAC is not sufficient to consider the spatial and perceptual
30 relations between channels. MPEG Surround is applied in an additional pre-/
postprocessing step, residual signals are transmitted individually without the possibility of
joint stereo coding, e.g. to exploit dependencies between left and right vertical residual
signals. In AC-4 dedicated N- channel elements are introduced that allow for efficient
encoding of joint coding parameters, but fail for generic speaker setups with more
35 channels as proposed for new immersive playback scenarios (7.1+4, 22.2). MPEG-H
Quad Channel element is also restricted to only 4 channels and cannot be dynamically
applied to arbitrary channels but only a pre-configured and fixed number of channels.
3
The MPEG-H Multichannel Coding Tool allows the creation of an arbitrary tree of
discretely coded stereo boxes, i.e. jointly coded channel pairs, see [2].
5 A problem that often arises in audio signal coding is caused by quantization, e.g., spectral
quantization. Quantization may possibly result in spectral holes. For example, all spectral
values in a particular frequency band may be set to zero on the encoder side as a result of
quantization. For example, the exact value of such spectral lines before quantization may
be relatively low and quantization then may lead to a situation, where the spectral values
10 of all spectral lines, for example, within a particular frequency band have been set to zero.
On the decoder side, when decoding, this may lead to undesired spectral holes.
Modern frequency-domain speech/audio coding systems such as the Opus/Celt codec of
the IETF [9], MPEG-4 (HE-)AAC [10] or, in particular, MPEG-D xHE-AAC (USAC) [11],
15 offer means to code audio frames using either one long transform – a long block – or eight
sequential short transforms – short blocks – depending on the temporal stationarity of the
signal. In addition, for low-bitrate coding these schemes provide tools to reconstruct
frequency coefficients of a channel using pseudorandom noise or lower-frequency
coefficients of the same channel. In xHE-AAC, these tools are known as noise filling and
20 spectral band replication, respectively.
However, for very tonal or transient stereophonic input, noise filling and/or spectral band
replication alone limit the achievable coding quality at very low bitrates, mostly since too
many spectral coefficients of both channels need to be transmitted explicitly.
25
MPEG-H Stereo Filling is a parametric tool which relies on the use of a previous frame’s
downmix to improve the filling of spectral holes caused by quantization in the frequency
domain. Like noise filling, Stereo Filling operates directly in the MDCT domain of the
MPEG-H core coder, see [1], [5], [8].
30
However, using of MPEG Surround and Stereo Filling in MPEG-H is restricted to fixed
channel pair elements and therefore cannot exploit time-variant inter-channel
dependencies.
35 The Multichannel Coding Tool (MCT) in MPEG-H allows adapting to varying inter-channel
dependencies but, due to usage of single channel elements in typical operating
configurations, does not allow Stereo Filling. The prior art does not disclose perceptually
4
optimal ways to generate previous frame’s downmixes in case of time-variant, arbitrary
jointly coded channel pairs. Using noise filling as a substitute for stereo filling in
combination with the MCT to fill spectral holes would lead to noise artifacts, especially for
tonal signals.
5
SUMMARY OF THE INVENTION
The object of the present invention is to provide improved audio coding concepts. The
object of the present invention is solved by an apparatus for decoding according to claim
10 1, by an apparatus for encoding according to claim 15, by a method for decoding
according to claim 18, by a method for encoding according to claim 19, by a computer
program according to claim 20 and by an encoded multichannel signal according to claim
21.
15 An apparatus for decoding an encoded multichannel signal of a current frame to obtain
three or more current audio output channels is provided. A multichannel processor is
adapted to select two decoded channels from three or more decoded channels depending
on first multichannel parameters. Moreover, the multichannel processor is adapted to
generate a first group of two or more processed channels based on said selected
20 channels. A noise filling module is adapted to identify for at least one of the selected
channels, one or more frequency bands, within which all spectral lines are quantized to
zero, and to generate a mixing channel using, depending on side information, a proper
subset of three or more previous audio output channels that have been decoded, and to
fill the spectral lines of frequency bands, within which all spectral lines are quantized to
25 zero, with noise generated using spectral lines of the mixing channel.
According to embodiments, an apparatus for decoding a previous encoded multichannel
signal of a previous frame to obtain three or more previous audio output channels, and for
decoding a current encoded multichannel signal of a current frame to obtain three or more
30 current audio output channels is provided.
The apparatus comprises an interface, a channel decoder, a multichannel processor for
generating the three or more current audio output channels, and a noise filling module.
35 The interface is adapted to receive the current encoded multichannel signal, and to
receive side information comprising first multichannel parameters.
5
The channel decoder is adapted to decode the current encoded multichannel signal of the
current frame to obtain a set of three or more decoded channels of the current frame.
The multichannel processor is adapted to select a first selected pair of two decoded
5 channels from the set of three or more decoded channels depending on the first
multichannel parameters.
Moreover, the multichannel processor is adapted to generate a first group of two or more
processed channels based on said first selected pair of two decoded channels to obtain
10 an updated set of three or more decoded channels.
Before the multichannel processor generates the first pair of two or more processed
channels based on said first selected pair of two decoded channels, the noise filling
module is adapted to identify for at least one of the two channels of said first selected pair
15 of two decoded channels, one or more frequency bands, within which all spectral lines are
quantized to zero, and to generate a mixing channel using two or more, but not all of the
three or more previous audio output channels, and to fill the spectral lines of the one or
more frequency bands, within which all spectral lines are quantized to zero, with noise
generated using spectral lines of the mixing channel, wherein the noise filling module is
20 adapted to select the two or more previous audio output channels that are used for
generating the mixing channel from the three or more previous audio output channels
depending on the side information.
A particular concept of embodiments that may be employed by the noise filling module
25 that specifies how to generate and fill noise is referred to as Stereo Filling.
Moreover, an apparatus for encoding a multichannel signal having at least three channels
is provided.
30 The apparatus comprises an iteration processor being adapted to calculate, in a first
iteration step, inter-channel correlation values between each pair of the at least three
channels, for selecting, in the first iteration step, a pair having a highest value or having a
value above a threshold, and for processing the selected pair using a multichannel
processing operation to derive initial multichannel parameters for the selected pair and to
35 derive first processed channels.
6
The iteration processor is adapted to perform the calculating, the selecting and the
processing in a second iteration step using at least one of the processed channels to
derive further multichannel parameters and second processed channels.
5 Moreover, the apparatus comprises a channel encoder being adapted to encode channels
resulting from an iteration processing performed by the iteration processor to obtain
encoded channels.
Furthermore, the apparatus comprises an output interface being adapted to generate an
10 encoded multichannel signal having the encoded channels, the initial multichannel
parameters and the further multichannel parameters and having an information indicating
whether or not an apparatus for decoding shall fill spectral lines of one or more frequency
bands, within which all spectral lines are quantized to zero, with noise generated based on
previously decoded audio output channels that have been previously decoded by the
15 apparatus for decoding.
Moreover, a method for decoding a previous encoded multichannel signal of a previous
frame to obtain three or more previous audio output channels, and for decoding a current
encoded multichannel signal of a current frame to obtain three or more current audio
20 output channels is provided. The method comprises:
- Receiving the current encoded multichannel signal, and receiving side information
comprising first multichannel parameters.
25 - Decoding the current encoded multichannel signal of the current frame to obtain a
set of three or more decoded channels of the current frame.
- Selecting a first selected pair of two decoded channels from the set of three or
more decoded channels depending on the first multichannel parameters.
30
- Generating a first group of two or more processed channels based on said first
selected pair of two decoded channels to obtain an updated set of three or more
decoded channels.
35 Before the first pair of two or more processed channels is generated based on said first
selected pair of two decoded channels, the following steps are conducted:
7
- Identifying for at least one of the two channels of said first selected pair of two
decoded channels, one or more frequency bands, within which all spectral lines
are quantized to zero, and generating a mixing channel using two or more, but not
5 all of the three or more previous audio output channels, and filling the spectral
lines of the one or more frequency bands, within which all spectral lines are
quantized to zero, with noise generated using spectral lines of the mixing channel,
wherein selecting the two or more previous audio output channels that are used for
generating the mixing channel from the three or more previous audio output
10 channels is conducted depending on the side information.
Furthermore, a method for encoding a multichannel signal having at least three channels
is provided. The method comprises:
15 - Calculating, in a first iteration step, inter-channel correlation values between each
pair of the at least three channels, for selecting, in the first iteration step, a pair
having a highest value or having a value above a threshold, and processing the
selected pair using a multichannel processing operation to derive initial
multichannel parameters for the selected pair and to derive first processed
20 channels.
- Performing the calculating, the selecting and the processing in a second iteration
step using at least one of the processed channels to derive further multichannel
parameters and second processed channels.
25
- Encoding channels resulting from an iteration processing performed by the
iteration processor to obtain encoded channels. And:
- Generating an encoded multichannel signal having the encoded channels, the
30 initial multichannel parameters and the further multichannel parameters and having
an information indicating whether or not an apparatus for decoding shall fill
spectral lines of one or more frequency bands, within which all spectral lines are
quantized to zero, with noise generated based on previously decoded audio output
channels that have been previously decoded by the apparatus for decoding.
35
8
Moreover, computer programs are provided, wherein each of the computer programs is
configured to implement one of the above-described methods when being executed on a
computer or signal processor, so that each of the above-described methods is
implemented by one of the computer programs.
5 Furthermore, an encoded multichannel signal is provided. The encoded multichannel
signal comprises encoded channels and multichannel parameters and information
indicating whether or not an apparatus for decoding shall fill spectral lines of one or more
frequency bands, within which all spectral lines are quantized to zero, with spectral data
generated based on previously decoded audio output channels that have been previously
10 decoded by the apparatus for decoding.
BRIEF DESCRIPTION OF DRAWINGS
In the following, embodiments of the present invention are described in more detail with
15 reference to the figures, in which:
Fig. 1a shows an apparatus for decoding according to an embodiment;
Fig. 1b shows an apparatus for decoding according to another embodiment;
20
Fig. 2 shows a block diagram of a parametric frequency-domain decoder according to
an embodiment of the present application;
Fig. 3 shows a schematic diagram illustrating the sequence of spectra forming the
25 spectrograms of channels of a multichannel audio signal in order to ease the
understanding of the description of the decoder of Fig. 2;
Fig. 4 shows a schematic diagram illustrating current spectra out of the spectrograms
shown in Fig. 3 for the sake of alleviating the understanding of the description of
30 Fig. 2;
Fig. 5a and 5b show a block diagram of a parametric frequency-domain audio decoder
in accordance with an alternative embodiment according to which the downmix of
the previous frame is used as a basis for inter-channel noise filling;
35
Fig. 6 shows a block diagram of a parametric frequency-domain audio encoder in
accordance with an embodiment;
9
Fig. 7 shows a schematic block diagram of an apparatus for encoding a multichannel
signal having at least three channels, according to an embodiment;
5 Fig. 8 shows a schematic block diagram of an apparatus for encoding a multichannel
signal having at least three channels, according to an embodiment;
Fig. 9 shows a schematic block diagram of a stereo box, according to an embodiment;
10 Fig. 10 shows a schematic block diagram of an apparatus for decoding an encoded
multichannel signal having encoded channels and at least two multichannel
parameters, according to an embodiment;
Fig. 11 shows a flowchart of a method for encoding a multichannel signal having at least
15 three channels, according to an embodiment;
Fig. 12 shows a flowchart of a method for decoding an encoded multichannel signal
having encoded channels and at least two multichannel parameters, according to
an embodiment;
20
Fig. 13 shows a system according to an embodiment;
Fig. 14 shows in scenario (a) a generation of combination channels for a first frame in
scenario, and in scenario (b) a generation of combination channels for a second
25 frame succeeding the first frame according to an embodiment; and
Fig. 15 shows an indexing scheme for the multichannel parameters according to
embodiments.
30 DETAILED DESCRIPTION OF THE INVENTION
Equal or equivalent elements or elements with equal or equivalent functionality are
denoted in the following description by equal or equivalent reference numerals.
35 In the following description, a plurality of details are set forth to provide a more thorough
explanation of embodiments of the present invention. However, it will be apparent to those
10
skilled in the art that embodiments of the present invention may be practiced without these
specific details. In other instances, well-known structures and devices are shown in block
diagram form rather than in detail in order to avoid obscuring embodiments of the present
invention. In addition, features of the different embodiments described hereinafter may be
5 combined with each other, unless specifically noted otherwise.
Before describing the apparatus 201 for decoding of Fig. 1a, at first, noise filling for
multichannel audio coding is described. In embodiments, the noise filing module 220 of
Fig. 1a may, e.g., be configured to conduct on or more of the technologies below that are
10 described regarding noise filling for multichannel audio coding.
Fig. 2 shows a frequency-domain audio decoder in accordance with an embodiment of the
present application. The decoder is generally indicated using reference sign 10 and
comprises a scale factor band identifier 12, a dequantizer 14, a noise filler 16 and an
15 inverse transformer 18 as well as a spectral line extractor 20 and a scale factor extractor
22. Optional further elements which might be comprised by decoder 10 encompass a
complex stereo predictor 24, an MS (mid-side) decoder 26 and an inverse TNS (Temporal
Noise Shaping) filter tool of which two instantiations 28a and 28b are shown in Fig. 2. In
addition, a downmix provider is shown and outlined in more detail below using reference
20 sign 30.
The frequency-domain audio decoder 10 of Fig. 2 is a parametric decoder supporting
noise filling according to which a certain zero-quantized scale factor band is filled with
noise using the scale factor of that scale factor band as a means to control the level of the
25 noise filled into that scale factor band. Beyond this, the decoder 10 of Fig. 2 represents a
multichannel audio decoder configured to reconstruct a multichannel audio signal from an
inbound data stream 30. Fig. 2, however, concentrates on decoder’s 10 elements involved
in reconstructing one of the multichannel audio signals coded into data stream 30 and
outputs this (output) channel at an output 32. A reference sign 34 indicates that decoder
30 10 may comprise further elements or may comprise some pipeline operation control
responsible for reconstructing the other channels of the multichannel audio signal wherein
the description brought forward below indicates how the decoder’s 10 reconstruction of
the channel of interest at output 32 interacts with the decoding of the other channels.
35 The multichannel audio signal represented by data stream 30 may comprise two or more
channels. In the following, the description of the embodiments of the present application
11
concentrate on the stereo case where the multichannel audio signal merely comprises two
channels, but in principle the embodiments brought forward in the following may be readily
transferred onto alternative embodiments concerning multichannel audio signals and their
coding comprising more than two channels.
5
As will further become clear from the description of Fig. 2 below, the decoder 10 of Fig. 2
is a transform decoder. That is, according to the coding technique underlying decoder 10,
the channels are coded in a transform domain such as using a lapped transform of the
channels. Moreover, depending on the creator of the audio signal, there are time phases
10 during which the channels of the audio signal largely represent the same audio content,
deviating from each other merely by minor or deterministic changes therebetween, such
as different amplitudes and/or phase in order to represent an audio scene where the
differences between the channels enable the virtual positioning of an audio source of the
audio scene with respect to virtual speaker positions associated with the output channels
15 of the multichannel audio signal. At some other temporal phases, however, the different
channels of the audio signal may be more or less uncorrelated to each other and may
even represent, for example, completely different audio sources.
In order to account for the possibly time-varying relationship between the channels of the
20 audio signal, the audio codec underlying decoder 10 of Fig. 2 allows for a time-varying
use of different measures to exploit inter-channel redundancies. For example, MS coding
allows for switching between representing the left and right channels of a stereo audio
signal as they are or as a pair of M (mid) and S (side) channels representing the left and
right channels’ downmix and the halved difference thereof, respectively. That is, there are
25 continuously – in a spectrotemporal sense – spectrograms of two channels transmitted by
data stream 30, but the meaning of these (transmitted) channels may change in time and
relative to the output channels, respectively.
Complex stereo prediction – another inter-channel redundancy exploitation tool – enables,
30 in the spectral domain, predicting one channel’s frequency-domain coefficients or spectral
lines using spectrally co-located lines of another channel. More details concerning this are
described below.
In order to facilitate the understanding of the subsequent description of Fig. 2 and its
35 components shown therein, Fig. 3 shows, for the exemplary case of a stereo audio signal
represented by data stream 30, a possible way how sample values for the spectral lines of
12
the two channels might be coded into data stream 30 so as to be processed by decoder
10 of Fig. 2. In particular, while at the upper half of Fig. 3 the spectrogram 40 of a first
channel of the stereo audio signal is depicted, the lower half of Fig. 3 illustrates the
spectrogram 42 of the other channel of the stereo audio signal. Again, it is worthwhile to
5 note that the “meaning” of spectrograms 40 and 42 may change over time due to, for
example, a time-varying switching between an MS coded domain and a non-MS-coded
domain. In the first instance, spectrograms 40 and 42 relate to an M and S channel,
respectively, whereas in the latter case spectrograms 40 and 42 relate to left and right
channels. The switching between MS coded domain and non-coded MS coded domain
10 may be signaled in the data stream 30.
Fig. 3 shows that the spectrograms 40 and 42 may be coded into data stream 30 at a
time-varying spectrotemporal resolution. For example, both (transmitted) channels may
be, in a time-aligned manner, subdivided into a sequence of frames indicated using curly
15 brackets 44 which may be equally long and abut each other without overlap. As just
mentioned, the spectral resolution at which spectrograms 40 and 42 are represented in
data stream 30 may change over time. Preliminarily, it is assumed that the
spectrotemporal resolution changes in time equally for spectrograms 40 and 42, but an
extension of this simplification is also feasible as will become apparent from the following
20 description. The change of the spectrotemporal resolution is, for example, signaled in data
stream 30 in units of the frames 44. That is, the spectrotemporal resolution changes in
units of frames 44. The change in the spectrotemporal resolution of the spectrograms 40
and 42 is achieved by switching the transform length and the number of transforms used
to describe the spectrograms 40 and 42 within each frame 44. In the example of Fig. 3,
25 frames 44a and 44b exemplify frames where one long transform has been used in order
to sample the audio signal’s channels therein, thereby resulting in highest spectral
resolution with one spectral line sample value per spectral line for each of such frames per
channel. In Fig. 3, the sample values of the spectral lines are indicated using small
crosses within the boxes, wherein the boxes, in turn, are arranged in rows and columns
30 and shall represent a spectral temporal grid with each row corresponding to one spectral
line and each column corresponding to sub-intervals of frames 44 corresponding to the
shortest transforms involved in forming spectrograms 40 and 42. In particular, Fig. 3
illustrates, for example, for frame 44d, that a frame may alternatively be subject to
consecutive transforms of shorter length, thereby resulting, for such frames such as frame
35 44d, in several temporally succeeding spectra of reduced spectral resolution. Eight short
transforms are exemplarily used for frame 44d, resulting in a spectrotemporal sampling of
13
the spectrograms 40 and 42 within that frame 42d, at spectral lines spaced apart from
each other so that merely every eighth spectral line is populated, but with a sample value
for each of the eight transform windows or transforms of shorter length used to transform
frame 44d. For illustration purposes, it is shown in Fig. 3 that other numbers of transforms
5 for a frame would be feasible as well, such as the usage of two transforms of a transform
length which is, for example, half the transform length of the long transforms for frames
44a and 44b, thereby resulting in a sampling of the spectrotemporal grid or spectrograms
40 and 42 where two spectral line sample values are obtained for every second spectral
line, one of which relates to the leading transform, the other to the trailing transform.
10
The transform windows for the transforms into which the frames are subdivided are illustrated in Fig. 3 below each spectrogram using overlapping window-like lines. The temporal
overlap serves, for example, for TDAC (Time-Domain Aliasing Cancellation) purposes.
15 Although the embodiments described further below could also be implemented in another
fashion, Fig. 3 illustrates the case where the switching between different spectrotemporal
resolutions for the individual frames 44 is performed in a manner such that for each frame
44, the same number of spectral line values indicated by the small crosses in Fig. 3 result
for spectrogram 40 and spectrogram 42, the difference merely residing in the way the
20 lines spectrotemporally sample the respective spectrotemporal tile corresponding to the
respective frame 44, spanned temporally over the time of the respective frame 44 and
spanned spectrally from zero frequency to the maximum frequency fmax.
Using arrows in Fig. 3, Fig. 3 illustrates with respect to frame 44d that similar spectra may
25 be obtained for all of the frames 44 by suitably distributing the spectral line sample values
belonging to the same spectral line but short transform windows within one frame of one
channel, onto the un-occupied (empty) spectral lines within that frame up to the next
occupied spectral line of that same frame. Such resulting spectra are called “interleaved
spectra” in the following. In interleaving n transforms of one frame of one channel, for
30 example, spectrally co-located spectral line values of the n short transforms follow each
other before the set of n spectrally co-located spectral line values of the n short transforms
of the spectrally succeeding spectral line follows. An intermediate form of interleaving
would be feasible as well: instead of interleaving all spectral line coefficients of one frame,
it would be feasible to interleave merely the spectral line coefficients of a proper subset of
35 the short transforms of a frame 44d. In any case, whenever spectra of frames of the two
14
channels corresponding to spectrograms 40 and 42 are discussed, these spectra may
refer to interleaved ones or non-interleaved ones.
In order to efficiently code the spectral line coefficients representing the spectrograms 40
5 and 42 via data stream 30 passed to decoder 10, same are quantized. In order to control
the quantization noise spectrotemporally, the quantization step size is controlled via scale
factors which are set in a certain spectrotemporal grid. In particular, within each of the
sequence of spectra of each spectrogram, the spectral lines are grouped into spectrally
consecutive non-overlapping scale factor groups. Fig. 4 shows a spectrum 46 of the
10 spectrogram 40 at the upper half thereof, and a co-temporal spectrum 48 out of
spectrogram 42. As shown therein, the spectra 46 and 48 are subdivided into scale factor
bands along the spectral axis f so as to group the spectral lines into non-overlapping
groups. The scale factor bands are illustrated in Fig. 4 using curly brackets 50. For the
sake of simplicity, it is assumed that the boundaries between the scale factor bands
15 coincide between spectrum 46 and 48, but this does not need to necessarily be the case.
That is, by way of the coding in data stream 30, the spectrograms 40 and 42 are each
subdivided into a temporal sequence of spectra and each of these spectra is spectrally
subdivided into scale factor bands, and for each scale factor band the data stream 30
20 codes or conveys information about a scale factor corresponding to the respective scale
factor band. The spectral line coefficients falling into a respective scale factor band 50 are
quantized using the respective scale factor or, as far as decoder 10 is concerned, may be
dequantized using the scale factor of the corresponding scale factor band.
25 Before changing back again to Fig. 2 and the description thereof, it shall be assumed in
the following that the specifically treated channel, i.e. the one the decoding of which the
specific elements of the decoder of Fig. 2 except 34 are involved with, is the transmitted
channel of spectrogram 40 which, as already stated above, may represent one of left and
right channels, an M channel or an S channel with the assumption that the multichannel
30 audio signal coded into data stream 30 is a stereo audio signal.
While the spectral line extractor 20 is configured to extract the spectral line data, i.e. the
spectral line coefficients for frames 44 from data stream 30, the scale factor extractor 22 is
configured to extract for each frame 44 the corresponding scale factors. To this end,
35 extractors 20 and 22 may use entropy decoding. In accordance with an embodiment, the
scale factor extractor 22 is configured to sequentially extract the scale factors of, for
15
example, spectrum 46 in Fig. 4, i.e. the scale factors of scale factor bands 50, from the
data stream 30 using context-adaptive entropy decoding. The order of the sequential
decoding may follow the spectral order defined among the scale factor bands leading, for
example, from low frequency to high frequency. The scale factor extractor 22 may use
5 context-adaptive entropy decoding and may determine the context for each scale factor
depending on already extracted scale factors in a spectral neighborhood of a currently
extracted scale factor, such as depending on the scale factor of the immediately
preceding scale factor band. Alternatively, the scale factor extractor 22 may predictively
decode the scale factors from the data stream 30 such as, for example, using differential
10 decoding while predicting a currently decoded scale factor based on any of the previously
decoded scale factors such as the immediately preceding one. Notably, this process of
scale factor extraction is agnostic with respect to a scale factor belonging to a scale factor
band populated by zero-quantized spectral lines exclusively, or populated by spectral lines
among which at least one is quantized to a non-zero value. A scale factor belonging to a
15 scale factor band populated by zero-quantized spectral lines only may both serve as a
prediction basis for a subsequent decoded scale factor which possibly belongs to a scale
factor band populated by spectral lines among which one is non-zero, and be predicted
based on a previously decoded scale factor which possibly belongs to a scale factor band
populated by spectral lines among which one is non-zero.
20
For the sake of completeness only, it is noted that the spectral line extractor 20 extracts
the spectral line coefficients with which the scale factor bands 50 are populated likewise
using, for example, entropy coding and/or predictive coding. The entropy coding may use
context-adaptivity based on spectral line coefficients in a spectrotemporal neighborhood of
25 a currently decoded spectral line coefficient, and likewise, the prediction may be a spectral
prediction, a temporal prediction or a spectrotemporal prediction predicting a currently
decoded spectral line coefficient based on previously decoded spectral line coefficients in
a spectrotemporal neighborhood thereof. For the sake of an increased coding efficiency,
spectral line extractor 20 may be configured to perform the decoding of the spectral lines
30 or line coefficients in tuples, which collect or group spectral lines along the frequency axis.
Thus, at the output of spectral line extractor 20 the spectral line coefficients are provided
such as, for example, in units of spectra such as spectrum 46 collecting, for example, all
of the spectral line coefficients of a corresponding frame, or alternatively collecting all of
35 the spectral line coefficients of certain short transforms of a corresponding frame. At the
16
output of scale factor extractor 22, in turn, corresponding scale factors of the respective
spectra are output.
Scale factor band identifier 12 as well as dequantizer 14 have spectral line inputs coupled
5 to the output of spectral line extractor 20, and dequantizer 14 and noise filler 16 have
scale factor inputs coupled to the output of scale factor extractor 22. The scale factor band
identifier 12 is configured to identify so-called zero-quantized scale factor bands within a
current spectrum 46, i.e. scale factor bands within which all spectral lines are quantized to
zero, such as scale factor band 50c in Fig. 4, and the remaining scale factor bands of the
10 spectrum within which at least one spectral line is quantized to non-zero. In particular, in
Fig. 4 the spectral line coefficients are indicated using hatched areas in Fig. 4. It is visible
therefrom that in spectrum 46, all scale factor bands but scale factor band 50b have at
least one spectral line, the spectral line coefficient of which is quantized to a non-zero
value. Later on it will become clear that the zero-quantized scale factor bands such as 50d
15 form the subject of the inter-channel noise filling described further below. Before
proceeding with the description, it is noted that scale factor band identifier 12 may restrict
its identification onto merely a proper subset of the scale factor bands 50 such as onto
scale factor bands above a certain start frequency 52. In Fig. 4, this would restrict the
identification procedure onto scale factor bands 50d, 50e and 50f.
20
The scale factor band identifier 12 informs the noise filler 16 on those scale factor bands
which are zero-quantized scale factor bands. The dequantizer 14 uses the scale factors
associated with an inbound spectrum 46 so as to dequantize, or scale, the spectral line
coefficients of the spectral lines of spectrum 46 according to the associated scale factors,
25 i.e. the scale factors associated with the scale factor bands 50. In particular, dequantizer
14 dequantizes and scales spectral line coefficients falling into a respective scale factor
band with the scale factor associated with the respective scale factor band. Fig. 4 shall be
interpreted as showing the result of the dequantization of the spectral lines.
30 The noise filler 16 obtains the information on the zero-quantized scale factor bands which
form the subject of the following noise filling, the dequantized spectrum as well as the
scale factors of at least those scale factor bands identified as zero-quantized scale factor
bands and a signalization obtained from data stream 30 for the current frame revealing
whether inter-channel noise filling is to be performed for the current frame.
35
17
The inter-channel noise filling process described in the following example actually involves
two types of noise filling, namely the insertion of a noise floor 54 pertaining to all spectral
lines having been quantized to zero irrespective of their potential membership to any zeroquantized scale factor band, and the actual inter-channel noise filling procedure. Although
5 this combination is described hereinafter, it is to be emphasized that the noise floor
insertion may be omitted in accordance with an alternative embodiment. Moreover, the
signalization concerning the noise filling switch-on and switch-off relating to the current
frame and obtained from data stream 30 could relate to the inter-channel noise filling only,
or could control the combination of both noise filling sorts together.
10
As far as the noise floor insertion is concerned, noise filler 16 could operate as follows. In
particular, noise filler 16 could employ artificial noise generation such as a pseudorandom
number generator or some other source of randomness in order to fill spectral lines, the
spectral line coefficients of which were zero. The level of the noise floor 54 thus inserted
15 at the zero-quantized spectral lines could be set according to an explicit signaling within
data stream 30 for the current frame or the current spectrum 46. The “level” of noise floor
54 could be determined using a root-mean-square (RMS) or energy measure for example.
The noise floor insertion thus represents a kind of pre-filling for those scale factor bands
20 having been identified as zero-quantized ones such as scale factor band 50d in Fig. 4. It
also affects other scale factor bands beyond the zero-quantized ones, but the latter are
further subject to the following inter-channel noise filling. As described below, the interchannel noise filling process is to fill-up zero-quantized scale factor bands up to a level
which is controlled via the scale factor of the respective zero-quantized scale factor band.
25 The latter may be directly used to this end due to all spectral lines of the respective zeroquantized scale factor band being quantized to zero. Nevertheless, data stream 30 may
contain an additional signalization of a parameter, for each frame or each spectrum 46,
which commonly applies to the scale factors of all zero-quantized scale factor bands of
the corresponding frame or spectrum 46 and results, when applied onto the scale factors
30 of the zero-quantized scale factor bands by the noise filler 16, in a respective fill-up level
which is individual for the zero-quantized scale factor bands. That is, noise filler 16 may
modify, using the same modification function, for each zero-quantized scale factor band of
spectrum 46, the scale factor of the respective scale factor band using the just mentioned
parameter contained in data stream 30 for that spectrum 46 of the current frame so as to
35 obtain a fill-up target level for the respective zero-quantized scale factor band measuring,
in terms of energy or RMS, for example, the level up to which the inter-channel noise
18
filling process shall fill up the respective zero-quantized scale factor band with (optionally)
additional noise (in addition to the noise floor 54).
In particular, in order to perform the inter-channel noise filling 56, noise filler 16 obtains a
5 spectrally co-located portion of the other channel’s spectrum 48, in a state already largely
or fully decoded, and copies the obtained portion of spectrum 48 into the zero-quantized
scale factor band to which this portion was spectrally co-located, scaled in such a manner
that the resulting overall noise level within that zero-quantized scale factor band – derived
by an integration over the spectral lines of the respective scale factor band – equals the
10 aforementioned fill-up target level obtained from the zero-quantized scale factor band’s
scale factor. By this measure, the tonality of the noise filled into the respective zeroquantized scale factor band is improved in comparison to artificially generated noise such
as the one forming the basis of the noise floor 54, and is also better than an uncontrolled
spectral copying/replication from very-low-frequency lines within the same spectrum 46.
15
To be even more precise, the noise filler 16 locates, for a current band such as 50d, a
spectrally co-located portion within spectrum 48 of the other channel, scales the spectral
lines thereof depending on the scale factor of the zero-quantized scale factor band 50d in
a manner just described involving, optionally, some additional offset or noise factor
20 parameter contained in data stream 30 for the current frame or spectrum 46, so that the
result thereof fills up the respective zero-quantized scale factor band 50d up to the desired
level as defined by the scale factor of the zero-quantized scale factor band 50d. In the
present embodiment, this means that the filling-up is done in an additive manner relative
to the noise floor 54.
25
In accordance with a simplified embodiment, the resulting noise-filled spectrum 46 would
directly be input into the input of inverse transformer 18 so as to obtain, for each transform
window to which the spectral line coefficients of spectrum 46 belong, a time-domain
portion of the respective channel audio time-signal, whereupon (not shown in Fig. 2) an
30 overlap-add process may combine these time-domain portions. That is, if spectrum 46 is a
non-interleaved spectrum, the spectral line coefficients of which merely belong to one
transform, then inverse transformer 18 subjects that transform so as to result in one timedomain portion and the preceding and trailing ends of which would be subject to an
overlap-add process with preceding and trailing time-domain portions obtained by inverse
35 transforming preceding and succeeding inverse transforms so as to realize, for example,
time-domain aliasing cancelation. If, however, the spectrum 46 has interleaved there-into
19
spectral line coefficients of more than one consecutive transform, then inverse transformer
18 would subject same to separate inverse transformations so as to obtain one timedomain portion per inverse transformation, and in accordance with the temporal order
defined thereamong, these time-domain portions would be subject to an overlap-add
5 process therebetween, as well as with respect to preceding and succeeding time-domain
portions of other spectra or frames.
However, for the sake of completeness it must be noted that further processing may be
performed onto the noise-filled spectrum. As shown in Fig. 2, the inverse TNS filter may
10 perform an inverse TNS filtering onto the noise-filled spectrum. That is, controlled via TNS
filter coefficients for the current frame or spectrum 46, the spectrum obtained so far is
subject to a linear filtering along spectral direction.
With or without inverse TNS filtering, complex stereo predictor 24 could then treat the
15 spectrum as a prediction residual of an inter-channel prediction. More specifically, interchannel predictor 24 could use a spectrally co-located portion of the other channel to
predict the spectrum 46 or at least a subset of the scale factor bands 50 thereof. The
complex prediction process is illustrated in Fig. 4 with dashed box 58 in relation to scale
factor band 50b. That is, data stream 30 may contain inter-channel prediction parameters
20 controlling, for example, which of the scale factor bands 50 shall be inter-channel
predicted and which shall not be predicted in such a manner. Further, the inter-channel
prediction parameters in data stream 30 may further comprise complex inter-channel
prediction factors applied by inter-channel predictor 24 so as to obtain the inter-channel
prediction result. These factors may be contained in data stream 30 individually for each
25 scale factor band, or alternatively each group of one or more scale factor bands, for which
inter-channel prediction is activated or signaled to be activated in data stream 30.
The source of inter-channel prediction may, as indicated in Fig. 4, be the spectrum 48 of
the other channel. To be more precise, the source of inter-channel prediction may be the
30 spectrally co-located portion of spectrum 48, co-located to the scale factor band 50b to be
inter-channel predicted, extended by an estimation of its imaginary part. The estimation of
the imaginary part may be performed based on the spectrally co-located portion 60 of
spectrum 48 itself, and/or may use a downmix of the already decoded channels of the
previous frame, i.e. the frame immediately preceding the currently decoded frame to
35 which spectrum 46 belongs. In effect, inter-channel predictor 24 adds to the scale factor
20
bands to be inter-channel predicted such as scale factor band 50b in Fig. 4, the prediction
signal obtained as just-described.
As already noted in the preceding description, the channel to which spectrum 46 belongs
5 may be an MS coded channel, or may be a loudspeaker related channel, such as a left or
right channel of a stereo audio signal. Accordingly, optionally an MS decoder 26 subjects
the optionally inter-channel predicted spectrum 46 to MS decoding, in that same performs,
per spectral line or spectrum 46, an addition or subtraction with spectrally corresponding
spectral lines of the other channel corresponding to spectrum 48. For example, although
10 not shown in Fig. 2, spectrum 48 as shown in Fig. 4 has been obtained by way of portion
34 of decoder 10 in a manner analogous to the description brought forward above with
respect to the channel to which spectrum 46 belongs, and the MS decoding module 26, in
performing MS decoding, subjects the spectra 46 and 48 to spectral line-wise addition or
spectral line-wise subtraction, with both spectra 46 and 48 being at the same stage within
15 the processing line, meaning, both have just been obtained by inter-channel prediction, for
example, or both have just been obtained by noise filling or inverse TNS filtering.
It is noted that, optionally, the MS decoding may be performed in a manner globally
concerning the whole spectrum 46, or being individually activatable by data stream 30 in
20 units of, for example, scale factor bands 50. In other words, MS decoding may be
switched on or off using respective signalization in data stream 30 in units of, for example,
frames or some finer spectrotemporal resolution such as, for example, individually for the
scale factor bands of the spectra 46 and/or 48 of the spectrograms 40 and/or 42, wherein
it is assumed that identical boundaries of both channels’ scale factor bands are defined.
25
As illustrated in Fig. 2, the inverse TNS filtering by inverse TNS filter 28 could also be
performed after any inter-channel processing such as inter-channel prediction 58 or the
MS decoding by MS decoder 26. The performance in front of, or downstream of, the interchannel processing could be fixed or could be controlled via a respective signalization for
30 each frame in data stream 30 or at some other level of granularity. Wherever inverse TNS
filtering is performed, respective TNS filter coefficients present in the data stream for the
current spectrum 46 control a TNS filter, i.e. a linear prediction filter running along spectral
direction so as to linearly filter the spectrum inbound into the respective inverse TNS filter
module 28a and/or 28b.
35
21
Thus, the spectrum 46 arriving at the input of inverse transformer 18 may have been
subject to further processing as just described. Again, the above description is not meant
to be understood in such a manner that all of these optional tools are to be present either
concurrently or not. These tools may be present in decoder 10 partially or collectively.
5
In any case, the resulting spectrum at the inverse transformer’s input represents the final
reconstruction of the channel’s output signal and forms the basis of the aforementioned
downmix for the current frame which serves, as described with respect to the complex
prediction 58, as the basis for the potential imaginary part estimation for the next frame to
10 be decoded. It may further serve as the final reconstruction for inter-channel predicting
another channel than the one which the elements except 34 in Fig. 2 relate to.
The respective downmix is formed by downmix provider 31 by combining this final
spectrum 46 with the respective final version of spectrum 48. The latter entity, i.e. the
15 respective final version of spectrum 48, formed the basis for the complex inter-channel
prediction in predictor 24.
Fig. 5 shows an alternative relative to Fig. 2 insofar as the basis for inter-channel noise
filling is represented by the downmix of spectrally co-located spectral lines of a previous
20 frame so that, in the optional case of using complex inter-channel prediction, the source of
this complex inter-channel prediction is used twice, as a source for the inter-channel noise
filling as well as a source for the imaginary part estimation in the complex inter-channel
prediction. Fig. 5 shows a decoder 10 including the portion 70 pertaining to the decoding
of the first channel to which spectrum 46 belongs, as well as the internal structure of the
25 aforementioned other portion 34, which is involved in the decoding of the other channel
comprising spectrum 48. The same reference sign has been used for the internal
elements of portion 70 on the one hand and 34 on the other hand. As can be seen, the
construction is the same. At output 32, one channel of the stereo audio signal is output,
and at the output of the inverse transformer 18 of second decoder portion 34, the other
30 (output) channel of the stereo audio signal results, with this output being indicated by
reference sign 74. Again, the embodiments described above may be easily transferred to
a case of using more than two channels.
The downmix provider 31 is co-used by both portions 70 and 34 and receives temporally
35 co-located spectra 48 and 46 of spectrograms 40 and 42 so as to form a downmix based
thereon by summing up these spectra on a spectral line by spectral line basis, potentially
22
with forming the average therefrom by dividing the sum at each spectral line by the
number of channels downmixed, i.e. two in the case of Fig. 5. At the downmix provider’s
31 output, the downmix of the previous frame results by this measure. It is noted in this
regard that in case of the previous frame containing more than one spectrum in either one
5 of spectrograms 40 and 42, different possibilities exist as to how downmix provider 31
operates in that case. For example, in that case downmix provider 31 may use the
spectrum of the trailing transforms of the current frame, or may use an interleaving result
of interleaving all spectral line coefficients of the current frame of spectrogram 40 and 42.
The delay element 74 shown in Fig. 5 as connected to the downmix provider’s 31 output,
10 shows that the downmix thus provided at downmix provider’s 31 output forms the downmix of the previous frame 76 (see Fig. 4 with respect to the inter-channel noise filling 56
and complex prediction 58, respectively). Thus, the output of delay element 74 is connected to the inputs of inter-channel predictors 24 of decoder portions 34 and 70 on the one
hand, and the inputs of noise fillers 16 of decoder portions 70 and 34, on the other hand.
15
That is, while in Fig. 2, the noise filler 16 receives the other channel’s finally reconstructed
temporally co-located spectrum 48 of the same current frame as a basis of the interchannel noise filling, in Fig. 5 the inter-channel noise filling is performed instead based on
the downmix of the previous frame as provided by downmix provider 31. The way in which
20 the inter-channel noise filling is performed, remains the same. That is, the inter-channel
noise filler 16 grabs out a spectrally co-located portion out of the respective spectrum of
the other channel’s spectrum of the current frame, in case of Fig. 2, and the largely or fully
decoded, final spectrum as obtained from the previous frame representing the downmix of
the previous frame, in case of Fig. 5, and adds same “source” portion to the spectral lines
25 within the scale factor band to be noise filled, such as 50d in Fig. 4, scaled according to a
target noise level determined by the respective scale factor band’s scale factor.
Concluding the above discussion of embodiments describing inter-channel noise filling in
an audio decoder, it should be evident to readers skilled in the art that, before adding the
30 grabbed-out spectrally or temporally co-located portion of the “source” spectrum to the
spectral lines of the “target” scale factor band, a certain pre-processing may be applied to
the “source” spectral lines without digressing from the general concept of the inter-channel
filling. In particular, it may be beneficial to apply a filtering operation such as, for example,
a spectral flattening, or tilt removal, to the spectral lines of the “source” region to be added
35 to the “target” scale factor band, like 50d in Fig. 4, in order to improve the audio quality of
the inter-channel noise filling process. Likewise, and as an example of a largely (instead
23
of fully) decoded spectrum, the aforementioned “source” portion may be obtained from a
spectrum which has not yet been filtered by an available inverse (i.e. synthesis) TNS filter.
Thus, the above embodiments concerned a concept of an inter-channel noise filling. In the
5 following, a possibility is described how the above concept of inter-channel noise filling
may be built into an existing codec, namely xHE-AAC, in a semi-backward compatible
manner. In particular, hereinafter a preferred implementation of the above embodiments is
described, according to which a stereo filling tool is built into an xHE-AAC based audio
codec in a semi-backward compatible signaling manner. By use of the implementation
10 described further below, for certain stereo signals, stereo filling of transform coefficients in
either one of the two channels in an audio codec based on an MPEG-D xHE-AAC (USAC)
is feasible, thereby improving the coding quality of certain audio signals especially at low
bitrates. The stereo filling tool is signaled semi-backward-compatibly such that legacy
xHE-AAC decoders can parse and decode the bitstreams without obvious audio errors or
15 drop-outs. As was already described above, a better overall quality can be attained if an
audio coder can use a combination of previously decoded/quantized coefficients of two
stereo channels to reconstruct zero-quantized (non-transmitted) coefficients of either one
of the currently decoded channels. It is therefore desirable to allow such stereo filling
(from previous to present channel coefficients) in addition to spectral band replication
20 (from low- to high-frequency channel coefficients) and noise filling (from an uncorrelated
pseudorandom source) in audio coders, especially xHE-AAC or coders based on it.
To allow coded bitstreams with stereo filling to be read and parsed by legacy xHE-AAC
decoders, the desired stereo filling tool shall be used in a semi-backward compatible way:
25 its presence should not cause legacy decoders to stop – or not even start – decoding.
Readability of the bitstream by xHE-AAC infrastructure can also facilitate market adoption.
To achieve the aforementioned wish for semi-backward compatibility for a stereo filling
tool in the context of xHE-AAC or its potential derivatives, the following implementation
30 involves the functionality of stereo filling as well as the ability to signal the same via syntax
in the data stream actually concerned with noise filling. The stereo filling tool would work
in line with the above description. In a channel pair with common window configuration, a
coefficient of a zero-quantized scale factor band is, when the stereo filling tool is activated,
as an alternative (or, as described, in addition) to noise filling, reconstructed by a sum or
35 difference of the previous frame’s coefficients in either one of the two channels, preferably
the right channel. Stereo filling is performed similar to noise filling. The signaling would be
24
done via the noise filling signaling of xHE-AAC. Stereo filling is conveyed by means of the
8-bit noise filling side information. This is feasible because the MPEG-D USAC standard
[3] states that all 8 bits are transmitted even if the noise level to be applied is zero. In that
situation, some of the noise-fill bits can be reused for the stereo filling tool.
5
Semi-backward-compatibility regarding bitstream parsing and playback by legacy xHEAAC decoders is ensured as follows. Stereo filling is signaled via a noise level of zero (i.e.
the first three noise-fill bits all having a value of zero) followed by five non-zero bits (which
traditionally represent a noise offset) containing side information for the stereo filling tool
10 as well as the missing noise level. Since a legacy xHE-AAC decoder disregards the value
of the 5-bit noise offset if the 3-bit noise level is zero, the presence of the stereo filling tool
signaling only has an effect on the noise filling in the legacy decoder: noise filling is turned
off since the first three bits are zero, and the remainder of the decoding operation runs as
intended. In particular, stereo filling is not performed due to the fact that it is operated like
15 the noise-fill process, which is deactivated. Hence, a legacy decoder still offers “graceful”
decoding of the enhanced bitstream 30 because it does not need to mute the output
signal or even abort the decoding upon reaching a frame with stereo filling switched on.
Naturally, it is however unable to provide a correct, intended reconstruction of stereo-filled
line coefficients, leading to a deteriorated quality in affected frames in comparison with
20 decoding by an appropriate decoder capable of appropriately dealing with the new stereo
filling tool. Nonetheless, assuming the stereo filling tool is used as intended, i.e. only on
stereo input at low bitrates, the quality through xHE-AAC decoders should be better than if
the affected frames would drop out due to muting or lead to other obvious playback errors.
25 In the following, a detailed description is presented how a stereo filling tool may be built
into, as an extension, the xHE-AAC codec.
When built into the standard, the stereo filling tool could be described as follows. In
particular, such a stereo filling (SF) tool would represent a new tool in the frequency30 domain (FD) part of MPEG-H 3D-audio. In line with the above discussion, the aim of such
a stereo filling tool would be the parametric reconstruction of MDCT spectral coefficients
at low bitrates, similar to what already can be achieved with noise filling according to
section 7.2 of the standard described in [3]. However, unlike noise filling, which employs a
pseudorandom noise source for generating MDCT spectral values of any FD channel, SF
35 would be available also to reconstruct the MDCT values of the right channel of a jointly
coded stereo pair of channels using a downmix of the left and right MDCT spectra of the
25
previous frame. SF, in accordance with the implementation set forth below, is signaled
semi-backward-compatibly by means of the noise filling side information which can be
parsed correctly by a legacy MPEG-D USAC decoder.
5 The tool description could be as follows. When SF is active in a joint-stereo FD frame, the
MDCT coefficients of empty (i.e. fully zero-quantized) scale factor bands of the right
(second) channel, such as 50d, are replaced by a sum or difference of the corresponding
decoded left and right channels’ MDCT coefficients of the previous frame (if FD). If legacy
noise filling is active for the second channel, pseudorandom values are also added to
10 each coefficient. The resulting coefficients of each scale factor band are then scaled such
that the RMS (root of the mean coefficient square) of each band matches the value
transmitted by way of that band’s scale factor. See section 7.3 of the standard in [3].
Some operational constraints could be provided for the use of the new SF tool in the
15 MPEG-D USAC standard. For example, the SF tool may be available for use only in the
right FD channel of a common FD channel pair, i.e. a channel pair element transmitting a
StereoCoreToolInfo( ) with common_window == 1. Besides, due to the semi-backwardcompatible signaling, the SF tool may be available for use only when noiseFilling == 1 in
the syntax container UsacCoreConfig( ). If either of the channels in the pair is in LPD
20 core_mode, the SF tool may not be used, even if the right channel is in the FD mode.
The following terms and definitions are used hereafter in order to more clearly describe
the extension of the standard as described in [3].
25 In particular, as far as the data elements are concerned, the following data element is
newly introduced:
stereo_filling binary flag indicating whether SF is utilized in the current frame and
channel
30
Further, new help elements are introduced:
noise_offset noise-fill offset to modify the scale factors of zero-quantized bands
(section 7.2)
35 noise_level noise-fill level representing the amplitude of added spectrum noise
(section 7.2)
26
downmix_prev[ ] downmix (i.e. sum or difference) of the previous frame’s left and
right channels
sf_index[g][sfb] scale factor index (i.e. transmitted integer) for window group g and
band sfb
5
The decoding process of the standard would be extended in the following manner. In
particular, the decoding of a joint-stereo coded FD channel with the SF tool being
activated is executed in three sequential steps as follows:
10 First of all, the decoding of the stereo_filling flag would take place.
stereo_filling does not represent an independent bit-stream element but is derived from
the noise-fill elements, noise_offset and noise_level, in a UsacChannelPairElement() and
the common_window flag in StereoCoreToolInfo(). If noiseFilling == 0 or common_window
15 == 0 or the current channel is the left (first) channel in the element, stereo_filling is 0, and
the stereo filling process ends. Otherwise,
if ((noiseFilling != 0) && (common_window != 0) && (noise_level == 0)) {
stereo_filling = (noise_offset & 16) / 16;
20 noise_level = (noise_offset & 14) / 2;
noise_offset = (noise_offset & 1) * 16;
}
else {
stereo_filling = 0;
25 }
In other words, if noise_level == 0, noise_offset contains the stereo_filling flag followed by
4 bits of noise filling data, which are then rearranged. Since this operation alters the
values of noise_level and noise_offset, it needs to be performed before the noise filling
30 process of section 7.2. Moreover, the above pseudo-code is not executed in the left (first)
channel of a UsacChannelPairElement( ) or any other element.
Then, the calculation of downmix_prev would take place.
35 downmix_prev[ ], the spectral downmix which is to be used for stereo filling, is identical to
the dmx_re_prev[ ] used for the MDST spectrum estimation in complex stereo prediction
(section 7.7.2.3). This means that
27
All coefficients of downmix_prev[ ] must be zero if any of the channels of the frame
and element with which the downmixing is performed – i.e. the frame before the
currently decoded one – use core_mode == 1 (LPD) or the channels use unequal
5 transform lengths (split_transform == 1 or block switching to window_sequence ==
EIGHT_SHORT_SEQUENCE in only one channel) or usacIndependencyFlag == 1.
All coefficients of downmix_prev[ ] must be zero during the stereo filling process if
the channel’s transform length changed from the last to the current frame (i.e.
10 split_transform == 1 preceded by split_transform == 0, or window_sequence ==
EIGHT_SHORT_SEQUENCE preceded by window_sequence !=
EIGHT_SHORT_SEQUENCE, or vice versa resp.) in the current element.
If transform splitting is applied in the channels of the previous or current frame,
downmix_prev[ ] represents a line-by-line interleaved spectral downmix. See the
15 transform splitting tool for details.
If complex stereo prediction is not utilized in the current frame and element,
pred_dir equals 0.
20 Consequently, the previous downmix only has to be computed once for both tools, saving
complexity. The only difference between downmix_prev[ ] and dmx_re_prev[ ] in section
7.7.2 is the behavior when complex stereo prediction is not currently used, or when it is
active but use_prev_frame == 0. In that case, downmix_prev[ ] is computed for stereo
filling decoding according to section 7.7.2.3 even though dmx_re_prev[ ] is not needed for
25 complex stereo prediction decoding and is, therefore, undefined/zero.
Thereinafter, the stereo filling of empty scale factor bands would be performed.
If stereo_filling == 1, the following procedure is carried out after the noise filling process in
30 all initially empty scale factor bands sfb[ ] below max_sfb_ste, i.e. all bands in which all
MDCT lines were quantized to zero. First, the energies of the given sfb[ ] and the
corresponding lines in downmix_prev[ ] are computed via sums of the line squares. Then,
given sfbWidth containing the number of lines per sfb[ ],
35 if (energy[sfb] < sfbWidth[sfb]) { /* noise level isn't maximum, or band starts below
noise-fill region */
28
facDmx = sqrt((sfbWidth[sfb] – energy[sfb]) / energy_dmx[sfb]);
factor = 0.0;
/* if the previous downmix isn't empty, add the scaled downmix lines such that band reaches unity
energy */
5 for (index = swb_offset[sfb]; index < swb_offset[sfb+1]; index++) {
spectrum[window][index] += downmix_prev[window][index] * facDmx;
factor += spectrum[window][index] * spectrum[window][index];
}
if ((factor != sfbWidth[sfb]) && (factor > 0)) { /* unity energy isn't reached, so
10 modify band */
factor = sqrt(sfbWidth[sfb] / (factor + 1e-8));
for (index = swb_offset[sfb]; index < swb_offset[sfb+1]; index++) {
spectrum[window][index] *= factor;
}
15 }
}
for the spectrum of each group window. Then the scale factors are applied onto the
resulting spectrum as in section 7.3, with the scale factors of the empty bands being
20 processed like regular scale factors.
An alternative to the above extension of the xHE-AAC standard would use an implicit
semi-backward compatible signaling method.
25 The above implementation in the xHE-AAC code framework describes an approach which
employs one bit in a bitstream to signal usage of the new stereo filling tool, contained in
stereo_filling, to a decoder in accordance with Fig. 2. More precisely, such signaling (let’s
call it explicit semi-backward-compatible signaling) allows the following legacy bitstream
data – here the noise filling side information – to be used independently of the SF
30 signalization: In the present embodiment, the noise filling data does not depend on the
stereo filling information, and vice versa. For example, noise filling data consisting of allzeros (noise_level = noise_offset = 0) may be transmitted while stereo_filling may signal
any possible value (being a binary flag, either 0 or 1).
35 In cases where strict independence between the legacy and the inventive bitstream data
is not required and the inventive signal is a binary decision, the explicit transmission of a
signaling bit can be avoided, and said binary decision can be signaled by the presence or
29
absence of what may be called implicit semi-backward-compatible signaling. Taking again
the above embodiment as an example, the usage of stereo filling could be transmitted by
simply employing the new signaling: If noise_level is zero and, at the same time,
noise_offset is not zero, the stereo_filling flag is set equal to 1. If both noise_level and
5 noise_offset are not zero, stereo_filling is equal to 0. A dependent of this implicit signal on
the legacy noise-fill signal occurs when both noise_level and noise_offset are zero. In this
case, it is unclear whether legacy or new SF implicit signaling is being used. To avoid
such ambiguity, the value of stereo_filling must be defined in advance. In the present
example, it is appropriate to define stereo_filling = 0 if the noise filling data consists of all10 zeros, since this is what legacy encoders without stereo filling capability signal when noise
filling is not to be applied in a frame.
The issue which remains to be solved in the case of implicit semi-backward-compatible
signaling is how to signal stereo_filling == 1 and no noise filling at the same time. As
15 explained, the noise filling data must not be all-zero, and if a noise magnitude of zero is
requested, noise_level ((noise_offset & 14)/2 as mentioned above) must equal 0. This
leaves only a noise_offset ((noise_offset & 1)*16 as mentioned above) greater than 0 as a
solution. The noise_offset, however, is considered in case of stereo filling when applying
the scale factors, even if noise_level is zero. Fortunately, an encoder can compensate for
20 the fact that a noise_offset of zero might not be transmittable by altering the affected scale
factors such that upon bitstream writing, they contain an offset which is undone in the
decoder via noise_offset. This allows said implicit signaling in the above embodiment at
the cost of a potential increase in scale factor data rate. Hence, the signaling of stereo
filling in the pseudo-code of the above description could be changed as follows, using the
25 saved SF signaling bit to transmit noise_offset with 2 bits (4 values) instead of 1 bit:
if ((noiseFilling) && (common_window) && (noise_level == 0) &&
(noise_offset > 0)) {
stereo_filling = 1;
30 noise_level = (noise_offset & 28) / 4;
noise_offset = (noise_offset & 3) * 8;
}
else {
stereo_filling = 0;
35 }
30
For the sake of completeness, Fig. 6 shows a parametric audio encoder in accordance
with an embodiment of the present application. First of all, the encoder of Fig. 6 which is
generally indicated using reference sign 90 comprises a transformer 92 for performing the
transformation of the original, non-distorted version of the audio signal reconstructed at
5 the output 32 of Fig. 2. As described with respect to Fig. 3, a lapped transform may be
used with a switching between different transform lengths with corresponding transform
windows in units of frames 44. The different transform length and corresponding transform
windows are illustrated in Fig. 3 using reference sign 104. In a manner similar to Fig. 2,
Fig. 6 concentrates on a portion of encoder 90 responsible for encoding one channel of
10 the multichannel audio signal, whereas another channel domain portion of decoder 90 is
generally indicated using reference sign 96 in Fig. 6.
At the output of transformer 92 the spectral lines and scale factors are unquantized and
substantially no coding loss has occurred yet. The spectrogram output by transformer 92
15 enters a quantizer 98, which is configured to quantize the spectral lines of the spectrogram output by transformer 92, spectrum by spectrum, setting and using preliminary scale
factors of the scale factor bands. That is, at the output of quantizer 98, preliminary scale
factors and corresponding spectral line coefficients result, and a sequence of a noise filler
16’, an optional inverse TNS filter 28a’, inter-channel predictor 24’, MS decoder 26’ and
20 inverse TNS filter 28b’ are sequentially connected so as to provide the encoder 90 of Fig.
6 with the ability to obtain a reconstructed, final version of the current spectrum as
obtainable at the decoder side at the downmix provider’s input (see Fig. 2). In case of
using inter-channel prediction 24’ and/or using the inter-channel noise filling in the version
forming the inter-channel noise using the downmix of the previous frame, encoder 90 also
25 comprises a downmix provider 31’ so as to form a downmix of the reconstructed, final
versions of the spectra of the channels of the multichannel audio signal. Of course, to
save computations, instead of the final, the original, unquantized versions of said spectra
of the channels may be used by downmix provider 31’ in the formation of the downmix.
30 The encoder 90 may use the information on the available reconstructed, final version of
the spectra in order to perform inter-frame spectral prediction such as the aforementioned
possible version of performing inter-channel prediction using an imaginary part estimation,
and/or in order to perform rate control, i.e. in order to determine, within a rate control loop,
that the possible parameters finally coded into data stream 30 by encoder 90 are set in a
35 rate/distortion optimal sense.
31
For example, one such parameter set in such a prediction loop and/or rate control loop of
encoder 90 is, for each zero-quantized scale factor band identified by identifier 12’, the
scale factor of the respective scale factor band which has merely been preliminarily set by
quantizer 98. In a prediction and/or rate control loop of encoder 90, the scale factor of the
5 zero-quantized scale factor bands is set in some psychoacoustically or rate/distortion
optimal sense so as to determine the aforementioned target noise level along with, as
described above, an optional modification parameter also conveyed by the data stream for
the corresponding frame to the decoder side. It should be noted that this scale factor may
be computed using only the spectral lines of the spectrum and channel to which it belongs
10 (i.e. the “target” spectrum, as described earlier) or, alternatively, may be determined using
both the spectral lines of the “target” channel spectrum and, in addition, the spectral lines
of the other channel spectrum or the downmix spectrum from the previous frame (i.e. the
“source” spectrum, as introduced earlier) obtained from downmix provider 31’. In particular
to stabilize the target noise level and to reduce temporal level fluctuations in the decoded
15 audio channels onto which the inter-channel noise filling is applied, the target scale factor
may be computed using a relation between an energy measure of the spectral lines in the
“target” scale factor band, and an energy measure of the co-located spectral lines in the
corresponding “source” region. Finally, as noted above, this “source” region may originate
from a reconstructed, final version of another channel or the previous frame’s downmix, or
20 if the encoder complexity is to be reduced, the original, unquantized version of same other
channel or the downmix of original, unquantized versions of the previous frame’s spectra.
In the following, multichannel encoding and multichannel decoding according to
embodiments is explained. In embodiments, the multichannel processor 204 of the
25 apparatus 201 for decoding of Fig. 1a may, e.g., be configured to conduct on or more of
the technologies below that are described regarding noise multichannel decoding.
At first, however, before describing multichannel decoding, multichannel encoding
according to embodiments is explained with reference to Fig. 7 to Fig. 9 and, then,
30 multichannel decoding is explained with reference to Fig. 10 and Fig. 12.
Now, multichannel encoding according to embodiments is explained with reference to Fig.
7 to Fig. 9 and Fig. 11:
35 Fig. 7 shows a schematic block diagram of an apparatus (encoder) 100 for encoding a
multichannel signal 101 having at least three channels CH1 to CH3.
32
The apparatus 100 comprises an iteration processor 102, a channel encoder 104 and an
output interface 106.
The iteration processor 102 is configured to calculate, in a first iteration step, inter-channel
5 correlation values between each pair of the at least three channels CH1 to CH3 for
selecting, in the first iteration step, a pair having a highest value or having a value above a
threshold, and for processing the selected pair using a multichannel processing operation
to derive multichannel parameters MCH_PAR1 for the selected pair and to derive first
processed channels P1 and P2. In the following, such a processed channels P1 and such
10 a processed channel P2 may also be referred to as a combination channel P1 and a
combination channel P2, respectively. Further, the iteration processor 102 is configured to
perform the calculating, the selecting and the processing in a second iteration step using
at least one of the processed channels P1 or P2 to derive multichannel parameters
MCH_PAR2 and second processed channels P3 and P4.
15
For example, as indicated in Fig. 7, the iteration processor 102 may calculate in the first
iteration step an inter-channel correlation value between a first pair of the at least three
channels CH1 to CH3, the first pair consisting of a first channel CH1 and a second
channel CH2, an inter-channel correlation value between a second pair of the at least
20 three channels CH1 to CH3, the second pair consisting of the second channel CH2 and a
third channel CH3, and an inter-channel correlation value between a third pair of the at
least three channels CH1 to CH3, the third pair consisting of the first channel CH1 and the
third channel CH3.
25 In Fig. 7 it is assumed that in the first iteration step the third pair consisting of the first
channel CH1 and the third channel CH3 comprises the highest inter-channel correlation
value, such that the iteration processor 102 selects in the first iteration step the third pair
having the highest inter-channel correlation value and processes the selected pair, i.e.,
the third pair, using a multichannel processing operation to derive multichannel
30 parameters MCH_PAR1 for the selected pair and to derive first processed channels P1
and P2.
Further, the iteration processor 102 can be configured to calculate, in the second iteration
step, inter-channel correlation values between each pair of the at least three channels
35 CH1 to CH3 and the processed channels P1 and P2, for selecting, in the second iteration
step, a pair having a highest inter-channel correlation value or having a value above a
33
threshold. Thereby, the iteration processor 102 can be configured to not select the
selected pair of the first iteration step in the second iteration step (or in any further
iteration step).
5 Referring to the example shown in Fig. 7, the iteration processor 102 may further calculate
an inter-channel correlation value between a fourth pair of channels consisting of the first
channel CH1 and the first processed channel P1, an inter-channel correlation value
between a fifth pair consisting of the first channel CH1 and the second processed channel
P2, an inter-channel correlation value between a sixth pair consisting of the second
10 channel CH2 and the first processed channel P1, an inter-channel correlation value
between a seventh pair consisting of the second channel CH2 and the second processed
channel P2, an inter-channel correlation value between an eighth pair consisting of the
third channel CH3 and the first processed channel P1, an inter-correlation value between
a ninth pair consisting of the third channel CH3 and the second processed channel P2,
15 and an inter-channel correlation value between a tenth pair consisting of the first
processed channel P1 and the second processed channel P2.
In Fig. 7, it is assumed that in the second iteration step the sixth pair consisting of the
second channel CH2 and the first processed channel P1 comprises the highest inter20 channel correlation value, such that the iteration processor 102 selects in the second
iteration step the sixth pair and processes the selected pair, i.e., the sixth pair, using a
multichannel processing operation to derive multichannel parameters MCH_PAR2 for the
selected pair and to derive second processed channels P3 and P4.
25 The iteration processor 102 can be configured to only select a pair when the level
difference of the pair is smaller than a threshold, the threshold being smaller than 40 dB,
25 dB, 12 dB or smaller than 6 dB. Thereby, the thresholds of 25 or 40 dB correspond to
rotation angles of 3 or 0.5 degree.
30 The iteration processor 102 can be configured to calculate normalized integer correlation
values, wherein the iteration processor 102 can be configured to select a pair, when the
integer correlation value is greater than e.g. 0.2 or preferably 0.3.
Further, the iteration processor 102 may provide the channels resulting from the
35 multichannel processing to the channel encoder 104. For example, referring to Fig. 7, the
iteration processor 102 may provide the third processed channel P3 and the fourth
34
processed channel P4 resulting from the multichannel processing performed in the
second iteration step and the second processed channel P2 resulting from the
multichannel processing performed in the first iteration step to the channel encoder 104.
Thereby, the iteration processor 102 may only provide those processed channels to the
5 channel encoder 104 which are not (further) processed in a subsequent iteration step. As
shown in Fig. 7, the first processed channel P1 is not provided to the channel encoder 104
since it is further processed in the second iteration step.
The channel encoder 104 can be configured to encode the channels P2 to P4 resulting
10 from the iteration processing (or multichannel processing) performed by the iteration
processor 102 to obtain encoded channels E1 to E3.
For example, the channel encoder 104 can be configured to use mono encoders (or mono
boxes, or mono tools) 120_1 to 120_3 for encoding the channels P2 to P4 resulting from
15 the iteration processing (or multichannel processing). The mono boxes may be configured
to encode the channels such that less bits are required for encoding a channel having less
energy (or a smaller amplitude) than for encoding a channel having more energy (or a
higher amplitude). The mono boxes 120_1 to 120_3 can be, for example, transformation
based audio encoders. Further, the channel encoder 104 can be configured to use stereo
20 encoders (e.g., parametric stereo encoders, or lossy stereo encoders) for encoding the
channels P2 to P4 resulting from the iteration processing (or multichannel processing).
The output interface 106 can be configured to generate and encoded multichannel signal
107 having the encoded channels E1 to E3 and the multichannel parameters MCH_PAR1
25 and MCH_PAR2.
For example, the output interface 106 can be configured to generate the encoded
multichannel signal 107 as a serial signal or serial bit stream, and so that the multichannel
parameters MCH_PAR2 are in the encoded signal 107 before the multichannel
30 parameters MCH_PAR1. Thus, a decoder, an embodiment of which will be described later
with respect to Fig. 10, will receive the multichannel parameters MCH_PAR2 before the
multichannel parameters MCH-PAR1.
In Fig. 7 the iteration processor 102 exemplarily performs two multichannel processing
35 operations, a multichannel processing operation in the first iteration step and a
multichannel processing operation in the second iteration step. Naturally, the iteration
35
processor 102 also can perform further multichannel processing operations in subsequent
iteration steps. Thereby, the iteration processor 102 can be configured to perform iteration
steps until an iteration termination criterion is reached. The iteration termination criterion
can be that a maximum number of iteration steps is equal to or higher than a total number
5 of channels of the multichannel signal 101 by two, or wherein the iteration termination
criterion is, when the inter-channel correlation values do not have a value greater than the
threshold, the threshold preferably being greater than 0.2 or the threshold preferably being
0.3. In further embodiments, the iteration termination criterion can be that a maximum
number of iteration steps is equal to or higher than a total number of channels of the
10 multichannel signal 101, or wherein the iteration termination criterion is, when the interchannel correlation values do not have a value greater than the threshold, the threshold
preferably being greater than 0.2 or the threshold preferably being 0.3.
For illustration purposes the multichannel processing operations performed by the
15 iteration processor 102 in the first iteration step and the second iteration step are
exemplarily illustrated in Fig. 7 by processing boxes 110 and 112. The processing boxes
110 and 112 can be implemented in hardware or software. The processing boxes 110 and
112 can be stereo boxes, for example.
20 Thereby, inter-channel signal dependency can be exploited by hierarchically applying
known joint stereo coding tools. In contrast to previous MPEG approaches, the signal
pairs to be processed are not predetermined by a fixed signal path (e.g., stereo coding
tree) but can be changed dynamically to adapt to input signal characteristics. The inputs
of the actual stereo box can be (1) unprocessed channels, such as the channels CH1 to
25 CH3, (2) outputs of a preceding stereo box, such as the processed signals P1 to P4, or (3)
a combination channel of an unprocessed channel and an output of a preceding stereo
box.
The processing inside the stereo box 110 and 112 can either be prediction based (like
30 complex prediction box in USAC) or KLT/PCA based (the input channels are rotated (e.g.,
via a 2x2 rotation matrix) in the encoder to maximize energy compaction, i.e., concentrate
signal energy into one channel, in the decoder the rotated signals will be retransformed to
the original input signal directions).
35 In a possible implementation of the encoder 100, (1) the encoder calculates an inter
channel correlation between every channel pair and selects one suitable signal pair out of
36
the input signals and applies the stereo tool to the selected channels; (2) the encoder
recalculates the inter channel correlation between all channels (the unprocessed channels
as well as the processed intermediate output channels) and selects one suitable signal
pair out of the input signals and applies the stereo tool to the selected channels; and (3)
5 the encoder repeats step (2) until all inter channel correlation is below a threshold or if a
maximum number of transformations is applied.
As already mentioned, the signal pairs to be processed by the encoder 100, or more
precisely the iteration processor 102, are not predetermined by a fixed signal path (e.g.,
10 stereo coding tree) but can be changed dynamically to adapt to input signal
characteristics. Thereby, the encoder 100 (or the iteration processor 102) can be
configured to construct the stereo tree in dependence on the at least three channels CH1
to CH3 of the multichannel (input) signal 101. In other words, the encoder 100 (or the
iteration processor 102) can be configured to build the stereo tree based on an inter15 channel correlation (e.g., by calculating, in the first iteration step, inter-channel correlation
values between each pair of the at least three channels CH1 to CH3, for selecting, in the
first iteration step, a pair having the highest value or a value above a threshold, and by
calculating, in a second iteration step, inter-channel correlation values between each pair
of the at least three channels and previously processed channels, for selecting, in the
20 second iteration step, a pair having the highest value or a value above a threshold).
According to a one step approach, a correlation matrix may be calculated for possibly
each iteration containing the correlations of all, in previous iterations possibly processed,
channels.
25 As indicated above, the iteration processor 102 can be configured to derive multichannel
parameters MCH_PAR1 for the selected pair in the first iteration step and to derive
multichannel parameters MCH_PAR2 for the selected pair in the second iteration step.
The multichannel parameters MCH_PAR1 may comprise a first channel pair identification
(or index) identifying (or signaling) the pair of channels selected in the first iteration step,
30 wherein the multichannel parameters MCH_PAR2 may comprise a second channel pair
identification (or index) identifying (or signaling) the pair of channels selected in the
second iteration step.
In the following, an efficient indexing of input signals is described. For example, channel
35 pairs can be efficiently signaled using a unique index for each pair, dependent on the total
37
number of channels. For example, the indexing of pairs for six channels can be as shown
in the following table:
0 1 2 3 4 5
0 0 1 2 3 4
1 5 6 7 8
2 9 10 11
3 12 13
4 14
5
5 For example, in the above table the index 5 may signal the pair consisting of the first
channel and the second channel. Similarly, the index 6 may signal the pair consisting of
the first channel and the third channel.
The total number of possible channel pair indices for n channels can be calculated to:
10
numPairs = numChannels*(numChannels-1)/2
Hence, the number of bits needed for signaling one channel pair amount to:
15 numBits = floor(log2(numPairs-1))+1
Further, the encoder 100 may use a channel mask. The multichannel tool’s configuration
may contain a channel mask indicating for which channels the tool is active. Thus, LFEs
(LFE = low frequency effects/enhancement channels) can be removed from the channel
20 pair indexing, allowing for a more efficient encoding. E.g. for a 11.1 setup, this reduces the
number of channel pair indices from 12*11/2=66 to 11*10/2 = 55, allowing signaling with 6
instead of 7 bit. This mechanism can also be used to exclude channels intended to be
mono objects (e.g. multiple language tracks). On decoding of the channel mask
(channelMask), a channel map (channelMap) can be generated to allow re-mapping of
25 channel pair indices to decoder channels.
Moreover, the iteration processor 102 can be configured to derive, for a first frame, a
plurality of selected pair indications, wherein the output interface 106 can be configured to
38
include, into the multichannel signal 107, for a second frame, following the first frame, a
keep indicator, indicating that the second frame has the same plurality of selected pair
indications as the first frame.
5 The keep indicator or the keep tree flag can be used to signal that no new tree is
transmitted, but the last stereo tree shall be used. This can be used to avoid multiple
transmission of the same stereo tree configuration if the channel correlation properties
stay stationary for a longer time.
10 Fig. 8 shows a schematic block diagram of a stereo box 110, 112. The stereo box 110,
112 comprises inputs for a first input signal I1 and a second input signal I2, and outputs
for a first output signal O1 and a second output signal O2. As indicated in Fig. 8,
dependencies of the output signals O1 and O2 from the input signals I1 and I2 can be
described by the s-parameters S1 to S4.
15
The iteration processor 102 can use (or comprise) stereo boxes 110,112 in order to
perform the multichannel processing operations on the input channels and/or processed
channels in order to derive (further) processed channels. For example, the iteration
processor 102 can be configured to use generic, prediction based or KLT (Karhunen20 Loève-Transformation) based rotation stereo boxes 110,112.
A generic encoder (or encoder-side stereo box) can be configured to encode the input
signals I1 and I2 to obtain the output signals O1 and O2 based on the equation:
25 .
A generic decoder (or decoder-side stereo box) can be configured to decode the input
signals I1 and I2 to obtain the output signals O1 and O2 based on the equation:
30 .
A prediction based encoder (or encoder-side stereo box) can be configured to encode the
input signals I1 and I2 to obtain the output signals O1 and O2 based on the equation
39
,
wherein p is the prediction coefficient.
5 A prediction based decoder (or decoder-side stereo box) can be configured to decode the
input signals I1 and I2 to obtain the output signals O1 and O2 based on the equation:
.
10 A KLT based rotation encoder (or encoder-side stereo box) can be configured to encode
the input signals I1 to I2 to obtain the output signals O1 and O2 based on the equation:
.
15 A KLT based rotation decoder (or decoder-side stereo box) can be configured to decode
the input signals I1 and I2 to obtain the output signals O1 and O2 based on the equation
(inverse rotation):
.
20
In the following, a calculation of the rotation angle α for the KLT based rotation is
described.
The rotation angle for the KLT based rotation can be defined as:
25
with being the entries of a non-normalized correlation matrix, wherein , are the
channel energies.
30
40
This can be implemented using the atan2 function to allow for differentiation between
negative correlations in the numerator and negative energy difference in the denominator:
alpha = 0.5*atan2(2*correlation[ch1][ch2],
5 (correlation[ch1][ch1] - correlation[ch2][ch2]));
Further, the iteration processor 102 can be configured to calculate an inter-channel
correlation using a frame of each channel comprising a plurality of bands so that a single
inter-channel correlation value for the plurality of bands is obtained, wherein the iteration
10 processor 102 can be configured to perform the multichannel processing for each of the
plurality of bands so that the multichannel parameters are obtained from each of the
plurality of bands.
Thereby, the iteration processor 102 can be configured to calculate stereo parameters in
15 the multichannel processing, wherein the iteration processor 102 can be configured to
only perform a stereo processing in bands, in which a stereo parameter is higher than a
quantized-to-zero threshold defined by a stereo quantizer (e.g., KLT based rotation
encoder). The stereo parameters can be, for example, MS On/Off or rotation angles or
prediction coefficients).
20
For example, the iteration processor 102 can be configured to calculate rotation angles in
the multichannel processing, wherein the iteration processor 102 can be configured to
only perform a rotation processing in bands, in which a rotation angle is higher than a
quantized-to-zero threshold defined by a rotation angle quantizer (e.g., KLT based rotation
25 encoder).
Thus, the encoder 100 (or output interface 106) can be configured to transmit the
transformation/rotation information either as one parameter for the complete spectrum (full
band box) or as multiple frequency dependent parameters for parts of the spectrum.
30
The encoder 100 can be configured to generate the bit stream 107 based on the following
tables:
Table 1 — Syntax of mpegh3daExtElementConfig()
Syntax No. of bits Mnemonic
41
mpegh3daExtElementConfig()
{
usacExtElementType = escapedValue(4, 8, 16);
usacExtElementConfigLength = escapedValue(4, 8, 16);
if (usacExtElementDefaultLengthPresent) { 1 uimsbf
usacExtElementDefaultLength = escapedValue(8, 16, 0) + 1;
} else {
usacExtElementDefaultLength = 0;
}
usacExtElementPayloadFrag; 1 uimsbf
switch (usacExtElementType) {
case ID_EXT_ELE_FILL:
/* No configuration element */
break;
case ID_EXT_ELE_MPEGS:
SpatialSpecificConfig();
break;
case ID_EXT_ELE_SAOC:
SAOCSpecificConfig();
break;
case ID_EXT_ELE_AUDIOPREROLL:
/* No configuration element */
break;
42
case ID_EXT_ELE_UNI_DRC:
mpegh3daUniDrcConfig();
break;
case ID_EXT_ELE_OBJ_METADATA:
ObjectMetadataConfig();
break;
case ID_EXT_ELE_SAOC_3D:
SAOC3DSpecificConfig();
break;
case ID_EXT_ELE_HOA:
HOAConfig();
break;
case ID_EXT_ELE_MCC: /* multi channel coding */
MCCConfig(grp);
break;
case ID_EXT_ELE_FMT_CNVRTR
/* No configuration element */
break;
default: NOTE
while (usacExtElementConfigLength--) {
tmp; 8 uimsbf
}
break;
}
}
NOTE: The default entry for the usacExtElementType is used for unknown extElementTypes so that
legacy decoders can cope with future extensions.
Table 21 — Syntax of MCCConfig(),
Syntax No. of bits Mnemonic
MCCConfig(grp)
{
nChannels = 0
for(chan=0;chan < bsNumberOfSignals[grp]; chan++)
chanMask[chan] 1
if(chanMask[chan] > 0) {
43
mctChannelMap[nChannels]=chan;
nChannels++;
}
}
}
NOTE: The corresponding ID_USAC_EXT element shall be prior to any audio element of the certain
signal group grp.
Table 32 — Syntax of MultichannelCodingBoxBandWise()
Syntax No. of bits Mnemonic
MultichannelCodingBoxBandWise()
{
for(pair=0; pair 0) {
delta_code_time = 0;
} else {
delta_code_time; 1
}
if (hasBandwiseAngles == 0) {
hcod_angle[dpcm_alpha[pair][0]]; 1..10 vlclbf
}
else {
for(j=0;j< numMaskBands;j++) {
if (msMask[pair][j] ==1) {
hcod_angle[dpcm_alpha[pair][j]]; 1..10 vlclbf
}
}
}
}
}
NOTE 1) nBits = floor(log2(nChannels*(nChannels-1)/2 – 1))+1
Table 4 — Syntax of MultichannelCodingBoxFullband()
Syntax No. of bits Mnemonic
MultichannelCodingBoxFullband()
{
for (pair=0; pair 0) {
lastVal = alpha_prev_fullband[pair];
} else {
lastVal = DEFAULT_ALPHA;
}
35
newAlpha = lastVal + dpcm_alpha[pair][0];
if(newAlpha >= 64) {
newAlpha -= 64;
}
40
for (band=0; band < numMaskBands; band++){
/* set all angles to fullband angle */
54
pairAlpha[pair][band] = newAlpha;
/* set previous angles according to mctMask */
if(mctMask[pair][band] > 0) {
5 alpha_prev_frame[pair][band%mctBandsPerWindow] = newAlpha;
}
else {
alpha_prev_frame[pair][band%mctBandsPerWindow] =
DEFAULT_ALPHA;
10 }
}
alpha_prev_fullband[pair] = newAlpha;
for(band=bandsPerWindow ; band 0) {
lastVal = alpha_prev_frame[pair][band%mctBandsPerWindow];
25 }
else {
if ((band % mctBandsPerWindow) == 0) {
lastVal = DEFAULT_ALPHA;
}
30 }
if (msMask[pair][band] > 0 ) {
newAlpha = lastVal + dpcm_alpha[pair][band];
if(newAlpha >= 64) {
35 newAlpha -= 64;
}
pairAlpha[pair][band] = newAlpha;
alpha_prev_frame[pair][band%mctBandsPerWindow] = newAlpha;
lastVal = newAlpha;
40 }
else {
alpha_prev_frame[pair][band%mctBandsPerWindow] =
DEFAULT_ALPHA; /* -45° */
}
45
/* reset fullband angle */
alpha_prev_fullband[pair] = DEFAULT_ALPHA;
}
for(band=bandsPerWindow ; bandnumPairs; pair++) {
mctBandOffset = 0;
50
/* inverse MCT rotation */
for (win = 0, group = 0; group MCCSignalingType == 0) {
}
else if (self->MCCSignalingType == 1) {
/* apply fullband box */
30 if (!self->bHasBandwiseAngles[pair] && !self-
>bHasMctMask[pair]) {
apply_mct_rotation(dmx, res, alphaSfb[0], nSamples);
}
else {
35 /* apply bandwise processing */
for (i = 0; i< mctBandsPerWindow; i++) {
if (mctMask[i] == 1) {
startLine = swb_offset [sfb];
stopLine = (sfb+2= totalSfb) {
50 break;
}
}
}
57
}
else if (self->MCCSignalingType == 2) {
}
else if (self->MCCSignalingType == 3) {
5 apply_mct_rotation(dmx, res, alpha, nSamples);
}
}
For an application of KLT rotation the following c-code can be used.
10
apply_mct_rotation(*dmx, *res, alpha, nSamples)
{
for (n=0;n
Documents
Application Documents
| # |
Name |
Date |
| 1 |
202428012258-STATEMENT OF UNDERTAKING (FORM 3) [21-02-2024(online)].pdf |
2024-02-21 |
| 2 |
202428012258-REQUEST FOR EXAMINATION (FORM-18) [21-02-2024(online)].pdf |
2024-02-21 |
| 3 |
202428012258-FORM 18 [21-02-2024(online)].pdf |
2024-02-21 |
| 4 |
202428012258-FORM 1 [21-02-2024(online)].pdf |
2024-02-21 |
| 5 |
202428012258-FIGURE OF ABSTRACT [21-02-2024(online)].pdf |
2024-02-21 |
| 6 |
202428012258-DRAWINGS [21-02-2024(online)].pdf |
2024-02-21 |
| 7 |
202428012258-DECLARATION OF INVENTORSHIP (FORM 5) [21-02-2024(online)].pdf |
2024-02-21 |
| 8 |
202428012258-COMPLETE SPECIFICATION [21-02-2024(online)].pdf |
2024-02-21 |
| 9 |
Abstract.jpg |
2024-02-29 |
| 10 |
202428012258-FORM-26 [04-03-2024(online)].pdf |
2024-03-04 |
| 11 |
202428012258-Proof of Right [05-03-2024(online)].pdf |
2024-03-05 |
| 12 |
202428012258-FORM-26 [21-05-2024(online)].pdf |
2024-05-21 |
| 13 |
202428012258-FORM 3 [01-07-2024(online)].pdf |
2024-07-01 |