Abstract:
An Audio decoder for decoding a multi-audio-object signal
having an audio signal of a first type and an audio signal
of a second type encoded therein is described, the multi-
audio-object signal consisting of a downmix signal (56) and
side information (58), the side information comprising
level information (60) of the audio signal of the first
type and the audio signal of the second type in a first
predetermined time/frequency resolution (42), and a
residual signal (62) specifying residual level values in a
second predetermined time/frequency resolution, the audio
decoder comprising means (52) for computing prediction
coefficients (64) based on the level information (60); and
means (54) for up-mixing the downmix signal (56) based on
the prediction coefficients (64) and the residual signal
(62) to obtain a first up-mix audio signal approximating
the audio signal of the first type and/or a second up-mix
audio signal approximating the audio signal of the second
type.
Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence
AM EUROPAKANAL 36, APP. 11 91056 ERLANGEN / GERMANY
4. ANDREAS HOELZER
OBERE KARLSTRASSE 23 91054 ERLANGEN / GERMANY
5. CORNELIA FALCH
KASSELER STRAβE 12 90491 NURNBERG / GERMANY
6. JOHANNES HILPERT
HERRNHUETTESTRAβE 46 90411 NÜRNBERG / GERMANY
Specification
Audio Coding using Downmix
Description
The present application is concerned with audio coding
using down-mixing of signals.
Many audio encoding algorithms have been proposed in order
to effectively encode or compress audio data of one
channel, i.e., mono audio signals. Using psychoacoustics,
audio samples are appropriately scaled, quantized or even
set to zero in order to remove irrelevancy from, for
example, the PCM coded audio signal. Redundancy removal is
also performed.
As a further step, the similarity between the left and
right channel of stereo audio signals has been exploited in
order to effectively encode/compress stereo audio signals.
However, upcoming applications pose further demands on
audio coding algorithms. For example, in teleconferencing,
computer games, music performance and the like, several
audio signals which are partially or even completely
uncorrelated have to be transmitted in parallel. In order
to keep the necessary bit rate for encoding these audio
signals low enough in order to be compatible to low-bit
rate transmission applications, recently, audio codecs have
been proposed which downmix the multiple input audio
signals into a downmix signal, such as a stereo or even
mono downmix signal. For example, the MPEG Surround
standard downmixes the input channels into the downmix
signal in a manner prescribed by the standard. The
downmixing is performed by use of so-called OTT-1 and TTT-1
boxes for downmixing two signals into one and three signals
into two, respectively. In order to downmix more than three
signals, a hierarchic structure of these boxes is used.
Each OTT-1 box outputs, besides the mono downmix signal,
channel level differences between the two input channels,
as well as inter-channel coherence/cross-correlation
parameters representing the coherence or cross-correlation
between the two input channels. The parameters are output
along with the downmix signal of the MPEG Surround coder
within the MPEG Surround data stream. Similarly, each TTT-1
box transmits channel prediction coefficients enabling
recovering the three input channels from the resulting
stereo downmix signal. The channel prediction coefficients
are also transmitted as side information within the MPEG
Surround data stream. The MPEG Surround decoder upmixes the
downmix signal by use of the transmitted side information
and recovers, the original channels input into the MPEG
Surround encoder.
However, MPEG Surround, unfortunately, does not fulfill all
requirements posed by many applications. For example, the
MPEG Surround decoder is dedicated for upmixing the downmix
signal of the MPEG Surround encoder such that the input
channels of the MPEG Surround encoder are recovered as they
are. In other words, the MPEG Surround data stream is
dedicated to be played back by use of the loudspeaker
configuration having been used for encoding.
However, according to some implications, it would be
favorable if the loudspeaker configuration could be changed
at the decoder's side.
In order to address the latter needs, the spatial audio
object coding (SAOC) standard is currently designed. Each
channel is treated as an individual object, and all objects
are downmixed into a downmix signal. However, in addition
the individual objects may also comprise individual sound
sources as e.g. instruments or vocal tracks. However,
differing from the MPEG Surround decoder, the SAOC decoder
is free to individually upmix the downmix signal to replay
the individual objects onto any loudspeaker configuration.
In order to enable the SAOC decoder to recover the
individual objects having been encoded into the SAOC data
stream, object level differences and, for objects forming
together a stereo (or multi-channel) signal, inter-object
cross correlation parameters are transmitted as side
information within the SAOC bitstream. Besides this, the
SAOC decoder/transcoder is provided with information
revealing how the individual objects have been downmixed
into the downmix signal. Thus, on the decoder's side, it is
possible to recover the individual SAOC channels and to
render these signals onto any loudspeaker configuration by
utilizing user-controlled rendering information.
However, although the SAOC codec has been designed for
individually handling audio objects, some applications are
even more demanding. For example, Karaoke applications
require a complete separation of the background audio
signal from the foreground audio signal or foreground audio
signals. Vice versa, in the solo mode, the foreground
objects have to be separated from the background object.
However, owing to the equal treatment of the individual
audio objects it was not possible to completely remove the
background objects or the foreground objects, respectively,
from the downmix signal.
Thus, it is the object of the present invention to provide
an audio codec using downmixing of audio signals such that
a better separation of individual objects such as, for
example, in a Karaoke/solo mode application, is achieved.
This object is achieved by an audio decoder according to
claim 1, an audio encoder according to claim 18, a decoding
method according to claim 20, an encoding method according
to claim 21, and a multi-audio-object signal according to
claim 23.
Referring to the Figs., preferred embodiments of the
present application are described in more detail. Among
these Figs.,
Fig. 1 shows a block diagram of an SAOC encoder/decoder
arrangement in which the embodiments of the
present invention may be implemented;
Fig. 2 shows a schematic and illustrative diagram of a
spectral representation of a mono audio signals-
Fig. 3 shows a block diagram of an audio decoder
according to an embodiment of the present
invention;
Fig. 4 shows a block diagram of an audio encoder
according to an embodiment of the present
invention;
Fig. 5 shows a block diagram of an audio encoder/decoder
arrangement for Karaoke/Solo mode application, as
a comparison embodiment;
Fig. 6 shows a block diagram of an audio encoder/decoder
arrangement for Karaoke/Solo mode application
according to an embodiments-
Fig. 7a shows a block diagram of an audio encoder for a
Karaoke/Solo mode application, according to a
comparison embodiment;
Fig. 7b shows a block diagram of an audio encoder for a
Karaoke/Solo mode application, according to an
embodiment;
Fig. 8a and b show plots of quality measurement results;
Fig. 9 shows a block diagram of an audio encoder/decoder
arrangement for Karaoke/Solo mode application,
for comparison purposes;
Fig. 10 shows a block diagram of an audio encoder/decoder
arrangement for Karaoke/Solo mode application
according to an embodiment;
Fig. 11 shows a block diagram of an audio encoder/decoder
arrangement for Karaoke/Solo mode application
according to a further embodiment;
Fig. 12 shows a block diagram of an audio encoder/decoder
arrangement for Karaoke/Solo mode application
according to a further embodiment;
Fig. 13a to h show tables reflecting a possible syntax for
the SOAC bitstream according to an embodiment of
the present invention;
Fig. 14 shows a block diagram of an audio decoder for a
Karaoke/Solo mode application, according to an
embodiment; and
Fig. 15 show a table reflecting a possible syntax for
signaling the amount of data spent for
transferring the residual signal.
Before embodiments of the present invention are described
in more detail below, the SAOC codec and the SAOC
parameters transmitted in an SAOC bitstream are presented
in order to ease the understanding of the specific
embodiments outlined in further detail below.
Fig. 1 shows a general arrangement of an SAOC encoder 10
and an SAOC decoder 12. The SAOC encoder 10 receives as an
input N objects, i.e., audio signals 141 to 14N. In
particular, the encoder 10 comprises a downmixer 16 which
receives the audio signals 141 to 14N and downmixes same to
a downmix signal 18. In Fig. 1, the downmix signal is
exemplarily shown as a stereo downmix signal. However, a
mono downmix signal is possible as well. The channels of
the stereo downmix signal 18 are denoted LO and RO, in case
of a mono downmix same is simply denoted L0. In order to
enable the SAOC decoder 12 to recover the individual
objects 141 to 14N, downmixer 16 provides the SAOC decoder
12 with side information including SAOC-parameters
including object level differences (OLD), inter-object
cross correlation parameters (IOC), downmix gain values
(DMG) and downmix channel level differences (DCLD). The
side information 20 including the SAOC-parameters, along
with the downmix signal 18, forms the SAOC output data
stream received by the SAOC decoder 12.
The SAOC decoder 12 comprises an upmixer 22 which receives
the downmix signal 18 as well as the side information 20 in
order to recover and render the audio signals 141 and 14N
onto any user-selected set of channels 241 to 24M, with the
rendering being prescribed by rendering information 26
input into SAOC decoder 12.
The audio signals 141 to 14N may be input into the
downmixer 16 in any coding domain, such as, for example, in
time or spectral domain. In case, the audio signals 141 to
14N are fed into the downmixer 16 in the time domain, such
as PCM coded, downmixer 16 uses a filter bank, such as a
hybrid QMF bank, i.e., a bank of complex exponentially
modulated filters with a Nyquist filter extension for the
lowest frequency bands to increase the frequency resolution
therein, in order to transfer the signals into spectral
domain in which the audio signals are represented in
several subbands associated with different spectral
portions, at a specific filter bank resolution. If the
audio signals 141 to 14N are already in the representation
expected by downmixer 16, same does not have to perform the
spectral decomposition.
Fig. 2 shows an audio signal in the just-mentioned spectral
domain. As can be seen, the audio signal is represented as
a plurality of subband signals. Each subband signal 301 to
30P consists of a sequence of subband values indicated by
the small boxes 32. As can be seen, the subband values 32
of the subband signals 301 to 30P are synchronized to each
other in time so that for each of consecutive filter bank
time slots 34 each subband 301 to 30P comprises exact one
subband value 32. As illustrated by the frequency axis 36,
the subband signals 301 to 30P are associated with
different frequency regions, and as illustrated by the time
axis 38, the filter bank time slots 34 are consecutively
arranged in time.
As outlined above, downmixer 16 computes SAOC-parameters
from the input audio signals 141 to 14N. Downmixer 16
performs this computation in a time/frequency resolution
which may be decreased relative to the original
time/frequency resolution as determined by the filter bank
time slots 34 and subband decomposition, by a certain
amount, with this certain amount being signaled to the
decoder side within the side information 20 by respective
syntax elements bsFrameLength and bsFreqRes. For example,
groups of consecutive filter bank time slots 34 may form a
frame 40. In other words, the audio signal may be divided-
up into frames overlapping in time or being immediately
adjacent in time, for example. In this case, bsFrameLength
may define the number of parameter time slots 41, i.e. the
time unit at which the SAOC parameters such as OLD and IOC,
are computed in an SAOC frame 40 and bsFreqRes may define
the number of processing frequency bands for which SAOC
parameters are computed. By this measure, each frame is
divided-up into time/frequency tiles exemplified in Fig. 2
by dashed lines 42.
The downmixer 16 calculates SAOC parameters according to
the following formulas. In particular, downmixer 16
computes object level differences for each object i as
wherein the sums and the indices n and k, respectively, go
through all filter bank time slots 34, and all filter bank
subbands 30 which belong to a certain time/frequency tile
42. Thereby, the energies of all subband values Xi of an
audio signal or object i are summed up and normalized to
the highest energy value of that tile among all objects or
audio signals.
Further the SAOC downmixer 16 is able to compute a
similarity measure of the corresponding time/frequency
tiles of pairs of different input objects 141 to 14N.
Although the SAOC downmixer 16 may compute the similarity
measure between all the pairs of input objects 141 to 14N,
downmixer 16 may also suppress the signaling of the
similarity measures or restrict the computation of the
similarity measures to audio objects 141 to 14N which form
left or right channels of a common stereo channel. In any
case, the similarity measure is called the inter-object
cross-correlation parameter IOCi,j. The computation is as
follows
with again indexes n and k going through all subband values
belonging to a certain time/frequency tile 42, and i and j
denoting a certain pair of audio objects 141 to 14N.
The downmixer 16 downmixes the objects 141 to 14N by use of
gain factors applied to each object 141 to 14N. That is, a
gain factor Di is applied to object i and then all thus
weighted objects 141 to 14N are summed up to obtain a mono
downmix signal. In the case of a stereo downmix signal,
which case is exemplified in Fig. 1, a gain factor Dl,i is
applied to object i and then all such gain amplified
objects are summed-up in order to obtain the left downmix
channel LO, and gain factors D2,i are applied to object i
and then the thus gain-amplified objects are summed-up in
order to obtain the right downmix channel R0.
This downmix prescription is signaled to the decoder side
by means of down mix gains DMGi and, in case of a stereo
downmix signal, downmix channel level differences DCLDi..
The downmix gains are calculated according to:
where s is a small number such as 10" .
For the DCLD3 the following formula applies:
In the normal mode, downmixer 16 generates the downmix
signal according to:
for a stereo downmix, respectively.
Thus, in the abovementioned formulas, parameters OLD and
IOC are a function of the audio signals and parameters DMG
and DCLD are a function of D. By the way, it is noted that
D may be varying in time.
Thus, in the normal mode, downmixer 16 mixes all objects
141 to 14N with no preferences, i.e., with handling all
objects 141 to 14N equally.
The upmixer 22 performs the inversion of the downmix
procedure and the implementation of the "rendering
information" represented by matrix A in one computation
step, namely
where matrix E is a function of the parameters OLD and IOC.
In other words, in the normal mode, no classification of
the objects 141 to 14N into BGO, i.e., background object,
or FGO, i.e., foreground object, is performed. The
information as to which object shall be presented at the
output of the upmixer 22 is to be provided by the rendering
matrix A. If, for example, object with index 1 was the left
channel of a stereo background object, the object with
index 2 was the right channel thereof, and the object with
index 3 was the foreground object, then rendering matrix A
would be
to produce a Karaoke-type of output signal.
However, as already indicated above, transmitting BGO and
FGO by use of this normal mode of the SAOC codec does not
achieve acceptable results.
Figs. 3 and 4, describe an embodiment of the present
invention which overcomes the deficiency just described.
The decoder and encoder described in these Figs, and their
associated functionality may represent an additional mode
such as an "enhanced mode" into which the SAOC codec of
Fig. 1 could be switchable. Examples for the latter
possibility will be presented thereinafter.
Fig. 3 shows a decoder 50. The decoder 50 comprises means
52 for computing prediction coefficients and means 54 for
upmixing a downmix signal.
The audio decoder 50 of Fig. 3 is dedicated for decoding a
multi-audio-object signal having an audio signal of a first
type and an audio signal of a second type encoded therein.
The audio signal of the first type and the audio signal of
the second type may be a mono or stereo audio signal,
respectively. The audio signal of the first type is, for
example, a background object whereas the audio signal of
the second type is a foreground object. That is, the
embodiment of Fig. 3 and Fig. 4 is not necessarily
restricted to Karaoke/Solo mode applications. Rather, the
decoder of Fig. 3 and the encoder of Fig. 4 may be
advantageously used elsewhere.
The multi-audio-object signal consists of a downmix signal
56 and side information 58. The side information 58
comprises level information 60 describing, for example,
spectral energies of the audio signal of the first type and
the audio signal of the second type in a first
predetermined time/frequency resolution such as, for
example, the time/frequency resolution 42. In particular,
the level information 60 may comprise a normalized spectral
energy scalar value per object and time/frequency tile. The
normalization may be related to the highest spectral energy
value among the audio signals of the first and second type
at the respective time/frequency tile. The latter
possibility results in OLDs for representing the level
information, also called level difference information
herein. Although the following embodiments use OLDs, they
may, although not explicitly stated there, use an otherwise
normalized spectral energy representation.
The side information 58 comprises also a residual signal 62
specifying residual level values in a second predetermined
time/frequency resolution which may be equal to or
different to the first predetermined time/frequency
resolution.
The means 52 for computing prediction coefficients is
configured to compute prediction coefficients based on the
level information 60. Additionally, means 52 may compute
the prediction coefficients further based on inter-
correlation information also comprised by side information
58. Even further, means 52 may use time varying downmix
prescription information comprised by side information 58
to compute the prediction coefficients. The prediction
coefficients computed by means 52 are necessary for
retrieving or upmixing the original audio objects or audio
signals from the downmix signal 56.
Accordingly, means 54 for upmixing is configured to upmix
the downmix signal 56 based on the prediction coefficients
64 received from means 52 and the residual signal 62. By
using the residual 62, decoder 50 is able to better
suppress cross talks from the audio signal of one type to
the audio signal of the other type. In addition to the
residual signal 62, means 54 may use the time varying
downmix prescription to upmix the downmix signal. Further,
means 54 for upmixing may use user input 66 in order to
decide which of the audio signals recovered from the
downmix signal 56 to be actually output at output 68 or to
what extent. As a first extreme, the user input 66 may
instruct means 54 to merely output the first up-mix signal
approximating the audio signal of the first type. The
opposite is true for the second extreme according to which
means 54 is to output merely the second up-mix signal
approximating the audio signal of the second type.
Intermediate options are possible as well according to
which a mixture of both up-mix signals is rendered an
output at output 68.
Fig. 4 shows an embodiment for an audio encoder suitable
for generating a multi-audio object signal decoded by the
decoder of Fig. 3. The encoder of Fig. 4 which is indicated
by reference sign 80, may comprise means 82 for spectrally
decomposing in case the audio signals 84 to be encoded are
not within the spectral domain. Among the audio signals 84,
in turn, there is at least one audio signal of a first type
and at least one audio signal of a second type. The means
82 for spectrally decomposing is configured to spectrally
decompose each of these signals 84 into a representation as
shown in Fig. 2, for example. That is, the means 82 for
spectrally decomposing spectrally decomposes the audio
signals 84 at a predetermined time/frequency resolution.
Means 82 may comprise a filter bank, such as a hybrid QMF
bank.
The audio encoder 80 further comprises means 86 for
computing level information, means 88 for downmixing, means
90 for computing prediction coefficients and means 92 for
setting a residual signal. Additionally, audio encoder 80
may comprise means for computing inter-correlation
information, namely means 94. Means 86 computes level
information describing the level of the audio signal of the
first type and the audio signal of the second type in the
first predetermined time/frequency resolution from the
audio signal as optionally output by means 82. Similarly,
means 88 downmixes the audio signals. Means 88 thus outputs
the downmix signal 56. Means 86 also outputs the level
information 60. Means 90 for computing prediction
coefficients acts similarly to means 52. That is, means 90
computes prediction coefficients from the level information
60 and outputs the prediction coefficients 64 to means 92.
Means 92, in turn, sets the residual signal 62 based on the
downmix signal 56, the predication coefficients 64 and the
original audio signals at a second predetermined
time/frequency resolution such that up-mixing the downmix
signal 56 based on both the prediction coefficients 64 and
the residual signal 62 results in a first up-mix audio
signal approximating the audio signal of the first type and
the second up-mix audio signal approximating the audio
signal of the second type, the approximation being approved
compared to the absence of the residual signal 62.
The residual signal 62 and the level information 60 are
comprised by the side information 58 which forms, along
with the downmix signal 56, the multi-audio-object signal
to be decoded by decoder Fig. 3.
As shown in Fig. 4, and analogous to the description of
Fig. 3, means 90 may additionally use the inter-correlation
information output by means 94 and/or time varying downmix
prescription output by means 88 to compute the prediction
coefficient 64. Further, by means 92 for setting the
residual signal 62 may additionally use the time varying
downmix prescription output by means 88 in order to
appropriately set the residual signal 62.
Again, it is noted that the audio signal of the first type
may be a mono or stereo audio signal. The same applies for
the audio signal of the second type. The residual signal 62
may be signaled within the side information in the same
time/frequency resolution as the parameter time/frequency
resolution used to compute, for example, the level
information, or a different time/frequency resolution may
be used. Further, it may be possible that the signaling of
the residual signal is restricted to a sub-portion of the
spectral range occupied by the time/frequency tiles 42 for
which level information is signaled. For example, the
time/frequency resolution at which the residual signal is
signaled, may be indicated within the side information 58
by use of syntax elements bsResidualBands and
bsResidualFramesPerSAOCFrame. These two syntax elements may
define another sub-division of a frame into time/frequency
tiles than the sub-division leading to tiles 42.
By the way, it is noted that the residual signal 62 may or
may not reflect information loss resulting from a
potentially used core encoder 96 optionally used to encode
the downmix signal 56 by audio encoder 80. As shown in Fig.
4, means 92 may perform the setting of the residual signal
62 based on the version of the downmix signal re-
constructible from the output of core coder 96 or from the
version input into core encoder 96' . Similarly, the audio
decoder 50 may comprise a core decoder 98 to decode or
decompress downmix signal 56.
The ability to set, within the multiple-audio-object
signal, the time/frequency resolution used for the residual
signal 62 different from the time/frequency resolution used
for computing the level information 60 enables to achieve a
good compromise between audio quality on the one hand and
compression ratio of the multiple-audio-object signal on
the other hand. In any case, the residual signal 62 enables
to better suppress cross-talk from one audio signal to the
other within the first and second up-mix signals to be
output at output 68 according to the user input 66.
As will become clear from the following embodiment, more
than one residual signal 62 may be transmitted within the
side information in case more than one foreground object or
audio signal of the second type is encoded. The side
information may allow for an individual decision as to
whether a residual signal 62 is transmitted for a specific
audio signal of a second type or not. Thus, the number of
residual signals 62 may vary from one up to the number of
audio signals of the second type.
In the audio decoder of Fig.3, the means 54 for computing
may be configured to compute a prediction coefficient
matrix C consisting of the prediction coefficients based on
the level information (OLD) and means 56 may be configured
to yield the first up-mix signal S1 and/or the second up-
mix signal S2 from the downmix signal d according to a
computation representable by
where the "1" denotes - depending on the number of channels
of d - a scalar, or an identity matrix, and D-1 is a matrix
uniquely determined by a downmix prescription according to
which the audio signal of the first type and the audio
signal of the second type are downmixed into the downmix
signal, and which is also comprised by the side
information, and H is a term being independent from d but
dependent from the residual signal.
As noted above and described further below, the downmix
prescription may vary in time and/or may spectrally vary
within the side information. If the audio signal of the
first type is a stereo audio signal having a first (L) and
a second input channel (R) , the level information, for
example, describes normalized spectral energies of the
first input channel (L) , the second input channel (R) and
the audio signal of the second type, respectively, at the
time/frequency resolution 42.
The aforementioned computation according to which the means
56 for up-mixing performs the up-mixing may even be
representable by
wherein L is a first channel of the first up-mix signal,
approximating L and R is a second channel of the first up-
mix signal, approximating R, and the "1" is a scalar in
case d is mono, and a 2x2 identity matrix in case d is
stereo. If the downmix signal 56 is a stereo audio signal
having a first (LO) and second output channel (RO), and the
computation according to which the means 56 for up-mixing
performs the up-mixing may be representable by
As far as the term H being dependent on the residual signal
res is concerned, the computation according to which the
means 56 for up-mixing performs the up-mixing may be
representable by
The multi-audio-object signal may even comprise a plurality
of audio signals of the second type and the side
information may comprise one residual signal per audio
signal of the second type. A residual resolution parameter
may be present in the side information defining a spectral
range over which the residual signal is transmitted within
the side information. It may even define a lower and an
upper limit of the spectral range.
Further, the multi-audio-object signal may also comprise
spatial rendering information for spatially rendering the
audio signal of the first type onto a predetermined
loudspeaker configuration. In other words, the audio signal
of the first type may be a multi channel (more than two
channels) MPEG Surround signal downmixed down to stereo.
In the following, embodiments will be described which make
use of the above residual signal signaling. However, it is
noted that the term "object" is often used in a double
sense. Sometimes, an object denotes an individual mono
audio signal. Thus, a stereo object may have a mono audio
signal forming one channel of a stereo signal. However, at
other situations, a stereo object may denote, in fact, two
objects, namely an object concerning the right channel and
a further object concerning the left channel of the stereo
object. The actual sense will become apparent from the
context.
Before describing the next embodiment, same is motivated by
deficiencies realized with the baseline technology of the
SAOC standard selected as reference model 0 (RMO) in 2007.
The RMO allowed the individual manipulation of a number of
sound objects in terms of their panning position and
amplification/attenuation. A special scenario has been
presented in the context of a "Karaoke" type application.
In this case
• a mono, stereo or surround background scene (in the
following called Background Object, BGO) is conveyed
from a set of certain SAOC objects, which is
reproduced without alteration, i.e. every input
channel signal is reproduced through the same output
channel at an unaltered level, and
• a specific object of interest (in the following called
Foreground Object FGO) (typically the lead vocal)
which is reproduced with alterations (the FGO is
typically positioned in the middle of the sound stage
and can be muted, i.e. attenuated heavily to allow
sing-along).
As it is visible from subjective evaluation procedures, and
could be expected from the underlying technology principle,
manipulations of the object position lead to high-quality
results, while manipulations of the object level are
generally more challenging. Typically, the higher the
additional signal amplification/attenuation is, the more
potential artefacts arise. In this sense, the Karaoke
scenario is extremely demanding since an extreme (ideally:
total) attenuation of the FGO is required.
The dual usage case is the ability to reproduce only the
FGO without the background/MBO, and is referred to in the
following as the solo mode.
It is noted, however, that if a surround background scene
is involved, it is referred to as a Multi-Channel
Background Object (MBO) . The handling of the MBO is the
following, which is shown in Fig.5:
• The MBO is encoded using a regular 5-2-5 MPEG Surround
tree 102. This results in a stereo MBO downmix signal
104, and an MBO MPS side information stream 106.
• The MBO downmix is then encoded by a subsequent SAOC
encoder 108 as a stereo object, (i.e. two object level
differences, plus an inter-channel correlation),
together with the (or several) FGO 110. This results
in a common downmix signal 112, and a SAOC side
information stream 114.
In the transcoder 116, the downmix signal 112 is
preprocessed and the SAOC and MPS side information streams
106, 114 are transcoded into a single MPS output side
information stream 118. This currently happens in a
discontinuous way, i.e. either only full suppression of the
FGO(s) is supported or full suppression of the MBO.
Finally, the resulting downmix 120 and MPS side information
118 are rendered by an MPEG Surround decoder 122.
In Fig. 5, both the MBO downmix 104 and the controllable
object signal (s) 110 are combined into a single stereo
downmix 112. This "pollution" of the downmix by the
controllable object 110 is the reason for the difficulty of
recovering a Karaoke version with the controllable object
110 being removed, which is of sufficiently high audio
quality. The following proposal aims at circumventing this
problem.
Assuming one FGO (e.g. one lead vocal), the key observation
used by the following embodiment of Fig. 6 is that the SAOC
downmix signal is a combination of the BGO and the FGO
signal, i.e. three audio signals are downmixed and
transmitted via 2 downmix channels. Ideally, these signals
should be separated again in the transcoder in order to
produce a clean Karaoke signal (i.e. to remove the FGO
signal), or to produce a clean solo signal (i.e. to remove
the BGO signal) . This is achieved, in accordance with the
embodiment of Fig. 6, by using a "two-to-three" (TTT)
encoder element 124 (TTT-1 as it is known from the MPEG
Surround specification) within SAOC encoder 108 to combine
the BGO and the FGO into a single SAOC downmix signal in
the SAOC encoder. Here, the FGO feeds the "center" signal
input of the TTT'1 box 124 while the BGO 104 feeds the
"left/right" TTT-1 inputs L.R. The transcoder 116 can then
produce approximations of the BGO 104 by using a TTT
decoder element 126 (TTT as it is known from MPEG
Surround), i.e. the "left/right" TTT outputs L,R carry an
approximation of the BGO, whereas the "center" TTT output C
carries an approximation of the FGO 110.
When comparing the embodiment of Fig. 6 with the embodiment
of an encoder and decoder of Figs. 3 and 4, reference sign
104 corresponds to the audio signal of the first type among
audio signals 84, means 82 is comprised by MPS encoder 102,
reference sign 110 corresponds to the audio signals of the
second type among audio signal 84, TTT-1 box 124 assumes
the responsibility for the functionalities of means 88 to
92, with the functionalities of means 86 and 94 being
implemented in SAOC encoder 108, reference sign 112
corresponds to reference sign 56, reference sign 114
corresponds to side information 58 less the residual signal
62, TTT box 126 assumes responsibility for the
functionality of means 52 and 54 with the functionality of
the mixing box 128 also being comprised by means 54.
Lastly, signal 120 corresponds to the signal output at
output 68. Further, it is noted that Fig. 6 also shows a
core coder/decoder path 131 for the transport of the down
mix 112 from SAOC encoder 108 to SAOC transcoder 116. This
core coder/decoder path 131 corresponds to the optional
core coder 96 and core decoder 98. As indicated in Fig. 6,
this core coder/ decoder path 131 may also encode/compress
the side information transported signal from encoder 108 to
transcoder 116.
The advantages resulting from the introduction of the TTT
box of Fig. 6 will become clear by the following
description. For example, by
• simply feeding the "left/right" TTT outputs L.R. into
the MPS downmix 120 (and passing on the transmitted
MBO MPS bitstream 106 in stream 118), only the MBO is
reproduced by the final MPS decoder. This corresponds
to the Karaoke mode.
• simply feeding the "center" TTT output C. into left
and right MPS downmix 120 (and producing a trivial MPS
bitstream 118 that renders the FGO 110 to the desired
position and level), only the FGO 110 is reproduced by
the final MPS decoder 122. This corresponds to the
Solo mode.
The handling of the three TTT output signals L.R.C. is
performed in the "mixing" box 128 of the SAOC transcoder
116.
The processing structure of Fig. 6 provides a number of
distinct advantages over Fig. 5:
• The framework provides a clean structural separation
of background (MBO) 100 and FGO signals 110
• The structure of the TTT element 126 attempts a best
possible reconstruction of the three signals L.R.C. on
a waveform basis. Thus, the final MPS output signals
130 are not only formed by energy weighting (and
decorrelation) of the downmix signals, but also are
closer in terms of waveforms due to the TTT
processing.
• Along with the MPEG Surround TTT box 126 comes the
possibility to enhance the reconstruction precision by
using residual coding. In this way, a significant
enhancement in reconstruction quality can be achieved
as the residual bandwidth and residual bitrate for the
residual signal 132 output by TTT-1 124 and used by
TTT box for upmixing are increased. Ideally (i.e. for
infinitely fine quantization in the residual coding
and the coding of the downmix signal) , the
interference between the background (MBO) and the FGO
signal is cancelled.
The processing structure of Fig. 6 possesses a number of
characteristics :
• Duality Karaoke/Solo mode: The approach of Fig. 6
offers both Karaoke and Solo functionality by using
the same technical means. That is, SAOC parameters are
reused, for example.
• Refineability: The quality of the Karaoke/Solo signal
can be refined as needed by controlling the amount of
residual coding information used in the TTT boxes. For
example, parameters bsResidualSamplingFrequencylndex,
bsResidualBands and bsResidualFramesPerSAOCFrame may
be used.
• Positioning of FGO in downmix: When using a TTT box as
specified in the MPEG Surround specification, the FGO
would always be mixed into the center position between
the left and right downmix channels. In order to allow
more flexibility in positioning, a generalized TTT
encoder box is employed which follows the same
principles while allowing non-symmetric positioning of
the signal associated to the "center" inputs/outputs.
• Multiple FGOs: In the configuration described, the use
of only one FGO was described (this may correspond to
the most important application case). However, the
proposed concept is also able to accommodate several
FGOs by using one or a combination of the following
measures:
o Grouped FGOs: Like shown in Figure 6, the signal
that is connected to the center input/output of
the TTT box can actually be the sum of several
FGO signals rather than only a single one. These
FGOs can be independently positioned/controlled
in the multi-channel output signal 130 (maximum
quality advantage is achieved, however, when they
are scaled & positioned in the same way) . They
share a common position in the stereo downmix
signal 112, and there is only one residual signal
132. In any case, the interference between the
background (MBO) and the controllable objects is
cancelled (although not between the controllable
obj ects) .
o Cascaded FGOs: The restrictions regarding the
common FGO position in the downmix 112 can be
overcome by extending the approach of Fig. 6.
Multiple FGOs can be accommodated by cascading
several stages of the described TTT structure,
each stage corresponding to one FGO and producing
a residual coding stream. In this way,
interference ideally would be cancelled also
between each FGO. Of course, this option requires
a higher bitrate than using a grouped FGO
approach. An example will be described later.
• SAOC side information: In MPEG Surround, the side
information associated to a TTT box is a pair of
Channel Prediction Coefficients (CPCs). In contrast,
the SAOC parametrization and the MBO/Karaoke scenario
transmit object energies for each object signal, and
an inter-signal correlation between the two channels
of the MBO downmix (i.e. the parametrization for a
"stereo object"). In order to minimize the number of
changes in the parametrization relative to the case
without the enhanced Karaoke/Solo mode, and thus
bitstream format, the CPCs can be calculated from the
energies of the downmixed signals (MBO downmix and
FGOs) and the inter-signal correlation of the MBO
downmix stereo object. Therefore, there is no need to
change or augment the transmitted parametrization and
the CPCs can be calculated from the transmitted SAOC
parametrization in the SAOC transcoder 116. In this
way, a bitstream using the Enhanced Karaoke/Solo mode
could also be decoded by a regular mode decoder
(without residual coding) when ignoring the residual
data .
In summary, the embodiment of Fig. 6 aims at an enhanced
reproduction of certain selected objects (or the scene
without those objects) and extends the current SAOC
encoding approach using a stereo downmix in the following
way:
• In the normal mode, each object signal is weighted by
its entries in the downmix matrix (for its
contribution to the left and to the right downmix
channel, respectively). Then, all weighted
contributions to the left and right downmix channel
are summed to form the left and right downmix
channels.
• For enhanced Karaoke/Solo performance, i.e. in the
enhanced mode, all object contributions are
partitioned into a set of object contributions that
form a Foreground Object (FGO) and the remaining
object contributions (BGO). The FGO contribution is
summed into a mono downmix signal, the remaining
background contributions are summed into a stereo
downmix, and both are summed using a generalized TTT
encoder element to form the common SAOC stereo
downmix.
Thus, a regular summation is replaced by a "TTT summation"
(which can be cascaded when desired).
In order to emphasize the just-mentioned difference between
the normal mode of the SAOC encoder and the enhanced mode,
reference is made to Figs. 7a and 7b, where Fig. 7a
concerns the normal mode, whereas Fig. 7b concerns the
enhanced mode. As can be seen, in the normal mode, the SAOC
encoder 108 uses the afore-mentioned DMX parameters Dij for
weighting objects j and adding the thus weighed object j to
SAOC channel i, i.e. L0 or R0. In case of the enhanced mode
of Fig. 6, merely a vector of DMX-parameters Di is
necessary, namely, DMX-parameters Di indicating how to form
a weighted sum of the FGOs 110, thereby obtaining the
center channel C for the TTT-1 box 124, and DMX-parameters
Di, instructing the TTT-1 box how to distribute the center
signal C to the left MBO channel and the right MBO channel
respectively, thereby obtaining the LDMX or RDMX
respectively.
Problematically, the processing according to Fig. 6 does
not work very well with non-waveform preserving codecs (HE-
AAC / SBR) . A solution for that problem may be an energy-
based generalized TTT mode for HE-AAC and high frequencies.
An embodiment addressing the problem will be described
later.
A possible bitstream format for the one with cascaded TTTs
could be as follows:
An addition to the SAOC bitstream that needs to be able to
be skipped if to be digested in "regular decode mode":
numTTTs int
for (ttt=0; ttt