Audio Signal Decoder, Method for Decoding an Audio Signal and Computer Program
using Cascaded Audio Object Processing Stages
Description
Technical Field
Embodiments according to the invention are related to an audio signal decoder for
providing an upmix signal representation in dependence on a downmix signal
representation and an object-related parametric information.
Further embodiments according to the invention are related to a method for providing an
upmix signal representation in dependence on a downmix signal representation and an
object-related parametric information.
Further embodiments according to the invention are related to a computer program.
Some embodiments according to the invention are related to an enhanced Karaoke/Solo
SAOC system.
Background of the Invention
In modern audio systems, it is desired to transfer and store audio information in a bitrate-
efficient way. In addition, it is often desired to reproduce an audio content using a plurality
of two or even more speakers, which are spatially distributed in a room. In such cases, it is
desired to exploit the capabilities of such a multi-speaker arrangement to allow for a user
to spatially identify different audio contents or different items of a single audio content.
This may be achieved by individually distributing the different audio contents to the
different speakers.
In other words, in the art of audio processing, audio transmission and audio storage, there
is an increasing desire to handle multi-channel contents in order to improve the hearing
impression. Usage of multi-channel audio content brings along significant improvements
for the user. For example, a 3-dimensional hearing impression can be obtained, which
brings along an improved user satisfaction in entertainment applications. However, multi-
channel audio contents are also useful in professional environments, for example in
telephone conferencing applications, because the speaker intelligibility can be improved by
using a multi-channel audio playback.
However, it is also desirable to have a good tradeoff between audio quality and bitrate
requirements in order to avoid an excessive resource load caused by multi-channel
applications.
Recently, parametric techniques for the bitrate-efficient transmission and/or storage of
audio scenes containing multiple audio objects has been proposed, for example, Binaural
Cue Coding (Type I) (see, for example reference [BCC]), Joint Source Coding (see, for
example, reference [JSC]), and MPEG Spatial Audio Object Coding (SAOC) (see, for
example, references [SAOC1], [SAOC2]).
These techniques aim at perceptually reconstructing the desired output audio scene rather
than by a waveform match.
Fig. 8 shows a system overview of such a system (here: MPEG SAOC). The MPEG SAOC
system 800 shown in Fig. 8 comprises an SAOC encoder 810 and an SAOC decoder 820.
The SAOC encoder 810 receives a plurality of object signals x1 to xN, which may be
represented, for example, as time-domain signals or as time-frequency-domain signals (for
example, in the form of a set of transform coefficients of a Fourier-type transform, or in the
form of QMF subband signals). The SAOC encoder 810 typically also receives downmix
coefficients d1 to dN, which are associated with the object signals x1 to xN. Separate sets of
downmix coefficients may be available for each channel of the downmix signal. The
SAOC encoder 810 is typically configured to obtain a channel of the downmix signal by
combining the object signals x1 to xN in accordance with the associated downmix
coefficients d1 to dN. Typically, there are less downmix channels than object signals x1 to
XN- In order to allow (at least approximately) for a separation (or separate treatment) of the
object signals at the side of the SAOC decoder 820, the SAOC encoder 810 provides both
the one or more downmix signals (designated as downmix channels) 812 and a side
information 814. The side information 814 describes characteristics of the object signals x1
to xN, in order to allow for a decoder-sided object-specific processing.
The SAOC decoder 820 is configured to receive both the one or more downmix signals
812 and the side information 814. Also, the SAOC decoder 820 is typically configured to
receive a user interaction information and/or a user control information 822, which
describes a desired rendering setup. For example, the user interaction information/user
control information 822 may describe a speaker setup and the desired spatial placement of
the objects provided by the object signals x1 to xN.
The SAOC decoder 820 is configured to provide, for example, a plurality of decoded
upmix channel signals y1 to yM. The upmix channel signals may for example be associated
with individual speakers of a multi-speaker rendering arrangement. The SAOC decoder
820 may, for example, comprise an object separator 820a, which is configured to
reconstruct, at least approximately, the object signals x1 to xn on the basis of the one or
more downmix signals 812 and the side information 814, thereby obtaining reconstructed
object signals 820b. However, the reconstructed object signals 820b may deviate
somewhat from the original object signals x1 to xN for example, because the side
information 814 is not quite sufficient for a perfect reconstruction due to the bitrate
constraints. The SAOC decoder 820 may further comprise a mixer 820c, which may be
configured to receive the reconstructed object signals 820b and the user interaction
information/user control information 822, and to provide, on the basis thereof, the upmix
channel signals y1 to yM. The mixer 820c may be configured to use the user interaction
information /user control information 822 to determine the contribution of the individual
reconstructed object signals 820b to the upmix channel signals y1 to yM. The user
interaction information/user control information 822 may, for example, comprise rendering
parameters (also designated as rendering coefficients), which determine the contribution of
the individual reconstructed object signals 820b to the upmix channel signals y1 to yM-
However, it should be noted that in many embodiments, the object separation, which is
indicated by the object separator 820a in Fig. 8, and the mixing, which is indicated by the
mixer 820c in Fig. 8, are performed in one single step. For this purpose, overall parameters
may be computed which describe a direct mapping of the one or more downmix signals
812 onto the upmix channel signals y1 to yM. These parameters may be computed on the
basis of the side information 814 and the user interaction information/user control
information 822.
Taking reference now to Figs. 9a, 9b and 9c, different apparatus for obtaining an upmix
signal representation on the basis of a downmix signal representation and object-related
side information will be described. Fig. 9a shows a block schematic diagram of an MPEG
SAOC system 900 comprising an SAOC decoder 920. The SAOC decoder 920 comprises,
as separate functional blocks, an object decoder 922 and a mixer/renderer 926. The object
decoder 922 provides a plurality of reconstructed object signals 924 in dependence on the
downmix signal representation (for example, in the form of one or more downmix signals
represented in the time domain or in the time-frequency-domain) and object-related side
information (for example, in the form of object metadata). The mixer/renderer 926
receives the reconstructed object signals 924 associated with a plurality of N objects and
provides, on the basis thereof, one or more upmix channel signals 928. In the SAOC
decoder 920, the extraction of the object signals 924 is performed separately from the
mixing/rendering which allows for a separation of the object decoding functionality from
the mixing/rendering functionality but brings along a relatively high computational
complexity.
Taking reference now to Fig. 9b, another MPEG SAOC system 930 will be briefly
discussed, which comprises an SAOC decoder 950. The SAOC decoder 950 provides a
plurality of upmix channel signals 958 in dependence on a downmix signal representation
(for example, in the form of one or more downmix signals) and an object-related side
information (for example, in the form of object meta data). The SAOC decoder 950
comprises a combined object decoder and mixer/renderer, which is configured to obtain
the upmix channel signals 958 in a joint mixing process without a separation of the object
decoding and the mixing/rendering, wherein the parameters for said joint upmix process
are dependent on both, the object-related side information and the rendering information.
The joint upmix process also depends on the downmix information, which is considered to
be part of the object-related side information.
To summarize the above, the provision of the upmix channel signals 928, 958 can be
performed in a one step process or a two-step process.
Taking reference now to Fig. 9c, an MPEG SAOC system 960 will be described. The
SAOC system 960 comprises an SAOC to MPEG Surround transcoder 980, rather than an
SAOC decoder.
The SAOC to MPEG Surround transcoder comprises a side information transcoder 982,
which is configured to receive the object-related side information (for example, in the form
of object meta data) and, optionally, information on the one or more downmix signals and
the rendering information. The side information transcoder is also configured to provide an
MPEG Surround side information 984 (for example, in the form of an MPEG Surround
bitstream) on the basis of a received data. Accordingly, the side information transcoder 982
is configured to transform an object-related (parametric) side information, which is
relieved from the object encoder, into a channel-related (parametric) side information 984,
taking into consideration the rendering information and, optionally, the information about
the content of the one or more downmix signals.
Optionally, the SAOC to MPEG Surround transcoder 980 may be configured to manipulate
the one or more downmix signals, described, for example, by the downmix signal
representation, to obtain a manipulated downmix signal representation 988. However, the
downmix signal manipulator 986 may be omitted, such that the output downmix signal
representation 988 of the SAOC to MPEG Surround transcoder 980 is identical to the input
downmix signal representation of the SAOC to MPEG Surround transcoder. The downmix
signal manipulator 986 may, for example, be used if the channel-related MPEG Surround
side information 984 would not allow to provide a desired hearing impression on the basis
of the input downmix signal representation of the SAOC to MPEG Surround transcoder
980, which may be the case in some rendering constellations.
Accordingly, the SAOC to MPEG Surround transcoder 980 provides the downmix signal
representation 988 and the MPEG Surround bitstream 984 such that a plurality of upmix
channel signals, which represent the audio objects in accordance with the rendering
information input to the SAOC to MPEG Surround transcoder 980 can be generated using
an MPEG Surround decoder which receives the MPEG Surround bitstream 984 and the
downmix signal representation 988.
To summarize the above, different concepts for decoding SAOC-encoded audio signals can
be used. In some cases, an SAOC decoder is used, which provides upmix channel signals
(for example, upmix channel signals 928, 958) in dependence on the downmix signal
representation and the object-related parametric side information. Examples for this
concept can be seen in Figs. 9a and 9b. Alternatively, the SAOC-encoded audio
information may be transcoded to obtain a downmix signal representation (for example, a
downmix signal representation 988) and a channel-related side information (for example,
the channel-related MPEG Surround bitstream 984), which can be used by an MPEG
Surround decoder to provide the desired upmix channel signals.
In the MPEG SAOC system 800, a system overview of which is given in Fig. 8, the
general processing is carried out in a frequency selective way and can be described as
follows within each frequency band:
• N input audio object signals X1 to XN are downmixed as part of the SAOC encoder
processing. For a mono downmix, the downmix coefficients are denoted by d1 to dN. In
addition, the SAOC encoder 810 extracts side information 814 describing the
characteristics of the input audio objects. For MPEG SAOC, the relations of the object
powers with respect to each other are the most basic form of such a side information.
• Downmix signal (or signals) 812 and side information 814 are transmitted and/or
stored. To this end, the downmix audio signal may be compressed using well-known
perceptual audio coders such as MPEG-1 Layer II or III (also known as ".mp3"),
MPEG Advanced Audio Coding (AAC), or any other audio coder.
• On the receiving end, the SAOC decoder 820 conceptually tries to restore the original
object signal ("object separation") using the transmitted side information 814 (and,
naturally, the one or more downmix signals 812). These approximated object signals
(also designated as reconstructed object signals 820b) are then mixed into a target scene
represented by M audio output channels (which may, for example, be represented by
the upmix channel signals y1 to yM) using a rendering matrix. For a mono output, the
rendering matrix coefficients are given by r1 to rN .
• Effectively, the separation of the object signals is rarely executed (or even never
executed), since both the separation step (indicated by the object separator 820a) and
the mixing step (indicated by the mixer 820c) are combined into a single transcoding
step, which often results in an enormous reduction in computational complexity.
It has been found that such a scheme is tremendously efficient, both in terms of
transmission bitrate (it is only necessary to transmit a few downmix channels plus some
side information instead of N discrete object audio signals or a discrete system) and
computational complexity (the processing complexity relates mainly to the number of
output channels rather than the number of audio objects). Further advantages for the user
on the receiving end include the freedom of choosing a rendering setup of his/her choice
(mono, stereo, surround, virtualized headphone playback, and so on) and the feature of
user interactivity: the rendering matrix, and thus the output scene, can be set and changed
interactively by the user according to will, personal preference or other criteria. For
example, it is possible to locate the talkers from one group together in one spatial area to
maximize discrimination from other remaining talkers. This interactivity is achieved by
providing a decoder user interface.
For each transmitted sound object, its relative level and (for non-mono rendering) spatial
position of rendering can be adjusted. This may happen in real-time as the user changes the
position of the associated graphical user interface (GUI) sliders (for example: object level
= +5dB, object position = -30deg),
However, it has been found that it is difficult to handle audio objects of different audio
object types in such a system. In particular, it has been found that it is difficult to process
audio objects of different audio object types, for example, audio objects to which different
side information is associated, if the total number of audio objects to be processed is not
predetermined.
In view of this situation, it is an objective of the present invention to create a concept,
which allows for a computationally-efficient and flexible decoding of an audio signal
comprising a downmix signal representation and an object-related parametric information,
wherein the object-related parametric information describes audio objects of two or more
different audio object types.
Summary of the Invention
This objective is achieved by an audio signal decoder for providing an upmix signal
representation in dependence on a downmix signal representation and an object-related
parametric information, a method for providing an upmix signal representation in
dependence on a downmix signal representation and an object-related parametric
information, and a computer program, as defined by the independent claims.
An embodiment according to the invention creates an audio signal decoder for providing
an upmix signal representation in dependence on a downmix signal representation and an
object-related parametric information. The audio signal decoder comprises an object
separator configured to decompose the downmix signal representation, to provide a first
audio information describing a first set of one or more audio objects of a first audio object
type and a second audio information describing a second set of one or more audio objects
of a second audio object type in dependence on the downmix signal representation and
using at least a part of the object-related parametric information. The audio signal decoder
also comprises an audio signal processor configured to receive the second audio
information and to process the second audio information in dependence on the object-
related parametric information, to obtain a processed version of the second audio
information. The audio signal decoder also comprises an audio signal combiner configured
to combine the first audio information with the processed version of the second audio
information to obtain the upmix signal representation.
It is a key idea of the present invention that an efficient processing of different types of
audio objects can be obtained in a cascaded structure, which allows for a separation of the
different types of audio objects using at least a part of the object-related parametric
information in a first processing step performed by the object separator, and which allows
for an additional spatial processing in a second processing step performed in dependence
on at least a part of the object-related parametric information by the audio signal processor.
It has been found that extracting a second audio information, which comprises audio
objects of the second audio object type, from a downmix signal representation can be
performed with a moderate complexity even if there is a larger number of audio objects of
the second audio object type. In addition, it has been found that a spatial processing of the
audio objects of the second audio type can be performed efficiently once the second audio
information is separated from the first audio information describing the audio objects of
the first audio object type.
Additionally, it has been found that the processing algorithm performed by the object
separator for separating the first audio information and the second audio information can
be performed with comparatively small complexity if the object-individual processing of
the audio objects of the second audio object type is postponed to the audio signal processor
and not performed at the same time as the separation of the first audio information and the
second audio information.
In a preferred embodiment, the audio signal decoder is configured to provide the upmix
signal representation in dependence on the downmix signal representation, the object-
related parametric information and a residual information associated to a sub-set of audio
objects represented by the downmix signal representation. In this case, the object separator
is configured to decompose the downmix signal representation to provide the first audio
information describing the first set of one or more audio objects (for example, foreground
objects FGO) of the first audio object type to which residual information is associated and
the second audio information describing the second set of one or more audio objects (for
example, background objects BGO) of the second audio object type to which no residual
information is associated in dependence on the downmix signal representation and using at
least part of the object-related parametric information and the residual information.
This embodiment is based on the finding that a particularly accurate separation between
the first audio information describing the first set of audio objects of the first audio object
type and the second audio information describing the second set of audio objects of the
second audio object type can be obtained by using a residual information in addition to the
object-related parametric information. It has been found that the mere use of the object-
related parametric information would result in distortions in many cases, which can be
reduced significantly or even entirely eliminated by the use of residual information. The
residual information describes, for example, a residual distortion, which is expected to
remain if an audio object of the first audio object type is isolated merely using the object-
related parametric information. The residual information is typically estimated by an audio
signal encoder. By applying the residual information, the separation between the audio
objects of the first audio object type and the audio objects of the second audio object type
can be improved.
This allows to obtain the first audio information and the second audio information with
particularly good separation between the audio objects of the first audio object type and the
audio objects of the second audio object type, which, in turn, allows to achieve a high-
quality spatial processing of the audio objects of the second audio object type when
processing the second audio information in the audio signal processor.
In a preferred embodiment, the object separator is therefore configured to provide the first
audio information such that audio objects of the first audio object type are emphasized over
audio objects of the second audio object type in the first audio information. The object
separator is also configured to provide the second audio information such that audio
objects of the second audio object type are emphasized over audio objects of the first audio
object type in the second audio information.
In a preferred embodiment, the audio signal decoder is configured to perform a two-step
processing, such that a processing of the second audio information in the audio signal
processor is performed subsequently to a separation between the first audio information
describing the first set of one or more audio objects of the first audio object type and the
second audio information describing the second set of one or more audio objects of the
second audio object type.
In a preferred embodiment, the audio signal processor is configured to process the second
audio information in dependence on the object-related parametric information associated
with the audio objects of the second audio object type and independent from the object-
related parametric information associated with the audio objects of the first audio object
type. Accordingly, a separate processing of the audio objects of the first audio object type
and the audio objects of the second audio object type can be obtained.
In a preferred embodiment, the object separator is configured to obtain the first audio
information and the second audio information using a linear combination of one or more
downmix channels and one or more residual channels. In this case, the object separator is
configured to obtain combination parameters for performing the linear combination in
dependence on downmix parameters associated with the audio objects of the first audio
object type and in dependence on channel prediction coefficients of the audio objects of the
first audio object type. The computation of the channel prediction coefficients of the audio
objects of the first audio object type may, for example, take into consideration the audio
objects of the second audio object type as a single, common audio object. Accordingly, a
separation process can be performed with sufficiently small computational complexity,
which may, for example, be almost independent from the number of audio objects of the
second audio object type.
In a preferred embodiment, the object separator is configured to apply a rendering matrix
to the first audio information to map object signals of the first audio information onto
audio channels of the upmix audio signal representation. This can be done, because the
object separator may be capable of extracting separate audio signals individually
representing the audio objects of the first audio object type. Accordingly, it is possible to
map the object signals of the first audio information directly onto the audio channels of the
upmix audio signal representation.
In a preferred embodiment, the audio processor is configured to perform a stereo
processing of the second audio information in dependence on a rendering information, an
object-related covariance information and a downmix information, to obtain audio
channels of the upmix audio signal representation.
Accordingly, the stereo processing of the audio objects of the second audio object type is
separated from the separation between the audio objects of the first audio object type and
the audio objects of the second audio object type. Thus, the efficient separation between
audio objects of the first audio object type and audio objects of the second audio object
type is not affected (or degraded) by the stereo processing, which typically leads to a
distribution of audio objects over a plurality of audio channels without providing the high
degree of object separation, which can be obtained in the object separator, for example,
using the residual information.
In another preferred embodiment, the audio processor is configured to perform a post-
processing of the second audio information in dependence on a rendering information, an
object-related covariance information and a downmix information. This form of post-
processing allows for a spatial placement of the audio objects of the second audio object
type within an audio scene. Nevertheless, due to the cascaded concept, the computational
complexity of the audio processor can be kept sufficiently small, because the audio
processor does not need to consider the object-related parametric information associated
with the audio objects of the first audio object type.
In addition, different types of processing can be performed by the audio processor, like, for
example, a mono-to-binaural processing, a mono-to-stereo processing, a stereo-to-binaural
processing or a stereo-to-stereo processing.
In a preferred embodiment, the object separator is configured to treat audio objects of the
second audio object type, to which no residual information is associated, as a single audio
object. In addition, the audio signal processor is configured to consider object-specific
rendering parameters to adjust contributions of the objects of the second audio object type
to the upmix signal representation. Thus, the audio objects of the second audio object type
are considered as a single audio object by the object separator, which significantly reduces
the complexity of the object separator and also allows to have a unique residual
information, which is independent from the rendering parameters associated with the audio
objects of the second audio object type.
In a preferred embodiment, the object separator is configured to obtain a common object-
level difference value for a plurality of audio objects of the second audio object type. The
object separator is configured to use the common object-level difference value for a
computation of channel prediction coefficients. In addition, the object separator is
configured to use the channel prediction coefficients to obtain one or two audio channels
representing the second audio information. For obtaining a common object-level difference
value, the audio objects of the second audio object type can be handled efficiently as a
single audio object by the object separator.
In a preferred embodiment, the object separator is configured to obtain a common object
level difference value for a plurality of audio objects of the second audio object type and
the object separator is configured to use the common object-level difference value for a
computation of entries of an energy-mode mapping matrix. The object separator is
configured to use the energy-mode mapping matrix to obtain the one or more audio
channels representing the second audio information. Again, the common object level
difference value allows for a computationally efficient common treating of the audio
objects of the second audio object type by the object separator.
In a preferred embodiment, the object separator is configured to selectively obtain a
common inter-object correlation value associated to the audio objects of the second audio
object type in dependence on the object-related parametric information if it is found that
there are two audio objects of the second audio object type and to set the inter-object
correlation value associated to the audio objects of the second audio object type to zero if it
is found that there are more or less than two audio objects of the second audio object type.
The object separator is configured to use the common inter-object correlation value
associated to the audio objects of the second audio object type to obtain the one or more
audio channels representing the second audio information. Using this approach, the inter-
object correlation value is exploited if it is obtainable with high computational efficiency,
i.e. if there are two audio objects of the second audio object type. Otherwise, it would be
computationally demanding to obtain inter-object correlation values. Accordingly, it has
been found to be a good compromise in terms of hearing impression and computational
complexity to set the inter-object correlation value associated to the audio objects of the
second audio object type to zero if there are more or less than two audio objects of the
second object type.
In a preferred embodiment, the audio signal processor is configured to render the second
audio information in dependence on (at least a part of) the object-related parametric
information, to obtain a rendered representation of the audio objects of the second audio
object type as a processed version of the second audio information. In this case, the
rendering can be made independent from the audio objects of the first audio object type.
In a preferred embodiment, the object separator is configured to provide the second audio
information such that the second audio information describes more than two audio objects
of the second audio object type. Embodiments according to the invention allow for a
flexible adjustment of the number of audio objects of the second audio object type, which
is significantly facilitated by the cascaded structure of the processing.
In a preferred embodiment, the object separator is configured to obtain, as the second audio
information, a one-channel audio signal representation or a two-channel audio signal
representation representing more than two audio objects of the second audio object type.
Extracting one or two audio signal channels can be performed by the object separator with
low computational complexity. In particular, the complexity of the object separator can be
kept significantly smaller when compared to a case in which the object separator would
need to deal with more than two audio objects of the second audio object type.
Nevertheless, it has been found that it is a computationally efficient representation of the
audio objects of the second audio object type to use one or two channels of an audio signal.
In a preferred embodiment, the audio signal processor is configured to receive the second
audio information and to process the second audio information in dependence on (at least a
part of) the object-related parametric information, taking into consideration object-related
parametric information associated with more than two audio objects of the second audio
object type. Accordingly, an object-individual processing is performed by the audio
processor, while such an object-individual processing is not performed for audio objects of
the second audio object type by the object separator.
In a preferred embodiment, the audio decoder is configured to extract a total object number
information and a foreground object number information from a configuration information
related to the object-related parametric information. The audio decoder is also configured
to determine a number of audio objects of the second audio object type by forming a
difference between the total object number information and the foreground object number
information. Accordingly, efficient signalling of the number of audio objects of the second
audio object type is achieved. In addition, this concept provides for a high degree of
flexibility regarding the number of audio objects of the second audio object type.
In a preferred embodiment, the object separator is configured to use object-related
parametric information associated with Neao audio objects of the first audio object type to
obtain, as the first audio information, Neao, audio signals representing (preferably,
individually) the Neao audio objects of the first audio object type, and to obtain, as the
second audio information, one or two audio signals representing the N-Neao audio objects
of the second audio object type, treating the N-Nea0 audio objects of the second audio
object type as a single one-channel or two-channel audio object. The audio signal
processor is configured to individually render the N-Neao audio objects represented by the
one or two audio signals of the second audio information using the object-related
parametric information associated with the N-Neao audio objects of the second audio object
type. Accordingly, the audio object separation between the audio objects of the first audio
object type and the audio objects of the second audio object type is separated from the
subsequent processing of the audio objects of the second audio object type.
An embodiment according to the invention creates a method for providing an upmix signal
representation in dependence on a downmix signal representation and an object-related
parametric information.
Another embodiment according to the invention creates a computer program for
performing said method.
Brief Description of the Figs.
Embodiments according to the invention will subsequently be described taking reference to
the enclosed Figs., in which:
Fig. 1 shows a block schematic diagram of an audio signal decoder, according to
an embodiment of the invention;
Fig. 2 shows a block schematic diagram of another audio signal decoder,
according to an embodiment of the invention;
Figs. 3a and 3b show a block schematic diagrams of a residual processor, which can
be used as an object separator in an embodiment of the invention;
Figs. 4a to 4e show block schematic diagrams of audio signal processors, which can be
used in an audio signal decoder according to an embodiment of the
invention:
Fig. 4f shows a block diagram of an SAOC transcoder processing mode;
Fig. 4g shows a block diagram of an SAOC decoder processing mode;
Fig. 5 a shows a block schematic diagram of an audio signal decoder, according to
an embodiment of the invention;
Fig. 5b shows a block schematic diagram of another audio signal decoder,
according to an embodiment of the invention;
Fig. 6a shows a Table representing a listening test design description;
Fig. 6b shows a Table representing systems under test;
Fig. 6c shows a Table representing the listening test items and rendering matrices;
Fig. 6d shows a graphical representation of average MUSHRA scores for a
Karaoke/Solo type rendering listening test;
Fig. 6e shows a graphical representation of average MUSHRA scores for a classic
rendering listening test;
Fig. 7 shows a flow chart of a method for providing an upmix signal
representation, according to an embodiment of the invention;
Fig. 8 shows a block schematic diagram of a reference MPEG SAOC system;
Fig. 9a shows a block schematic diagram of a reference SAOC system using a
separate decoder and mixer;
Fig. 9b shows a block schematic diagram of a reference SAOC system using an
integrated decoder and mixer; and
Fig. 9c shows a block schematic diagram of a reference SAOC system using an
SAOC-to-MPEG transcoder.
Detailed Description of the Embodiments
1. Audio signal decoder according to Fig. 1
Fig. 1 shows a block schematic diagram of an audio signal decoder 100 according to an
embodiment of the invention.
The audio signal decoder 100 is configured to receive an object-related parametric
information 110 and a downmix signal representation 112. The audio signal decoder 100 is
configured to provide an upmix signal representation 120 in dependence on the downmix
signal representation and the object-related parametric information 110. The audio signal
decoder 100 comprises an object separator 130, which is configured to decompose the
downmix signal representation 112 to provide a first audio information 132 describing a
first set of one or more audio objects of a first audio object type and a second audio
information 134 describing a second set of one or more audio objects of a second audio
object type in dependence on the downmix signal representation 112 and using at least a
part of the object-related parametric information 110. The audio signal decoder 100 also
comprises an audio signal processor 140, which is configured to receive the second audio
information 134 and to process the second audio information in dependence on at least a
part of the object-related parametric information 112, to obtain a processed version 142 of
the second audio information 134. The audio signal decoder 100 also comprises an audio
signal combiner 150 configured to combine the first audio information 132 with the
processed version 142 of the second audio information 134, to obtain the upmix signal
representation 120.
The audio signal decoder 100 implements a cascaded processing of the downmix signal
representation, which represents audio objects of the first audio object type and audio
objects of the second audio object type in a combined manner.
In a first processing step, which is performed by the object separator 130, the second audio
information describing a second set of audio objects of the second audio object type is
separated from the first audio information 132 describing a first set of audio objects of a
first audio object type using the object-related parametric information 110. However, the
second audio information 134 is typically an audio information (for example, a one-
channel audio signal or a two-channel audio signal) describing the audio objects of the
second audio object type in a combined manner.
In the second processing step, the audio signal processor 140 processes the second audio
information 134 in dependence on the object-related parametric information. Accordingly,
the audio signal processor 140 is capable of performing an object-individual processing or
rendering of the audio objects of the second audio object type, which are described by the
second audio information 134, and which is typically not performed by the object separator
130.
Thus, while the audio objects of the second audio object type are preferably not processed
in an object-individual manner by the object separator 130, the audio objects of the second
audio object type are, indeed, processed in an object-individual manner (for example,
rendered in an object-individual manner) in the second processing step, which is performed
by the audio signal processor 140. Thus, the separation between the audio objects of the
first audio object type and the audio objects of the second audio object type, which is
performed by the object separator 130, is separated from the object-individual processing
of the audio objects of the second audio object type, which is performed afterwards by the
audio signal processor 140. Accordingly, the processing which is performed by the object
separator 130 is substantially independent from a number of audio objects of the second
audio object type. In addition, the format (for example, one-channel audio signal or the
two-channel audio signal) of the second audio information 134 is typically independent
from the number of audio objects of the second audio object type. Thus, the number of
audio objects of the second audio object type can be varied without having the need to
modify the structure of the object separator 130. In other words, the audio objects of the
second audio object type are treated as a single (for example, one-channel or two-channel)
audio object for which a common object-related parametric information (for example, a
common object-level-difference value associated with one or two audio channels) is
obtained by the object separator 140.
Accordingly, the audio signal decoder 100 according to Fig. 1 is capable to handle a
variable number of audio objects of the second audio object type without a structural
modification of the object separator 130. In addition, different audio object processing
algorithms can be applied by the object separator 130 and the audio signal processor 140.
Accordingly, for example, it is possible to perform an audio object separation using a
residual information by the object separator 130, which allows for a particularly good
separation of different audio objects, making use of the residual information, which
constitutes a side information for improving the quality of an object separation. In contrast,
the audio signal processor 140 may perform an object-individual processing without using
a residual information. For example, the audio signal processor 140 may be configured to
perform a conventional spatial-audio-object-coding (SAOC) type audio signal processing
to render the different audio objects.
2. Audio Signal Decoder according to Fig. 2
In the following, an audio signal decoder 200 according to an embodiment of the invention
will be described. A block-schematic diagram of this audio signal decoder 200 shown in
Fig. 2.
The audio decoder 200 is configured to receive a downmix signal 210, a so-called SAOC
bitstream 212, rendering matrix information 214 and, optionally, head-related-transfer-
function (HRTF) parameters 216. The audio signal decoder 200 is also configured to
provide an output/MPS downmix signal 220 and (optionally) a MPS bitstream 222.
2.1. Input signals and output signals of the audio signal decoder 200
In the following, various details regarding input signals and output signals of the audio
decoder 200 will be described.
The downmix signal 200 may, for example, be a one-channel audio signal or a two-channel
audio signal. The downmix signal 210 may, for example, be derived from an encoded
representation of the downmix signal.
The spatial-audio-object-coding bitstream (SAOC bitstream) 212 may, for example,
comprise object-related parametric information. For example, the SAOC bitstream 212
may comprise object-level-difference information, for example, in the form of object-level-
difference parameters OLD, an inter-object-correlation information, for example, in the
form of inter-object-correlation parameters IOC.
In addition, the SAOC bitstream 212 may comprise a downmix information describing
how the downmix signals have been provided on the basis of a plurality of audio object
signals using a downmix process. For example, the SAOC bitstream may comprise a
downmix gain parameter DMG and (optionally) downmix-channel-level difference
parameters DCLD.
The rendering matrix information 214 may, for example, describe how the different audio
objects should be rendered by the audio decoder. For example, the rendering matrix
information 214 may describe an allocation of an audio object to one or more channels of
the output/MPS downmix signal 220.
The optional head-related-transfer-function (HRTF) parameter information 216 may
further describe a transfer function for deriving a binaural headphone signal.
The output/MPEG-Surround downmix signal (also briefly designated with "output/MPS
downmix signal") 220 represents one or more audio channels, for example, in the form of a
time domain audio signal representation or a frequency-domain audio signal
representation. Alone or in combination with the optional MPEG-Surround bitstream
(MPS bitstream) 222, which comprises MPEG-Surround parameters describing a mapping
of the output/MPS downmix signal 220 onto a plurality of audio channels, an upmix signal
representation is formed.
2.2. Structure and functionality of the audio signal decoder 200
In the following, the structure of the audio signal decoder 200, which may fulfill the
functionality of an SAOC transcoder or the functionality of a SAOC decoder, will be
described in more detail.
The audio signal decoder 200 comprises a downmix processor 230, which is configured to
receive the downmix signal 210 and to provide, on the basis thereof, the output/MPS
downmix signal 220. The downmix processor 230 is also configured to receive at least a
part of the SAOC bitstream information 212 and at least a part of the rendering matrix
information 214. In addition, the downmix processor 230 may also receive a processed
SAOC parameter information 240 from a parameter processor 250.
The parameter processor 250 is configured to receive the SAOC bitstream information
212, the rendering matrix information 214 and, optionally, the head-related-transfer-
function parameter information 260, and to provide, on the basis thereof, the MPEG
Surround bitstream 222 carrying the MPEG surround parameters (if the MPEG surround
parameters are required, which is, for example, true in the transcoding mode of operation).
In addition, the parameter processor 250 provides the processed SAOC information 240 (if
this processed SAOC information is required).
In the following, the structure and functionality of the downmix processor 230 will be
described in more detail.
The downmix processor 230 comprises a residual processor 260, which is configured to
receive the downmix signal 210 and to provide, on the basis thereof, a first audio object
signal 262 describing so-called enhanced audio objects (EAOs), which may be considered
as audio objects of a first audio object type. The first audio object signal may comprise one
or more audio channels and may be considered as a first audio information. The residual
processor 260 is also configured to provide a second audio object signal 264, which
describes audio objects of a second audio object type and may be considered as a second
audio information. The second audio object signal 264 may comprise one or more channels
and may typically comprise one or two audio channels describing a plurality of audio
objects. Typically, the second audio object signal may describe even more than two audio
objects of the second audio object type.
The downmix processor 230 also comprises an SAOC downmix pre-processor 270, which
is configured to receive the second audio object signal 264 and to provide, on the basis
thereof, a processed version 272 of the second audio object signal 264, which may be
considered as a processed version of the second audio information.
The downmix processor 230 also comprises an audio signal combiner 280, which is
configured to receive the first audio object signal 262 and the processed version 272 of the
second audio object signal 264, and to provide, on the basis thereof, the output/MPS
downmix signal 220, which may be considered, alone or together with the (optional)
corresponding MPEG-Surround bitstream 222, as an upmix signal representation.
In the following, the functionality of the individual units of the downmix processor 230
will be discussed in more detail.
The residual processor 260 is configured to separately provide the first audio object signal
262 and the second audio object signal 264. For this purpose, the residual processor 260
may be configured to apply at least a part of the SAOC bitstream information 212. For
example, the residual processor 260 may be configured to evaluate an object-related
parametric information associated with the audio objects of the first audio object type, i.e.
the so-called "enhanced audio objects" EAO. In addition, the residual processor 260 may
be configured to obtain an overall information describing the audio objects of the second
audio object type, for example, the so-called "non-enhanced audio objects", commonly.
The residual processor 260 may also be configured to evaluate a residual information,
which is provided in the SAOC bitstream information 212, for a separation between
enhanced audio objects (audio objects of the first audio object type) and non-enhanced
audio objects (audio objects of the second audio object type). The residual information
may, for example, encode a time domain residual signal, which is applied to obtain a
particularly clean separation between the enhanced audio objects and the non-enhanced
audio objects. In addition, the residual processor 260 may, optionally, evaluate at least a
part of the rendering matrix information 214, for example, in order to determine a
distribution of the enhanced audio objects to the audio channels of the first audio object
signal 262.
The SAOC downmix pre-processor 270 comprises a channel re-distributor 274, which is
configured to receive the one or more audio channels of the second audio object signal 264
and to provide, on the basis thereof, one or more (typically two) audio channels of the
processed second audio object signal 272. In addition, the SAOC downmix pre-processor
270 comprises a decorrelated-signal-provider 276, which is configured to receive the one
or more audio channels of the second audio object signal 264 and to provide, on the basis
thereof, one or more decorrelated signals 278a, 278b, which are added to the signals
provided by the channel re-distributor 274 in order to obtain the processed version 272 of
the second audio object signal 264.
Further details regarding the SAOC downmix processor will be discussed below.
The audio signal combiner 280 combines the first audio object signal 262 with the
processed version 272 of the second audio object signal. For this purpose, a channel-wise
combination may be performed. Accordingly, the output/MPS downmix signal 220 is
obtained.
The parameter processor 250 is configured to obtain the (optional) MPEG-Surround
parameters, which make up the MPEG-Surround bitstream 222 of the upmix signal
representation, on the basis of the SAOC bitstream, taking onto consideration the rendering
matrix information 214 and, optionally, the HRTF parameter information 216. In other
words, the SAOC parameter processor 252 is configured to translate the object-related
parameter information, which is described by the SAOC bitstream information 212, into a
channel-related parametric information, which is described by the MPEG Surround bit
stream 222.
In the following, a short overview of the structure of the SAOC transcoder/decoder
architecture shown in Fig. 2 will be given. Spatial audio object coding (SAOC) is a
parametric multiple object coding technique. It is designed to transmit a number of audio
objects in an audio signal (for example the downmix audio signal 210) that comprises M
channels. Together with this backward compatible downmix signal, object parameters are
transmitted (for example, using the SAOC bitstream information 212) that allow for
recreation and manipulation of the original object signals. An SAOC encoder (not shown
here) produces a downmix of the object signals at its input and extracts these object
parameters. The number of objects that can be handled is in principle not limited. The
object parameters are quantized and coded efficiently into the SAOC bitstream 212. The
downmix signal 210 can be compressed and transmitted without the need to update
existing coders and infrastructures. The object parameters, or SAOC side information, are
transmitted in a low bit rate side channel, for example, the ancillary data portion of the
downmix bitstream.
On the decoder side, the input objects are reconstructed and rendered to a certain number
of playback channels. The rendering information containing reproduction level and
panning position for each object is user-supplied or can be extracted from the SAOC
bitstream (for example, as a preset information). The rendering information can be time-
variant. Output scenarios can range from mono to multi-channel (for example, 5.1) and are
independent from both, the number of input objects and the number of downmix channels.
Binaural rendering of objects is possible including azimuth and elevation of virtual object
positions. An optional effect interface allows for advanced manipulation of object signals,
besides level and panning modification.
The objects themselves can be mono signals, stereophonic signals, as well as a multi-
channel signals (for example 5.1 channels). Typical downmix configurations are mono and
stereo.
In the following, the basic structure of the SAOC transcoder/decoder, which is shown in
Fig. 2, will be explained. The SAOC transcoder/decoder module described herein may act
either as a stand-alone decoder or as a transcoder from an SAOC to an MPEG-surround
bitstream, depending on the intended output channel configuration. In a first mode of
operation, the output signal configuration is mono, stereo or binaural, and two output
channels are used. In this first case, the SAOC module may operate in a decoder mode, and
the SAOC module output is a pulse-code-modulated output (PCM output). In the first case,
an MPEG surround decoder is not required. Rather, the upmix signal representation may
only comprise the output signal 220, while the provision of the MPEG surround bit stream
222 may be omitted. In a second case, the output signal configuration is a multi-channel
configuration with more than two output channels. The SAOC module may be operational
in a transcoder mode. The SAOC module output may comprise both a downmix signal 220
and an MPEG surround bit stream 222 in this case, as shown in Fig. 2. Accordingly, an
MPEG surround decoder is required in order to obtain a final audio signal representation
for output by the speakers.
Fig. 2 shows the basic structure of the SAOC transcoder/decoder architecture. The residual
processor 216 extracts the enhanced audio object from the incoming downmix signal 210
using the residual information contained in the SAOC bit stream 212. The downmix
preprocessor 270 processes the regular audio objects (which are, for example, non-
enhanced audio objects, i.e., audio objects for which no residual information is transmitted
in the SAOC bit stream 212). The enhanced audio objects (represented by the first audio
object signal 262) and the processed regular audio objects (represented, for example, by
the processed version 272 of the second audio object signal 264) are combined to the
output signal 220 for the SAOC decoder mode or to the MPEG surround downmix signal
220 for the SAOC transcoder mode. Detailed descriptions of the processing blocks are
given below.
3. Architecture and functionality of Residual Processor and Energy Mode Processor
In the following, details regarding a residual processor will be described, which may, for
example, take over the functionality of the object separator 130 of the audio signal decoder
100 or of the residual processor 260 of the audio signal decoder 200. For this purpose,
Figs. 3a and 3b show block schematic diagrams of such a residual processor 300, which
may take the place of the object separator 130 or of the residual processor 260. Fig. 3a
shows less details than Fig. 3b. However, the following description applies to the residual
processor 300 according to Fig. 3 a and also to the residual processor 380 according to Fig.
3b.
The residual processor 300 is configured to receive an SAOC downmix signal 310, which
may be equivalent to the downmix signal representation 112 of Fig, 1 or the downmix
signal representation 210 of Fig. 2. The residual processor 300 is configured to provide, on
the basis thereof, a first audio information 320 describing one or more enhanced audio
objects, which may, for example, be equivalent to the first audio information 132 or to the
first audio object signal 262. Also, the residual processor 300 may provide a second audio
information 322 describing one or more other audio objects (for example, non-enhanced
audio objects, for which no residual information is available), wherein the second audio
information 322 may be equivalent to the second audio information 134 or to the second
audio object signal 264.
The residual processor 300 comprises a 1-to-N/2-to-N unit (OTN/TTN unit) 330, which
receives the SAOC downmix signal 310 and which also receives SAOC data and residuals
332. The 1-to-N/2-to-N unit 330 also provides an enhanced-audio-object signal 334, which
describes the enhanced audio objects (EAO) contained in the SAOC downmix signal 310.
Also, the 1-to-N/2-to-N unit 330 provides the second audio information 322. The residual
processor 300 also comprises a rendering unit 340, which receives the enhanced-audio-
object signal 334 and a rendering matrix information 342 and provides, on the basis
thereof, the first audio information 320.
In the following, the enhanced audio object processing (EAO processing), which is
performed by the residual processor 300, will be described in more detail.
3.1. Introduction into the Operation of the Residual Processor 300
Regarding the functionality of the residual processor 300, it should be noted that the SAOC
technology allows for the individual manipulation of a number of audio objects in terms of
their level amplification/attenuation without significant decrease in the resulting sound
quality only in a very limited way. A special "karaoke-type" application scenario requires a
total (or almost total) suppression of the specific objects, typically the lead vocal, keeping
the perceptional quality of the background sound scene unharmed.
A typical application case contains up to four enhanced audio objects (EAO) signals,
which can, for example, represent two independent stereo objects (for example, two
independent stereo objects which are prepared to be removed at the side of the decoder).
It should be noted that the (one or more) quality enhanced audio objects (or, more
precisely, the audio signal contributions associated with the enhanced audio objects) are
included in the SAOC downmix signal 310. Typically, the audio signal contributions
associated with the (one or more) enhanced audio objects are mixed, by the downmix
processing performed by the audio signal encoder, with audio signal contributions of other
audio objects, which are not enhanced audio objects. Also, it should be noted that audio
signal contributions of a plurality of enhanced audio objects are also typically overlapped
or mixed by the downmix processing performed by the audio signal encoder.
3-2 SQAC Architecture Supporting Enhanced Audio Objects
In the following, details regarding the residual processor 300 will be described. Enhanced
audio object processing incorporates the 1-to-N or 2-to-N units, depending on the SAOC
downmix mode. The 1-to-N processing unit is dedicated to a mono downmix signal and
the 2-to-N processing unit is dedicated to a stereo downmix signal 310. Both these units
represent a generalized and enhanced modification of the 2-to-2 box (TTT box) known
from ISO/IEC 23003-1:2007. In the encoder, regular and EAO signals are combined into
the downmix. The OTN''/TTN"1 processing units (which are inverse one-to-N processing
units or inverse 2-to-N processing units) are employed to produce and encode the
corresponding residual signals.
The EAO and regular signals are recovered from the downmix 310 by the OTN/TTN units
330 using the SAOC side information and incorporated residual signals. The recovered
EAOs (which are described by the enhanced audio object signal 334) are fed into the
rendering unit 340 which represents (or provides) the product of the corresponding
rendering matrix (described by the rendering matrix information 342) and the resulting
output of the OTN/TTN unit. The regular audio objects (which are described by the second
audio information 322) are delivered to the SAOC downmix pre-processor, for example,
the SAOC downmix preprocessor 270, for further processing. Figs. 3a and 3b depict the
general structure of the residual processor, i.e., the architecture of the residual processor.
The residual processor output signals 320,322 are computed as
where XOBJ represents the downmix signal of the regular audio objects (i.e. non-EAOs)
and Xfc>)0 is the rendered EAO output signal for the SAOC decoding mode or the
corresponding EAO downmix signal for the SAOC transcoding mode.
The residual processor can operate in prediction (using residual information) mode or
energy (without residual information) mode. The extended input signal Xres is defined
accordingly:
Here, X may, for example, represent the one or more channels of the downmix signal
representation 310, which may be transported in the bitstream representing the multi-
channel audio content, res may designate one or more residual signals, which may be
described by the bitstream representing the multi-channel audio content.
The OTN/TTN processing is represented by matrix M and EAO processor by matrix
The OTN/TTN processing matrix M is defined according to the EAO operation mode (i.e.
prediction or energy) as
i
The OTN/TTN processing matrix M is represented as
where the matrix M0BJ relates to the regular audio objects (i.e. non-EAOs) and MEAO to
the enhanced audio objects (EAOs).
In some embodiments, one or more multichannel background objects (MBO) may be
treated the same way by the residual processor 300.
A Multi-channel Background Object (MBO) is an MPS mono or stereo downmix that is
part of the SAOC downmix. As opposed to using individual SAOC objects for each
channel in a multi-channel signal, an MBO can be used enabling SAOC to more efficiently
handle a multi-channel object. In the MBO case, the SAOC overhead gets lower as the
MBO's SAOC parameters only are related to the downmix channels rather than all the
upmix channels.
3.3 Further Definitions
3.3.1 Dimensionality of Signals and Parameters
In the following, the dimensionality of the signals and parameters will be briefly discussed
in order to provide an understanding how often the different calculations are performed.
The audio signals are defined for every time slot n and every hybrid subband (which may
be a frequency subband) k. The corresponding SAOC parameters are defined for each
parameter time slot 1 and processing band m. A Subsequent mapping between the hybrid
and parameter domain is specified by table A.31 ISO/IEC 23003-1:2007. Hence, all
calculations are performed with respect to the certain time/band indices and the
corresponding dimensionalities are implied for each introduced variable.
However, in the following, the time and frequency band indices will be omitted sometimes
to keep the notation concise.
3.3.2 Calculation of the matrix AfcV)0
The EAO pre-rendering matrix ABAO is defined according to the number of output
channels (i.e. mono, stereo or binaural) as
The matrices A,640 of size \*NEM and Af° of size IxN^o are defined as
where the rendering sub-matrix M^° corresponds to the EAO rendering (and describes a
desired mapping of enhanced audio objects onto channels of the upmix signal
representation).
)
The values wfA0 are computed in dependence on rendering information associated with the
enhanced audio objects using the corresponding EAO elements and using the equations of
section 4.2.2.1.
> In case of binaural rendering the matrix A\AO is defined by equations given in section
4.1.2, for which the corresponding target binaural rendering matrix contains only EAO
related elements.
3.4 Calculation of the OTN/TTN Elements in the Residual Mode
)
In the following, it will be discussed how the SAOC downmix signal 310, which typically
comprises one or two audio channels, is mapped onto the enhanced audio object signal
334, which typically comprises one or more enhanced audio object channels, and the
second audio information 322, which typically comprises one or two regular audio object
i channels.
The functionality of the 1-to-N unit or 2-to-N unit 330 may, for example, be implemented
using a matrix vector multiplication, such that a vector describing both the channels of the
enhanced audio object signal 334 and the channels of the second audio information 322 is
obtained by multiplying a vector describing the channels of the SAOC downmix signal 310
and (optionally) one or more residual signals with a matrix Mprediction or Menergy
Accordingly, the determination of the matrix Mprediction or Menergy is an important step in
the derivation of the first audio information 320 and the second audio information 322
from the SAOC downmix 310.
To summarize, the OTN/TTN upmix process is presented by either a matrix Mprediction for a
prediction mode or MEnergy for an energy mode.
The energy based encoding/decoding procedure is designed for non-waveform preserving
coding of the downmix signal. Thus the OTN/TTN upmix matrix for the corresponding
energy mode does not rely on specific waveforms, but only describe the relative energy
distribution of the input audio objects, as will be discussed in more detail below.
3.4.1 Prediction mode
For the prediction mode the matrix MPre(li(.ljon is defined exploiting the downmix
information contained in the matrix D"1 and the CPC data from matrix C:
M = D"'C
"^Prediction " ^ •
With respect to the several SAOC modes, the extended downmix matrix D and CPC
matrix C exhibit the following dimensions and structures:
3.4.1.1 Stereo downmix modes (TTN):
For stereo downmix modes (TTN) (for example, for the case of a stereo downmix on the
basis of two regular-audio-object channels and NEAO enhanced-audio-object-channels), the
(extended) downmix matrix Dand the CPC matrix Ccan be obtained as follows:
With a stereo downmix, each EAO j holds two CPCs cJQ and cJX yielding matrix C.
The residual processor output signals are computed as
Accordingly, two signals yL, yR (which are represented by XOBJ) are obtained, which
represent one or two or even more than two regular audio objects (also designated as non-
extended audio objects). Also, NEAO signals (represented by XEAO) representing NEAO
enhanced audio objects are obtained. These signals are obtained on the basis of two SAOC
downmix signals lo,ro and NEAO residual signals reso to resNEAO-i, which will be encoded in
the SAOC side information, for example, as a part as the object-related parametric
information.
It should be noted that the signals yL and yR may be equivalent to the signal 322, and that
the signals yo.EAo to yNEAo-i, EAO (which are represented by XEAO) may equivalent to the
signals 320.
The matrix AEA0 is a rendering matrix. Entries of the matrix AEA0 may describe, for
example, a mapping of enhanced audio objects to the channels of the enhanced audio
object signal 334 (XEAo).
Accordingly, an appropriate choice of the matrix AEA0 may allow for an optional
integration of the functionality of the rendering unit 340, such that the multiplication of the
vector describing the channels (lo,ro) of the SAOC downmix signal 310 and one or more
residual signals (reso,...,resNEAO-i) with the matrix A&40M^fc"°''may directly result in a
representation XEAO of the first audio information 320.
3.4.1.2 Mono downmix modes (OTN):
In the following, the derivation of the enhanced audio object signals 320 (or, alternatively,
of the enhanced audio object signals 334) and of the regular audio object signal 322 will be
described for the case in which the SAOC downmix signal 310 comprises a signal channel
only.
For mono downmix modes (OTN) (e.g., a mono downmix on the basis of one regular-
audio-object channel and NEAO enhanced-audio-object channels), the (extended) downmix
matrix Dand the CPC matrix Ccan be obtained as follows:
With a mono downmix, one EAO / is predicted by only one coefficient cj yielding the
matrix C. All matrix elements cj are obtained, for example, from the SAOC parameters
(for example, from the SAOC data 322) according to the relationships provided below
(section 3.4.1.4).
The residual processor output signals are computed as
The output signal XOBJ comprises, for example, one channel describing the regular audio
objects (non-enhanced audio objects) . The output signal XEAO comprises, for example,
one, two, or even more channels describing the enhanced audio objects (preferably NEAO
channels describing the enhanced audio objects). Again, said signals are equivalent to the
signals 320, 322.
3.4.1.3 Calculation of the inverse extended downmix matrix
The matrix D"1 is the inverse of the extended downmix matrix D and C implies the
CPCs.
The matrix D~' is the inverse of the extended downmix matrix D and can be calculated
as
The elements dtJ (for example, of the inverse D"1 of the extended downmix matrix D of
size 6x6) are derived using the following values:
The coefficients mj and n} of the extended downmix matrix D denote the downmix
values for every EAO j for the right and left downmix channel as
The elements dg of the downmix matrix D are obtained using the downmix gain
information DMG and the (optional) downmix channel level different information DCLD,
which is included in the SAOC information 332, which is represented, for example, by the
object-related parametric information 110 or the SAOC bitstream information 212.
For the stereo downmix case the downmix matrix D of size 2 x N with elements dLj
(/ = 0,1; y = 0,..., JV -1) is obtained from the DMG and DCLD parameters as
For the mono downmix case the downmix matrix D of size 1 x N with elements dtJ
[i = 0;j = 0,...,N-\) is obtained from the DMG parameters as
Here, the dequantized downmix parameters DMGj and DCLDj are obtained, for example,
from the parametric side information 110 or from the SAOC bitstream 212.
The function EAO(j) determines mapping between indices of input audio object channels
and EAO signals:
3.4.1.4 Calculation of the matrix C
The matrix C implies the CPCs and is derived from the transmitted SAOC parameters
(i.e. the OLDs, IOCs, DMGs and DCLDs) as
In other words, the constrained CPCs are obtained in accordance with the above equations,
which may be considered as a constraining algorithm. However, the constrained CPCs may
also be derived from the values using a different limitation approach (constraining
algorithm), or can be set to be equal to the values
It should be noted, that matrix entries Cj,i (and the intermediate quantities on the basis of
which the matrix entries Cj>( are computed) are typically only required if the downmix
signal is a stereo downmix signal.
The CPCs are constrained by the subsequent limiting functions:
For one specific EAO channel 7 = 0... NEA0 -1 the unconstrained CPCs are estimated by
The energy quantities PLo, PRn, PLoRo, P^j and PRoCoJ are computed as
The covariance matrix etj is defined in the following way: The covariance matrix E of
size NxN with elements e(J represents an approximation of the original signal
covariance matrix E * SS' and is obtained from the OLD and IOC parameters as
Here, the dequantized object parameters OLDj, IOCy are obtained, for example, from the
parametric side information 110 or from the SAOC bitstream 212.
In addition, Q^,K may, for example, be obtained as
The parameters OLDL, OLDR and IOCL R correspond to the regular (audio) objects
and can be derived using the downmix information:
As can be seen, two common object-level-different values OLDL and OLDR are computed
for the regular audio objects in the case of a stereo downmix signal (which preferably
implies a two-channel regular audio object signal). In contrast, only one common object-
level-different value OLDL is computed for the regular audio objects in the case of a one-
channel (mono) downmix signal (which preferably implies a one-channel regular audio
object signal).
As can be seen, the first (in the case of a two-channel downmix signal) or sole (in the case
of a one-channel downmix signal) common object-level-difference value OLDL is obtained
by summing contributions of the regular audio objects having audio object index (or
indices) i to the left channel (or sole channel) of the SAOC downmix signal 310.
The second common object-level-difference value OLDR (which is used in the case of a
two-channel downmix signal) is obtained by summing the contributions of the regular
audio objects having the audio object index (or indices) i to the right channel of the SAOC
downmix signal 310.
The contribution OLDL of the regular audio objects (having audio objects indices i=0 to
i=N-NhAo-l) onto the left channel signal (or sole channel signal) of the SAOC downmix
signal 710 is computed, for example, taking into consideration the downmix gain doj,
describing the downmix gain applied to the regular audio object having audio object index
i when obtaining the left channel signal of the SAOC downmix signal 310, and also the
object level of the regular audio object having the audio object i, which is represented by
the value OLDj.
Similarly, the common object level difference value OLDR is obtained using the downmix
coefficients d^j, describing the downmix gain which is applied to the regular audio object
having the audio object index i when forming the right channel signal of the SAOC
downmix signal 310, and the level information OLDj associated with the regular audio
object having the audio object index i.
As can be seen, the equations for the calculation of the quantities PLo, PRO, PLORO, PLOCOJ and
PROCOJ do not distinguish between the individual regular audio objects, but merely make
use of the common object level difference values OLDL, OLDR, thereby considering the
regular audio objects (having audio object indices i) as a single audio object.
Also, the inter-object-correlation value IOCL,R, which is associated with the regular audio
objects, is set to 0 unless there are two regular audio objects.
The covariance matrix e,j (and eL,R) is defined as follows:
The covariance matrix E of size NxN with elements etJ represents an approximation of
the original signal covariance matrix E«SS* and is obtained from the OLD and IOC
parameters as
For example,
wherein OLDL and OLDR and IOCL,R are computed as described above.
Here, the dequantized object parameters are obtained as
wherein DOLD and Dioc are matrices comprising objects-level-difference parameters and
inter-object-correlation parameters.
3.4.2. Energy Mode
In the following, another concept will be described, which can be used to separate the
extended-audio-object signals 320 and the regular-audio-object (non-extended audio
object) signals 322, and which can be used in combination with a non-waveform-
preserving audio coding of the SAOC downmix channels 310.
In other words, the energy based encoding/decoding procedure is designed for non-
waveform preserving coding of the downmix signal. Thus the OTN/TTN upmix matrix for
the corresponding energy mode does not rely on specific waveforms, but only describe the
relative energy distribution of the input audio objects.
Also, the concept discussed here, which is designated as an "energy mode" concept, can be
used without transmitting a residual signal information. Again, the regular audio objects
(non-enhanced audio objects) are treated as a single one-channel or two-channel audio
object having one or two common object-level-difference values OLDL, OLDR.
For the energy mode the matrix MEnergy is defined exploiting the downmix information
and the OLDs, as will be described in the following.
3.4.2.1. Energy Mode for Stereo Downmix Modes (TTN)
In case of a stereo (for example, a stereo downmix on the basis of two regular-audio-object
channels and NEAO enhanced-audio-object channels), the matrices M^T0, and Mj^01 are
obtained from the corresponding OLDs according to
The residual processor output signals are computed as
The signals yL, yR, which are represented by the signal XOBJ, describe the regular audio
objects (and may be equivalent to the signal 322), and the signals yo,EAO to yNEAO-i,EAO,
which are described by the signal XEAO, describe the enhanced audio objects (and may be
equivalent to the signal 334 or to the signal 320).
If a mono upmix signal is desired for the case of a stereo downmix signal, a 2-to-l
processing may be performed, for example, by the pre-processor 270 on the basis of the
two-channel signal XOBJ-
3.4.2.2. Energy Mode for Mono Downmix Modes (OTN)
For the mono case (for example, a mono downmix on the basis of one regular-audio-object
channel and NEAO enhanced-audio-object channels), the matrices M^r85' and M.^"®1 are
obtained from the corresponding OLDs according to
The residual processor output signals are computed as
A single regular-audio-object channel 322 (represented by XOBJ) and NEAo enhanced-
audio-object channels 320 (represented by XEAO) can be obtained by applying the matrices
M£"aysv and Mj^to a representation of a single channel SAOC downmix signal 310
(represented here by do).
If a two-channel (stereo) upmix signal is desired for the case of a one-channel (mono)
downmix signal, a l-to-2 processing may be performed, for example, by the pre-processor
270 on the basis of the one-channel signal XOBJ-
4. Architecture and operation of the SAOC Downmix Pre-Processor
In the following, the operation of the SAOC downmix pre-processor 270 will be described
both for some decoding modes of operation and for some transcoding modes of operation.
4,1 Operation in'the Decoding Modes
4.1.1 Introduction
In the following, a method for obtaining an output signal using SAOC parameters and
panning information (or rendering information) associated with each audio object is
described. The SAOC decoder 495 is depicted in Fig. 4g and consists of the SAOC
parameter processor 496 and the downmix processor 497.
It should be noted that the SAOC decoder 494 may be used to process the regular audio
objects, and may therefore receive, as the downmix signal 497a, the second audio object
signal 264 or the regular-audio-object signal 322 or the second audio information 134.
Accordingly, the downmix processor 497 may provide, as its output signals 497b, the
processed version 272 of the second audio object signal 264 or the processed version 142
of the second audio information 134. Accordingly, the downmix processor 497 may take
the role of the SAOC downmix pre-processor 270, or the role of the audio signal processor
140.
The SAOC parameter processor 496 may take the role of the SAOC parameter processor
252 and consequently provides downmix information 496a.
4.1.2 Downmix Processor
In the following, the downmix processor, which is part of the audio signal processor 140,
and which is designated as a "SAOC downmix pre-processor" 270 in the embodiment of
Fig. 2, and which is designated with 497 in the SAOC decoder 495, will be described in
more detail.
For the decoder mode of the SAOC system, the output signal 142, 272, 497b of the
downmix processor (represented in the hybrid QMF domain) is fed into the corresponding
synthesis filterbank (not shown in Figs. 1 and 2) as described in ISO/IEC 23003-1: 2007
yielding the final output PCM signal. Nevertheless, the output signal 142, 272, 497b of the
downmix processor is typically combined with one or more audio signals 132, 262
representing the enhanced audio objects. This combination may be performed before the
corresponding synthesis filterbank (such that a combined signal combining the output of
the downmix processor and the one or more signals representing the enhanced audio
objects is input to the synthesis filterbank). Alternatively, the output signal of the downmix
processor may be combined with one or more audio signals representing the enhanced
audio objects only after the synthesis filterbank processing. Accordingly, the upmix signal
representation 120, 220 may be either a QMF domain representation or a PCM domain
representation (or any other appropriate representation). The downmix processing
incorporates, for example, the mono processing, the stereo processing and, if required, the
subsequent binaural processing.
The output signal X of the downmix processor 270, 497 (also designated with 142, 272,
497b) is computed from the mono downmix signal X (also designated with 134, 264,
497a) and the decorrelated mono downmix signal Xd as
X = GX + P2Xd.
The decorrelated mono downmix signal Xd is computed as
Xd = decorrFunc(X) .
The decorrelated signals Xd are created from the decorrelator described in ISO/IEC
23003-1:2007, subclause 6.6.2. Following this scheme, the bsDecorrConfig = 0
configuration should be used with a decorrelator index, X = 8, according to Table A.26 to
Table A.29 in ISO/IEC 23003-1:2007. Hence, the decorrFunc{ ) denotes the decorrelation
process:
In case of binaural output the upmix parameters G and P2 derived from the SAOC data,
rendering information M're™ and HRTF parameters are applied to the downmix signal X
(and Xd) yielding the binaural output X, see Fig. 2, reference numeral 270, where the
basic structure of the downmix processor is shown.
The target binaural rendering matrix A'm of size 2x N consists of the elements a!;my. Each
element a'xmy is derived from HRTF parameters and rendering matrix M'r;* with elements
m'y", for example, by the SAOC parameter processor. The target binaural rendering matrix
A'-" represents the relation between all audio input objects y and the desired binaural
output.
The HRTF parameters are given by H"L, H"R and >" for each processing band m. The
spatial positions for which HRTF parameters are available are characterized by the index
(. These parameters are described in ISO/IEC 23003-1:2007.
4.1.2.1 Overview
In the following, an overview over the downmix processing will be given taking reference
to Figs. 4a and 4b, which show a block representation of the downmix processing, which
may be performed by the audio signal processor 140 or by the combination of the SAOC
parameter processor 252 and the SAOC downmix pre-processor 270, or by the
combination of the SAOC parameter processor 496 and the downmix processor 497.
Taking reference now to Fig. 4a, the downmix processing receives a rendering matrix M,
an object level difference information OLD, an inter-object-correlation information IOC, a
downmix gain information DMG and (optionally) a downmix channel level difference
information DCLD. The downmix processing 400 according to Fig. 4a obtains a rendering
matrix A on the basis of the rendering matrix M, for example, using a parameter adjuster
and a M-to-A mapping. Also, entries of a covariance matrix E are obtained in dependence
on the object level difference information OLD and the inter-object correlation information
IOC, for example, as discussed above. Similarly, entries of a downmix matrix D are
obtained in dependence on the downmix gain information DMG and the downmix channel
level difference information DCLD.
Entries f of a desired covariance matrix F are obtained in dependence on the rendering
matrix A and the covariance matrix E. Also, a scalar value v is obtained in dependence on
the covariance matrix E and the downmix matrix D (or in dependence on the entries
thereof).
Gain values PL, PR for two channels are obtained in dependence on entries of the desired
covariance matrix F and the scalar value v. Also, an inter-channel phase difference value
cpc is obtained in dependence entries f of the desired covariance matrix F. A rotation angle
a is also obtained in dependence on entries f of the desired covariance matrix F, taking into
consideration, for example, a constant c. In addition, a second rotation angle p is obtained,
for example, in dependence on the channel gains PL, PR and the first rotation angle a.
Entries of a matrix G are obtained, for example, in dependence on the two channel gain
values PL.PR and also in dependence on the inter-channel phase difference
/t,) order and the eigenvector corresponding to
the larger eigenvalue is calculated according to the equation above. It is assured to lie in
the positive x-plane (first element has to be positive). The second eigenvector is obtained
from the first by a - 90 degrees rotation:
Incorporating E, = (1 l)G, Rd can be calculated according to:
which gives
and finally the mix matrix,
4.2.2.4 Dual mode
The SAOC transcoder can let the mix matrices P, s P2 and the prediction matrix C3 be
calculated according to an alternative scheme for the upper frequency range. This
alternative scheme is particularly useful for downmix signals where the upper frequency
range is coded by a non-waveform preserving coding algorithm e.g. SBR in High
Efficiency AAC.
For the upper parameter bands, defined by bsTttBandsLow OLDR) for a plurality of audio objects of the second audio object type, and
wherein the object separator is configured to use the common object level
difference value for a computation of entries of an matrix (M); and
wherein the object separator is configured to use the matrix (M) to obtain one or
more audio channels representing the second audio information.
22. The audio signal decoder according to one of claims 1 to 21, wherein the object
separator is configured to selectively obtain a common inter-object correlation
value (IOCL.R) associated to the audio object of the second audio object type in
dependence on the object-related parametric information if it is found that there are
two audio objects of the second audio object type, and to set the inter-object
correlation value associated to the audio objects of the second audio object type to
zero if it is found that there are more or less than two audio objects of the second
audio object type; and
wherein the object separator is configured to use the common inter-object
correlation value for a computation of entries of an matrix (M ); and
wherein the object separator is configured to use the common inter-object
correlation value associated to the audio objects of the second audio object type to
obtain the one or more audio channels representing the second audio information.
23. The audio signal decoder according to one of claims 1 to 22, wherein the audio
signal processor is configured to render the second audio information in
dependence on the object-related parametric information, to obtain a rendered
representation of the audio objects of the second audio object type as the processed
version of the second audio information.
24. The audio signal decoder according to one of claims 1 to 23, wherein the object
separator is configured to provide the second audio information such that the
second audio information describes more than two audio objects of the second
audio object type.
25. The audio signal decoder according to claim 24, wherein the object separator is
configured to obtain, as the second audio information, a one-channel audio signal
representation or a two-channel audio signal representation representing more than
two audio objects of the second audio object type.
26. The audio signal decoder according to one of claims 1 to 25, wherein the audio
signal processor is configured to receive the second audio information and to
process the second audio information in dependence of the object-related
parametric information, taking into consideration object-related parametric
information associated with more than two audio objects of the second audio object
type.
27. The audio signal decoder according to one of claims 1 to 26, wherein the audio
signal decoder is configured to extract a total object number information
(bsNumObjects) and a foreground object number information (bsNumGroupsFGO)
from a configuration information (SAOCSpecificConfig) of the object-related
parametric information, and to determine the number of audio objects of the second
audio object type by forming a difference between the total object number
information and the foreground object number information.
28. The audio signal decoder according to one of claims 1 to 27, wherein the object
separator is configured to use object-related parametric information associated with
NEAO audio objects of the first audio object type to obtain, as the first audio
information, NEAO audio signals (X£/t0) representing the NEAO audio objects of the
first audio object type and to obtain, as the second audio information, one or two
audio signals (Xoa/) representing the N-NEAO audio objects of the second audio
object type, treating the N-NEAO audio objects of the second audio object type as a
single one-channel or a two-channel audio object; and
wherein the audio signal processor is configured to individually render the N-NEAO
audio objects represented by the one or two audio signals of the second audio
information using the object-related parametric information associated with the N-
NEAO audio objects of the second audio object type,
29. A method for providing an upmix signal representation in dependence on a
downmix signal representation and an object-related parametric information, the
method comprising:
decomposing the downmix signal representation, to provide a first audio
information describing a first set of one or more audio objects of a first audio object
type, and a second audio information describing a second set of one or more audio
objects of a second audio object type in dependence on the downmix signal
representation and using at least a part of the object-related parametric information,
wherein the second audio information is an audio information describing the audio
objects of the second audio object type in a combined manner; and
processing the second audio information in dependence on the object-related
parametric information, to obtain a processed version of the second audio
information; and
combining the first audio information with the processed version of the second
audio information, to obtain the upmix signal representation;
wherein the upmix signal representation is provided in dependence on a residual
information associated to a subset of audio objects represented by the downmix
signal representation,
wherein the downmix signal representation is decomposed, to provide the first
audio information describing a first set of one or more audio objects of a first audio
object type to which residual information is associated, and the second audio
information describing a second set of one or more audio objects of a second audio
object type, to which no residual information is associated, in dependence on the
downmix signal representation and using the residual information;
wherein an object-individual processing of the audio objects of the second audio
object type is performed, taking into consideration object-related parametric
information associated with more than two audio objects of the second audio object
type; and
wherein the residual information describes a residual distortion, which is expected
to remain if an audio object of the first audio object type is isolated merely using
the object-related parametric information.
30. A computer program for performing the method according to claim 33 when the
computer program runs on a computer.
31. An audio signal decoder (100; 200; 500; 590) for providing an upmix signal
representation in dependence on a downmix signal representation (112; 210; 510;
510a), an object-related parametric information (110; 212; 512; 512a) the audio
signal decoder comprising:
an object separator (130; 260; 520; 520a) configured to decompose the downmix
signal representation, to provide a first audio information (132; 262; 562; 562a)
describing a first set of one or more audio objects of a first audio object type, and a
second audio information (134; 264; 564; 564a) describing a second set of one or
more audio objects of a second audio object type in dependence on the downmix
signal representation and using at least a part of the object-related parametric
information;
an audio signal processor configured to receive the second audio information (134;
264; 564; 564a) and to process the second audio information in dependence on the
object-related parametric information, to obtain a processed version (142; 272; 572;
572a) of the second audio information; and
an audio signal combiner (150; 280; 580; 580a) configured to combine the first
audio information with the processed version of the second audio information, to
obtain the upmix signal representation;
wherein the object separator is configured to obtain the first audio information and
the second audio information according to
wherein
wherein
wherein X0H/ represent channels of the second audio information;
wherein XEAO represent object signals of the first audio information;
whereinrepresents a matrix which is an inverse of an extended downmix
matrix;
wherein C describes a matrix representing a plurality of channel prediction
coefficients,
wherein lo and ro represent channels of the downmix signal representation;
wherein reso to resw _i represent residual channels; and
wherein kEAO is a EAO pre-rendering matrix, entries of which describe a mapping
of enhanced audio objects to channels of an enhanced audio object signal XEAO;
wherein the object separator is configured to obtain the inverse downmix matrix
as an inverse of an extended downmix matrix which is defined as
wherein the object separator is configured to obtain the matrix C as
wherein mo to mN _, are downmix values associated with the audio objects of the
first audio object type;
wherein n0 to nN _, are downmix values associated with the audio objects of the
first audio object type;
wherein the object separator is configured to compute the prediction coefficients
wherein the object separator is configured to derive constrained prediction
coefficients cJfi and cy., from the prediction coefficients and using a
constraining algorithm, or to use the prediction coefficients , and as the
prediction coefficients cy 0 and cy,;
wherein energy quantities PLo, PRO, PLORO, PUCOJ and PR„COJ are defined as
wherein parameters OLD,., OLDR and IOCL>R correspond to audio objects of the
second audio object type and are defined according to
wherein do,i and dg are downmix values associated with the audio objects of the
second audio object type;
wherein OLDj are object level difference values associated with the audio objects
of the second audio object type;
wherein N is a total number of audio objects;
wherein NEAO is a number of audio objects of the first audio object type;
wherein IOCo.i is an inter-object-correlation value associated with a pair of audio
objects of the second audio object type;
wherein ey and eL,R are covariance values derived from object-level-difference
parameters and inter-object-correlation parameters; and
wherein ey are associated with a pair of audio objects of the 1st audio object type
and CL,R is associated with a pair of audio objects of the second audio object type.
32. An audio signal decoder (100; 200; 500; 590) for providing an upmix signal
representation in dependence on a downmix signal representation (112; 210; 510;
510a), an object-related parametric information (110; 212; 512; 512a) the audio
signal decoder comprising:
an object separator (130; 260; 520; 520a) configured to decompose the downmix
signal representation, to provide a first audio information (132; 262; 562; 562a)
describing a first set of one or more audio objects of a first audio object type, and a
second audio information (134; 264; 564; 564a) describing a second set of one or
more audio objects of a second audio object type in dependence on the downmix
signal representation and using at least a part of the object-related parametric
information;
an audio signal processor configured to receive the second audio information (134;
264; 564; 564a) and to process the second audio information in dependence on the
object-related parametric information, to obtain a processed version (142; 272; 572;
572a) of the second audio information; and
an audio signal combiner (150; 280; 580; 580a) configured to combine the first
audio information with the processed version of the second audio information, to
obtain the upmix signal representation;
wherein the object separator is configured to obtain the first audio information and
the second audio information according to
wherein Xoa/ represent channels of the second audio information;
wherein XEA0 represent object signals of the first audio information;
wherein
wherein mo to mNEAo-i are downmix values associated with the audio objects of the
first audio object type;
wherein no to M„ , are downmix values associated with the audio objects of the
first audio object type;
wherein OLD* are object level difference values associated with the audio objects
of the first audio object type;
wherein OLDL and OLDR are common object level difference values associated
with the audio objects of the second audio object type; and
wherein AhA0 is a EAO pre-rendering matrix.
33, An audio signal decoder (100; 200; 500; 590) for providing an upmix signal
representation in dependence on a downmix signal representation (112; 210; 510;
510a), an object-related parametric information (110; 212; 512; 512a) the audio
signal decoder comprising:
an object separator (130; 260; 520; 520a) configured to decompose the downmix
signal representation, to provide a first audio information (132; 262; 562; 562a)
describing a first set of one or more audio objects of a first audio object type, and a
second audio information (134; 264; 564; 564a) describing a second set of one or
more audio objects of a second audio object type in dependence on the downmix
signal representation and using at least a part of the object-related parametric
information;
an audio signal processor configured to receive the second audio information (134;
264; 564; 564a) and to process the second audio information in dependence on the
object-related parametric information, to obtain a processed version (142; 272; 572;
572a) of the second audio information; and
an audio signal combiner (150; 280; 580; 580a) configured to combine the first
audio information with the processed version of the second audio information, to
obtain the upmix signal representation;
wherein the object separator is configured to obtain the first audio information and
the second audio information according to
wherein X0BJ represents a channel of the second audio information;
wherein XFA0 represent object signals of the first audio information;
wherein
wherein mo to rtiMEAo-i arc downmix values associated with the audio objects of the
first audio object type;
wherein OLD; are object level difference values associated with the audio objects
of the first audio object type;
wherein OLDL is a common object level difference value associated with the audio
objects of the second audio object type; and
wherein AEA0 is a EAO pre-rendering matrix;
wherein the matrices and ! are applied to a representation do of a
single SAOC downmix signal.
34, A method for providing an upmix signal representation in dependence on a
downmix signal representation and an object-related parametric information, the
method comprising:
decomposing the downmix signal representation, to provide a first audio
information describing a first set of one or more audio objects of a first audio object
type, and a second audio information describing a second set of one or more audio
objects of a second audio object type in dependence on the downmix signal
representation and using at least a part of the object-related parametric information;
and
processing the second audio information in dependence on the object-related
parametric information, to obtain a processed version of the second audio
information; and
combining the first audio information with the processed version of the second
audio information, to obtain the upmix signal representation;
wherein the first audio information and the second audio information are obtained
according to
wherein
wherein Xoa/ represent channels of the second audio information;
wherein XEA0 represent object signals of the first audio information;
wherein represents a matrix which is an inverse of an extended downmix
matrix;
wherein C describes a matrix representing a plurality of channel prediction
coefficients,
wherein 1Q and ro represent channels of the downmix signal representation;
wherein resoto resw _, represent residual channels; and
wherein AEA0 is a EAO pre-rendering matrix, entries of which describe a mapping
of enhanced audio objects to channels of an enhanced audio object signal XEAO;
wherein the inverse downmix matrix is obtained as an inverse of an extended
downmix matrix which is defined as
wherein the matrix C is obtained as
wherein m0 to mN _, are downmix values associated with the audio objects of the
first audio object type;
wherein no to nN _, are downmix values associated with the audio objects of the
first audio object type;
wherein the prediction coefficients and are computed as
wherein constrained prediction coefficients and are derived from the
prediction coefficients and using a constraining algorithm, or wherein the
prediction coefficients , and are used as the prediction coefficients
wherein energy quantities PLo, PRO, PURO, PLOCOJ and PRoCoj are defined as
wherein parameters OLDL, OLDR and IOCL,R correspond to audio objects of the
second audio object type and are defined according to
wherein doj and di,,- are downmix values associated with the audio objects of the
second audio object type;
wherein OLDj are object level difference values associated with the audio objects
of the second audio object type;
wherein N is a total number of audio objects;
wherein NEAO is a number of audio objects of the first audio object type;
wherein IOCo.i is an inter-object-correlation value associated with a pair of audio
objects of the second audio object type;
wherein e,-j and e^R are covariance values derived from object-level-difference
parameters and inter-object-correlation parameters; and
wherein e,j are associated with a pair of audio objects of the 1st audio object type
and eLiR is associated with a pair of audio objects of the second audio object type.
35. A method for providing an upmix signal representation in dependence on a
downmix signal representation and an object-related parametric information, the
method comprising:
decomposing the downmix signal representation, to provide a first audio
information describing a first set of one or more audio objects of a first audio object
type, and a second audio information describing a second set of one or more audio
objects of a second audio object type in dependence on the downmix signal
representation and using at least a part of the object-related parametric information;
and
processing the second audio information in dependence on the object-related
parametric information, to obtain a processed version of the second audio
information; and
combining the first audio information with the processed version of the second
audio information, to obtain the upmix signal representation;
wherein the first audio information and the second audio information are obtained
according to
wherein X0BJ represent channels of the second audio information;
wherein XEA0 represent object signals of the first audio information;
wherein
wherein mo to ITINEAO-I are downmix values associated with the audio objects of the
first audio object type;
wherein no to «„ , are downmix values associated with the audio objects of the
first audio object type;
wherein OLDj are object level difference values associated with the audio objects
of the first audio object type;
wherein OLDL and OLDR are common object level difference values associated
with the audio objects of the second audio object type; and
wherein kEA0 is a EAO pre-rendering matrix.
36. A method for providing an upmix signal representation in dependence on a
downmix signal representation and an object-related parametric information, the
method comprising:
decomposing the downmix signal representation, to provide a first audio
information describing a first set of one or more audio objects of a first audio object
type, and a second audio information describing a second set of one or more audio
objects of a second audio object type in dependence on the downmix signal
representation and using at least a part of the object-related parametric information;
and
processing the secorid audio information in dependence on the object-related
parametric information, to obtain a processed version of the second audio
information; and
combining the first audio information with the processed version of the second
audio information, to obtain the upmix signal representation;
wherein the first audio information and the second audio information are obtained
according to
wherein X0BJ represents a channel of the second audio information;
wherein XEA0 represent object signals of the first audio information;
wherein
wherein m0 to mNEAO-1 are downmix values associated with the audio objects of the
first audio object type;
wherein OLDj are object level difference values associated with the audio objects
of the first audio object type;
wherein OLDL is a common object level difference value associated with the audio
objects of the second audio object type; and
wherein AEA0 is a EAO pre-rendering matrix;
wherein the matrices MOBJEnergy and MOBJEnergy are applied to a representation d0 of a
single SAOC downmix signal,
37. A computer program for performing the method according to one of claims 34 to 36
when the computer program runs on a computer.
ABSTRACT
An audio signal decoder for providing an upmix signal representation in dependence on a
downmix signal representation and an object-related parametric information
comprises an object separator configured to decompose the downmix signal
representation, to provide a first audio information describing a first set of one or
more audio objects of a first audio object type and a second audio information
describing a second set of one or more audio objects of a second audio object type,
in dependence on the downmix signal representation and using at least a part of the
object-related parametric information. The audio signal decoder also comprises an
audio signal processor configured to receive the second audio information and to
process the second audio information in dependence on the object-related
parametric information, to obtain a processed version of the second audio
information. The audio signal decoder also comprises an audio signal combiner
configured to combine the first audio information with the processed version of the
second audio information, to obtain the upmix signal representation.