Apparatus And Method For Generating Audio Output Signals Using Object

< Back

Apparatus And Method For Generating Audio Output Signals Using Object Based Metadata

Abstract: An apparatus for generating at least one audio output signal representing a superposition of at least two different audio objects comprises a processor for processing an audio input signal to provide an object representation of the audio input signal, where this object representation can be generated by a parametrically guided approximation of original objects using an object downmix signal. An object manipulator individually manipulates objects using audio object based metadata referring to the individual audio objects to obtain manipulated audio objects. The manipulated audio objects are mixed using an object mixer for finally obtaining an audio output signal having one or several channel signals depending on a specific rendering setup.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

13 January 2011

Publication Number

12/2011

Publication Type

INA

Invention Field

ELECTRONICS

Status

Parent Application

Patent Number

Legal Status

Grant Date

2017-12-27

Renewal Date

Applicants

FRAUNHOFER-GESELLSCHAFT ZUR FOERDERUNG DER ANGEWANDTEN FORSCHUNG E.V.

HANSASTRASSE 27C, 80686 MUENCHEN, GERMANY

Inventors

1. STEPHAN SCHREINER

KIRCHENSTRASSE 20A 91186 BUECHENBACH, GERMANY

2. WOLFGANG FIESEL

SONNENSTRASSE 19 90596 SCHWANSTETTEN, GERMANY

3. MATTHIAS NEUSINGER

BERGSTRASSE 10 91186 ROHR, GERMANY

4. OLIVER HELLMUTH

GESCHWISTER-VOEMEL-WEG 60 91052 ERLANGEN, GERMANY

5. RALPH SPERSCHNEIDER

DEBERT 75 91320 EBERMANNSTADT, GERMANY

Specification

Apparatus and Method for Generating Audio Output Signals
Using Object based Metadata
Field of the Invention
The present invention relates to audio processing and, par-
ticularly, to audio processing in the context of audio ob-
jects coding such as spatial audio object coding.
Background of the Invention and Prior Art
In modern broadcasting systems like television it is at
certain circumstances desirable not to reproduce the audio
tracks as the sound engineer designed them, but rather do
perform special adjustments to address constraints given at
rendering time. A well-known technology to control such
post-production adjustments is to provide appropriate meta-
data along with those audio tracks.
Traditional sound reproduction systems, e.g. old home tele-
vision systems, consist of one loudspeaker or a stereo pair
of loudspeakers. More sophisticated multichannel reproduc-
tion systems use five or even more loudspeakers.
If multichannel reproduction systems are considered, sound
engineers can be much more flexible in placing single
sources in a two-dimensional plane and therefore may also
use a higher dynamic range for their overall audio tracks,
since voice intelligibility is much easier due to the well-
known cocktail party effect.
However, those realistic, high dynamical sounds may cause
problems on traditional reproduction systems. There may be
scenarios where a consumer may not want this high dynamic
signal, be it because she or he is listening to the content
in a noisy environment (e.g. in a driving car or with an
in-flight or mobile entertainment system), she or he is
wearing hearing aids or she or he does not want to disturb
her or his neighbors (late at night for example).
Furthermore, broadcasters face the problem that different
items in one program (e.g. commercials) may be at different
loudness levels due to different crest factors requiring
level adjustment of consecutive items.
In a classical broadcast transmission chain the end user
receives the already mixed audio track. Any further manipu-
lation on receiver side may be done only in a very limited
form. Currently a small feature set of Dolby metadata al-
lows the user to modify some property of the audio signal.
Usually, manipulations based on the above mentioned meta-
data is applied without any frequency selective distinc-
tion, since the metadata traditionally attached to the au-
dio signal does not provide sufficient information to do
so.
Furthermore, only the whole audio stream itself can be ma-
nipulated. Additionally, there is no way to adopt and sepa-
rate each audio object inside this audio stream. Especially
in improper listening environments, this may be unsatisfac-
tory.
In the midnight mode, it is impossible for the current au-
dio processor to distinguish between ambience noises and
dialog because of missing guiding information. Therefore,
in case of high level noises (which must be compressed/
limited in loudness), also dialogs will be manipulated in
parallel. This might be harmful for speech intelligibility.
Increasing the dialog level compared to the ambient sound
helps to improve the perception of speech specially for
hearing impaired people. This technique only works if the
audio signal is really separated in dialog and ambient com-
ponents on the receiver side in addition with property con-
trol information. If only a stereo downmix signal is avail-
able no further separation can be applied anymore to dis-
tinguish and manipulate the speech information separately.
Current downmix solutions allow a dynamic stereo level tun-
ing for center and surround channels. But for any variant
loudspeaker configuration instead of stereo there is no
real description from the transmitter how to downmix the
final multichannel audio source. Only a default formula in-
side the decoder performs the signal mix in a very inflexi-
ble way.
In all described scenarios, generally two different ap-
proaches exist. The first approach is that, when generating
the audio signal to be transmitted, a set of audio objects
is downmixed into a mono, stereo or a multichannel signal.
This signal which is to be transmitted to a user of this
signal via broadcast, via any other transmission protocol
or via distribution on a computer-readable storage medium
normally has a number of channels which is smaller than the
number of original audio objects which were downmixed by a
sound engineer for example in a studio environment. Fur-
thermore, metadata can be attached in order to allow sev-
eral different modifications, but these modifications can
only be applied to the whole transmitted signal or, if the
transmitted signal has several different transmitted chan-
nels, to individual transmitted channels as a whole. Since,
however, such transmitted channels are always superposi-
tions of several audio objects, an individual manipulation
of a certain audio object, while a further audio object is
not manipulated is not possible at all.
The other approach is to not perform the object downmix,
but to transmit the audio object signals as they are as
separate transmitted channels. Such a scenario works well,
when the number of audio objects is small. When, for exam-
pie, only five audio objects exist, then it is possible to
transmit these five different audio objects separately from
each other within a 5.1 scenario. Metadata can be associ-
ated with these channels which indicate the specific nature
of an object/channel. Then, on the receiver side, the
transmitted channels can be manipulated based on the trans-
mitted metadata.
A disadvantage of this approach is that it is not backward-
compatible and does only work well in the context of a
small number of audio objects. When the number of audio ob-
jects increases, the bitrate required for transmitting all
objects as separate explicit audio tracks rapidly in-
creases. This increasing bitrate is specifically not useful
in the context of broadcast applications.
Therefore current bitrate efficient approaches do not allow
an individual manipulation of distinct audio objects. Such
an individual manipulation is only allowed when one would
transmit each object separately. This approach, however, is
not bitrate efficient and is, therefore, not feasible spe-
cifically in broadcast scenarios.
It is an object of the present invention to provide a bi-
trate efficient but flexible solution to these problems.
In accordance with the first aspect of the present inven-
tion this object is achieved by Apparatus for generating at
least one audio output signal representing a superposition
of at least two different audio objects, comprising: a
processor for processing an audio input signal to provide
an object representation of the audio input signal, in
which the at least two different audio objects are sepa-
rated from each other, the at least two different audio ob-
jects are available as separate audio object signals, and
the at least two different audio objects are manipulatable
independently from each other; an object manipulator for
manipulating the audio object signal or a mixed audio ob-
ject signal of at least one audio object based on audio ob-
ject based metadata referring to the at least one audio ob-
ject to obtain a manipulated audio object signal or a ma-
nipulated mixed audio object signal for the at least one
audio object; and an object mixer for mixing the object
representation by combining the manipulated audio object
with an unmodified audio object or with a manipulated dif-
ferent audio object manipulated in a different way as the
at least one audio object.
In accordance with a second aspect of the present inven-
tion, this object is achieved by this Method of generating
at least one audio output signal representing a superposi-
tion of at least two different audio objects, comprising:
processing an audio input signal to provide an object rep-
resentation of the audio input signal, in which the at
least two different audio objects are separated from each
other, the at least two different audio objects are avail-
able as separate audio object signals, and the at least two
different audio objects are manipulatable independently
from each other; manipulating the audio object signal or a
mixed audio object signal of at least one audio object
based on audio object based metadata referring to the at
least one audio object to obtain a manipulated audio object
signal or a manipulated mixed audio object signal for the
at least one audio object; and mixing the object represen-
tation by combining the manipulated audio object with an
unmodified audio object or with a manipulated different au-
dio object manipulated in a different way as the at least
one audio object.
In accordance with a third aspect of the present invention,
this object is achieved by an apparatus for generating an
encoded audio signal representing a superposition of at
least two different audio objects, comprising: a data
stream formatter for formatting a data stream so that the
data stream comprises an object downmix signal representing
a combination of the at least two different audio objects,
and, as side information, metadata referring to at least
one of the different audio objects.
In accordance with a fourth aspect of the present inven-
tion, this object is achieved by a method of generating an
encoded audio signal representing a superposition of at
least two different audio objects, comprising: formatting a
data stream so that the data stream comprises an object
downmix signal representing a combination of the at least
two different audio objects, and, as side information,
metadata referring to at least one of the different audio
objects.
Further aspects of the present invention refer to computer
programs implementing the inventive methods and a computer-
readable storage medium having stored thereon an object
downmix signal and, as side information, object parameter
data and metadata for one or more audio objects included in
the object downmix signal.
The present invention is based on the finding that an indi-
vidual manipulation of separate audio object signals or
separate sets of mixed audio object signals allows an indi-
vidual object-related processing based on object-related
metadata. In accordance with the present invention, the re-
sult of the manipulation is not directly output to a loud-
speaker, but is provided to an object mixer, which gener-
ates output signals for a certain rendering scenario, where
the output signals are generated by a superposition of at
least one manipulated object signal or a set of mixed ob-
ject signals together with other manipulated object signals
and/or an unmodified object signal. Naturally, it is not
necessary to manipulate each object, but, in some in-
stances, it can be sufficient to only manipulate one object
and to not manipulate a further object of the plurality of
audio objects. The result of the object mixing operation is
one or a plurality of audio output signals, which are based
on manipulated objects. These audio output signals can be
transmitted to loudspeakers or can be stored for further
use or can even be transmitted to a further receiver de-
pending on the specific application scenario.
Preferably, the signal input into the inventive manipula-
tion/mixing device is a downmix signal generated by down-
mixing a plurality of audio object signals. The downmix op-
eration can be meta-data controlled for each object indi-
vidually or can be uncontrolled such as be the same for
each object. In the former case, the manipulation of the
object in accordance with the metadata is the object con-
trolled individual and object-specific upmix operation, in
which a speaker component signal representing this object
is generated. Preferably, spatial object parameters are
provided as well, which can be used for reconstructing the
original signals by approximated versions thereof using the
transmitted object downmix signal. Then, the processor for
processing an audio input signal to provide an object rep-
resentation of the audio input signal is operative to cal-
culate reconstructed versions of the original audio object
based on the parametric data, where these approximated ob-
ject signals can then be individually manipulated by ob-
ject-based metadata.
Preferably, object rendering information is provided as
well, where the object rendering information includes in-
formation on the intended audio reproduction setup and in-
formation on the positioning of the individual audio ob-
jects within the reproduction scenario. Specific embodi-
ments, however, can also work without such object-location
data. Such configurations are, for example, the provision
of stationary object positions, which can be fixedly set or
which can be negotiated between a transmitter and a re-
ceiver for a complete audio track.
Brief Description of the Drawings
Preferred embodiments of the present invention are subse-
quently discussed in the context of the enclosed figures,
in which:
Fig. 1 illustrates a preferred embodiment of an appara-
tus for generating at least one audio output sig-
nal;
Fig. 2 illustrates a preferred implementation of the
processor of Fig. 1;
Fig. 3a illustrates a preferred embodiment of the manipu-
lator for manipulating object signals;
Fig. 3b illustrates a preferred implementation of the ob-
ject mixer in the context of a manipulator as il-
lustrated in Fig. 3a;
Fig. 4 illustrates a processor/manipulator/object mixer
configuration in a situation, in which the ma-
nipulation is performed subsequent to an object
downmix, but before a final object mix;
Fig. 5a illustrates a preferred embodiment of an appara-
tus for generating an encoded audio signal;
Fig. 5b illustrates a transmission signal having an ob-
ject downmix, object based metadata, and spatial
object parameters;
Fig. 6 illustrates a map indicating several audio ob-
jects identified by a certain ID, having an ob-
ject audio file, and a joint audio object infor-
mation matrix E;
Fig. 7 illustrates an explanation of an object covari-
ance matrix E of Fig. 6:
Fig. 8 illustrates a downmix matrix and an audio object
encoder controlled by the downmix matrix D;
Fig. 9 illustrates a target rendering matrix A which is
normally provided by a user and an example for a
specific target rendering scenario;
Fig. 10 illustrates a preferred embodiment of an appara-
tus for generating at least one audio output sig-
nal in accordance with a further aspect of the
present invention;
Fig. lla illustrates a further embodiment;
Fig. lib illustrates an even further embodiment;
Fig. lie illustrates a further embodiment;
Fig. 12a illustrates an exemplary application scenario;
and
Fig. 12b illustrates a further exemplary application sce-
nario.
Detailed Description of the Preferred Embodiments
To face the above mentioned problems, a preferred approach
is to provide appropriate metadata along with those audio
tracks. Such metadata may consist of information to control
the following three factors (the three "classical" D's):
• dialog normalization
• dynamic range control
downmix
Such Audio metadata helps the receiver to manipulate the
received audio signal based on the adjustments performed by
a listener. To distinguish this kind of audio metadata from
others (e.g. descriptive metadata like Author, Title,...), it
is usually referred to as "Dolby Metadata" (because they
are yet only implemented by Dolby). Subsequently, only this
kind of Audio metadata is considered and is simply called
metadata.
Audio metadata is additional control information that is
carried along with the audio program and has essential in-
formation about the audio to a receiver. Metadata provides
many important functions including dynamic range control
for less-than-ideal listening environments, level matching
between programs, downmixing information for the reproduc-
tion of multichannel audio through fewer speaker channels,
and other information.
Metadata provides the tools necessary for audio programs to
be reproduced accurately and artistically in many different
listening situations from full-blown home theaters to in-
flight entertainment, regardless of the number of speaker
channels, quality of playback equipment, or relative ambi-
ent noise level.
While an engineer or content producer takes great care in
providing the highest quality audio possible within their
program, she or he has no control over the vast array of
consumer electronics or listening environments that will
attempt to reproduce the original soundtrack. Metadata pro-
vides the engineer or content producer greater control over
how their work is reproduced and enjoyed in almost every
conceivable listening environment.
Dolby Metadata is a special format to provide information
to control the three factors mentioned.
The three most important Dolby metadata functionalities
are:
• Dialogue Normalization to achieve a long-term average
level of dialogue within a presentation, frequently
consisting of different program types, such as feature
film, commercials, etc.
• Dynamic Range Control to satisfy most of the audience
with pleasing audio compression but at the same time
allow each individual customer to control the dynamics
of the audio signal and adjust the compression to her
or his personal listening environment.
• Downmix to map the sounds of a multichannel audio sig-
nal to two or one channels in case no multichannel au-
dio playback equipment is available.
Dolby metadata are used along with Dolby Digital (AC-3) and
Dolby E. The Dolby-E Audio metadata format is described in
[16] Dolby Digital (AC-3) is intended for the translation
of audio into the home through digital television broadcast
(either high or standard definition), DVD or other media.
Dolby Digital can carry anything from a single channel of
audio up to a full 5.1-channel program, including metadata.
In both digital television and DVD, it is commonly used for
the transmission of stereo as well as full 5.1 discrete au-
dio programs.
Dolby E is specifically intended for the distribution of
multichannel audio within professional production and dis-
tribution environments. Any time prior to delivery to the
consumer, Dolby E is the preferred method for distribution
of multichannel/multiprogram audio with video. Dolby E can
carry up to eight discrete audio channels configured into
any number of individual program configurations (including
metadata for each) within an existing two-channel digital
audio infrastructure. Unlike Dolby Digital, Dolby E can
handle many encode/decode generations, and is synchronous
with the video frame rate. Like Dolby Digital, Dolby E car-
ries metadata for each individual audio program encoded
within the data stream. The use of Dolby E allows the re-
sulting audio data stream to be decoded, modified, and re-
encoded with no audible degradation. As the Dolby E stream
is synchronous to the video frame rate, it can be routed,
switched, and edited in a professional broadcast environ-
ment .
Apart from this means are provided along with MPEG AAC to
perform dynamic range control and to control the downmix
generation.
In order to handle source material with variable peak lev-
els, mean levels and dynamic range in a manner that mini-
mizes the variability for the consumer, it is necessary to
control the reproduced level such that, for instance, dia-
logue level or mean music level is set to a consumer con-
trolled level at reproduction, regardless of how the pro-
gram was originated. Additionally, not all consumers will
be able to listen to the programs in a good (i.e. low
noise) environment, with no constraint on how loud they
make the sound. The car environment, for instance, has a
high ambient noise level and it can therefore be expected
that the listener will want to reduce the range of levels
that would otherwise be reproduced.
For both of these reasons, dynamic range control has to be
available within the specification of AAC. To achieve this,
it is necessary to accompany the bit-rate reduced audio
with data used to set and control the dynamic range of the
program items. This control has to be specified relative to
a reference level and in relationship to the important pro-
gram elements, e.g. the dialogue.
The features of the dynamic range control are as follows:
1. Dynamic Range Control is entirely optional. Therefore,
with correct syntax, there is no change in complexity
for those not wishing to invoke DRC.
2. The bit-rate reduced audio data is transmitted with
the full dynamic range of the source material, with
supporting data to assist in dynamic range control.
3. The dynamic range control data can be sent every frame
to reduce to a minimum the latency in setting replay
gains.
4. The dynamic range control data is sent using the
"fill_element" feature of AAC.
5. The Reference Level is defined as Full-scale.
6. The Program Reference Level is transmitted to permit
level parity between the replay levels of different
sources and to provide a reference about which the dy-
namic range control may be applied. It is that feature
of the source signal that is most relevant to the sub-
jective impression of the loudness of a program, such
as the level of the dialogue content of a program or
the average level of a music program.
7. The Program Reference Level represents that level of
program that may be reproduced at a set level relative
to the Reference Level in the consumer hardware to
achieve replay level parity. Relative to this, the
quieter portions of the program may be increased in
level and the louder portions of the program may be
reduced in level.
8. Program Reference Level is specified within the range
0 to -31.75 dB relative to Reference Level.
9. Program Reference Level uses a 7 bit filed with 0.25
dB steps.
10. The dynamic range control is specified within the
range ±31.75 dB.
11. The dynamic range control uses an 8 bit field (1 sign,
7 magnitude) with 0.25 dB steps.
12. The dynamic range control can be applied to all of an
audio channel's spectral coefficients or frequency
bands as a single entity or the coefficients can be
split into different scalefactor bands, each being
controlled separately by separate sets of dynamic
range control data.
13. The dynamic range control can be applied to all chan-
nels (of a stereo or multichannel bitstream) as a sin-
gle entity or can be split, with sets of channels be-
ing controlled separately by separate sets of dynamic
range control data.
14. If an expected set of dynamic range control data is
missing, the most recently received valid values
should be used.
15. Not all elements of the dynamic range control data are
sent every time. For instance, Program Reference Level
may only be sent on average once every 200 ms.
16. Where necessary, error detection/protection is pro-
vided by the Transport Layer.
17. The user shall be given the means to alter the amount
of dynamic range control, present in the bitstream,
that is applied to the level of the signal.
Besides the possibility to transmit separate mono or stereo
mixdown channels in a 5.1-channel transmission, AAC also
allows a automatic mixdown generation from the 5-channel
source track. The LFE channel shall be omitted in this
case.
This matrix mixdown method may be controlled by the editor
of an audio track with a small set of parameters defining
the amount of the rear channels added to mixdown.
The matrix-mixdown method applies only for mixing a 3-
front/2-back speaker configuration, 5-channel program, down
to stereo or a mono program. It is not applicable to any
program with other than the 3/2 configuration.
Within MPEG several means are provided to control the Audio
rendering on the receiver side.
A generic technology is provided by a scene description
language, e.g. BIFS and LASeR. Both technologies are used
for rendering audio-visual elements from separated coded
objects into a playback scene.
BIFS is standardized in [5] and LASeR in [6].
MPEG-D mainly deals with (parametric) descriptions (i.e.
metadata)
• to generate multichannel Audio based on downmixed Au-
dio representations (MPEG Surround); and
• to generate MPEG Surround parameters based on Audio
objects (MPEG Spatial Audio Object Coding)
MPEG Surround exploits inter-channel differences in level,
phase and coherence equivalent to the ILD, ITD and IC cues
to capture the spatial image of a multichannel audio signal
relative to a transmitted downmix signal and encodes these
cues in a very compact form such that the cues and the
transmitted signal can be decoded to synthesize a high
quality multi-channel representation. The MPEG Surround en-
coder receives a multi-channel audio signal, where N is the
number of input channels (e.g. 5.1). A key aspect of the
encoding process is that a downmix signal, xtl and xt2,
which is typically stereo (but could also be mono), is de-
rived from the multi-channel input signal, and it is this
downmix signal that is compressed for transmission over the
channel rather than the multi-channel signal. The encoder
may be able to exploit the downmix process to advantage,
such that it creates a faithful equivalent of the multi-
channel signal in the mono or stereo downmix, and also cre-
ates the best possible multi-channel decoding based on the
downmix and encoded spatial cues. Alternatively, the down-
mix could be supplied externally. The MPEG Surround encod-
ing process is agnostic to the compression algorithm used
for the transmitted channels; it could be any of a number
of high-performance compression algorithms such as MPEG-1
Layer III, MPEG-4 AAC or MPEG-4 High Efficiency AAC, or it
could even be PCM.
The MPEG surround technology supports very efficient para-
metric coding of multichannel audio signals. The idea of
MPEG SAOC is to apply similar basic assumptions together
with a similar parameter representation for very efficient
parametric coding of individual audio objects (tracks). Ad-
ditionally, a rendering functionality is included to inter-
actively render the audio objects into an acoustical scene
for several types of reproduction systems (1.0, 2.0, 5.0,
.. for loudspeakers or binaural for headphones). SAOC is
designed to transmit a number of audio objects in a joint
mono or stereo downmix signal to later allow a reproduction
of the individual objects in an interactively rendered au-
dio scene. For this purpose, SAOC encodes Object Level Dif-
ferences (OLD), Inter-Object Cross Coherences (IOC) and
Downmix Channel Level Differences (DCLD) into a parameter
bitstreara. The SAOC decoder converts the SAOC parameter
representation into an MPEG Surround parameter representa-
tion, which is then decoded together with the downmix sig-
nal by an MPEG Surround decoder to produce the desired au-
dio scene. The user interactively controls this process to
alter the representation of the audio objects in the re-
sulting audio scene. Among the numerous conceivable appli-
cations for SAOC, a few typical scenarios are listed in the
following.
Consumers can create personal interactive remixes using a
virtual mixing desk. Certain instruments can be, e.g., at-
tenuated for playing along (like Karaoke), the original mix
can be modified to suit personal taste, the dialog level in
movies/broadcasts can be adjusted for better speech intel-
ligibility etc.
For interactive gaming, SAOC is a storage and computation-
ally efficient way of reproducing sound tracks. Moving
around in the virtual scene is reflected by an adaptation
of the object rendering parameters. Networked multi-player
games benefit from the transmission efficiency using one
SAOC stream to represent all sound objects that are exter-
nal to a certain player's terminal.
In the context of this application, the term "audio object"
also comprises a "stem" known in sound production scenar-
ios. Particularly, stems are the individual components of a
mix, separately saved (usually to disc) for the purposes of
use in a remix. Related stems are typically bounced from
the same original location. Examples could be a drum stem
(includes all related drum instruments in a mix), a vocal
stem (includes only the vocal tracks) or a rhythm stem (in-
cludes all rhythm related instruments such as drums, gui-
tar, keyboard, ...) .
Current telecommunication infrastructure is monophonic and
can be extended in its functionality. Terminals equipped
with an SAOC extension pick up several sound sources (ob-
jects) and produce a monophonic downmix signal, which is
transmitted in a compatible way by using the existing
(speech) coders. The side information can be conveyed in an
embedded, backward compatible way. Legacy terminals will
continue to produce monophonic output while SAOC-enabled
ones can render an acoustic scene and thus increase intel-
ligibility by spatially separating the different speakers
("cocktail party effect").
On overview of actual available Dolby audio metadata appli-
cations describes the following section:
Midnight mode
As mentioned in section [], there may scenarios, where the
listener may not want a high dynamic signal. Therefore, she
or he may activate the so called "midnight mode" of her or
his receiver. Then, a compressor is applied on the total
audio signal. To control the parameters of this compressor,
transmitted metadata are evaluated and applied to the total
audio signal.
Clean Audio
Another scenario are hearing impaired people, who do not
want to have high dynamic ambience noise, but who want to
have a quite clean signal containing dialogs.
("CleanAudio") . This mode may also be enabled using meta-
data.
A currently proposed solution is defined in [15] - Annex E.
The balance between the stereo main signal and the addi-
tional mono dialog description channel is handled here by
an individual level parameter set. The proposed solution
based on a separate syntax is called supplementary audio
service in DVB.
Downmix
There are separate metadata parameters that govern the L/R
downmix. Certain metadata parameters allow the engineer to
select how the stereo downmix is constructed and which ste-
reo analog signal is preferred. Here the center and the
surround downmix level define the final mixing balance of
the downmix signal for every decoder.
Fig. 1 illustrates an apparatus for generating at least one
audio output signal representing a superposition of at
least two different audio objects in accordance with a pre-
ferred embodiment of the present invention. The apparatus
of Fig. 1 comprises a processor 10 for processing an audio
input signal 11 to provide an object representation 12 of
the audio input signal, in which the at least two different
audio objects are separated from each other, in which the
at least two different audio objects are available as sepa-
rate audio object signals and in which the at least two
different audio objects are manipulatable independently
from each other.
The manipulation of the object representation is performed
in an object manipulator 13 for manipulating the audio ob-
ject signal or a mixed representation of the audio object
signal of at least one audio object based on audio object
based metadata 14 referring to the at least one audio ob-
ject. The audio object manipulator 13 is adapted to obtain
a manipulated audio object signal or a manipulated mixed
audio object signal representation 15 for the at least one
audio object.
The signals generated by the object manipulator are input
into an object mixer 16 for mixing the object representa-
tion by combining the manipulated audio object with an un-
modified audio object or with a manipulated different audio
object where the manipulated different audio object has
been manipulated in a different way as the at least one au-
dio object. The result of the object mixer comprises one or
more audio output signals 17a, 17b, 17c. Preferably, the
one or more output signals 17a to 17c are designed for a
specific rendering setup such as a mono rendering setup, a
stereo rendering setup, a multi-channel rendering setup
comprising three or more channels such as a surround-setup
requiring at least five or at least seven different audio
output signals.
Fig. 2 illustrates a preferred implementation of the proc-
essor 10 for processing the audio input signal. Preferably,
the audio input signal 11 is implemented as an object down-
mix 11 as obtained by an object downmixer 101a of Fig. 5a
which is described later. In this situation, the processor
additionally receives object parameters 18 as, for example,
generated by object parameter calculator 101b in Fig. 5a as
described later. Then, the processor 10 is in the position
to calculate separate audio object signals 12. The number
of audio object signals 12 can be higher than the number of
channels in the object downmix 11. The object downmix 11
can include a mono downmix, a stereo downmix or even a
downmix having more than two channels. However, the proces-
sor 12 can be operative to generate more audio object sig-
nals 12 compared to the number of individual signals in the
object downmix 11. The audio object signals are, due to the
parametric processing performed by the processor 10, not a
true reproduction of the original audio objects which were
present before the object downmix 11 was performed, but the
audio object signals are approximated versions of the
original audio objects, where the accuracy of the approxi-
mation depends on the kind of separation algorithm per-
formed in the processor 10 and, of course, on the accuracy
of the transmitted parameters. Preferred object parameters
are the parameters known from spatial audio object coding
and a preferred reconstruction algorithm for generating the
individually separated audio object signals is the recon-
struction algorithm performed in accordance with the spa-
tial audio object coding standard. A preferred embodiment
of the processor 10 and the object parameters is subse-
quently discussed in the context of Figs. 6 to 9.
Fig. 3a and Fig. 3b collectively illustrate an implementa-
tion, in which the object manipulation is performed before
an object downmix to the reproduction setup, while Fig. 4
illustrates a further implementation, in which the object
downmix is performed before manipulation, and the manipula-
tion is performed before the final object mixing operation.
The result of the procedure in Fig. 3a, 3b compared to Fig.
4 is the same, but the object manipulation is performed at
different levels in the processing scenario. When the ma-
nipulation of the audio object signals is an issue in the
context of efficiency and computational resources, the Fig.
3a/3b embodiment is preferred, since the audio signal ma-
nipulation has to be performed only on a single audio sig-
nal rather than a plurality of audio signals as in Fig. 4.
In a different implementation in which there might be a re-
quirement that the object downmix has to be performed using
an unmodified object signal, the configuration of Fig. 4 is
preferred, in which the manipulation is performed subse-
quent to the object downmix, but before the final object
mix to obtain the output signals for, for example, the left
channel L, the center channel C or the right channel R.
Fig. 3a illustrates the situation, in which the processor
10 of Fig. 2 outputs separate audio object signals. At
least one audio object signal such as the signal for object
1 is manipulated in a manipulator 13a based on metadata for
this object 1. Depending on the implementation, other ob-
jects such as object 2 is manipulated as well by a manipu-
lator 13b. Naturally, the situation can arise that there
actually exist an object such as object 3, which is not ma-
nipulated but which is nevertheless generated by the object
separation. The result of the Fig. 3a processing are, in
the Fig. 3a example, two manipulated object signals and one
non-manipulated signal.
These results are input into the object mixer 16, which in-
cludes a first mixer stage implemented as object downmixers
19a, 19b, 19c, and which furthermore comprises a second ob-
ject mixer stage implemented by devices 16a, 16b, 16c.
The first stage of the object mixer 16 includes, for each
output of Fig. 3a, an object downmixer such as object down-
mixer 19a for output 1 of Fig. 3a, object downmixer 19b for
output 2 of Fig. 3a an object downmixer 19c for output 3 of
Fig. 3a. The purpose of the object downmixer 19a to 19c is
to "distribute" each object to the output channels. There-
fore, each object downmixer 19a, 19b, 19c has an output for
a left component signal L, a center component signal C and
a right component signal R. Thus, if for example object 1
would be the single object, downmixer 19a would be a
straight-forward downmixer and the output of block 19a
would be the same as the final output L, C, R indicated at
17a, 17b, 17c. The object downmixers 19a to 19c preferably
receive rendering information indicated at 30, where the
rendering information may describe the rendering setup,
i.e., as in the Fig. 3e embodiment only three output speak-
ers exist. These outputs are a left speaker L, a center
speaker C and a right speaker R. If, for example, the ren-
dering setup or reproduction setup comprises a 5.1 sce-
nario, then each object downmixer would have six output
channels, and there would exist six adders so that a final
output signal for the left channel, a final output signal
for the right channel, a final output signal for the center
channel, a final output signal for the left surround chan-
nel, a final output signal for the right surround channel
and a final output signal for the low frequency enhancement
(sub-woofer) channel would be obtained.
Specifically, the adders 16a, 16b, 16c are adapted to com-
bine the component signals for the respective channel,
which were generated by the corresponding object downmix-
ers. This combination preferably is a straight-forward sam-
ple by sample addition, but, depending on the implementa-
tion, weighting factors can be applied as well. Furthermore
the functionalities in Figs. 3a, 3b can be performed in the
frequency or subband domain so that elements 19a to 16c
might operate in the frequency domain and there would be
some kind of frequency/time conversion before actually out-
putting the signals to speakers in a reproduction set-up.
Fig. 4 illustrates an alternative implementation, in which
the functionalities of the elements 19a, 19b, 19c, 16a,
16b, 16c are similar to the Fig. 3b embodiment. Impor-
tantly, however, the manipulation which took place in Fig.
3a before the object downmix 19a now takes place subsequent
to the object downmix 19a. Thus, the object-specific ma-
nipulation which is controlled by the metadata for the re-
spective object is done in the downmix domain, i.e., before
the actual addition of the then manipulated component sig-
nals. When Fig. 4 is compared to Fig. 1, it becomes clear
that the object downmixer as 19a, 19b, 19c will be imple-
mented within the processor 10, and the object mixer 16
will comprise the adders 16a, 16b, 16c. When Fig. 4 is im-
plemented and the object downmixers are part of the proces-
sor, then the processor will receive, in addition to the
object parameters 18 of Fig. 1, the rendering information
30, i.e. information on the position of each audio object
and information on the rendering setup and additional in-
formation as the case may be.
Furthermore, the manipulation can include the downmix op-
eration implemented by blocks 19a, 19b, 19c. In this em-
bodiment, the manipulator includes these blocks, and addi-
tional manipulations can take place, but are not required
in any case.
Fig. 5a illustrates an encoder-side embodiment which can
generate a data stream as schematically illustrated in Fig.
5b. Specifically, Fig. 5a illustrates an apparatus for gen-
erating an encoded audio signal 50, representing a super
position of at least two different audio objects. Basi-
cally, the apparatus of Fig. 5a illustrates a data stream
formatter 51 for formatting the data stream 50 so that the
data stream comprises an object downmix signal 52, repre-
senting a combination such as a weighted or unweighted com-
bination of the at least two audio objects. Furthermore,
the data stream 50 comprises, as side information, object
related metadata 53 referring to at least one of the dif-
ferent audio objects. Preferably, the data stream 50 fur-
thermore comprises parametric data 54, which are time and
frequency selective and which allow a high quality separa-
tion of the object downmix signal into several audio ob-
jects, where this operation is also termed to be an object
upmix operation which is performed by the processor 10 in
Fig. 1 as discussed earlier.
The object downmix signal 52 is preferably generated by an
object downmixer 101a. The parametric data 54 is preferably
generated by an object parameter calculator 101b, and the
object-selective metadata 53 is generated by an object-
selective metadata provider 55. The object-selective meta-
data provider may be an input for receiving metadata as
generated by an audio producer within a sound studio or may
be data generated by an object-related analysis, which
could be performed subsequent to the object separation.
Specifically, the object-selective metadata provider could
be implemented to analyze the object's output by the proc-
essor 10 in order to, for example, find out whether an ob-
ject is a speech object, a sound object or a surround sound
object. Thus, a speech object could be analyzed by some of
the well-known speech detection algorithms known from
speech coding, and the object-selective analysis could be
implemented to also find out sound objects, stemming from
instruments. Such sound objects have a high tonal nature
and can, therefore, be distinguished from speech objects or
surround sound objects. Surround sound objects will have a
quite noisy nature reflecting the background sound which
typically exists in, for example, cinema movies, where, for
example, background noises are traffic sounds or any other
stationary noisy signals or non-stationary signals having a
broadband spectrum such as it is generated when, for exam-
ple, a shooting scene takes place in a cinema.
Based on this analysis, one could amplify a sound object
and attenuate the other objects in order to emphasize the
speech as it is useful for a better understanding of the
movie for hearing-impaired people or for elder people. As
stated before, other implementations include the provision
of the object-specific metadata such as an object identifi-
cation and the object-related data by a sound engineer gen-
erating the actual object downmix signal on a CD or a DVD
such as a stereo downmix or a surround sound downmix.
Fig. 5d illustrates an exemplary data stream 50, which has,
as main information, the mono, stereo or multichannel ob-
ject downmix and which has, as side information, the object
parameters 54 and the object based metadata 53, which are
stationary in the case of only identifying objects as
speech or surround, or which are time-variable in the case
of the provision of level data as object based metadata
such as required by the midnight mode. Preferably, however,
the object based metadata are not provided in a frequency-
selective way in order to save data rate.
Fig. 6 illustrates an embodiment of an audio object map il-
lustrating a number of N objects. In the exemplary explana-
tion of Fig. 6, each object has an object ID, a correspond-
ing object audio file and, importantly, audio object pa-
rameter information which is, preferably, information re-
lating to the energy of the audio object and to the inter-
object correlation of the audio object. Specifically, the
audio object parameter information includes an object co-
variance matrix E for each subband and for each time block.
An example for such an object audio parameter information
matrix E is illustrated in Fig. 7. The diagonal elements
en include power or energy information of the audio object
i in the corresponding subband and the corresponding time
block. To this end, the subband signal representing a cer-
tain audio object i is input into a power or energy calcu-
lator which may, for example, perform an auto correlation
function (acf) to obtain value en with or without some
normalization. Alternatively, the energy can be calculated
as the sum of the squares of the signal over a certain
length (i.e. the vector product: ss*). The acf can in some
sense describe the spectral distribution of the energy, but
due to the fact that a T/F-transform for frequency selec-
tion is preferably used anyway, the energy calculation can
be performed without an acf for each subband separately.
Thus, the main diagonal elements of object audio parameter
matrix E indicate a measure for the power of energy of an
audio object in a certain subband in a certain time block.
On the other hand, the off-diagonal element e^ indicate a
respective correlation measure between audio objects i, j
in the corresponding subband and time block. It is clear
from Fig. 7 that matrix E is - for real valued entries -
symmetric with respect to the main diagonal. Generally,
this matrix is a Hermitian matrix. The correlation measure
element eij can be calculated, for example, by a cross cor-
relation of the two subband signals of the respective audio
objects so that a cross correlation measure is obtained
which may or may not be normalized. Other correlation meas-
ures can be used which are not calculated using a cross
correlation operation but which are calculate by other ways
of determining correlation between two signals. For practi-
cal reasons, all elements of matrix E are normalized so
that they have magnitudes between 0 and 1, where 1 indi-
cates a maximum power or a maximum correlation and 0 indi-
cates a minimum power (zero power) and -1 indicates a mini-
mum correlation (out of phase).
The downmix matrix D of size KxN where K>\ determines
the K channel downmix signal in the form of a matrix with
K rows through the matrix multiplication
X = DS. (2)
Fig. 8 illustrates an example of a downmix matrix D having
downmix matrix elements dij. Such an element dij indicates
whether a portion or the whole object j is included in the
object downmix signal i or not. When, for example, di2 is
equal to zero, this means that object 2 is not included in
the object downmix signal 1. On the other hand a value of
d23 equal to 1 indicates that object 3 is fully included in
object downmix signal 2.
Values of downmix matrix elements between 0 and 1 are pos-
sible. Specifically, the value of 0.5 indicates that a cer-
tain object is included in a downmix signal, but only with
half its energy. Thus, when an audio object such object
number 4 is equally distributed to both downmix signal
channels, then d24 and di4 would be equal to 0.5. This way
of downmixing is an energy-conserving downmix operation
which is preferred for some situations. Alternatively, how-
ever, a non-energy conserving downmix can be used as well,
in which the whole audio object is introduced into the left
downmix channel and the right downmix channel so that the
energy of this audio object has been doubled with respect
to the other audio objects within the downmix signal.
At the lower portion of Fig. 8, a schematic diagram of the
object encoder 101 of Fig. 1 is given. Specifically, the
object encoder 101 includes two different portions 101a and
101b. Portion 101a is a downmixer which preferably performs
a weighted linear combination of audio objects 1, 2, ..., N,
and the second portion of the object encoder 101 is an au-
dio object parameter calculator 101b, which calculates the
audio object parameter information such as matrix B for
each time block or subband in order to provide the audio
energy and correlation information which is a parametric
information and can, therefore, be transmitted with a low
bit rate or can be stored consuming a small amount of mem-
ory resources.
The user controlled object rendering matrix A of size
MxN determines the M channel target rendering of the
audio objects in the form of a matrix with M rows through
the matrix multiplication
Y = AS. (3)
It will be assumed throughout the following derivation that
M= 2 since the focus is on stereo rendering. Given an ini-
tial rendering matrix to more than two channels, and a
downmix rule from those several channels into two channels
it is obvious for those skilled in the art to derive the
corresponding rendering matrix A of size 2xN for stereo
rendering. It will also be assumed for simplicity that K=2
such that the object downmix is also a stereo signal. The
case of a stereo object downmix is furthermore the most im-
portant special case in terms of application scenarios.
Fig. 9 illustrates a detailed explanation of the target
rendering matrix A. Depending on the application, the tar-
get rendering matrix A can be provided by the user. The
user has full freedom to indicate, where an audio object
should be located in a virtual manner for a replay setup.
The strength of the audio object concept is that the down-
mix information and the audio object parameter information
is completely independent on a specific localization of the
audio objects. This localization of audio objects is pro-
vided by a user in the form of target rendering informa-
tion. Preferably, the target rendering information can be
implemented as a target rendering matrix A which may be in
the form of the matrix in Fig. 9. Specifically, the render-
ing matrix A has M lines and N columns, where M is equal to
the number of channels in the rendered output signal, and
wherein N is equal to the number of audio objects. M is
equal to two of the preferred stereo rendering scenario,
but if an M-channel rendering is performed, then the matrix
A has M lines.
Specifically, a matrix element aij, indicates whether a
portion or the whole object j is to be rendered in the spe-
cific output channel i or not. The lower portion of Fig. 9
gives a simple example for the target rendering matrix of a
scenario, in which there are six audio objects A01 to A06
wherein only the first five audio objects should be ren-
dered at specific positions and that the sixth audio object
should not be rendered at all.
Regarding audio object A01, the user wants that this audio
object is rendered at the left side of a replay scenario.
Therefore, this object is placed at the position of a left
speaker in a (virtual) replay room, which results in the
first column of the rendering matrix A to be (10). Regard-
ing the second audio object, a22 is one and ai2 is 0 which
means that the second audio object is to be rendered on the
right side.
Audio object 3 is to be rendered in the middle between the
left speaker and the right speaker so that 50% of the level
or signal of this audio object go into the left channel
and 50% of the level or signal go into the right channel so
that the corresponding third column of the target rendering
matrix A is (0.5 length 0.5).
Similarly, any placement between the left speaker and the
right speaker can be indicated by the target rendering ma-
trix. Regarding audio object 4, the placement is more to
the right side, since the matrix element a24 is larger than
ai4. Similarly, the fifth audio object A05 is rendered to
be more to the left speaker as indicated by the target ren-
dering matrix elements ais and a25. The target rendering ma-
trix A additionally allows to not render a certain audio
object at all. This is exemplarily illustrated by the sixth
column of the target rendering matrix A which has zero ele-
ments .
Subsequently, a preferred embodiment of the present inven-
tion is summarized referencing to Fig. 10.
Preferably, the methods known from SAOC (Spatial Audio Ob-
ject Coding) split up one audio signal into different
parts. These parts may be for example different sound ob-
jects, but it might not be limited to this.
If the metadata is transmitted for each single part of the
audio signal, it allows adjusting just some of the signal
components while other parts will remain unchanged or even
might be modified with different metadata.
This might be done for different sound objects, but also
for individual spectral ranges.
Parameters for object separation are classical or even new
metadata (gain, compression, level, ...) , for every individ-
ual audio object. These data are preferably transmitted.
The decoder processing box is implemented in two different
stages: In a first stage, the object separation parameters
are used to generate (10) individual audio objects. In the
second stage, the processing unit 13 has multiple in-
stances, where each instance is for an individual object.
Here, the object-specific metadata should be applied. At
the end of the decoder, all individual objects are again
combined (16) to one single audio signal. Additionally, a
dry/wet-controller 20 may allow smooth fade-over between
original and manipulated signal to give the end-user a sim-
ple possibility to find her or his preferred setting.
Depending on the specific implementation, Fig. 10 illus-
trates two aspects. In a base aspect, the object-related
metadata are just indicating an object description for a
specific object. Preferably, the object description is re-
lated to an object ID as indicated at 21 in Fig. 10. There-
fore , the object based metadata for the upper object ma-
nipulated by device 13a is just the information that this
object is a "speech" object. The object based metadata for
the other object processed by item 13b have information
that this second object is a surround object.
This basic object-related metadata for both objects might
be sufficient for implementing an enhanced clean audio
mode, in which the speech object is amplified and the sur-
round object is attenuated or, generally speaking, the
speech object is amplified with respect to the surround ob-
ject or the surround object is attenuated with respect to
the speech object. The user, however, can preferably imple-
ment different processing modes on the receiver/decoder-
side, which can be programmed via a mode control input.
These different modes can be a dialogue level mode, a com-
pression mode, a downmix mode, an enhanced midnight mode,
an enhanced clean audio mode, a dynamic downmix mode, a
guided upmix mode, a mode for relocation of objects etc.
Depending on the implementation, the different modes re-
quire a different object based metadata in addition to the
basic information indicating the kind or characteristic of
an object such as speech or surround. In the midnight mode,
in which the dynamic range of an audio signal has to be
compressed, it is preferred that, for each object such as
speech object and the surround object, either the actual
level or the target level for the midnight mode is provided
as metadata. When the actual level of the object is pro-
vided, then the receiver has to calculate the target level
for the midnight mode. When, however, the target relative
level is given, then the decoder/receiver-side processing
is reduced.
In this implementation, each object has a time-varying ob-
ject based sequence of level information which are used by
a receiver to compress the dynamic range so that the level
differences within a single object are reduced. This, auto-
matically, results in a final audio signal, in which the
level differences from time to time are reduced as required
by a midnight mode implementation. For clean audio applica-
tions, a target level for the speech object can be provided
as well. Then, the surround object might be set to zero or
almost to zero in order to heavily emphasize the speech ob-
ject within the sound generated by a certain loudspeaker
setup. In a high fidelity application, which is the con-
trary of the midnight mode, the dynamic range of the object
or the dynamic range of the difference between the objects
could even be enhanced. In this implementation, it would be
preferred to provide target object gain levels, since these
target levels guarantee that, in the end, a sound is ob-
tained which is created by an artistic sound engineer
within a sound studio and, therefore, has the highest qual-
ity compared to an automatic or user defined setting.
In other implementations, in which the object based meta-
data relate to advanced downmixes, the object manipulation
includes a downmix different from for specific rendering
setups. Then, the object based metadata is introduced into
the object downmixer blocks 19a to 19c in Fig. 3b or Fig.
4. In this implementation, the manipulator may include
blocks 19a to 19c, when an individual object downmix is
performed depending on the rendering setup. Specifically,
the object downmix blocks 19a to 19c can be set different
from each other. In this case, a speech object might be in-
troduced only into the center channel rather than in a left
or right channel, depending on the channel configuration.
Then, the downmixer blocks 19a to 19c might have different
numbers of component signal outputs. The downmix can also
be implemented dynamically.
Additionally, guided upmix information and information for
relocation of objects can be provided as well.
Subsequently, a summary of preferred ways of providing
metadata and the application of object-specific metadata is
given.
Audio objects may not be separated ideally like in typical
SOAC application. For manipulation of audio, it may be suf-
ficient to have a "mask" of the objects, not a total sepa-
ration.
This could lead to less/coarser parameters for object sepa-
ration.
For the application called "midnight mode", the audio engi-
neer needs to define all metadata parameters independently
for each object, yielding for example in constant dialog
volume but manipulated ambience noise ("enhanced midnight
mode").
This may be also useful for people wearing hearing aids
("enhanced clean audio").
New downmix scenarios: Different separated objects may be
treated different for each specific downmix situation. For
example, a 5.1-channel signal must be downmixed for a ste-
reo home television system and another receiver has even
only a mono playback system. Therefore, different objects
may be treated in different ways (and all this is con-
trolled by the sound engineer during production due to the
metadata provided by the sound engineer).
Also downmixes to 3.0, etc. are preferred.
The generated downmix will not be defined by a fixed global
parameter (set), but it may be generated from time-varying
object dependent parameters.
With new object based metadata, it is possible to perform a
guided upmix as well.
Objects may be placed to different positions, e.g. to make
the spatial image broader when ambience is attenuated. This
will help speech intelligibility for hearing-disabled peo-
ple.
The proposed method in this paper extends the existing
metadata concept implemented and mainly used in Dolby Co-
decs. Now, it is possible to apply the known metadata con-
cept not only to the whole audio stream, but to extracted
objects within this stream. This gives audio engineers and
artists much more flexibility, greater ranges of adjust-
ments and therefore better audio quality and enjoyment for
the listeners.
Figs. 12a, 12b illustrate different application scenarios
of the inventive concept. In a classical scenario, there
exists sports in television, where one has the stadium at-
mosphere in all 5.1 channels, and where the speaker channel
is mapped to the center channel. This "mapping" can be per-
formed by a straight-forward addition of the speaker chan-
nel to a center channel existing for the 5.1 channels car-
rying the stadium atmosphere. Now, the inventive process
allows to have such a center channel in the stadium atmos-
phere sound description. Then, the addition operation mixes
the center channel from the stadium atmosphere and the
speaker. By generating object parameters for the speaker
and the center channel from the stadium atmosphere, the
present invention allows to separate these two sound ob-
jects on a decoder-side and allows to enhance or attenuate
the speaker or the center channel from the stadium atmos-
phere. The further scenario is, when one has two speakers.
Such a situation may arise, when two persons are commenting
one and the same soccer game. Specifically, when there ex-
ist two speakers which are speaking simultaneously, it
might be useful to have these two speakers as separate ob-
jects and, additionally, to have these two speakers sepa-
rate from the stadium atmosphere channels. In such an ap-
plication, the 5.1 channels and the two speaker channels
can be processed as eight different audio objects or seven
different audio objects, when the low frequency enhancement
channel (sub-woofer channel) is neglected. Since the
straight-forward distribution infrastructure is adapted to
a 5.1 channels sound signal, the seven (or eight) objects
can be downmixed into a 5.1 channels downmix signal, and
the object parameters can be provided in addition to the
5.1 downmix channels so that, on the receiver side, the ob-
jects can be separated again and due to the fact that ob-
ject based metadata will identify the speaker objects from
the stadium atmosphere objects, an object-specific process-
ing is possible, before a final 5.1 channels downmix by the
object mixer takes place on the receiver side.
In this scenario, one could also have a first object com-
prising the first speaker, a second object comprising the
second speaker and a third object comprising the complete
stadium atmosphere.
Subsequently, different implementations of object based
downmix scenarios are discussed in the context of Figs, lla
to lie.
When, for example, the sound generated by the Fig. 12a or
12b scenario has to be replayed on a conventional 5.1 play-
back system, then the embedded metadata stream can be dis-
regarded and the received stream can be played as it is.
When, however, a playback has to take place on stereo
speaker setups, a downmix from 5.1 to stereo has to take
place. If the surround channels are just added to
left/right, the moderators may be at level that is too
small. Therefore, it is preferred to reduce the atmosphere
level before or after downmix before the moderator object
is (re-) added.
Hearing impaired people may want to reduce the atmosphere
level to have better speech intelligibility while still
having both speakers separated in left/right, which is
known as the "cocktail-party-effect", where one hears her
or his name and then, concentrates into the direction where
she or he heard her or his name. This direction-specific
concentration will, from a psycho acoustic point of view
attenuate the sound coming from different directions.
Therefore, a sharp location of a specific object such as
the speaker on left or right or on both left or right so
that the speaker appears in the middle between left or
right might increase intelligibility. To this end, the in-
put audio stream is preferably divided into separate ob-
jects, where the objects have to have a ranking in metadata
saying that an object is important or less important. Then,
the level difference between them can be adjusted in accor-
dance with the meta data or the object position can be re-
located to increase intelligibility in accordance with the
metadata.
To obtain this goal, metadata are applied not on the trans-
mitted signal but metadata are applied to single separable
audio objects before or after the object downmix as the
case may be. Now, the present invention does not require
anymore that objects have to be limited to spatial channels
so that these channels can be individually manipulated. In-
stead, the inventive object based metadata concept does not
require to have a specific object in a specific channel,
but objects can be downmixed to several channels and can
still be individually manipulated.
Fig. lla illustrates a further implementation of a pre-
ferred embodiment. The object downmixer 16 generates m out-
put channels out of k x n input channels, where k is the
number of objects and were n channels are generated per ob-
ject. Fig. lla corresponds to the scenario of Fig. 3a, 3b,
where the manipulation 13a, 13b, 13c takes place before the
object downmix.
Fig. lla furthermore comprises level manipulators 19d, 19e,
19f, which can be implemented without a metadata control.
Alternatively, however, these level manipulators can be
controlled by object based metadata as well so that the
level modification implemented by blocks 19d to 19f is also
part of the object manipulator 13 of Fig. 1. The same is
true for the downmix operations 19a to 19b to 19c, when
these downmix operations are controlled by the object based
metadata. This case, however, is not illustrated in Fig.
lla, but could be implemented as well, when the object
based metadata are forwarded to the downmix blocks 19a to
19c as well. In the latter case, these blocks would also be
part of the object manipulator 13 of Fig. lla, and the re-
maining functionality of the object mixer 16 is implemented
by the output-channel-wise combination of the manipulated
object component signals for the corresponding output chan-
nels. Fig. lla furthermore comprises a dialogue normaliza-
tion functionality 25, which may be implemented with con-
ventional metadata, since this dialogue normalization does
not take place in the object domain but in the output chan-
nel domain.
Fig. lib illustrates an implementation of an object based
5.1-stereo-downmix. Here, the downmix is performed before
manipulation and, therefore, Fig. lib corresponds to the
scenario of Fig. 4. The level modification 13a, 13b is per-
formed by object based metadata where, for example, the up-
per branch corresponds to a speech object and the lower
branch corresponds to a surround object or, for the example
in Fig. 12a, 12b, the upper branch corresponds to one or
both speakers and the lower branch corresponds to all sur-
round information. Then, the level manipulator blocks 13a,
13b would manipulate both objects based on fixedly set pa-
rameters so that the object based metadata would just be an
identification of the objects, but the level manipulators
13a, 13b could also manipulate the levels based on target
levels provided by the metadata 14 or based on actual lev-
els provided by the metadata 14. Therefore, to generate a
stereo downmix for multichannel input, a downmix formula
for each object is applied and the objects are weighted by
a given level before remixing them to an output signal
again.
For clean audio applications as illustrated in Fig. lie, an
importance level is transmitted as metadata to enable a re-
duction of less important signal components. Then, the
other branch would correspond to the importance components,
which are amplified while the lower branch might correspond
to the less important components which can be attenuated.
How the specific attenuation and/or amplification of the
different objects is performed can be fixedly set by a re-
ceiver but can also be controlled, in addition, by object
based metadata as implemented by the "dry/wet" control 14
in Fig. lie.
Generally, a dynamic range control can be performed in the
object domain which is done similar to the AAC-dynamic
range control implementation as a multi-band compression.
The object based metadata can even be a frequency-selective
data so that a frequency-selective compression is performed
which is similar to an equalizer implementation.
As stated before, a dialogue normalization is preferably
performed subsequent to the downmix, i.e., in the downmix
signal. The downmixing should, in general, be able to proc-
ess k objects with n input channels into m output channels.
It is not necessarily important to separate objects into
discrete objects. It may be sufficient to "mask out" signal
components which are to be manipulated. This is similar to
editing masks in image processing. Then, a generalized "ob-
ject" is a superposition of several original objects, where
this superposition includes a number of objects which is
smaller than the total number of original objects. All ob-
jects are again added up at a final stage. There might be
no interest in separated single objects, and for some ob-
jects, the level value may be set to 0, which is a high
negative dB figure, when a certain object has to be removed
completely such as for karaoke applications where one might
be interested in completely removing the vocal object so
that the karaoke singer can introduce her or his own vocals
to the remaining instrumental objects.
Other preferred applications of the invention are as stated
before an enhanced midnight mode where the dynamic range of
single objects can be reduced, or a high fidelity mode,
where the dynamic range of objects is expanded. In this
context, the transmitted signal may be compressed and it is
intended to invert this compression. The application of a
dialogue normalization is mainly preferred to take place
for the total signal as output to the speakers, but a non-
linear attenuation/amplification for different objects is
useful, when the dialogue normalization is adjusted. In ad-
dition to parametric data for separating the different au-
dio objects from the object downmix signal, it is preferred
to transmit, for each object and sum signal in addition to
the classical metadata related to the sum signal, level
values for the downmix, importance an importance values in-
dicating an importance level for clean audio, an object
identification, actual absolute or relative levels as time-
varying information or absolute or relative target levels
as time-varying information etc.
The described embodiments are merely illustrative for the
principles of the present invention. It is understood that
modifications and variations of the arrangements and the
details described herein will be apparent to others
skilled in the art. It is the intent, therefore, to be
limited only by the scope of the impending patent claims
and not by the specific details presented by way of de-
scription and explanation of the embodiments herein.
Depending on certain implementation requirements of the in-
ventive methods, the inventive methods can be implemented
in hardware or in software. The implementation can be per-
formed using a digital storage medium, in particular, a
disc, a DVD or a CD having electronically-readable control
signals stored thereon, which co-operate with programmable
computer systems such that the inventive methods are per-
formed. Generally, the present invention is therefore a
computer program product with a program code stored on a
machine-readable carrier, the program code being operated
for performing the inventive methods when the computer pro-
gram product runs on a computer. In other words, the inven-
tive methods are, therefore, a computer program having a
program code for performing at least one of the inventive
methods when the computer program runs on a computer.
References
[1] ISO/IEC 13818-7: MPEG-2 (Generic coding of moving pic-
tures and associated audio information) - Part 7: Advanced
Audio Coding (AAC)
[2] ISO/IEC 23003-1: MPEG-D (MPEG audio technologies) -
Part 1: MPEG Surround
[3] ISO/IEC 23003-2: MPEG-D (MPEG audio technologies) -
Part 2: Spatial Audio Object Coding (SAOC)
[4] ISO/IEC 13818-7: MPEG-2 (Generic coding of moving pic-
tures and associated audio information) - Part 7: Advanced
Audio Coding (AAC)
[5] ISO/IEC 14496-11: MPEG 4 (Coding of audio-visual ob-
jects) - Part 11: Scene Description and Application Engine
(BIFS)
[6] ISO/IEC 14496-: MPEG 4 (Coding of audio-visual objects)
- Part 20: Lightweight Application Scene Representation
(LASER) and Simple Aggregation Format (SAF)
[7] http:/www.dolby.com/assets/pdf/techlibrary/17. AllMeta-
data.pdf
[8] http:/www.dolby.com/assets/pdf/tech_library/
18_Metadata.Guide.pdf
[9] Krauss, Kurt; Roden, Jonas; Schildbach, Wolfgang:
Transcoding of Dynamic Range Control Coefficients and Other
Metadata into MPEG-4 HE AA, AES convention 123, October
2007, pp 7217
[10] Robinson, Charles Q., Gundry, Kenneth: Dynamic Range
Control via Metadata, AES Convention 102, September 1999,
pp 5028
[11] Dolby, "Standards and Practices for Authoring Dolby
Digital and Dolby E Bitstreams", Issue 3
[14] Coding Technologies/Dolby, "Dolby E / aacPlus Metadata
Transcoder Solution for aacPlus Multichannel Digital Video
Broadcast (DVB)", VI.1.0
[15] ETSI TS101154: Digital Video Broadcasting (DVB),
VI.8.1
[16] SMPTE RDD 6-2008: Description and Guide to the Use of
Dolby E audio Metadata Serial Bitstream
Claims
1. Apparatus for generating at least one audio output
signal representing a superposition of at least two
different audio objects, comprising:
a processor for processing an audio input signal to
provide an object representation of the audio input
signal, in which the at least two different audio ob-
jects are separated from each other, the at least two
different audio objects are available as separate au-
dio object signals, and the at least two different au-
dio objects are manipulatable independently from each
other;
an object manipulator for manipulating the audio ob-
ject signal or a mixed audio object signal of at least
one audio object based on audio object based metadata
referring to the at least one audio object to obtain a
manipulated audio object signal or a manipulated mixed
audio object signal for the at least one audio object;
and
an object mixer for mixing the object representation
by combining the manipulated audio object with an un-
modified audio object or with a manipulated different
audio object manipulated in a different way as the at
least one audio object.
2. Apparatus in accordance with claim 1, which is adapted
to generate m output signals, m being an integer
greater than 1,
wherein the processor is operative to provide an ob-
ject representation having k audio objects, k being an
integer and greater than m,
wherein the object manipulator is adapted to manipu-
late at least two objects different from each other
based on metadata associated with at least one object
of the at least two objects, and
wherein the object mixer is operative to combine the
manipulated audio signals of the at least two differ-
ent objects to obtain the m output signals so that
each output signal is influenced by the manipulated
audio signals of the at least two different objects.
3. Apparatus in accordance with claim 1,
in which the processor is adapted to receive the input
signal, the input signal being a downmixed representa-
tion of a plurality of original audio objects,
in which the processor is adapted to receive audio ob-
ject parameters for controlling a reconstruction algo-
rithm for reconstructing an approximated representa-
tion of the original audio objects, and
in which the processor is adapted to conduct the re-
construction algorithm using the input signal and the
audio object parameters to obtain the object represen-
tation comprising audio object signals being an ap-
proximation of audio object signals of the original
audio objects.
4. Apparatus in accordance with claim 1,
in which the audio input signal is a downmixed repre-
sentation of a plurality of original audio objects and
comprises, as side information, object based metadata
having information on one or more audio objects in-
cluded in the downmix representation, and
in which the object manipulator is adapted to extract
the object based metadata from the audio input signal.
5. Apparatus in accordance with claim 3, in which the au-
dio input signal comprises, as side information, the
audio object parameters, and in which the processor is
adapted to extract the side information from the audio
input signal.
6. Apparatus in accordance with claim 1,
in which the object manipulator is operative to ma-
nipulate the audio object signal, and
in which the object mixer is operative to apply a
downmix rule for each object based on a rendering po-
sition for the object and a reproduction setup to ob-
tain an object component signal for each audio output
signal, and
wherein the object mixer is adapted to add object com-
ponent signals from different objects for the same
output channel to obtain the audio output signal for
the output channel.
7. Apparatus in accordance with claim 1, in which the ob-
ject manipulator is operative to manipulate each of a
plurality of object component signals in the same man-
ner based on metadata for the object to obtain object
component signals for the audio object, and
in which the object mixer is adapted to add the object
component signals from different objects for the same
output channel to obtain the audio output signal for
the output channel.
8. Apparatus in accordance with claim 1, further compris-
ing an output signal mixer for mixing the audio output
signal obtained based on a manipulation of at least
one audio object and a corresponding audio output sig-
nal obtained without the manipulation of the at least
one audio object.
9. Apparatus in accordance with claim 1, in which the
metadata comprises the information on a gain, a com-
pression, a level, a downmix setup or a characteristic
specific for a certain object, and
wherein the object manipulator is adaptive to manipu-
late the object or other objects based on the metadata
to implement, in an object specific way, a midnight
mode, a high fidelity mode, a clean audio mode, a dia-
logue normalization, a downmix specific manipulation,
a dynamic downmix, a guided upmix, a relocation of
speech objects or an attenuation of an ambience ob-
ject.
10. Apparatus in accordance with claim 1, in which the ob-
ject parameters comprise, for a plurality of time por-
tions of an object audio signal, parameters for each
band of a plurality of frequency bands in the respec-
tive time portion, and
wherein the metadata only include non-frequency-
selective information for an audio object.
11. Apparatus for generating an encoded audio signal rep-
resenting a superposition of at least two different
audio objects, comprising:
a data stream formatter for formatting a data stream
so that the data stream comprises an object downmix
signal representing a combination of the at least two
different audio objects, and, as side information,
metadata referring to at least one of the different
audio objects.
12. Apparatus in accordance with claim 11, wherein the
data stream formatter is operative to additionally in-
troduce, as side information, parametric data allowing
an approximation of the at least two different audio
objects, into the data stream.
13. Apparatus in accordance with claim 11, the apparatus
further comprising a parameter calculator for calcu-
lating parametric data for an approximation of the at
least two different audio objects, a downmixer for
downmixing the at least two different audio objects to
obtain the downmix signal, and an input for metadata
individually relating to the at least two different
audio objects.
14. Method of generating at least one audio output signal
representing a superposition of at least two different
audio objects, comprising:
processing an audio input signal to provide an object
representation of the audio input signal, in which the
at least two different audio objects are separated
from each other, the at least two different audio ob-
jects are available as separate audio object signals,
and the at least two different audio objects are ma-
nipulatable independently from each other;
manipulating the audio object signal or a mixed audio
object signal of at least one audio object based on
audio object based metadata referring to the at least
one audio object to obtain a manipulated audio object
signal or a manipulated mixed audio object signal for
the at least one audio object; and
mixing the object representation by combining the ma-
nipulated audio object with an unmodified audio object
or with a manipulated different audio object manipu-
lated in a different way as the at least one audio ob-
ject.
15. Method of generating an encoded audio signal repre-
senting a superposition of at least two different au-
dio objects, comprising:
formatting a data stream so that the data stream com-
prises an object downmix signal representing a combi-
nation of the at least two different audio objects,
and, as side information, metadata referring to at
least one of the different audio objects.
16. Computer program for performing, when being executed
on a computer, a method for generating at least one
audio output signal in accordance with claim 14 or a
method for generating an encoded audio signal in ac-
cordance with claim 15.

An apparatus for generating at least one audio output signal
representing a superposition of at least two different
audio objects comprises a processor for processing an audio
input signal to provide an object representation of the audio
input signal, where this object representation can be
generated by a parametrically guided approximation of
original objects using an object downmix signal. An object
manipulator individually manipulates objects using audio
object based metadata referring to the individual audio objects
to obtain manipulated audio objects. The manipulated
audio objects are mixed using an object mixer for finally
obtaining an audio output signal having one or several
channel signals depending on a specific rendering setup.

Documents

Orders

Section	Controller	Decision Date

Application Documents

#	Name	Date
1	198-KOLNP-2011-RELEVANT DOCUMENTS [05-09-2023(online)].pdf	2023-09-05
1	abstract-198-kolnp-2011.jpg	2011-10-06
2	198-KOLNP-2011-RELEVANT DOCUMENTS [08-09-2022(online)].pdf	2022-09-08
2	198-kolnp-2011-specification.pdf	2011-10-06
3	198-KOLNP-2011-RELEVANT DOCUMENTS [24-09-2021(online)].pdf	2021-09-24
3	198-kolnp-2011-pct request form.pdf	2011-10-06
4	198-KOLNP-2011-RELEVANT DOCUMENTS [22-02-2020(online)].pdf	2020-02-22
4	198-kolnp-2011-pct priority document notification.pdf	2011-10-06
5	198-KOLNP-2011-RELEVANT DOCUMENTS [06-02-2019(online)].pdf	2019-02-06
5	198-KOLNP-2011-IPRB.pdf	2011-10-06
6	198-KOLNP-2011-IntimationOfGrant27-12-2017.pdf	2017-12-27
6	198-kolnp-2011-international search report.pdf	2011-10-06
7	198-KOLNP-2011-PatentCertificate27-12-2017.pdf	2017-12-27
7	198-kolnp-2011-international publication.pdf	2011-10-06
8	198-KOLNP-2011-Written submissions and relevant documents (MANDATORY) [14-10-2017(online)].pdf	2017-10-14
8	198-kolnp-2011-form-5.pdf	2011-10-06
9	198-kolnp-2011-form-3.pdf	2011-10-06
9	198-KOLNP-2011-HearingNoticeLetter.pdf	2017-09-13
10	198-kolnp-2011-form-2.pdf	2011-10-06
10	198-KOLNP-2011-Information under section 8(2) (MANDATORY) [02-09-2017(online)].pdf	2017-09-02
11	198-kolnp-2011-form-1.pdf	2011-10-06
11	Abstract [25-03-2017(online)].pdf	2017-03-25
12	198-KOLNP-2011-FORM 3-1.1.pdf	2011-10-06
12	Claims [25-03-2017(online)].pdf	2017-03-25
13	198-KOLNP-2011-FORM 18.pdf	2011-10-06
13	Correspondence [25-03-2017(online)].pdf	2017-03-25
14	198-kolnp-2011-drawings.pdf	2011-10-06
14	Description(Complete) [25-03-2017(online)].pdf	2017-03-25
15	198-kolnp-2011-description (complete).pdf	2011-10-06
15	Description(Complete) [25-03-2017(online)].pdf_1095.pdf	2017-03-25
16	198-kolnp-2011-correspondence.pdf	2011-10-06
16	Drawing [25-03-2017(online)].pdf	2017-03-25
17	Examination Report Reply Recieved [25-03-2017(online)].pdf	2017-03-25
17	198-KOLNP-2011-CORRESPONDENCE-1.2.pdf	2011-10-06
18	198-KOLNP-2011-CORRESPONDENCE 1.1.pdf	2011-10-06
18	Other Document [25-03-2017(online)].pdf	2017-03-25
19	198-kolnp-2011-claims.pdf	2011-10-06
19	Petition Under Rule 137 [25-03-2017(online)].pdf	2017-03-25
20	198-kolnp-2011-abstract.pdf	2011-10-06
20	Petition Under Rule 137 [25-03-2017(online)].pdf_826.pdf	2017-03-25
21	198-KOLNP-2011-(14-05-2012)-PA.pdf	2012-05-14
21	198-KOLNP-2011_EXAMREPORT.pdf	2016-06-30
22	198-KOLNP-2011-(14-05-2012)-CORRESPONDENCE.pdf	2012-05-14
23	198-KOLNP-2011-(14-05-2012)-PA.pdf	2012-05-14
23	198-KOLNP-2011_EXAMREPORT.pdf	2016-06-30
24	Petition Under Rule 137 [25-03-2017(online)].pdf_826.pdf	2017-03-25
24	198-kolnp-2011-abstract.pdf	2011-10-06
25	Petition Under Rule 137 [25-03-2017(online)].pdf	2017-03-25
25	198-kolnp-2011-claims.pdf	2011-10-06
26	198-KOLNP-2011-CORRESPONDENCE 1.1.pdf	2011-10-06
26	Other Document [25-03-2017(online)].pdf	2017-03-25
27	198-KOLNP-2011-CORRESPONDENCE-1.2.pdf	2011-10-06
27	Examination Report Reply Recieved [25-03-2017(online)].pdf	2017-03-25
28	198-kolnp-2011-correspondence.pdf	2011-10-06
28	Drawing [25-03-2017(online)].pdf	2017-03-25
29	198-kolnp-2011-description (complete).pdf	2011-10-06
29	Description(Complete) [25-03-2017(online)].pdf_1095.pdf	2017-03-25
30	198-kolnp-2011-drawings.pdf	2011-10-06
30	Description(Complete) [25-03-2017(online)].pdf	2017-03-25
31	198-KOLNP-2011-FORM 18.pdf	2011-10-06
31	Correspondence [25-03-2017(online)].pdf	2017-03-25
32	198-KOLNP-2011-FORM 3-1.1.pdf	2011-10-06
32	Claims [25-03-2017(online)].pdf	2017-03-25
33	198-kolnp-2011-form-1.pdf	2011-10-06
33	Abstract [25-03-2017(online)].pdf	2017-03-25
34	198-kolnp-2011-form-2.pdf	2011-10-06
34	198-KOLNP-2011-Information under section 8(2) (MANDATORY) [02-09-2017(online)].pdf	2017-09-02
35	198-kolnp-2011-form-3.pdf	2011-10-06
35	198-KOLNP-2011-HearingNoticeLetter.pdf	2017-09-13
36	198-KOLNP-2011-Written submissions and relevant documents (MANDATORY) [14-10-2017(online)].pdf	2017-10-14
36	198-kolnp-2011-form-5.pdf	2011-10-06
37	198-KOLNP-2011-PatentCertificate27-12-2017.pdf	2017-12-27
37	198-kolnp-2011-international publication.pdf	2011-10-06
38	198-KOLNP-2011-IntimationOfGrant27-12-2017.pdf	2017-12-27
38	198-kolnp-2011-international search report.pdf	2011-10-06
39	198-KOLNP-2011-RELEVANT DOCUMENTS [06-02-2019(online)].pdf	2019-02-06
39	198-KOLNP-2011-IPRB.pdf	2011-10-06
40	198-KOLNP-2011-RELEVANT DOCUMENTS [22-02-2020(online)].pdf	2020-02-22
40	198-kolnp-2011-pct priority document notification.pdf	2011-10-06
41	198-KOLNP-2011-RELEVANT DOCUMENTS [24-09-2021(online)].pdf	2021-09-24
41	198-kolnp-2011-pct request form.pdf	2011-10-06
42	198-KOLNP-2011-RELEVANT DOCUMENTS [08-09-2022(online)].pdf	2022-09-08
42	198-kolnp-2011-specification.pdf	2011-10-06
43	198-KOLNP-2011-RELEVANT DOCUMENTS [05-09-2023(online)].pdf	2023-09-05
43	abstract-198-kolnp-2011.jpg	2011-10-06