System, Apparatus and Method for Consistent Acoustic Scene Reproduction
Based on Adaptive Functions
Description
The present invention relates to audio signal processing, and, in particular, to a system,
an apparatus and a method for consistent acoustic scene reproduction based on informed
spatial filtering.
In spatial sound reproduction the sound at the recording location (near-end side) is
captured with multiple microphones and then reproduced at the reproduction side (far-end
side) using multiple loudspeakers or headphones. In many applications, it is desired to
reproduce the recorded sound such that the spatial image recreated at the far-end side is
consistent with the original spatial image at the near-end side. This means for instance
that the sound of the sound sources is reproduced from the directions where the sources
were present in the original recording scenario. Alternatively, when for instance a video is
complimenting the recorded audio, it is desirable that the sound is reproduced such that
the recreated acoustical image is consistent with the video image. This means for
instance that the sound of a sound source is reproduced from the direction where the
source is visible in the video. Additionally, the video camera may be equipped with a
visual zoom function or the user at the far-end side may apply a digital zoom to the video
which would change the visual image. In this case, the acoustical image of the reproduced
spatial sound should change accordingly. In many cases, the far-end side determines the
spatial image to which the reproduced sound should be consistent is determined either at
the far end side or during play back, for instance when a video image is involved.
Consequently, the spatial sound at the near-end side must be recorded, processed, and
transmitted such that at the far-end side we can still control the recreated acoustical
image.
The possibility to reproduce a recorded acoustical scene consistently with a desired
spatial image is required in many modern applications. For instance modern consumer
devices such as digital cameras or mobile phones are often equipped with a video camera
and multiple microphones. This enables to record videos together with spatial sound, e.g.,
stereo sound. When reproducing the recorded audio together with the video, it is desired
that the visual and acoustical image are consistent. When the user zooms in with the
camera, it is desirable to recreate the visual zooming effect acoustically so that the visual
and acoustical images are aligned when watching the video. For instance, when the user
zooms in on a person, the voice of this person should become less reverberant as the
person appears to be closer to the camera. Moreover, the voice of the person should be
reproduced from the same direction where the person appears in the visual image.
Mimicking the visual zoom of a camera acoustically is referred to as acoustical zoom in
the following and represents one example of a consistent audio-video reproduction. The
consistent audio-video reproduction which may involve an acoustical zoom is also useful
in teleconferencing, where the spatial sound at the near-end side is reproduced at the farend
side together with a visual image. Moreover, it is desirable to recreate the visual
zooming effect acoustically so that the visual and acoustical images are aligned.
The first implementation of an acoustical zoom was presented in [1], where the zooming
effect was obtained by increasing the directivity of a second-order directional microphone,
whose signal was generated based on the signals of a linear microphone array. This
approach was extended in [2] to a stereo zoom. A more recent approach for a mono or
stereo zoom was presented in [3], which consists in changing the sound source levels
such that the source from the frontal direction was preserved, whereas the sources
coming from other directions and the diffuse sound were attenuated. The approaches
proposed in [ 1 ,2] result in an increase of the direct-to-reverberation ratio (DRR) and the
approach in [3] additionally allows for the suppression of undesired sources. The
aforementioned approaches assume the sound source is located in front of a camera, and
do not aim to capture the acoustical image that is consistent with the video image.
A well-known approach for a flexible spatial sound recording and reproduction is
represented by directional audio coding (DirAC) [4]. In DirAC, the spatial sound at the
near-end side is described in terms of an audio signal and parametric side information,
namely the direction-of-arrival (DOA) and diffuseness of the sound. The parametric
description enables the reproduction of the original spatial image with arbitrary
loudspeaker setups. This means that the recreated spatial image at the far-end side is
consistent with the spatial image during recording at the near-end side. However, if for
instance a video is complimenting the recorded audio, then the reproduced spatial sound
is not necessarily aligned to the video image. Moreover, the recreated acoustical image
cannot be adjusted when the visual images changes, e.g., when the look direction and
zoom of the camera is changed. This means that DirAC provides no possibility to adjust
the recreated acoustical image to an arbitrary desired spatial image.
In [5], an acoustical zoom was realized based on DirAC. DirAC represents a reasonable
basis to realize an acoustical zoom as it is based on a simple yet powerful signal model
assuming that the sound field in the time-frequency domain is composed of a single plane
wave plus diffuse sound. The underlying model parameters, e.g., the DOA and
diffuseness, are exploited to separate the direct sound and diffuse sound and to create
the acoustical zoom effect. The parametric description of the spatial sound enables an
efficient transmission of the sound scene to the far-end side while still providing the user
full control over the zoom effect and spatial sound reproduction. Even though DirAC
employs multiple microphones to estimate the model parameters, only single-channel
filters are applied to extract the direct sound and diffuse sound, limiting the quality of the
reproduced sound. Moreover, all sources in the sound scene are assumed to be
positioned on a circle and the spatial sound reproduction is performed with reference to a
changing position of an audio-visual camera, which is inconsistent with the visual zoom.
In fact, zooming changes the view angle of the camera while the distance to the visual
objects and their relative positions in the image remain unchanged, which is in contrast to
moving a camera.
A related approach is the so-called virtual microphone (VM) technique [6,7] which
considers the same signal model as DirAC but allows to synthesize the signal of a nonexisting
(virtual) microphone in an arbitrary position in the sound scene. Moving the VM
towards a sound source is analogous to the movement of the camera to a new position.
The VM was realized using multi-channel filters to improve the sound quality, but requires
several distributed microphone arrays to estimate the model parameters.
However, it would be highly appreciated, if further improved concepts for audio signal
processing would be provided.
Thus, the object of the present invention is to provide improved concepts for audio signal
processing. The object of the present invention is solved by a system according to claim
1, by an apparatus according to claim 14, by a method according to claim 15, by a method
according to claim 16 and by a computer program according to claim 17.
A system for generating one or more audio output signals is provided. The system
comprises a decomposition module, a signal processor, and an output interface. The
decomposition module is configured to receive two or more audio input signals, wherein
the decomposition module is configured to generate a direct component signal,
comprising direct signal components of the two or more audio input signals, and wherein
the decomposition module is configured to generate a diffuse component signal,
comprising diffuse signal components of the two or more audio input signals. The signal
processor is configured to receive the direct component signal, the diffuse component
signal and direction information, said direction information depending on a direction of
arrival of the direct signal components of the two or more audio input signals. Moreover,
the signal processor is configured to generate one or more processed diffuse signals
depending on the defuse component signal. For each audio output signal of the one or
more audio output signals, the signal processor is configured to determine, depending on
the direction of arrival, a direct gain, the signal processor is configured to apply said direct
gain on the direct component signal to obtain a processed direct signal, and the signal
processor is configured to combine said processed direct signal and one of the one or
more processed diffuse signals to generate said audio output signal. The output interface
is configured to output the one or more audio output signals. The signal processor
comprises a gain function computation module for calculating one or more gain functions,
wherein each gain function of the one or more gain functions, comprises a plurality of gain
function argument values, wherein a gain function return value is assigned to each of said
gain function argument values, wherein, when said gain function receives one of said gain
function argument values, wherein said gain function is configured to return the gain
function return value being assigned to said one of said gain function argument values.
Moreover, the signal processor further comprises a signal modifier for selecting,
depending on the direction of arrival, a direction dependent argument value from the gain
function argument values of a gain function of the one or more gain functions, for
obtaining the gain function return value being assigned to said direction dependent
argument value from said gain function, and for determining the gain value of at least one
of the one or more audio output signals depending on said gain function return value
obtained from said gain function.
According to an embodiment, the gain function computation module may, e.g., be
configured to generate a lookup table for each gain function of the one or more gain
functions, wherein the lookup table comprises a plurality of entries, wherein each of the
entries of the lookup table comprises one of the gain function argument values and the
gain function return value being assigned to said gain function argument value, wherein
the gain function computation module may, e.g., be configured to store the lookup table of
each gain function in persistent or non-persistent memory, and wherein the signal modifier
may, e.g., be configured to obtain the gain function return value being assigned to said
direction dependent argument value by reading out said gain function return value from
one of the one or more lookup tables being stored in the memory.
In an embodiment, the signal processor may, e.g., be configured to determine two or more
audio output signals, wherein the gain function computation module may, e.g., be
configured to calculate two or more gain functions, wherein, for each audio output signal
of the two or more audio output signals, the gain function computation module may, e.g.,
be configured to calculate a panning gain function being assigned to said audio output
signal as one of the two or more gain functions, wherein the signal modifier may, e.g., be
configured to generate said audio output signal depending on said panning gain function.
According to an embodiment, the panning gain function of each of the two or more audio
output signals may, e.g., have one or more global maxima, being one of the gain function
argument values of said panning gain function, wherein for each of the one or more global
maxima of said panning gain function, no other gain function argument value exists for
which said panning gain function returns a greater gain function return value than for said
global maxima, and wherein, for each pair of a first audio output signal and a second
audio output signal of the two or more audio output signals, at least one of the one or
more global maxima of the panning gain function of the first audio output signal may, e.g.,
be different from any of the one or more global maxima of the panning gain function of the
second audio output signal.
According to an embodiment, for each audio output signal of the two or more audio output
signals, the gain function computation module may, e.g., be configured to calculate a
window gain function being assigned to said audio output signal as one of the two or more
gain functions, wherein the signal modifier may, e.g., be configured to generate said audio
output signal depending on said window gain function, and wherein, if the argument value
of said window gain function is greater than a lower window threshold and smaller than an
upper window threshold, the window gain function is configured to return a gain function
return value being greater than any gain function return value returned by said window
gain function, if the window function argument value is smaller than the lower threshold, or
greater than the upper threshold.
In an embodiment, the window gain function of each of the two or more audio output
signals has one or more global maxima, being one of the gain function argument values of
said window gain function, wherein for each of the one or more global maxima of said
window gain function, no other gain function argument value exists for which said window
gain function returns a greater gain function return value than for said global maxima, and
wherein, for each pair of a first audio output signal and a second audio output signal of the
two or more audio output signals, at least one of the one or more global maxima of the
window gain function of the first audio output signal may, e.g., be equal to one of the one
or more global maxima of the window gain function of the second audio output signal.
According to an embodiment, the gain function computation module may, e.g., be
configured to further receive orientation information indicating an angular shift of a look
direction with respect to the direction of arrival, and wherein the gain function computation
module may, e.g., be configured to generate the panning gain function of each of the
audio output signals depending on the orientation information.
In an embodiment, the gain function computation module may, e.g., be configured to
generate the window gain function of each of the audio output signals depending on the
orientation information.
According to an embodiment, the gain function computation module may, e.g., be
configured to further receive zoom information, wherein the zoom information indicates an
opening angle of a camera, and wherein the gain function computation module may, e.g.,
be configured to generate the panning gain function of each of the audio output signals
depending on the zoom information.
In an embodiment, the gain function computation module may, e.g., be configured to
generate the window gain function of each of the audio output signals depending on the
zoom information.
According to an embodiment, the gain function computation module may, e.g., be
configured to further receive a calibration parameter for aligning a visual image and an
acoustical image, and wherein the gain function computation module may, e.g., be
configured to generate the panning gain function of each of the audio output signals
depending on the calibration parameter.
In an embodiment, the gain function computation module may, e.g., be configured to
generate the window gain function of each of the audio output signals depending on the
calibration parameter.
A system according to one of the preceding claims, the gain function computation module
may, e.g., be configured to receive information on a visual image, and the gain function
computation module may, e.g., be configured to generate, depending on the information
on a visual image, a blurring function returning complex gains to realize perceptual
spreading of a sound source.
Moreover, an apparatus for generating one or more audio output signals is provided. The
apparatus comprises a signal processor and an output interface. The signal processor is
configured to receive a direct component signal, comprising direct signal components of
the two or more original audio signals, wherein the signal processor is configured to
receive a diffuse component signal, comprising diffuse signal components of the two or
more original audio signals, and wherein the signal processor is configured to receive
direction information, said direction information depending on a direction of arrival of the
direct signal components of the two or more audio input signals. Moreover, the signal
processor is configured to generate one or more processed diffuse signals depending on
the defuse component signal. For each audio output signal of the one or more audio
output signals, the signal processor is configured to determine, depending on the direction
of arrival, a direct gain, the signal processor is configured to apply said direct gain on the
direct component signal to obtain a processed direct signal, and the signal processor is
configured to combine said processed direct signal and one of the one or more processed
diffuse signals to generate said audio output signal. The output interface is configured to
output the one or more audio output signals. The signal processor comprises a gain
function computation module for calculating one or more gain functions, wherein each
gain function of the one or more gain functions, comprises a plurality of gain function
argument values, wherein a gain function return value is assigned to each of said gain
function argument values, wherein, when said gain function receives one of said gain
function argument values, wherein said gain function is configured to return the gain
function return value being assigned to said one of said gain function argument values.
Moreover, the signal processor further comprises a signal modifier for selecting,
depending on the direction of arrival, a direction dependent argument value from the gain
function argument values of a gain function of the one or more gain functions, for
obtaining the gain function return value being assigned to said direction dependent
argument value from said gain function, and for determining the gain value of at least one
of the one or more audio output signals depending on said gain function return value
obtained from said gain function.
Furthermore, a method for generating one or more audio output signals is provided. The
method comprises:
- Receiving two or more audio input signals.
Generating a direct component signal, comprising direct signal components of the
two or more audio input signals.
- Generating a diffuse component signal, comprising diffuse signal components of
the two or more audio input signals.
Receiving direction information depending on a direction of arrival of the direct
signal components of the two or more audio input signals.
Generating one or more processed diffuse signals depending on the defuse
component signal.
For each audio output signal of the one or more audio output signals, determining,
depending on the direction of arrival, a direct gain, applying said direct gain on the
direct component signal to obtain a processed direct signal, and combining said
processed direct signal and one of the one or more processed diffuse signals to
generate said audio output signal. And:
Outputting the one or more audio output signals.
Generating the one or more audio output signals comprises calculating one or more gain
functions, wherein each gain function of the one or more gain functions, comprises a
plurality of gain function argument values, wherein a gain function return value is assigned
to each of said gain function argument values, wherein, when said gain function receives
one of said gain function argument values, wherein said gain function is configured to
return the gain function return value being assigned to said one of said gain function
argument values. Moreover, generating the one or more audio output signals comprises
selecting, depending on the direction of arrival, a direction dependent argument value
from the gain function argument values of a gain function of the one or more gain
functions, for obtaining the gain function return value being assigned to said direction
dependent argument value from said gain function, and for determining the gain value of
at least one of the one or more audio output signals depending on said gain function
return value obtained from said gain function.
Moreover, a method for generating one or more audio output signals is provided. The
method comprises:
Receiving a direct component signal, comprising direct signal components of the
two or more original audio signals.
- Receiving a diffuse component signal, comprising diffuse signal components of the
two or more original audio signals.
Receiving direction information, said direction information depending on a direction
of arrival of the direct signal components of the two or more audio input signals.
Generating one or more processed diffuse signals depending on the defuse
component signal.
For each audio output signal of the one or more audio output signals, determining,
depending on the direction of arrival, a direct gain, applying said direct gain on the
direct component signal to obtain a processed direct signal, and the combining
said processed direct signal and one of the one or more processed diffuse signals
to generate said audio output signal. And:
Outputting the one or more audio output signals.
Generating the one or more audio output signals comprises calculating one or more gain
functions, wherein each gain function of the one or more gain functions, comprises a
plurality of gain function argument values, wherein a gain function return value is assigned
to each of said gain function argument values, wherein, when said gain function receives
one of said gain function argument values, wherein said gain function is configured to
return the gain function return value being assigned to said one of said gain function
argument values. Moreover, generating the one or more audio output signals comprises
selecting, depending on the direction of arrival, a direction dependent argument value
from the gain function argument values of a gain function of the one or more gain
functions, for obtaining the gain function return value being assigned to said direction
dependent argument value from said gain function, and for determining the gain value of
at least one of the one or more audio output signals depending on said gain function
return value obtained from said gain function.
Moreover, computer programs are provided, wherein each of the computer programs is
configured to implement one of the above-described methods when being executed on a
computer or signal processor, so that each of the above-described methods is
implemented by one of the computer programs.
Furthermore, a system for generating one or more audio output signals is provided. The
system comprises a decomposition module, a signal processor, and an output interface.
The decomposition module is configured to receive two or more audio input signals,
wherein the decomposition module is configured to generate a direct component signal,
comprising direct signal components of the two or more audio input signals, and wherein
the decomposition module is configured to generate a diffuse component signal,
comprising diffuse signal components of the two or more audio input signals. The signal
processor is configured to receive the direct component signal, the diffuse component
signal and direction information, said direction information depending on a direction of
arrival of the direct signal components of the two or more audio input signals. Moreover,
the signal processor is configured to generate one or more processed diffuse signals
depending on the defuse component signal. For each audio output signal of the one or
more audio output signals, the signal processor is configured to determine, depending on
the direction of arrival, a direct gain, the signal processor is configured to apply said direct
gain on the direct component signal to obtain a processed direct signal, and the signal
processor is configured to combine said processed direct signal and one of the one or
more processed diffuse signals to generate said audio output signal. The output interface
is configured to output the one or more audio output signals.
According to embodiments, concepts are provided to achieve spatial sound recording and
reproduction such that the recreated acoustical image may, e.g., be consistent to a
desired spatial image, which is, for example, determined by the user at the far-end side or
by a video-image. The proposed approach uses a microphone array at the near-end side
which allows us to decompose the captured sound into direct sound components and a
diffuse sound component. The extracted sound components are then transmitted to the
far-end side. The consistent spatial sound reproduction may, e.g., be realized by a
weighted sum of the extracted direct sound and diffuse sound, where the weights depend
on the desired spatial image to which the reproduced sound should be consistent, e.g.,
the weights depend on the look direction and zooming factor of the video camera, which
may, e.g., be complimenting the audio recording. Concepts are provided which employ
informed multi-channel filters for the extraction of the direct sound and diffuse sound.
According to an embodiment, the signal processor may, e.g., be configured to determine
two or more audio output signals, wherein for each audio output signal of the two or more
audio output signals a panning gain function may, e.g., be assigned to said audio output
signal, wherein the panning gain function of each of the two or more audio output signals
comprises a plurality of panning function argument values, wherein a panning function
return value may, e.g., be assigned to each of said panning function argument values,
wherein, when said panning gain function receives one of said panning function argument
values, said panning gain function may, e.g., be configured to return the panning function
return value being assigned to said one of said panning function argument values, and
wherein the signal processor may, e.g., be configured to determine each of the two or
more audio output signals depending on a direction dependent argument value of the
panning function argument values of the panning gain function being assigned to said
audio output signal, wherein said direction dependent argument value depends on the
direction of arrival.
In an embodiment, the panning gain function of each of the two or more audio output
signals has one or more global maxima, being one of the panning function argument
values, wherein for each of the one or more global maxima of each panning gain function,
no other panning function argument value exists for which said panning gain function
returns a greater panning function return value than for said global maxima, and wherein,
for each pair of a first audio output signal and a second audio output signal of the two or
more audio output signals, at least one of the one or more global maxima of the panning
gain function of the first audio output signal may, e.g., be different from any of the one or
more global maxima of the panning gain function of the second audio output signal.
According to an embodiment, the signal processor may, e.g., be configured to generate
each audio output signal of the one or more audio output signals depending on a window
gain function, wherein the window gain function may, e.g., be configured to return a
window function return value when receiving a window function argument value, wherein,
if the window function argument value may, e.g., be greater than a lower window
threshold and smaller than an upper window threshold, the window gain function may,
e.g., be configured to return a window function return value being greater than any
window function return value returned by the window gain function, if the window function
argument value may, e.g., be smaller than the lower threshold, or greater than the upper
threshold.
In an embodiment, the signal processor may, e.g., be configured to further receive
orientation information indicating an angular shift of a look direction with respect to the
direction of arrival, and wherein at least one of the panning gain function and the window
gain function depends on the orientation information; or wherein the gain function
computation module may, e.g., be configured to further receive zoom information, wherein
the zoom information indicates an opening angle of a camera, and wherein at least one of
the panning gain function and the window gain function depends on the zoom information;
or wherein the gain function computation module may, e.g., be configured to further
receive a calibration parameter, and wherein at least one of the panning gain function and
the window gain function depends on the calibration parameter.
According to an embodiment, the signal processor may, e.g., be configured to receive
distance information, wherein the signal processor may, e.g., be configured to generate
each audio output signal of the one or more audio output signals depending on the
distance information.
According to an embodiment, the signal processor may, e.g., be configured to receive an
original angle value depending on an original direction of arrival, being the direction of
arrival of the direct signal components of the two or more audio input signals, and may,
e.g., be configured to receive the distance information, wherein the signal processor may,
e.g., be configured to calculate a modified angle value depending on the original angle
value and depending on the distance information, and wherein the signal processor may,
e.g., be configured to generate each audio output signal of the one or more audio output
signals depending on the modified angle value.
According to an embodiment, the signal processor may, e.g., be configured to generate
the one or more audio output signals by conducting low pass filtering, or by adding
delayed direct sound, or by conducting direct sound attenuation, or by conducting
temporal smoothing, or by conducting direction of arrival spreading, or by conducting
decorrelation.
In an embodiment, the signal processor may, e.g., be configured to generate two or more
audio output channels, wherein the signal processor may, e.g., be configured to apply the
diffuse gain on the diffuse component signal to obtain an intermediate diffuse signal, and
wherein the signal processor may, e.g., be configured to generate one or more
decorrelated signals from the intermediate diffuse signal by conducting decorrelation,
wherein the one or more decorrelated signals form the one or more processed diffuse
signals, or wherein the intermediate diffuse signal and the one or more decorrelated
signals form the one or more processed diffuse signals.
According to an embodiment, the direct component signal and one or more further direct
component signals form a group of two or more direct component signals, wherein the
decomposition module may, e.g., be configured may, e.g., be configured to generate the
one or more further direct component signals comprising further direct signal components
of the two or more audio input signals, wherein the direction of arrival and one or more
further direction of arrivals form a group of two or more direction of arrivals, wherein each
direction of arrival of the group of the two or more direction of arrivals may, e.g., be
assigned to exactly one direct component signal of the group of the two or more direct
component signals, wherein the number of the direct component signals of the two or
more direct component signals and the number of the direction of arrivals of the two
direction of arrivals may, e.g., be equal, wherein the signal processor may, e.g., be
configured to receive the group of the two or more direct component signals, and the
group of the two or more direction of arrivals, and wherein, for each audio output signal of
the one or more audio output signals, the signal processor may, e.g., be configured to
determine, for each direct component signal of the group of the two or more direct
component signals, a direct gain depending on the direction of arrival of said direct
component signal, the signal processor may, e.g., be configured to generate a group of
two or more processed direct signals by applying, for each direct component signal of the
group of the two or more direct component signals, the direct gain of said direct
component signal on said direct component signal, and the signal processor may, e.g., be
configured to combine one of the one or more processed diffuse signals and each
processed signal of the group of the two or more processed signals to generate said audio
output signal.
In an embodiment, the number of the direct component signals of the group of the two or
more direct component signals plus 1 may, e.g., be smaller than the number of the audio
input signals being received by the receiving interface.
Moreover, a hearing aid or an assistive listening device comprising a system as described
above may, e.g., be provided.
Moreover, an apparatus for generating one or more audio output signals is provided. The
apparatus comprises a signal processor and an output interface. The signal processor is
configured to receive a direct component signal, comprising direct signal components of
the two or more original audio signals, wherein the signal processor is configured to
receive a diffuse component signal, comprising diffuse signal components of the two or
more original audio signals, and wherein the signal processor is configured to receive
direction information, said direction information depending on a direction of arrival of the
direct signal components of the two or more audio input signals. Moreover, the signal
processor is configured to generate one or more processed diffuse signals depending on
the defuse component signal. For each audio output signal of the one or more audio
output signals, the signal processor is configured to determine, depending on the direction
of arrival, a direct gain, the signal processor is configured to apply said direct gain on the
direct component signal to obtain a processed direct signal, and the signal processor is
configured to combine said processed direct signal and one of the one or more processed
diffuse signals to generate said audio output signal. The output interface is configured to
output the one or more audio output signals.
Furthermore, a method for generating one or more audio output signals is provided. The
method comprises:
Receiving two or more audio input signals.
Generating a direct component signal, comprising direct signal components of the
two or more audio input signals.
Generating a diffuse component signal, comprising diffuse signal components of
the two or more audio input signals.
Receiving direction information depending on a direction of arrival of the direct
signal components of the two or more audio input signals.
- Generating one or more processed diffuse signals depending on the defuse
component signal.
For each audio output signal of the one or more audio output signals, determining,
depending on the direction of arrival, a direct gain, applying said direct gain on the
direct component signal to obtain a processed direct signal, and combining said
processed direct signal and one of the one or more processed diffuse signals to
generate said audio output signal. And:
Outputting the one or more audio output signals.
Moreover, a method for generating one or more audio output signals is provided. The
method comprises:
Receiving a direct component signal, comprising direct signal components of the
two or more original audio signals.
Receiving a diffuse component signal, comprising diffuse signal components of the
two or more original audio signals.
- Receiving direction information, said direction information depending on a direction
of arrival of the direct signal components of the two or more audio input signals.
P T/EP2015/058857
15
Generating one or more processed diffuse signals depending on the defuse
component signal.
For each audio output signal of the one or more audio output signals, determining,
depending on the direction of arrival, a direct gain, applying said direct gain on the
direct component signal to obtain a processed direct signal, and the combining
said processed direct signal and one of the one or more processed diffuse signals
to generate said audio output signal. And:
Outputting the one or more audio output signals.
Moreover, computer programs are provided, wherein each of the computer programs is
configured to implement one of the above-described methods when being executed on a
computer or signal processor, so that each of the above-described methods is
implemented by one of the computer programs.
In the following, embodiments of the present invention are described in more detail with
reference to the figures, in which:
Fig. 1a illustrates a system according to an embodiment,
Fig. 1b illustrates an apparatus according to an embodiment,
Fig. 1c illustrates a system according to another embodiment,
Fig. 1d illustrates an apparatus according to another embodiment,
Fig. 2 shows a system according to another embodiment,
Fig. 3 depicts modules for direct/diffuse decomposition and for parameter of a
estimation of a system according to an embodiment,
Fig. 4 shows a first geometry for acoustic scene reproduction with acoustic
zooming according to an embodiment, wherein a sound source is located
on a focal plane,
Fig. 5 illustrates panning functions for consistent scene reproduction and for
acoustical zoom,
Fig. 6 depicts further panning functions for consistent scene reproduction and for
acoustical zoom according to embodiments,
Fig. 7 illustrates example window gain functions for various situations according
to embodiments,
shows a diffuse gain function according to an embodiment,
depicts a second geometry for acoustic scene reproduction with acoustic
zooming according to an embodiment, wherein a sound source is not
located on a focal plane,
illustrates functions to explain the direct sound blurring, and
visualizes hearing aids according to embodiments.
Fig . 1a illustrates a system for generating one or more audio output signals is provided .
The system comprises a decomposition module 10 1, a signal processor 105, and an
output interface 106.
The decomposition module 10 1 is configured to generate a direct component signal
di ( ri), comprising direct signal components of the two or more audio input signals k,
ri), x (k, ri), . . . x (k, ri). Moreover, the decomposition module 10 1 is configured to
generate a diffuse component signal )> comprising diffuse signal components of
the two or more audio input signals x ( , ri), x2(k, ri), . . . x (k, ri).
The signal processor 105 is configured to receive the direct component signal k, ri),
the diffuse component signal ri) and direction information, said direction
information depending on a direction of arrival of the direct signal components of the two
or more audio input signals X ( , ri), x2(k, ri), . . . x (k, ri).
Moreover, the signal processor 105 is configured to generate one or more processed
diffuse signals Y diff, ( , ), Ydiff,2( , ri), Ydiff,v( , ri) depending on the defuse
component signal i ri).
For each audio output signal Y\{k, ri) of the one or more audio output signals Y^k, ri),
Y ( , ri), Y v( , ri), the signal processor 105 is configured to determine, depending on
the direction of arrival, a direct gain G(k, ri), the signal processor 105 is configured to
apply said direct gain G(k, ri) on the direct component signal ri) to obtain a
processed direct signal Y , ri), and the signal processor 105 is configured to combine
said processed direct signal Ydir,i( ri) and one Y ri) of the one or more processed
diffuse signals Ydiff,i(/c, ri), Y ri), ri) to generate said audio output signal
Y k, ri).
The output interface 106 is configured to output the one or more audio output signals
Y k, ri), Y2(k, ri), ..., Yv(k, ri).
As outlined, the direction information depends on a direction of arrival f ( , ) of the direct
signal components of the two or more audio input signals x (k, ri), x2(k, ri), ... x (k, ri). For
example, the direction of arrival of the direct signal components of the two or more audio
input signals x ( , ri), x k, ri), . . . x (k, ri) may, e.g., itself be the direction information. Or,
for example, the direction information, may, for example, be the propagation direction of
the direct signal components of the two or more audio input signals x (k, ri), x2(k,
xp(k, ri). While the direction of arrival points from a receiving microphone array to a sound
source, the propagation direction points from the sound source to the receiving
microphone array. Thus, the propagation direction points in exactly the opposite direction
of the direction of arrival and therefore depends on the direction of arrival.
To generate one Yi(&, n) of the one or more audio output signals U , ri), Yrik, ri),
Y (k, ri), the signal processor 105
determines, depending on the direction of arrival, a direct gain G(k, ri),
apply said direct gain G j( , ri) on the direct component signal X di,( , ri) to obtain a
processed direct signal ri), and
combine said processed direct signal ri) and one Y if ri) of the one or
more processed diffuse signals Y d ff, ( , ri), Y k, ri), Ydiff, ( , ri) to generate
said audio output signal Y j( , ri)
This is done for each of the one or more audio output signals Yi(/c, ri), Y k, ri), Yv(k,
ri) that shall be generated Yi(/c, ri), Y 2(/c, ri), Yv(k, ri). The signal processor may, for
example, be configured to generate one, two, three or more audio output signals Y (k, ri),
Y2(k, ri), .. ., Y v(k, ri).
Regarding the one or more processed diffuse signals Y d ff,i ( , ri), Y ri),
Ydiff,v(&, ri), according to an embodiment, the signal processor 05 may, for example, be
configured to generate the one or more processed diffuse signals Y dff, ( ri), Y k-
· · · , Y ff,v( , n) by applying a diffuse gain Q(k, ri) on the diffuse component signal
X dif , ri).
The decomposition module 101 is configured may, e.g , generate the direct component
signal d k, ri), comprising the direct signal components of the two or more audio input
signals x ( c, ri), x2(k, ri), . . . x (k, ri), and the diffuse component signal ri),
comprising diffuse signal components of the two or more audio input signals x (k, ri), x2(k,
ri), . .. x (k, ri), by decomposing the one or more audio input signals into the direct
component signal and into the diffuse component signal.
In a particular embodiment, the signal processor 105 may, e.g. , be configured to generate
two or more audio output channels Y (k, ri), Y (k, ri), Yv(k, n). The signal processor
105 may, e.g. , be configured to apply the diffuse gain Q(k, n) on the diffuse component
signal «) to obtain an intermediate diffuse signal. Moreover, the signal processor
105 may, e.g. , be configured to generate one or more decorrelated signals from the
intermediate diffuse signal by conducting decorrelation, wherein the one or more
decorrelated signals form the one or more processed diffuse signals Y ff, , n), diff,2( ,
n), Y i f, ( or wherein the intermediate diffuse signal and the one or more
decorrelated signals form the one or more processed diffuse signals Y di , ( . n ) ,
dif 2( , n), Ydiff,v(&, ri).
For example, the number of processed diffuse signals ) , n),
Ydiff,v(A:, n) and the number of audio output signals may, e.g. , be equal Yi(k, ri), Y 2( , ri),
. . ., Yv(k, n).
Generating the one or more decorrelated signals from the intermediate diffuse signal may,
e.g, be conducted by applying delays on the intermediate diffuse signal, or, e.g. , by
convolving the intermediate diffuse signal with a noise burst, or, e.g. , by convolving the
intermediate diffuse signal with an impulse response, etc. Any other state of the art
decorrelation technique may, e.g. , alternatively or additionally be applied .
For obtaining v audio output signals Y ( c, n), Y2( , n), . . ., Y v(k, n), v determinations of the
v direct gains G ( , n), G (k, n), Gv(k, n) and v applications of the respective gain on
the one or more direct component signals X k, n) may, for example, be employed to
obtain the v audio output signals Y ( , n), Y 2( , n), . . ., Y v(k, n).
Only a single diffuse component signal n), only one determination of a single
diffuse gain Q(k, n) and only one application of the diffuse gain Q(k, n) on the diffuse
component signal ) may, e.g , be needed to obtain the v audio output signals
Y (k, n), Y 2(k, n), Y (k, n). To achieve decorrelation, decorrelation techniques may be
applied only after the diffuse gain has already been applied on the diffuse component
signal.
According to the embodiment of Fig. 1a , the same processed diffuse signal Y f n) is
then combined with the corresponding one (Ydir,i(&, )) of the processed direct signals to
obtain the corresponding one (Yj(&, n)) of the audio output signals.
The embodiment of Fig . 1a takes the direction of arrival of the direct signal components of
the two or more audio input signals x^(k, n), x k, n), . . . x k, n) into account. Thus, the
audio output signals Y ( , n), Y 2 k n), Yv( , n) can be generated by flexibly adjusting
the direct component signals ,( , n) and diffuse component signals df i k n)
depending on the direction of arrival . Advanced adaptation possibilities are achieved .
According to embodiments, the audio output signals Y ^(k, n), Y {k, n), Y ( c, n) may,
e.g. , be determined for each time-frequency bin (k, n) of a time-frequency domain.
According to an embodiment, the decomposition module 10 1 may, e.g. , be configured to
receive two or more audio input signals k, n), x2(k, n), . . . x (k, n). In another
embodiment, the , decomposition module 101 may, e.g. , be configured to receive three or
more audio input signals ( , n), x2(k, n), . . . x (k, n). The decomposition module 101
may, e.g. , be configured to decompose the two or more (or three or more audio input
signals) x^(k, , x (k, n), . . . x (k, n) into the diffuse component signal fi k, n), which is
not a multi-channel signal, and into the one or more direct component signals Xi„(k, n).
That an audio signal is not a multi-channel signal means that the audio signal does itself
not comprise more than one audio channel. Thus, the audio information of the plurality of
audio input signals is transmitted within the two component signals (Xd ( , n), X n))
(and possibly in additional side information), which allows efficient transmission .
The signal processor 105, may, e.g. , be configured to generate each audio output signal
Y j( , ri) of two or more audio output signals U ( V ri), Y2(k, ri), Y v( , n) by determining
the direct gain G(k, ri) for said audio output signal Y k, ri), by applying said direct gain
Gj(A;, ri) on the one or more direct component signals (k, ri) to obtain the processed
direct signal Y di , ( ri) for said audio output signal Y k, ri), and by combining said
processed direct signal Y d ,,i( ri) for said audio output signal ri) and the processed
diffuse signal Ydiff( , ri) to generate said audio output signal Y (k, ri). The output interface
106 is configured to output the two or more audio output signals Y^k, ri), Y 2( , ri),
Y v( , ri). Generating two or more audio output signals ri), Y2(k, ri), Y v( , ri) by
determining only a single processed diffuse signal ri) is particularly advantageous.
Fig. 1b illustrates an apparatus for generating one or more audio output signals Y\{k, ri),
U 2 ( ri), Yv(k, ri) according to an embodiment. The apparatus implements the socalled
"far-end" side of the system of Fig. a.
The apparatus of Fig. 1b comprises a signal processor 105, and an output interface 106.
The signal processor 105 is configured to receive a direct component signal X d,r(k, ri),
comprising direct signal components of the two or more original audio signals x ( , ri),
2( ri), . . . x ( , ri) (e.g. , the audio input signals of Fig. 1a). Moreover, the signal
processor 105 is configured to receive a diffuse component signal X dif , ri), comprising
diffuse signal components of the two or more original audio signals k, ri), ri), ...
xp( , ri). Furthermore, the signal processor 105 is configured to receive direction
information, said direction information depending on a direction of arrival of the direct
signal components of the two or more audio input signals.
The signal processor 105 is configured to generate one or more processed diffuse signals
Ydiff,i ( ) , ri), Y if , ( , ri) depending on the defuse component signal
Xdiff(&, ri).
For each audio output signal Y,-(k, ri) of the one or more audio output signals Y^(k, ri),
Y (k, ri), Y k, ri), the signal processor 105 is configured to determine, depending on
the direction of arrival, a direct gain G {k, ri), the signal processor 105 is configured to
apply said direct gain G (k, ri) on the direct component signal X d„(k, ri) to obtain a
processed direct signal nd the signal processor 105 is configured to combine
said processed direct si ri) and one Ydiff,i( , ri) of the one or more processed
diffuse signals Y d , ( , ri), Y f , ( , n), Y k, ri) to generate said audio output signal
Y\(k, ri).
The output interface 106 is configured to output the one or more audio output signals
Y ( n), Y 2(k, n), . . ., Y ( , n).
All configurations of the signal processor 105 described with reference to the system in
the following, may also be implemented in an apparatus according to Fig. 1b. This relates
in particular to the various configurations of signal modifier 103 and gain function
computation module 104 which are described below. The same applies for the various
application examples of the concepts described below.
Fig. 1c illustrates a system according to another embodiment. In Fig. 1c, the signal
generator 105 of Fig. 1a further comprises a gain function computation module 104 for
calculating one or more gain functions, wherein each gain function of the one or more gain
functions, comprises a plurality of gain function argument values, wherein a gain function
return value is assigned to each of said gain function argument values, wherein, when
said gain function receives one of said gain function argument values, wherein said gain
function is configured to return the gain function return value being assigned to said one of
said gain function argument values.
Furthermore, the signal processor 105 further comprises a signal modifier 103 for
selecting, depending on the direction of arrival, a direction dependent argument value
from the gain function argument values of a gain function of the one or more gain
functions, for obtaining the gain function return value being assigned to said direction
dependent argument value from said gain function, and for determining the gain value of
at least one of the one or more audio output signals depending on said gain function
return value obtained from said gain function.
Fig. 1d illustrates a system according to another embodiment. In Fig. 1d, the signal
generator 105 of Fig. 1b further comprises a gain function computation module 104 for
calculating one or more gain functions, wherein each gain function of the one or more gain
functions, comprises a plurality of gain function argument values, wherein a gain function
return value is assigned to each of said gain function argument values, wherein, when
said gain function receives one of said gain function argument values, wherein said gain
function is configured to return the gain function return value being assigned to said one of
said gain function argument values.
Furthermore, the signal processor 105 further comprises a signal modifier 103 for
selecting, depending on the direction of arrival, a direction dependent argument value
from the gain function argument values of a gain function of the one or more gain
functions, for obtaining the gain function return value being assigned to said direction
dependent argument value from said gain function, and for determining the gain value of
at least one of the one or more audio output signals depending on said gain function
return value obtained from said gain function.
Embodiments provide recording and reproducing the spatial sound such that the
acoustical image is consistent with a desired spatial image, which is determined for
instance by a video which is complimenting the audio at the far-end side. Some
embodiments are based on recordings with a microphone array located in the reverberant
near-end side. Embodiments provide, for example, an acoustical zoom which is consistent
to the visual zoom of a camera. For example, when zooming in, the direct sound of the
speakers is reproduced from the direction where the speakers would be located in the
zoomed visual image, such that the visual and acoustical image are aligned. If the
speakers are located outside the visual image (or outside a desired spatial region) after
zooming in, the direct sound of these speakers can be attenuated, as these speakers are
not visible anymore, or, for example, as the direct sound from these speakers is not
desired. Moreover, the direct-to-reverberation ratio may, e.g., be increased when zooming
in to mimic the smaller opening angle of the visual camera.
Embodiments are based on the concept to separate the recorded microphone signals into
the direct sound of the sound sources and the diffuse sound, e.g., reverberant sound, by
applying two recently multi-channel filters at the near-end side. These multi-channel filters
may, e.g., be based on parametric information of the sound field, such as the DOA of the
direct sound. In some embodiments, the separated direct sound and diffuse sound may,
e.g., be transmitted to the far-end side together with the parametric information.
For example, at the far-end side, specific weights may, e.g., be applied to the extracted
direct sound and diffuse sound, which adjust the reproduced acoustical image such that
the resulting audio output signals are consistent with a desired spatial image. These
weights model, for example, the acoustical zoom effect and depend, for example, on the
direction of arrival (DOA) of the direct sound and, for example, on a zooming factor and/or
a look direction of a camera. The final audio output signals may, e.g., then be obtained by
summing up the weighted direct sound and diffuse sound.
The provided concepts realize an efficient usage in the aforementioned video recording
scenario with consumer devices or in a teleconferencing scenario: For example, in the
video recording scenario, it may, e.g., be sufficient to store or transmit the extracted direct
sound and diffuse sound (instead of all microphone signals) while still being able to control
the recreated spatial image.
This means, if for instance a visual zoom is applied in a post-processing step (digital
zoom), the acoustical image may still be modified accordingly without the need to store
and access the original microphone signals. In the teleconferencing scenario, the
proposed concepts can also be used efficiently, since the direct and diffuse sound
extraction can be carried out at the near-end side while still being able to control the
spatial sound reproduction (e.g. , changing the loudspeaker setup) at the far-end side and
to align the acoustical and visual image. Therefore, it is only necessary to transmit only
few audio signals and the estimated DOAs as side information, while the computational
complexity at the far-end side is low.
Fig . 2 illustrates a system according to an embodiment. The near-end side comprises the
modules 101 and 102. The far-end side comprises the module 105 and 106. Module 105
itself comprises the modules 103 and 104. When reference is made to a near-end side
and to a far-end side, it is understood that in some embodiments, a first apparatus may
implement the near-end side (for example, comprising the modules 10 1 and 102), and a
second apparatus may implement the far end side (for example, comprising the modules
103 and 104), while in other embodiments, a single apparatus implements the near-end
side as well as the far-end side, wherein such a single apparatus, e.g. , comprises the
modules 10 1, 102, 103 and 104.
In particular, Fig. 2 illustrates a system according to an embodiment comprising a
decomposition module 10 1, a parameter estimation module 102, a signal processor 105,
and an output interface 106. In Fig. 2, the signal processor 105 comprises a gain function
computation module 104 and a signal modifier 103. The signal processor 105 and the
output interface 106 may, e.g. , realize an apparatus as illustrated by Fig . 1b.
In Fig. 2 , inter alia, the parameter estimation module 102 may, e.g. , be configured to
receive the two or more audio input signals x^(k, n), x (k, n), . . . x k, n). Furthermore the
parameter estimation module 102 may, e.g. , be configured to estimate the direction of
arrival of the direct signal components of the two or more audio input signals x ( , n),
x 2 k, n), . . . x (k, n) depending on the two or more audio input signals. The signal
processor 105 may, e.g. , be configured to receive the direction of arrival information
comprising the direction of arrival of the direct signal components of the two or more audio
input signals from the parameter estimation module 102.
The input of the system of Fig. 2 consists of M microphone signals X k, ri) in the timefrequency
domain (frequency index k , time index ri). It may, e.g. , be assumed that the
sound field, which is captured by the microphones, consists for each (k, ri) of a plane
wave propagating in an isotropic diffuse field . The plane wave models the direct sound of
the sound sources (e.g. , speakers) while the diffuse sound models the reverberation.
According to such a model, the -th microphone signal can be written as
X m {k .. n ) = ¾ ir, ( , 7l) + , . ,„ ( / . n ) + X n .m ( k . 77·) ( )
where Xdir,m , ri) is the measured direct sound (plane wave), Xc ff ,m{k, ri) is the measured
diffuse sound , and X n,m(k, ri) is a noise component (e.g. , a microphone self-noise).
In decomposition module 10 1 in Fig. 2 (direct/diffuse decomposition), the direct sound
X dirik, n) and diffuse sound X dif ri) is extracted from the microphone signals. For this
purpose, for example, informed multi-channel filters as described below may be
employed. For the direct/diffuse decomposition , specific parametric information on the
sound field may, e.g. , be employed , for example, the DOA of the direct sound
(k, ri) and/or r(k, ri) such that the desired consistent spatial image
are obtained.
For example, when zooming in with the visual camera, the gain functions are adjusted
such that the sound is reproduced from the directions where the sources are visible in the
video. The weights Giik, ri) and Q and underlying gain functions and q are further
described below. It should be noted that the weights Gi(k, ri) and Q and underlying gain
functions g, and q may, e.g ., be complex-valued. Computing the gain functions requires
information such as the zooming factor, width of the visual image, desired look direction,
and loudspeaker setup.
In other embodiments, the weights are Gi(k, ri) and Q are directly computed within the
signal modifier 103, instead of at first computing the gain functions in module 104 and
then selecting the weights G,( , ri) and Q from the computed gain functions in the gain
selection units 201 and 202.
According to embodiments, more than one plane wave per time-frequency may, e.g. , be
specifically processed. For example, two or more plane waves in the same frequency
band from two different directions may, e.g. , arrive be recorded by a microphone array at
the same point-in-time. These two plane waves may each have a different direction of
arrival. In such scenarios, the direct signal components of the two or more plane waves
and their direction of arrivals may, e.g. , be separately considered .
According to embodiments, the direct component signal ri) and one or more further
direct component signals ri), Xdi qik, ri) may, e.g, form a group of two or more
direct component signals X d k, ri), Xdin k, ri), ..., Xdir q{k ri), wherein the decomposition
module 101 may, e.g. , be configured is configured to generate the one or more further
direct component signals Xdir ri), Xdir q(k ri) comprising further direct signal
components of the two or more audio input signals x^(k, ri), x2( , ri), . . . x (k, ri).
The direction of arrival and one or more further direction of arrivals form a group of two or
more direction of arrivals, wherein each direction of arrival of the group of the two or more
direction of arrivals is assigned to exactly one direct component signal Xdir ,{k, ri) of the
group of the two or more direct component signals Xdi ri), Xdir ri), Xdirq,m{k ri),
wherein the number of the direct component signals of the two or more direct component
signals and the number of the direction of arrivals of the two direction of arrivals is equal.
The signal processor 105 may, e.g. , be configured to receive the group of the two or more
direct component signals ri), Xd ri), . . ., Xdi (k, ri), and the group of the two or
more direction of arrivals.
For each audio output signal Y ,( , ri) of the one or more audio output signals Y ( , ri),
Y 2(k, ri), Y v (k, ri),
The signal processor 105 may, e.g, be configured to determine, for each direct
component signal X dir k, ri) of the group of the two or more direct component
signals , i( ri), X di , ri), Xdir q{ ri), a direct gain Gj,,{k, ri) depending on
the direction of arrival of said direct component signal X d j ri),
The signal processor 05 may, e.g. , be configured to generate a group of two or
more processed direct signals ¾ , ( , ri), Yd ir2,i(k, ri), Ydu- q ri) by applying ,
for each direct component signal X dir j ri) of the group of the two or more direct
component signals ri), X dir ri), Xdir q ri), the direct gain Gp(k, ri) of
said direct component signal X d j ri) on said direct component signal X dirj ri).
And:
The signal processor 05 may, e.g. , be configured to combine one Ydif ri) of the
one or more processed diffuse signals Y diff, ( , ri), ri), ri) and
each processed signal j (/c, ri) of the group of the two or more processed
signals ri), Ydr ri), Ydir ri) to generate said audio output signal
Thus, if two or more plane waves are separately considered, the model of formula (1)
becomes:
X m(k, n = ri) + X d l 2,m(k, «) + . . . + Xdir q,m(k, ri) + Xdiff.mih n) + X ,m(k, ri)
and the weights may, e.g. , be computed analogously to formulae (2a ) and (2b) according
to:
Yi(k, ri) = G k, ri) C ά n) + G2 (k, ri) X i {k, ri) + . . .+G ,,(/c, ri) X dir q{k ri) + Q C ά ri)
= ri) + Ydir2,i(k, n) +...+ Ydir q.ik ri) + Ydijf.iK ri)
It is sufficient that only a few direct component signals, a diffuse component signal and
side information is transmitted from a near-end side to a far-end side. In an embodiment,
the number of the direct component signal(s) of the group of the two or more direct
component signals ¾ , ( , ri), X dir k, ri), . . ., X ir q(k, ri) plus 1 is smaller than the number
of the audio input signals x-i(k, ri), c2( ri), . . . x (k, ri) being received by the receiving
interface 101 . (using the indices: q + 1 < p) "plus 1" represents the diffuse component
signal n) that is needed.
When in the following, explanations are provided with respect to a single plane wave, to a
single direction of arrival and to a single direct component signal, it is to be understood
that the explained concepts are equally applicable to more than one plane wave, more
than one direction of arrival and more than one direct component signal.
In the following, direct and diffuse Sound Extraction is described. Practical realizations of
the decomposition module 101 of Fig. 2, which realizes the direct/diffuse decomposition,
are provided.
In embodiments, to realize the consistent spatial sound reproduction, the output of two
recently proposed informed linearly constrained minimum variance (LCMV) filters
described in [8] and [9] are combined, which enable an accurate multi-channel extraction
of direct sound and diffuse sound with a desired arbitrary response assuming a similar
sound field model as in DirAC (Directional Audio Coding). A specific way of combining
these filters according to an embodiment is now described in the following:
At first, direct sound extraction according to an embodiment is described.
The direct sound is extracted using the recently proposed informed spatial filter described
in [8]. This filter is briefly reviewed in the following and then formulated such that it can be
used in embodiments according to Fig. 2 .
The estimated desired direct signal Ydi {k,n) for the z-th loudspeaker channel in (2b) and
Fig. 2 is computed by applying a linear multi-channel filter to the microphone signals, e.g.,
¾ ·. ( ' · ) = w . / -. n )x . r ) . 4
where the vector x(k, n) = [X k, ), . . . ,X { n)] comprises the M microphone signals
and i i is a complex-valued weight vector. Here, the filter weights minimize the noise
and diffuse sound comprised by the microphones while capturing the direct sound with the
desired gain G ,( , n). Expressed mathematically, the weights, may, e.g., be computed as
. / > - > > ) — i n ( A:. n)w
(5)
subject to the linear constraint
a -. = ' ( ·. n ) . (6)
Here, ( , f ) is the so-called array propagation vector. The -th element of this vector is
the relative transfer function of the direct sound between the m-th microphone and a
reference microphone of the array (without loss of generality the first microphone at
position is used in the following description). This vector depends on the DOA f { ) of
the direct sound.
The array propagation vector is, for example, defined in [8]. In formula (6) of document [8],
the array propagation vector is defined according to
a ( , · · · L-/ ( /·' |
wherein
U( , n) indicates a power spectral density matrix of the noise and diffuse sound
of the two or more audio input signals, wherein a(k, f ) indicates an array propagation
vector, and wherein f indicates the azimuth angle of the direction of arrival of the direct
signal components of the two or more audio input signals.
Fig. 3 illustrates parameter estimation module 102 and a decomposition module 101
implementing direct/diffuse decomposition according to an embodiment.
The embodiment illustrated by Fig. 3 realizes direct sound extraction by direct sound
extraction module 203 and diffuse sound extraction by diffuse sound extraction module
204.
The direct sound extraction is carried out in direct sound extraction module 203 by
applying the filter weights to the microphone signals as given in (10). The direct filter
weights are computed in direct weights computation unit 301 which can be realized for
instance with (8). The gains ik, n) of, e.g., equation (9), are then applied at the far-end
side as shown in Fig. 2 .
In the following, diffuse sound extraction is described. Diffuse sound extraction may, e.g.,
be implemented by diffuse sound extraction module 204 of Fig. 3. The diffuse filter
weights are computed in diffuse weights computation unit 302 of Fig. 3 , e.g., as described
in the following.
In embodiments, the diffuse sound may, e.g., be extracted using the spatial filter which
was recently proposed in [9]. The diffuse sound Xdi k, n) in (2a) and Fig. 2 may, e.g., be
estimated by applying a second spatial filter to the microphone signals, e.g.,
To find the optimal filter for the diffuse sound difi (k, n), we consider the recently proposed
filter in [9], which can extract the diffuse sound with a desired arbitrary response while
minimizing the noise at the filter output. For spatially white noise, the filter is given by
( >· ) = ar i h h
h (12)
subject to a(k, f) = 0 and k = 1. The first linear constraint ensures that the direct
sound is suppressed, while the second constraint ensures that on average, the diffuse
sound is captured with the desired gain Q, see document [9]. Note that (k) is the diffuse
sound coherence vector defined in [9]. The solution to (12) is given by
hdiff (k - " ) = 7diff ( f 0A diff ( O ' · 3 )
where
A k f ) = I -
with I being the identity matrix of size c . The filter < ff k, n) does not dependent on
the weights Gi(k, n) and Q, and thus, it can be computed and applied at the near-end side
to obtain dj (k,n) . In doing so, it is only needed to transmit a single audio signal to the
far-end side, namely X diff (k,n) , while still being able to fully control the spatial sound
reproduction of the diffuse sound.
Fig. 3 moreover illustrates the diffuse sound extraction according to an embodiment. The
diffuse sound extraction is carried out in diffuse sound extraction module 204 by applying
the filter weights to the microphone signals as given in formula ( 11) . The filter weights are
computed in diffuse weights computation unit 302 which can be realized for example, by
employing formula (13).
In the following , parameter estimation is described . Parameter estimation may, e.g. , be
conducted by parameter estimation module 102, in which the parametric information
about the recorded sound scene may, e.g. , be estimated . This parametric information is
employed for computing two spatial filters in the decomposition module 101 and for the
gain selection in consistent spatial audio reproduction in the signal modifier 103.
At first, determination/estimation of DOA information is described.
In the following embodiments are described , wherein the parameter estimation module
(102) comprises a DOA estimator for the direct sound , e.g. , for the plane wave that
originates from the sound source position and arrives at the microphone array. Without
the loss of generality, it is assumed that a single plane wave exists for each time and
frequency. Other embodiments consider cases where multiple plane waves exists, and
extending the single plane wave concepts described here to multiple plane waves is
straightforward . Therefore, the present invention also covers embodiments with multiple
plane waves.
The narrowband DOAs can be estimated from the microphone signals using one of the
state-of-the-art narrowband DOA estimators, such as ESPRIT [ 10] or root MUSIC [ 1 1] .
Instead of the azimuth angle
(k,
n)] for one or more waves arriving at the microphone array. It should be noted that the
DOA information can also be provided externally. For example, the DOA of the plane
wave can be determined by a video camera together with a face recognition algorithm
assuming that human talkers form the acoustic scene.
Finally, it should be noted that the DOA information can also be estimated in 3D (in three
dimensions). In that case, both the azimuth (p(k, ri) and elevation 3(k, n) angles are
estimated in the parameter estimation module 102 and the DOA of the plane wave is in
such a case provided, for example, as (f , 9).
Thus, when reference is made below to the azimuth angle of the DOA, it is understood
that all explanations are also applicable to the elevation angle of the DOA, to an angle or
derived from the azimuth angle of the DOA, to an angle or derived from the elevation
angle of the DOA or to an angle derived from the azimuth angle and the elevation angle of
the DOA. In more general, all explanations provided below are equally applicable to any
angle depending on the DOA.
Now, distance information determination/estimation is described .
Some embodiments relate top acoustic zoom based on DOAs and distances. In such
embodiments, the parameter estimation module 102 may, for example, comprise two submodules,
e.g. , the DOA estimator sub-module described above and a distance estimation
sub-module that estimates the distance from the recording position to the sound source
r k, n). In such embodiments, it may, for example, be assumed that each plane wave that
arrives at the recording microphone array originates from the sound source and
propagates along a straight line to the array (which is also known as the direct
propagation path).
Several state-of-the-art approaches exist for distance estimation using microphone
signals. For example, the distance to the source can be found by computing the power
ratios between the microphones signals as described in [ 12]. Alternatively, the distance to
the source r(k, n) in acoustic enclosures (e.g. , rooms) can be computed based on the
estimated signal-to-diffuse ratio (SDR) [1 3]. The SDR estimates can then be combined
with the reverberation time of a room (known or estimated using state-of-the-art methods)
to calculate the distance. For high SDR, the direct sound energy is high compared to the
diffuse sound which indicates that the distance to the source is small . When the SDR
value is low, the direct sound power is week in comparison to the room reverberation,
which indicates a large distance to the source.
In other embodiments, instead of calculating/estimating the distance by employing a
distance computation module in the parameter estimation module 102, external distance
information may, e.g. , be received, for example, from the visual system. For example,
state-of-the-art techniques used in vision may, e.g. , be employed that can provide the
distance information, for example, Time of Flight (ToF), stereoscopic vision, and
structured light. For example, in the ToF cameras, the distance to the source can be
computed from the measured time-of-flight of a light signal emitted by a camera and
traveling to the source and back to the camera sensor. Computer stereo vision for
example, utilizes two vantage points from which the visual image is captured to compute
the distance to the source.
Or, for example, structured light cameras may be employed, where a known pattern of
pixels is projected on a visual scene. The analysis of deformations after the projection
allows the visual system to estimate the distance to the source. It should be noted that the
distance information r k, n) for each time-frequency bin is required for consistent audio
scene reproduction. If the distance information is provided externally by a visual system ,
the distance to the source r(k, n) that corresponds to the DOA q>(k, n), may, for example,
be selected as the distance value from the visual system that corresponds to that
particular direction
, ri)
may, for example, be achieved by selecting the direct sound gain G ( , ri) in gain selection
unit 201 ("Direct Gain Selection") from a fixed look-up table provided by gain function
computation module 104 for the estimated DOA
{k, n) and the location of the source on the x-axis is given by xg(k, n). Here, it is
assumed that all sound sources are located at the same distance g to the x-axis, e.g. , the
source positions are located on the left dashed line, which is referred to in optics as a
focal plane. It should be noted that this assumption is only made to ensure that the visual
and acoustical images are aligned and the actual distance value g is not needed for the
presented processing.
On the reproduction side (far-end side), the display is located at b and the position of the
source on the display is given by X k, n). Moreover, x is the display size (or, in some
embodiments, for example, x indicates half of the display size), is the corresponding
maximum visual angle, S is the sweet spot of the sound reproduction system, and ( k , n)
is the angle from which the direct sound should be reproduced so that the visual and
acoustical images are aligned. (p k, n) depends on Xb(k, n ) and on the distance between
the sweet spot S and the display located at b. Moreover, x h{k, n) depends on several
parameters such as the distance g of the source from the camera, the image sensor size,
and the display size . Unfortunately, at least some of these parameters are often
unknown in practice such that Xb n) and ( k , n) cannot be determined for a given DOA
(p g{k, n). However, assuming the optical system is linear, according to formula (17):
an k n ) = c t an k n ) . (17)
where c is an unknown constant compensating for the aforementioned unknown
parameters. It should be noted that c is constant only if all source positions have the same
distance g to the x-axis.
In the following, c is assumed to be a calibration parameter which should be adjusted
during the calibration stage until the visual and acoustical images are consistent. To
perform calibration, the sound sources should be positioned on a focal plane and the
value of c is found such that the visual and acoustical images are aligned. Once
calibrated, the value of c remains unchanged and the angle from which the direct sound
should be reproduced is given by
tan {c h
To ensure that both acoustic and visual scenes are consistent, the original panning
function r i(f ) is modified to a consistent (modified) panning function r i,,ί(f ) . The direct
sound gain G ( , ή ) is now selected according to
G i ( k n ) = h (19)
P2015/058857
39
ί ( ) = , ( ) (20)
where p b ( ) is the consistent panning function returning the panning gains for the z'-th
loudspeaker across all possible source DOAs. For a fixed value of , such a consistent
panning function is computed in the gain function computation module 104 from the
original (e.g. VBAP) panning gain table as
Pt r = Pi( [c tan f \). (2
Thus, in embodiments, the signal processor 105 may, e.g. , be configured to determine, for
each audio output signal of the one or more audio output signals, such that the direct gain
Gj(k, n) is defined according to
Gi k, n) = ,(tan [c tan( ( , «))]) .
wherein i indicates an index of said audio output signal, wherein k indicates frequency,
and wherein n indicates time, wherein G,( , n) indicates the direct gain, wherein (p{k, n)
indicates an angle depending o n the direction of arrival (e.g. , the azimuth angle of the
direction of arrival), wherein c indicates a constant value, and wherein indicates a
panning function.
In embodiments, the direct sound gain Gj(k, n) is selected in gain selection unit 201 based
o n the estimated DOA (p{k, n) from a fixed look-up table provided by the gain function
computation module 104, which is computed once (after the calibration stage) using ( 19).
Thus, according to an embodiment, the signal processor 105 may, e.g. , be configured to
obtain, for each audio output signal of the one or more audio output signals, the direct
gain for said audio output signal from a lookup table depending o n the direction of arrival.
In a n embodiment, the signal processor 105 calculates a lookup table for the direct gain
function g (k, n). For example, for every possible full degree, e.g. , 1° , 2°, 3° for the
azimuth value f of the DOA, the direct gain G ,( , ri) may be computed and stored in
advance. Then, when a current azimuth value f of the direction of arrival is received , the
signal processor 105 reads the direct gain G ( , n) for the current azimuth value f from the
lookup table. (The current azimuth value f , may, e.g. , be the lookup table argument value;
and the direct gain G,(/c, n) may, e.g. , be the lookup table return value). Instead of the
azimuth f of the DOA, in other embodiments, the lookup table may be computed for any
5 058857
40
angle depending on the direction of arrival. This has an advantage, that the gain value
does not always have to be calculated for every point-in-time, or for every time-frequency
bin, but instead, the lookup table is calculated once and then, for a received angle f , the
direct gain G ( , ri) is read from the lookup table.
Thus, according to an embodiment, the signal processor 105 may, e.g., be configured to
calculate a lookup table, wherein the lookup table comprises a plurality of entries, wherein
each of the entries comprises a lookup table argument value and a lookup table return
value being assigned to said argument value. The signal processor 105 may, e.g., be
configured to obtain one of the lookup table return values from the lookup table by
selecting one of the lookup table argument values of the lookup table depending on the
direction of arrival. Furthermore, the signal processor 05 may, e.g., be configured to
determine the gain value for at least one of the one or more audio output signals
depending said one of the lookup table return values obtained from the lookup table.
The signal processor 105 may, e.g., be configured to obtain another one of the lookup
table return values from the (same) lookup table by selecting another one of the lookup
table argument values depending on another direction of arrival to determine another gain
value. E.g., the signal processor may, for example, receive further direction information,
e.g., at a later point-in-time, which depends on said further direction of arrival.
An example of VBAP panning and consistent panning gain functions are shown in Fig.
5(a) and 5(b).
It should be noted that instead of recomputing the panning gain tables, one could
alternatively calculate the DOA
( ? ) (25)
where ) denotes the panning gain function and wb (p) is the window gain function for
a consistent audio-visual zoom. The panning gain function for a consistent audio-visual
zoom is computed in the gain function computation module 104 from the original (e.g.
VBAP) panning gain function r i f ) as
pbA-r) = ' i t [b t a f \) . (26)
Thus the direct sound gain G ( , n), e.g. , selected in the gain selection unit 201 , is
determined based on the estimated DOA
is a window gain function for an acoustic zoom that attenuates the direct
sound if the source is mapped to a position outside the visual image for the zoom factor b .
The window function w(
(k, ri).
The required distance information can be estimated as explained above (the distance g of
the focal plane can be obtained from the lens system or autofocus information). It should
be noted that, for example, in this embodiment, the distance r{k, ri) between the source
and focal plane is transmitted to the far-end side together with the (mapped) DOA (p{k, ri).
Moreover, by analogy to the visual zoom, the sources lying at a large distance r from the
focal plane do not appear sharp in the image. This effect is well-known in optics as the socalled
depth-of-field (DOF), which defines the range of source distances that appear
acceptably sharp in the visual image.
An example of the DOF curve as function of the distance r is depicted in Fig . 10(a).
Fig . 10 illustrates example figures for the depth-of-field (Fig. 10(a)), for a cut-off frequency
of a low-pass filter (Fig. 10(b)), and for the time-delay in s for the repeated direct sound
(Fig. 10(c)).
In Fig. 10(a), the sources at a small distance from the focal plane are still sharp, whereas
sources at larger distances (either closer or further away from the camera) appear as
blurred . So according to an embodiment, the corresponding sound sources are blurred
such that their visual and acoustical images are consistent.
To derive the gains k, ri) and Q in (2a), which realize the acoustic blurring and
consistent spatial sound reproduction, the angle is considered at which the source
positioned at R (f , r ) will appear on a display. The blurred source will be displayed at
n f (k , ) = b c an f h (30
where c is the calibration parameter, b ³ 1 is the user-controlled zoom factor,
)) !li . ' - (31)
ί ) P l i ) " ' ) (32)
6(r) (33)
wherein b ) denotes the panning gain function (to assure that the sound is reproduced
from the right direction), wherein ( is the window gain function (to assure that the
direct sound is attenuated if the source is not visible in the video), and wherein b(r) is the
blurring function (to blur sources acoustically if they are not located on the focal plane).
It should be noted that all gain functions can be defined frequency-dependent (which is
omitted here for brevity). It should be further noted that in this embodiment the direct gain
G is found by selecting and multiplying gains from two different gain functions, as shown
in formula (32).
Both gain functions Pb,i(