Apparatus And Method For Geometry Based Spatial Audio Coding
Abstract:
An apparatus for generating at least one audio output signal based on an audio data stream comprising audio data relating to one or more sound sources is provided. The apparatus comprises a receiver for receiving the audio data stream comprising the audio data. The audio data comprises one or more pressure values for each one of the sound sources. Furthermore, the audio data comprises one or more position values indicating a position of one of the sound sources for each one of the sound sources. Moreover, the apparatus comprises a synthesis module for generating the at least one audio output signal based on at least one of the one or more pressure values of the audio data of the audio data stream and based on at least one of the one or more position values of the audio data of the audio data stream.
Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence
Apparatus and Method for Geometry-based Spatial Audio Coding
Description
The present invention relates to audio processing and, in particular, to a apparatus and
method for geometry-based spatial audio coding.
Audio processing and, in particular, spatial audio coding, becomes more and more
important. Traditional spatial sound recording aims at capturing a sound field such that at
the reproduction side, a listener perceives the sound image as it was at the recording
location. Different approaches to spatial sound recording and reproduction techniques are
known from the state of the art, which may be based on channel-, object- or parametric
representations.
Channe!-based representations represent the sound scene by means f N discrete audio
signals meant to be played back by N loudspeakers arranged in a known setup, e.g. a 5.1
surround sound setup. The approach for spatial sound recording usually employs spaced,
omnidirectional microphones, for example, in AB stereophony, or coincident directional
microphones, for example, in intensity stereophony. Alternatively, more sophisticated
microphones, such as a B-format microphone, may be employed, for example, i
Ambisonics, see:
[ 1] Michael A. Gerzon. Ambisonics in multichannel broadcasting and video. J . Audio
Eng. Soc, 33(1 1):859-871 , 1985.
The desired loudspeaker signals for the known setup are derived directly from the recorded
microphone signals and are then transmitted or stored discretely. A more efficient
representation is obtained by applying audio coding to the discrete signals, which in some
cases codes the information of different channels jointly for increased efficiency, for
example in MPEG-Surround for 5.1 , see:
[21] J. Herre, K. Kjorling, J. Breebaart, C. Faller, S. Disch, H. Purnhagen, J. Koppens, J .
Hilpert, J. oden, W. Oomen, K. Linzmeier, K.S. Chong: "MPEG Surround - The
ISO/MPEG Standard for Efficient and Compatible Multichannel Audio Coding",
122nd AES Convention, Vienna, Austria, 2007, Preprint 7084.
A major drawback of these techniques is, that the sound scene, once the loudspeaker
signals have been computed, cannot be modified.
Object-based representations are, for example, used in Spatial Audio Object Coding
(SAOC), see
[25] Jeroen Breebaart, Jonas Engdegard, Cornelia Falch, Oliver Hellmuth, Johannes
Hi1pert, Andreas Hoelzer, Jeroens Koppens, Werner Oomen, Barbara Resch, Erik
Schuijers, and Leonid Terentiev. Spatial audio object coding (saoc) - the upcoming
peg standard on parametric object based audio coding. In Audio Engineering
Society Convention 124, 5 2008.
Object-based representations represent the sound scene with N discrete audio objects. This
representation gives high flexibility at the reproduction side, since the sound scene can be
manipulated by changing e.g. the position and loudness of each object. While this
representation may be readily available from an e.g. multitrack recording, it is very
difficult to be obtained from a complex sound scene recorded with a few microphones (see,
for example, [21]). In fact, the talkers (or other sound emitting objects) have to be first
localized and then extracted from the mixture, which might cause artifacts.
Parametric representations often employ spatial microphones to determine one or more
audio downmix signals together with spatial side information describing the spatial sound.
An example is Directional Audio Coding (DirAC), as discussed in
[22] Ville Pulkki. Spatial sound reproduction with directional audio coding. J . Audio
Eng. Soc, 55(6):503-516, June 2007.
The term "spatial microphone" refers to any apparatus for the acquisition of spatial sound
capable of retrieving direction of arrival of sound (e.g. combination of directional
microphones, microphone arrays, etc.) .
The term "non-spatial microphone" refers to any apparatus that is not adapted for
retrieving direction of arrival of sound, such as a single omnidirectional or directive
microphone.
Another example is proposed in:
[23] C. Faller. Microphone front-ends for spatial audio coders. In Proc. of t e AES 5
International Convention, San Francisco. Oct. 2008.
In DirAC, the spatial cue information comprises the direction of arrival (DOA) o sound
and the diffuseness of the sound field computed n a time-frequency domain. For the sound
reproduction, the audio playback signals can be derived based on the parametric
description. These techniques offer great flexibility at the reproduction side because a
arbitrary loudspeaker setup can be employed, because the representation is particularly
flexible and compact as it comprises a downmix mono audio signal and side information,
and because it allows easy modifications on the sound scene, for example, acoustic
zooming, directional filtering, scene merging, etc.
However, these techniques are still limited in that the spatial image recorded is always
relative to the spatial microphone used. Therefore, the acoustic viewpoint cannot be varied
and the listening-position within the sound scene cannot be changed.
A virtual microphone approach is presented in
[20] Giovanni Del Galdo, Oliver Thiergart, Tobias Weller, and E. A. P. Habcts.
Generating virtual microphone signals using geometrical information gathered by
distributed arrays. In Third Joint Workshop on Hands- free Speech Communication
and Microphone Arrays (IISCMA 1), Edinburgh, United Kingdom, May 201 .
which allows to compute the output signals of an arbitrary spatial microphone virtually
placed at will (i.e., arbitrary position and orientation) in the environment. The flexibility
characterizing the virtual microphone (VM) approach allows the sound scene to be
virtually captured at will n a postprocessing step, but no sound field representation s
made available, which can be used to transmit and/or store and/or modify the sound scene
efficiently. Moreover only one source per time-frequency bin is assumed active, and
therefore, it cannot correctly describe the sound scene if two or more sources are active in
the same time-frequency bin. Furthermore, if the virtual microphone (VM) is applied at the
receiver side, all the microphone signals need to be sent over the channel, which makes the
representation inefficient, whereas if the VM is applied at the transmitter side, the sound
scene cannot be further manipulated and the model loses flexibility and becomes limited to
a certain loudspeaker setup. Moreover, it does not considers a manipulation of the sound
scene based on parametric information.
In
[24] Emmanuel Gallo and Nicolas Tsingos. Extracting and re-rendering structured
auditory scenes from field recordings. In AES 30th International Conference on
intelligent Audio Environments, 2007,
the sound source position estimation is based on pairwise time difference of arrival
measured by means of distributed microphones. Furthermore, the receiver s dependent on
the recording and requires all microphone signals for the synthesis (e.g., the generation of
the loudspeaker signals).
The method presented i
[28] Svein Berge. Device and method fo converting spatial audio signal. US patent
application, Appl. No. 10/547,151,
uses, similarly to DirAC, direction of arrival as a parameter, thus limiting the
representation to a specific point of view of the sound scene. Moreover, it does not propose
the possibility to transmit/store the sound scene representation, since the analysis and
synthesis need both to be applied at the same side of the communication system.
The object of the present invention is to provide improved concepts for spatial sound
acquisition and description via the extraction of geometrical information. The object of the
present invention is solved by an apparatus for generating at least one audio output signal
based on an audio data stream according to claim 1, by an apparatus for generating an
audio data stream according to claim 10, by a system according to claim 19, by an audio
data stream according to claim 20, by a method for generating at least one audio output
signal according to claim 23, by a method for generating an audio data stream according to
claim 24 and by a computer program according to claim 25.
An apparatus for generating at least one audio output signal based on an audio data stream
comprising audio data relating to one or more sound sources is provided. The apparatus
comprises a receiver for receiving the audio data stream comprising the audio data. The
audio data comprises one or more pressure values for each one of the sound sources.
Furthermore, the audio data comprises one or more position values indicating a position of
one of the sound sources for each one o the sound sources. Moreover, the apparatus
comprises a synthesis module for generating the at least one audio output signal based on
at least one of the one or more pressure values of the audio data o the audio data stream
and based on at least one of the one or more position values of the audio data o the audio
data stream. In an embodiment, each one of the one or more position values may comprise
at least two coordinate values.
The audio data may be defined for a time-frequency bin of a plurality of time-frequency
bins. Alternatively, the audio data may be defined for a time instant of a plurality of time
instants. In some embodiments, one or more pressure values of the audio data may be
defined for a time instant of a plurality of time instants, while the corresponding
parameters (e.g., the position values) may be defined in a time-frequency domain. This can
be readily obtained by transforming back to time domain the pressure values otherwise
defined in time-frequency. For each one of the sound sources, at least one pressure value is
comprised in the audio data, wherein the at least one pressure value may be a pressure
value relating to an emitted sound wave, e.g. originating from the sound source. The
pressure value may be a value of an audio signal, for example, a pressure value of an audio
output signal generated by an apparatus for generating a audio output signal of a virtual
microphone, wherein that the virtual microphone is placed at the position of the sound
source.
The above-described embodiment allows to compute a sound field representation which is
truly independent from the recording position and provides for efficient transmission and
storage of a complex sound scene, as well as for easy modifications and an increased
flexibility at the reproduction system.
Inter alia, important advantages of this technique are, that at the reproduction side the
listener can choose freely its position within the recorded sound scene, use any loudspeaker
setup, and additionally manipulate the sound scene based on the geometrical information,
e.g. position-based filtering. In other words, with the proposed technique the acoustic
viewpoint can be varied and the listening-position within the sound scene can be changed.
According to the above-described embodiment, the audio data comprised in the audio data
stream comprises one or more pressure values for each one of the sound sources. Thus, the
pressure values indicate an audio signal relative to one of the sound sources, e.g. an audio
signal originating from the sound source, and not relative to the position of the recording
microphones. Similarly, the one or more position values that are comprised in the audio
data stream indicate positions of the sound sources and not of the microphones.
By this, a plurality of advantages are realized: For example, a representation of an audio
scene is achieved that can be encoded using few bits. If the sound scene onl comprises a
single sound source in a particular time frequency bin, only the pressure values of a single
audio signal relating to the only sound source have to be encoded together with the
position value indicating the position of the sound source. In contrast, traditional methods
may have to encode a plurality of pressure values from the plurality recorded
microphone signals to reconstruct an audio scene at a receiver. Moreover, the abovedescribed
embodiment allows easy modification of a sound scene on a transmitter, as well
as on a receiver side, as will be described below. Thus, scene composition (e.g., deciding
the listening position within the sound scene) can also be carried out at the receiver side.
Embodiments employ the concept f modeling a complex sound scene by means o sound
sources, for example, point-like sound sources (PLS = point-like sound source), e.g.
isotropic point-like sound sources (IPLS), which are active at specific slots in a timefrequency
representation, such as the one provided by the Short-Time Fourier Transform
(STFT).
According to an embodiment, the receiver may be adapted to receive the audio data stream
comprising the audio data, wherein the audio data furthermore comprises one or more
diffuseness values for each one of the sound sources. The synthesis module may be
adapted to generate the at least one audio output signal based on at least one of the one or
more diffuseness values,
In another embodiment, the receiver may furthermore comprise a modification module for
modifying the audio data of the received audio data stream by modifying at least one of the
one or more pressure values of the audio data, by modifying at least one of the one or more
position values of the audio data or by modifying at least one of the diffuseness values of
the audio data. The synthesis module may be adapted to generate the at least one audio
output signal based on the at least one pressure value that has been modified, based on the
at least one position value that has been modified or based on the at least one diffuseness
value that has been modified.
In a further embodiment, each one of the position values of each one of the sound sources
may comprise at least two coordinate values. Furthermore, the modification module may
be adapted to modify the coordinate values by adding at least one random number to the
coordinate values, when the coordinate values indicate that a sound source is located at a
position within a predefined area of an environment.
According to another embodiment, each one of the position values of each one of the
sound sources may comprise at least two coordinate values. Moreover, the modification
module is adapted to modify the coordinate values by applying a deterministic function on
the coordinate values, when the coordinate values indicate that a sound source is located at
a position within a predefined area of an environment.
In a further embodiment, each one of the position values of each one of the sound sources
may comprise at least two coordinate values. Moreover, the modification module may be
adapted to modify a selected pressure value of the one or more pressure values of the audio
data, relating to the same sound source as the coordinate values, when the coordinate
values indicate that a sound source is located at a position within a predefined area of an
environment.
According to an embodiment, the synthesis module may comprise a first stage synthesis
unit and a second stage synthesis unit. The first stage synthesis unit may be adapted to
generate a direct pressure signal comprising direct sound, a diffuse pressure signal
comprising diffuse sound and direction of arrival information based on at least one of the
one or more pressure values of the audio data of the audio data stream, based on at least
one of the one or more position values of the audio data of the audio data stream and based
on at least one of the one or more diffuseness values of the audio data of the audio data
stream. The second stage synthesis unit may be adapted to generate the at least one audio
output signal based on the direct pressure signal, the diffuse pressure signal and the
direction of arrival information.
According to an embodiment, an apparatus for generating an audio data stream comprising
sound source data relating to one or more sound sources is provided. The apparatus for
generating an audio data stream comprises a determiner for determining the sound source
data based on at least one audio input signal recorded by at least one microphone and based
o audio side information provided by at least two spatial microphones. Furthermore, the
apparatus comprises a data stream generator for generating the audio data stream such that
the audio data stream comprises the sound source data. The sound source data comprises
one or more pressure values for each one of the sound sources. Moreover, the sound source
data furthermore comprises one or more position values indicating a sound source position
for each one of the sound sources. Furthermore, the sound source data is defined for a
time-frequency bin of a plurality of time-frequency bins.
In a further embodiment, the determiner may be adapted to determine the sound source
data based on diffuseness information by at least one spatial microphone. The data stream
generator may be adapted to generate the audio data stream such that the audio data stream
comprises the sound source data. The sound source data furthermore comprises one or
more diffuseness values for each one of the sound sources.
In another embodiment, the apparatus for generating an audio data stream may furthermore
comprise a modification module for modifying the audio data stream generated by the data
stream generator by modifying at least o e of the pressure values of the audio data, at least
one of the position values of the audio data or at least one of the diffuseness values of the
audio dat relating to at least one of the sound sources,
According to another embodiment, each one of the position values of each one of the
sound sources may comprise at least two coordinate values (e.g., two coordinates o a
Cartesian coordinate system, or azimuth and distance, in a polar coordinate system). The
modification module may be adapted to modify the coordinate values by adding at least
one random number to the coordinate values or by applying a deterministic function on the
coordinate values, when the coordinate values indicate that a sound source is located at a
position within a predefined area of an environment.
According to a further embodiment, an audio data stream is provided. The audio data
stream may comprise audio data relating to one or more sound sources, wherein the audio
data comprises one or more pressure values for each one of the sound sources. The audio
data may furthermore comprise at least one position value indicating a sound source
position for each one of the sound sources. In an embodiment, each one of the at least one
position values may comprise at least two coordinate values. The audio data may be
defined for a time-frequency bin of a plurality of time-frequency bins.
n another embodiment, the audio data furthermore comprises one or more diffuseness
values for each one of the sound sources.
Preferred embodiments of the present invention wi l be described in the following, in
which:
Fig. 1 illustrates an apparatus for generating at least o e audio output signal based
on an audio data stream comprising audio data relating to one or more sound
sources according to an embodiment,
Fig. 2 illustrates an apparatus for generating an audio data stream comprising
sound source data relating to one or more sound sources according to an
embodiment,
Fig. 3a-3c illustrate audio data streams according to different embodiments,
illustrates an apparatus for generating an audio data stream comprising
sound source data relating to one or more sound sources according to
another embodiment,
illustrates a sound scene composed of two sound sources and two uniform
linear microphone arrays,
illustrates an apparatus 600 for generating at least one audio output signal
based on an audio data stream according to an embodiment,
illustrates an apparatus 660 for generating an audio data stream comprising
sound source data relating to one or more sound sources according to an
embodiment,
depicts a modification module according to a embodiment,
depicts a modification module according to another embodiment,
illustrates transmitter/analysi units and a receiver/synthesis units according
to an embodiment,
depicts a synthesis module according to an embodiment,
depicts a first synthesis stage unit according to an embodiment,
depicts a second synthesis stage unit according to an embodiment,
depicts a synthesis module according to another embodiment,
illustrates an apparatus for generating an audio output signal of a virtual
microphone according to an embodiment,
illustrates the inputs and outputs of an apparatus and a method for
generating an audio output signal of a virtual microphone according to an
embodiment,
illustrates the basic structure o an apparatus for generating an audio output
signal of a virtual microphone according to an embodiment which
comprises a sound events position estimatior and an information
computation module,
shows an exemplary scenario in which the real spatial microphones are
depicted as Uniform Linear Arrays f 3 microphones each,
depicts two spatial microphones in for estimating the direction f arrival
in 3D space,
illustrates a geometry where an isotropic point-like sound source of the
current time- frequency bin(k, n) s located at a position i s ( , n),
depicts the information computation module according to an embodiment,
depicts the information computation module according to another
embodiment,
shows two real spatial microphones, a localized sound event and a position
of a virtual spatial microphone,
illustrates, how to obtain the direction of arrival relative to a virtual
microphone according to an embodiment,
depicts a possible way to derive the DOA of the sound from the point of
view of the virtual microphone according to an embodiment,
illustrates an information computation block comprising a diffuseness
computation unit according to an embodiment,
depicts a diffuseness computation unit according to an embodiment,
illustrates a scenario, where the sound events position estimation is not
possible,
Fig. 26 illustrates an apparatus for generating a virtual microphone data stream
according to an embodiment,
Fig. 27 illustrates an apparatus for generating at least one audio output signal based
on an audio data stream according to another embodiment, and
Fig. 28a-28c illustrate scenarios where two microphone arrays receive direct sound,
sound reflected by a wall and di fuse sound.
Before providing a detailed description of embodiments of the present invention, an
apparatus for generating an audio output signal of a virtual microphone is described to
provide background information regarding the concepts of the present invention.
Fig. 12 illustrates an apparatus for generating an audio output signal to simulate a
recording of a microphone at a configurable virtual position posVmic in an environment.
The apparatus comprises a sound events position estimator 10 and an information
computation module 120. The sound events position estimator 1 0 receives a first direction
information dil from a first real spatial microphone and a second direction information di2
from a second real spatial microphone. The sound events position estimator 1 0 is adapted
to estimate a sound source position ssp indicating a position of a sound source in the
environment, the soun source emitting a soun wave, wherein the sound events position
estimator 110 is adapted to estimate the sound source position ssp based on a first direction
information dil provided by a first real spatial microphone being located at a first real
microphone position poslmic in the environment, and based on a second direction
information di2 provided by a second real spatial microphone being located at a second
real microphone position in the environment. The information computation module 120 is
adapted to generate the audio output signal based on a first recorded audio input signal isl
being recorded by the first real spatial microphone, based on the first real microphone
position poslmic and based on the virtual position posVmic of the virtual microphone. The
information computation module 20 comprises a propagation compensator being adapted
to generate a first modified audio signal by modifying the first recorded audio input signal
is by compensating a first delay or amplitude decay between an arrival of the sound wave
emitted by the sound source at the first real spatial microphone and a arrival of the sound
wave at the virtual microphone by adjusting an amplitude value, a magnitude value or a
phase value of the first recorded audio input signal isl, to obtain the audio output signal.
Fig. 13 illustrates the inputs and outputs of an apparatus and a method according to an
embodiment. Information fro two or more real spatial microphones 1, 112, 1 N is
fed to the apparatus/is processed by the method. This information comprises audio signals
picked up b the real spatial microphones as well as direction information from the real
spatial microphones, e.g. direction of arrival (DOA) estimates. The audio signals and the
direction information, such as the direction of arrival estimates may be expressed in a timefrequency
domain. If, for example, a 2D geometry reconstruction is desired and a
traditional STFT (short time Fourier transformation) domain is chosen for the
representation of the signals, the DOA may be expressed as azimuth angles dependent on k
and n, namely the frequency and time indices.
In embodiments, the sound event localization in space, as well as describing the position of
the virtual microphone may be conducted based on the positions and orientations of the
real and virtual spatial microphones in a common coordinate system. This information may
be represented by the inputs 121 ... N and input 4 in Fig. 13. The input 4 may
additionally specify the characteristic of the virtual spatial microphone, e.g., its position
and pick-up pattern, as will be discussed in the following. If the virtual spatial microphone
comprises multiple virtual sensors, their positions and the corresponding different pick-up
patterns may be considered.
The output of the apparatus or a corresponding method may be, when desired, one or more
sound signals 105, which may have been picked up by a spatial microphone defined and
placed as specified by 4. Moreover, the apparatus (or rather the method) may provide as
output corresponding spatial side information 106 which may be estimated by employing
the virtual spatial microphone.
Fig. 14 illustrates an apparatus according to an embodiment, which comprises two main
processing units, a sound events position estimator 201 and an information computation
module 202. The sound events position estimator 201 may carry out geometrical
reconstruction on the basis of the DOAs comprised in inputs 11 ... 1IN and based on the
knowledge of the position and orientation of the real spatial microphones, where the DOAs
have been computed. The output of the sound events position estimator 205 comprises the
position estimates (either in 2D or 3D) of the sound sources where the sound events occur
for each time and frequency bin. The second processing block 202 is a information
computation module. According to the embodiment of Fig. 14, the second processing
block 202 computes a virtual microphone signal and spatial side information. It is therefore
also referred to as virtual microphone signal and side information computation block 202.
The virtual microphone signal and side information computation block 202 uses the sound
events' positions 205 to process the audio signals comprised in 111... N to output the
virtual microphone audio signal 105. Block 202, if required, may also compute the spatial
side information 106 corresponding to the virtual spatial microphone. Embodiments below
illustrate possibilities, how blocks 201 and 202 may operate.
In the following, position estimation of a sound events position estimator according to an
embodiment is described in more detail.
Depending on the dimensionality of the problem (2D or 3D) and the number of spatial
microphones, several solutions for the position estimation are possible.
If two spatial microphones in 2D exist, (the simplest possible case) a simple triangulation
is possible. Fig. 5 shows an exemplary scenario in which the real spatial microphones are
depicted as Uniform Linear Arrays (ULAs) of 3 microphones each. The DOA, expressed
as the azimuth angles al(k, n) and a2(k, n), are computed for the time-frequency bin (k, n).
This is achieved by employing a proper DOA estimator, such as ESPRIT,
[13] R. Roy, A. Paulraj, and T, Kailath, "Direction-of-arriva estimation by subspace
rotation methods - ESPRIT," in IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), Stanford, CA, USA, April 1986.
or (root) MUSIC, see
[14] R. Schmidt, "Multiple emitter location and signal parameter estimation," IEEE
Transactions on Antennas and Propagation, vol. 34, no. 3, pp. 276-280, 1986
to the pressure signals transformed into the time-frequency domain.
In Fig. 15, two real spatial microphones, here, two real spatial microphone arrays 410, 420
are illustrated. The two estimated DOAs al(k, n) and a2(k, n) are represented by two lines,
a first line 430 representing DOA al(k, n) and a second line 440 representing DOA a2(k,
n). The triangulation is possible via simple geometrical considerations knowing the
position and orientation of each array.
The triangulation fails when the two lines 430, 440 are exactly parallel. In real
applications, however, this is very unlikely. However, not all triangulation results
correspond to a physical or feasible position for the sound event in the considered space.
For example, the estimated position of the sound event might be too far away or even
outside the assumed space, indicating that probably the DOAs do not correspond to anysound
event which can be physically interpreted with the used model. Such results may be
caused by sensor noise or too strong room reverberation. Therefore, according to an
embodiment, such undesired results are flagged such that the information computation
module 202 can treat them properly.
Fig. 1 depicts a scenario, where the position of a sound event is estimated in 3D space.
Proper spatial microphones are employed, for example, a planar or 3D microphone array.
In Fig. 16, a first spatial microphone 510, for example, a first 3D microphone array, and a
second spatial microphone 520, e.g. , a first 3D microphone array, is illustrated. The DOA
in the 3D space, may for example, be expressed as azimuth and elevation. Unit vectors
530, 540 may be employed to express the DOAs. Two lines 550, 560 are projected
according to the DOAs. In 3D, even with very reliable estimates, the two lines 550, 560
projected according to the DOAs might not intersect. However, the triangulation can still
be carried out, for example, by choosing the middle point of the smallest segment
connecting the two lines.
Similarly to the 2D case, the triangulation may fail or may yield unfeasible results for
certain combinations of directions, which may then also be flagged, e.g. to the information
computation module 202 of Fig. 14.
If more than two spatial microphones exist, several solutions are possible. For example, the
triangulation explained above, could be carried out for all pairs of the real spatial
microphones (if N = 3, 1 with 2, 1 with 3, and 2 with 3). The resulting positions may then
be averaged (along x and y, and, if 3D is considered, z).
Alternatively, more complex concepts may be used. For example, probabilistic approaches
may be applied as described in
[15] J . Michael Steele, "Optimal Triangulation of Random Samples in the Plane", The
Annals of Probability, Vol. 0, No.3 (Aug., 1982), pp. 548-553.
According to an embodiment, the sound field may be analyzed in the time-frequency
domain, for example, obtained via a short-time Fourier transform (STFT), in which k and n
denote the frequency index k and time index n, respectively. The complex pressure Pv(k, n)
at an arbitrary position pv for a certain k and n is modeled as a single spherical wave
emitted by a narrow-band isotropic point-like source, e.g. by employing the formula:
v k,n) PLs fc . ) · p k, n) ,pv )
(1)
where PiPLs(k, n) is the signal emitted by the IPLS at its position PLs (k, n). The complex
factor y(k, pi , p ) expresses the propagation from s (k n) to pv, e.g., it introduces
appropriate phase and magnitude modifications. Here, the assumption may be applied that
in each time-frequency bin only one IPLS is active. Nevertheless, multiple narrow-band
IPLSs located at different positions may also be active at a single time instance.
Each IPLS either models direct sound or a distinct room reflection. Its position p >Ls(k, n)
may ideally correspond to an actual sound source located inside the room, or a mirror
image sound source located outside, respectively. Therefore, the position ipLs , n) may
also indicates the position of a sound event.
Please note that the ter "real sound sources" denotes the actual sound sources physically
existing in the recording environment, such as talkers or musical instruments. On the
contrary, with "sound sources" or "sound events" or "IPLS" we refer to effective sound
sources, which are active at certain time instants or at certain time-frequency bins, wherein
the sound sources may, for example, represent real sound sources or mirror image sources.
Fig. 28a-28b illustrate microphone arrays localizing sound sources. The localized sound
sources may have different physical interpretations depending on their nature. When the
microphone arrays receive direct sound, they may be able to localize the position of a true
sound source (e.g. talkers). When the microphone arrays receive reflections, they may
localize the position of a mirror image source. Mirror image sources are also sound
sources.
Fig. 28a illustrates a scenario, where two microphone arrays 151 and 152 receive direct
sound from an actual sound source (a physically existing sound source) 153.
Fig. 28b illustrates a scenario, where two microphone arrays 161, 162 receive reflected
sound, wherein the sound has been reflected by a wall. Because of the reflection, the
microphone arrays 161, 162 localize the position, where the sound appears to come from,
at a position of an mirror image source 165, which is different from the position of the
speaker 1 3.
Both the actual sound source 153 of Fig. 28a, as well as the mirror image source 165 are
sound sources.
Fig. 28c illustrates a scenario, where two microphone arrays 171, 172 receive diffuse
sound and are not able to localize a sound source.
While this single-wave model is accurate only for mildly reverberant environments given
that the source signals fulfill the W-disjoint orthogonality (WDO) condition, i.e. the timefrequency
overlap is sufficiently small. This is normally true for speech signals, see, for
example,
[12] S. Rickard and Z. Yilmaz, "On the approximate W-disjoint orthogonality of speech,"
in Acoustics, Speech and Signal Processing, 2002. ICASSP 2002. IEEE International
Conference on, April 2002, vol. 1.
However, the model also provides a good estimate for other environments and is therefore
also applicable for those environments.
In the following, the estimation of the positions i s k n) according to an embodiment is
explained. The position s n) of an active IPLS in a certain time-frequency bin, and
thus the estimation of a sound event in a time-frequency bin, is estimated via triangulation
on the basis of the direction of arrival (DOA) of sound measured in at least two different
observation points.
Fig. 17 illustrates a geometry, where the IPLS of the current time-frequency slot (k, n) is
located in the unknown position p >Ls (k, n). In order to determine the required DOA
information, two real spatial microphones, here, two microphone arrays, are employed
having a known geometry, position and orientation, which are placed in positions 610 and
620, respectively. The vectors i and p 2 point to the positions 610, 620, respectively. The
array orientations are defined by the unit vectors c i and c2. The DOA of the sound is
determined in the positions 610 and 620 for each (k, n) using a DOA estimation algorithm,
for instance as provided by the DirAC analysis (see [2], [3]). By this, a first point-of-view
unit vector e , (k, n) and a second point-of-view unit vector e (k, n) with respect to a
point of view of the microphone arrays (both not shown in Fig. 1 ) may be provided a s
output of the DirAC analysis. For example, when operating in 2D, the first point-of-view
unit vector results to:
(2)
Here, >i ( , n) represents the azimuth of the DOA estimated at the first microphone array,
as depicted in Fig. 17. The corresponding DOA unit vectors ei(k, n) and e2(k, n), with
respect to the global coordinate system in the origin, may be computed by applying the
formulae:
ei(k, n) = R e (k, n),
e 2(fc, n) = i ?2 · e 0 (k, n),
(3)
where R are coordinate transformation matrices, e.g.,
(4)
when operating in 2D and C [ i c y ] , For carrying out the triangulation, the
direction vectors di(k, n) and d2(k, n) may be calculated as:
d2 - k , n ) = d k , n ) e-2 k , n )
(5)
where di(k, n) = ||di(k, n)|j and d2(k, n) = |jd (k, n)|| are the unknown distances between the
IPLS and the two microphone arrays. The following equation
P + d i (k, n ) = + ' (k, n )
(6)
may be solved for d (k, n). Finally, the position i Ls( , n) of the IPLS is given by
piPLS (fc, n ) = d i (fc, n)ei (k n ) + i .
(7)
In another embodiment, equation (6) may be solved for d (k, n) and i Ls , n) is
analogously computed employing d (k, n).
Equation (6) always provides a solution when operating in 2D, unless ei(k. n) and e (k, n)
are parallel. However, when using more than two microphone arrays or when operating in
3D, a solution cannot be obtained when the direction vectors d do not intersect. According
to an embodiment, in this case, the point which is closest to al direction vectors d is be
computed and the result can be used as the position of the IPLS.
In an embodiment, all observation points pi, p2, ... should be located such that the sound
emitted by the IPLS falls into the same temporal block n. This requirement may simply be
fulfilled when the distance D between any two of the observation points is smaller than
(8)
where m is the STFT window length, 0 < R < 1 specifies the overlap between successive
time frames and fs is the sampling frequency. For example, for a 1024-point STFT at
48 kHz with 50 % overlap (R = 0.5), the maximum spacing between the arrays to fulfill the
above requirement is D = 3.65 m.
In the following, an information computation module 202, e.g. a virtual microphone signal
and side information computation module, according to an embodiment is described in
more detail.
Fig. 18 illustrates a schematic overview of an information computation module 202
according to an embodiment. The information computation unit comprises a propagation
compensator 500, a combiner 510 and a spectral weighting unit 520. The information
computation module 202 receives the sound source position estimates ssp estimated by a
sound events position estimator, one or more audio input signals is recorded by one or
more of the real spatial microphones, positions posRealMic of one or more of the real
spatial microphones, and the virtual position posVmic of the virtual microphone. It outputs
an audio output signal os representing an audio signal of the virtual microphone.
Fig. 19 illustrates an information computation module according to another embodiment.
The information computation module of Fig. 9 comprises a propagation compensator 500,
a combiner 510 and a spectral weighting unit 520. The propagation compensator 500
comprises a propagation parameters computation module 501 and a propagation
compensation module 504. The combiner 510 comprises a combination factors
computation module 502 and a combination module 505. The spectral weighting unit 520
comprises a spectral weights computation unit 503, a spectral weighting application
module 506 and a spatial side information computation module 507.
To compute the audio signal of the virtual microphone, the geometrical information, e.g.
the position and orientation of the real spatial microphones 1 ... 12N, the position,
orientation and characteristics of the virtual spatial microphone 104, and the position
estimates of the sound events 205 are fed into the information computation module 202, in
particular, into the propagation parameters computation module 50 of the propagation
compensator 500, into the combination factors computation module 502 of the combiner
10 and into the spectral weights computation unit 503 of the spectral weighting unit 520.
The propagation parameters computation module 501, the combination factors
computation module 502 and the spectral weights computation unit 503 compute the
parameters used in the modification of the audio signals 11 ... 1 in the propagation
compensation module 504, the combination module 505 and the spectral weighting
application module 506.
In the information computation module 202, the audio signals 11 ... 1IN may at first be
modified to compensate for the effects given by the different propagation lengths between
the sound event positions and the real spatial microphones. The signals may then be
combined to improve for instance the signal-to-noise ratio (SNR). Finally, the resulting
signal may then be spectrally weighted to take the directional pick up pattern of the virtual
microphone into account, as well as any distance dependent gain function. These three
steps are discussed i more detail below.
Propagation compensation is now explained in more detail. In the upper portion of Fig. 20,
two real spatial microphones (a first microphone array 9 0 and a second microphone array
920), the position of a localized sound event 930 for time-frequency bin (k, n), and the
position of the virtual spatial microphone 940 are illustrated.
The lower portion of Fig. 20 depicts a temporal axis. It is assumed that a sound event is
emitted at time tO and then propagates to the real and virtual spatial microphones. The time
delays of arrival as well as the amplitudes change with distance, so that the further the
propagation length, the weaker the amplitude and the longer the time delay of arrival are.
The signals at the two real arrays are comparable only if the relative delay Dtl2 between
them is small. Otherwise, one of the two signals needs to be temporally realigned to
compensate the relative delay Dtl2, and possibly, to be scaled to compensate for the
different decays.
Compensating the delay between the ar val at the virtual microphone and the arrival at the
real microphone arrays (at one of the real spatial microphones) changes the delay
independent from the localization of the sound event, making it superfluous for most
applications.
Returning to Fig. 19, propagation parameters computation module 501 is adapted to
compute the delays to be corrected for each real spatial microphone and for each sound
event. If desired, it also computes the gain factors to be considered to compensate for the
different amplitude decays.
The propagation compensation module 504 is configured to use this information to modify
the audio signals accordingly. If the signals are to be shifted by a small amount of time
(compared to the time window of the filter bank), then a simple phase rotation suffices. If
the delays are larger, more complicated implementations are necessary.
The output of the propagation compensation module 504 are the modified audio signals
expressed in the original time-frequency domain.
In the following, a particular estimation of propagation compensation for a virtual
microphone according to an embodiment will be described with reference to Fig. 7 which
inter alia illustrates the position 6 10 of a first real spatial microphone and the position 620
of a second real spatial microphone.
In the embodiment that is now explained, it is assumed that at least a first recorded audio
input signal, e.g. a pressure signal of at least one of the real spatial microphones (e.g. the
microphone arrays) is available, for example, the pressure signal of a first real spatial
microphone. We will refer to the considered microphone as reference microphone, to its
position as reference position pref and to its pressure signal as reference pressure signal
P f k, n). However, propagation compensation may not only be conducted with respect to
only one pressure signal, but also with respect to the pressure signals of a plurality or of all
of the real spatial microphones.
The relationship between the pressure signal s( , n) emitted by the IPLS and a
reference pressure signal Pref(k, n) of a reference microphone located in p f can be
expressed by formula (9):
Prei (k, n) = s k · k, P LS , ref ) ,
(9)
In general, the complex factor y( , p , pb expresses the phase rotation and amplitude decay
introduced by the propagation of a spherical wave from its origin in p to p . However,
practical tests indicated that considering only the amplitude decay in g leads to plausible
impressions of the virtual microphone signal with significantly fewer artifacts compared to
also considering the phase rotation.
The sound energy which can be measured in a certain point in space depends strongly on
the distance r from the sound source, in Fig 6 from the position p i i s of the sound source.
In many situations, this dependency can be modeled with sufficient accuracy using wellknown
physical principles, for example, the /r decay of the sound pressure in the far-field
of a point source. When the distance of a reference microphone, for example, the first real
microphone from the sound source is known, and when also the distance of the virtual
microphone from the sound source is known, then, the sound energy at the position of the
virtual microphone can be estimated from the signal and the energy of the reference
microphone, e.g. the first real spatial microphone. This means, that the output signal of the
virtual microphone can be obtained by applying proper gains to the reference pressure
signal.
Assuming that the first real spatial microphone is the reference microphone, then p = pi.
In Fig. 17, the virtual microphone is located in pv. Since the geometry in Fig. 17 is known
in detail, the distance dj(k, n) = jjdi(k, n)|| between the reference microphone (in Fig. 7:
the first real spatial microphone) and the IPLS can easily be determined, as well as the
distance s(k, n) = ||s(k, n)|| between the virtual microphone and the IPLS, namely
s ( , n) = \\s(k, n ) \\ = + d (k, n ) - p v \ .
(10)
The sound pressure Pv(k, n) at the position of the virtual microphone is computed by
combining formulas (1) and (9), leading to
( 1 1)
As mentioned above, in some embodiments, the factors g may only consider the amplitude
decay due to the propagation. Assuming for instance that the sound pressure decreases with
1/r, then
(12)
When the model in formula (1) holds, e.g., when only direct sound is present, then formula
(12) can accurately reconstruct the magnitude information. However, in case of pure
diffuse sound fields, e.g., when the model assumptions are not met, the presented method
yields an implicit dereverberation of the signal when moving the virtual microphone away
from the positions of the sensor arrays. In fact, as discussed above, in diffuse sound fields,
we expect that most IPLS are localized near the two sensor arrays. Thus, when moving the
virtual microphone away from these positions, we likely increase the distance s = ||s|| in
Fig. 17. Therefore, the magnitude of the reference pressure is decreased when applying a
weighting according to formula ( 11). Correspondingly, when moving the virtual
microphone close to an actual sound source, the time-frequency bins corresponding to the
direct sound will be amplified such that the overall audio signal will be perceived less
diffuse. By adjusting the rule in formula (12), one can control the direct sound
amplification and diffuse sound suppression at will.
By conducting propagation compensation on the recorded audio input signal (e.g. the
pressure signal) of the first real spatial microphone, a first modified audio signal is
obtained.
In embodiments, a second modified audio signal may be obtained by conducting
propagation compensation on a recorded second audio input signal (second pressure
signal) of the second real spatial microphone.
In other embodiments, further audio signals may be obtained by conducting propagation
compensation on recorded further audio input signals (further pressure signals) of further
real spatial microphones.
Now, combining in blocks 502 and 505 in Fig. 19 according to an embodiment is
explained in more detail. It is assumed that two or more audio signals from a plurality
different rea spatial microphones have been modified to compensate for the different
propagation paths to obtain two or more modified audio signals. Once the audio signals
from the different real spatial microphones have been modified to compensate for the
different propagation paths, they can be combined to improve the audio quality. By doing
so, for example, the SNR can be increased or the reverberance can be reduced.
Possible solutions for the combination comprise:
- Weighted averaging, e.g., considering SNR, or the distance to the virtual
microphone, or the diffuseness which was estimated by the real spatial
microphones. Traditional solutions, for example, Maximum Ratio Combining
(MRC) or Equal Gain Combining (EQC) may be employed, or
- Linear combination of some or al of the modified audio signals to obtain a
combination signal. The modified audio signals may be weighted in the linear
combination to obtain the combination signal, or
Selection, e.g., only one signal is used, for example, dependent on SNR or distance
or diffuseness.
The task of module 502 is, if applicable, to compute parameters for the combining, which
is carried out in module 505.
Now, spectral weighting according to embodiments is described in more detail. For this,
reference is made to blocks 503 and 506 of Fig. 1 . At this final step, the audio signal
resulting from the combination or from the propagation compensation of the input audio
signals is weighted in the time- frequency domain according to spatial characteristics of the
virtual spatial microphone as specified by input 104 and/or according to the reconstructed
geometry (given n 205).
For each time-frequency bin the geometrical reconstruction allows us to easily obtain the
DOA relative to the virtual microphone, as shown in Fig. 21. Furthermore, the distance
between the virtual microphone and the position of the sound event can also be readily
computed.
The weight for the time-frequency bin is then computed considering the type of virtual
microphone desired.
In case of directional microphones, the spectral weights may be computed according to a
predefined pick-up pattern. For example, according to an embodiment, a cardioid
microphone may have a pick up pattern defined by the function g(theta),
g(theta) = 0.5 + 0.5 cos(theta),
where theta is the angle between the look direction of the virtual spatial microphone and
the DOA of the sound from the point of view of the virtual microphone.
Another possibility is artistic (non physical) decay functions. In certain applications, it may
be desired to suppress sound events far away from the irtual microphone with a factor
greater than the one characterizing free-field propagation. For this purpose, some
embodiments introduce an additional weighting function which depends on the distance
between the virtual microphone and the sound event. In an embodiment, only sound events
within a certain distance (e.g. in meters) from the virtual microphone should be picked up.
With respect to virtual microphone directivity, arbitrary directivity patterns can be applied
for the virtual microphone. In doing so, one can for instance separate a source from a
complex sound scene.
Since the DOA of the sound can be computed in the position pv of the virtual microphone,
namely
(13)
where c is a unit vector describing the orientation of the virtual microphone, arbitrary
directivities for the virtual microphone can be realized. For example, assuming that P (k,n)
indicates the combination signal or the propagation-compensated modified audio signal,
then the formula:
P v (k, n ) = P v k , n ) [l + cos ( ¾ (k, n))]
(14)
calculates the output of a virtual microphone with cardioid directivity. The directional
patterns, which can potentially be generated in this way, depend on the accuracy of the
position estimation.
In embodiments, one or more real, non-spatial microphones, for example, an
omnidirectional microphone or a directional microphone such as a cardioid. are placed in
the sound scene n addition to the real spatial microphones to further improve the sound
quality of the virtual microphone signals 105 in Figure 8. These microphones are not used
to gather any geometrical information, but rather only to provide a cleaner audio signal.
These microphones may be placed closer to the sound sources than the spatial
microphones. In this case, according to an embodiment, the audio signals of the real, nonspatial
microphones and their positions are simply fe to the propagation compensation
module 504 of Fig. 19 for processing, instead of the audio signals of the real spatial
microphones. Propagation compensation is then conducted for the one or more recorded
audio signals of the non-spatial microphones with respect to the position of the one or
more non-spatial microphones. By this, an embodiment is realized using additional nonspatial
microphones.
In a further embodiment, computation o the spatial side information of the virtual
microphone is realized. To compute the spatial side information 06 of the microphone,
the information computation module 202 of Fig. 9 comprises a spatial side information
computation module 507, which is adapted to receive as input the sound sources' positions
205 and the position, orientation and characteristics 104 of the virtual microphone. In
certain embodiments, according to the side information 106 that needs to be computed, the
audio signal of the virtual microphone 5 can also be taken into account as input to the
spatial side information computation module 507.
The output of the spatial side information computation module 507 is the side information
of the virtual microphone 106. This side information can be, for instance, the DOA or the
dilTuseness of sound for each time-frequency bin (k, n) from the point of view of the
virtual microphone. Another possible side information could, for instance, be the active
sound intensity vector la(k, n) which would have been measured in the position of the
virtual microphone. How these parameters can be derived, will now be described.
According to an embodiment, DOA estimation for the virtual spatial microphone is
realized. The information computation module 120 is adapted to estimate the direction of
arrival at the virtual microphone as spatial side information, based on a position vector of
the virtual microphone and based on a position vector of the sound event as illustrated by
Fig. 22.
Fig. 22 depicts a possible way to derive the DOA of the sound from the point of view of
the virtual microphone. The position of the sound event, provided by block 205 in Fig. ,
can be described for each time-frequency bin (k, n) with a position vector r(k, n), the
position vector f the sound event. Similarly, the position of the virtual microphone,
provided as input 104 in Fig. , can be described with a position vector s(k,n), the
position vector of the virtual microphone. The look direction of the virtual microphone can
be described by a vector v(k, n). The DOA relative to the virtual microphone is given by
a(k,n). It represents the angle between v and the sound propagation path h(k,n). h(k, n) can
be computed by employing the formula:
h(k, n) = s(k,n) r(k, n).
The desired DOA a(k, n) can now be computed for each (k, n) for instance via the
definition of the dot product of h(k, n) and v(k,n), namely
a(k, n) = arcos (h(k, n) v(k,n) / ( jjh(k, n)|j j|v(k ,n) | ).
In another embodiment, the information computation module 120 may be adapted to
estimate the active sound intensity at the virtual microphone as spatial side information,
based on a position vector of the virtual microphone and based on a position vector of the
sound event as illustrated by Fig. 22.
From the DOA a(k, n) defined above, we can derive the active sound intensity a k, n) at
the position of the virtual microphone. For this, it is assumed that the virtual microphone
audio signal 05 in Fig. 9 corresponds to the output of an omnidirectional microphone,
e.g., we assume, that the virtual microphone is an omnidirectional microphone. Moreover,
the looking direction v in Fig. 22 is assumed to be parallel to the x-axis of the coordinate
system. Since the desired active sound intensity vector Ia(k, n) describes the net flow of
energy through the position of the virtual microphone, we can compute Ia(k, n) can be
computed, e.g. according to the formula:
Ia(k, ) = - (1/2 rho) iP (k, n)|2 * [ cos a(k, n), sin a(k, n) ,
where [] denotes a transposed vector, rho is the air density, and Pv (k, n) is the sound
pressure measured by the virtual spatial microphone, e.g., the output 105 of block 506 in
Fig. 19.
If the active intensity vector shall be computed expressed in the general coordinate system
but still at the position of the virtual microphone, the following formula may be applied:
Ia(k, n) = (1/2 rho) |P (k, n)j2 h(k, n) / j| h(k, n) |i-
The diffuseness of sound expresses how diffuse the sound field is in a given timefrequency
slot (see, for example, [2]). Diffuseness is expressed by a value y, wherein 0 < y
< 1. A diffuseness of 1 indicates that the total sound field energy of a sound field is
completely diffuse. This information is important e.g. in the reproduction of spatial sound.
Traditionally, diffuseness is computed at the specific point in space in which a microphone
array is placed.
According to an embodiment, the diffuseness may be computed as an additional parameter
to the side information generated for the Virtual Microphone (VM), which can be placed at
will at an arbitrary position in the sound scene. By this, an apparatus that also calculates
the diffuseness besides the audio signal at a virtual position of a virtual microphone can be
seen as a virtual DirAC front-end, as it is possible to produce a DirAC stream, namely an
audio signal, direction of arrival, and diffuseness, for an arbitrary point in the sound scene.
The DirAC stream may be further processed, stored, transmitted, and played back on an
arbitrary multi-loudspeaker setup. In this case, the listener experiences the sound scene as
if he or she were in the position specified by the virtual microphone and were looking in
the direction determined by its orientation.
Fig. 23 illustrates an information computation block according to an embodiment
comprising a diffuseness computation unit 801 for computing the diffuseness at the virtual
microphone. The information computation block 202 is adapted to receive inputs 1 to
UN, that in addition to the inputs of Fig. 14 also include diffuseness at the real spatial
microphones. Let y to denote these values. These additional inputs are fed to
the information computation module 202. The output 103 of the diffuseness computation
unit 8 is the diffuseness parameter computed at the position of the virtual microphone.
A diffuseness computation unit 801 of an embodiment is illustrated in Fig. 24 depicting
more details. According to an embodiment, the energy of direct and diffuse sound at each
of the N spatial microphones is estimated. Then, using the information on the positions of
the IPLS, and the information on the positions of the spatial and virtual microphones, N
estimates of these energies at the position of the virtual microphone are obtained. Finally,
the estimates can be combined to improve the estimation accuracy and the diffuseness
parameter at the virtual microphone can be readily computed.
Let E 1 to N ) denote the estimates of the energies of
direct and diffuse sound for the N spatial microphones computed by energy analysis unit
810. If P j is the complex pressure signal and y is diffuseness for the i-th spatial
microphone, then the energies may, for example, be computed according to the formulae:
¾ i = -
The energy of diffuse sound should be equal in all positions, therefore, an estimate of the
diffuse sound energy at the virtual microphone can be computed simply by
averaging to E , e.g. in a diffuseness combination unit 820, for example,
according to the formula:
A more effective combination of the estimates E ! to E N could be carried out by
considering the variance of the estimators, for instance, by considering the SNR.
The energy of the direct sound depends on the distance to the source due to the
propagation. Therefore, E r
1 to r
, may be modified to take this into account. This
may be carried out, e.g., by a direct sound propagation adjustment unit 830. For example,
if it is assumed that the energy of the direct sound field decays with 1 over the distance
squared, then the estimate for the direct sound at the virtual microphone for the i-th spatial
microphone may be calculated according to the formula:
, V _ ~ 'f ,
* ·» ~ ffe an \ * - PLS / *
Similarly to the diffuseness combination unit 820, the estimates of the direct sound energy
obtained at different spatial microphones ca be combined, e.g. by a direct sound
combination unit 840. The result is E , e.g., the estimate for the direct sound energy at
the virtual microphone. The diffuseness at the virtual microphone y may be computed,
for example, by a diffuseness sub-calculator 850, e.g. according to the formula:
As mentioned above, i some cases, the sound events position estimation carried out by a
sound events position estimator fails, e.g., in case of a wrong direction f arrival
estimation. Fig. 25 illustrates such a scenario. In these cases, regardless of the diffuseness
parameters estimated at the different spatial microphone and as received as inputs 11 to
1IN, the diffuseness for the virtual microphone 103 may be set to 1 (i.e., fully diffuse), as
no spatially coherent reproduction is possible.
Additionally, the reliability of the DOA estimates at the N spatial microphones may be
considered. This may be expressed e.g. in terms of the variance of the DOA estimator or
SNR. Such an information may be taken into account by the diffuseness sub-calculator
850, so that the VM diffuseness 103 can be artificially increased in case that the DOA
estimates are unreliable. n fact, as a consequence, the position estimates 205 will also be
unreliable.
Fig. 1 illustrates an apparatus 150 for generating at least one audio output signal based on
an audio data stream comprising audio data relating to one or more sound sources
according to an embodiment.
The apparatus 150 comprises a receiver 160 for receiving the audio data stream comprising
the audio data. The audio data comprises one or more pressure values for each one of the
one or more sound sources. Furthermore, the audio data comprises one or more position
values indicating a position of one of the sound sources for each one of the sound sources.
Moreover, the apparatus comprises a synthesis module 0 for generating the at least one
audio output signal based on at least one of the one or more pressure values of the audio
data of the audio data stream and based on at least one of the one or more position values
of the audio data of the audio data stream. The audio data is defined for a time-frequency
bin of a plurality of time-frequency bins. For each one of the sound sources, at least one
pressure value is comprised in the audio data, wherein the at least one pressure value may
be a pressure value relating to an emitted sound wave, e.g. originating from the sound
source. The pressure value may be a value of an audio signal, for example, a pressure value
of an audio output signal generated by an apparatus for generating an audio output signal
of a virtual microphone, wherein that the virtual microphone is placed at the position of the
sound source.
Thus, Fig. 1 illustrates an apparatus 50 that may be employed for receiving or processing
the mentioned audio data stream, i.e. the apparatus 150 may be employed on a
receiver/synthesis side. The audio data stream comprises audio data which comprises one
or more pressure values and one or more position values for each one of a plurality of
sound sources, i.e. each one of the pressure values and the position values relates to a
particular sound source of the one or more sound sources of the recorded audio scene. This
means that the position values indicate positions of sound sources instead of the recording
microphones. With respect to the pressure value this means that the audio data stream
comprises one or more pressure value for each one of the sound sources, i.e. the pressure
values indicate an audio signal which is related to a sound source instead of being related
to a recording of a real spatial microphone.
According to an embodiment, the receiver 160 may be adapted to receive the audio data
stream comprising the audio data, wherein the audio data furthermore comprises one or
more diffuseness values for each one of the sound sources. The synthesis module 170 may
be adapted to generate the at least one audio output signal based on at least one of the one
or more diffuseness values.
Fig. 2 illustrates an apparatus 200 for generating an audio data stream comprising sound
source data relating to one or more sound sources according to an embodiment. The
apparatus 200 for generating an audio data stream comprises a determiner 210 for
determining the sound source data based on at least one audio input signal recorded by at
least one spatial microphone and based on audio side information provided by at least two
spatial microphones. Furthermore, the apparatus 200 comprises a data stream generator
220 for generating the audio data stream such that the audio data stream comprises the
sound source data. The sound source data comprises one or more pressure values for each
one of the sound sources. Moreover, the sound source data furthermore comprises one or
more position values indicating a sound source position for each one of the sound sources.
Furthermore, the sound source data is defined for a time-frequency bin of a plurality of
time-frequency bins.
The audio data stream generated by the apparatus 200 may then be transmitted. Thus, the
apparatus 200 may be employed on an analysis/transmitter side. The audio data stream
comprises audio data which comprises one or more pressure values and one or more
position values for each one of a plurality of sound sources, i.e. each one of the pressure
values and the position values relates to a particular sound source of the one or more sound
sources of the recorded audio scene. This means that with respect to the position values,
the position values indicate positions of sound sources instead of the recording
microphones.
In a further embodiment, the determiner 2 0 may be adapted to determine the sound source
data based on diffuseness information by at least one spatial microphone. The data stream
generator 220 may be adapted to generate the audio data stream such that the audio data
stream comprises the sound source data. The sound source data furthermore comprises one
or more diffuseness values for each one of the sound sources.
Fig. 3a illustrates an audio data stream according to an embodiment. The audio data stream
comprises audio data relating to two sound sources being active in one time-frequency bin.
In particular, Fig. 3a illustrates the audio data that is transmitted for a time-frequency bin
(k, n), wherein k denotes the frequency index and n denotes the time index. The audio data
comprises a pressure value PI, a position value and a diffuseness value y 1 of a first
sound source. The position value Ql comprises three coordinate values XI, Yl and Zl
indicating the position of the first sound source. Furthermore, the audio data comprises a
pressure value P2, a position value Q2 and a diffuseness value y 2 of a second sound
source. The position value Q2 comprises three coordinate values X2, Y2 and Z2 indicating
the position of the second sound source.
Fig. 3b illustrates an audio stream according to another embodiment. Again, the audio data
comprises a pressure value PI, a position value Ql and a diffuseness value y 1 of a first
sound source. The position value Ql comprises three coordinate values XI, Yl and Zl
indicating the position of the first sound source. Furthermore, the audio data comprises a
pressure value P2, a position value Q2 and a diffuseness value y 2 of a second sound
source. The position value Q2 comprises three coordinate values X2, Y2 and Z2 indicating
the position of the second sound source.
Fig. 3c provides another illustration of the audio data stream. As the audio data stream
provides geometry-based spatial audio coding (GAC) information, it is also referred to as
"geometry-based spatial audio coding stream" or "GAC stream". The audio data stream
comprises information which relates to the one or more sound sources, e.g. one or more
isotropic point-like source (IPLS). As already explained above, the GAC stream may
comprise the following signals, wherein k and n denote the frequency index and the time
index of the considered time-frequency bin:
P(k, n): Complex pressure at the sound source, e.g. at the IPLS. This signal possibly
comprises direct sound (the sound originating from the IPLS itself) and diffuse
sound.
Q(k,n): Position (e.g. Cartesian coordinates in 3D) of the sound source, e.g. of the
IPLS: The position may, for example, comprise Cartesian coordinates X(k,n),
Y(k,n), Z(k,n).
• Diffuseness at the IPLS: This parameter is related to the power ratio of
direct to diffuse sound comprised in P(k,n). If P(k,n) = Pjir(k,n) + P j ff(k,n), then
one possibility to express diffuseness is y (k,n) = jPdif (k,n)j2 / !P(k,n)|2. If |P(k,n)|2 is
known, other equivalent representations are conceivable, for example, the Direct to
Diffuse Ratio (DDR) r=|Pdir(k,n)j /|Pdiff (k,n)|2.
As already stated, k and n denote the frequency and time indices, respectively. If desired
and if the analysis allows it, more than one IPLS can be represented at a given timefrequency
slot. This is depicted in Fig. 3c as M multiple layers, so that the pressure signal
for the i-th layer (i.e., for the i-th IPLS) is denoted with P,(k, n). For convenience, the
position of the IPLS can be expressed as the vector Q (k, n) = [X,(k, n), Yj(k, n), Z,(k, n)] .
Differently than the state-of-the-art, ail parameters in the GAC stream are expressed with
respect to the one or more sound source, e.g. with respect to the IPLS, thus achieving
independence from the recording position. In Fig. 3c, as well as in Fig. a and 3b, all
quantities in the figure are considered in time-frequency domain; the (k,n) notation was
neglected for reasons of simplicity, for example, P means Pj(k,n), e.g. Pi = Pj(k,n).
In the following, an apparatus for generating a audio data stream according to an
embodiment is explained in more detail. As the apparatus of Fig. 2, the apparatus of Fig. 4
comprises a determiner 2 10 and a data stream generator 220 which may be similar to the
determiner 10. As the determiner analyzes the audio input data to determine the sound
source data based on which the data stream generator generates the audio data stream, the
determiner and the data stream generator may together be referred to as an "analysis
module" (see analysis module 410 in Fig. 4).
The analysis module 410 computes the GAC stream from the recordings of the N spatial
microphones. Depending on the number M of layers desired (e.g. the number of sound
sources for which information shall be comprised in the audio data stream for a particular
time-frequency bin), the type and number N of spatial microphones, different methods for
the analysis are conceivable. A few examples are given in the following.
As a first example, parameter estimation for one sound source, e.g. one IPLS, per timefrequency
slot is considered. In the case of M = 1, the GAC stream can be readily obtained
with the concepts explained above for the apparatus for generating an audio output signal
of a virtual microphone, in that a virtual spatial microphone can be placed in the position
of the sound source, e.g. in the position of the IPLS. This allows the pressure signals to be
calculated at the position of the IPLS, together with the corresponding position estimates,
and possibly the diffuseness. These three parameters are grouped together in a GAC stream
and can be further manipulated by module 102 in Fig. 8 before being transmitted or stored.
For example, the determiner may determine the position of a sound source by employing
the concepts proposed for the sound events position estimation of the apparatus for
generating an audio output signal of a virtual microphone. Moreover, the determiner may
comprise an apparatus for generating an audio output signal and may use the determined
position of the sound source as the position of the virtual microphone to calculate the
pressure values (e.g. the values of the audio output signal to be generated) and the
diffuseness at the position of the sound source.
In particular, the determiner 210, e.g., in Figure 4), is configured to determine the pressure
signals, the corresponding position estimates, and the corresponding diffuseness, while the
data stream generator 220 is configured to generate the audio data stream based on the
calculated pressure signals, position estimates and diffuseness.
As another example, parameter estimation for 2 sound sources, e.g. 2 IPLS, per timefrequency
slot is considered. If the analysis module 410 is to estimate two sound sources
per time-frequency bin, then the following concept based on state-of-the-art estimators can
be used.
Fig. 5 illustrates a sound scene composed of two sound sources and two uniform linear
microphone arrays. Reference is made to ESPRIT, see
[26] R. Roy an T. Kailath. ESPRIT-estimation of signal parameters via rotational
invariance techniques. Acoustics, Speech and Signal Processing, IEEE Transactions
on, 37(7):984-995, M y 1989.
ESPRIT ([26]) can be employed separately at each array to obtain two DOA estimates for
each time- frequency bin at each array. Du to a pairing ambiguity, this leads to two
possible solutions for the position of the sources. As can be seen from Fig. 5, the two
possible solutions are given by (1, 2) and ( , 2'). In order to solve this ambiguity, the
following solution can be applied. The signal emitted at each source is estimated by using a
beamformer oriented in the direction of the estimated source positions and applying a
proper factor to compensate for the propagation (e.g., multiplying by the inverse of the
attenuation experienced by the wave). This can be carried out for each source at each array
for each of the possible solutions. We can then define an estimation error for each pair of
sources (i, j) as:
Ei = ]P i, - Pi | + |P , - P (1)
where (i, j) {(1, 2), (G , 2')} (see Fig. 5) and stands for the compensated signal power
seen by array r from sound source i . The error is minimal for the true sound source pair.
Once the pairing issue is solved and the correct DOA estimates are computed, these are
grouped, together with the corresponding pressure signals and diffuseness estimates into a
GAC stream. The pressure signals and diffuseness estimates can be obtained using the
same method already described for the parameter estimation for one sound source.
Fig. 6a illustrates an apparatus 600 for generating at least one audio output signal based on
an audio data stream according to a embodiment. The apparatus 600 comprises a receiver
610 and a synthesis module 620. The receiver 610 comprises a modification module 630
for modifying the audio data of the received audio data stream by modifying at least one of
the pressure values of the audio data, at least one of the position values of the audio data or
at least one of the diffuseness values of the audio data relating to at least one of the sound
sources.
Fig. 6b illustrates an apparatus 660 for generating an audio data stream comprising sound
source data relating to one or more sound sources according to an embodiment. The
apparatus for generating an audio data stream comprises a determiner 670, a data stream
generator 680 and furthermore a modification module 690 for modifying the audio data
stream generated by the data stream generator by modifying at least one of the pressure
values o the audio data, at least one of the position values of the audio data or at least one
of the diffuseness values of the audio data relating to at least one of the sound sources.
While the modification module 610 of Fig. 6a is employed on a receiver/synthesis side, the
modification module 660 of Fig. 6b is employed on a transmitter/analysis side.
The modifications o the audio data stream conducted by the modification modules 610,
660 may also be considered as modifications of the sound scene. Thus, the modification
modules 610, 660 may also be referred to as sound scene manipulation modules.
The sound field representation provided by the GAC stream allows different kinds of
modifications f the audio data stream, i.e. as a consequence, manipulations of the sound
scene. Some examples in this context are:
1. Expanding arbitrary sections of space/volumes in the sound scene (e.g. expansion
of a point-like sound source in order to make it appear wider to the listener);
2. Transforming a selected section of space/volume to any other arbitrary section of
space/volume in the sound scene (the transformed space/volume could e.g. contain
a source that is required to be moved to a new location);
3. Position-based filtering, where selected regions of the sound scene are enhanced or
partially/completely suppressed
In the following a layer of an audio data stream, e.g. a GAC stream, is assumed to
comprise all audio data of one of the sound sources with respect to a particular timefrequency
bin.
Fig. 7 depicts a modification module according to an embodiment. The modification unit
of Fig. 7 comprises a demultiplexer 401, a manipulation processor 420 and a multiplexer
405.
The demultiplexer 401 is configured to separate the different layers of the M-layer GAC
stream and form M single-layer GAC streams. Moreover, the manipulation processor 420
comprises units 402, 403 and 404, which are applied on each of the GAC streams
separately. Furthermore, the multiplexer 405 is configured to form the resulting M-layer
GAC stream from the manipulated single-layer GAC streams.
Based on the position data from the GAC stream and the knowledge about the position of
the real sources (e.g. talkers), the energy can be associated with a certain real source for
every time-frequency bin. The pressure values P are then weighted accordingly to modify
the loudness of the respective real source (e.g. talker). It requires a priori information or an
estimate of the location of the real sound sources (e.g. talkers).
In some embodiments, if knowledge about the position of the real sources is available, then
based on the position data from the GAC stream, the energy can be associated with a
certain real source for every time-frequency bin.
The manipulation of the audio data stream, e.g. the GAC stream can take place at the
modification module 630 of the apparatus 600 for generating at least one audio output
signal of Fig. 6a, i.e. at a receiver/synthesis side and/or at the modification module 690 of
the apparatus 660 for generating an audio data stream of Fig 6b, i.e. at a
transmitter/analysis side.
For example, the audio data stream, i.e. the GAC stream, can be modified prior to
transmission, or before the synthesis after transmission.
Unlike the modification module 630 of Fig. 6a at the receiver/synthesis side, the
modification module 690 of Fig. 6b at the transmitter/analysis side may exploit the
additional information from the inputs 1 to U N (the recorded signals) and 121 to 1 N
(relative position and orientation of the spatial microphones), as this information is
available at the transmitter side. Using this information, a modification unit according to an
alternative embodiment can be realized, which is depicted in Fig. 8.
Fig. 9 depicts an embodiment by illustrating a schematic overview of a system, wherein a
GAC stream is generated o a transmitter/analysis side, where, optionally, the GAC stream
may be modified by a modification module 102 at a transmitter/analysis side, where the
GAC stream may, optionally, be modified at a receiver/synthesis side by modification
module 103 and wherein the GAC stream is used to generate a plurality of audio output
signals 191 ... 19L.
At the transmitter/analysis side, the sound field representation (e.g., the GAC stream) is
computed in unit 101 from the inputs 1 to 1I , i.e., the signals recorded with N > 2
spatial microphones, and from the inputs 121 to 12N, i.e., relative position and orientation
of the spatial microphones.
The output of unit 10 1 is the aforementioned sound field representation, which in the
following is denoted as Geometry-based spatial Audio Coding (GAC) stream. Similarly to
the proposal in
[20] Giovanni Del Galdo, Oliver Thiergart, Tobias Weller, and E. A. P. Habets.
Generating virtual microphone signals using geometrical information gathered by
distributed arrays. In Third Joint Workshop on Hands-free Speech Communication
and Microphone Arrays (HSCMA 1), Edinburgh, United Kingdom, May 201 .
and as described for the apparatus for generating an audio output signal of a virtual
microphone at a configurable virtual position, a complex sound scene is modeled by means
of sound sources, e.g. isotropic point-like sound sources (IPLS), which are active at
specific slots in a time-frequency representation, such as the one provided by the Short-
Time Fourier Transform (STFT).
The GAC stream may be further processed in the optional modification module 102, which
may also be referred to as a manipulation unit. The modification module 102 allows for a
multitude of applications. The GAC stream can then be transmitted or stored. The
parametric nature of the GAC stream is highly efficient. At the synthesis/receiver side, one
more optional modification modules (manipulation units) 103 can be employed. The
resulting GAC stream enters the synthesis unit 104 which generates the loudspeaker
signals. Given the independence o the representation from the recording, the end user at
the reproduction side can potentially manipulate the sound scene and decide the listening
position and orientation within the sound scene freely.
The modification/manipulation of the audio data stream, e.g.. the GAC stream can take
place at modification modules 102 and/or 103 in Fig. 9, by modifying the GAC stream
accordingly either prior to transmission in module 102 or after the transmission before the
synthesis 103. Unlike in modification module 103 at the receiver/synthesis side, the
modification module 102 at the transmitter/analysis side may exploit the additional
information from the inputs 11 to 1 N (the audio data provided by the spatial
microphones) and 121 to 12N (relative position and orientation of the spatial
microphones), as this information is available at the transmitter side. Fig. 8 illustrates an
alternative embodiment of a modification module which employs this information.
Examples of different concepts for the manipulation of the GAC stream are described in
the following with reference to Fig. 7 and Fig. 8. Units with equal reference signals have
equal function.
. Volume Expansion
It is assumed that a certain energy in the scene is located within volume V. The volume V
may indicate a predefined area of an environment. Q denotes the set of time-frequency bins
(k, n) for which the corresponding sound sources, e.g. IPLS, are localized within the
volume V.
If expansion of the volume V to another volume V is desired, this can be achieved by
adding a random term to the position data in the GAC stream whenever ( , n) £ Q
(evaluated in the decision units 403) and substituting Q(k, n) = [X(k, n), Y (k, n) ,Z(k, n)]
(the index layer is dropped for simplicity) such that the outputs 43 to 43M of units 404 in
Fig. 7 and 8 become
Q(k, n) = [X(k, n) +