Apparatus And Method For Decomposing An Input Signal Using A Pre Calculated Reference Curve
Abstract:
An apparatus for decomposing a signal having an number of at least three channels comprises an analyzer (16) for analyzing a similarity between two channels of an analysis signal related to the signal having at least two analysis channels, wherein the analyzer is configured for using a pre calculated frequency dependent similarity curve as a reference curve to determine the analysis result. The signal processor (20) processes the analysis signal or a signal derived from the analysis signal or a signal, from which the analysis signal is derived using the analysis result to obtain a decomposed signal.
Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence
Chemin de Tout-Vent 2,
CH-1023 Crissier,
SWITZERLAND
Specification
Apparatus and Method for Decomposing an Input Signal Using a Pre-Calculated
Reference Curve
Specification
The present invention relates to audio processing and, in particular to audio signal
decomposition into different components such as perceptually distinct components.
The human auditory system senses sound from all directions. The perceived auditory (the
adjective auditory denotes what is perceived, while the word sound will be used to
describe physical phenomena) environment creates an impression of the acoustic properties
of the surrounding space and the occurring sound events. The auditory impression
perceived in a specific sound field can (at least partially) be modeled considering three
different types of signals at the car entrances: The direct sound, early reflections, and
diffuse reflections. These signals contribute to the formation of a perceived auditory spatial
image.
Direct sound denotes the waves of each sound event that first reach the listener directly
from a sound source without disturbances. It is characteristic for the sound source and
provides the least-compromised information about the direction of incidence of the sound
event. The primary cues for estimating the direction of a sound source in the horizontal
plane are differences between the left and right ear input signals, namely interaural time
differences (ITDs) and interaural level differences (ILDs). Subsequently, a multitude of
reflections of the direct sound arrive at the ears from different directions and with different
relative time delays and levels. With increasing time delay, relative to the direct sound, the
density of the reflections increases until they constitute a statistical clutter.
The reflected sound contributes to distance perception, and to the auditory spatial
impression, which is composed of at least two components: apparent source width (ASW)
(Another commonly used term for ASW is auditory spaciousness) and listener
envelopment (LEV). ASW is defined as a broadening of the apparent width of a sound
source and is primarily determined by early lateral reflections. LEV refers to the listener's
sense of being enveloped by sound and is determined primarily by late-arriving reflections.
The goal of electroacoustic stereophonic sound reproduction is to evoke the perception of a
pleasing auditory spatial image. This can have a natural or architectural reference (e.g. the
recording of a concert in a hall), or it may be a sound field that is not existent in reality
(e.g. electroacoustic music).
From the field of concert hall acoustics, it is well known that - to obtain a subjectively
pleasing sound field - a strong sense of auditory spatial impression is important, with LEV
being an integral part. The ability of loudspeaker setups to reproduce an enveloping sound
field by means of reproducing a diffuse sound field is of interest. In a synthetic sound field
it is not possible to reproduce all naturally occurring reflections using dedicated
transducers. That is especially true for diffuse later reflections. The timing and level
properties of diffuse reflections can be simulated by using "reverberated" signals as
loudspeakers feeds. If those are sufficiently uncorrelated, the number and location of the
loudspeakers used for playback determines if the sound field is perceived as being diffuse.
The goal is to evoke the perception of a continuous, diffuse sound field using only a
discrete number of transducers. That is, creating sound fields where no direction of sound
arrival can be estimated and especially no single transducer can be localized. The
subjective diffuseness of synthetic sound fields can be evaluated in subjective tests.
Stereophonic sound reproductions aim at evoking the perception of a continuous sound
field using only a discrete number of transducers. The features desired the most are
directional stability of localized sources and realistic rendering of the surrounding auditory
environment. The majority of formats used today to store or transport stereophonic
recordings are channel-based. Each channel conveys a signal that is intended to be played
back over an associated loudspeaker at as specific position. A specific auditory image is
designed during the recording or mixing process. This image is accurately recreated if the
loudspeaker setup used for reproduction resembles the target setup that the recording was
designed for.
The number of feasible transmission and playback channels constantly grows and with
every emerging audio reproduction format comes the desire to render legacy format
content over the actual playback system. Upmix algorithms are a solution to this desire,
computing a signal with more channels from a legacy signal. A number of stereo upmix
algorithms have been proposed in the literature, e.g. Carlos Avendano and Jean-Marc Jot,
"A frequency-domain approach to multichannel upmix", Journal of the Audio Engineering
Society, vol. 52, no. 7/8, pp. 740-749, 2004; Christof Faller, "Multiple-loudspeaker
playback of stereo signals," Journal of the Audio Engineering Society, vol. 54, no. 11, pp.
1051-1064, November 2006; John Usherand Jacob Benesty, "Enhancement of spatial
sound quality: A new reverberation-extraction audio upmixer," IEEE Transactions on
Audio, Speech, and Language Processing, vol. 15, no. 7, pp. 2141-2150, September
2007.Most of these algorithms are based on a direct/ambient signal decomposition
followed by rendering adapted to the target loudspeaker setup.
The described direct/ambient signal decompositions are not readily applicable to multi¬
channel surround signals. It is not easy to formulate a signal model and filtering to obtain
from N audio channels the corresponding N direct sound and N ambient sound channels.
The simple signal model used in the stereo case, see e.g. Christof Faller, "Multipleloudspeaker
playback of stereo signals," Journal of the Audio Engineering Society, vol. 54,
no. 11, pp. 1051-1064, November 2006, assuming direct sound to be correlated amongst all
channels, does not capture the diversity of channel relations that can exist between
surround signal channels.
The general goal of stereophonic sound reproduction is to evoke the perception of a
continuous sound field using only a limited number of transmission channels and
transducers. Two loudspeakers are the minimum requirement for spatial sound
reproduction. Modern consumer systems often offer a larger number of reproduction
channels. Basically, stereophonic signals (independent of the number of channels) are
recorded or mixed such that for each source the direct sound goes coherent (=dependeni)
into a number of channels with specific directional cues and reflected independent sounds
go into a number of channels determining cues for apparent source width and listener
envelopment. Correct perception of the intended auditory image is usually only possible in
the ideal point of observation in the playback setup the recording was intended for. Adding
more speakers to a given loudspeaker setup usually enables a more realistic
reconstruction/simulation of a natural sound field. To use the full advantage of an extended
loudspeaker setup if the input signals are given in another format, or to manipulate the
perceptually distinct parts of the input signal, those have to be separately accessible. This
specification describes a method to separate the dependent and independent components of
stereophonic recordings comprising an arbitrary number of input channels below.
A decomposition of audio signals into perceptually distinct components is necessary for
high quality signal modification, enhancement, adaptive playback, and perceptual coding.
A number of methods have recently been proposed that allow the manipulation and/or
extraction of perceptually distinct signal components from two-channel input signals.
Since input signals with more than two channels become more and more common, the
described manipulations are desirable also for multichannel input signals. However, most
of the concepts described for two-channel input can not easily be extended to work with
input signals with an arbitrary number of channels.
If one were to perform a signal analysis into direct and ambience parts with, for example, a
5.1 channel surround signal having a left channel, a center channel, a right channel, a left
surround channel, a right surround channel and a low-frequency enhancement (subwoofer),
it is not straight-forward how one should apply a direct/ambience signal analysis. One
might think of comparing each pair of the six channels resulting in a hierarchical
processing which has, in the end, up to 15 different comparison operations. Then, when all
of these 15 comparison operations have been done, where each channel has been compared
to every other channel, one would have to determine how one should evaluate the 5
results. This is time consuming, the results are hard to interprete, and due to the
considerable amount of processing resources, not usable for e.g. real-time applications of
direct/ambience separation or, generally, signal decompositions which may be, for
example, used in the context of upmix or any other audio processing operations.
In M. M. Goodwin and J . M. Jot, "Primary-ambient signal decomposition and vector-based
localization for spatial audio coding and enhancement," in Proc. OflCASSP 2007, 2007, a
principal component analysis is applied to the input channel signals to perform the primary
(= direct) and ambient signal decomposition.
The models used in Christof Faller, "Multiple-loudspeaker playback of stereo signals,"
Journal of the Audio Engineering Society, vol. 54, no. 11, pp. 1051-1064, November 2006
and C. Faller, "A highly directive 2-capsule based microphone system," in Preprint 123rd
Com. Aud. Eng. Soc, Oct. 2007 assume de-correlated or partially correlated diffuse sound
in stereo and microphone signals, respectively. They derive filters for extracting
diffuse/ambient signal given this assumption. These approaches are limited to single and
two channel audio signals.
A further reference is C. Avendano and J.-M. Jot, "A frequency-domain approach to
multichannel upmix", Journal of the Audio Engineering Society, vol. 52, no. 7/8, pp. 740-
749, 2004. The reference M. M. Goodwin and J . M. Jot, "Primary-ambient signal
decomposition and vector-based localization for spatial audio coding and enhancement," in
Proc. Of ICASSP 2007, 2007, comments on the Avendano, Jot reference as follows. The
reference provides an approach which involves creating a time-frequency mask to extract
the ambience from a stereo input signal. The mask is based on the cross-correlation
between the left-and right channel signals, however, so this approach is not immediately
applicable to the problem of extracting ambience from an arbitrary multichannel input. To
use any such correlation-based method in this higher-order case would call for a
hierarchical pairwise correlation analysis, which would entail a significant computational
cost, or some alternate measure of multichannel correlation.
Spatial Impulse Response Rendering (SIRR) (Juha Merimaa and Ville Pulkki, "Spatial
impulse response rendering", in Proc. of the 7th Int. Conf. on Digital Audio Effects
(DAFx'04), 2004) estimates the direct sound with direction and diffuse sound in B-Format
impulse responses. Very similar to SIRR, Directional Audio Coding (DirAC) (Ville Pulkki,
"Spatial sound reproduction with directional audio coding," Journal of the Audio
Engineering Society, vol. 55, no. 6, pp. 503-516, June 2007) implements similar direct and
diffuse sound analysis to B-Format continuous audio signals.
The approach presented in Julia Jakka, Binaural to Multichannel Audio Upmix, Ph.D.
thesis, Master's Thesis, Helsinki University of Technology, 2005 describes an upmix using
binaural signals as input.
The reference Boaz Rafaely, "Spatially Optimal Wiener Filtering in a Reverberant Sound
Field, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 2001,
October 2 1 to 24, 2001, New Paltz, New York," describes the derivation of Wiener filters
which are spatially optimal for reverberant sound fields. An application to two-microphone
noise cancellation in reverberant rooms is given. The optimal filters which are derived
from the spatial correlation of diffuse sound fields capture the local behavior of the sound
fields and are therefore of lower order and potentially more spatially robust than
conventional adaptive noise cancellation filters in reverberant rooms. Formulations for
unconstrained and causally constrained optimal filters are presented and an example
application to a two-microphone speech enhancement is demonstrated using a computer
simulation.
While the Wiener-filtering approach can provide useful results for noise cancellation in
reverberant rooms, it can be computationally inefficient and it is, for some instances, not so
useful for signal decomposition.
It is the object of the present invention to provide an improved concept for decomposing an
input signal.
This object is achieved by an apparatus for decomposing an input signal in accordance
with claim 1, a method of decomposing an input signal in accordance with claim 14 or a
computer program in accordance with claim 15.
The present invention is based on the finding that a particular efficiency for the purpose of
signal decomposition is obtained when the signal analysis is performed based on the precalculated
frequency-dependent similarity curve as a reference curve. The term similarity
includes the correlation and the coherence, where - in a strict - mathematical sense, the
correlation is calculated between two signals without an additional time shift and the
coherence is calculated by shifting the two signals in time/phase so that the signals have a
maximum correlation and the actual correlation over frequency is then calculated with the
time/phase shift applied. For this text, similarity, correlation and coherence are considered
to mean the same, i.e., a quantitative degree of similarity between two signals, e.g., where
a higher absolute value of the similarity means that the two signals are more similar and a
lower absolute value of the similarity means that the two signals are less similar.
It has been shown that the usage of such a similarity curve as a reference curve allows a
very efficiently implementable analysis, since the curve can be used for straightforward
comparison operations and/or weighting factor calculations. The use of a pre-calculated
frequency-dependent similarity curve allows to only perform simple calculations rather
than more complex Wiener filtering operations. Furthermore, the application of the
frequency-dependent similarity curve is particularly useful due to the fact that the problem
is not addressed from a statistical point of view but is addressed in a more analytic way,
since as much information as possible from the current setup is introduced so as to obtain a
solution to the problem. Additionally, the flexibility of this procedure is very high, since
the reference curve can be obtained by many different ways. One way is to actually
measure the two or more signals in a certain setup and to then calculate the similarity curve
over frequency from the measured signals. Therefore, one may emit independent signals
from different speakers or signals having a certain degree of dependency which is preknown.
The other preferred alternative is to simply calculate the similarity curve under the
assumption of independent signals. In this case, any signals are actually not necessary,
since the result is signal-independent.
The signal decomposition using a reference curve for the signal analysis can be applied for
stereo processing, i.e., for decomposing a stereo signal. Alternatively, this procedure can
also be implemented together with a downmixer for decomposing multichannel signals.
Alternatively, this procedure can also be implemented for multichannel signals without
using a downmixer when a pair-wise evaluation of signals in a hierarchical way is
envisaged.
In a further embodiment it is an advantageous approach to not perform the analysis with
respect to the different signal components with the input signal directly, i.e. with a signal
having at least three input channels. Instead, the multi-channel input signal having at least
three input channels is processed by a downmixer for downmixing the input signal to
obtain a downmixed signal. The downmixed signal has a number of downmix channels
which is smaller than the number of input channels and, preferably, is two. Then, the
analysis of the input signal is performed on the downmixed signal rather than on the input
signal directly and the analysis results in an analysis result. However, this analysis result is
not applied to the downmixed signal, but is applied to the input signal or, alternatively, to a
signal derived from the input signal where this signal derived from the input signal may be
an upmix signal or, depending on the number of channels of the input signals, also a
downmix signal, but this signal derived from the input signal will be different from the
downmixed signal, on which the analysis has been performed. When, for example, the case
is considered that the input signal is a 5.1 channel signal, then the downmix signal, on
which the analysis is performed, might be a stereo downmix having two channels. The
analysis results are then applied to the 5.1 input signal directly, to a higher upmix such as a
7.1 output signal or to a multi-channel downmix of the input signal having for example
only three channels, which are the left channel, the center channel and the right channel,
when only a three channel audio rendering apparatus is at hand. In any case, however, the
signal on which the analysis results are applied by the signal processor is different from the
downmixed signal that the analysis has been performed on and typically has more channels
than the downmixed signal, on which the analysis with respect to the signal components is
performed on.
The so-called "indirect" analysis/processing is possible due to the fact that one can assume
that any signal components in the individual input channels also occur in the downmixed
channels, since a downmix typically consists of an addition of input channels in different
ways. One straightforward downmix is, for example, that the individual input channels are
weighted as required by a downmix rule or a downmix matrix and are then added together
after having been weighted. An alternative downmix consists of filtering the input channels
with certain filters such as HRTF filters and the downmix is performed by using filtered
signals, i.e. the signals filtered by HRTF filters as known in the art. For a five channel
input signal one requires 10 HRTF filters, and the HRTF filter outputs for the left part/left
ear are added together and the HRTF filter outputs for the right channel filters are added
together for the right ear. Alternative downmixes can be applied in order to reduce the
number of channels which have to be processed in the signal analyzer.
Hence, embodiments of the present invention describe a novel concept to extract
perceptually distinct components from arbitrary input signals by considering an analysis
signal, while the result of the analysis is applied to the input signal. Such an analysis signal
can be gained e.g. by considering a propagation model of the channels or loudspeaker
signals to the ears. This is in part motivated by the fact that the human auditory system also
uses solely two sensors (the left and right ear) to evaluate sound fields. Thus, the extraction
of perceptually distinct components is basically reduced to the consideration of an analysis
signal that will be denoted as downmix in the following. Throughout this document, the
term downmix is used for any pre-processing of the multichannel signal resulting in an
analysis signal (this may include e.g. a propagation model, HRTFs, BRIRs, simple crossfactor
downmix).
Knowing the format of the given input and the desired characteristics of the signal to be
extracted, the ideal inter-channel relations can be defined for the downmixed format and
such, an analysis of this analysis signal is sufficient to generate a weighting mask (or
multiple weighting masks) for the decomposition of multichannel signals.
In an embodiment, the multi-channel problem is simplified by using a stereo downmix of a
surround signal and applying a direct/ambient analysis to the downmix. Based on the
result, i.e. short-time power spectra estimations of direct and ambient sounds, filters are
derived for decomposing a N-channel signal to N direct sound and N ambient sound
channels.
The present invention is advantageous due to the fact that signal analysis is applied on a
smaller number of channels, which significantly reduces the processing time required, so
that the inventive concept can even be applied in real time applications for upmixing or
downmixing or any other signal processing operation where different components such as
perceptually different components of a signal are required.
A further advantage of the present invention is that although a downmix is performed it has
been found out that this does not deteriorate the detectability of perceptually distinct
components in the input signal. Stated differently, even when input channels are
downmixed, the individual signal components can nevertheless be separated to a large
extent. Furthermore, the downmix operates as a kind of "collection" of all signal
components of all input channels into two channels and the single analysis applied on these
"collected" downmixed signals provides a unique result which no longer has to be
interpreted and can be directly used for signal processing.
Preferred embodiments of the present invention are subsequently discussed with respect to
the accompanying figures, in which:
Fig. 1 is a block diagram for illustrating an apparatus for decomposing an input
signal using a downmixer;
Fig. 2 is a block diagram illustrating an implementation of an apparatus for
decomposing a signal having a number of at least three input channels using
an analyzer with a pre-calculated frequency dependent correlation curve in
accordance with a further aspect of the invention;
Fig. 3 illustrates a further preferred implementation of the present invention with a
frequency-domain processing for the downmix, analysis and the signal
processing;
Fig. 4 illustrates an exemplary pre-calculated frequency dependent correlation
curve for a reference curve for the analysis indicated in Fig. 1 or Fig. 2;
Fig. 5 illustrates a block diagram illustrating a further processing in order to
extract independent components;
Fig. 6 illustrates a further implementation of a block diagram for further
processing where independent diffuse, independent direct and direct
components are extracted;
Fig. 7 illustrates a block diagram implementing the downmixer as an analysis
signal generator;
Fig. 8 illustrates a flowchart for indicating a preferred way of processing in the
signal analyzer of Fig. 1 or Fig. 2;
Figs. 9a-9e illustrate different pre-calculated frequency dependent correlation curves
which can be used as reference curves for several different setups with
different numbers and positions of sound sources (such as loudspeakers);
Fig. 10 illustrates a block diagram for illustrating another embodiment for a
diffuseness estimation where diffuse components are the components to be
decomposed; and
Fig. A and 1IB illustrate example equations for applying a signal analysis without a
frequency-dependent correlation curve, but relying on Wiener filtering
approach.
Fig. 1 illustrates an apparatus for decomposing an input signal 10 having a number of at
least three input channels or, generally, N input channels. These input channels are input
into a downmixer 1 for downmixing the input signal to obtain a downmixed signal 14,
wherein the downmixer 12 is arranged for downmixing so that a number of downmix
channels of the downmixed signal 14, which is indicated by "m", is at least two and
smaller than the number of input channels of the input signal 10. The m downmix channels
are input into an analyzer 16 for analyzing the downmixed signal to derive an analysis
result 18. The analysis result 18 is input into a signal processor 20, where the signal
processor is arranged for processing the input signal 10 or a signal derived from the input
signal by a signal deriver 22 using the analysis result, wherein the signal processor 20 is
configured for applying the analysis results to the input channels or to channels of the
signal 24 derived from the input signal to obtain a decomposed signal 26.
In the embodiment illustrated in Fig. 1, a number of input channels is n, the number of
downmix channels is m, the number of derived channels is 1, and the number of output
channels is equal to 1, when the derived signal rather than the input signal is processed by
the signal processor. Alternatively, when the signal deriver 22 does not exist then the input
signal is directly processed by the signal processor and then the number of channels of the
decomposed signal 26 indicated by "1" in Fig. 1 will be equal to n. Hence, Fig. 1 illustrates
two different examples. One example does not have the signal deriver 22 and the input
signal is directly applied to the signal processor 20. The other example is that the signal
deriver 22 is implemented and, then, the derived signal 24 rather than the input signal 10 is
processed by the signal processor 20. The signal deriver may, for example, be an audio
channel mixer such as an upmixer for generating more output channels. In this case 1
would be greater than n. In another embodiment, the signal deriver could be another audio
processor which performs weighting, delay or anything else to the input channels and in
this case the number of output channels of 1of the signal deriver 22 would be equal to the
number n of input channels. In a further implementation, the signal deriver could be a
downmixer which reduces the number of channels from the input signal to the derived
signal. In this implementation, it is preferred that the number 1 is still greater than the
number m of downmixed channels in order to have one of the advantages of the present
invention, i.e. that the signal analysis is applied to a smaller number of channel signals.
The analyzer is operative to analyze the downmixed signal with respect to perceptually
distinct components. These perceptually distinct components can be independent
components in the individual channels on the one hand, and dependent components on the
other hand. Alternative signal components to be analyzed by the present invention are
direct components on the one hand and ambient components on the other hand. There are
many other components which can be separated by the present invention, such as speech
components from music components, noise components from speech components, noise
components from music components, high frequency noise components with respect to low
frequency noise components, in multi-pitch signals the components provided by the
different instruments, etc. This is due to the fact that there are powerful analysis tools such
as Wiener filtering as discussed in the context of Fig. 11A, 1IB or other analysis
procedures such as using a frequency-dependent correlation curve as discussed in the
context of, for example, Fig. 8 in accordance with the present invention.
Fig. 2 illustrates another aspect, where the analyzer is implemented for using a precalculated
frequency-dependent correlation curve 16. Thus, the apparatus for decomposing
a signal 28 having a plurality of channels comprises the analyzer 16 for analyzing a
correlation between two channels of an analysis signal identical to the input signal or
related to the input signal, for example, by a downmixing operation as illustrated in the
context of Fig. 1. The analysis signal analyzed by the analyzer 16 has at least two analysis
channels, and the analyzer 16 is configured for using a pre-calculated frequency dependent
correlation curve as a reference curve to determine the analysis result 18. The signal
processor 20 can operate in the same way as discussed in the context of Fig. 1 and is
configured for processing the analysis signal or a signal derived from the analysis signal by
a signal deriver 22, where the signal deriver 22 can be implemented similarly to what has
been discussed in the context of the signal deriver 22 of Fig. 1. Alternatively, the signal
processor can process a signal, from which the analysis signal is derived and the signal
processing uses the analysis result to obtain a decomposed signal. Hence, in the
embodiment of Fig. 2 the input signal can be identical to the analysis signal and, in this
case, the analysis signal can also be a stereo signal having just two channels as illustrated
in Fig. 2. Alternatively, the analysis signal can be derived from an input signal by any kind
of processing, such as downmixing as described in the context of Fig. 1 or by any other
processing such as upmixing or so. Additionally, the signal processor 20 can be useful to
apply the signal processing to the same signal as has been input into the analyzer or the
signal processor can apply a signal processing to a signal, from which the analysis signal
has been derived such as indicated in the context of Fig. 1, or the signal processor can
apply a signal processing to a signal which has been derived from the analysis signal such
as by upmixing or so.
Hence, different possibilities exist for the signal processor and all of these possibilities are
advantageous due to the unique operation of the analyzer using a pre-calculated frequencydependent
correlation curve as a reference curve to determine the analysis result.
Subsequently, further embodiments are discussed. It is to be noted that, as discussed in the
context of Fig. 2, even the use of a two-channel analysis signal (without a downmix) is
considered. Hence, the present invention as discussed in the different aspects in the context
of Fig. 1 and Fig. 2, which can be used together or as separate aspects, the downmix can be
processed by the analyzer or a two-channel signal, which has probably not been generated
by a downmix, can be processed by the signal analyzer using the pre-calculated reference
curve. In this context, it is to be noted that the subsequent description of implementation
aspects can be applied to both aspects schematically illustrated in Fig. 1 and Fig. 2 even
when certain features are only described for one aspect rather than both. If, for example,
Fig. 3 is considered, it becomes clear that the frequency-domain features of Fig. 3 are
described in the context of the aspect illustrated in Fig. 1, but it is clear that a
time/frequency transform as subsequently described with respect to Fig. 3 and the inverse
transform can also be applied to the implementation in Fig. 2, which does not have a
downmixer, but which has a specified analyzer that uses a pre-calculated frequency
dependent correlation curve.
Particularly, the time/frequency converter would be placed to convert the analysis signal
before the analysis signal is input into the analyzer, and the frequency/time converter
would be placed at the output of the signal processor to convert the processed signal back
into the time domain. When a signal deriver exists, the time/frequency converter might be
placed at an input of the signal deriver so that the signal deriver, the analyzer, and the
signal processor all operate in the frequency/subband domain. In this context, frequency
and subband basically mean a portion in frequency of a frequency representation.
It is furthermore clear that the analyzer in Fig. 1 can be implemented in many different
ways, but this analyzer is also, in one embodiment, implemented as the analyzer discussed
in Fig. 2, i.e. as an analyzer which uses a pre-calculated frequency-dependent correlation
curve as an alternative to Wiener filtering or any other analysis method.
The embodiment of Fig. 3 applies a downmix procedure to an arbitrary input signal to
obtain a two-channel representation. An analysis in the time-frequency domain is
performed and weighting masks are calculated that are multiplied with the time frequency
representation of the input signal, as is illustrated in Fig. 3.
In the picture, T/F denotes a time frequency transform; commonly a Short-time Fourier
Transform (STFT). iT/F denotes the respective inverse transform. [c ή ,·· are the
time domain input signals, where n is the time index. [X m ),- ,XN(m,i)] denote the
coefficients of the frequency decomposition, where m is the decomposition time index,
and i is the decomposition frequency index. [D m, i),D (m, )] are the two channels of the
downmixed signal.
W(m,i) is the calculated weighting. [Y (m, /),...,YN(m,i)] are the weighted frequency
decompositions of each channel. Hy(i) are the downmix coefficients, which can be realvalued
or complex-valued and the coefficients can be constant in time or time-variant.
Hence, the downmix coefficients can be just constants or filters such as HRTF filters,
reverberation filters or similar filters.
where y=(l,2 , ...,N) (2)
In Fig. 3 the case of applying the same weighting to all channels is depicted.
Yj (m,i)=W (m,i)- X j (m,i) (3)
[y (n),...,y N(n)] are the time-domain output signals comprising the extracted signal
components. (The input signal may have an arbitrary number of channels (N ), produced
for an arbitrary target playback loudspeaker setup. The downmix may include HRTFs to
obtain ear-input-signals, simulation of auditory filters, etc. The downmix may also be
carried out in the time domain.).
In an embodiment, the difference between a reference correlation (Throughout this text, the
term correlation is used as synonym for inter-channel similarity and may thus also include
evaluations of time shifts, for which usually the term coherence is used. Even if time-shifts
are evaluated, the resulting value may have a sign. (Commonly, the coherence is defined as
having only positive values) as a function of frequency ( cre ) ), and the actual correlation
of the downmixed input signal ( c jg ( denotes time averaging. In a steady state sound field, the following
relations can be derived:
r{k, d) = s (fo ee _ dimensional sound fields) , and (5)
kd
r(k,d) =J (kd) (for two - dimensional soundfields) , (6)
2p
where d is the distance between the two measurement points and k =— is the
l
wavenumber, with l being the wavelength. (The physical reference curve r(k,d) may
already be used as cref for further processing.)
A measure for the perceptual diffuseness of a sound field is the interaural cross
correlation coefficient p ), measured in a sound field. Measuring p implies that the
radius between the pressure sensors (resp. the ears) is fixed. Including this restriction, r
becomes a function of frequency with the radian frequency w = , where c is the speed
of sound in air. Furthermore, the pressure signals differ from the previously considered
free field signals due to reflection, diffraction, and bending-effects caused by the listener's
pinnae, head, and torso. Those effects, substantial for spatial hearing, are described by
head-related transfer functions (HRTFs). Considering those influences, the resulting
pressure signals at the ear entrances are pL n, c ) and pR(n,a>) . For the calculation,
measured HRTF data may be used or approximations can be obtained by using an
analytical model (e.g. Richard O. Duda and William L. Martens, "Range dependence of the
response of a spherical head model," Journal Of The Acoustical Society Of America, vol.
104, no. 5, pp. 3048-3058, November 1998).
Since the human auditory system acts as a frequency analyzer with limited frequency
selectivity, furthermore this frequency selectivity may be incorporated. The auditory filters
are assumed to behave like overlapping bandpass filters. In the following example
explanation, a critical band approach is used to approximate these overlapping bandpasses
by rectangular filters. The equivalent rectangular bandwidth (ERB) may be calculated as a
function of center frequency (Brian R. Glasberg and Brian C. J. Moore, "Derivation of
auditory filter shapes from notched-noise data," Hearing Research, vol. 47, pp. 103-138,
1990). Considering that the binaural processing follows the auditory filtering, p has to be
calculated for separate frequency channels, yielding the following frequency dependent
pressure signals
where the integration limits are given by the bounds of the critical band according to the
actual center frequency w . The factors 1 b (w) may or may not be used in equations (7)
and (8).
If one of the sound pressure measurements is advanced or delayed by a frequency
independent time difference, the coherence of the signals can be evaluated. The human
auditory system is able to make use of such a time alignment property. Usually, the
interaural coherence is calculated within ± 1 ms. Depending on the available processing
power, calculations can be implemented using only the lag-zero value (for low complexity)
or the coherence with a time advance and delay (if high complexity is possible). In the
following, no distinction is made between both cases.
The ideal behavior is achieved considering an ideal diffuse sound field, which can be
idealized as a wave field that is composed of equally strong, uncorrelated plane waves
propagating in all directions (i.e. a superposition of an infinite number of propagating
plane waves with random phase relations and uniformly distributed directions of
propagation). A signal radiated by a loudspeaker can be considered a plane wave for a
listener positioned sufficiently far away. This plane wave assumption is common in
stereophonic playback over loudspeakers. Thus, a synthetic sound field reproduced by
loudspeakers consists of contributing plane waves from a limited number of directions.
Given an input signal with N channels, produced for playback over a setup with
loudspeaker positions [l l2,l ,...,l N ]. ( n the case of a horizontal only playback setup,
indicates the azimuth angle. In the general case, /, = (azimuth, elevation) indicates the
position of the loudspeaker relative to the listener's head. If the setup present in the
listening room differs from the reference setup, may alternatively represent the
loudspeaker positions of the actual playback setup). With this information, an interaural
coherence reference curve p for a diffuse field simulation can be calculated for this
setup under the assumption that independent signals are fed to each loudspeaker. The
signal power contributed by each input channel in each time-frequency tile may be
included in the calculation of the reference curve. In the example implementation, pre is
used as cref .
Different reference curves as examples for frequency-dependent reference curves or
correlation curves are illustrated in Figs. 9a to 9e for a different number of sound sources
at different positions of the sound sources and different head orientations as indicated in
the Figs.
Subsequently the calculation of the analysis results as discussed in the context of Fig. 8
based on the reference curves is discussed in more detail.
The goal is to derive a weighting that equals 1, if the correlation of the downmix channels
is equal to the calculated reference correlation under the assumption of independent signals
being played back from all loudspeakers. If the correlation of the downmix equals + 1 or -1,
the derived weighting should be 0, indicating that no independent components are present.
In between those extreme cases, the weighting should represent a reasonable transition
between the indication as independent (W=l) or completely dependent (W=0).
Given the reference correlation curve cr f ) and the estimation of the correlation /
coherence of the actual input signal played back over the actual reproduction setup
( c
ig
w ) ) (<¾ is e correlation resp. coherence of the downmix), the deviation of c g w)
from cref ) can be calculated. This deviation (possibly including an upper and lower
threshold) is mapped to the range [0;1] to obtain a weighting (W(m,i)) that is applied to all
input channels to separate the independent components.
The following example illustrates a possible mapping when the thresholds correspond with
the reference curve:
The magnitude of the deviation (denoted as D ) of the actual curve cs g from the reference
cr f is given by
A( ) = \cs g ( ) - cre ( ) \ (9)
Given that the correlation / coherence is bounded between [-1;+1], the maximally possible
deviation towards + 1 or - 1 for each frequency is given by
D 1- n( ) (10)
D_( ) = ( ) + 1 ( 11)
The weighting for each frequency is thus obtained from
A(w)
W c ) (13)
- - cs, ( )