Method And Apparatus For Synthesizing A Plurality Of Audio Channels
Abstract:
The following coding scenario is addressed: A number of audio source signals need
to be transmitted or stored for the purpose of mixing wave field synthesis, multi-
channel surround, or stereo signals after decoding the source signals. The proposed
technique offers significant coding gain when jointly coding the source signals,
compared to separately coding them, even when no redundancy is present between
the source signals. This is possible by considering statistical properties of the source
signals, the properties of mixing techniques, and spatial hearing. The sum of the
source signals is transmitted plus the statistical properties of the source signals which
mostly determine the perceptually important spatial cues of the final mixed audio
channels. Source signals are recovered at the receiver such that their statistical
properties approximate the corresponding properties of the original source signals.
Subjective evaluations indicate that high audio quality is achieved by the proposed
scheme.
Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence
PARAMETRIC JOINT-CODING OF AUDIO SOURCES
1. INTRODUCTION
In a general coding problem, we have a number of (mono) source signals s,(n) (1 '
>M) and a scene description vector S(n), where n is the time index. The scene
description vector contains parameters such as (virtual) source positions, source
widths, and acoustic parameters such as (virtual) room parameters. The scene
description may be time-invariant or may be changing over time. The source signals
and scene description are coded and transmitted to a decoder. The coded source
signals, s\(n) are successively mixed as a function of the scene description, S{n), to
generate wavefield synthesis, multi-channel, or stereo signals as a function of the
scene description vector. The decoder output signals are denoted x\(n) (0 >i ,, ch di) and on the short-time
subband power of the sources, The normΔlized subband cross-
correlation function ®(n,d) (12), that is needed for ICTD (10) and ICC (9)
computation, depends on and additionally on the normΔlized subband
auto-correlation function, Oj(n, e) (13), for each source signal. The maximum of
φ(n,d) lies within the range mirii{7i} 0) and less de-
correlation processing can be applied than would be needed for generating
independent M or N channels.
Due to less de-correlation processing better audio quΔlity is expected.
Best audio quΔlity is expected when the mixer parameters are constrained such that
In this case, the power of each source in the transmitted
sum signal (1) is the same as the power of the same source in the mixed decoder
output signal. The decoder output signal (Figure 10) is the same as if the mixer
output signal (Figure 4) were encoded and decoded by a BCC encoder/decoder in
this case. Thus, ΔLso similar quΔlity can be expected.
The decoder can not only determine the direction at which each source is to appear
but ΔLso the gain of each source can be varied. The gain is increased by choosing
and decreased by choosing
B. Using no de-correlation processing
The restriction of the previously described technique is that mixing is carried out with
a BCC synthesis scheme. One could imagine implementing not only ICTD, ICLD, and
ICC synthesis but additionally effects processing within the BCC synthesis.
However, it may be desired that existing mixers and effects processors can be used.
This ΔLso includes wavefield synthesis mixers (often denoted "convoluters"). For
using existing mixers and effects processors, the i,(n)are computed explicitly and
used as if they were the original source signals.
When applying no de-correlation processing (hj(n) = 8(n) in (16)) good audio quΔlity
can ΔLso be achieved. It is a compromise between artifacts introduced due to de-
correlation processing and artifacts due to the fact that the source signals st(n)are
correlated. When no de-correlation processing is used the resulting auditory spatiΔL
image may suffer from instability [1]. But the mixer may introduce itself some de-
correlation when reverberators or other effects are used and thus there is less need
for de-correlation processing.
If si,(n)are generated without de-correlation processing, the level of the sources
depends on the direction to which they are mixed relative to the other sources. By
replacing amplitude panning ΔLgorithms in existing mixers with an ΔLgorithm
compensating for this level dependence, the negative effect of loudness dependence
on mixing parameters can be circumvented. A level compensating amplitude
ΔLgorithm is shown in Figure 11 which aims to compensate the source level
dependence on mixing parameters. Given the gain factors of a conventional
amplitude panning ΔLgorithm (e.g. Figure 4), a, and bh the weights in Figure 11, a,
and bt, are computed by
Note that ai and b{ are computed such that the output subband power is the same
as if st(n) were independent in each subband.
c. Reducing the amount of de-correlation processing
As mentioned previously, the generation of independent si,(n) is problematic. Here
strategies are described for applying less de-correlation processing, while effectively
getting a similar effect as if the i,(n)were independent.
Consider for example a wavefield synthesis system as shown in Figure 12. The
desired virtuΔL source positions for Si, s2, ..., Se {M = 6) are indicated. A strategy for
computing si,-(n) (16) without generating M fully independent signals is:
1. Generate groups of source indices corresponding to sources close to each
other. For example in Figure 8 these could be: {1}, {2, 5}, {3}, and {4, 6}.
2. At each time in each subband select the source index of the strongest source,
Apply no de-correlation processing for the source indices part of the group containing
/'max, i.e. h{n) = 8(n).
3. For each other group choose the same h{n) within the group.
The described ΔLgorithm modifies the strongest signal components least. Additionally,
the number of different hj(n) that are used are reduced. This is an advantage
because de-correlation is easier the less independent channels need to be
generated. The described technique is ΔLso applicable when stereo or multi-channel
audio signals are mixed.
V. ScalABIliTY IN TERMS OF QUΔliTY AND BITRATE
The proposed scheme transmits only the sum of ΔLl source signals, which can be
coded with a conventional mono audio coder. When no mono backwards
compatibility is needed and capacity is available for transmission/storage of more
than one audio waveform, the proposed scheme can be scaled for use with more
than one transmission channel. This is implemented by generating severaL sum
signals with different subsets of the given source signals, i.e. to each subset of
source signals the proposed coding scheme is applied individuΔLly. Audio quΔlity is
expected to improve as the number of transmitted audio channels is increased
because less independent channels have to be generated by de-correlation from
each transmitted channel (compared to the case of one transmitted channel).
VI. BACKWARDS COMPATIBIliTY TO EXISTING STEREO AND SURROUND
AUDIO FORMATS
Consider the following audio delivery scenario. A consumer obtains a maximum
quΔlity stereo or multi-channel surround signal (e.g. by means of an audio CD, DVD,
or on-line music store, etc.). The goΔL is to optionally deliver to the consumer the
flexibility to generate a custom mix of the obtained audio content, without
compromising standard stereo/surround playback quΔlity.
This is implemented by delivering to the consumer (e.g. as optional buying option in
an on-line music store) a bit stream of side information which ΔLlows computation of
i.(n)as a function of the given stereo or multi-channel audio signal. The consumer's
mixing ΔLgorithm is then applied to the si,(n) In the following, two possibilities for
computing si,-(n), given stereo or multi-channel audio signals, are described.
A. Estimating the sum of the source signals at the receiver
The most straight forward way of using the proposed coding scheme with a stereo or
multi-channel audio transmission is illustrated in Figure 13, where y{n) (1 s i <, L) are
the L channels of the given stereo or multi-channel audio signal. The sum signal of
the sources is estimated by downmixing the transmitted channels to a single audio
channel. Downmixing is carried out by means of computing the sum of the channels
y{n) (1 si i si L) or more sophisticated techniques may be applied.
For best performance, it is recommended that the level of the source signals is
adapted prior to estimation (6) such that the power ratio between the
source signals approximates the power ratio with which the sources are contained in
the given stereo or multi-channel signal. In this case, the downmix of the transmitted
channels is a relatively good estimate of the sum of the sources (1) (or a scaled
version thereof).
An automated process may be used to adjust the level of the encoder source signal
inputs Sj(n) prior to computation of the side information. This process adaptively in
time estimates the level at which each source signal is contained in the given stereo
or multi-channel signal. Prior to side information computation, the level of each
source signal is then adaptively in time adjusted such that it is equΔL to the level at
which the source is contained in the stereo or multi-channel audio signal.
B. Using the transmitted channels individuΔLly
Figure 14 shows a different implementation of the proposed scheme with stereo or
multi-channel surround signal transmission. Here, the transmitted channels are not
downmixed, but used individuΔLly for generation of the st(n). Most generaLly, the
subband signals of s((n) are computed by
where w,(ri) are weights determining specific linear combinations of the transmitted
channels' subbands. The linear combinations are chosen such that the si,•(n) are
ΔLready as much decorrelated as possible. Thus, no or only a smΔLl amount of de-
correlation processing needs to be applied, which is favorable as discussed earlier.
VII. APPliCATIONS
ΔLready previously we mentioned a number of applications for the proposed coding
schemes. Here, we summarize these and mention a few more applications.
A. Audio coding for mixing
Whenever audio source signals need to be stored or transmitted prior to mixing them
to stereo, multi-channel, or wavefield synthesis audio signals, the proposed scheme
can be applied. With prior art, a mono audio coder would be applied to each source
signal independently, resulting in a bitrate which scales with the number of sources.
The proposed coding scheme can encode a high number of audio source signals
with a single mono audio coder plus relatively low bitrate side information. As
described in Section V, the audio quΔlity can be improved by using more than one
transmitted channel, if the memory/capacity to do so is available.
B. Re-mixing with meta-data
As described in Section VI, existing stereo and multi-channel audio signals can be re-
mixed with the help of additional side information (i.e. "meta-data"). As opposed to
only selling optimized stereo and multi-channel mixed audio content, meta data can
be sold ΔLlowing a user to re-mix his stereo and multi-channel music. This can for
example ΔLso be used for attenuating the vocals in a song for karaoke, or for
attenuating specific instruments for playing an instrument ΔLong the music.
Even if storage would not be an issue, the described scheme would be very attractive
for enabling custom mixing of music. That is, because it is likely that the music
industry would never be willing to give away the multi-track recordings. There is too
much a danger for abuse. The proposed scheme enables re-mixing capability without
giving away the multi-track recordings.
Furthermore, as soon as stereo or multi-channel signals are re-mixed a certain
degree of quΔlity reduction occurs, making illegΔL distribution of re-mixes less
attractive.
c. Stereo/multi-channel to wavefield synthesis conversion
Another application for the scheme described in Section VI is described in the
following. The stereo and multi-channel (e.g. 5.1 surround) audio accompanying
moving pictures can be extended for wavefield synthesis rendering by adding side
information. For example, Dolby AC-3 (audio on DVD) can be extended for 5.1
backwards compatibly coding audio for wavefield synthesis systems, i.e. DVDs play
back 5.1 surround sound on conventional legacy players and wavefield synthesis
sound on a new generation of players supporting processing of the side information.
VIII. SUBJECTIVE EVΔLUATIONS
We implemented a reΔL-time decoder of the ΔLgorithms proposed in Section IV-A and
IV-B. An FFT-based STFT filterbank is used. A 1024-point FFT and a STFT window
size of 768 (with zero padding) are used. The spectraL coefficients are grouped
together such that each group represents signal with a bandwidth of two times the
equivΔLent rectangular bandwidth (ERB). InformΔL listening reveΔLed that the audio
quΔlity did not notably improve when choosing higher frequency resolution. A lower
frequency resolution is favorable since it results in less parameters to be transmitted.
For each source, the amplitude/delay panning and gain can be adjusted individuΔLly.
The ΔLgorithm was used for coding of severaL multi-track audio recordings with 12 -
14 tracks.
The decoder ΔLlows 5.1 surround mixing using a vector base amplitude panning
(VBAP) mixer. Direction and gain of each source signal can be adjusted. The
software ΔLlows on the-fly switching between mixing the coded source signal and
mixing the original discrete source signals.
CasuΔL listening usuΔLly reveΔLs no or little difference between mixing the coded or
original source signals if for each source a gain G, of zero dB is used. The more the
source gains are varied the more artifacts occur. Slight amplification and attenuation
of the sources (e.g. up to ± 6 dB) still sounds good. A critical scenario is when ΔLl the
sources are mixed to one side and only a single source to the other opposite side. In
this case the audio quΔlity may be reduced, depending on the specific mixing and
source signals.
IX. CONCLUSIONS
A coding scheme for joint-coding of audio source signals, e.g. the channels of a
multi-track recording, was proposed. The goΔL is not to code the source signal
waveforms with high quΔlity, in which case joint-coding would give minimΔL coding
gain since the audio sources are usuΔLly independent. The goΔL is that when the
coded source signals are mixed a high quΔlity audio signal is obtained. By
considering statistical properties of the source signals, the properties of mixing
schemes, and spatiΔL hearing it was shown that significant coding gain improvement
is achieved by jointly coding the source signals.
The coding gain improvement is due to the fact that only one audio waveform is
transmitted.
Additionally side information, representing the statistical properties of the source
signals which are the relevant factors determining the spatiΔL perception of the final
mixed signal, are transmitted.
The side information rate is about 3 kbs per source signal. Any mixer can be applied
with the coded source signals, e.g. stereo, multi-channel, or wavefield synthesis
mixers.
It is straight forward to scale the proposed scheme for higher bitrate and quΔlity by
means of transmitting more than one audio channel. Furthermore, a variation of the
scheme was proposed which ΔLlows re-mixing of the given stereo or multi-channel
audio signal (and even changing of the audio format, e.g. stereo to multi-channel or
wavefield synthesis).
The applications of the proposed scheme are manifold. For example MPEG-4 could
be extended with the proposed scheme to reduce bitrate when more than one
"naturaL audio object" (source signal) needs to be transmitted. ΔLso, the proposed
scheme offers compact representation of content for wavefield synthesis systems. As
mentioned, existing stereo or multi-channel signals could be complemented with side
information to ΔLlow that the user re-mixes the signals to his liking.
REFERENCES
[1] C. FΔLler, Parametric Coding of SpatiΔL Audio, Ph.D. thesis, Swiss FederaL
Institute of Technology Lausanne (EPFL), 2004, Ph.D. Thesis No. 3062.
[2] C. FΔLler and F. Baumgarte, "BinauraL Cue Coding - Part II: Schemes and
applications," IEEE Trans, on Speech and Audio Proa, vol. 11, no. 6, Nov. 2003.
We Claim:
1. Method for synthesizing a plurality of audio channels, comprising:
retrieving from an audio stream at least one sum signal representing a sum
of source signals,
retrieving from the audio stream statistical information about one or more
source signals,
receiving from the audio stream, or determining locally, parameters
describing an output audio format and source mixing parameters,
computing output mixer parameters from the received statistical information,
the parameters describing an output audio format, and the source mixing
parameters,
synthesizing the plurality of audio channels from the at least one sum signal
based on the computed output mixer parameters.
2. Method as claimed in claim 1, wherein the statistical information represent
spectraL envelopes of the source signals, or the spectraL envelopes of the
one or more audio source signals comprise lattice filter parameters or line
spectraL parameters or in which the statistical information represent a
relative power as a function of frequency and time of the plurality of
source signals.
3. Method as claimed in claim 1, wherein the step of computing the output
mixer parameters comprises computing the cues of the plurality of audio
channels and computing the output mixer parameters using the calculated
cues of the plurality of audio channels.
4. Method as claimed in claim 1, wherein the audio channels are synthesized
in a subband domain of a filterbank.
5. Method as claimed in claim 4, wherein a number and bandwidths of the
subband domain are determined according to a spectraL and temporaL
resolution of an human auditory system.
6. Method as claimed in claim 4, wherein a number of subbands is between
3 and 40.
7. Method as claimed in claim 4, wherein subbands in the subband domain
have different bandwidths, wherein subbands at lower frequencies have
smΔLler bandwidths than subbands at higher frequencies.
8. Method as claimed in claim 4, wherein a short time Fourier transform
(STFT) based filterbank is used and spectraL coefficients are combined to
form groups of spectraL coefficients such that each group of spectraL
coefficients forms a subband.
9. Method as claimed in claim 1, wherein the statistical information ΔLso
comprises auto-correlation functions.
10. Method as claimed in claim 2, wherein spectraL envelopes are represented
as linear predictive coding (LPC) parameters.
11. Method as claimed in claim 3, wherein the computed cues are level
difference, time difference, or coherence fcr different frequencies and
time instants.
12.Apparatus for synthesizing a plurality of audio channels, wherein the
apparatus is operative for:
retrieving from an audio stream at least one; sum signal representing a
sum of source signals,
retrieving from the audio stream statistical information about one or more
source signals,
receiving from the audio stream, or determining locally, parameters
describing an output audio format and source mixing parameters,
computing output mixer parameters from the received statistical information,
the parameters describing an output audio format, and the source mixing
parameters,
synthesizing the plurality of audio channels from the at least one sum signal
based on the computed output mixer parameters.
ABSTRACT
TITLE: Method and Apparatus for synthesizing a plurality of audio channels
The invention relates to Method for synthesizing a plurality of audio channels,
comprising retrieving from an audio stream at least one sum signal
representing a sum of source signals, retrieving from the audio stream
statistical information about one or more source signals, receiving from the
audio stream, or determining locally, parameters describing an output audio
format and source mixing parameters, computing output mixer parameters
from the received statistical information, the parameters describing an output
audio format, and the source mixing parameters, synthesizing the plurality of
audio channels from the at least one sum signal based on the computed
output mixer parameters.