Abstract:
A downscaled version of an audio decoding procedure may more effectively and/or at improved compliance maintenance be achieved if the synthesis window used for downscaled audio decoding is a downsampled version of a reference synthesis window involved in the non-downscaled audio decoding procedure by downsampling by the downsampling factor by which the downsampled sampling rate and the original sampling rate deviate, and downsampled using a segmental interpolation in segments of 1/4 of the frame length.
Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence
Description
The present application is concerned with a downscaled decoding concept.
The MPEG-4 Enhanced Low Delay AAC (AAC-ELD) usually operates at sample rates up to 48 kHz, which results in an algorithmic delay of 15ms. For some applications, e.g. lip-sync transmission of audio, an even lower delay is desirable. AAC-ELD already provides such an option by operating at higher sample rates, e.g. 96 kHz, and therefore provides operation modes with even lower delay, e.g. 7.5 ms. However, this operation mode comes along with an unnecessary high complexity due to the high sample rate.
The solution to this problem is to apply a downscaled version of the filter bank and therefore, to render the audio signal at a lower sample rate, e.g. 48kHz instead of 96 kHz. The downscaling operation is already part of AAC-ELD as it is inherited from the MPEG-4 AAC-LD codec, which serves as a basis for AAC-ELD.
The question which remains, however, is how to find the downscaled version of a specific filter bank. That is, the only uncertainty is the way the window coefficients are derived whilst enabling clear conformance testing of the downscaled operation modes of the AAC-ELD decoder.
In the following the principles of the down-scaled operation mode of the AAC-(E)LD codecs are described.
The downscaled operation mode or AAC-LD is described for AAC-LD in ISO/IEC 14496-3:2009 in section 4.6.17.2.7 "Adaptation to systems using lower sampling rates" as follows:
"In certain applications it may be necessary to integrate the low delay decoder into an audio system running at lower sampling rates (e.g. 16 kHz) while the nominal sampling rate of the bitstream payload is much higher (e.g. 48 kHz, corresponding to an algorithmic codec delay of approx. 20 ms). In such cases, it is favorable to decode the output of the low delay codec directly at the target sampling rate rather than using an additional sampling rate conversion operation after decoding.
This can be approximated by appropriate downscaiing of both, the frame size and the sampling rate, by some integer factor (e.g. 2, 3), resulting in the same time/frequency resolution of the codec. For example, the codec output can be generated at 16 kHz sampling rate instead of the nominal 48 kHz by retaining only the lowest third (i.e. 480/3 =
160} of the spectral coefficients prior to the synthesis filterbank and reducing the inverse transform size to one third (i.e. window size 960/3 = 320).
As a consequence, decoding for lower sampling rates reduces both memory and computational requirements, but may not produce exactly the same output as a full-bandwidth decoding, followed by band limiting and sample rate conversion.
Please note that decoding at a lower sampling rate, as described above, does not affect the interpretation of levels, which refers to the nominal sampling rate of the A AC low delay bitstream payload. "
Please note that AAC-LD works with a standard MDCT framework and two window shapes, i.e. sine-window and low-overlap-window. Both windows are fully described by formulas and therefore, window coefficients for any transformation lengths can be determined.
Compared to AAC-LD, the AAC-ELD codec shows two major differences:
• The Low Delay MDCT window (LD-MDCT)
• The possibility of utilizing the Low Delay SBR tool
The IMDCT algorithm using the low delay MDCT window is described in 4.6.20.2 in [1], which is very similar to the standard IMDCT version using e.g. the sine window. The coefficients of the low delay MDCT windows (480 and 512 samples frame size) are given in Table 4.A.15 and 4.A.16 in [1]. Please note that the coefficients cannot be determined by a formula, as the coefficients are the result of an optimization algorithm. Fig. 9 shows a plot of the window shape for frame size 512.
In case the low delay SBR (LD-SBR) tool is used in conjunction with the AAC-ELD coder, the filter banks of the LD-SBR module are downscaled as well. This ensures that the SBR module operates with the same frequency resolution and therefore, no more adaptions are required.
Thus, the above description reveals that there is a need for downscaling decoding operations such as, for example, downscaling a decoding at an AAC-ELD. It would be feasible to find out the coefficients for the downscaied synthesis window function anew, but this is a cumbersome task, necessitates additional storage for storing the downscaied version and renders a conformity check between the non-downscaled decoding and the downscaied decoding more complicated or, from another perspective, does not comply with the manner of downscaling requested in the AAC-ELD, for example. Depending on the downscale ratio, i.e. the ratio between the original sampling rate and the downscaied sampling rate, one could derive the downscaied synthesis window function simply by downsampling, i.e. picking out every second, third, ... window coefficient of the original synthesis window function, but this procedure does not result in a sufficient conformity of the non-downscaled decoding and downscaied decoding, respectively. Using more sophisticated decimating procedures applied to the synthesis window function, lead to unacceptable deviations from the original synthesis window function shape. Therefore, there is a need in the art for an improved downscaied decoding concept.
Accordingly, it is an object of the present invention to provide an audio decoding scheme which allows for such an improved downscaied decoding.
This object is achieved by the subject matter of the independent claims.
The present invention is based on the finding that a downscaied version of an audio decoding procedure may more effectively and/or at improved compliance maintenance be achieved if the synthesis window used for downscaied audio decoding is a downsampled version of a reference synthesis window involved in the non-downscaled audio decoding procedure by downsampling by the downsampling factor by which the downsampled sampling rate and the original sampling rate deviate, and downsampled using a segmental interpolation in segments of 1/4 of the frame length.
Advantageous aspects of the present application are the subject of dependent claims. Preferred embodiments of the present application are described below with respect to the figures, among which:
Fig. 1 shows a schematic diagram illustrating perfect reconstruction requirements needed to be obeyed when downscaling decoding in order to preserve perfect reconstruction;
Fig. 2 shows a block diagram of an audio decoder for downscaled decoding according to an embodiment;
Fig. 3 shows a schematic diagram illustrating in the upper half the manner in which an audio signal has been coded at an original sampling rate into a data stream and, in the lower half separated from the upper half by a dashed horizontal line, a downscaled decoding operation for reconstructing the audio signal from the data stream at a reduced or downscaled sampling rate, so as to illustrate the mode of operation of the audio decoder of Fig. 2;
Fig. 4 shows a schematic diagram illustrating the cooperation of the windower and time domain aliasing canceler of Fig. 2;
Fig. 5 illustrates a possible implementation for achieving the reconstruction according to Fig. 4 using a special treatment of the zero-weighted portions of the spectral-to-time modulated time portions;
Fig. 6 shows a schematic diagram illustrating the downsampling to obtain the downsampled synthesis window;
Fig. 7 shows a block diagram illustrating a downscaled operation of AAC-ELD including the low delay SBR tool;
Fig. 8 shows a block diagram of an audio decoder for downscaled decoding according to an embodiment where modulator, windower and canceller are implemented according to a lifting implementation; and
Fig. 9 shows a graph of the window coefficients of a low delay window according to AAC-ELD for 512 sample frame size as an example of a reference synthesis window to be downsampled.
The following description starts with an illustration of an embodiment for downscaled decoding with respect to the AAC-ELD codec. That is, the following description starts with an embodiment which could form a downscaled mode for AAC-ELD. This description concurrently forms a kind of explanation of the motivation underlying the embodiments of the present application. Later on, this description is generalized, thereby leading to a description of an audio decoder and audio decoding method in accordance with an embodiment of the present application.
As described in the introductory portion of the specification of the present application, AAC-ELD uses low delay MDCT windows. In order to generate downscaled versions thereof, i.e. downscaled low delay windows, the subsequently explained proposal for forming a downscaled mode for AAC-ELD uses a segmental spline interpolation algorithm which maintains the perfect reconstruction property (PR) of the LD-MDCT window with a very high precision. Therefore, the algorithm allows the generation of window coefficients in the direct form, as described in ISO/IEC 14496-3:2009, as well as in the lifting form, as described in [2], in a compatible way. This means both implementations generate 16bit-conform output.
The interpolation of Low Delay MDCT window is performed as follows.
In general a spline interpolation is to be used for generating the downscaled window coefficients to maintain the frequency response and mostly the perfect reconstruction property (around 170dB SNR). The interpolation needs to be constraint in certain segments to maintain the perfect reconstruction property. For the window coefficients c covering the DCT kernel of the transformation (see also Figure 1 , c(1024)..c(2048)), the following constraint is required,
1 = |(sgn · c(i) · c(2N – 1 – i) + c(N + i) · c(N – 1 – i))| for i = 0 ... N/2 – 1 (1)
where N denotes the frame size. Some implementation may use different signs to optimize the complexity, here, denoted by sgn. The requirement in (1 ) can be illustrated by Fig. 1 . It should be recalled that simply in even in case of F=2, i.e. halfening the sample rate, leaving-out every second window coefficient of the reference synthesis window to obtain the downscaled synthesis window does not fulfil the requirement.
The coefficients c(0) ... c(2N – 1) are listed along the diamond shape. The N/4 zeros in the window coefficients, which are responsible for the delay reduction of the filter bank, are marked using a bold arrow. Fig. 1 shows the dependencies of the coefficients caused by the folding involved in the MDCT and also the points where the interpolation needs to be constraint in order to avoid any undesired dependencies.
• Every N/2 coefficient, the interpolation needs to stop to maintain (1 )
• Additionally, the interpolation algorithm needs to stop every N/4 coefficients due to the inserted zeros. This ensures that the zeros are maintained and the interpolation error is not spread which maintains the PR.
The second constraint is not only required for the segment containing the zeros but also for the other segments. Knowing that some coefficients in the DCT kernel were not determined by the optimization algorithm but were determined by formula (1 ) to enable PR, several discontinuities in the window shape can be explained, e.g. around c(1536+128) in Figure 1 . In order to minimize the PR error, the interpolation needs to stop at such points, which appear in a N/4 grid.
Due to that reason, the segment size of N/4 is chosen for the segmental spline interpolation to generate the downscaled window coefficients. The source window coefficients are always given by the coefficients used for N = 512, also for downscaling operations resulting in frame sizes of N = 240 or N = 120. The basic algorithm is outlined very briefly in the following as MATLAB code:
As the spline function may not be fully deterministic, the complete algorithm is exactly specified in the following section, which may be included into ISO/IEC 14496-3:2009, in order to form an improved downscaled mode in AAC-ELD.
In other words, the following section provides a proposal as to how the above-outlined idea could be applied to ER AAC ELD, i.e. as to how a low-complex decoder could decode a ER AAC ELD bitstream coded at a first data rate at a second data rate lower than the first data rate. It is emphasized however, that the definition of N as used in the following adheres to the standard. Here, N corresponds to the length of the DCT kernel whereas hereinabove, in the claims, and the subsequently described generalized embodiments, N corresponds to the frame length, namely the mutual overlap length of the DCT kernels, i.e. the half of the DCT kernel length. Accordingly, while N was indicated to be 512 hereinabove, for example, it is indicated to be 1024 in the following.
The following paragraphs are proposed for inclusion to 14496-3:2009 via Amendment.
A.O Adaptation to systems using lower sampling rates
For certain applications, ER AAC LD can change the playout sample rate in order to avoid additional resampling steps (see 4.6.17.2.7). ER AAC ELD can apply similar downscaling steps using the Low Delay MDCT window and the LD-SBR tool. In case AAC-ELD operates with the LD-SBR tool, the downscaling factor is limited to multiples of 2. Without LD-SBR, the downscaled frame size needs to be an integer number.
A.1 Downscaling of Low Delay MDCT window
The LD-MDCT window WLD for N=1024 is downscaled by a factor F using a segmental spline interpolation. The number of leading zeros in the window coefficients, i.e. N/8, determines the segment size. The downscaled window coefficients WLD_d are used for the inverse MDCT as described in 4.6.20.2 but with a downscaled window length Nd = N / F. Please note that the algorithm is also able to generate downscaled lifting coefficients of the LD-MDCT.
A.2 Downscaling of Low Delay SBR tool
In case the Low Delay SBR tool is used in conjunction with ELD, this tool can be downscaled to lower sample rates, at least for downscaling factors of a multiple of 2. The downscale factor F controls the number of bands used for the CLDFB analysis and synthesis filter bank. The following two paragraphs describe a downscaled CLDFB analysis and synthesis filter bank, see also 4.6.19.4.
4.6.20.5.2.1 Downscaled analyses CLDFB filter bank
• Define number of downscaled CLDFB bands B = 32/ F.
• Shift the samples in the array x by B positions. The oldest B samples are discarded and B new samples are stored in positions 0 to B – 1.
• Multiply the samples of array x by the coefficient of window ci to get array z. The window coefficients ci are obtained by linear interpolation of the coefficients c, i.e. through the equation
The window coefficients of c can be found in Table 4. A.90.
• Sum the samples to create the 2B-element array u:
u(n) = z(n) + z(n + 2B) + z(n + 4B) + z(n + 6B) + z(n + SB), 0 < n < (2B).
• Calculate B new subband samples by the matrix operation Mu, where
In the equation, exp( ) denotes the complex exponential function and; is the imaginary unit.
4.6.20.5.2.2 Downscaled synthesis CLDFB filter bank
• Define number of downscaled CLDFB bands B = 64/F .
• Shift the samples in the array v by 2B positions. The oldest 2B samples are discarded.
• The B new complex-valued subband samples are multiplied by the matrix N, where
In the equation, exp( ) denotes the complex exponential function and j is the imaginary unit. The real part of the output from this operation is stored in the positions 0 to 2B - 1 of array v.
• Extract samples from v to create the 1 OS-element array g.
• Multiply the samples of array g by the coefficient of window ci to produce array w. The window coefficients ci are obtained by linear interpolation of the coefficients c, i.e. through the equation
The window coefficients of c can be found in Table 4. A.90.
• Calculate B new output samples by summation of samples from array w according to
Please note that setting F = 2 provides the downsampled synthesis filter bank according to 4.6.19.4.3. Therefore, to process a downsampled LD-SBR bit stream with an additional downscale factor F, F needs to be multiplied by 2.
4.6.20.5.2.3 Downscaled real-valued CLDFB filter bank
The downscaling of the CLDFB can be applied for the real valued versions of the low power SBR mode as well. For illustration, please also consider 4.6.19.5.
For the downscaled real-valued analysis and synthesis filter bank, follow the description in 4.6.20.5.2.1 and 4.6.20.2.2 and exchange the exp() modulator in M by a cos() modulator.
A.3 Low Delay MDCT Analysis
This subclause describes the Low Delay MDCT filter bank utilized in the AAC ELD encoder. The core MDCT algorithm is mostly unchanged, but with a longer window, such that n is now running from -N to N-1 (rather than from 0 to N-1 )
The spectral coefficient, Xi,k, are defined as follows:
where:
Zin = windowed input sequence
N = sample index
K = spectral coefficient index
I = block index
N = window length
n0 = (-N / 2 + 1 ) / 2
The window length N (based on the sine window) is 1024 or 960.
The window length of the low-delay window is 2*N. The windowing is extended to the past in the following way:
zi,n = wLD (N – 1 – n) · x'i,n
for n=-N,... ,N-1 , with the synthesis window w used as the analysis window by inverting the order.
A.4 Low Delay MDCT Synthesis
The synthesis filter bank is modified compared to the standard IMDCT algorithm using a sine window in order to adopt a low-delay filter bank. The core IMDCT algorithm is mostly unchanged, but with a longer window, such that n is now running up to 2N-1 (rather than up to N-1 ).
where:
n = sample index
i = window index
k = spectral coefficient index
N = window length / twice the frame length
n0 = (-N / 2 + 1 ) / 2
with N = 960 or 1024.
The windowing and overlap-add is conducted in the following way:
The length N window is replaced by a length 2N window with more overlap in the past, and less overlap to the future (N/8 values are actually zero).
Windowing for the Low Delay Window:
zi,n = wLD (n) · x'i,n
Where the window now has a length of 2N, hence n=0,.. ,2N-1 .
Overlap and add:
for 0<=n
Documents
Application Documents
#
Name
Date
1
202138020939-FORM 1 [08-05-2021(online)].pdf
2021-05-08
2
202138020939-FIGURE OF ABSTRACT [08-05-2021(online)].pdf
2021-05-08
3
202138020939-DRAWINGS [08-05-2021(online)].pdf
2021-05-08
4
202138020939-DECLARATION OF INVENTORSHIP (FORM 5) [08-05-2021(online)].pdf