Abstract: Subject of the invention is an apparatus (2) described by a schematic block diagram for processing an audio signal (4) to obtain a processed audio signal (6). The apparatus (2) comprises a phase calculator (8) for calculating phase values (10) for spectral values of a sequence of frequency domain frames (12) representing overlapping frames of the audio signal (4). Moreover the phase calculator 8 is configured to calculate the phase values (10) based on information on a target time domain envelope (14) related to the processed audio signal (6) so that the processed audio signal (6) has at least in an approximation the target time domain envelope (14) and a spectral envelope determined by the sequence of frequency domain frames (12).
Apparatus and Method for Processing an Audio Signal to Obtain a Processed Audio Signal using a target time-domain envelope
Specification
The present invention relates to an apparatus and a method for processing an audio signal to obtain a processed audio signal. Embodiments further show an audio decoder comprising the apparatus and a corresponding audio encoder, an audio source separation processor and a bandwidth enhancement processor, both comprising the apparatus. According to further embodiments, transient restoration in signal reconstruction and transient restoration in score-informed audio decomposition is shown.
The task of separating a mixture of superimposed sound sources into its constituent components has gained importance in digital audio signal processing. In speech processing, these components are usually the utterances of target speakers interfered by noise or simultaneously speaking persons. In music, these components can be individual instrumental or vocal melodies, percussive instruments, or even individual note events. Relevant topics are signal reconstruction and transient preservation and score-informed audio composition (i.e. source separation).
Music source separation aims at decomposing a polyphonic, multitimbral music recording into component signals such as singing voice, instrumental melodies, percussive instruments, or individual note events occurring in a mixture signal. Besides being an important step in many music analysis and retrieval tasks, music source separation is also a fundamental prerequisite for applications such as music restoration, upmixing, and remixing. For these purposes, high fidelity in terms of perceptual quality of the separated components is desirable. The majority of existing separation techniques work on a time-frequency (TF) representation of the mixture signal, often the Short-Time Fourier Transform (STFT). The target component signals are usually reconstructed using a suitable inverse transform, which in turn can introduce audible artifacts such as musical noise, smeared transients or pre-echos. Existing approaches suffer from audible artifacts in the form of musical noise, phase interference and pre-echos. These artifacts are often quite disturbing for the human listener.
There is a number of recent papers on music source separation. In most approaches, the separation is carried out in the time-frequency (TF) domain by modifying the magnitude spectrogram. The corresponding time-domain signals of the separated components are derived by using the original phase information and applying suitable inverse transforms. When striving for good perceptual quality of the separated solo signals, many authors revert to score-informed decomposition techniques. This has the advantage that the separation can be guided by information on the approximate location of component signals in time (onset, offset) and frequency (pitch, timbre). Fewer publications deal with source separation of transient signals such as drums. Others have focused on the separation of harmonic vs. percussive components [5].
Moreover, the problem of pre-echos has been addressed in the field of perceptual audio coding, where pre-echos are typically caused by the use of relatively long analysis and synthesis windows in conjunction with intermediate manipulation of TF bins such as quantization of spectral magnitudes according to a psycho-acoustic model. It can be considered state-of-the-art to use block-switching in the vicinity of transient events [6]. An interesting approach was proposed in [13] where spectral coefficients are encoded by linear prediction along the frequency axis, automatically reducing pre-echos. Later works proposed to decompose the signal into transient and residual components and use optimized coding parameters for each stream [3]. Transient preservation has also been investigated in the context of time-scale modification methods based on the phase-vocoder. In addition to optimized treatment of the transient components, several authors follow the principle of phase-locking or re-initialization of phase in transient frames [8].
The problem of signal reconstruction, also known as magnitude spectrogram inversion or phase estimation is a well-researched topic. In their classic paper [1], Griffin and Lim proposed the so-called LSEE-MSTFTM algorithm for iterative, blind signal reconstruction from modified ST FT magnitude (MSTFTM) spectrograms. In [2], Le Roux et al. developed a different view on this method by describing it using a TF consistency criterion. By keeping the necessary operations entirely in the TF domain, several simplifications and approximations could be introduced that lower the computational load compared to the original procedure. Since the phase estimates obtained using LSEE-MSTFTM can only converge to local optima, several publications were concerned with finding a good initial estimate for the phase information [3, 4]. Sturmei and Daudet [5] provided an in-depth review of signal reconstruction methods and pointed out unsolved problems. An extension of LSEE-MSTFTM with respect to convergence speed was proposed in [6], Other authors
tried to formulate the phase estimation problem as a convex optimization scheme and arrived at promising results hampered by high computational complexity [7]. Another work [8] was concerned with applying the spectrogram consistency framework to signal reconstruction from wavelet-based magnitude spectrograms.
However, the described approaches for signal reconstruction share the issue that a rapid change of the audio signal, which is, for example, typical for transients, may suffer from the earlier described artifacts such as, for example, pre-echos.
Therefore, there is a need for an improved approach.
It is an object of the present invention to provide an improved concept for processing an audio signal. This object is solved by the subject matter of the independent claims.
The present invention is based on the finding that a target time-domain amplitude envelope can be applied to the spectral values of the sequence of frequency-domain frames in time or frequency-domain. In other words, a phase of a signal may be corrected after signal processing using time-frequency and frequency-time conversion, where an amplitude or a magnitude of this signal is still maintained or kept (unchanged). The phase may be restored using for example an iterative algorithm such as the algorithm proposed by Griffin and Lim. However, using the target time-domain envelope significantly improves the quality of the phase restoration, which results in a reduced number of iterations if the iterative algorithm is used. The target time-domain envelope may be calculated or approximated.
Embodiments show an apparatus for processing an audio signal to obtain a processed audio signal. The apparatus may comprise a phase calculator for calculating phase values for spectral values of a sequence of frequency-domain frames representing overlapping frames of the audio signal. The phase calculator may be configured to calculate the phase values based on information on a target time-domain envelope related to the processed audio signal, so that the processed audio signal has at least in an approximation the target time-domain envelope and a spectral domain envelope determined by the sequence of frequency-domain frames. The information on the target time-domain amplitude envelope may be applied to the sequence of frequency-domain frames in time or frequency-domain.
To overcome the aforementioned limitations of the known approaches, embodiments show a technique, method or an apparatus for better preserving transient components in reconstructed source signals. In particular, an objective may be to attenuate pre-echos that deteriorate onset clarity of note events from drums and percussion as well as piano and guitar.
Embodiments further show an extension or an improvement to the signal reconstruction procedure by Griffin and Lim [1] which e.g. better preserves transient signal components. The original method iteratively estimates the phase information necessary for time-domain reconstruction from a STFT magnitude (STFTM) by going back and forth between the STFT and the time-domain signal, only updating the phase information, while keeping the STFTM fixed. The proposed extension or improvement manipulates the intermediate time-domain reconstructions in order to attenuate the pre-echos that potentially precede the transients.
According to a first embodiment, the information on the target time-domain envelope is applied to the sequence of frequency-domain frames in time-domain. Therefore, a modified Short-Time Fourier Transform (MSTFT) may be derived from a sequence of frequency-domain frames. Based on the modified Short-Time Fourier Transform, an inverse Short-Time Fourier Transform may be performed. Since the Inverse Short-Time Fourier Transform (ISTFT) performs an overlap-and-add procedure, magnitude values and phase values of the initial MSTFT are changed (updated, adapted or adjusted). This leads to an intermediate time-domain reconstruction of the audio signal. Moreover, a target time-domain envelope may be applied to the intermediate time-domain reconstruction. This can e.g. be performed by convolving a time domain signal by an impulse response or by multiplying a spectrum by a transfer function. The intermediate time-domain reconstruction of the audio signal having (an approximation of) the target time-domain envelope may be time-frequency converted using a Short-Time Fourier Transform (STFT). Therefore, overlapping analysis- and/or synthesis windows may be used.
Even if the modulation of the target time-domain envelope is not applied, the STFT of the intermediate time-domain representation of the audio signal would be different from the earlier MSTFT due to the overlap-and-add procedure in the ISTFT and the STFT. This may be performed in an iterative algorithm, wherein, for an updated MSTFT, the phase value of the previous STFT operation is used and the corresponding amplitude or
magnitude value is discarded. Instead, as an amplitude or magnitude value for the updated STFT, the initial magnitude values may be used, since it is assumed that the amplitude (or magnitude) value is (perfectly) reconstructed only having wrong phase information. Therefore, in each iteration step, the phase values are adapted to the correct (or original) phase values.
According to a second embodiment, the target time-domain envelope may be applied to the sequence of frequency-domain frames in frequency-domain. Therefore, the steps performed earlier in time-domain may be transferred (transformed, applied or converted) to the frequency-domain. In detail, this may be a time-frequency transform of the synthesis window of the ISTFT and the analysis window of the STFT. This leads to a frequency representation of neighboring frames that would overlap the current frame after the ISTFT and the STFT had been transformed in time-domain. However, this section is shifted to a correct position within the current frame, and an addition is performed to derive an intermediate frequency-domain representation of the audio signal. Moreover, the target time-domain envelope may be transformed to the frequency-domain, for example using an STFT, such that the frequency representation of the target time-domain envelope may be applied to the intermediate frequency-domain representation. Again, this procedure may be performed iteratively using the updated phase of the intermediate frequency-domain representation having (in an approximation) the envelope of the target time-domain envelope. Furthermore, the initial magnitude of the MSTFT is used, since it is assumed that the magnitude is already perfectly reconstructed.
Using the aforementioned apparatus, multiple further embodiments may be assumed to have different possibilities to derive the target time-domain envelope. Embodiments show an audio decoder comprising the aforementioned apparatus. The audio decoder may receive the audio signal from an (associated) audio encoder. The audio encoder may analyze the audio signal to derive a target time-domain envelope, for example for each time frame of the audio signal. The derived target time-domain envelope may be compared to a predetermined list of exemplary target time-domain envelopes. The predetermined target time-domain envelope which is closest to the calculated target time-domain envelope of the audio signal may be associated to a certain sequence of bits, for example a sequence of four bits to allocate 16 different target time-domain envelopes. The audio decoder may comprise the same predetermined target time-domain envelopes, for example a codebook or a lookup table, and is able to determine (read, compute or
calculate) the (encoded) predetermined target time-domain envelope by the sequence of bits transmitted from the encoder.
According to further embodiments, the above-mentioned apparatus may be part of an audio source separation processor. An audio source separation processor may use a rough approximation of the target time-domain envelope, since an original audio signal having only one source of multiple sources of the audio signal is (usually) not available. Therefore, especially for transient restoration, a part of a current frame up to an initial transient position may be forced to be zero. This may effectively reduce pre-echos in front of a transient usually incorporated due to the signal processing algorithm. Furthermore, a common onset may be used as an approximation for the target time-domain envelope, e.g. the same onset for each frame. According to a further embodiment, a different onset may be used for different components of the audio signal e.g. derived from a predetermined list of onsets. For example, a target time-domain envelope or an onset of a piano may differ from a target time-domain envelope or an onset of a guitar, a hi-hat, or speech. Therefore, the current source or component for the audio signal may be analyzed, e.g. to detect the kind of audio information (instrument, speech etc) to determine the (theoretically) best-fitting approximation of the target time-domain envelope. According to further embodiments, the kind of audio information may be preset (by a user), if the audio source separation is e.g. intended to separate one or more instruments (e.g. guitar, hi-hat, flute, or piano) or speech from a remaining part of the audio signal. Based on the preset, a corresponding onset for the separated or isolated audio track may be chosen.
According to further embodiments, a bandwidth enhancement processor may use the aforementioned apparatus. The bandwidth enhancement processor uses a core coder to code a high resolution representation of one or more bands of the audio signal. Moreover, bands which are not coded using the core coder may be approximated in a bandwidth enhancement decoder using a parameter of the bandwidth enhancement encoder. The target time domain envelope may be transmitted, e.g. as a parameter, by the encoder. However, according to a preferred embodiment, the target time-domain envelope is not transmitted (as a parameter) by the encoder. Therefore, the target time-domain envelope may be directly derived from the core decoded part or frequency band(s) of the audio signal. The shape or envelope of the core decoded part of the audio signal is a good approximation to the target time-domain envelope of the original audio signal. However, high-frequency components may be missing in the core-decoded part of the audio signal
leading to a target time-domain envelope which may be less accentuated when compared to the original envelope. For example, the target time domain envelope may be similar to a low-pass filtered version of the audio signal or a part of the audio signal. However, the approximation of the target time-domain envelope from the core-decoded audio signal may be (on average) more precise compared to, for example, using a codebook where information of the target time-domain envelope may be transmitted from a bandwidth enhancement encoder to the bandwidth enhancement decoder.
According to further embodiments, an effective extension of the iterative signal reconstruction algorithm proposed by Griffin and Lim is shown. The extension shows an intermediate step within the iterative reconstruction using a modified Short-Time Fourier Transform. The intermediate step may enforce a desired or predetermined shape of the signal which shall be reconstructed. Therefore, a predetermined envelope may be applied on the reconstructed (time-domain) signal, for example using amplitude modulation, within each step of the iteration. Alternatively, the envelope may be applied to the reconstructed signal using a convolution of the STFT and the envelope in the time-frequency domain. The second approach may be advantageous or more effective, since the inverse STFT and the STFT may be emulated (performed, transformed or transferred) in the time-frequency domain and therefore, these steps do not need to be performed explicitly. Moreover, further simplifications, such as, for example, a sequence-selective processing may be realized. Moreover, an initialization of the phases (of the first MSTFT step) having meaningful values is advantageous, since a faster conversion is achieved.
Before embodiments are described in detail using the accompanying figures, it is to be pointed out that the same or functionally equal elements are given the same reference numbers in the figures and that a repeated description for elements provided with the same reference numbers is submitted. Hence, descriptions provided for elements having the same reference numbers are mutually exchangeable.
Embodiments of the present invention will be discussed subsequently referring to their enclosed drawings, wherein:
Fig. 1 shows a schematic block diagram of an apparatus for processing an audio signal to obtain a processed audio signal;
shows a schematic block diagram of the apparatus according to a further embodiment using time-frequency-domain or frequency domain processing;
shows the apparatus according to a further embodiment in a schematic block diagram using time-frequency-domain processing;
shows a schematic block diagram of the apparatus according to an embodiment using frequency domain processing;
shows a schematic block diagram of the apparatus according to a further embodiment using time-frequency domain processing;
show a schematic plot of the transient restoration according to an embodiment;
shows a schematic block diagram of the apparatus according to a further embodiment using frequency-domain processing;
shows a schematic time-domain diagram illustrating one segment of an audio signal;
illustrate schematic diagrams of different hi-hat component signals separated from an example drum loop;
show a schematic illustration of a percussive signal mixture containing three instruments as sources for source-separation of drum loops;
shows an evolution of the normalized inconsistency measure vs. the number of iterations;
shows the evolution of the pre-echo energy vs. the number of iterations;
shows a schematic diagram of an evolution of the normalized inconsistency measure vs. the number of iterations;
shows the evolution of the pre-echo energy vs. the number of iterations;
Fig. 13 shows a schematic diagram of a typical N F decomposition result, illustrating the extracted templates (three leftmost plots) indeed resemble prototype versions of the onset events in V (lower right plot).
Fig. 14a shows a schematic diagram of an evolution of the normalized consistency measure vs. the number of iterations;
Fig. 14b shows a schematic diagram of an evolution of the pre-echo energy vs. the number of iterations;
Fig. 15 shows an audio encoder for encoding an audio signal according to an embodiment;
Fig. 16 shows an audio decoder comprising the apparatus and an input interface;
Fig. 17 shows an audio signal comprising a representation of a sequence of frequency-domain frames and a representation of a target time-domain envelope;
shows a schematic block diagram of an audio source separation processor according to an embodiment;
shows a schematic block diagram of a bandwidth enhancement processor according to an embodiment;
shows a schematic frequency-domain diagram illustrating bandwidth enhancement;
shows a schematic representation of the (intermediate) time-domain reconstruction;
Fig. 22 shows a schematic block diagram of a method for processing an audio signal to obtain a processed audio signal;
Fig. 23 shows a schematic block diagram of a method of audio decoding;
Fig. 24 shows a schematic block diagram of a method of audio source separation;
Fig. 25 shows a schematic block diagram of a method of bandwidth enhancement of an encoded audio signal;
Fig. 26 shows a schematic block diagram of a method of audio encoding.
In the following, embodiments of the invention will be described in further detail. Elements shown in the respective figures having the same or a similar functionality will have associated therewith the same reference signs.
Fig. 1 shows a schematic block diagram of an apparatus 2 for processing an audio signal 4 to obtain a processed audio signal 6. The apparatus 2 comprises a phase calculator 8 for calculating phase values 10 for spectral values of a sequence of frequency-domain frames 12 representing overlapping frames of the audio signal 4. Moreover, the phase calculator 8 is configured to calculate the phase values 10 based on information on a target time-domain envelope 14 related to the processed audio signal 6, so that the processed audio signal 6 has at least in an approximation the target time-domain amplitude envelope 14 and a spectral envelope determined by the sequence of frequency-domain frames 12. Therefore, the phase calculator 8 may be configured to receive the information on the target time-domain envelope or to extract the information on the target time-domain envelope from (a representation of) the target time-domain envelope.
The spectral values of the sequence of frequency-domain frames 10 may be calculated using a Short-Time Fourier Transform (STFT) of the audio signal 4. Therefore, the ST FT may use analysis windows having an overlapping range of, for example 50%, 67%, 75%, or even more. In other words, the STFT may use a hop size of, for example one half, one third, or one fourth of a length of the analysis window.
The information on the target time-domain envelope 14 may be derived using different or varying approaches related to the current or used embodiment. In a coding environment, for example, an encoder may analyze the (original) audio signal (before encoding) and transmit, for example, a codebook or lookup table index to the decoder representing a predefined target-domain envelope close to the calculated target-domain envelope. The
decoder, having the same codebook or lookup table as the encoder may derive the target time-domain envelope using the received codebook index.
In a bandwidth enhancement environment, the envelope of the core-decoded representation of the audio signal may be a good approximation to the original target time-domain envelope.
Bandwidth enhancement covers any form of enhancing a bandwidth of a processed signal compared to the bandwidth of an input signal before processing. One way of bandwidth enhancement is a gap filling implementation, such as Intelligent Gap Filling as e.g. disclosed in WO2015010948 or semi-parametric gap filling, where spectral gaps in an input signal are filled or "enhanced" by other spectral portions of the input signal with or without the help of transmitted parametric information. A further way of bandwidth enhancement is spectral band replication (SBR) as used in HE-AAC (MPEG 4) or related procedures, where a band above a cross over frequency is generated by the processing. In contrast to the gap filling implementation, the bandwidth of the core signal in SBR is limited, while gap filling implementations have a full band core signal. Hence, the bandwidth enhancement represents a bandwidth extension to higher frequencies than a cross over frequency or a bandwidth extension to spectral gaps located, with respect to frequency, below a maximum frequency of the core signal.
Moreover, in a source separation environment, the target time-domain envelope may be approximated. This may be zero padding up to an initial position of a transient or using (different) onsets as an approximation or a rough estimate of the target time-domain envelope. In other words, an approximated target time-domain envelope may be derived from the current time-domain envelope of the intermediate time domain signal by forcing the current time-domain envelope to be zero from the beginning of the frame or part of the audio signal up to the initial position of a transient. According to further embodiments, the current time-domain envelope is (amplitude) modulated by one or more (predefined) onsets. The onset may be fixed for the (whole) processing of the audio signal or, in other words, chosen once before (or for) processing the first (time) frame or part of the audio signal.
The (approximation or estimation) of the target time-domain envelope may be used to form a shape of the processed audio signal, for example using ampiitude modulation or multiplication, such that the processed audio signal has at least an approximation of the target time-domain envelope. However, the spectral envelope of the processed audio signal is determined by the sequence of frequency-domain frames, since the target time-domain envelope comprises mainly low frequency components when compared to the spectrum of the sequence of frequency-domain frames, such that the majority of frequencies remains unchanged.
Fig. 2 shows a schematic block diagram of the apparatus 2 according to a further embodiment. The apparatus of Fig. 2 shows a phase calculator 8 comprising an iteration processor 16 for performing an iterative algorithm to calculate, starting from initial phase values 18, the phase values 10 for the spectral values using an optimization target requiring consistency of overlapping blocks in the overlapping range. Moreover, the iteration processor 16 is configured to use, in a further iteration step, an updated phase estimate 20, depending on the target time-domain envelope. In other words, the calculation of the phase values 10 may be performed using an iterative algorithm performed by the iteration processor 16. Therefore, magnitude values of the sequence of frequency-domain frames may be known and remain unchanged. Starting from the initial phase value 18, the iteration processor may iteratively update the phase values for the spectral values using, after each iteration, an updated phase estimate 20 to perform the iterations.
The optimization target may be e.g. a number of iterations. According to further embodiments, the optimization target may be a threshold, where the phase values are updated only to a minor extent when compared to the phase values of a previous iteration step, or the optimization target may be a difference of the (initial) constant magnitude of the sequence of frequency-domain frames when compared to the magnitude of the spectral values after an iteration process. Therefore, the phase values may be improved or upgraded such that an individual frequency spectrum of those parts of frames of the audio signal are equal or at least differ only to a minor extent. In other words, all frame portions of the overlapping frames of the audio signal overlapping one another should have the same or a similar frequency representation.
According to embodiments, the phase calculator is configured to perform the iterative algorithm in accordance with the iterative signal reconstruction procedure by Griffin and Lim. Further (more detailed) embodiments are shown with respect to the upcoming figures. Therein, the iteration processor will be subdivided or replaced by a sequence of processing blocks, namely the frequency-to-time converter 22, the amplitude modulator
24, and the time-to-frequency converter 26. For convenience, the iteration processor 16 is usually (not explicitly) pointed out in the further figures, however, the aforementioned processing blocks perform the same operations as the iteration processor 16, or, the iteration processor supervises or monitors the termination condition (or exit condition) of the iterative processing, such as e.g. the optimization target. Furthermore, the iteration processor may perform the operations according to a frequency-domain processing shown e.g. with respect to Fig. 4 and Fig. 7.
Fig. 3 shows the apparatus 2 according to a further embodiment in a schematic block diagram. The apparatus 2 comprises a frequency-to-time converter 22, an amplitude modulator 24, and a time-to-frequency converter 26, wherein the frequency-to-time conversion and/or the time-to-frequency conversion may perform an overlap-and-add procedure. The frequency-to-time converter 22 may calculate an intermediate time-domain reconstruction 28 of the audio signal 4 from the sequence of frequency-domain frames 12 and an initial phase value estimate 18 or phase value estimates 10 of a preceding iteration step. The amplitude modulator 24 may modulate the intermediate time-domain reconstruction 28 using the (information on) the target time-domain envelope 14 to obtain an amplitude modulated audio signal 30. Moreover, the time-to-frequency converter is configured to convert the amplitude modulated signal 30 into a further sequence of frequency-domain frames 32 having phase values 10. Therefore, the phase calculator 8 is configured to use, for a next iteration step, the phase values 10 (of the further sequence of frequency-domain frames) and the spectral values of the sequence of frequency-domain frames (which is not the further sequence of frequency-domain frames). In other words, the phase calculator uses updated phase values of the further sequence of frequency-domain frames 32 after each iteration step. Magnitude values of the further sequence of frequency-domain frames may be discarded or not used for further processing. Moreover, the phase calculator 8 uses magnitude values of the (initial) sequence of frequency-domain frames 12, since it is assumed that the magnitude values are already (perfectly) reconstructed.
More general, the phase calculator 8 is configured to apply an amplitude modulation, for example in the amplitude modulator 22, to an intermediate time-domain reconstruction 28 of the audio signal 4, based on the target time-domain envelope 14. The amplitude modulation may be performed using single-sideband modulation, double-sideband modulation with or without suppressed-carrier transmission or using a multiplication of the target time-domain envelope with the intermediate time-domain reconstruction of the
audio signal. The initial phase value estimate may be a phase value of the audio signal, a (arbitrary) chosen value such as, for example, zero, a random value, or an estimate of a phase of a frequency band of the audio signal, or a phase of a source of the audio signal, for example when using audio source separation.
According to further embodiments, the phase calculator 8 is configured to output the intermediate time-domain reconstruction 28 of the audio signal 4 as the processed audio signal 6, when an iteration determination condition (e.g. iteration termination condition) is fulfilled. The iteration determination condition may be closely related to the optimization target and may define a maximum deviation of the optimization target to a current optimization value. Moreover, the iteration determination condition may be a (maximum) number of iterations, a (maximum) deviation of a magnitude of the further sequence of frequency-domain frames 32 when compared to the magnitude of the sequence of frequency-domain frames 12, or a (maximum) update effort of the phase values 10, between a current and a previous frame.
Fig. 4 shows a schematic block diagram of the apparatus 2 according to an embodiment, which may be an alternative embodiment when compared to the embodiment of Fig. 3. The phase calculator 8 is configured to apply a convolution 34 of a spectral representation 14' of at least one target time-domain envelope 14 and at least one intermediate frequency-domain representation, or selected parts or bands or only a high-pass portion or only several bandpass portions of the at least one target time-domain envelope 14 or at least one intermediate frequency-domain representation 28' of the audio signal 4. In other words, the processing of Fig. 3 may be performed in frequency-domain instead of time-domain. Therefore, the target time-domain envelope 14, more specifically, a frequency representation 14' thereof, may be applied to the intermediate frequency-domain representation 28' using convolution instead of amplitude modulation. However, the idea is again to use the (original) magnitude of the sequence of frequency-domain frames for each iteration and furthermore, after using the initial phase value 18 in a first iteration step, using updated phase value estimates 10 for each further iteration step. In other words, the phase calculator is configured to use phase values 10 obtained by the convolution 34 as updated phase value estimates for the next iteration step. Moreover, the apparatus may comprise a target envelope converter 36 for converting the target time-domain envelope into the spectral domain. Furthermore, the apparatus 2 may comprise a frequency-to-time converter 38 for calculating the time-domain reconstruction 28 from the intermediate frequency-domain reconstruction 28' using the phase value estimates 10
obtained from a most recent iteration step and the sequence of frequency-domain frames 12. In other words, the intermediate frequency-domain representation 28' may comprise magnitude values of the sequence of frequency-domain frames and a phase value 10 of the updated phase value estimates. The time-domain reconstruction 28 may be the processed audio signal 6 or at least a portion of the processed audio signal 6. The portion may relate, for example, to a reduced number of frequency-bands when compared to a total number of frequency bands of the processed audio signal or the audio signal 4.
According to further embodiments, the phase calculator 8 comprises a convolution processor 40. The convolution processor 40 may apply a convolution kernel, a shift kernel, and/or an add-to-center frame operation to obtain the intermediate frequency-domain representation 28' of the audio signal 4. In other words, the convolution processor may process the sequence of frequency-domain frames 12, wherein the convolution processor 40 may be configured to apply a frequency-domain equivalent of a time-domain overlap-and-add procedure to the sequence of frequency-domain frames 12 in the frequency-domain to determine the intermediate frequency-domain reconstruction. According to further embodiments, the convolution processor is configured to determine, based on a current frequency-domain frame, a portion of adjacent frequency-domain frames which contributes to the current frequency-domain frame after time-domain overlap-and-add is performed in the frequency-domain. Moreover, the convolution processor 40 may further determine an overlapping position of the portion of the adjacent frequency-domain frame within the current frequency-domain frame and to perform an addition of the positions of adjacent frequency-domain frames with the current frequency-domain frame at the overlapping position. According to a further embodiment, the convolution processor 40 is configured to time-to-frequency transform a time-domain synthesis and a time-domain analysis window to determine a portion of an adjacent frequency-domain frame, which contributes to the current frequency-domain frame after time-domain overlap-and-add is performed in the frequency-domain. Moreover, the convolution processor is further configured to shift the portion of the adjacent frequency-domain frame to an overlapping position within the current frequency-domain frame and to apply the portion of the adjacent frequency-domain frame to the current frame at the overlapping position.
In other words, the time-domain procedure shown in Fig. 3 may be transferred (transformed, applied or converted) to the frequency-domain. Therefore, the synthesis and analysis windows of the frequency-to-time converter 22 and the time-to-frequency
converter 26 may be transferred (transformed, applied or converted) to the frequency- domain. The (resulting) frequency-domain representation of the synthesis and analysis windows determines (or cuts out) portions of adjacent frames to a current frame which would have been overlapping in an overlap-and-add procedure in the time-domain. Moreover, the cut portions are shifted to a correct position within the current frame and added to the current frame such that the time-domain frequency-to-time transform and the time-to-frequency transform are performed in the frequency-domain. This is advantageous, since an explicit signal transformation may be neglected or not performed, which may increase the computational efficiency of the phase calculator 8 and the apparatus 2.
Fig. 5 shows a schematic block diagram of the apparatus 2 according to a further embodiment focusing on signal reconstruction of separated channels or bands of the audio signal 4. Therefore, the audio signal 4 in time-domain may be transformed to the sequence of frequency-domain frames 12 representing overlapping frames of the audio signal 4 using a time-frequency converter, for example an ST FT 42. Thereof, a modified magnitude estimator 44' may derive a magnitude 44 of the sequence of frequency-domain frames or components or component signals of the sequence of frequency-domain frames. Moreover, an initial phase estimate 18 may be calculated from the sequence of frequency-domain frames 12 using an initial phase estimator 18' or the initial phase estimator 18' may choose, for example, an arbitrary phase estimate 18, which is not derived from the sequence of frequency-domain frames 12. Based on the magnitude 44 of the sequence of frequency-domain frames 12 and the initial phase estimate 18, an MSTFT 12' may be calculated as an initial sequence of frequency-domain frames 12" having a (perfectly) reconstructed magnitude 44 which remains unchanged in the further processing, and only an initial phase estimate 18. The initial phase estimate 18 is updated using the phase calculator 8.
In a further step, the frequency-to-time converter 22, for example an inverse ST FT (ISTFT), may calculate the intermediate time-domain reconstruction 28 of the (initial) sequence of frequency-domain frames 12". The intermediate time-domain reconstruction 28 may be amplitude-modulated, for example multiplied, with a target envelope, or more precise, the target time-domain envelope 14. The time-to-frequency converter 26, for example an ST FT, may calculate the further sequence of frequency-domain frames 32 having phase values 10. The MSTFT 12' may use the updated phase estimator 10 and the magnitude 44 of the sequence of frequency-domain frames 12 in an updated
sequence of frequency-domain frames. This iterative algorithm may be performed or repeated L times within, for example, the iteration processor 16, which may perform the aforementioned processing steps of the phase calculator 8. E.g. after the iteration process is completed, the time domain reconstruction 28" is derived from the intermediate time domain reconstruction 28.According to embodiments, an advantageous point of the described methods, encoder or decoder is the intermediate step 2, which enforces transient constraints in the LSEE-MSTFTM procedure.
Fig. 6a-d show a schematic plot of the transient restoration according to an embodiment indicating a time-domain signal 46, an analytic signal envelope 48, and a transient location 50. Fig. 6 illustrates the proposed method or apparatus with the target component signal 46, overlaid with the envelope of its analytic signal 48 in Fig 6a. The example signal exhibits transient behavior or transient signal component around n0 50 when the waveform transitions from silence to an exponentially decaying sinusoid or sinewave. Fig. 6b shows the time-domain reconstruction obtained from the iSTFT with (φ^( 1 = 0 (i.e., zero phase for all TF bins). Through destructive interference of overlapping frames, the transient is completely destroyed, the amplitude of the sinusoid is strongly decreased and the envelope looks nearly flat. Fig. 6c shows the reconstruction with pronounced transient smearing after L = 200 LSEE-MSTFTM iterations. Figure 6d shows that the restored transient after L = 200 iterations of the proposed method is much closer to the original signal. Small ripples are visible in the envelope ahead of n0, but overall the restoration is much closer to the original signal. In real-world recordings, there usually exist multiple transient onsets event throughout the signal. In this case, one may apply the proposed
method to signal excerpts localized between consecutive transients (resp. onsets) as shown in Fig. 9.
Fig. 7 shows a schematic block diagram of the apparatus 2 according to a further embodiment. Similar to Fig. 4, the phase calculator performs the phase calculation in the frequency-domain. The frequency-domain processing may be equal to the time-domain processing described with respect to the embodiment shown in Fig. 5. Again, the time-domain signal 4 may be time-frequency transformed using the STFT (performer) 42 to derive the sequence of frequency-domain frames 12. Thereof, a modified magnitude estimator 44' may derive the modified magnitude 44 from the sequence of frequency-domain frames 12. The initial phase estimator 18' may derive the initial phase estimate 18 from the sequence of frequency-domain frames or it may provide, for example, an arbitrary initial phase estimate. Using the modified magnitude estimate and the initial phase estimate, the MSTFT 12' calculates or determines the initial sequence of frequency-domain frames 12", which will receive updated phase values after each iteration step. Different to embodiments of Fig. 5 is the (initial) sequence of frequency-domain frames 12" in the phase calculator 8. Based on time-domain synthesis and analysis windows, for example, the synthesis and analysis window used in the I STFT 22 or the STFT 26 in Fig. 5, a convolution kernel calculator 52' may calculate the convolution kernel 52 using a frequency-domain representation of the synthesis and analysis windows. The convolution kernel cuts out (slices out or uses) parts of neighboring or adjacent frames of a current frequency-domain frame that would overlap the current frame using overlap-and-add in the I ST FT 22. A kernel shift calculator 54' may calculate a shift kernel 52 and apply the shift kernel 52 to the parts of the adjacent frequency-domain frames to shift those parts to a correct overlapping position of a current frequency-domain frame. This may emulate the overlapping operation of the overlap-and-add procedure of the I ST FT 22. Moreover, block 56 performs the addition of the overlap-and-add procedure and adds the overlapping parts of the adjacent frames to the central frame period. The convolution kernel calculation and application, the shift kernel calculation and application, and the addition in block 56 may be performed in the convolution processor 40. The output of the convolution processor 40 may be an intermediate frequency-domain reconstruction 28' of the sequence of frequency-domain frames 12 or the initial sequence of frequency-domain frames 12". The intermediate frequency-domain reconstruction 28' may be (frame-wise) convolved with a frequency-domain representation of the target envelope 14 using the convolution 34. The output of the convolution 34 may be the further sequence of frequency-domain frames 32' having phase values 10. The phase values 10
replace the initial phase estimate 18 in the MSTFT 2' in the further iteration step. The iteration may be performed L times using the iteration processor 15. After the iteration process is stopped, or at a certain point of time within the iteration process, a final frequency-domain reconstruction 28"' may be derived from the convolution processor 40. The final frequency-domain reconstruction 28"' may be the intermediate frequency-domain reconstruction 28' of a most recent iteration step. Using a frequency-to-time converter 38, for example an ISTFT, the time-domain reconstruction 28" may be obtained, which may be the processed audio signal 6.
In other words, it is advantageous to apply an intermediate step in the LSEE-MSTFTM iteration. It may enforce all samples ahead of the transient to be zero before computing the STFT again to obtain an updated estimate of the phases φ(ί+ι). This constraint can also be enforced directly in the TF domain. Therefore, setting some pre-requisites may be advantageous. First, the normalization to the sum of the time-shifted and squared window functions in the denominator of (6) can be omitted by imposing certain constraints on w and H (e.g., using a symmetric Hann window and requiring the redundancy Q — N/H to be radix 4 [2]). The number of unique (up to conjugation) spectral bins per frame is K = N/2> anc| the frequency argument is evaluated for & G [~K '■ K . Focusing for the moment on a single spectral frame, the operation of successively applying iSTFT and STFT again can be expressed in the TF domain as a superposition of weighted spectral contributions from the preceding and subsequent frames. Only frames that overlap with the central one need to be considered. This is expressed by a neighborhood frame index q e [-( of case 1 66a, 66a' may exhibit a considerable head start in terms of pre-echo reduction compared to case 2 66b, 66b'. Surprisingly, the proposed TR processing applied to case 2 slightly outperforms GL applied to case 1 in terms of pre-echo reduction for L > 100. From these results, it may be inferred that it is sufficient to apply only a few iterations (e.g., L < 20) of the proposed method in scenarios where a reasonable initial phase and magnitude estimate is available. However, there may be applied more iterations (e.g., L < 200) in case a good magnitude estimate in conjunction with a weak phase estimate and vice versa is available. In Fig. 8, different versions of a segment from one test-item of test case 2 are shown. The TR reconstruction 61 d clearly exhibits reduced pre-echos in comparison to the reconstruction with LSEE-MSTFTM 61 c. The reference hi-hat signal 61 b and the mixture signal 61 a are shown for above.
However, the following figures are derived using a different hop size and a different window length as described below.
For each mixture excerpt, the ST FT is computed via (1 ) with H = 512 and N = 2048 and denoted as xMix. Since all test items have 44:1 kHz sampling rate, the frequency resolution is approx. 21 ,5 Hz and the temporal resolution is approx. 1 1 ,6 ms. A symmetric Hann window of size N is used for w. As a reference target, the same excerpt boundaries are taken, the same zero-padding is applied, but this time from the single track of each
Ora le
individual drum instrument, the resulting ST FT is denoted as Subsequently, two different cases for the initialization of (<¾-)(0) are defined as detailed above. Using these settings, the inconsistency of the resulting is expected to be lower in case 1 compared to case 2. Knowing that there exists a consistent L = 200 iterations of both LSEE-MSTFTM (GL) and the proposed method or apparatus (TR) are went through.
Fig. 12a shows a schematic diagram of an evolution of the normalized consistency measure vs. the number of iterations. Fig. 12b shows the evolution of the pre-echo energy vs. the number of iterations. The curves show the average of all test excerpts. In other words, Fig. 12 shows the evolution of both quality measures from (6) and (7) with respect to £. Fig. 12a indicates that, on average, the proposed method (TR) performs equally well as LSEE-MSTFTM (GL) in terms of inconsistency reduction. In both test cases, the curves for TR (solid line) and GL (dashed line) are almost indistinguishable, which indicates that the new approach, meaning the method or apparatus, shows similar convergence properties as the original method. As expected, the curves 66a, 66a' (Case 1 ) start at much lower initial inconsistency than the curves 66b, 66b' (Case 2), which is clearly due to the initialization with the mixture phase ^Mix. Fig. 12b shows the benefit of TR for pre-echo reduction. In both test cases, the pre-echo energy for TR (solid lines) is around 15 dB lower and shows a steeper decrease during the first few iterations compared to GL (dashed line). Again, the more consistent initial (-^){t,) of Case 1 66a, 66a' exhibit a considerable head start in terms of pre-echo reduction compared to Case 2 66b, 66b'. From these results, it is inferred that it is sufficient to apply only a few iterations (e.g., L < 20) of the proposed method in scenarios where a reasonable initial phase and magnitude estimate is available. However, applying more iterations (e.g., L < 200) may be advantageous in case a good magnitude estimate in conjunction with a weak phase estimate and vice versa is present.
The following will describe embodiments of how to apply the proposed transient restoration method or apparatus in a score-informed audio decomposition scenario. An objective is the extraction of isolated drum sounds from polyphonic drum recordings with enhanced transient preservation. In contrast to the idealized laboratory conditions used before, the magnitude spectrograms of the component signals from the mixture is estimated. To this end, an NMFD (Non-Negative Matrix Factor Deconvolution) [3, 4] may be employed as decomposition technique. Embodiments describe a strategy to enforce score-informed constraints on NMFD. Finally, the experiments are repeated under these more realistic conditions and observations are discussed.
Following, the NMFD method employed for decomposing the TF-representation of x is briefly described. As already indicated, a wide variety of alternative separation approaches exists. Previous works [3, 4] successfully applied NMFD, a convolutive version of NMF, for drum sound separation. Intuitively speaking, the underlying, convolutive or convolution model assumes that all audio events in one of the component signals can be explained by a prototype event that acts as an impulse response to some onset-related activation (e.g., striking a particular drum). In Fig. 10b one can see this kind of behavior in the hi-hat component V3. There, all instances of the 8 onset events look
more or less like copies of each other that could be explained by inserting a prototype event at each onset position.
Fig. 16 shows an audio decoder 1 10 comprising the apparatus 2 and an input interface 1 12. The input interface 1 12 may receive an encoded audio signal. The encoded audio signal may comprise a representation of the sequence of frequency-domain frames and a representation of the target time-domain envelope.
In other words, the decoder 1 10 may receive the encoded audio signal for example from the encoder 100. The input interface 1 12 or the apparatus 2, or a further means may extract the target time-domain envelope 14 or a representation thereof, for example a sequence of bits indicating a position of the target time-domain envelope in a lookup table or a codebook. Furthermore, the apparatus 2 may decode the encoded audio signal 108 for example by adjusting corrupted phases of the encoded audio signal still having uncorrupted magnitude values, or the apparatus may correct phase values of a decoded audio signal, for example from a decoding unit which sufficiently or even perfectly decoded the encoded audio signal's spectral magnitude, and the apparatus further adjusts the phase of the decoded audio signal, which may be corrupted by the decoding unit.
Fig. 17 shows an audio signal 1 14 comprising a representation of a sequence of frequency-domain frames 12 and a representation of a target time-domain envelope 14. The representation of a sequence of frequency-domain frames of the time-domain audio signal 12 may be an encoded audio signal according to a standard audio encoding scheme. Furthermore, the representation of a target time-domain envelope 14 may be a bit representation of the target time-domain envelope. The bit representation may be derived, for example, using sampling and quantization of the target time-domain envelope or by a further digitalization method. Moreover, the representation of the target time-domain envelope 14 may be an index of, for example, a codebook or a lookup table indicated or coded with a number of bits.
Fig. 18 shows a schematic block diagram of an audio source separation processor 1 6 according to an embodiment. The audio source separation processor comprises the apparatus 2 and a spectral masker 1 18. The spectral masker may mask a spectrum of the original audio signal 4 to derive a modified audio signal 120. Compared to the original audio signal 4, the modified audio signal 120 may comprise a reduced number of frequency bands or time frequency bins. Furthermore, the modified audio signal may comprise only one source or one instrument or one (human) speaker of the audio signal 4, wherein frequency contributions of other sources, speakers, or instruments are hidden or masked out. However, since magnitude values of the modified audio signal 120 may
match magnitude values of a (desired) processed audio signal 6, phase values of the modified audio signal may be corrupted. Therefore, the apparatus 2 may correct the phase values of the modified audio signal with respect to the target time-domain envelope 14.
Fig. 19 shows a schematic block diagram of a bandwidth enhancement processor 122 according to an embodiment. The bandwidth enhancement processor 122 is configured for processing an encoded audio signal 124. Moreover, the bandwidth enhancement processor 122 comprises an enhancement processor 126 and the apparatus 2. The enhancement processor 126 is configured to generate an enhancement signal 127 from an audio signal band included in the encoded signal and wherein the enhancement processor 126 is configured to extract the target time-domain envelope 14 from an encoded representation included in the encoded signal 122 or from the audio signal band included in the encoded signal. Furthermore, the apparatus 2 may process the enhancement signal 126 using the target time-domain envelope.
In other words, the enhancement processor 126 may core-encode the audio signal band or receive a core-encoded audio signal band of the encoded audios signal. Furthermore, the enhancement processor 126 may calculate further bands of the audio signal using, for example parameters of the encoded audio signal and the core-encoded baseband portion of the audio signal. Moreover, the target time domain envelope 14 may be present in the encoded audio signal 124, or the enhancement processor may be configured to calculate the target time-domain envelope from the baseband portion of the audio signal.
Fig. 20 illustrates a schemaftic representation of the spectrum. The spectrum is subdivided in scale factor bands SCB where there are seven scale factor bands SCB1 to SCB7 in the illustrated example of Fig. 20. The scale factor bands can be AAC scale factor bands which are defined in the AAC standard and have an increasing bandwidth to upper frequencies as illustrated in Fig. 20 schematically. It is preferred to perform intelligent gap filling not from the very beginning of the spectrum, i.e., at low frequencies, but to start the IGF operation at an IGF start frequency illustrated at 309. Therefore, the core frequency band extends from the lowest frequency to the IGF start frequency. Above the IGF start frequency, the spectrum analysis is applied to separate high resolution spectral components 304, 305, 306, 307 (the first set of first spectral portions) from low resolution components represented by the second set of second spectral portions. Fig. 20 illustrates a spectrum which is exemplarily input into the enhancement processor 126, i.e., the core encoder may operate in the full range, but encodes a significant amount of zero spectral values, i.e., these zero spectral values are quantized to zero or are set to zero before quantizing or subsequent to quantizing. Anyway, the core encoder operates in full range, i.e., as if the spectrum would be as illustrated, i.e., the core decoder does not necessarily have to be aware of any intelligent
described herein. Generally, the methods are preferably performed by any hardware apparatus.
The above described embodiments are merely illustrative for the principles of the present invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.
REFERENCES
[1] Daniel W. Griffin and Jae S. Lim, "Signal estimation from modified short-time Fourier transform", IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 32, no. 2, pp. 236-243, April 1984.
[2] Jonathan Le Roux, Nobutaka Ono, and Shigeki Sagayama, "Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction" in Proceedings of the ISCA Tutorial and Research Workshop on Statistical And Perceptual Audition, Brisbane, Australia, September 2008, pp. 23-28.
[3] Xinglei Zhu, Gerald T. Beauregard, and Lonce L. Wyse, "Real-time signal estimation from modified short-time Fourier transform magnitude spectra", IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 5, pp. 1645-1653, July 2007.
[4] Jonathan Le Roux, Hirokazu Kameoka, Nobutaka Ono, and Shigeki Sagayama, "Phase initialization schemes for faster spectrogram-consistency-based signal reconstruction" in Proceedings of the Acoustical Society of Japan Autumn Meeting, September 2010, number 3-10-3.
[5] Nicolas Sturmel and Laurent Daudet, "Signal reconstruction from STFT magnitude: a state of the art" in Proceedings of the International Conference on Digital Audio Effects (DAFx), Paris, France, September 201 1 , pp. 375-386.
[6] Nathanael Perraudin, Peter Balazs, and Peter L. Sondergaard, "A fast Griffin-Lim algorithm" in Proceedings IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, October 2013, pp. 1-4.
[7] Dennis L. Sun and Julius O. Smith III, "Estimating a signal from a magnitude spectrogram via convex optimization" in Proceedings of the Audio Engineering Society (AES) Convention, San Francisco, USA, October 2012, Preprint 8785.
[8] Tomohiko Nakamura and Hiokazu Kameoka, "Fast signal reconstruction from magnitude spectrogram of continuous wavelet transform based on spectrogram consistency" in Proceedings of the International Conference on Digital Audio Effects (DAFx), Erlangen, Germany, September 2014, pp. 129-135.
[9] Volker Gnann and Martin Spiertz, "Inversion of shorttime fourier transform magnitude spectrograms with adaptive window lengths" in Proceedings of the IEEE international Conference on Acoustics, Speech, and Signal Processing, (ICASSP), Taipei, Taiwan, April 2009, pp. 325-328.
[10] Jonathan Le Roux, Hirokazu Kameoka, Nobutaka Ono, and Shigeki Sagayama, "Fast signal reconstruction from magnitude ST FT spectrogram based on spectrogram consistency" in Proceedings International Conference on Digital Audio Effects (DAFx), Graz, Austria, September 2010, pp. 397-403.
Claims
1. Apparatus (2) for processing an audio signal (49) to obtain a processed audio signal (6), comprising:
a phase calculator (8) for calculating phase values (10) for spectral values of a sequence of frequency-domain frames (12) representing overlapping frames of the audio signal (4),
wherein the phase calculator (8) is configured to calculate the phase values (10) based on information on a target time-domain envelope (14) related to the processed audio signal (6), so that the processed audio signal has at least in an approximation the target time-domain envelope (14) and a spectral envelope determined by the sequence of frequency-domain frames (12).
2. Apparatus (2) of claim 1 ,
wherein the phase calculator (8) comprises:
an iteration processor (16) for performing an iterative algorithm to calculate, starting from initial phase values (18), the phase values for the spectral values using an optimization target requiring consistency of overlapping blocks in the overlapping range,
wherein the iteration processor (16) is configured to use, in a further iteration step, an updated phase estimate (20) depending on the target time-domain envelope
(14).
3. Apparatus (2) of claim 1 or 2, wherein the phase calculator (8) is configured to apply a convolution of a spectral representation of at least one target time-domain envelope (14) and at least one intermediate frequency-domain reconstruction (28') or selected parts or bands or only a high-pass portion or only several bandpass portions of the at least one target time-domain envelope or the at least one intermediate frequency-domain reconstruction of an audio signal.
4. Apparatus (2) of claim 3, wherein the phase calculator comprises:
a frequency-to-time converter (22) for calculating the intermediate time-domain reconstruction (28) of the audio signal (4) from the sequence of frequency-domain frames (12) and initial phase value estimates (18) or phase value estimates (20) of a preceding iteration step,
an amplitude modulator (24) for modulating the intermediate time-domain reconstruction (28) using a target time-domain envelope (14) to obtain an amplitude-modulated audio signal (30), and
a time-to-frequency converter (26) for converting the amplitude-modulated signal (30) into a further sequence of frequency-domain frames (32) having phase values (10), and
wherein the phase calculator is configured to use, for a next iteration step, the phase values and the spectral values of the sequence of frequency-domain frames (12).
6. Apparatus (2) of claim 5,
wherein the phase calculator (8) is configured to output the intermediate time- domain reconstruction (28) as the processed audio signal (6), when an iteration determination condition is fulfilled.
7. Apparatus (2) of claim 4,
wherein the phase calculator comprises:
a convolution processor (40) for applying a convolution kernel and for applying a shift kernel and for adding an overlapping part of an adjacent frame of a central frame to the central frame to obtain the intermediate frequency-domain reconstruction (28') of the audio signal (4).
8. Apparatus (2) of claim 4 or 7,
wherein the phase calculator (8) is configured to use phase values (10) obtained by the convolution (34) as updated phase value estimates (20) for a next iteration step.
Apparatus (2) of one of claims 4, 7 or 8,
further comprising a target envelope converter (36) for converting the target time-domain envelope into the spectral domain.
Apparatus (2) of one of claims 4, 7, 8, 9, further comprising:
a frequency-to-time converter (38) for calculating the time-domain reconstruction (28") from the intermediate frequency-domain reconstruction (28', 28"') using the phase value estimates (10, 20) obtained from a most recent iteration step and the sequence of frequency-domain frames (12).
Apparatus (2) of one of claims 4, 7, 8, 9, 10,
wherein the phase calculator (8) comprises a convolution processor (40) to process the sequence of frequency-domain frames (12), wherein the convolution processor is configured to apply a time-domain overlap-and-add procedure to the sequence of frequency-domain frames (12) in the frequency-domain to determine the intermediate frequency-domain reconstruction.
Apparatus (2) of claim 1 1 ,
wherein the convolution processor (40) is configured to determine, based on a current frequency-domain frame, a portion of an adjacent frequency-domain frame which contributes to the current frequency-domain frame after time-domain overlap-and-add is performed in the frequency-domain,
wherein the convolution processor is further configured to determine an overlapping position of the portion of the adjacent frequency-domain frame within the current frequency-domain frame and to perform an addition of the portions of adjacent frequency-domain frames with the current frequency-domain frame at the overlapping position.
13. Apparatus (2) of one of claims 1 1 or 12, wherein the convolution processor is configured to frequency-to-time transform a time-domain synthesis and a time- domain analysis window to determine a portion of an adjacent frequency-domain frame which contributes to the current frequency-domain frame after time-domain overlap-and-add is performed in the frequency-domain, wherein the convolution processor is further configured to shift the position of the adjacent frequency- domain frame to an overlapping position within the current frequency-domain frame and to apply the portion of the adjacent frequency-domain frame to the current frame at the overlapping position.
14. Apparatus (2) of one of the preceding claims,
wherein the phase calculator (8) is configured to perform the iterative algorithm in accordance with the iterative signal reconstruction procedure by Griffin and Lim.
15. Audio encoder (100) for encoding an audio signal, comprising:
an audio signal processor (102) configured for encoding the audio signal such that the encoded audio signal (108) comprises a representation of a sequence of frequency-domain frames of the audio signal and a representation of a target time- domain envelope, and
an envelope determiner (104) configured for determining a time-domain envelope from the audio signal, wherein the envelope determiner (104) is further configured to compare the envelope to a set of predetermined envelopes to determine a representation of the target time-domain envelope (14) based on the comparing.
16. Audio decoder (1 10), comprising:
the apparatus (2) of one of claims 1 to 15, and
an input interface (1 12) for receiving an encoded signal (108), the encoded signal comprising a representation of the sequence of frequency-domain frames and a representation of the target time-domain envelope (18).
17. Audio signal (1 14), comprising:
a representation of a sequence of frequency-domain frames (12) of the time- domain audio signal (4) and a representation of a target time-domain envelope (14).
18. Audio source separation processor (1 16), comprising:
an apparatus (2) for processing of one of claims 1 to 15, and a spectral masker (1 18) for masking a spectrum of an original audio signal to obtain a modified audio signal input into the apparatus for processing,
wherein the processed audio signal (6) is a separated source signal related to the target time-domain envelope (14).
9. Bandwidth enhancement processor (122) for processing an encoded audio signal, comprising:
an enhancement processor (126) for generating an enhancement signal (127) from an audio signal band included in the encoded signal, and
an apparatus (2) for processing in accordance with one of claims 1 to 15,
wherein the enhancement processor (126) is configured to extract the target time- domain envelope (14) from an encoded representation included in the encoded signal or from the audio signal band included in the encoded signal.
20. Method (2200) for processing an audio signal to obtain a processed audio signal, comprising:
calculating phase values for spectral values of a sequence of frequency-domain frames representing overlapping frames of the audio signal,
wherein the phase values are calculated based on information on a target time- domain envelope related to the processed audio signal, so that the processed
audio signal has at least in an approximation the target time-domain envelope and a spectral envelope determined by the sequence of frequency-domain frames.
21. Method (2300) of audio decoding, comprising:
the method of claim 20;
receiving an encoded signal, the encoded signal comprising a representation of the sequence of frequency-domain frames, and a representation of the target time- domain envelope.
22. Method (2400) of audio source separation, comprising:
the method of claim 20, and
masking a spectrum of an original audio signal to obtain a modified audio signal input into the apparatus for processing;
wherein the processed audio signal is a separated source signal related to the target time-domain envelope.
23. Method (2500) of bandwidth enhancement of an encoded audio signal, comprising:
generating an enhancement signal from an audio signal band included in the encoded signal;
the method of claim 20;
wherein the generating comprises extracting the target time-domain envelope from an encoded representation included in the encoded signal or from the audio signal band included in the encoded signal.
24. Method (2600) of audio encoding, comprising:
encoding the audio signal such that the encoded audio signal comprises a representation of a sequence of frequency-domain frames of the audio signal and a representation of a target time-domain envelope; and determining a time-domain envelope from the audio signal and comparing the envelope to a set of predetermined envelopes to determine a representation of the target time-domain envelope based on the comparing.
25. Computer program for performing, when running on a computer or a processor, the method of one of claims 20, 21 , 22, 23, or 24.
| # | Name | Date |
|---|---|---|
| 1 | 201737029412-RELEVANT DOCUMENTS [04-09-2023(online)].pdf | 2023-09-04 |
| 1 | 201737029412-STATEMENT OF UNDERTAKING (FORM 3) [19-08-2017(online)].pdf | 2017-08-19 |
| 2 | 201737029412-FORM 1 [19-08-2017(online)].pdf | 2017-08-19 |
| 2 | 201737029412-US(14)-HearingNotice-(HearingDate-28-07-2021).pdf | 2021-10-18 |
| 3 | 201737029412-IntimationOfGrant17-09-2021.pdf | 2021-09-17 |
| 3 | 201737029412-FIGURE OF ABSTRACT [19-08-2017(online)].pdf | 2017-08-19 |
| 4 | 201737029412-PatentCertificate17-09-2021.pdf | 2021-09-17 |
| 4 | 201737029412-DRAWINGS [19-08-2017(online)].pdf | 2017-08-19 |
| 5 | 201737029412-FORM 3 [13-08-2021(online)].pdf | 2021-08-13 |
| 5 | 201737029412-DECLARATION OF INVENTORSHIP (FORM 5) [19-08-2017(online)].pdf | 2017-08-19 |
| 6 | 201737029412-Written submissions and relevant documents [12-08-2021(online)].pdf | 2021-08-12 |
| 6 | 201737029412-COMPLETE SPECIFICATION [19-08-2017(online)].pdf | 2017-08-19 |
| 7 | 201737029412-FORM 18 [25-08-2017(online)].pdf | 2017-08-25 |
| 7 | 201737029412-Correspondence to notify the Controller [23-07-2021(online)].pdf | 2021-07-23 |
| 8 | 201737029412-MARKED COPIES OF AMENDEMENTS [08-09-2017(online)].pdf | 2017-09-08 |
| 8 | 201737029412-FORM-26 [23-07-2021(online)].pdf | 2021-07-23 |
| 9 | 201737029412-AMMENDED DOCUMENTS [08-09-2017(online)].pdf | 2017-09-08 |
| 9 | 201737029412-Information under section 8(2) [22-04-2021(online)].pdf | 2021-04-22 |
| 10 | 201737029412-Amendment Of Application Before Grant - Form 13 [08-09-2017(online)].pdf | 2017-09-08 |
| 10 | 201737029412-Information under section 8(2) [31-03-2021(online)].pdf | 2021-03-31 |
| 11 | 201737029412-Information under section 8(2) [11-02-2021(online)].pdf | 2021-02-11 |
| 11 | 201737029412-Proof of Right (MANDATORY) [07-12-2017(online)].pdf | 2017-12-07 |
| 12 | 201737029412-Information under section 8(2) (MANDATORY) [23-01-2018(online)].pdf | 2018-01-23 |
| 12 | 201737029412-Information under section 8(2) [02-12-2020(online)].pdf | 2020-12-02 |
| 13 | 201737029412-FORM-26 [23-01-2018(online)].pdf | 2018-01-23 |
| 13 | 201737029412-Information under section 8(2) [08-10-2020(online)].pdf | 2020-10-08 |
| 14 | 201737029412-Information under section 8(2) (MANDATORY) [16-07-2018(online)].pdf | 2018-07-16 |
| 14 | 201737029412-Information under section 8(2) [18-08-2020(online)].pdf | 2020-08-18 |
| 15 | 201737029412-ABSTRACT [07-07-2020(online)].pdf | 2020-07-07 |
| 15 | 201737029412-Information under section 8(2) (MANDATORY) [09-01-2019(online)].pdf | 2019-01-09 |
| 16 | 201737029412-CLAIMS [07-07-2020(online)].pdf | 2020-07-07 |
| 16 | 201737029412-Information under section 8(2) (MANDATORY) [28-01-2019(online)].pdf | 2019-01-28 |
| 17 | 201737029412-Information under section 8(2) (MANDATORY) [13-07-2019(online)].pdf | 2019-07-13 |
| 17 | 201737029412-DRAWING [07-07-2020(online)].pdf | 2020-07-07 |
| 18 | 201737029412-FER_SER_REPLY [07-07-2020(online)].pdf | 2020-07-07 |
| 18 | 201737029412-Information under section 8(2) (MANDATORY) [18-09-2019(online)].pdf | 2019-09-18 |
| 19 | 201737029412-FER.pdf | 2019-09-25 |
| 19 | 201737029412-OTHERS [07-07-2020(online)].pdf | 2020-07-07 |
| 20 | 201737029412-FER_SER_REPLY [18-06-2020(online)].pdf | 2020-06-18 |
| 20 | 201737029412-Information under section 8(2) [22-02-2020(online)].pdf | 2020-02-22 |
| 21 | 201737029412-FORM 4(ii) [18-03-2020(online)].pdf | 2020-03-18 |
| 21 | 201737029412-Information under section 8(2) [22-04-2020(online)].pdf | 2020-04-22 |
| 22 | 201737029412-FORM 4(ii) [18-03-2020(online)].pdf | 2020-03-18 |
| 22 | 201737029412-Information under section 8(2) [22-04-2020(online)].pdf | 2020-04-22 |
| 23 | 201737029412-FER_SER_REPLY [18-06-2020(online)].pdf | 2020-06-18 |
| 23 | 201737029412-Information under section 8(2) [22-02-2020(online)].pdf | 2020-02-22 |
| 24 | 201737029412-OTHERS [07-07-2020(online)].pdf | 2020-07-07 |
| 24 | 201737029412-FER.pdf | 2019-09-25 |
| 25 | 201737029412-FER_SER_REPLY [07-07-2020(online)].pdf | 2020-07-07 |
| 25 | 201737029412-Information under section 8(2) (MANDATORY) [18-09-2019(online)].pdf | 2019-09-18 |
| 26 | 201737029412-DRAWING [07-07-2020(online)].pdf | 2020-07-07 |
| 26 | 201737029412-Information under section 8(2) (MANDATORY) [13-07-2019(online)].pdf | 2019-07-13 |
| 27 | 201737029412-CLAIMS [07-07-2020(online)].pdf | 2020-07-07 |
| 27 | 201737029412-Information under section 8(2) (MANDATORY) [28-01-2019(online)].pdf | 2019-01-28 |
| 28 | 201737029412-ABSTRACT [07-07-2020(online)].pdf | 2020-07-07 |
| 28 | 201737029412-Information under section 8(2) (MANDATORY) [09-01-2019(online)].pdf | 2019-01-09 |
| 29 | 201737029412-Information under section 8(2) (MANDATORY) [16-07-2018(online)].pdf | 2018-07-16 |
| 29 | 201737029412-Information under section 8(2) [18-08-2020(online)].pdf | 2020-08-18 |
| 30 | 201737029412-FORM-26 [23-01-2018(online)].pdf | 2018-01-23 |
| 30 | 201737029412-Information under section 8(2) [08-10-2020(online)].pdf | 2020-10-08 |
| 31 | 201737029412-Information under section 8(2) (MANDATORY) [23-01-2018(online)].pdf | 2018-01-23 |
| 31 | 201737029412-Information under section 8(2) [02-12-2020(online)].pdf | 2020-12-02 |
| 32 | 201737029412-Information under section 8(2) [11-02-2021(online)].pdf | 2021-02-11 |
| 32 | 201737029412-Proof of Right (MANDATORY) [07-12-2017(online)].pdf | 2017-12-07 |
| 33 | 201737029412-Amendment Of Application Before Grant - Form 13 [08-09-2017(online)].pdf | 2017-09-08 |
| 33 | 201737029412-Information under section 8(2) [31-03-2021(online)].pdf | 2021-03-31 |
| 34 | 201737029412-AMMENDED DOCUMENTS [08-09-2017(online)].pdf | 2017-09-08 |
| 34 | 201737029412-Information under section 8(2) [22-04-2021(online)].pdf | 2021-04-22 |
| 35 | 201737029412-FORM-26 [23-07-2021(online)].pdf | 2021-07-23 |
| 35 | 201737029412-MARKED COPIES OF AMENDEMENTS [08-09-2017(online)].pdf | 2017-09-08 |
| 36 | 201737029412-FORM 18 [25-08-2017(online)].pdf | 2017-08-25 |
| 36 | 201737029412-Correspondence to notify the Controller [23-07-2021(online)].pdf | 2021-07-23 |
| 37 | 201737029412-Written submissions and relevant documents [12-08-2021(online)].pdf | 2021-08-12 |
| 37 | 201737029412-COMPLETE SPECIFICATION [19-08-2017(online)].pdf | 2017-08-19 |
| 38 | 201737029412-FORM 3 [13-08-2021(online)].pdf | 2021-08-13 |
| 38 | 201737029412-DECLARATION OF INVENTORSHIP (FORM 5) [19-08-2017(online)].pdf | 2017-08-19 |
| 39 | 201737029412-PatentCertificate17-09-2021.pdf | 2021-09-17 |
| 39 | 201737029412-DRAWINGS [19-08-2017(online)].pdf | 2017-08-19 |
| 40 | 201737029412-IntimationOfGrant17-09-2021.pdf | 2021-09-17 |
| 40 | 201737029412-FIGURE OF ABSTRACT [19-08-2017(online)].pdf | 2017-08-19 |
| 41 | 201737029412-US(14)-HearingNotice-(HearingDate-28-07-2021).pdf | 2021-10-18 |
| 41 | 201737029412-FORM 1 [19-08-2017(online)].pdf | 2017-08-19 |
| 42 | 201737029412-RELEVANT DOCUMENTS [04-09-2023(online)].pdf | 2023-09-04 |
| 42 | 201737029412-STATEMENT OF UNDERTAKING (FORM 3) [19-08-2017(online)].pdf | 2017-08-19 |
| 1 | search_strategy_of_3rd_appliaction_201737029412_18-09-2019.pdf |