Audio Similarity Evaluator, Audio Encoder, Methods And Computer

< Back

Audio Similarity Evaluator, Audio Encoder, Methods And Computer Program

Abstract: An audio similarity evaluator obtains envelope signals for a plurality of frequency ranges on the basis of an input audio signal. The audio similarity evaluator is configured to obtain a modulation information associated with the envelope signals for a plurality of modulation frequency ranges, wherein the modulation information describes the modulation of the envelope signals. The audio similarity evaluator is configured to compare the obtained modulation information with a reference modulation information associated with a reference.audio signal, in order to obtain an information about a similarity between the input audio, signal and the reference audio signal. An audio encoder uses such an audio similarity evaluator. Another audio similarity evaluator uses a neural net trained using the audio similarity evaluator.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

24 December 2020

Publication Number

11/2021

Publication Type

INA

Invention Field

ELECTRONICS

Status

IPRDEL@LAKSHMISRI.COM

Parent Application

Patent Number

Legal Status

Grant Date

2024-04-29

Renewal Date

Applicants

FRAUNHOFER-GESELLSCHAFT ZUR FÖRDERUNG DER ANGEWANDTEN FORSCHUNG E.V.

Hansastraße 27c 80686 München

Inventors

1. DISCH, Sascha

c/o Fraunhofer-Institut für Integrierte Schaltungen IIS Am Wolfsmantel 33 91058 Erlangen

2. VAN DER PAR, Steven

c/o Fraunhofer-Institut für Digitale Medientechnologie IDMT, Ehrenbergstraße 31 98693 Ilmenau

3. NIEDERMEIER, Andreas

c/o Fraunhofer-Institut für Integrierte Schaltungen IIS Am Wolfsmantel 33 91058 Erlangen

4. BURDIEL PÉREZ, Elena

c/o Fraunhofer-Institut für Integrierte Schaltungen IIS Am Wolfsmantel 33 91058 Erlangen

5. EDLER, Bernd

c/o Fraunhofer-Institut für Integrierte Schaltungen IIS Am Wolfsmantel 33 91058 Erlangen

Specification

Audio Similarity Evaluator, Audio Encoder, Methods and Computer Program

Technical Field

Embodiments according to the invention are related to audio similarity evaluators.

Further embodiments according to the invention are related to audio encoders.

Further embodiments according to the invention are related to methods for evaluating a similarity between audio signals.

Further embodiments according to the invention are related to methods for encoding an audio signal.

Further embodiments according to the invention are related to a computer program for performing said methods.

Generally, embodiments according to the invention are related to an improved psycho-acoustic model for efficient perceptual audio codecs.

Background of the Invention

Audio coding is an emerging technical field, since the encoding and decoding of audio contents is important in many technical fields, like mobile communication, audio streaming, audio broadcast, television, etc.

In the following, an introduction to perceptual coding will be provided. It should be noted that the definitions and details discussed in the following can optionally be applied in conjunction with the embodiments disclosed herein.

Perceptual Codecs

Perceptual audio codecs like mp3 or AAC are widely used to code the audio in today's multimedia applications [1]. Most popular codecs are so-called waveform coders, that is, they preserve the audio's time domain waveform and mostly add (inaudible) noise to it due to perceptually controlled application of quantisation. Quantisation may typically happen in a time-frequency domain, but can also be applied in time domain [2]. To render the added noise inaudible, it is shaped under control of a psychoacoustic model, typical a perceptual masking model.

In today's audio applications, there is a constant request for lower bitrates. Perceptual audio codecs traditionally limit audio bandwidth to still achieve decent perceptual quality at these low bitrates. Efficient semi-parametric techniques like Spectral Bandwidth Replication (SBR) [3] in High Efficiency Advanced Audio Coding (HE-AAC) [4] or Intelligent Gap Filling (IGF) [5] in MPEG-H 3D Audio [6] and 3gpp Enhanced Voice Services (EVS) [7] are used for extending the bandlimited audio up to full bandwidth at decoder side. Such technique is called Bandwidth Extension (BWE). These techniques insert an estimate of the missing high frequency content controlled by a few parameters. Typically, the most important BWE side information is envelope related data. Usually, the estimation process is steered by heuristics rather than a psychoacoustic model.

Perceptual Models

Psychoacoustic models used in audio coding mainly rely on evaluating whether the error signal is perceptually masked by the original audio signal to be encoded. This approach works well when the error signal is caused by a quantisation process typically used in waveform encoders. For parametric signal representations, however, such as SBR or IGF, the error signal will be large even when artefacts are hardly audible.

This is a consequence of the fact that the human auditory system does not process the exact waveform of an audio signal; in certain situations the auditory system is phase insensitive and the temporal envelope of a spectral band becomes the main auditory information that is evaluated. For example, different starting phases of a sinusoid (with smooth onset and offsets) have no perceivable effect. For a harmonic complex tone, however, relative starting phases can be perceptually important, specifically when multiple harmonics fall within one auditory critical band [8]. The relative phases of these harmonics, as well as their amplitudes, will influence the temporal envelope shape that is represented within one auditory critical band which, in principle can be processed by the human auditory system.

In view of this situation, there is a need for a concept to compare audio signals and/or to decide about coding parameters which provides an improved tradeoff between

computational complexity and perceptual relevance and/or which allows for the first time to use parametric techniques under control of a psychoacoustic model.

Summary of the Invention

An embodiment according to the invention creates an audio similarity evaluator.

The audio similarity evaluator is configured to obtain envelope signals for a plurality of (preferable overlapping) frequency ranges (for example, using a filterbank or a Gammatone filterbank and a rectification and a temporal low pass filtering and one or more adaptation processes which may, for example, model a pre-masking and/or a post-masking in an auditory system) on the basis of an input audio signal (for example, to perform an envelope demodulation in spectral sub-bands).

The audio similarity evaluator is configured to obtain a modulation information (for example, output signals of the modulation filters) associated with the envelope signals for a plurality of modulation frequency ranges (for example, using a modulation filterbank or using modulation filters), wherein the modulation information describes (for example, in the form of output signals of the modulation filterbank or in the form of output signals of the modulation filters) the modulation of the envelope signals (and may, for example, be considered as an internal representation) For example, the audio similarity evaluator may be configured to perform an envelope modulation analysis.

The audio similarity evaluator is configured to compare the obtained modulation information (for example, an internal representation) with a reference modulation information associated with a reference audio signal (for example, using an internal difference representation, wherein the internal difference representation may, for example describe a difference between the obtained modulation information and the reference modulation information, wherein one or more weighting operations or modification operations may be applied, like a scaling of the internal difference representation based on a degree of co-modulation or an asymmetrical weighting of positive and negative values of the internal difference representation), in order to obtain an information about a similarity between the input audio signal and the reference audio signal (for example, a single value describing a perceptual similarity between the input audio signal and the reference audio signal).

This embodiment according to the invention is based on the finding that a modulation information, which is associated with envelope signals for a plurality of modulation frequency ranges, can be obtained with moderate effort (for example, using a first filterbank to obtain the envelope signals and using a second filterbank, which may be a modulation filterbank, to obtain the modulation information, wherein some minor additional processing steps will also be used to improve accuracy).

Moreover, it has been found that such a modulation information is well-adapted to the human hearing impression in many situations, which means that a similarity of the modulation information corresponds to a similar perception of an audio content, while a major difference is of the modulation information typically indicates that an audio content will be perceived as being different. Thus, by comparing the modulation information of an input audio signal with the modulation information associated with a reference audio signal, it can be concluded whether the input audio signal will be perceived as being similar to the audio content of the reference audio signal or not. In other words, a quantitative measure which represents the similarity or difference between the modulation information associated with the input audio signal and the modulation information associated with the reference audio signal can serve as a (quantitative) similarity information, representing the similarity between the audio content of the input audio signal and the audio content of the reference audio signal in a perceptually-weighted manner.

Thus, the similarity information obtained by the audio similarity evaluator (for example, a single scalar value associated with a certain passage (for example, a frame) of the input audio signal (and/or of the reference audio signal) is well-suited to determine (for example, in a quantitative manner) how much the “input audio signal" is perceptually degraded with respect to the reference audio signal (for example, if it is assumed that the input audio signal is a degraded version of the reference audio signal).

It has been found that this similarity measure may, for example, be used for determining the quality of a lossy audio encoding and, in particular, of a lossy non-waveform-preserving audio encoding. For example, the similarity information indicates a comparatively big deviation if the "modulation" (of the envelope signal) in one or more frequency ranges is changed significantly, which would typically result in a degraded hearing impression. On the other hand, the similarity information provided by the similarity evaluator would typically indicate a comparatively high similarity (or, equivalently, a comparatively small difference or deviation) if the modulation in different frequency bands is similar in the input audio signal and in the reference audio signal, even if the actual signal waveforms are substantially different. Thus, an outcome is in agreement with the finding that a human listener is typically not particularly sensitive to the actual waveform, but more sensitive with respect to modulation characteristics of an audio content in different frequency bands.

To conclude, the similarity evaluator described here provides a similarity information which is well-adapted to the human hearing impression.

In a preferred embodiment, the audio similarity evaluator is configured to apply a plurality of filters or filtering operations (for example, of a filterbank or of a Gammatone filterbank) having overlapping filter characteristics (e.g. overlapping passbands), in order to obtain the envelope signals (wherein, preferably, bandwidths of the filters or filtering operations are increasing with increasing center frequencies of the filters). For example, the different envelope signals may be associated with different acoustic frequency ranges of the input audio signal.

This embodiment is based on the finding that the envelope signals can be obtained with moderate effort using filters or filtering operations having overlapping filter characteristics, because this is well in agreement with the human auditory system. Furthermore, it has been found that it is advantageous to increase the bandwidth of the filters or filtering operations with increasing frequency, because this is well in agreement with the human auditory system and furthermore helps to keep the number of filters reasonably small while providing a good frequency resolution in the perceptually important low frequency region. Accordingly, the different envelope signals are typically associated with different acoustic frequency ranges of the input audio signal, which helps to obtain an accurate similarity information having a reasonable frequency resolution. For example, different signal degradation (for example, of the input audio signal with respect to the reference audio signal) in different frequency ranges can be considered in this manner.

In a preferred embodiment, the audio similarity evaluator is configured to apply a rectification (e.g. a half wave rectification) to output signals of the filters or filtering operation, to obtain a plurality of rectified signals (for example, to model inner hair cells).

By applying a rectification to the output signals of the filters or of the filtering operation, it is possible to assimilate a behavior of inner hair cells. Furthermore, the rectification in combination with a low-pass filter provides for envelope signals which reflect intensities in different frequency ranges. Also, due to the rectification (and possibly a low-pass filtering), a number representation is comparatively easy (for example, since only positive values need to be represented). Moreover, the phenomenon of phase locking and the loss thereof for higher frequencies is modeled by said processing.

In a preferred embodiment, the audio similarity evaluator is configured to apply a low-pass filter or a low-pass filtering (for example, having a cut-off frequency which is smaller than 2500 Hz or which is smaller than 1500 Hz) to the half-wave rectified signals (for example, to model inner hair cells).

By using a low-pass filter or a low-pass filtering (which may, for example, be applied separately to each envelope signal out of a plurality of envelope signals associated with different frequency ranges), an inertness of inner hair cells may be modeled. Furthermore, an amount of data samples is reduced by performing a low-pass filtering, and a further processing of the low-pass filtered (preferably rectified) bandpass signals is facilitated. Thus, the preferably rectified and low-pass filtered output signal of a plurality of filters or filtering operations may serve as the envelope signals.

In a preferred embodiment, the audio similarity evaluator is configured to apply an automatic gain control, in order to obtain the envelope signals.

By applying an automatic gain control in order to obtain the envelope signals, a dynamic range of the envelope signals may be limited, which reduces numeric problems. Furthermore, it has been found that the usage of an automatic gain control, which uses certain time constants for the adaptation of the gain, models masking effects that occur in a hearing system, such that a similarity of information obtained by the audio similarity evaluator reflects a human hearing impression.

In a preferred embodiment, the audio similarity evaluator is configured to vary a gain applied to derive the envelope signals on the basis of rectified and low-pass filtered

signals provided by a plurality of filters or filter operations on the basis of the input audio signal.

It has been found that varying a gain, which is applied to derive the envelope signals on the basis of rectified and low-pass filtered signals provided by a plurality of filters or filter operations (on the basis of the input audio signal) is an efficient means for implementing an automatic gain control. It has been found that the automatic gain control can easily be implemented after the rectification and low-pass filtering of signals provided by a plurality of filters or filter operations. In other words, the automatic gain control is applied individually per frequency range, and it has been found that such a behavior is well in agreement with the human auditory system.

In a preferred embodiment, the audio similarity evaluator is configured to process rectified and low-pass filtered versions of signals provided by a plurality of filters or filtering operations (e.g. provided by the Gammatone filterbank) on the basis of the input audio signal using a series of two or more adaptation loops (preferably five adaptation loops), which apply a time-variant scaling in dependence on time variant gain values (for example, to effect a multi-stage automatic gain control, wherein the gain value is set to a comparatively small value for a comparatively large input signal or output signal of a respective stage, and wherein a gain value is set to a comparatively larger value for a comparatively smaller input value or output value of the respective stage). Optionally, there is a limitation of one or more output signals, for example to limit or avoid overshoots, e.g. a “Limiter".

The audio similarity evaluator is configured to adjust different time variant gain values (which are associated with different stages within the series of adaptation loops) using different time constants (for example, to model a pre-masking at an onset of an audio signal and/or to model a post-masking after an offset of an audio signal).

It has been recognized that the usage of a series of two or more adaptation loops which apply a time-variant scaling in dependence on time-variant gain values is well-adapted to model different time constants which occur in the human auditory system. When adjusting the different time variant gain values, which are used in different of the cascaded adaptation loops, different time constants of pre-masking and post-masking can be considered. Also, additional adaptation masking processes, which occur in the human

auditory system, can be modeled in such a manner with moderate computational effort. For example, the different time constants, which are used to adjust different of the time variant gain values, may be adapted to different time constants accordingly in a human auditory system.

To conclude using a series (or a cascade) of two or more adaptation loops, which apply a time-variant scaling in dependence on time-variant scale values provides envelope signals which are well-suited for the purpose to obtain a similarity information describing a similarity between an input audio signal and a reference audio signal.

In a preferred embodiment, the audio similarity evaluator is configured to apply a plurality of modulation filters (for example, of a modulation filterbank) having different (but possibly overlapping) passbands to the envelope signals (for example, such that components of the envelope signals having different modulation frequencies are at least partially separated), to obtain the modulation information (wherein, for example, a plurality of modulation filters associated with different modulation frequency ranges are applied to a first envelope signal associated with a first acoustic frequency range wherein, for example, a plurality of modulation filters associated with the different modulation frequency ranges are applied to a second envelope signal associated with a second acoustic frequency range which is different from the first acoustic frequency range).

It has been found that a meaningful information representing a modulation of envelope signals (associated with different frequency ranges) can be obtained with little effort using modulation filters which filter the envelope signals. For example, applying a set of modulation filters having different passbands to one of the envelope signals results in a set of signals (or values) for the given envelope signal (or associated with the given envelope signal, or associated with a frequency range of the input audio signal). Thus, a plurality of modulation signals may be obtained on the basis of a single envelope signal, and different sets of modulation signals may be obtained on the basis of a plurality of envelope signals. Each of the modulation signals may be associated with a modulation frequency or a range of modulation frequencies. Consequently, the modulation signals (which may be output by the modulation filters) or, more precisely, an intensity thereof may describe how an envelope signal (associated with a certain frequency range) is modulated (for example, time-modulated). Thus, separate sets of modulation signals may be obtained for the different envelope signals.

These modulation signals may be used to obtain the modulation information, wherein different post-processing operations may be used to derive the modulation information (which is compared with the modulation information associated with the reference audio signal) from the modulation signals provided by the modulation filters.

To conclude, it has been found that the usage of a plurality of modulation filters is a simple-to-implement approach which can be used in the derivation of the modulation gain for information.

In a preferred embodiment, the modulation filters are configured to at least partially separate components of the envelope signal having different frequencies (e.g. different modulation frequencies), wherein a center frequency of a first, lowest frequency modulation filter is smaller than 5 Hz, and wherein a center frequency of a highest frequency modulation filter is in a range between 200 Hz and 300 Hz.

It has been found that using such center frequencies of the modulation filters covers a range of modulation frequencies which is most relevant for the human perception.

In a preferred embodiment, the audio similarity evaluator is configured to remove DC components when obtaining the modulation information (for example, by low-pass filtering output signals of the modulation filters, for example, with a cut-off frequency of half a center frequency of the respective modulation filter, and by subtracting signals resulting from the low-pass filtering from the output signals of the modulation filters).

It has been found that a removal of DC components when obtaining the modulation information helps to avoid a degradation of the modulation information by strong DC components which are typically included in the envelope signals. Also, by using a DC removal when obtaining the modulation information on the basis of the envelope signals, a steepness of modulation filters can be kept reasonably small, which facilitates the implementation of the modulation filters.

In a preferred embodiment, the audio similarity evaluator is configured to remove a phase information when obtaining the modulation information.

By removing a phase information, it is possible to neglect such information, which is typically not of particularly high relevance for a human listener under many circumstances, in the comparison of the modulation information associated with the input audio signal with the modulation information associated with the reference audio signal. It has been found that the phase information of the output signals of the modulation filters would typically degrade the comparison result, in particular if non-waveform-preserving modification (like, for example, a non-waveform-preserving encoding and decoding operation) is applied to the input audio signal. Thus, it is avoided to classify an input audio signal and a reference audio signal as having a small level of similarity, even though a human perception would classify the signals as being very similar.

In a preferred embodiment, the audio similarity evaluator is configured to derive a scalar value representing a difference between the obtained modulation information (for example, an internal representation) and the reference modulation information associated with a reference audio signal (for example, a value representing a sum of squared differences between the obtained modulation information, which may comprise sample values for a plurality of acoustic frequency ranges and for a plurality of modulation frequency ranges per acoustic frequency range, and the reference modulation information, which may also comprise sample values for a plurality of acoustic frequency ranges and for a plurality of modulation frequency ranges per acoustic frequency range).

It has been found that a (single) scalar value may well represent differences between the modulation information associated with the input audio signal and modulation information associated with the reference audio signal. For example, the modulation information may comprise individual signals or values for different modulation frequencies and for a plurality of frequency ranges. By combining differences between all these signals or values into a single scalar value (which may take the form of a "distance measure" or a “norm"), it is possible to have a compact and meaningful assessment of the similarity between the input audio signal and the reference audio signal. Also, such a single scalar value may easily be usable by a mechanism for selecting coding parameters (for example encoding parameters and/or decoding parameters), or for deciding about any other audio

signal processing parameters which may be applied for a processing of the input audio signal.

It has been found that the determination of a difference representation may be an efficient intermediate step for deriving the similarity information. For example, the difference representation may represent differences between different modulation frequency bins (wherein, for example, a separate set of modulation frequency bins may be associated with different envelope signals being associated with different frequency ranges) when comparing the input audio signal with the reference audio signal.

For example, the difference representation may be a vector, wherein each entry of the vector may be associated with a modulation frequency and with a frequency range (of the input audio signal or of the reference audio signal) under consideration. Such a difference representation is well-suited for a post-processing, and also allows for a simple derivation of a single scalar value representing the similarity information.

In a preferred embodiment, the audio similarity evaluator is configured to determine a difference representation (for example, IDR) in order to compare the obtained modulation information (for example, an internal representation) with the reference modulation information associated with a reference audio signal.

In a preferred embodiment, the audio similarity evaluator is configured to adjust a weighting of a difference between the obtained modulation information (for example, an internal representation) and the reference modulation information associated with a reference audio signal in dependence on a comodulation between the obtained envelope signals or modulation information in two or more adjacent acoustic frequency ranges or between envelope signals associated with the reference signal or between the reference modulation information in two or more adjacent acoustic frequency ranges (wherein, for example, an increased weight is given to the difference between the obtained modulation information and the reference modulation information in case that a comparatively high degree of comodulation is found when compared to a case in which a comparatively low degree of comodulation is found) (and wherein the degree of comodulation is, for example, found by determining a covariance between temporal envelopes associated with different acoustic frequency ranges).

It has been found that adjusting the weighting of the difference between the obtained modulation information and the reference modulation information (which may, for example, be represented by the “difference representation") in dependence on the comodulation information is advantageous because differences between the modulation information may be perceived as stronger by a human listener if there is a comodulation in adjacent frequency ranges. For example, by associating an increased weight to the difference between the obtained modulation information and the reference modulation information in the case that a comparatively high degree of comodulation is found when compared to a case in which a comparatively low degree or amount of comodulation is found, the determination of the similarity information can be adapted to characteristics of the human auditory system. Consequently, the quality of the similarity information can be improved.

In a preferred embodiment, the audio similarity evaluator is configured to put a higher weighting on differences between the obtained modulation information (for example, an internal representation) and the reference modulation information associated with a reference audio signal indicating that the input audio signal comprises an additional signal component when compared to differences between the obtained modulation information (for example, an internal representation) and the reference modulation information associated with a reference audio signal indicating that the input audio signal lacks a signal component when determining the information about the similarity between the input audio signal and the reference audio signal (for example, a single scalar value describing the information about the similarity).

Putting higher weighting on differences between the obtained modulation information and the reference modulation information associated with a reference signal indicating that the audio signal comprises an additional signal component (when compared to differences indicating that the input audio signal lacks a signal component) emphasizes a contribution of added signals (or signal components, or carriers) when determining an information about the difference between the input audio signal and the reference audio signal. It has been found that added signals (or signal components or a carriers) are typically perceived as being more distorting when compared to missing signals (or signal components or carriers). This fact can be considered by such an “asymmetric" weighting of positive and negative differences between the modulation information associated with the input audio signal and the modulation information associated with the reference audio signal. A similarity information can be adapted to the characteristics of the human auditory system in this manner.

In a preferred embodiment, the audio similarity evaluator is configured to weight positive and negative values of a difference between the obtained modulation information and the reference modulation information (which typically comprises a large number of values) using different weights when determining the information about the similarity between the input audio signal and the reference audio signal.

By applying different weights to positive and negative values of the difference between the obtained modulation information and the reference modulation information (or, more precisely, between entries of a vector as mentioned above), the different impact of added and missing signals or signal components or carriers can be considered with very small computational effort.

Another embodiment according to the invention creates an audio encoder for encoding an audio signal. The audio encoder is configured to determine one or more coding parameters (for example, encoding parameters or decoding parameters, which are preferably signaled to an audio decoder by the audio encoder) in dependence on an evaluation of a similarity between an audio signal to be encoded and an encoded audio signal. The audio encoder is configured to evaluate the similarity between the audio signal to be encoded and the encoded audio signal (for example, a decoded version thereof) using an audio similarity evaluator as discussed herein (wherein the audio signal to be encoded is used as the reference audio signal and wherein a decoded version of an audio signal encoded using one or more candidate parameters is used as the input audio signal for the audio similarity evaluator).

This audio encoder is based on the finding that the above mentioned determination of the similarity information is well-suited for an assessment of a hearing impression obtainable by an audio encoding. For example, by obtaining the similarity information using an audio signal to be encoded as a reference signal and using an encoded and subsequently decoded version of the audio signal to be encoded as the input audio signal for the determination of the similarity information, it can be evaluated whether the encoding and decoding process is suited to reconstruct the audio signal to be encoded with little perceptual losses. However, the above mentioned determination of the similarity information focuses on the hearing impression which can be achieved, rather than on an agreement of waveforms. Accordingly, it can be found out, using the similarity information obtained, which coding parameters (out of a certain choice of coding parameters) provides a best (or at least sufficiently good) hearing impression. Thus, the above mentioned determination of the similarity information can be used to make a decision about the coding parameter without requiring identity (or similarity) of waveforms.

Accordingly, the coding parameters can be chosen reliably, while avoiding impractical restrictions (like waveform-similarity).

In a preferred embodiment, the audio encoder is configured to encode one or more bandwidth extension parameters which define a processing rule to be used at the side of an audio decoder to derive a missing audio content (for example, a high frequency content, which is not encoded in a waveform-preserving manner by the audio encoder) on the basis of an audio content of a different frequency range encoded by the audio encoder (e.g. the audio encoder is a parametric or semi-parametric audio encoder).

It has been found that the above-mentioned determination of the similarity information is well-suited for the selection of bandwidth extension parameters. It should be noted that parametric bandwidth extension, which are bandwidth extension parameters, is typically not waveform-preserving. Also, it has been found that the above mentioned determination of the similarity of audio signals is very well-suited for assessing similarities or differences in a higher audio frequency range, in which the bandwidth extension is typically active, and in which the human auditory system is typically insensitive to phase. Thus, the concept allows to judge bandwidth extension concepts, which may, for example, derive high-frequency components on the basis of low-frequency components, in an efficient and perceptually accurate manner.

In a preferred embodiment, the audio encoder is configured to use an Intelligent Gap Filling (for example, as defined in the MPEG-H 3D Audio standard, for example, in the version available on the filing date of the present application, or in modifications thereof), and the audio encoder is configured to determine one or more parameters of the

Intelligent Gap Filling using an evaluation of the similarity between the audio signal to be encoded and the encoded audio signal (wherein, for example, the audio signal to be encoded is used as the reference audio signal and wherein, for example, a decoded version of an audio signal encoded using one or more candidate intelligent gap filling parameters is used as the input audio signal for the audio similarity evaluation).

It has been found that the above-mentioned concept for the evaluation of similarities between audio signals is well-suited for usage in the context of an “intelligent gap filling", because the determination of the similarity between audio signals considers criteria, which are highly important for the hearing impression.

In a preferred embodiment, the audio encoder is configured to select one or more associations between a source frequency range and a target frequency range for a bandwidth extension (for example, an association which determines on the basis of which source frequency range out of a plurality of selectable source frequency ranges an audio content of a target frequency range should be determined) and/or one or more processing operation parameters for a bandwidth extension (which may, for example determine parameters of a processing operation, like a whitening operation or a random noise replacement, which is executed when providing an audio content of a target frequency range on the basis of a source frequency range, and/or an adaptation of tonal properties and/or an adaptation of a spectral envelope) in dependence on the evaluation of a similarity between an audio signal to be encoded and an encoded audio signal.

It has been found that the selection of one or more associations between a source frequency range and a target frequency range and/or the selection of one or more processing operation parameters for a bandwidth extension may be performed with good results using the above mentioned approach for the evaluation of a similarity between audio signals. By comparing an "original” audio signal to be encoded with an encoded and decoded version (encoded and decoded again using a specific association and/or a specific processing between a source frequency range and a target frequency range, or between source frequency ranges and target frequency ranges), it can be judged whether the specific association provides a hearing impression similar to the original or not.

The same also holds for the choice of other processing operation parameters. Thus, by checking, for different settings of the audio encoding (and of the audio decoding) how

good the encoded and decoded audio signal agrees with the (original) input audio signal, it can be found out which specific association (between a source frequency range and a target frequency range, or between source frequency ranges and target frequency ranges) provides the best similarity (or at least a sufficiently good similarity) when comparing the encoded and decoded version of the audio content with the original version of the audio content. Thus, adequate encoding settings (for example, an adequate association between a source frequency range and a target frequency range) can be chosen. Moreover, additional processing operation parameters may also be selected using the same approach.

In a preferred embodiment, the audio encoder is configured to select one or more associations between a source frequency range and a target frequency range for a bandwidth extension. The audio encoder is configured to selectively allow or prohibit a change of an association between a source frequency range and a target frequency range in dependence on an evaluation of a modulation of an envelope (for example, of an audio signal to be encoded) in an old or a new target frequency range.

By using such a concept, a change of an association between a source frequency range and a target frequency range can be prohibited, if such a change of the association between the source frequency range and the target frequency range would bring along noticeable artefacts. Thus, a switching between frequency shifts of the intelligent gap filling may be limited. For example, a change of the association between the source frequency range and the target frequency range may selectively be allowed if it is found that there is a sufficient modulation of the envelope (for example, higher than a certain threshold) which (sufficiently) masks the modulation caused by the change of the association.

In a preferred embodiment, the audio encoder is configured to determine a modulation strength of an envelope in a (old or new) target frequency range in a modulation frequency range corresponding to a frame rate of the encoder and to determine a sensitivity measure in dependence on the determined modulation strength (for example, such that the similarity measure is inversely proportional to the modulation strength).

The audio encoder is configured to decide whether it is allowed or prohibited to change an association between a target frequency range and a source frequency range in

dependence on the sensitivity measure (for example, only to allow a change of an association between a target frequency range and a source frequency range when the sensitivity measure is smaller than a predetermined threshold value, or only to allow a change of an association between a target frequency range and a source frequency range when there is a modulation strength which is larger than a threshold level in the target frequency range).

Accordingly, it can be reached that the change of the association between a target frequency range and a source frequency range only occurs if a (parasitic) modulation caused by such a change is sufficiently masked by the (original) modulation in the target frequency range (into which the parasitic modulation would be introduced). Thus, audible artefacts can be avoided efficiently).

An embodiment according to the present invention creates an audio encoder for encoding an audio signal, wherein the audio encoder is configured to determine one or more coding parameters in dependence on an audio signal to be encoded using a neural network. The neural network is trained using an audio similarity evaluator as discussed herein.

By using a neural network, which is trained using the audio similarity-value evaluator mentioned above, to decide about the one or more coding parameters, a computational complexity can further be reduced. In other words, the audio similarity evaluation, as mentioned herein, can be used to provide the training data for a neural network, and the neural network can adapt itself (or can be adapted) to make coding parameter decisions which are sufficiently similar to coding parameter decisions which would be obtained by assessing the audio quality using the audio similarity evaluator.

An embodiment according to the present invention creates an audio similarity evaluator,

The audio similarity evaluator is configured to compare an analysis representation of the input audio signal (for example, an “internal representation", like the obtained modulation information or a time-frequency-domain representation) with a reference analysis representation associated with a reference audio signal (for example, using an internal difference representation, wherein the internal difference representation may, for example describe a difference between the obtained analysis representation and the reference analysis representation, wherein one or more weighting operations or modification operations may be applied, like a scaling of the internal difference representation based on a degree of co-modulation or an asymmetrical weighting of positive and negative values of the internal difference representation), in order to obtain an information about a similarity between the input audio signal and the reference audio signal (for example, a single value describing a perceptual similarity between the input audio signal and the reference audio signal).

The audio similarity evaluator is configured to adjust a weighting of a difference between the obtained analysis representation (e.g. a modulation information; for example, an internal representation) and the reference analysis representation (for example, a reference modulation information associated with a reference audio signal) in dependence on a comodulation (e.g. between the obtained envelope signals or an obtained modulation information) in two or more adjacent acoustic frequency ranges of the input audio signal or in dependence on a comodulation (e.g. between envelope signals associated with the reference signal or between the reference modulation information) in two or more adjacent acoustic frequency ranges of the reference audio signal (wherein, for example, an increased weight is given to the difference in case that a comparatively high degree of comodulation is found when compared to a case in which a comparatively low degree of comodulation is found) (and wherein the degree of comodulation is, for example, found by determining a covariance between temporal envelopes associated with different acoustic frequency ranges).

This embodiment is based on the finding that a comodulation in two or more adjacent frequency ranges typically has the effect that distortions in such comodulated frequency ranges are perceived stronger than distortions in non-comodulated (or weakly-comodulated) adjacent frequency ranges. Accordingly, by weighting deviations between audio signals to be compared (for example, between an input audio signal and a reference audio signal) relatively stronger in strongly comodulated frequency ranges

(when compared to a weighting in non-comodulated or more weakly comodulated frequency ranges), the evaluation of the audio quality can be performed in a manner which is well-adapted to human perception. Typically, differences between obtained analysis representations, which may be based on envelope signals for a plurality of frequency ranges, may be compared, and in such analysis representations, frequency ranges, which comprise a comparatively higher comodulation may be weighted stronger than frequency ranges comprising a comparatively smaller comodulation. Accordingly, the similarity evaluation may be well-adapted to a human perception.

An embodiment according to the invention creates a method for evaluating a similarity between audio signals.

The method comprises obtaining envelope signals for a plurality of (preferable overlapping) frequency ranges (for example, using a filterbank or a Gammatone filterbank and a rectification and a temporal low pass filtering and one or more adaptation processes which may, for example, model a pre-masking and/or a post-masking in an auditory system) on the basis of an input audio signal (for example, to perform an envelope demodulation in spectral sub-bands).

The method comprises obtaining a modulation information (for example, output signals of the modulation filters) associated with the envelope signals for a plurality of modulation frequency ranges (for example, using a modulation filterbank or using modulation filters). The modulation information describes (for example, in the form of output signals of the modulation filterbank or in the form of output signals of the modulation filters) the modulation of the envelope signals (for example, temporal envelope signals or spectral envelope signals). The modulation information may, for example, be considered as an internal representation and may, for example, be used to perform an envelope modulation analysis.

The method comprises comparing the obtained modulation information (for example, an internal representation) with a reference modulation information associated with a reference audio signal (for example, using an internal difference representation, wherein the internal difference representation may, for example describe a difference between the obtained modulation information and the reference modulation information, wherein one or more weighting operations or modification operations may be applied, like a scaling of the internal difference representation based on a degree of co-modulation or an asymmetrical weighting of positive and negative values of the internal difference representation), in order to obtain an information about a similarity between the input audio signal and the reference audio signal (for example, a single value describing a perceptual similarity between the input audio signal and the reference audio signal).

An embodiment according to the invention creates a method for encoding an audio signal, wherein the method comprises determining one or more coding parameters in dependence on an evaluation of a similarity between an audio signal to be encoded and an encoded audio signal, and wherein the method comprises evaluating the similarity between the audio signal to be encoded and the encoded audio signal as discussed herein (wherein, for example, the audio signal to be encoded is used as the reference audio signal and wherein a decoded version of an audio signal encoded using one or more candidate parameters is used as the input audio signal for the audio similarity evaluator).

An embodiment according to the invention creates a method for encoding an audio signal.

The method comprises determining one or more coding parameters in dependence on an audio signal to be encoded using a neural network,

wherein the neural network is trained using a method for evaluating a similarity between audio signals as discussed herein

An embodiment according to the invention creates a method for evaluating a similarity between audio signals (for example, between an input audio signal and a reference audio signal).

The method comprises comparing an analysis representation of the input audio signal (for example, an "internal representation", like the obtained modulation information or a time-frequency-domain representation) with a reference analysis representation associated with a reference audio signal (for example, using an internal difference representation, wherein the internal difference representation may, for example describe a difference between the obtained analysis representation and the reference analysis representation, wherein one or more weighting operations or modification operations may be applied, like a scaling of the internal difference representation based on a degree of co-modulation or an asymmetrical weighting of positive and negative values of the internal difference representation), in order to obtain an information about a similarity between the input audio signal and the reference audio signal (for example, a single value describing a perceptual similarity between the input audio signal and the reference audio signal),

The method comprises adjusting a weighting of a difference between the obtained analysis representation (e.g. a modulation information; for example, an internal representation) and the reference analysis representation (for example, a reference modulation information associated with a reference audio signal) in dependence on a comodulation. For example, the weighting is adjusted in dependence on a comodulation (e.g. between the obtained envelope signals or an obtained modulation information) in two or more adjacent acoustic frequency ranges of the input audio signal. Alternatively, the weighting is adjusted in dependence on a comodulation (e.g. between envelope signals associated with the reference signal or between the reference modulation information) in two or more adjacent acoustic frequency ranges of the reference audio signal. For example, an increased weight is given to the difference in case that a comparatively high degree of comodulation is found when compared to a case in which a comparatively low degree of comodulation is found. The degree of comodulation is, for example, found by determining a covariance between temporal envelopes associated with different acoustic frequency ranges.

These methods are based on the same considerations as the above-mentioned audio similarity evaluators and the above-mentioned audio encoders.

Moreover, the methods can be supplemented by any features, functionalities and details discussed herein with respect to the audio similarity evaluators and with respect to the audio encoders. The methods can be supplemented by such features, functionalities and details both individually and taken in combination.

An embodiment according to the invention creates a computer program for performing the methods as discussed herein when the computer program runs on a computer.

The computer program can be supplemented by any of the features, functionalities and details described herein with respect to the corresponding apparatuses and methods.

Brief Description of the Figures

Embodiments according to the present invention will subsequently be described taking reference to the enclosed figures in which:

Fig. 1 shows a block schematic diagram of an audio similarity evaluator, according to an embodiment of the present invention;

Fig. 2a, 2b show a block schematic diagram of an audio similarity evaluator, according to an embodiment of the present invention;

Fig. 3 shows a block schematic diagram of an audio encoder with automated selection, according to an embodiment of the present invention;

Fig. 4 shows a block schematic diagram of an audio encoder with change gating according to an embodiment of the present invention;

Fig. 5a shows a block schematic diagram of an audio encoder with a neural net in an operation mode, according to an embodiment of the present invention;

Fig. 5b shows a block schematic diagram of a neural net for use in an audio encoder in a training mode, according to an embodiment of the present invention;

Fig. 6 shows a block schematic diagram of an audio similarity evaluator, according to an embodiment of the present invention;

Fig. 7 shows a schematic representation of a signal flow and of processing blocks of a Dau et al. auditory processing model;

Fig.8 shows a schematic representation of gamma-tone filterbank impulse responses;

Fig. 9 shows a schematic representation of an Organ of Corti (modified from [14]);

Fig. 10 shows a block schematic diagram of an audio decoder using IGF;

Fig. 11 shows a schematic representation of an IGF tile selection;

Fig. 12 shows a block schematic diagram of a generation of IGF automated choice items;

Fig. 13 shows a schematic representation of a choice of IGF tiles for the audio excerpt "trilogy” through automated control, wherein for each frame (circles), the source tile “sT” choice [0,1,2,3] is shown for each of the three target tiles as a black line overlaying on the spectrogram;

Fig. 14 shows a schematic representation of a choice of IGF whitening levels for the audio excerpt “trilogy” through automated control, wherein for each frame (circles), the whitening level choice [0,1,2] is shown for each of the three target tiles as a black line overlay on the spectrogram;

Table 1 shows items of a listening test;

Table 2 shows conditions of a listening test;

Fig. 15 shows a graphic representation of absolute MUSHRA scores of proposed automated and fixed IGF controls; and

Fig. 16 shows a graphic representation of difference MUSHRA scores comparing proposed automated against fixed IGF control.

Detailed Description of the Embodiments

In the following, embodiments according to the present application will be described. However, it should be noted that the embodiments described in the following can be used individually, and can also be used in combination.

Moreover, it should be noted that features, functionalities and details described with respect to the following embodiments can optionally be introduced into any of the embodiments as defined by the claims, both individually and taken in combination.

Moreover, it should be noted that the embodiments described in the following can optionally be supplemented by any of the features, functionalities and details as defined in the claims.

1. Audio Similarity Evaluator According to Fig. 1

Fig. 1 shows a block schematic diagram of an audio similarity evaluator, according to an embodiment of the invention.

The audio similarity evaluator 100 according to Fig. 1 receives an input audio signal 110 (for example, an input audio signal of the audio similarity evaluator) and provides, on the basis thereof, a similarity information 112, which may, for example, take the form of a scalar value.

The audio similarity evaluator 100 comprises an envelope signal determination (or envelope signal determinator) 120 which is configured to obtain envelope signals 122a, 122b, 122c for a plurality of frequency ranges on the basis of the input audio signal. Preferably, the frequency ranges for which the envelope signals 122a-122c are provided, may be overlapping. For example, the envelope signal determinator may use a filterbank or a Gamma-tone filterbank and a rectification and a temporal low-pass filtering and one or more adaptation processes which may, for example, model a pre-masking and/or a post-masking in an auditory system. In other words, the envelope signal determination 120 may, for example, perform an envelope demodulation of spectral subbands of the input audio signal.

Moreover, the audio similarity evaluator 100 comprises a modulation information determination (or modulation information determinator) 160, which receives the envelope signals 122a-122c and provides, on the basis thereof, modulation information 162a-162c. Generally speaking, the modulation information determination 160 is configured to obtain a modulation information 162a-162c associated with the envelope signals 122a-122c for a plurality of modulation frequency ranges. The modulation information describes the (temporal) modulation of the envelope signals.

The modulation information 162a-162c may, for example, be provided on the basis of output signals of modulation filters or on the basis of output signals of a modulation filterbank. For example, the modulation information 162a may be associated to a first frequency range, and may, for example, describe the modulation of a first envelope signal 122a (which is associated with this first frequency range) for a plurality of modulation frequency ranges. In other words, the modulation information 162a may not be a scalar value, but may comprise a plurality of values (or even a plurality of sequences of values) which are associated with different modulation frequencies that are present in the first envelope signal 122a which is associated with a first frequency range of the input audio signal. Similarly, the second modulation information 162b may not be a scalar value, but may comprise a plurality of values or even a plurality of sequences of values associated with different modulation frequency ranges which are present in the second envelope signal 122b, which is associated with a second frequency range of the input audio signal 110. Thus, for each of a plurality of frequency ranges under consideration (for which separate envelope signals 122a-122c are provided by the envelope signal determinator 120), modulation information may be provided for a plurality of modulation frequency ranges. Worded yet differently, for a portion (for example a frame) of the input audio signal 110, a plurality of sets of modulation information values are provided, wherein the different sets are associated with different frequency ranges of the input audio signal, and where each of the sets describes a plurality of modulation frequency ranges (i.e. each of the sets describes the modulation of one envelope signal).

Moreover, the audio similarity evaluator comprises a comparison or comparator 180, which receives the modulation information 162a-162c and also a reference modulation information 182a-182c which is associated with a reference audio signal. Moreover, the comparison 180 is configured to compare the obtained modulation information 162a-162c (obtained on the basis of input audio signal 110) with the reference modulation information 182a-182c associated with a reference signal, in order to obtain an information about a (perceptually-judged) similarity between the input audio signal 110 and the reference audio signal.

For example, the comparison 180 may obtain a single value describing a perceptual similarity between the input audio signal and the reference audio signal as the similarity information 112. Moreover, it should be noted that the comparison 180 may, for example, use an internal difference representation, wherein the internal difference representation may, for example, describe a difference between the obtained modulation information and the reference modulation information. For example, one or more weighting operations or modification operations may be applied, like a scaling of the internal difference representation based on a degree of comodulation and/or an asymmetrical weighting of positive and negative values of the internal difference representation when deriving the similarity information.

However, it should be noted that additional (optional) details of the envelope signal determination 120, of the modulation information determination 160 and of the comparison 180 are described below and can optionally be introduced into the audio similarity evaluator 100 of Fig. 1, both individually and taken in combination.

Optionally, the reference modulation information 182a-182c may be obtained using an optional reference modulation information determination 190 on the basis of a reference audio signal 192. The reference modulation information determination may, for example, perform the same functionality like the envelope signal determination 120 and the modulation information determination 160 on the basis of the reference audio signal 192.

However, it should be noted that the reference modulation information 182a-182c can also be obtained from a different source, for example, from a data base or from a memory or from a remote device which is not part of the audio similarity evaluator.

It should further be noted that the blocks shown in Fig. 1 may be considered as (functional) blocks or (functional) units of a hardware implementation or of a software implementation, as will be detailed below.

2. Audio Similarity Evalutor According to Fig. 2

Figs. 2a and 2b show a block schematic diagram of an audio similarity evaluator 200, according to an embodiment of the present invention.

The audio similarity evaluator 200 is configured to receive an input audio signal 210 and to provide, the basis thereof, a similarity information 212. Moreover, the audio similarity evaluator 200 may be configured to receive a reference modulation information 282 or to compute the reference modulation information 282 by itself (for example, in the same manner in which the modulation information is computed). The reference modulation information 282 is typically associated with a reference audio signal.

The audio similarity evaluator 200 comprises an envelope signal determination 220, which may, for example, comprise the functionality of the envelope signal determination 120. The audio similarity evaluator may also comprise a modulation information determination 260 which may, for example, comprise the functionality of the modulation information determination 160. Moreover, the audio similarity evaluator may comprise a comparison 280 which may, for example, correspond to the comparison 180.

Moreover, the audio similarity evaluator 200 may optionally comprise a comodulation determination, which may operate on the basis of different input signals and which may be implemented in different manners. Examples for the comodulation determination are also shown in the audio similarity evaluator.

In the following, details of the individual functional blocks or functional units of the audio similarity evaluator 200 will be described.

The envelope signal determination 220 comprises a filtering 230, which receives the input audio signal 210 and which provides, on the basis thereof, a plurality of filtered (preferably band-pass-filtered) signals 232a-232e. The filtering 230 may, for example, be implemented using a filterbank and may, for example, model a basilar-membrane filtering. For example, the filters may be considered as "auditory filters” and may, for example, be implemented using a Gamma-tone filterbank. In other words, bandwidths of bandpass filters which perform the filtering may increase with increasing center frequency of the filters. Thus, each of the filtered signals 232a-232e may represent a certain frequency range of the input audio signal, wherein the frequency ranges may be overlapping (or may be non-overlapping in some implementations).

Moreover, similar processing may be applied to each of the filtered signals 232a, such that only one processing path for one given (representative) filtered signal 232c will be described in the following. However, the explanations provided with respect to the processing of the filtered signal 232c can be taken over for the processing of the other filtered signals 232a, 232b, 232d, 232e (wherein, in the present example, only five filtered signals are shown for the sake of simplicity, while a significantly higher number of filtered signals could be used in actual implementations).

A processing chain, which processes the filtered signal 232c under consideration may, for example, comprise a rectification 236, a low-pass filtering 240 and an adaptation 250,

For example, a half-wave rectification 236 (which may, for example, remove the negative half-wave and create pulsating positive half-waves) may be applied to the filtered signal 232c, to thereby obtain a rectified signal 238. Furthermore, a low-pass filtering 240 is applied to the rectified signal 238 to thereby obtain a smooth low-pass signal 242. The low-pass filtering may, for example, comprise a cutoff frequency of 1000 Hz, but different cutoff frequencies (which may preferably be smaller than 1500 Hz or smaller than 2000 Hz) may be applied.

Claims

1. An audio similarity evaluator (100;200;340),

wherein the audio similarity evaluator is configured to obtain envelope signals (122a-122c; 222a-222e) for a plurality of frequency ranges on the basis of an input audio signal

(110,210,362), and

wherein the audio similarity evaluator is configured to obtain a modulation information (162a-162c; 262a-262e) associated with the envelope signals for a plurality of modulation frequency ranges, wherein the modulation information describes the modulation of the envelope signals; and

wherein the audio similarity evaluator is configured to compare the obtained modulation information with a reference modulation information (182a-182c; 282a-282e) associated with a reference audio signal (310), in order to obtain an information (112;212;342) about a similarity between the input audio signal and the reference audio signal.

2. The audio similarity evaluator (100;200;340) according to claim 1, wherein the audio similarity evaluator is configured to apply a plurality of filters or filtering operations (230) having overlapping filter characteristics, in order to obtain the envelope signals (122a-122c; 222a-222e).

3. The audio similarity evaluator (100;200;340) according to claim 1 or claim 2, wherein the audio similarity evaluator is configured to apply a rectification (236) to output signals (232a-232e) of the filters or filtering operation (230), to obtain a plurality of rectified signals (238), or wherein the audio similarity evaluator is configured to obtain a Hilbert envelope on the basis of output signals (232a-232e) of the filters or filtering operation (230), or wherein the audio similarity evaluator is configured to demodulate the output signals (232a-232e) of the filters or filtering operation (230).

4. The audio similarity evaluator (100;200;340) according to claim 3, wherein the audio similarity evaluator is configured to apply a low-pass filter or a low-pass filtering (240) to the rectified signals (238).

5. The audio similarity evaluator (100;200;340) according to one of claims 1 to 4, wherein the audio similarity evaluator is configured to apply an automatic gain control (250), in order to obtain the envelope signals (222a to 222e), or to apply a logarithmic transform, in order to obtain the envelope signals (222a to 222e), or to apply a modeling of a forward masking, in order to obtain the envelope signals (222a to 222e).

6. The audio similarity evaluator (100;200;340) according to claim 5, wherein the audio similarity evaluator is configured to vary a gain applied to derive the envelope signals (222a to 222e) on the basis of rectified and low-pass filtered signals (242) provided by a plurality of filters or filter operations (240) on the basis of the input audio signal,

7. The audio similarity evaluator (100;200;340) according to one of claims 1 to 6, wherein the audio similarity evaluator is configured to process rectified and low-pass filtered versions (242) of signals (232a to 232e) provided by a plurality of filters or filtering operations (230) on the basis of the input audio signal (210) using a series of two or more adaptation loops (254,256,257), which apply a time-variant scaling in dependence on time variant gain values (258),

wherein the audio similarity evaluator is configured to adjust different of the time variant gain values (258) using different time constants.

8. The audio similarity evaluator (100;200;340) according to one of claims 1 to 7,

wherein the audio similarity evaluator is configured to apply a plurality of modulation filters (264) having different passbands to the envelope signals (222a to 222e), to obtain the modulation information (262a to 262e), and/or wherein the audio similarity evaluator is configured to apply a down-sampling to the envelope signals (222a to 222e), to obtain the modulation information (262a to 262e).

9. The audio similarity evaluator (100;200;340) according to claim 8, wherein the modulation filters (264) are configured to at least partially separate components of the envelope signal (222a-222e) having different frequencies, wherein a center frequency of a first, lowest frequency modulation filter is smaller than 5 Hz, and wherein a center frequency of a highest frequency modulation filter is in a range between 200 Hz and 300 Hz.

10. The audio similarity evaluator (100;200;340) according to claim 8 or claim 9, wherein the audio similarity evaluator is configured to remove DC components when obtaining the modulation information (262a to 262e).

11. The audio similarity evaluator (100;200;340) according to one of claims 8 to 10, wherein the audio similarity evaluator is configured to remove a phase information when obtaining the modulation information (262a to 262e).

12. The audio similarity evaluator (100;200;340) according to one of claims 1 to 11, wherein the audio similarity evaluator is configured to derive a scalar value (112;212;342) representing a difference between the obtained modulation information (262a to 262e) and the reference modulation information (282a to 282e) associated with a reference audio signal (310).

13. The audio similarity evaluator (100;200;340) according to one of claims 1 to 12, wherein the audio similarity evaluator is configured to determine a difference representation (294a-294e) in order to compare the obtained modulation information (262a to 262e) with the reference modulation information (282a-282e) associated with a reference audio signal.

14. The audio similarity evaluator (100;200;340) according to one of claims 1 to 13, wherein audio similarity evaluator is configured to adjust a weighting of a difference (289a-283e) between the obtained modulation information (262a-262e) and the reference modulation information (282a-282e) associated with a reference audio signal in dependence on a comodulation between the obtained envelope signals (222a-222e) or modulation information (262a-262e) in two or more adjacent acoustic frequency ranges or between envelope signals associated with the reference signal or between the reference modulation information (282a-282e) in two or more adjacent acoustic frequency ranges.

15. The audio similarity evaluator (100;200;340) according to one of claims 1 to 14, wherein the audio similarity evaluator is configured to put a higher weighting on differences (289a-289e) between the obtained modulation information (262a-262e) and the reference modulation information (282a-282e) associated with a reference audio signal indicating that the input audio signal (210) comprises an additional signal component when compared to differences (289a-289e) between the obtained modulation information (262a-262e) and the reference modulation information (282a-282e)

associated with a reference audio signal indicating that the input audio signal lacks a signal component when determining the information (212) about the similarity between the input audio signal and the reference audio signal,

16. The audio similarity evaluator (100;200;340) according to one of claims 1 to 15, wherein the audio similarity evaluator is configured to weight positive and negative values of a difference (289a-289e) between the obtained modulation information (262a-262e) and the reference modulation information (282a-282e) using different weights when determining the information about the similarity between the input audio signal and the reference audio signal.

17. An audio encoder (300;400) for encoding an audio signal (310;410),

wherein the audio encoder is configured to determine one or more coding parameters (324;424) in dependence on an evaluation of a similarity between an audio signal to be encoded (310;410) and an encoded audio signal (362),

wherein the audio encoder is configured to evaluate the similarity between the audio signal (310; 410) to be encoded and the encoded audio signal (352) using an audio similarity evaluator (100;200;340) according to one of claims 1 to 16.

18. The audio encoder (300;400) according to claim 17, wherein the audio encoder is configured to encode one or more bandwidth extension parameters (324;424) which define a processing rule to be used at the side of an audio decoder (1000) to derive a missing audio content (1052) on the basis of an audio content (1042) of a different frequency range encoded by the audio encoder; and/or

wherein the audio encoder is configured to encode one or more audio decoder configuration parameters which define a processing rule to be used at the side of an audio decoder.

19. The audio encoder (300;400) according to claim 17 or claim 18, wherein the audio encoder is configured to support an Intelligent Gap Filling, and

wherein the audio encoder is configured to determine one or more parameters (324;424) of the Intelligent Gap Filling using an evaluation of the similarity between the audio signal (310;410) to be encoded and the encoded audio signal (352).

20. The audio encoder (300;400) according to one of claims 17 to 19, wherein the audio encoder is configured to select one or more associations between a source frequency range (sT[.]) and a target frequency range (tile[.]) for a bandwidth extension and/or one or more processing operation parameters for a bandwidth extension in dependence on the evaluation of a similarity between an audio signal (310;410) to be encoded and an encoded audio signal (362).

21. The audio encoder (300;400) according to one of claims 17 to 20,

wherein the audio encoder is configured to select one or more associations between a source frequency range and a target frequency range for a bandwidth extension, wherein the audio encoder is configured to selectively allow or prohibit a change of an association between a source frequency range and a target frequency range in dependence on an evaluation of a modulation of an envelope in an old or a new target frequency range.

22. The audio encoder (300;400) according to claim 21,

wherein the audio encoder is configured to determine a modulation strength (485) of an envelope in a target frequency range in a modulation frequency range corresponding to a frame rate of the encoder and to determine a sensitivity measure (487) in dependence on the determined modulation strength, and

wherein the audio encoder is configured to decide whether it is allowed or prohibited to change an association between a target frequency range and a source frequency range in dependence on the sensitivity measure.

23. An audio encoder (500) for encoding an audio signal,

wherein the audio encoder is configured to determine one or more coding parameters (524) in dependence on an audio signal (510) to be encoded using a neural network (524),

wherein the neural network is trained using an audio similarity evaluator (100;200) according to one of claims 1 to 16.

24. An audio similarity evaluator (600),

wherein the audio similarity evaluator is configured to obtain envelope signals (622a-622c) for a plurality of frequency ranges on the basis of an input audio signal (610), and

wherein the audio similarity evaluator is configured to compare an analysis representation (622a-622c) of the input audio signal with a reference analysis representation (682a-682c) associated with a reference audio signal, in order to obtain an information (612) about a similarity between the input audio signal and the reference audio signal,

wherein audio similarity evaluator is configured to adjust a weighting of a difference between the obtained analysis representation (622a-622c) and the reference analysis representation (682a-682c) in dependence on a comodulation in two or more adjacent acoustic frequency ranges of the input audio signal or in dependence on a comodulation in two or more adjacent acoustic frequency ranges of the reference audio signal.

25. A method for evaluating a similarity between audio signals,

wherein the method comprises obtaining envelope signals for a plurality of frequency ranges on the basis of an input audio signal, and

wherein the method comprises obtaining a modulation information associated with the envelope signals for a plurality of modulation frequency ranges, wherein the modulation information describes the modulation of the envelope signals; and

wherein the method comprises comparing the obtained modulation information with a reference modulation information associated with a reference audio signal, in order to obtain an information about a similarity between the input audio signal and the reference audio signal.

26. A method for encoding an audio signal,

wherein the method comprises determining one or more coding parameters in dependence on an evaluation of a similarity between an audio signal to be encoded and an encoded audio signal,

wherein the method comprises evaluating the similarity between the audio signal to be encoded and the encoded audio signal according to claim 25.

27. Method for encoding an audio signal,

wherein the method comprises determining one or more coding parameters in dependence on an audio signal to be encoded using a neural network,

wherein the neural network is trained using a method for evaluating a similarity between audio signals according to claim 25.

28. A method for evaluating a similarity between audio signals,

wherein the method comprises obtaining envelope signals for a plurality of frequency ranges on the basis of an input audio signal, and

wherein the method comprises comparing an analysis representation of the input audio signal with a reference analysis representation associated with a reference audio signal, in order to obtain an information about a similarity between the input audio signal and the reference audio signal,

wherein the method comprises adjusting a weighting of a difference between the obtained analysis representation and the reference analysis representation in dependence on a comodulation in two or more adjacent acoustic frequency ranges of the input audio signal or in dependence on a comodulation in two or more adjacent acoustic frequency ranges of the reference audio signal.

29. A computer program for performing the method of one of claims 25 to 28, when the computer program runs on a computer.

Documents

Application Documents

#	Name	Date
1	202017056438-STATEMENT OF UNDERTAKING (FORM 3) [24-12-2020(online)].pdf	2020-12-24
2	202017056438-REQUEST FOR EXAMINATION (FORM-18) [24-12-2020(online)].pdf	2020-12-24
3	202017056438-NOTIFICATION OF INT. APPLN. NO. & FILING DATE (PCT-RO-105) [24-12-2020(online)].pdf	2020-12-24
4	202017056438-FORM 18 [24-12-2020(online)].pdf	2020-12-24
5	202017056438-FORM 1 [24-12-2020(online)].pdf	2020-12-24
6	202017056438-DRAWINGS [24-12-2020(online)].pdf	2020-12-24
7	202017056438-DECLARATION OF INVENTORSHIP (FORM 5) [24-12-2020(online)].pdf	2020-12-24
8	202017056438-COMPLETE SPECIFICATION [24-12-2020(online)].pdf	2020-12-24
9	202017056438-FORM-26 [10-03-2021(online)].pdf	2021-03-10
10	202017056438-RELEVANT DOCUMENTS [09-04-2021(online)].pdf	2021-04-09
11	202017056438-Proof of Right [09-04-2021(online)].pdf	2021-04-09
12	202017056438-FORM 13 [09-04-2021(online)].pdf	2021-04-09
13	202017056438-FORM 3 [22-04-2021(online)].pdf	2021-04-22
14	202017056438.pdf	2021-10-19
15	202017056438-FER.pdf	2021-10-19
16	202017056438-Information under section 8(2) [22-10-2021(online)].pdf	2021-10-22
17	202017056438-FORM 3 [22-10-2021(online)].pdf	2021-10-22
18	202017056438-Response to office action [16-02-2022(online)].pdf	2022-02-16
19	202017056438-FORM 3 [04-05-2022(online)].pdf	2022-05-04
20	202017056438-Response to office action [25-08-2022(online)].pdf	2022-08-25
21	202017056438-FORM 3 [23-12-2022(online)].pdf	2022-12-23
22	202017056438-FORM 3 [06-03-2023(online)].pdf	2023-03-06
23	202017056438-FORM 3 [20-04-2023(online)].pdf	2023-04-20
24	202017056438-Information under section 8(2) [29-08-2023(online)].pdf	2023-08-29
25	202017056438-Information under section 8(2) [12-10-2023(online)].pdf	2023-10-12
26	202017056438-FORM 3 [12-10-2023(online)].pdf	2023-10-12
27	202017056438-Information under section 8(2) [20-12-2023(online)].pdf	2023-12-20
28	202017056438-PatentCertificate29-04-2024.pdf	2024-04-29
29	202017056438-IntimationOfGrant29-04-2024.pdf	2024-04-29

Search Strategy

1	serach70E_02-07-2021.pdf