Audio Coding Using Upmix

< Back

Audio Coding Using Upmix

Abstract: A method for decoding a multi-audio-object signal having an audio signal of a first type and an audio signal of a second type encoded therein is described, the multi-audio- object signal consisting of a downmix signal (112) and side information, the side information comprising level information of the audio signal of the first type and the audio signal of the second type in a first predetermined time/frequency resolution, the method comprising computing a prediction coefficient matrix C based on the level information (OLD); and up-mixing the downmix signal based on the prediction coefficients to obtain a first up-mix audio signal approximating the audio signal of the first type and/or a second up-mix audio signal approximating the audio signal of the second type, wherein the up-mixing yields the first up-mix signal S1 and/or the second up-mix signal S2 from the downmix signal d according to a computation representable by where the "1" denotes - depending on the number of channels of d - a scalar, or an identity matrix, and D-1 is a matrix uniquely determined by a downmix prescription according to which the audio signal of the first type and the audio signal of the second type are downmixed into the downmix signal, and which is also comprised by the side information, and H is a term being independent from d.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

16 April 2010

Publication Number

31/2010

Publication Type

INA

Invention Field

ELECTRONICS

Status

Parent Application

Patent Number

Legal Status

Grant Date

2019-07-23

Renewal Date

Applicants

FRAUNHOFER-GESELLSCHAFT ZUR FÖRDERUNG DER ANGEWANDTEN FORSCHUNG E.V.

HANSASTRASSE 27C, 80686 MÜNCHEN, GERMANY

Inventors

1. OLIVER HELLMUTH

GESCHWISTER-VÖMEL-WEG 60 91052 ERLANGEN / GERMANY

2. JUERGEN HERRE

HALLERSTR. 24 91054 BUCKENHOF / GERMANY

3. LEONID TERENTIEV

AM EUROPAKANAL 36, APP. 11 91056 ERLANGEN / GERMANY

4. ANDREAS HOELZER

OBERE KARISTRASSE 23 91054 ERLANGEN / GERMANY

5. CORNELIA FALCH

KASSELER STRAβE 12 90491 NÜRNBERG / GERMANY

6. JOHANNES HILPERT

HERRNHUETTESTRAβE 46 90411 NÜRNBERG / GERMANY

Specification

Audio Coding using Upmix Description The present application is concerned with audio coding using up-mixing of signals. Many audio encoding algorithms have been proposed in order to effectively encode or compress audio data of one channel, i.e., mono audio signals. Using psychoacoustics, audio samples are appropriately scaled, quantized or even set to zero in order to remove irrelevancy from, for example, the PCM coded audio signal. Redundancy removal is also performed. As a further step, the similarity between the left and right channel of stereo audio signals has been exploited in order to effectively encode/compress stereo audio signals. However, upcoming applications pose further demands on audio coding algorithms. For example, in teleconferencing, computer games, music performance and the like, several audio signals which are partially or even completely uncorrelated have to be transmitted in parallel. In order to keep the necessary bit rate for encoding these audio signals low enough in order to be compatible to low-bit rate transmission applications, recently, audio codecs have been proposed which downmix the multiple input audio signals into a downmix signal, such as a stereo or even mono downmix signal. For example, the MPEG Surround standard downmixes the input channels into the downmix signal in a manner prescribed by the standard. The downmixing is performed by use of so-called OTT-1 and TTT-1 boxes for downmixing two signals into one and three signals into two, respectively. In order to downmix more than three signals, a hierarchic structure of these boxes is used. Each OTT-1 box outputs, besides the mono downmix signal, channel level differences between the two input channels, as well as inter-channel coherence/cross-correlation parameters representing the coherence or cross-correlation between the two input channels. The parameters are output along with the downmix signal of the MPEG Surround coder within the MPEG Surround data stream. Similarly, each TTT-1 box transmits channel prediction coefficients enabling recovering the three input channels from the resulting stereo downmix signal. The channel prediction coefficients are also transmitted as side information within the MPEG Surround data stream. The MPEG Surround decoder upmixes the downmix signal by use of the transmitted side information and recovers, the original channels input into the MPEG Surround encoder. However, MPEG Surround, unfortunately, does not fulfill all requirements posed by many applications. For example, the MPEG Surround decoder is dedicated for upmixing the downmix signal of the MPEG Surround encoder such that the input channels of the MPEG Surround encoder are recovered as they are. In other words, the MPEG Surround data stream is dedicated to be played back by use of the loudspeaker configuration having been used for encoding. However, according to some implications, it would be favorable if the loudspeaker configuration could be changed at the decoder's side. In order to address the latter needs, the spatial audio object coding (SAOC) standard is currently designed. Each channel is treated as an individual object, and all objects are downmixed into a downmix signal. However, in addition the individual objects may also comprise individual sound sources as e.g. instruments or vocal tracks. However, differing from the MPEG Surround decoder, the SAOC decoder is free to individually upmix the downmix signal to replay the individual objects onto any loudspeaker configuration. In order to enable the SAOC decoder to recover the individual objects having been encoded into the SAOC data stream, object level differences and, for objects forming together a stereo (or multi-channel) signal, inter-object cross correlation parameters are transmitted as side information within the SAOC bitstream. Besides this, the SAOC decoder/transcoder is provided with information revealing how the individual objects have been downmixed into the downmix signal. Thus, on the decoder's side, it is possible to recover the individual SAOC channels and to render these signals onto any loudspeaker configuration by utilizing user-controlled rendering information. However, although the SAOC codec has been designed for individually handling audio objects, some applications are even more demanding. For example, Karaoke applications require a complete separation of the background audio signal from the foreground audio signal or foreground audio signals. Vice versa, in the solo mode, the foreground objects have to be separated from the background object. However, owing to the equal treatment of the individual audio objects it was not possible to completely remove the background objects or the foreground objects, respectively, from the downmix signal. Thus, it is the object of the present invention to provide an audio codec using down and up mixing of audio signals, respectively such that a better separation of individual objects such as, for example, in a Karaoke/solo mode application, is achieved. This object is achieved by an audio decoder according to claim 1, a decoding method according to claim 19, and a program according to claim 20. Referring to the Figs., preferred embodiments of the present application are described in more detail. Among these Figs., Fig. 1 shows a block diagram of an SAOC encoder/decoder arrangement in which the embodiments of the present invention may be implemented; Fig. 2 shows a schematic and illustrative diagram of a spectral representation of a mono audio signal; Fig. 3 shows a block diagram of an audio decoder according to an embodiment of the present invention; Fig. 4 shows a block diagram of an audio encoder according to an embodiment of the present invention; Fig. 5 shows a block diagram of an audio encoder/decoder arrangement for Karaoke/Solo mode application, as a comparison embodiment; Fig. 6 shows a block diagram of an audio encoder/decoder arrangement for Karaoke/Solo mode application according to an embodiment; Fig. 7a shows a block diagram of an audio encoder for a Karaoke/Solo mode application, according to a comparison embodiment; Fig. 7b shows a block diagram of an audio encoder for a Karaoke/Solo mode application, according to an embodiment; Fig. 8a and b show plots of quality measurement results; Fig. 9 shows a block diagram of an audio encoder/decoder arrangement for Karaoke/Solo mode application, for comparison purposes; Fig. 10 shows a block diagram of an audio encoder/decoder arrangement for Karaoke/Solo mode application according to an embodiment; Fig. 11 shows a block diagram of an audio encoder/decoder arrangement for Karaoke/Solo mode application according to a further embodiment; Fig. 12 shows a block diagram of an audio encoder/decoder arrangement for Karaoke/Solo mode application according to a further embodiment; Fig. 13a to h show tables reflecting a possible syntax for the SOAC bitstream according to an embodiment of the present invention; Fig. 14 shows a block diagram of an audio decoder for a Karaoke/Solo mode application, according to an embodiment; and Fig. 15 show a table reflecting a possible syntax for signaling the amount of data spent for transferring the residual signal. Before embodiments of the present invention are described in more detail below, the SAOC codec and the SAOC parameters transmitted in an SAOC bitstream are presented in order to ease the understanding of the specific embodiments outlined in further detail below. Fig. 1 shows a general arrangement of an SAOC encoder 10 and an SAOC decoder 12. The SAOC encoder 10 receives as an input N objects, i.e., audio signals 141 to 14N. In particular, the encoder 10 comprises a downmixer 16 which receives the audio signals 141 to 14N and downmixes same to a downmix signal 18. In Fig. 1, the downmix signal is exemplarily shown as a stereo downmix signal. However, a mono downmix signal is possible as well. The channels of the stereo downmix signal 18 are denoted LO and RO, in case of a mono downmix same is simply denoted LO. In order to enable the SAOC decoder 12 to recover the individual objects 141 to 14N, downmixer 16 provides the SAOC decoder 12 with side information including SAOC-parameters including object level differences (OLD), inter-object cross correlation parameters (IOC), downmix gain values (DMG) and downmix channel level differences (DCLD). The side information 20 including the SAOC-parameters, along with the downmix signal 18, forms the SAOC output data stream received by the SAOC decoder 12. The SAOC decoder 12 comprises an upmixer 22 which receives the downmix signal 18 as well as the side information 20 in order to recover and render the audio signals 141 and 14N onto any user-selected set of channels 241 to 24M, with the rendering being prescribed by rendering information 26 input into SAOC decoder 12. The audio signals 141 to 14N may be input into the downmixer 16 in any coding domain, such as, for example, in time or spectral domain. In case, the audio signals 141 to 14N are fed into the downmixer 16 in the time domain, such as PCM coded, downmixer 16 uses a filter bank, such as a hybrid QMF bank, i.e., a bank of complex exponentially modulated filters with a Nyquist filter extension for the lowest frequency bands to increase the frequency resolution therein, in order to transfer the signals into spectral domain in which the audio signals are represented in several subbands associated with different spectral portions, at a specific filter bank resolution. If the audio signals 141 to 14N are already in the representation expected by downmixer 16, same does not have to perform the spectral decomposition. Fig. 2 shows an audio signal in the just-mentioned spectral domain. As can be seen, the audio signal is represented as a plurality of subband signals. Each subband signal 301 to 30P consists of a sequence of subband values indicated by the small boxes 32. As can be seen, the subband values 32 of the subband signals 301 to 30P are synchronized to each other in time so that for each of consecutive filter bank time slots 34 each subband 301 to 30P comprises exact one subband value 32. As illustrated by the frequency axis 36, the subband signals 301 to 30P are associated with different frequency regions, and as illustrated by the time axis 38, the filter bank time slots 34 are consecutively arranged in time. As outlined above, downmixer 16 computes SAOC-parameters from the input audio signals 141 to 14N. Downmixer 16 performs this computation in a time/frequency resolution which may be decreased relative to the original time/frequency resolution as determined by the filter bank time slots 34 and subband decomposition, by a certain amount, with this certain amount being signaled to the decoder side within the side information 20 by respective syntax elements bsFrameLength and bsFreqRes. For example, groups of consecutive filter bank time slots 34 may form a frame 40. In other words, the audio signal may be divided- up into frames overlapping in time or being immediately adjacent in time, for example. In this case, bsFrameLength may define the number of parameter time slots 41, i.e. the time unit at which the SAOC parameters such as OLD and IOC, are computed in an SAOC frame 40 and bsFreqRes may define the number of processing frequency bands for which SAOC parameters are computed. By this measure, each frame is divided-up into time/frequency tiles exemplified in Fig. 2 by dashed lines 42. The downmixer 16 calculates SAOC parameters according to the following formulas. In particular, downmixer 16 computes object level differences for each object i as wherein the sums and the indices n and k, respectively, go through all filter bank time slots 34, and all filter bank subbands 30 which belong to a certain time/frequency tile 42. Thereby, the energies of all subband values x1 of an audio signal or object i are summed up and normalized to the highest energy value of that tile among all objects or audio signals. Further the SAOC downmixer 16 is able to compute a similarity measure of the corresponding time/frequency tiles of pairs of different input objects 141 to 14N. Although the SAOC downmixer 16 may compute the similarity measure between all the pairs of input objects 141 to 14N, downmixer 16 may also suppress the signaling of the similarity measures or restrict the computation of the similarity measures to audio objects 141 to 14N which form left or right channels of a common stereo channel. In any case, the similarity measure is called the inter-object cross-correlation parameter IOCi,j. The computation is as follows with again indexes n and k going through all subband values belonging to a certain time/frequency tile 42, and i and j denoting a certain pair of audio objects 141 to 14N. The downmixer 16 downmixes the objects 141 to 14N by use of gain factors applied to each object 141 to 14N. That is, a gain factor Di is applied to object i and then all thus weighted objects 141 to 14N are summed up to obtain a mono downmix signal. In the case of a stereo downmix signal, which case is exemplified in Fig. 1, a gain factor D1-i is applied to object i and then all such gain amplified objects are summed-up in order to obtain the left downmix channel LO, and gain factors D2,1 are applied to object i and then the thus gain-amplified objects are summed-up in order to obtain the right downmix channel R0. This downmix prescription is signaled to the decoder side by means of down mix gains DMGi and, in case of a stereo downmix signal, downmix channel level differences DCLDi. The downmix gains are calculated according to: where s is a small number such as 10-9. For the DCLDS the following formula applies: In the normal mode, downmixer 16 generates the downmix signal according to: for a stereo downmix, respectively. Thus, in the abovementioned formulas, parameters OLD and IOC are a function of the audio signals and parameters DMG and DCLD are a function of D. By the way, it is noted that D may be varying in time. Thus, in the normal mode, downmixer 16 mixes all objects 141 to 14N with no preferences, i.e., with handling all objects 141 to 14N equally. The upmixer 22 performs the inversion of the downmix procedure and the implementation of the "rendering information" represented by matrix A in one computation step, namely where matrix E is a function of the parameters OLD and IOC. In other words, in the normal mode, no classification of the objects 141 to 14N into BGO, i.e., background object, or FGO, i.e., foreground object, is performed. The information as to which object shall be presented at the output of the upmixer 22 is to be provided by the rendering matrix A. If, for example, object with index 1 was the left channel of a stereo background object, the object with index 2 was the right channel thereof, and the object with index 3 was the foreground object, then rendering matrix A would be to produce a Karaoke-type of output signal. However, as already indicated above, transmitting BGO and FGO by use of this normal mode of the SAOC codec does not achieve acceptable results. Figs. 3 and 4, describe an embodiment of the present invention which overcomes the deficiency just described. The decoder and encoder described in these Figs, and their associated functionality may represent an additional mode such as an "enhanced mode" into which the SAOC codec of Fig. 1 could be switchable. Examples for the latter possibility will be presented thereinafter. Fig. 3 shows a decoder 50. The decoder 50 comprises means 52 for computing prediction coefficients and means 54 for upmixing a downmix signal. The audio decoder 50 of Fig. 3 is dedicated for decoding a multi-audio-object signal having an audio signal of a first type and an audio signal of a second type encoded therein. The audio signal of the first type and the audio signal of the second type may be a mono or stereo audio signal, respectively. The audio signal of the first type is, for example, a background object whereas the audio signal of the second type is a foreground object. That is, the embodiment of Fig. 3 and Fig. 4 is not necessarily restricted to Karaoke/Solo mode applications. Rather, the decoder of Fig. 3 and the encoder of Fig. 4 may be advantageously used elsewhere. The multi-audio-object signal consists of a downmix signal 56 and side information 58. The side information 58 comprises level information 60 describing, for example, spectral energies of the audio signal of the first type and the audio signal of the second type in a first predetermined time/frequency resolution such as, for example, the time/frequency resolution 42. In particular, the level information 60 may comprise a normalized spectral energy scalar value per object and time/frequency tile. The normalization may be related to the highest spectral energy value among the audio signals of the first and second type at the respective time/frequency tile. The latter possibility results in OLDs for representing the level information, also called level difference information herein. Although the following embodiments use OLDs, they may, although not explicitly stated there, use an otherwise normalized spectral energy representation. The side information 58 optionally comprises a residual signal 62 specifying residual level values in a second predetermined time/frequency resolution which may be equal to or different to the first predetermined time/frequency resolution. The means 52 for computing prediction coefficients is configured to compute prediction coefficients based on the level information 60. Additionally, means 52 may compute the prediction coefficients further based on inter- correlation information also comprised by side information 58. Even further, means 52 may use time varying downmix prescription information comprised by side information 58 to compute the prediction coefficients. The prediction coefficients computed by means 52 are necessary for retrieving or upmixing the original audio objects or audio signals from the downmix signal 56. Accordingly, means 54 for upmixing is configured to upmix the downmix signal 56 based on the prediction coefficients 64 received from means 52 and, optionally, the residual signal 62. When using the residual 62, decoder 50 is able to even better suppress cross talks from the audio signal of one type to the audio signal of the other type. Means 54 may also use the time varying downmix prescription to upmix the downmix signal. Further, means 54 for upmixing may use user input 66 in order to decide which of the audio signals recovered from the downmix signal 56 to be actually output at output 68 or to what extent. As a first extreme, the user input 66 may instruct means 54 to merely output the first up-mix signal approximating the audio signal of the first type. The opposite is true for the second extreme according to which means 54 is to output merely the second up-mix signal approximating the audio signal of the second type. Intermediate options are possible as well according to which a mixture of both up-mix signals is rendered an output at output 68 . Fig. 4 shows an embodiment for an audio encoder suitable for generating a multi-audio object signal decoded by the decoder of Fig. 3. The encoder of Fig. 4 which is indicated by reference sign 80, may comprise means 82 for spectrally decomposing in case the audio signals 84 to be encoded are not within the spectral domain. Among the audio signals 84, in turn, there is at least one audio signal of a first type and at least one audio signal of a second type. The means 82 for spectrally decomposing is configured to spectrally decompose each of these signals 84 into a representation as shown in Fig. 2, for example. That is, the means 82 for spectrally decomposing spectrally decomposes the audio signals 84 at a predetermined time/frequency resolution. Means 82 may comprise a filter bank, such as a hybrid QMF bank. The audio encoder 80 further comprises means 86 for computing level information, and means 88 for downmixing, and, optionally, means 90 for computing prediction coefficients and means 92 for setting a residual signal. Additionally, audio encoder 80 may comprise means for computing inter-correlation information, namely means 94. Means 86 computes level information describing the level of the audio signal of the first type and the audio signal of the second type in the first predetermined time/frequency resolution from the audio signal as optionally output by means 82. Similarly, means 88 downmixes the audio signals. Means 88 thus outputs the downmix signal 56. Means 86 also outputs the level information 60. Means 90 for computing prediction coefficients acts similarly to means 52. That is, means 90 computes prediction coefficients from the level information 60 and outputs the prediction coefficients 64 to means 92. Means 92, in turn, sets the residual signal 62 based on the downmix signal 56, the predication coefficients 64 and the original audio signals at a second predetermined time/frequency resolution such that up-mixing the downmix signal 56 based on both the prediction coefficients 64 and the residual signal 62 results in a first up-mix audio signal approximating the audio signal of the first type and the second up-mix audio signal approximating the audio signal of the second type, the approximation being approved compared to the absence of the residual signal 62. The residual signal 62, if present, and the level information 60 are comprised by the side information 58 which forms, along with the downmix signal 56, the multi- audio-object signal to be decoded by decoder Fig. 3. As shown in Fig. 4, and analogous to the description of Fig. 3, means 90 - if present - may additionally use the inter-correlation information output by means 94 and/or time varying downmix prescription output by means 88 to compute the prediction coefficient 64. Further, means 92 for setting the residual signal 62 - if present - may additionally use the time varying downmix prescription output by means 88 in order to appropriately set the residual signal 62. Again, it is noted that the audio signal of the first type may be a mono or stereo audio signal. The same applies for the audio signal of the second type. The residual signal 62 is optional. However, if present, it may be signaled within the side information in the same time/frequency resolution as the parameter time/frequency resolution used to compute, for example, the level information, or a different time/frequency resolution may be used. Further, it may be possible that the signaling of the residual signal is restricted to a sub-portion of the spectral range occupied by the time/frequency tiles 42 for which level information is signaled. For example, the time/frequency resolution at which the residual signal is signaled, may be indicated within the side information 58 by use of syntax elements bsResidualBands and bsResidualFramesPerSAOCFrame. These two syntax elements may define another sub-division of a frame into time/frequency tiles than the sub-division leading to tiles 42. By the way, it is noted that the residual signal 62 may or may not reflect information loss resulting from a potentially used core encoder 96 optionally used to encode the downmix signal 56 by audio encoder 80. As shown in Fig. 4, means 92 may perform the setting of the residual signal 62 based on the version of the downmix signal re- constructible from the output of core coder 96 or from the version input into core encoder 96' . Similarly, the audio decoder 50 may comprise a core decoder 98 to decode or decompress downmix signal 56. The ability to set, within the multiple-audio-object signal, the time/frequency resolution used for the residual signal 62 different from the time/frequency resolution used for computing the level information 60 enables to achieve a good compromise between audio quality on the one hand and compression ratio of the multiple-audio-object signal on the other hand. In any case, the residual signal 62 enables to better suppress cross-talk from one audio signal to the other within the first and second up-mix signals to be output at output 68 according to the user input 66. As will become clear from the following embodiment, more than one residual signal 62 may be transmitted within the side information in case more than one foreground object or audio signal of the second type is encoded. The side information may allow for an individual decision as to whether a residual signal 62 is transmitted for a specific audio signal of a second type or not. Thus, the number of residual signals 62 may vary from one up to the number of audio signals of the second type. In the audio decoder of Fig. 3, the means 54 for computing may be configured to compute a prediction coefficient matrix C consisting of the prediction coefficients based on the level information (OLD) and means 56 may be configured to yield the first up-mix signal S1 and/or the second up- mix signal S2 from the downmix signal d according to a computation representable by where the "1" denotes - depending on the number of channels of d - a scalar, or an identity matrix, and D-1 is a matrix uniquely determined by a downmix prescription according to which the audio signal of the first type and the audio signal of the second type are downmixed into the downmix signal, and which is also comprised by the side information, and H is a term being independent from d but dependent from the residual signal if the latter is present. As noted above and described further below, the downmix prescription may vary in time and/or may spectrally vary within the side information. If the audio signal of the first type is a stereo audio signal having a first (L) and a second input channel (R) , the level information, for example, describes normalized spectral energies of the first input channel (L) , the second input channel (R) and the audio signal of the second type, respectively, at the time/frequency resolution 42. The aforementioned computation according to which the means 56 for up-mixing performs the up-mixing may even be representable by wherein L is a first channel of the first up-mix signal, approximating L and R is a second channel of the first up- mix signal, approximating R, and the "1" is a scalar in case d is mono, and a 2x2 identity matrix in case d is stereo. If the downmix signal 56 is a stereo audio signal having a first (LO) and second output channel (RO), and the computation according to which the means 56 for up-mixing performs the up-mixing may be representable by As far as the term H being dependent on the residual signal res is concerned, the computation according to which the means 56 for up-mixing performs the up-mixing may be representable by The multi-audio-object signal may even comprise a plurality of audio signals of the second type and the side information may comprise one residual signal per audio signal of the second type. A residual resolution parameter may be present in the side information defining a spectral range over which the residual signal is transmitted within the side information. It may even define a lower and an upper limit of the spectral range. Further, the multi-audio-object signal may also comprise spatial rendering information for spatially rendering the audio signal of the first type onto a predetermined loudspeaker configuration. In other words, the audio signal of the first type may be a multi channel (more than two channels) MPEG Surround signal downmixed down to stereo. In the following, embodiments will be described which make use of the above residual signal signaling. However, it is noted that the term "object" is often used in a double sense. Sometimes, an object denotes an individual mono audio signal. Thus, a stereo object may have a mono audio signal forming one channel of a stereo signal. However, at other situations, a stereo object may denote, in fact, two objects, namely an object concerning the right channel and a further object concerning the left channel of the stereo object. The actual sense will become apparent from the context. Before describing the next embodiment, same is motivated by deficiencies realized with the baseline technology of the SAOC standard selected as reference model 0 (RMO) in 2007. The RMO allowed the individual manipulation of a number of sound objects in terms of their panning position and amplification/attenuation. A special scenario has been presented in the context of a "Karaoke" type application. In this case • a mono, stereo or surround background scene (in the following called Background Object, BGO) is conveyed from a set of certain SAOC objects, which is reproduced without alteration, i.e. every input channel signal is reproduced through the same output channel at an unaltered level, and • a specific object of interest (in the following called Foreground Object FGO) (typically the lead vocal) which is reproduced with alterations (the FGO is typically positioned in the middle of the sound stage and can be muted, i.e. attenuated heavily to allow sing-along). As it is visible from subjective evaluation procedures, and could be expected from the underlying technology principle, manipulations of the object position lead to high-quality results, while manipulations of the object level are generally more challenging. Typically, the higher the additional signal amplification/attenuation is, the more potential artefacts arise. In this sense, the Karaoke scenario is extremely demanding since an extreme (ideally: total) attenuation of the FGO is required. The dual usage case is the ability to reproduce only the FGO without the background/MBO, and is referred to in the following as the solo mode. It is noted, however, that if a surround background scene is involved, it is referred to as a Multi-Channel Background Object (MBO) . The handling of the MBO is the following, which is shown in Fig.5: • The MBO is encoded using a regular 5-2-5 MPEG Surround tree 102. This results in a stereo MBO downmix signal 104, and an MBO MPS side information stream 106. • The MBO downmix is then encoded by a subsequent SAOC encoder 108 as a stereo object, (i.e. two object level differences, plus an inter-channel correlation), together with the (or several) FGO 110. This results in a common downmix signal 112, and a SAOC side information stream 114. In the transcoder 116, the downmix signal 112 is preprocessed and the SAOC and MPS side information streams 106, 114 are transcoded into a single MPS output side information stream 118. This currently happens in a discontinuous way, i.e. either only full suppression of the FGO(s) is supported or full suppression of the MBO. Finally, the resulting downmix 120 and MPS side information 118 are rendered by an MPEG Surround decoder 122. In Fig. 5, both the MBO downmix 104 and the controllable object signal(s) 110 are combined into a single stereo downmix 112. This "pollution" of the downmix by the controllable object 110 is the reason for the difficulty of recovering a Karaoke version with the controllable object 110 being removed, which is of sufficiently high audio quality. The following proposal aims at circumventing this problem. Assuming one FGO (e.g. one lead vocal), the key observation used by the following embodiment of Fig. 6 is that the SAOC downmix signal is a combination of the BGO and the FGO signal, i.e. three audio signals are downmixed and transmitted via 2 downmix channels. Ideally, these signals should be separated again in the transcoder in order to produce a clean Karaoke signal (i.e. to remove the FGO signal), or to produce a clean solo signal (i.e. to remove the BGO signal) . This is achieved, in accordance with the embodiment of Fig. 6, by using a "two-to-three" (TTT) encoder element 124 (TTT-1 as it is known from the MPEG Surround specification) within SAOC encoder 108 to combine the BGO and the FGO into a single SAOC downmix signal in the SAOC encoder. Here, the FGO feeds the "center" signal input of the TTT"1 box 124 while the BGO 104 feeds the "left/right" TTT-1 inputs L.R. The transcoder 116 can then produce approximations of the BGO 104 by using a TTT decoder element 126 (TTT as it is known from MPEG Surround), i.e. the "left/right" TTT outputs L,R carry an approximation of the BGO, whereas the "center" TTT output C carries an approximation of the FGO 110. When comparing the embodiment of Fig. 6 with the embodiment of an encoder and decoder of Figs. 3 and 4, reference sign 104 corresponds to the audio signal of the first type among audio signals 84, means 82 is comprised by MPS encoder 102, reference sign 110 corresponds to the audio signals of the second type among audio signal 84, TTT-1 box 124 assumes the responsibility for the functionalities of means 88 to 92, with the functionalities of means 86 and 94 being implemented in SAOC encoder 108, reference sign 112 corresponds to reference sign 56, reference sign 114 corresponds to side information 58 less the residual signal 62, TTT box 126 assumes responsibility for the functionality of means 52 and 54 with the functionality of the mixing box 128 also being comprised by means 54. Lastly, signal 120 corresponds to the signal output at output 68. Further, it is noted that Fig. 6 also shows a core coder/decoder path 131 for the transport of the down mix 112 from SAOC encoder 108 to SAOC transcoder 116. This core coder/decoder path 131 corresponds to the optional core coder 96 and core decoder 98. As indicated in Fig. 6, this core coder/ decoder path 131 may also encode/compress the side information transported signal from encoder 108 to transcoder 116. The advantages resulting from the introduction of the TTT box of Fig. 6 will become clear by the following description. For example, by • simply feeding the "left/right" TTT outputs L.R. into the MPS downmix 120 (and passing on the transmitted MBO MPS bitstream 106 in stream 118) , only the MBO is reproduced by the final MPS decoder. This corresponds to the Karaoke mode. • simply feeding the "center" TTT output C. into left and right MPS downmix 120 (and producing a trivial MPS bitstream 118 that renders the FGO 110 to the desired position and level), only the FGO 110 is reproduced by the final MPS decoder 122. This corresponds to the Solo mode. The handling of the three TTT output signals L.R.C. is performed in the "mixing" box 128 of the SAOC transcoder 116. The processing structure of Fig. 6 provides a number of distinct advantages over Fig. 5: • The framework provides a clean structural separation of background (MBO) 100 and FGO signals 110 • The structure of the TTT element 126 attempts a best possible reconstruction of the three signals L.R.C. on a waveform basis. Thus, the final MPS output signals 130 are not only formed by energy weighting (and decorrelation) of the downmix signals, but also are closer in terms of waveforms due to the TTT processing. • Along with the MPEG Surround TTT box 126 comes the possibility to enhance the reconstruction precision by using residual coding. In this way, a significant enhancement in reconstruction quality can be achieved as the residual bandwidth and residual bitrate for the residual signal 132 output by TTT-1 124 and used by TTT box for upmixing are increased. Ideally (i.e. for infinitely fine quantization in the residual coding and the coding of the downmix signal), the interference between the background (MBO) and the FGO signal is cancelled. The processing structure of Fig. 6 possesses a number of characteristics: • Duality Karaoke/Solo mode: The approach of Fig. 6 offers both Karaoke and Solo functionality by using the same technical means. That is, SAOC parameters are reused, for example. • Refineability: The quality of the Karaoke/Solo signal can be refined as needed by controlling the amount of residual coding information used in the TTT boxes. For example, parameters bsResidualSamplingFrequencylndex, bsResidualBands and bsResidualFramesPerSAOCFrame may be used. • Positioning of FGO in downmix: When using a TTT box as specified in the MPEG Surround specification, the FGO would always be mixed into the center position between the left and right downmix channels. In order to allow more flexibility in positioning, a generalized TTT encoder box is employed which follows the same principles while allowing non-symmetric positioning of the signal associated to the "center" inputs/outputs. • Multiple FGOs: In the configuration described, the use of only one FGO was described (this may correspond to the most important application case). However, the proposed concept is also able to accommodate several FGOs by using one or a combination of the following measures: o Grouped FGOs: Like shown in Figure 6, the signal that is connected to the center input/output of the TTT box can actually be the sum of several FGO signals rather than only a single one. These FGOs can be independently positioned/controlled in the multi-channel output signal 130 (maximum quality advantage is achieved, however, when they are scaled & positioned in the same way) . They share a common position in the stereo downmix signal 112, and there is only one residual signal 132. In any case, the interference between the background (MBO) and the controllable objects is cancelled (although not between the controllable objects). o Cascaded FGOs: The restrictions regarding the common FGO position in the downmix 112 can be overcome by extending the approach of Fig. 6. Multiple FGOs can be accommodated by cascading several stages of the described TTT structure, each stage corresponding to one FGO and producing a residual coding stream. In this way, interference ideally would be cancelled also between each FGO. Of course, this option requires a higher bitrate than using a grouped FGO approach. An example will be described later. • SAOC side information: In MPEG Surround, the side information associated to a TTT box is a pair of Channel Prediction Coefficients (CPCs). In contrast, the SAOC parametrization and the MBO/Karaoke scenario transmit object energies for each object signal, and an inter-signal correlation between the two channels of the MBO downmix (i.e. the parametrization for a "stereo object"). In order to minimize the number of changes in the parametrization relative to the case without the enhanced Karaoke/Solo mode, and thus bitstream format, the CPCs can be calculated from the energies of the downmixed signals (MBO downmix and FGOs) and the inter-signal correlation of the MBO downmix stereo object. Therefore, there is no need to change or augment the transmitted parametrization and the CPCs can be calculated from the transmitted SAOC parametrization in the SAOC transcoder 116. In this way, a bitstream using the Enhanced Karaoke/Solo mode could also be decoded by a regular mode decoder (without residual coding) when ignoring the residual data. In summary, the embodiment of Fig. 6 aims at an enhanced reproduction of certain selected objects (or the scene without those objects) and extends the current SAOC encoding approach using a stereo downmix in the following way: • In the normal mode, each object signal is weighted by its entries in the downmix matrix (for its contribution to the left and to the right downmix channel, respectively). Then, all weighted contributions to the left and right downmix channel are summed to form the left and right downmix channels. • For enhanced Karaoke/Solo performance, i.e. in the enhanced mode, all object contributions are partitioned into a set of object contributions that form a Foreground Object (FGO) and the remaining object contributions (BGO). The FGO contribution is summed into a mono downmix signal, the remaining background contributions are summed into a stereo downmix, and both are summed using a generalized TTT encoder element to form the common SAOC stereo downmix. Thus, a regular summation is replaced by a "TTT summation" (which can be cascaded when desired). In order to emphasize the just-mentioned difference between the normal mode of the SAOC encoder and the enhanced mode, reference is made to Figs. 7a and 7b, where Fig. 7a concerns the normal mode, whereas Fig. 7b concerns the enhanced mode. As can be seen, in the normal mode, the SAOC encoder 108 uses the afore-mentioned DMX parameters Dij for weighting objects j and adding the thus weighed object j to SAOC channel i, i.e. L0 or R0. In case of the enhanced mode of Fig. 6, merely a vector of DMX-parameters Di is necessary, namely, DMX-parameters Di indicating how to form a weighted sum of the FGOs 110, thereby obtaining the center channel C for the TTT-1 box 124, and DMX-parameters Di, instructing the TTT-1 box how to distribute the center signal C to the left MBO channel and the right MBO channel respectively, thereby obtaining the LDMX or RDMX respectively. Problematically, the processing according to Fig. 6 does not work very well with non-waveform preserving codecs (HE- AAC / SBR) . A solution for that problem may be an energy- based generalized TTT mode for HE-AAC and high frequencies. An embodiment addressing the problem will be described later. A possible bitstream format for the one with cascaded TTTs could be as follows: An addition to the SAOC bitstream that needs to be able to be skipped if to be digested in "regular decode mode": numTTTs int for (ttt=0; ttt

Documents

Orders

Section	Controller	Decision Date

Application Documents

#	Name	Date
1	1356-KOLNP-2010-RELEVANT DOCUMENTS [06-09-2023(online)].pdf	2023-09-06
1	1356-kolnp-2010-specification.pdf	2011-10-07
2	1356-kolnp-2010-pct request form.pdf	2011-10-07
2	1356-KOLNP-2010-RELEVANT DOCUMENTS [08-09-2022(online)].pdf	2022-09-08
3	1356-KOLNP-2010-RELEVANT DOCUMENTS [26-09-2021(online)].pdf	2021-09-26
3	1356-kolnp-2010-pct priority document notification.pdf	2011-10-07
4	1356-KOLNP-2010-RELEVANT DOCUMENTS [23-03-2020(online)].pdf	2020-03-23
4	1356-KOLNP-2010-PCT IPER.pdf	2011-10-07
5	1356-KOLNP-2010-PA.pdf	2011-10-07
5	1356-KOLNP-2010-IntimationOfGrant23-07-2019.pdf	2019-07-23
6	1356-KOLNP-2010-PatentCertificate23-07-2019.pdf	2019-07-23
6	1356-kolnp-2010-international search report.pdf	2011-10-07
7	1356-KOLNP-2010-Written submissions and relevant documents (MANDATORY) [28-02-2019(online)].pdf	2019-02-28
7	1356-kolnp-2010-international publication.pdf	2011-10-07
8	1356-KOLNP-2010-HearingNoticeLetter.pdf	2019-01-16
8	1356-kolnp-2010-form 5.pdf	2011-10-07
9	1356-kolnp-2010-form 3.pdf	2011-10-07
9	1356-KOLNP-2010-Information under section 8(2) (MANDATORY) [05-01-2019(online)].pdf	2019-01-05
10	1356-KOLNP-2010-FORM 3-1.1.pdf	2011-10-07
10	1356-KOLNP-2010-Information under section 8(2) (MANDATORY) [30-05-2018(online)].pdf	2018-05-30
11	1356-kolnp-2010-form 2.pdf	2011-10-07
11	1356-KOLNP-2010-Information under section 8(2) (MANDATORY) [02-09-2017(online)].pdf	2017-09-02
12	1356-KOLNP-2010-FORM 18.pdf	2011-10-07
12	ABSTRACT-1356KOLNP2010.pdf	2017-08-31
13	1356-kolnp-2010-form 1.pdf	2011-10-07
13	CLAIMS-1356KOLNP2010.pdf	2017-08-31
14	1356-kolnp-2010-drawings.pdf	2011-10-07
14	DRAWINGS-1356KOLNP2010.pdf	2017-08-31
15	1356-kolnp-2010-description (complete).pdf	2011-10-07
15	FORMS FRAUNHOFER- 1356-KOLNP-2010.pdf	2017-08-31
16	1356-kolnp-2010-correspondence.pdf	2011-10-07
16	LETTER-1356KOLNP2010.pdf	2017-08-31
17	MARK UP COPY-1356KOLNP2010.pdf	2017-08-31
17	1356-KOLNP-2010-CORRESPONDENCE-1.4.pdf	2011-10-07
18	1356-KOLNP-2010-CORRESPONDENCE 1.3.pdf	2011-10-07
18	Other Patent Document [17-02-2017(online)].pdf	2017-02-17
19	1356-KOLNP-2010-CORRESPONDENCE 1.2.pdf	2011-10-07
19	Abstract [19-10-2016(online)].pdf	2016-10-19
20	1356-KOLNP-2010-CORRESPONDENCE 1.1.pdf	2011-10-07
20	Claims [19-10-2016(online)].pdf	2016-10-19
21	1356-kolnp-2010-claims.pdf	2011-10-07
21	Correspondence [19-10-2016(online)].pdf	2016-10-19
22	1356-KOLNP-2010-ASSIGNMENT.pdf	2011-10-07
22	Description(Complete) [19-10-2016(online)].pdf	2016-10-19
23	1356-kolnp-2010-abstract.pdf	2011-10-07
23	Examination Report Reply Recieved [19-10-2016(online)].pdf	2016-10-19
24	Other Document [19-10-2016(online)].pdf	2016-10-19
24	1356-KOLNP-2010_EXAMREPORT.pdf	2016-06-30
25	Other Patent Document [09-08-2016(online)].pdf	2016-08-09
25	Petition Under Rule 137 [19-10-2016(online)].pdf	2016-10-19
26	Petition Under Rule 137 [19-10-2016(online)].pdf_19.pdf	2016-10-19
27	Other Patent Document [09-08-2016(online)].pdf	2016-08-09
27	Petition Under Rule 137 [19-10-2016(online)].pdf	2016-10-19
28	1356-KOLNP-2010_EXAMREPORT.pdf	2016-06-30
28	Other Document [19-10-2016(online)].pdf	2016-10-19
29	1356-kolnp-2010-abstract.pdf	2011-10-07
29	Examination Report Reply Recieved [19-10-2016(online)].pdf	2016-10-19
30	1356-KOLNP-2010-ASSIGNMENT.pdf	2011-10-07
30	Description(Complete) [19-10-2016(online)].pdf	2016-10-19
31	1356-kolnp-2010-claims.pdf	2011-10-07
31	Correspondence [19-10-2016(online)].pdf	2016-10-19
32	1356-KOLNP-2010-CORRESPONDENCE 1.1.pdf	2011-10-07
32	Claims [19-10-2016(online)].pdf	2016-10-19
33	1356-KOLNP-2010-CORRESPONDENCE 1.2.pdf	2011-10-07
33	Abstract [19-10-2016(online)].pdf	2016-10-19
34	1356-KOLNP-2010-CORRESPONDENCE 1.3.pdf	2011-10-07
34	Other Patent Document [17-02-2017(online)].pdf	2017-02-17
35	1356-KOLNP-2010-CORRESPONDENCE-1.4.pdf	2011-10-07
35	MARK UP COPY-1356KOLNP2010.pdf	2017-08-31
36	LETTER-1356KOLNP2010.pdf	2017-08-31
36	1356-kolnp-2010-correspondence.pdf	2011-10-07
37	FORMS FRAUNHOFER- 1356-KOLNP-2010.pdf	2017-08-31
37	1356-kolnp-2010-description (complete).pdf	2011-10-07
38	1356-kolnp-2010-drawings.pdf	2011-10-07
38	DRAWINGS-1356KOLNP2010.pdf	2017-08-31
39	1356-kolnp-2010-form 1.pdf	2011-10-07
39	CLAIMS-1356KOLNP2010.pdf	2017-08-31
40	1356-KOLNP-2010-FORM 18.pdf	2011-10-07
40	ABSTRACT-1356KOLNP2010.pdf	2017-08-31
41	1356-kolnp-2010-form 2.pdf	2011-10-07
41	1356-KOLNP-2010-Information under section 8(2) (MANDATORY) [02-09-2017(online)].pdf	2017-09-02
42	1356-KOLNP-2010-FORM 3-1.1.pdf	2011-10-07
42	1356-KOLNP-2010-Information under section 8(2) (MANDATORY) [30-05-2018(online)].pdf	2018-05-30
43	1356-kolnp-2010-form 3.pdf	2011-10-07
43	1356-KOLNP-2010-Information under section 8(2) (MANDATORY) [05-01-2019(online)].pdf	2019-01-05
44	1356-kolnp-2010-form 5.pdf	2011-10-07
44	1356-KOLNP-2010-HearingNoticeLetter.pdf	2019-01-16
45	1356-kolnp-2010-international publication.pdf	2011-10-07
45	1356-KOLNP-2010-Written submissions and relevant documents (MANDATORY) [28-02-2019(online)].pdf	2019-02-28
46	1356-KOLNP-2010-PatentCertificate23-07-2019.pdf	2019-07-23
46	1356-kolnp-2010-international search report.pdf	2011-10-07
47	1356-KOLNP-2010-PA.pdf	2011-10-07
47	1356-KOLNP-2010-IntimationOfGrant23-07-2019.pdf	2019-07-23
48	1356-KOLNP-2010-RELEVANT DOCUMENTS [23-03-2020(online)].pdf	2020-03-23
48	1356-KOLNP-2010-PCT IPER.pdf	2011-10-07
49	1356-KOLNP-2010-RELEVANT DOCUMENTS [26-09-2021(online)].pdf	2021-09-26
49	1356-kolnp-2010-pct priority document notification.pdf	2011-10-07
50	1356-KOLNP-2010-RELEVANT DOCUMENTS [08-09-2022(online)].pdf	2022-09-08
50	1356-kolnp-2010-pct request form.pdf	2011-10-07
51	1356-KOLNP-2010-RELEVANT DOCUMENTS [06-09-2023(online)].pdf	2023-09-06
51	1356-kolnp-2010-specification.pdf	2011-10-07