Abstract: The invention relates to the processing of a digital audio signal including a series of samples distributed in consecutive frames. The processing is implemented in particular when decoding said signal in order to replace at least one signal frame lost during decoding. The method includes the following steps: a) searching in a valid signal segment available when decoding for at least one period in the signal determined in accordance with said valid signal; b) analysing the signal in said period in order to determine spectral components of the signal in said period; c) synthesising at least one frame for replacing the lost frame by construction of a synthesis signal from: an addition of components selected among said predetermined spectral components and a noise added to the addition of components. In particular the amount of noise added to the addition of components is weighted in accordance with voice information of the valid signal obtained when decoding.
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003
COMPLETE SPECIFICATION
(See section 10, rule 13)
“IMPROVED FRAME LOSS CORRECTION WITH VOICE
INFORMATION”
ORANGE of 78 rue Olivier de Serres, 75015 Paris, France
The following specification particularly describes the invention and the manner in which it is
to be performed.
WO 2015/166175 PCT/FR2015/051127
1
1
Improved frame loss correction with voice information
The present invention relates to the field of encoding/decoding in telecommunications, and
more particularly to the field of frame loss correction in decoding.
5
A "frame" is an audio segment composed of at least one sample (the invention applies to the
loss of one or more samples in coding according to G.711 as well as to a loss one or more
packets of samples in coding according to standards G.723, G.729, etc.).
10 Losses of audio frames occur when a real-time communication using an encoder and a
decoder is disrupted by the conditions of a telecommunications network (radiofrequency
problems, congestion of the access network, etc.). In this case, the decoder uses frame loss
correction mechanisms to attempt to replace the missing signal with a signal reconstructed
using information available at the decoder (for example the audio signal already decoded for
15 one or more past frames). This technique can maintain a quality of service despite degraded
network performance.
Frame loss correction techniques are often highly dependent on the type of coding used.
20 In the case of CELP coding, it is common to repeat certain parameters decoded in the
previous frame (spectral envelope, pitch, gains from codebooks), with adjustments such as
modifying the spectral envelope to converge toward an average envelope or using a random
fixed codebook.
25 In the case of transform coding, the most widely used technique for correcting frame loss
consists of repeating the last frame received if a frame is lost and setting the repeated frame
to zero as soon as more than one frame is lost. This technique is found in many coding
standards (G.719, G.722.1, G.722.1C). One can also cite the case of the G.711 coding
standard, for which an example of frame loss correction described in Appendix I to G.711
30 identifies a fundamental period (called the "pitch period") in the already decoded signal and
repeats it, overlapping and adding the already decoded signal and the repeated signal
("overlap-add"). Such overlap-add "erases" audio artifacts, but in order to be implemented
requires an additional delay in the decoder (corresponding to the duration of the overlap).
WO 2015/166175 PCT/FR2015/051127
2
Moreover, in the case of coding standard G.722.1, a modulated lapped transform (or MLT)
with an overlap-add of 50% and sinusoidal windows ensures a transition between the last
lost frame and the repeated frame that is slow enough to erase artifacts related to simple
repetition of the frame in the case of a single lost frame. Unlike the frame 5 loss correction
described in the G.711 standard (Appendix I), this embodiment requires no additional delay
because it makes use of the existing delay and the temporal aliasing of the MLT transform
to implement an overlap-add with the reconstructed signal.
10 This technique is inexpensive, but its main fault is an inconsistency between the signal
decoded before the frame loss and the repeated signal. This results in a phase discontinuity
that can produce significant audio artifacts if the duration of the overlap between the two
frames is low, as is the case when the windows used for the MLT transform are "short delay"
as described in document FR 1350845 with reference to figures 1A and 1B of that document.
15 In such case, even a solution combining a pitch search as in the case of the coder according
to standard G.711 (Appendix I) and an overlap-add using the window of the MLT transform
is not sufficient to eliminate audio artifacts.
Document FR 1350845 proposes a hybrid method that combines the advantages of both
20 these methods to keep phase continuity in the transformed domain. The present invention is
defined within this framework. A detailed description of the solution proposed in FR
1350845 is described below with reference to Figure 1.
Although it is particularly promising, this solution requires improvement because, when the
25 encoded signal has only one fundamental period ("mono pitch"), for example in a voiced
segment of a speech signal, the audio quality after frame loss correction may be degraded
and not as good as with frame loss correction by a speech model of a type such as CELP
("Code-Excited Linear Prediction").
30 The invention improves the situation.
For this purpose, it proposes a method for processing a digital audio signal comprising a
series of samples distributed in successive frames, the method being implemented when
decoding said signal in order to replace at least one lost signal frame during decoding.
The method comprises the steps of:
WO 2015/166175 PCT/FR2015/051127
3
a) searching, in a valid signal segment available when decoding, for at least one period in
the signal, determined based on said valid signal,
b) analyzing the signal in said period, in order to determine spectral components of the signal
in said period,
c) synthesizing at least one replacement for the lost frame, by constructing 5 a synthesis signal
from:
- an addition of components selected from among said determined spectral components, and
- noise added to the addition of components.
10 In particular, the amount of noise added to the addition of components is weighted based on
voice information of the valid signal, obtained when decoding.
Advantageously, the voice information used when decoding, transmitted at at least one
bitrate of the encoder, gives more weight to the sinusoidal components of the passed signal
15 if this signal is voiced, or gives more weight to the noise if not, which yields a much more
satisfactory audible result. However, in the case of an unvoiced signal or in the case of a
music signal, it is unnecessary to keep so many components for synthesizing the signal
replacing the lost frame. In this case, more weight can be given to the noise injected for the
synthesis of the signal. This advantageously reduces the complexity of the processing,
20 particularly in the case of an unvoiced signal, without degrading the quality of the synthesis.
In an embodiment in which a noise signal is added to the components, this noise signal is
therefore weighted by a smaller gain in the case of voicing in the valid signal. For example,
the noise signal may be obtained from the previously received frame by a residual between
25 the received signal and the addition of selected components.
In an additional or alternative embodiment, the number of components selected for the
addition is larger in the case of voicing in the valid signal. Thus, if the signal is voiced, the
spectrum of the passed signal is given more consideration, as indicated above.
30
Advantageously, a complementary form of embodiment may be chosen in which more
components are selected if the signal is voiced, while minimizing the gain to be applied to
the noise signal. Thus, the total amount of energy attenuated by applying a gain of less than
1 to the noise signal is partially offset by the selection of more components. Conversely, the
WO 2015/166175 PCT/FR2015/051127
4
gain to be applied to the noise signal is not decreased and fewer components are selected if
the signal is not voiced or is weakly voiced.
In addition, it is possible to further improve the compromise between quality/complexity in
decoding, and in step a) the above period may be searched for in a valid 5 signal segment of
greater length, in the case of voicing in a valid signal. In an embodiment presented in the
detailed description below, a search is made by correlating, in the valid signal, a period of
repetition typically corresponding to at least one pitch period if the signal is voiced, and in
this case, particularly for male voices, the pitch search may be carried out over more than
10 30 milliseconds for example.
In an optional embodiment, the voice information is supplied in an encoded stream
("bitstream") received in decoding and corresponding to said signal comprising a series of
samples distributed in successive frames. In the case of frame loss in decoding, the voice
15 information contained in a valid signal frame preceding the lost frame is then used.
The voice information thus comes from an encoder generating a bitstream and determining
the voice information, and in one particular embodiment the voice information is encoded
in a single bit in the bitstream. However, as an exemplary embodiment, the generation of
20 this voice data in the encoder may be dependent on whether there is sufficient bandwidth on
a communication network between the encoder and the decoder. For example, if the
bandwidth is below a threshold, the voice data is not transmitted by the encoder in order to
save bandwidth. In this case, purely as an example, the last voice information acquired at
the decoder can be used for the frame synthesis, or alternatively it may be decided to apply
25 the unvoiced case for the synthesis of the frame.
In implementation, the voice information is encoded in one bit in the bitstream, the value of
the gain applied to the noise signal may also be binary, and if the signal is voiced, the gain
value is set to 0.25 and otherwise is 1.
30
Alternatively, the voice information comes from an encoder determining a value for the
harmonicity or flatness of the spectrum (obtained for example by comparing amplitudes of
the spectral components of the signal to a background noise), the encoder then delivering
this value in binary form in the bitstream (using more than one bit).
WO 2015/166175 PCT/FR2015/051127
5
In such an alternative, the gain value may be determined as a function of said flatness value
(for example continuously increasing as a function of this value).
Generally, said flatness value can be compared to a threshold in 5 order to determine:
- that the signal is voiced if the flatness value is below the threshold, and
- that the signal is unvoiced otherwise,
(which characterizes voicing in a binary manner).
10 Thus, in the single bit implementation as well as its variant, the criteria for selecting
components and/or choosing the duration of the signal segment in which the pitch search
occurs may be binary.
For example, for the selection of components:
15 - if the signal is voiced, the spectral components having amplitudes greater than those of the
neighboring first spectral components are selected, as well as the neighboring first spectral
components, and
- otherwise, only the spectral components having amplitudes greater than those of the
neighboring first spectral components are selected.
20
For selecting the duration of the pitch search segment, for example:
- if the signal is voiced, the period is searched for in a valid signal segment of a duration of
more than 30 milliseconds (for example 33 milliseconds),
- and if not, the period is searched for in a valid signal segment of a duration of less than 30
25 milliseconds (for example 28 milliseconds).
Thus, the invention aims to improve the prior art in the sense of document FR 1350845 by
modifying various steps in the processing presented in that document (pitch search, selection
of components, noise injection), but is still based in particular on characteristics of the
30 original signal.
These characteristics of the original signal can be encoded as special information in the data
stream to the decoder (or "bitstream"), according to the speech and/or music classification,
and if appropriate on the speech class in particular.
WO 2015/166175 PCT/FR2015/051127
6
This information in the bitstream at decoding allows optimizing the compromise between
quality and complexity, and, collectively:
- changing the gain of the noise to be injected into the sum of the selected spectral
components in order to construct the synthesis signal replacing 5 the lost frame,
- changing the number of components selected for the synthesis,
- changing the duration of the pitch search segment.
Such an embodiment may be implemented in an encoder for the determination of voice
10 information, and more particularly in a decoder, for the case of frame loss. It may be
implemented as software to carry out encoding/decoding for the enhanced voice services (or
"EVS") specified by the 3GPP group (SA4).
In this capacity, the invention also provides a computer program comprising instructions for
15 implementing the above method when this program is executed by a processor. An
exemplary flowchart of such a program is presented in the detailed description below, with
reference to Figure 4 for decoding and with reference to Figure 3 for encoding.
The invention also relates to a device for decoding a digital audio signal comprising a series
20 of samples distributed in successive frames. The device comprises means (such as a
processor and a memory, or an ASIC component or other circuit) for replacing at least one
lost signal frame, by:
a) searching, in a valid signal segment available when decoding, for at least one period in
the signal, determined based on said valid signal,
25 b) analyzing the signal in said period, in order to determine spectral components of the signal
in said period,
c) synthesizing at least one frame for replacing the lost frame, by constructing a synthesis
signal from:
- an addition of components selected from among said determined spectral components, and
30 - noise added to the addition of components,
the amount of noise added to the addition of components being weighted based on voice
information of the valid signal, obtained when decoding.
WO 2015/166175 PCT/FR2015/051127
7
Similarly, the invention also relates to a device for encoding a digital audio signal,
comprising means (such as a memory and a processor, or an ASIC component or other
circuit) for providing voice information in a bitstream delivered by the encoding device,
distinguishing a speech signal likely to be voiced from a music signal, and in the case of a
5 speech signal:
- identifying that the signal is voiced or generic, in order to consider it as generally voiced,
or
- identifying that the signal is inactive, transient, or unvoiced, in order to consider it as
generally unvoiced.
10
Other features and advantages of the invention will be apparent from examining the
following detailed description and the appended drawings in which:
- Figure 1 summarizes the main steps of the method for correcting frame loss in the
sense of document FR 1350845;
15 - Figure 2 schematically shows the main steps of a method according to the invention;
- Figure 3 illustrates an example of steps implemented in encoding, in one
embodiment in the sense of the invention;
- Figure 4 shows an example of steps implemented in decoding, in one embodiment
in the sense of the invention;
20 - Figure 5 illustrates an example of steps implemented in decoding, for the pitch search
in a valid signal segment Nc;
- Figure 6 schematically illustrates an example of encoder and decoder devices in the
sense of the invention.
25 We now refer to Figure 1, illustrating the main steps described in document FR 1350845. A
series of N audio samples, denoted b(n) below, is stored in a buffer memory of the decoder.
These samples correspond to samples already decoded and are therefore accessible for
correcting frame loss at the decoder. If the first sample to be synthesized is sample N, the
audio buffer corresponds to previous samples 0 to N-1. In the case of transform coding, the
30 audio buffer corresponds to samples in the previous frame, which cannot be changed because
this type of encoding/decoding does not provide for delay in reconstructing the signal;
therefore the implementation of a crossfade of sufficient duration to cover a frame loss is
not provided for.
WO 2015/166175 PCT/FR2015/051127
8
Next is a step S2 of frequency filtering, in which the audio buffer b(n) is divided into two
bands, a low band LB and a high band HB, with a separation frequency denoted Fc (for
example Fc=4kHz). This filtering is preferably a delayless filtering. The size of the audio
buffer is now reduced to N’ = N*Fc/f following decimation of fs to Fc. In variants of the
invention, this filtering step may be optional, the next steps being carried 5 out on the full
band.
The next step S3 consists of searching the low band for a loop point and a segment p(n)
corresponding to the fundamental period (or "pitch") within buffer b(n) re-sampled at
10 frequency Fc. This embodiment allows taking into account pitch continuity in the lost
frame(s) to be reconstructed.
Step S4 consists of breaking apart segment p(n) into a sum of sinusoidal components. For
example, the discrete Fourier transform (DFT) of signal p(n) over a duration corresponding
15 to the length of the signal can be calculated. The frequency, phase, and amplitude of each of
the sinusoidal components (or "peaks") of the signal are thus obtained. Transforms other
than DFT are possible. For example, transforms such as DCT, MDCT, or MCLT may be
applied.
20 Step S5 is a step of selecting K sinusoidal components in order to retain only the most
significant components. In one particular embodiment, the selection of components first
corresponds to selecting the amplitudes A(n) for which A(n)>A(n-1) and A(n)>A(n+1)
where �� __________∈ ��0; ����
�� − 1��, which ensures that the amplitudes correspond to spectral peaks.
25 To do this, the samples of segment p(n) (pitch) are interpolated to obtain segment p'(n)
composed of P' samples, where ���� = 2��������(��������(��)) > ��, ceil(x) being an integer greater than
or equal to x. Analysis by Fourier transform FFT is therefore done more efficiently over a
length which is a power of 2, without modifying the actual pitch period (due to the
interpolation). The FFT transform of p'(n) is calculated: Π(��) = ������(����( ��)); and, from the
30 FFT transform, the phases ��(��) and amplitudes ��(��) of the sinusoidal components are
directly obtained, the normalized frequencies between 0 and 1 being given here by:
��(��) = ��������
���� �� ∈ ��0; ����
�� − 1��
WO 2015/166175 PCT/FR2015/051127
9
Next, among the amplitudes of this first selection, the components are selected in descending
order of amplitude, so that the cumulative amplitude of the selected peaks is at least x% (for
example x=70%) of the cumulative amplitude over typically half the spectrum at the current
frame.
5
In addition, it is also possible to limit the number of components (for example to 20) in order
to reduce the complexity of the synthesis.
The sinusoidal synthesis step S6 consists of generating a segment s(n) of a length at least
10 equal to the size of the lost frame (T). The synthesis signal s(n) is calculated as a sum of the
selected sinusoidal components:
��(��) = ������ A(k) sin (����(��)
������ �� + ��(��)) �� ∈ ��0; 2�� + ����
�� ��
where k is the index of the K peaks selected in step S5.
15 Step S7 consists of "noise injection" (filling in the spectral regions corresponding to the lines
not selected) in order to compensate for energy loss due to the omission of certain frequency
peaks in the low band. One particular implementation consists of calculating the residual
r(n) between the segment corresponding to the pitch p(n) and the synthesis signal s(n), where
�� ∈ [0; �� − 1], such that:
20 ��(��) = ��(��) − ��(��) �� ∈ [0; �� − 1]
This residual of size P is transformed, for example it is windowed and repeated with overlaps
between windows of varying sizes, as described in patent FR 1353551:
����(��) = ������(��)�� �� ∈ [0; �� − 1] ���� �� ∈ ��0; 2�� + ����
��
��
25
Signal s(n) is then combined with signal r'(n):
��(��) = ��(��) + ��′(��) �� ∈ ��0; 2�� + ����
�� ��
Step S8 applied to the high band may simply consist of repeating the passed signal.
30
In step S9, the signal is synthesized by resampling the low band at its original frequency fc,
after having been mixed with the filtered high band in step S8 (simply repeated in step S11).
WO 2015/166175 PCT/FR2015/051127
10
Step S10 is an overlap-add to ensure continuity between the signal before the frame loss and
the synthesis signal.
We now describe elements added to the method of Figure 1, in one embodiment 5 in the sense
of the invention.
According to a general approach presented in Figure 2, voice information of the signal before
frame loss, transmitted at at least one bitrate of the coder, is used in decoding (step DI-1) in
10 order to quantitatively determine a proportion of noise to be added to the synthesis signal
replacing one or more lost frames. Thus, the decoder uses the voice information to decrease,
based on the voicing, the general amount of noise mixed in the synthesis signal (by assigning
a gain G(res) lower than the noise signal r'(k) originating from a residual in step DI-3, and/or
by selecting more components of amplitudes A(k) for use in constructing the synthesis signal
15 in step DI-4).
In addition, the decoder may adjust its parameters, particularly for the pitch search, to
optimize the compromise between quality/complexity of the processing, based on the voice
information. For example, for the pitch search, if the signal is voiced, the pitch search
20 window Nc may be larger (in step DI-5), as we will see below with reference to Figure 5.
For determining the voicing, information may be provided by the encoder, in two ways, at
at least one bitrate of the encoder:
- in the form of a bit of value 1 or 0 depending on a degree of voicing identified in the
encoder (received from the encoder in step DI-1 and read in step DI-2 in case of
25 frame loss for the subsequent processing), or
- as a value of the average amplitude of the peaks composing the signal in encoding,
compared to a background noise.
This spectrum "flatness" data Pl may be received in multiple bits at the decoder in optional
step DI-10 of Figure 2, then compared to a threshold in step DI-11, which is the same as
30 determining in steps DI-1 and DI-2 whether the voicing is above or below a threshold, and
deducing the appropriate processing, particularly for the selection of peaks and for the choice
of length of the pitch search segment.
WO 2015/166175 PCT/FR2015/051127
11
This information (whether in the form of a single bit or as a multi-bit value) is received from
the encoder (at at least one bitrate of the codec), in the example described here.
Indeed, with reference to Figure 3, in the encoder, the input signal presented in the form of
frames C1 is analyzed in step C2. The analysis step consists of determining 5 whether the
audio signal of the current frame has characteristics that require special processing in case
of frame loss at the decoder, as is the case for example with voiced speech signals.
In one particular embodiment, a classification (speech/music or other) already determined
10 at the encoder is advantageously used in order to avoid increasing the overall complexity of
the processing. Indeed, in the case of encoders that can switch coding modes between speech
or music, classification at the encoder already allows adapting the encoding technique
employed to the nature of the signal (speech or music). Similarly, in the case of speech,
predictive encoders such as the encoder of the G.718 standard also use classification in order
15 to adapt the encoder parameters to the type of signal (sounds that are voiced/unvoiced,
transient, generic, inactive).
In one particular first embodiment, only one bit is reserved for "frame loss characterization."
It is added to the encoded stream (or "bitstream") in step C3 to indicate whether the signal
20 is a speech signal (voiced or generic). This bit is, for example, set to 1 or 0 according to the
following table, based on:
• the decision of the speech/music classifier
• and also on the decision of the speech coding mode classifier.
Decision of the encoder's
classifier
Speech Music
Value of frame loss
characterization bit
Decision of the coding
mode classifier:
Voiced 1
Not voiced 0
Transient 0
Generic 1
Inactive 0
0
WO 2015/166175 PCT/FR2015/051127
12
Here, the term "generic" refers to a common speech signal (which is not a transient related
to the pronunciation of a plosive, is not inactive, and is not necessarily purely voiced such
as the pronunciation of a vowel without a consonant).
5
In a second alternative embodiment, the information transmitted to the decoder in the
bitstream is not binary but corresponds to a quantification of the ratio between the peaks and
valleys in the spectrum. This ratio can be expressed as a measurement of the "flatness" of
the spectrum, denoted Pl:
10
���� = ������2 ��
exp ��1
��
������ ln (��(��))
������ ��
1��
������ ��(��)
������
��
In this expression, x(k) is the spectrum of amplitude of size N resulting from analysis of the
current frame in the frequency domain (after FFT).
15
In an alternative, a sinusoidal analysis is provided, breaking down the signal at the encoder
into sinusoidal components and noise, and the flatness measurement is obtained by a ratio
of sinusoidal components and the total energy of the frame.
20 After step C3 (including the one bit of voice information or the multiple bits of the flatness
measurement), the audio buffer of the encoder is conventionally encoded in step C4 before
any subsequent transmission to the decoder.
Referring now to Figure 4, we will describe the steps implemented in the decoder in one
25 exemplary embodiment of the invention.
In the case where there is no frame loss in step D1 (NOK arrow exiting test D1 of the Figure
4), in step D2 the decoder reads the information contained in the bitstream, including the
"frame loss characterization" information (at at least one bitrate of the codec). This
30 information is stored in memory so it can be reused when a following frame is missing. The
decoder then continues with the conventional steps of decoding D3, etc., to obtain the
synthesized output frame FR SYNTH.
WO 2015/166175 PCT/FR2015/051127
13
In the case where frame loss(es) occurs (OK arrow exiting test D1), steps D4, D5, D6, D7,
D8, and D12 are applied, respectively corresponding to steps S2, S3, S4, S5, S6, and S11 of
Figure 1. However, a few changes are made concerning steps S3 and S5, respectively steps
D5 (searching for a loop point for the pitch determination) and D7 5 (selecting sinusoidal
components). Furthermore, the noise injection in step S7 of Figure 1 is carried out with a
gain determination according to two steps D9 and D10 in Figure 4 of the decoder in the
sense of the invention.
10 In the case where the "frame loss characterization" information is known (when the previous
frame has been received), the invention consists of modifying the processing of steps D5,
D7, and D9-D10, as follows.
In a first embodiment, the "frame loss characterization" information is binary, of a value:
15 - equal to 0 for an unvoiced signal, of a type such as music or transient,
- equal to 1 otherwise (the above table).
Step D5 consists of searching for a loop point and a segment p(n) corresponding to the pitch
within the audio buffer resampled at frequency Fc. This technique, described in document
20 FR 1350845, is illustrated in Figure 5, in which:
- the audio buffer in the decoder is of sample size N',
- the size of a target buffer BC of Ns samples is determined,
- the correlation search is performed over Nc samples
- the correlation curve "Correl" has a maximum at mc,
25 - the loop point is designated Loop pt and is positioned at Ns samples of the correlation
maximum,
- the pitch is then determined over the p(n) remaining samples at N'-1.
In particular, we calculate a normalized correlation corr(n) between the target buffer
30 segment of size Ns, between N'-Ns and N'-1 (of a duration of 6ms for example), and the
sliding segment of size Ns which begins between sample 0 and Nc (where Nc > N'-Ns):
��������(��) = ����������(������)��������������������
������
���������� ��(������)��
������ Σ ��(��������������)�� ��������
������
�� ∈ [0; ����]
WO 2015/166175 PCT/FR2015/051127
14
For music signals, due to the nature of the signal, the value Nc does not need to be very large
(for example Nc=28ms). This limitation saves in computational complexity during the pitch
search.
5
However, voice information from the last valid frame previously received allows
determining whether the signal to be reconstructed is a voiced speech signal (mono pitch).
It is therefore possible, in such cases and with such information, to increase the size of
segment Nc (for example Nc=33 ms) in order to optimize the pitch search (and potentially
10 find a higher correlation value).
In step D7 in Figure 4, sinusoidal components are selected such that only the most significant
components are retained. In one particular embodiment, also presented in document FR
1350845, the first selection of components is equivalent to selecting amplitudes A(n) where
A(n)>A(n-1) and A(n)>A(n+1) with �� ∈ ��0; ����
�� 15 − 1��.
In the case of the invention, it is advantageously known whether the signal to be
reconstructed is a speech signal (voiced or generic) and therefore has pronounced peaks and
a low level of noise. Under these conditions, it is preferable to select not only the peaks A(n)
20 where A(n)>A(n-1) and A(n)>A(n+1) as shown above, but also to expand the selection to
A(n-1) and A(n+1) so that the selected peaks represent a larger portion of the total energy
of the spectrum. This modification allows lowering the level of noise (and in particular the
level of noise injected in steps D9 and D10 presented below) compared to the level of the
signal synthesized by sinusoidal synthesis in step D8, while retaining an overall energy level
25 sufficient to cause no audible artifacts related to energy fluctuations.
Next, in the case where the signal is without noise (at least at low frequencies), as is the case
in a generic or voiced speech signal, we observe that the addition of noise corresponding to
the transformed residual r'(n) within the meaning of FR 1350845, actually degrades the
30 quality.
Therefore the voice information is advantageously used to reduce noise by applying a gain
G in step D10. Signal s(n) resulting from step D8 is mixed with the noise signal r'(n) resulting
WO 2015/166175 PCT/FR2015/051127
15
from step D9, but a gain G is applied here which is dependent on the "frame loss
characterization" information originating from the bitstream of the previous frame, which
is:
��(��) = ��(��) + �� ∗ ����(��) �� ∈ ��0; 2�� +
����
2
�� .
5
In this particular embodiment, G may be a constant equal to 1 or 0.25 depending on the
voiced or unvoiced nature of the signal of the previous frame, according to the table given
below by way of example:
Value of "frame loss
characterization" bit
0 1
Gain G 1 0.25
10
In the alternative embodiment where the "frame loss characterization" information has a
plurality of discrete levels characterizing the flatness P1 of the spectrum, the gain G may be
expressed directly as a function of the Pl value. The same is true for the bounds of segment
Nc for the pitch search and/or for the number of peaks An to be taken into account for
15 synthesis of the signal.
Processing such as the following can be defined as an example.
The gain G has already been directly defined as a function of the Pl value: ��(����) = 2����
20
In addition, the Pl value is compared to an average value -3dB, provided that the 0 value
corresponds to a flat spectrum and -5dB corresponds to a spectrum with pronounced peaks.
If the Pl value is less than the average threshold value -3 dB (thus corresponding to a
25 spectrum with pronounced peaks, typical of a voiced signal), then we can set the duration of
the segment for the pitch search Nc to 33 ms, and we can select peaks A(n) such that
A(n)>A(n-1) and A(n)>A(n+1), as well as the first neighboring peaks A(n-1) and A(n+1).
Otherwise (if the Pl value is above the threshold, corresponding to less pronounced peaks,
30 more background noise, such as a music signal for example), the duration Nc can be chosen
WO 2015/166175 PCT/FR2015/051127
16
to be shorter, for example 25 ms, and only the peaks A(n) are selected that satisfy A(n)>A(n-
1) and A(n)>A(n+1).
The decoding can then continue by mixing noise for which the gain is thus obtained with
the components selected in this manner, to obtain the synthesis signal in 5 the low frequencies
in step D13, which is added to the synthesis signal in the high frequencies that is obtained
in step D14, in order to obtain the general synthesis signal in step D15.
Referring to Figure 6, one possible implementation of the invention is illustrated in which a
10 decoder DECOD (comprising for example software and hardware such as a suitably
programmed memory MEM and a processor PROC cooperating with this memory, or
alternatively a component such as an ASIC, or other, as well as a communication interface
COM) embedded for example in a telecommunications device such as a telephone TEL, for
the implementation of the method of Figure 4, uses voice information that it receives from
15 an encoder ENCOD. This encoder comprises, for example, software and hardware such as
a suitably programmed memory MEM' for determining the voice information and a
processor PROC' cooperating with this memory, or alternatively a component such as an
ASIC, or other, and a communication interface COM'. The encoder ENCOD is embedded
in a telecommunications device such as a telephone TEL'.
20
Of course, the invention is not limited to the embodiments described above by way of
example; it extends to other variants.
Thus, for example, it is understood that voice information may take different forms as
25 variants. In the example described above, this may be the binary value of a single bit (voiced
or not voiced), or a multi-bit value that can concern a parameter such as the flatness of the
signal spectrum or any other parameter that allows characterizing voicing (quantitatively or
qualitatively). Furthermore, this parameter may be determined by decoding, for example
based on the degree of correlation which can be measured when identifying the pitch period.
30
An embodiment was presented above by way of example which included a separation, into
a high frequency band and a low frequency band, of the signal from preceding valid frames,
in particular with a selection of spectral components in the low frequency band. This
implementation is optional, however, although it is advantageous as it reduces the
WO 2015/166175 PCT/FR2015/051127
17
complexity of the processing. Alternatively, the method of frame replacement with the
assistance of voice information in the sense of the invention can be carried out while
considering the entire spectrum of the valid signal.
An embodiment was described above in which the invention is implemented 5 in a context of
transform coding with overlap add. However, this type of method can be adapted to any
other type of coding (CELP in particular).
It should be noted that in the context of transform coding with overlap add (where typically
10 the synthesis signal is constructed over at least two frame durations because of the overlap),
said noise signal can be obtained by the residual (between the valid signal and the sum of
the peaks) by temporally weighting the residual. For example, it can be weighted by overlap
windows, as in the usual context of encoding/decoding by transform with overlap.
15 It is understood that applying gain as a function of the voice information adds another
weight, this time based on the voicing.
18
WO 2015/166175 PCT/FR2015/051127
We Claim:-
1. Method for processing a digital audio signal comprising a series of samples distributed
in successive frames, the method being implemented when decoding said 5 signal in order to
replace at least one lost signal frame during decoding,
the method comprising the steps of:
a) searching, in a valid signal segment available when decoding, for at least one period in
the signal, determined based on said valid signal,
10 b) analyzing the signal in said period, in order to determine spectral components of the
signal in said period,
c) synthesizing at least one replacement for the lost frame, by constructing a synthesis
signal from:
- an addition of components selected from among said determined spectral components,
15 and
- noise added to the addition of components,
wherein the amount of noise added to the addition of components is weighted based on
voice information of the valid signal, obtained when decoding.
20 2. Method according to claim 1, wherein a noise signal added to the addition of
components is weighted by a smaller gain in the case of voicing in the valid signal.
3. Method according to claim 2, wherein the noise signal is obtained by a residual
between the valid signal and the addition of selected components.
25
4. Method according to claim 1, wherein the number of components selected for the
addition is larger in the case of voicing in the valid signal.
5. Method according to claim 1, wherein, in step a), the period is searched for in a valid
30 signal segment of greater length in the case of voicing in the valid signal.
19
WO 2015/166175 PCT/FR2015/051127
6. Method according to claim 1, wherein the voice information is supplied in a bitstream
received in decoding and corresponding to said signal comprising a series of samples
distributed in successive frames,
and wherein, in the case of frame loss in decoding, the voice information contained in a
valid signal frame preceding the 5 lost frame is used.
7. Method according to claim 6, wherein the voice information comes from an encoder
generating the bitstream and determining the voice information, and wherein the voice
information is encoded in a single bit in the bitstream.
10
8. Method according to claim 7, wherein a noise signal added to the addition of
components is weighted by a smaller gain in the case of voicing in the valid signal, and, if
the signal is voiced, the gain value is 0.25, and otherwise is 1.
15 9. Method according to claim 6, wherein the voice information comes from an encoder
determining a spectrum flatness value, obtained by comparing amplitudes of the spectral
components of the signal to a background noise, said encoder delivering said value in
binary form in the bitstream.
20 10. Method according to claim 7, wherein a noise signal added to the addition of
components is weighted by a smaller gain in the case of voicing in the valid signal, and the
gain value is determined as a function of said flatness value.
11. Method according to claim 9, wherein said flatness value is compared to a threshold in
25 order to determine:
- that the signal is voiced if the flatness value is below the threshold, and
- that the signal is unvoiced otherwise.
12. Method according to claim 7, wherein the number of components selected for the
30 addition is larger in the case of voicing in the valid signal, and wherein:
- if the signal is voiced, the spectral components having amplitudes greater than those of
the neighboring first spectral components are selected, as well as the neighboring first
spectral components, and
20
WO 2015/166175 PCT/FR2015/051127
- otherwise only the spectral components having amplitudes greater than those of the
neighboring first spectral components are selected.
13. Method according to claim 7, wherein, in step a), the period is searched for in a valid
signal segment of greater length in the case of voicing in the valid signal, 5 and wherein:
- if the signal is voiced, the period is searched for in a valid signal segment of a duration of
more than 30 milliseconds,
- and if not, the period is searched for in a valid signal segment of a duration of less than
30 milliseconds.
10
14. A readable computer medium storing a code of a computer program, wherein said
computer program comprises instructions for implementing the method according to one
of claims 1 to 13 when the program is executed by a processor.
15 15. Device for decoding a digital audio signal comprising a series of samples distributed in
successive frames, the device comprising a computer circuit for replacing at least one lost
signal frame, by:
a) searching, in a valid signal segment available when decoding, for at least one period in
the signal, determined based on said valid signal,
20 b) analyzing the signal in said period, in order to determine spectral components of the
signal in said period,
c) synthesizing at least one frame for replacing the lost frame, by constructing a synthesis
signal from:
- an addition of components selected from among said determined spectral components,
25 and
- noise added to the addition of components,
the amount of noise added to the addition of components being weighted based on voice
information of the valid signal, obtained when decoding.
30 16. Device for encoding a digital audio signal, comprising a computer circuit for providing
voice information in a bitstream delivered by the encoding device, distinguishing a speech
signal likely to be voiced from a music signal, and, in the case of a speech signal:
- identifying that the signal is voiced or generic, in order to consider it as generally voiced,
or
21
WO 2015/166175 PCT/FR2015/051127
- identifying that the signal is inactive, transient, or unvoiced, in order to consider it as
generally unvoiced.
Dated this 7th day of October, 2016
5
SENTHIL KUMAR S.
IN/PA-1546
OF K&S PARTNERS
AGENT FOR THE APPLICANT(S)
| # | Name | Date |
|---|---|---|
| 1 | Power of Attorney [10-10-2016(online)].pdf | 2016-10-10 |
| 2 | Form 5 [10-10-2016(online)].pdf | 2016-10-10 |
| 3 | Form 3 [10-10-2016(online)].pdf | 2016-10-10 |
| 4 | Drawing [10-10-2016(online)].pdf | 2016-10-10 |
| 5 | Description(Complete) [10-10-2016(online)].pdf | 2016-10-10 |
| 6 | Other Patent Document [23-03-2017(online)].pdf | 2017-03-23 |
| 7 | Form 3 [23-03-2017(online)].pdf | 2017-03-23 |
| 8 | 201627034583-ORIGINAL UNDER RULE 6 (1A)-29-03-2017.pdf | 2017-03-29 |
| 9 | 201627034583-FORM 18 [20-02-2018(online)].pdf | 2018-02-20 |
| 10 | abstract1.jpg | 2018-08-11 |
| 11 | 201627034583.pdf | 2018-08-11 |
| 12 | 201627034583-FER.pdf | 2020-05-26 |
| 13 | 201627034583-Verified English translation [19-08-2020(online)].pdf | 2020-08-19 |
| 14 | 201627034583-Verified English translation [19-08-2020(online)]-1.pdf | 2020-08-19 |
| 15 | 201627034583-Certified Copy of Priority Document [19-08-2020(online)].pdf | 2020-08-19 |
| 16 | 201627034583-OTHERS [16-10-2020(online)].pdf | 2020-10-16 |
| 17 | 201627034583-FER_SER_REPLY [16-10-2020(online)].pdf | 2020-10-16 |
| 18 | 201627034583-COMPLETE SPECIFICATION [16-10-2020(online)].pdf | 2020-10-16 |
| 19 | 201627034583-CLAIMS [16-10-2020(online)].pdf | 2020-10-16 |
| 20 | 201627034583-US(14)-HearingNotice-(HearingDate-24-11-2023).pdf | 2023-11-09 |
| 21 | 201627034583-Correspondence to notify the Controller [21-11-2023(online)].pdf | 2023-11-21 |
| 22 | 201627034583-FORM 3 [24-11-2023(online)].pdf | 2023-11-24 |
| 23 | 201627034583-Written submissions and relevant documents [09-12-2023(online)].pdf | 2023-12-09 |
| 24 | 201627034583-PatentCertificate18-12-2023.pdf | 2023-12-18 |
| 25 | 201627034583-IntimationOfGrant18-12-2023.pdf | 2023-12-18 |
| 1 | SearchStrategy201627034583E_30-04-2020.pdf |