Abstract: no abstract
Description
System and method for speech coding
The present invention relates to speech coding, and more
particularly, to coding of speech at low bit rates.
Speech communication in applications with limited bandwidth
and/or storage capabilities requires coding and transmission
of speech at very low bit rates. However, for very low bit
rates, for example of the order of 150 bits/seconds, speech
coders are unable to produce high quality speech, due to
reduced number of bits available for accurate modeling of the
input speech utterance.
Low bit rate speech coding has been achieved in the past by
segment vocoders. Segment vocoders based on variable-length
segment quantization provide the means for achieving
reasonably low bit rates while still offering intelligible
speech quality. The basic functioning of a segment vocoder
includes segmentation of a sequence of spectral feature
vectors of input speech into a sequence of variable length
segments, and then quantization of each of these segments
using a segment codebook and transmission of the best-match
code-segment index and input segment duration to a decoder.
The decoder functions to synthesize speech by linear
predictive synthesis using the code-segment corresponding to
the code segment index received, and time-normalizing this
code-segment to match the input segment duration. However,
the main issues with the above segment vocoder framework are
to define the segmental units to be quantized and how
segmentation and segment quantization are actually realized.
Moreover, the type of segment codebook used and its design
are of concern too.
More recently, segment quantization has been achieved by
speech coders based on a speech recognition/synthesis
paradigm involving the principle of unit selection as done in
concatenative speech synthesis.
The article K. S. Lee and R. V. Cox, "A very low bit rate
speech coder based on a recognition/synthesis paradigm", IEEE
Trans, on Speech and Audio Proc., 9(5): 482-491, July 2001,
proposes a very low bit rate speech coder based on the
principle of unit selection as done in concatenative speech
synthesis system. Herein, a large TTS (text-to-speech)
labeled database is used as the codebook, having units made
up of single frame-vectors. The technigue described uses a
Viterbi search to perform segmentation and quantization using
unit-selection from the single-frame codebook.
The article K. S. Lee and R. V. Cox, "A segmental speech
coder based on a concatenative TTS", Speech Commun., 38:89-
100, 2002 proposes a segmental speech coder based on unit
selection principles as done in concatenative TTS. The
technique described involves an offline process of
constructing a variable-length segment quantization (VLSQ)
codebook from a speech database, and then segmenting and
classifying (or indexing) the speech database with the
constructed VLSQ codebook. In online processing, an input
speech signal is first segmented and classified with the pre-
constructed VLSQ codebook. Unit-selection is then performed
for each segment according to the classification index in the
speech database.
However, the existing techniques have several sub-
optimalities in the segment quantization procedure. For
example, pre-quantization of the input speech before Viterbi
decoding produces a segmentation that is sub-optimal with
respect to the units of the actual units in the database.
Moreover, the use of a unit-selection quantization on such
pre-segmented and pre-quantized input utterance results in
further loss of optimality as the optimal segmentation (and
the corresponding quantization) would be significantly
different, particularly with respect to the overall spectral
distortion. Further, using only those segments from the
database which have the labels of the pre-quantized input
speech restricts the units available for quantization to a
small sub-set of units. Still further, unit selection by
Viterbi decoding essentially works only on segments defined
by pre-quantization, and hence incurs a sub-optimality with
respect to the overall spectral distortion of the final
segmentation and quantization of the input speech with
respect to the database units, which, after all, are the
actual units used for synthesis at the receiver.
The object of the present invention is to provide an improved
system and method for coding of speech at very low bit rates.
The above object is achieved by a system for coding speech,
comprising:
- means for dividing input speech into a sequence of speech
frames and analyzing said sequence of speech frames to
obtain a sequence of feature vectors corresponding to said
sequence of speech frames, and
- means for quantizing said sequence of feature vectors
obtained from the input speech, including a dynamic
programming framework adapted for unit-selection directly
from a continuous codebook, said continuous codebook
comprising sequentially indexed units made up of feature
vectors corresponding to frames of continuous speech,
wherein said dynamic programming framework further
comprises:
- means for determining a warping path for an optimized
alignment of said sequence of feature vectors obtained
from input speech with a sequence of feature vectors of
units from the continuous codebook using a within-unit
recursion and a cross-unit recursion, said within-unit
recursion and cross-unit recursion minimizing an overall
distortion defined by a weighted function of a unit-cost
and a concatenation-cost, and
- means for computing, from said warping path, an
optimized number of segments and segment boundaries of
the sequence of feature vectors obtained from input
speech, and a corresponding sequence of units selected
from the continuous codebook along the warping path,
wherein said unit-cost is defined as a distortion in
quantizing a segment of sequence of feature vectors of input
speech with a unit selected from the continuous codebook and
said concatenation-cost is defined as a distortion between
two units of the continuous codebook that are selected to
quantize two consecutive segments of the sequence of feature
vectors of input speech.
The above object is achieved by a method for coding speech,
comprising the steps of:
- dividing input speech into a sequence of speech frames and
analyzing said sequence of speech frames to obtain a
sequence of feature vectors corresponding to said sequence
of speech frames, and
- quantizing said sequence of feature vectors obtained from
the input speech via unit-selection directly from a
continuous codebook, said continuous codebook comprising
sequentially indexed units made up of feature vectors
corresponding to frames of continuous speech, wherein unit-
selection from said continuous codebook is based upon a
dynamic programming algorithm comprising:
- determining a warping path for an optimized alignment of
said sequence of feature vectors obtained from input
speech with a sequence of feature vectors of units from
the continuous codebook using a within-unit recursion
and a cross-unit recursion, said within-unit recursion
and cross-unit recursion minimizing an overall
distortion defined by a weighted function of a unit-cost
and a concatenation-cost, and
- computing, from said warping path, an optimized number
of segments and segment boundaries of the sequence of
feature vectors obtained from input speech, and a
corresponding sequence of units selected from the
continuous codebook along the warping path,
wherein said unit-cost is defined as a distortion in
quantizing a segment of sequence of feature vectors of input
speech with a unit selected from the continuous codebook and
said concatenation-cost is defined as a distortion between
two units of the continuous codebook that are selected to
quantize two consecutive segments of the sequence of feature
vectors of input speech.
The underlying idea of the present invention is to make use
of the principle of unit selection as done in concatenative
speech synthesis while providing a unified framework for
segmentation and quantization using a continuous unit
database (codebook) of fixed or variable length segments of
speech at very low bit rates based on unit-selection using a
dynamic programming (DP) algorithm. To make use of the
principle of concatenative speech synthesis, a continuous
codebook is employed. The use of the continuous codebook
along with the use of concatenation-costs in the DP algorithm
favors quantizing of consecutive segments of input speech
with units having consecutive indices in the continuous
codebook. Advantageously, the within-unit and cross-unit
recursions of the proposed DP algorithm are applicable for
quantization using variable-length codebooks (made of
variable length segments) or fixed length codebook (with
fixed length segments including the case of single-frame
segments), thus providing a generalized and unified framework
for unit-selection.
According to a preferred embodiment of the invention, to
achieve reduced bit-rates, run-length coding is performed on
the sequence of selected units from the continuous codebook.
Run-length coding identifies at least one contiguous group
from the sequence of selected units which comprises a partial
sequence of units having consecutive indices in the
continuous codebook. From this partial sequence, a base-index
(i.e., the index of the first unit in a contiguous group) and
the contiguity (the number of units having consecutive labels
following the base index unit) of said contiguous group is
determined. Run-length coding reduces the total number of
bits transmitted by reducing the total number of quantized
parameters to be transmitted for decoding.
According to a further embodiment of the invention, to
facilitate speech transmission, quantized speech parameters
including the base-index and the contiguity of identified
contiguous groups and individual segment lengths are
communicated as a plurality of bits over a communication
channel, for reconstruction by a decoding means, wherein
total number of bits (B) transmitted over a time duration T
is computed based upon the relationship given by:
wherein,
P is the total number of contagious groups during the time
duration T,
N is the size of the continuous codebook,
cmax is a maximum allowed contiguity,
K+ is the total number of segments of input speech during the
time duration T, and
Lmax is a maximum allowed length of a segment k of input
speech obtained after decoding of the feature vectors of
input speech.
In a particular embodiment of the present invention, units of
said continuous codebook comprise variable number of feature
vectors corresponding to a variable number of frames of said
continuous speech. This provides lower bit rates for a given
spectral distortion than fixed length segments.
In an alternate embodiment of the invention, each of the
units of said continuous codebook comprises an equal number
of feature vectors corresponding to an equal number of frames
of said continuous speech. Its has been shown that fixed-
length segments with frame sizes of 6-8 provide spectral
distortions comparable to variable length segments defined as
phonetic units and lower bit-rates and lower spectral
distortions than single-frame units.
In another embodiment of the invention, each unit of said
continuous codebook comprises a single feature vector
obtained corresponding to a single frame of said continuous
speech. This provides lower contiguity and higher spectral
distortions than fixed length segments of length greater than
1. This embodiment also provides a more gradual increase of
spectral distortion with decrease in bit-rate.
According to an especially preferred further embodiment of
the invention, to ease within-unit and cross-unit recursions,
the within-unit recursion is carried out over all feature
vectors of a unit except a feature vector corresponding to
the first frame of said unit based upon a relationship given
the cross-unit recursion is carried out over the feature
vector corresponding to the first frame of a unit based upon
a relationship given by:
wherein,
D (i,j,n) is defined as a minimum accumulated distortion by
any path reaching a grid point defined by feature vector , i'
of input speech and feature vector yj' of unit yn' in the
continuous codebook,
d(i,j,n) is defined as a local distance between feature
vector yi' of the input speech and feature vector ,j' of unit
n,
N denotes total number of units in the continuous codebook,
lr denotes the feature vector corresponding to the last frame
of unit 'r' in the continuous codebook,
α and 1-α are respectively defined as relative weights of
the unit-cost and concatenation-cost, and
wherein is the Euclidean distance between the
feature vector corresponding to the last frame of unit ur and
the feature vector corresponding to the first frame of unit
un, and
The above described within-unit and cross-unit recursions can
be advantageously used for quantization using both variable-
length as well as fixed-length codebooks including single-
frame codebooks. The use of the weighing parameter α
facilitates controlling the relative importance of the unit-
cost and concatenation-cost in determining the optimal
warping path. Moreover, the proposed embodiment is
advantageous in reducing storage requirements as only a small
portion of the array of accumulated distortions is needed to
perform the recursions.
In a further embodiment of the invention feature vectors
comprise linear prediction (LP) parameters (obtained by LP
analysis) or mel-frequency cepstrum coefficients. These
provide for particularly effective extraction of features.
The present invention is further described hereinafter with
reference to preferred embodiments shown in the accompanying
drawings, in which:
FIG 1 shows a schematic overview of a speech communication
system in accordance with an embodiment of the present
invention,
FIG 2 is a flowchart illustrating a method for coding speech
in accordance with an embodiment of the present invention,
FIG 3 is an exemplary illustration of a warping path for an
optimized alignment between a sequence of feature vectors of
input speech and a sequence of feature vectors of units from
the continuous codebook,
FIG 4 is an exemplary illustration of a transition rule for
within-unit recursion in accordance with an embodiment of the
present invention,
FIG 5 is an exemplary illustration of a transition rule for
cross-unit recursion in accordance with an embodiment of the
present invention,
FIG 6 shows a graph illustrating rate-distortion curves for
different fixed-length units and variable-length phonetic
units,
FIG 7 shows a graph illustrating contiguity distribution for
fixed length units and variable length phonetic units, and
FIG 8 is a graph illustrating comparative the rate-distortion
performances of the proposed unit-selection algorithm and the
existing techniues.
The method proposed herein provides a unified framework for
segmenting and quantizing input speech using a dynamic
programming algorithm for performing unit-selection from a
continuous codebook. The proposed method implements segment
quantization of input speech directly with respect to the
units of the continuous codebook, instead of using a pre-
constructed VLSQ codebook. From the description that follows,
it can be understood and appreciated that this unified
framework can be generally applied to single-frame codebooks
(whose segments are single-frames) as well as segmental
codebooks (whose segments can be of fixed or variable
length).
FIG 1 shows a schematic overview of a speech communication
system 5 in accordance with one embodiment of the present
invention. The speech communication system 5 includes a
coding means 10 that receives input speech 12 and analyzes
and quantizes it in a form suitable for transmission to a
communication channel 22, for example a telephone or radio
link, internet, etc. Having passed through the communication
channel 22, the quantized speech parameters arrive at a
decoding means 24 which uses these quantized speech
parameters to synthesize a replica 36 of the input speech 12
for delivery to a listener (not shown). The subsystems of the
illustrated embodiment and their functions are discussed in
greater detail below.
The input speech 12 is first provided to a speech analysis
means 14, wherein important speech parameters are extracted
and forwarded to a quantizing means 16. The speech analysis
means 14 divides the input speech 12 into a sequence of
speech frames and analyzes them to extract, for each frame, a
feature vector comprising a set of spectral parameters of the
frame. As used herein, the term 'frame' is intended to refer
to a sample of input speech of fixed duration in time domain
wherein the change in spectral information is insignificant
for practical purposes. A typical (but not limiting) example
of a frame duration is 10 milliseconds. In one embodiment,
the set of spectral parameters may comprise a set of linear
prediction (LP) parameters obtained by LP analysis of the
input speech 12. In another embodiment, the set of spectral
parameters may comprise mel-frequency-cepstrum coefficients
(MFCC). The speech analysis means 14 thus generates a
sequence of feature vectors 15 of input speech 12, which is
forwarded to the quantizing means 16.
As used herein, 'quantizing' is intended to refer to the
process of assigning a single quantized value to a feature
vector of a frame of speech or a group of feature vectors
corresponding to a group of frames of speech (also referred
to as a segment) of input speech.
The quantizing value is selected from a sequence of indexed
units stored in a database, referred to as a codebook,
wherein feature vectors corresponding to a frame or segments
of input speech are assigned a quantizing value based upon
the index of the unit that best matches the frame or segment.
An identical codebook is typically used by the decoding means
to reconstruct speech from the quantized speech parameters.
In this description, a variable length codebook refers to a
continuous speech data segmented into units which are of
variable lengths. A fixed length codebook refers to a
continuous speech data whose segmental units are all of fixed
length segments (of some arbitrary fixed length) and the
single-frame codebook refers to a continuous speech data
whose segmental units are all single frames (i.e. fixed
length segments of length 1).
In the described embodiment, the quantizing means 16 -
comprises a dynamic programming framework 17 which is adapted
for unit-selection directly from a codebook 18. The codebook
18 comprises units made up of feature vectors corresponding
to speech frames derived from training data comprising
continuous speech. That is, when consecutive units of the
codebook 18 are concatenated, a continuous waveform may be
obtained. The codebook 18 is hence also referred to as a
continuous codebook. The units of the continuous codebook 18
are sequentially indexed. In accordance with different
embodiments of the present invention, the continuous codebook
18 may comprise single-frame units, or units made up of
segments of fixed or variable lengths (i.e, number of
frames). The dynamic programming framework 17 performs joint
segmentation and quantization of the sequence of feature
vectors 15 of input speech 12. To that end, the dynamic
programming framework 17 includes means to determine a
warping path (see FIG 3) for an optimized alignment of the
sequence of feature vectors 15 of the input speech 12 with a
sequence of feature vectors of units from the continuous
codebook 18. As discussed in greater detail with reference to
FIG 4 and FIG 5, the optimized warping path is determined
using a within-unit recursion and a cross-unit recursion
configured so as to minimize an overall distortion defined by
a weighted function of a unit-cost and a concatenation-cost.
The dynamic programming framework 17 further includes means
to compute, using the warping path, an optimized number of
segments and segment boundaries of the sequence of feature
vectors 15 of the input speech 12, and a sequence of units
selected from the continuous codebook along the warping path.
The dynamic programming framework according to embodiments of
the present invention is discussed in greater detail in the
sections that follow.
The incorporation of concatenation-cost in unit-selection
favors quantizing consecutive segments (or frames) of the
input speech 12 with, units having consecutive indices in the
continuous codebook 18, a feature which may be exploited to
achieve reduced bit-rates in speech transmission. To that
end, quantizing means 16 of the described embodiment further
includes a run-length coding module 19 to identify at least
one contiguous group from the sequence of selected units. A
contiguous group may be defined as a partial sequence of
units having consecutive indices in the continuous codebook.
For each contiguous group, the base-index (i.e., the index of
the first unit in the contiguous group) and the number of
units in the contiguous group (also referred to as the
'contiguity' of the group) are transmitted as quantized
speech parameters 21 using an appropriate number of bits over
the channel 22 to the decoding means 24. Additionally, the
total number of transmitted bits may also include an
appropriate number of bits to represent individual segment
lengths (i.e., the total number of frames in each segment) of
input speech 12, as computed by the dynamic programming
module 17. Run-length coding achieves a significant reduction
in the total number of bits 21 transmitted over the
communication channel 22. The total number of bits
transmitted for a given time duration can be computed with
reference to Eqn. (7) discussed below.
The decoding means 24 comprises a segment decoder 26 that
reconstructs segments of speech based upon the received bits
representing quantized speech parameters 21. The quantized
speech parameters are decoded using a continuous codebook 28
identical to the continuous codebook 18. Segments are
reconstructed based upon a time-normalization of the lengths
of the codebook unit segments to the lengths of the input
speech segments. The reconstructed segments 29 of speech
frames vectors are then converted by a speech synthesis means
30, typically a linear prediction (LP) speech synthesizer,
into speech waveform samples and/or an analog output signal
36 that represents a replica of the input speech 12. Details
on the functional aspects of the decoding means 24 are known
to one skilled in the art and have hence been omitted from
the present discussion.
The proposed technique for quantization may be incorporated
along with residual modeling and quantization to result in a
complete ultra low bit-rate speech coder. Accordingly, in
certain embodiments of the present invention, the coding
means 10 may further include means 32 to quantize residuals
31 obtained from the analysis means 14. Quantized residuals
33 are communicated via the channel 22 to the decoding means
24, wherein a residual decoder 34 reconstructs the residuals.
Reconstructed residuals 35 are then used by the speech
synthesis means 30 to generate the speech waveform samples
and/or the analog output speech signal 36. Residual
quantization is known in the art and further discussion on
the same has hence been omitted herein.
The speech coding means 10 (and likewise, the decoding means
24) may, for example, be incorporated in a microprocessor for
real-time operations in communication with a memory device,
such as a ROM, that contains instructions that enable the
microprocessor to implement the functions in accordance with
embodiments of the present invention.
Referring to FIG 2, a method 37 is illustrated for coding
speech in accordance with an embodiment of the present
invention. The method begins by providing input speech to a
speech analysis means (step 38). The input speech is then
divided into a sequence of speech frames (step 40) and then
analyzed to obtain a sequence of feature vectors of input
speech (step 42). The steps 44 and- 45 involve segmentation
and quantization of the feature vectors of the frames of
input speech by a dynamic programming algorithm adapted for
unit-selection directly from a continuous codebook. At step
44, a warping path is determined that provides an optimized
alignment of a sequence of feature vectors corresponding to
frames of input speech with a sequence of feature vectors of
units from the continuous codebook. According to the proposed
method, the step 44 uses a within-unit recursion and a cross-
unit recursion to determine said warping path so as to
minimize an overall distortion defined by a weighted function
of a unit-cost and a concatenation-cost. Step. 45 involves
computation of an optimized number of segments and segment
boundaries of the sequence of feature vectors obtained from
input speech, and a sequence of units selected from the
warping path determined at step 44. Next, at step 46, run-
length coding is performed to identify contiguous groups from
the computed sequence of selected units and to determine the
base-index and contiguity therefrom. Subsequently, at step
48, the quantized speech parameters, including the base and
contiguity of the identified contiguous groups along with
individual segment lengths of input speech are transmitted
over a communication channel to a decoding means.
At the decoding means, segments of speech frames are
reconstructed (step 50) based upon the received quantized
speech parameters. Step 52 involves speech synthesis,
typically by linear prediction techniques, to generate an
analog output speech signal, representing a replica of the
input speech.
In certain embodiments, the method 37 may further comprise
quantization of residuals (step 54) obtained from step 42 and
transmitting (step 56) said quantized residuals to the
decoding means, and subsequently decoding the quantized
residuals (step 58) for speech synthesis (step 52).
The proposed dynamic programming framework for unit-selection
from a continuous codebook is implemented by a one-pass
dynamic programming (DP) algorithm. An exemplary embodiment
of the proposed framework is presented below. In describing
the proposed framework and the one-pass DP algorithm, a
'frame' essentially refers to a feature vector corresponding
to said frame, made up of spectral features extracted from
the frame.
A continuous codebook, which is a sequence of MFCC or linear-
prediction (LP) parameter vectors as occurring in continuous
speech, can be viewed as being composed of N units
represented as (u1, u2, . . ., uN) . The units in the
continuous codebook may comprise single-frames or segments of
fixed or variable lengths. The length of a unit un (i.e. the
number of frames) is represented by ln. The codebook is said
to be made of 'fixed length' units, if ln = 1, for all n =
1,..., N, i.e., each unit has 1 frames. When 1 = 1 (one), the
codebook is said to be a 'single-frame' codebook. The
codebook is said to be made of 'variable length' units if ln
is variable over n.
The input speech utterance which is to be quantized using the
above continuous codebook may be viewed as a sequence of
frames comprising feature vectors (MFCC or LP parameters)
represented as 0 = (o1, o2. . . oT ) . Let an arbitrary initial
sequence of K segments of the input speech utterance be
represented as S = (s1, s2, . . . , sk-1, sk, . . . , sK) with
corresponding segment lengths (L1, L2, . . . , Lk, . . . ,
LK). This segmentation is specified by the segment boundaries
B = ( (b0 = 0) , b1, b2. . . bk-1, bk, . . . , (bK = T) ) , such
that the kth segment, sk, may be given by sk = (obk-1 +1, ,ob) .
Each segment may be considered to be quantized by a label,
which is an index of a unit selected from the continuous
codebook, having a value from 1 to N. This sequence of unit
indices may be represented as Q = q1, q2. . . qk-1, qk, • • • ,
qK.
The modified one-pass DP algorithm for unit selection
decoding proposed herein performs an optimal segment
quantization of the input utterance 0, by employing
concatenation-costs in order to constrain the resultant
decoding by a measure of how favorable the sequence Q is with
respect to ease of run-length coding in addition to also
minimize the unit cost representing the distortion incurred
in quantizing an input segment by the corresponding
quantizing unit from the continuous codebook. To that end,
the proposed one-pass DP algorithm provides an optimal
solution for K, B and Q (represented hereinafter as K*, B*,
Q*), so as to minimize an overall distortion ( or
quantization error) given by Eqn. (1) and Eqn. (1.1) below:
Herein, the term represents a unit-cost which may be
defined as a distortion or error in quantizing segment sk
using unit un having an index qk in the continuous codebook.
Typically, the unit-cost is measured along the warping path
between st and wqk.
The term represents a concatenation-cost (or
distortion) when unit uqk-1, is followed by unit uqk. In one
embodiment, the concatenation-cost may be determined in
accordance with Eqn. (2) below:
wherein is the Euclidean distance between the
last frame of unit qk-1 and the first frame of unit qk . βk-1.k =
0, if qk= qk-1 + 1 and βk-1,k = 1 otherwise.
The terms α and 1-α are respectively defined as relative
weights of the unit-cost and concatenation-cost.
The use of the concatenation-cost in the one-pass DP
algorithm favors quantizing two consecutive segments (sk sk-1)
with two units, which are consecutive in the continuous
codebook. This feature may be further exploited by run-length
coding to achieve lowered bit-rates.
Based on the above-described framework, the one-pass DP based
unit selection algorithm determines a warping path that
provides an optimized alignment of the sequence of frames of
input speech with a sequence of frames of units from the
continuous codebook.
An exemplary warping path is illustrated with reference to
FIG 3, wherein the x-axis 60 represents the sequence of
frames i obtained of input speech, 1 (1, 2, 3...T) , T
representing the last frame of input speech, and the y-axis
62 represents units from the continuous codebook sequentially
arranged based upon their indices in the continuous codebook
(represented as n). Each unit un is made up of frames j
(1,2, ..1n) , ln representing the last frame of the unit un.
In the illustrated embodiment the units are shown to have
variable-lengths. However it should be appreciated that the
method described below can be used in case of units having
fixed-lengths as also for 'single frame' units.
The frames i of the input speech utterance and the frames j
of each unit of index n of the continuous codebook define a
set of grid points 64 (i,j,n). Each grid point 64 is
associated with a local distance measure d(i,j,n) defining a
'dissimilarity' between a frame i of input speech and frame j
of the unit with index n. The proposed one-pass DP algorithm
determines a path 66 through the set of grid points (i,j,n)
that provides the best match between the sequence of input
speech frames and a sequence of units from the continuous
codebook. The path 66 is referred to as the warping path.
The criteria for the matching procedure is to minimize the
overall accumulated distortion based upon the unit-cost ,
which is a function of the sum of the local distances along a
given path, and the concatenation-cost. In the embodiment
illustrated in FIG 3, the unit-cost is a function of the
local distances of all the grid points (i,j,n) along the
warping path 66 between a segment of input speech and a
selected unit from the continuous codebook.
In addition to minimizing the overall distortion, the one-
pass DP algorithm requires the warping path to obey certain
continuity constraints that are implied by the physical
nature of the sequences to be matched and to facilitate
concatenation of units for the benefit of deriving an
advantage by later run-length coding. These constraints apply
to consecutive points of the. path. The constraints result
from the requirement of the preservation of time order along
the time axes. Further, endpoint constraints may have to be
taken into consideration, requiring that the warping path 66
begins at a first frame of any unit, and ends at the last
frame of any unit. These constraints determine the possible
preceding grid points for a given grid point (i,j,n) and are
hence also referred to as transition rules. These transition
rules are distinguished into transition rules that apply
within a unit (called within-unit transition rules) and
transition rules that apply at the unit boundaries (called
cross-unit transition rules) which incorporates concatenation
costs unlike the conventional one-pass DP algorithm which
does not impose any such constraints to derive maximally
contiguous decoding, while also achieving optimal overall
distortions in a constrained optimization manner.
FIG 4 illustrates an exemplary within-unit transition rule
according to an embodiment of the present invention. Herein
the x-axis 68 represents the sequence of frames input speech
frames i (1, 2, 3...T) , T representing the last input speech
frame, and the y-axis 7 0 represents frames j of a unit having
an index n in the continuous codebook. According to the
proposed within-unit transition rule, a grid point 72 (i,j,n)
can be reached only from three possible preceding grid points
based upon the local continuity constraints employed. As
shown in FIG 4, these possible preceding grid points are grid
point 74 (i-l,j,n), grid point 76 (i-l,j-l,n) and grid point
78 (i-l,j-2,n). The above-described within-unit transition
rule employs a specific local continuity constraint which
defines the set of candidate frames of the unit n (at input
frame i-1) that can reach the current grid point (i, j, n).
In principle, it is possible to define any other type of
local continuity constraint, each with its own specific
properties with respect to the degree of warping allowed and
in turn on the degree of matching between the input frame
sequence and the unit frame sequence.
FIG 5 illustrates an exemplary cross-unit transition rule
according to an embodiment of the present invention. Herein
the x-axis 80 represents the sequence of input speech frames
i (1,2,3...T) , T representing the last input speech frame. The
axis 82 represents frames j (l,2,..ln) of a unit having an
index n in the continuous codebook, while the axis 84
represents frames j (1,2,..lr) of any unit having an index
r in the continuous codebook, r (1,2, ..IV), N being the size
of the codebook. The grid point 86 (i,1,n) corresponds to the
first frame of the unit having index n, and can hence be
reached either from the last frame of any unit having an
index r (including n itself) or the first frame of unit n.
According to the proposed cross-unit transition rule, the
grid point 86 (i,l,n) can be reached only from two possible
preceding grid points. As shown in FIG 5, these possible
preceding grid points are grid point 88 (i-l,l,n) and grid
point 76 (i-1,1r, r), r=1,..., N, lr being the last frame of any
unit r in the continuous codebook.
Taking into consideration the above-described transition
rules, the warping path 66 is computed recursively using
within-unit recursions and cross-unit recursions. These
recursions are applied over all frames of all the units in
the continuous codebook for every frame i of the input
utterance. The proposed within-unit and cross-unit
recursions are adapted to minimize an accumulated distortion
D(i,j,n) along any path to the grid point (i,j,n), which is
in turn a function of the accumulated distortion along the
best path to its predecessor grid points (based upon the
above-described transition rules), the local distance
associated with the grid point (i,j,n) itself and the
concatenation costs, applicable in cross-word recursions.
Within-unit recursion is carried out over all frames of a
unit, except the first frame of the unit, i.e., for j ≠ 1. The
cross-unit recursion is applied only for the starting frames
of all units, i.e., for j = 1, to account for a potential
entry into unit n from the last frame lr of any of the other
units r = 1, . . . , N in the continuous codebook.
In the illustrated embodiment, the within-unit recursion is
carried out based upon a relationship given by Eqn. (3)
below:
and the cross-unit recursion is carried out based upon a
relationship given by Eqn. (4) below:
wherein,
min denotes a 'minimum' operation,
D(i, j, n) is defined as a minimum accumulated distortion by
any path reaching a grid point defined by frame 'i' of input
speech and frame 'j' of unit 'n' in the continuous codebook,
d(±, j, n) is defined as a local distance between frame 'i'
of the input speech and frame 'j' of unit n,
a = D(i-l,l,n) ,
N denotes total number of units in the continuous codebook,
lr denotes the last frame of unit 'r' in the continuous
codebook, and
α and 1-α , respectively weigh the unit-cost and
concatenation-cost, thereby realizing Eqn. (1). Dc(r,n) is as
defined in Eqn. (1.1)
The use of the weighing parameter α facilitates controlling
the relative importance of the unit-cost and concatenation-
cost in determining the optimal warping path 66. The proposed
embodiment retains a time-synchronous decoding property of
the one-pass DP algorithm and is hence particularly
advantageous in reducing storage requirements, because to
perform the recursions, only a small portion of the array of
accumulated distortions D (n,j,n) is needed, namely the
accumulated distortions of the preceding points as determined
by the transition rules.
Eqn.s (3) and (4) are the recurrence relationships of the
dynamic programming according to embodiments of the present
invention. By using this recurrence relations the accumulated
distance D(i,j,n) can be recursively evaluated point by
point. The warping path may be determined by tracing back the
best path using additional arrays Ψ1(i,j,n) and Ψ2(i,j,n) f or
storing the two backpointers at each (i,j,n) grid point which
are respectively, i) the best unit index which leads to
(i,j,n) in either of the 2 recursions and ii) the best frame
at (i-1) which leads to (i,j,n) in either of the 2
recursions. This is done by starting from a grid point (T,
ln*, n*) satisfying the condition of having the least value
for minimum accumulated distortion in the array D (i, j, n)
of minimum accumulated distortions, wherein T is the last
frame of said input speech and ln*, is the last frame of a
unit n*.
The final optimal distortion hence is given by Eqn. (5)
below:
n* and ln* are defined by Eqn. (6) below:
wherein ln* is the last frame of the unit n* .
Once the warping path is determined, the optimal number of
segments K , segment boundaries, B and sequence of indices
of the selected units, Q* (corresponding to this optimal D* in
Eqn. (1)) are retrieved by backtracking as in the
conventional one-pass DP algorithm. This may be achieved by
tracing back the decisions taken by the 'minimum' operator at
each grid point. In the example illustrated in FIG 3, the
computed segments of the input speech sequence are shown by
the reference numerals 92, 94, 96, 98, 100 and 102. As can be
understood from the FIG 3, the corresponding sequence of unit
indices is Q*= (3,2,3,4,1,5).
As will be appreciated, a 'single frame' continuous codebook
is a special case of the above one-pass DP algorithm when the
units in the continuous codebook are of fixed length equal to
one. For variable length units, the above algorithm performs
a decoding of the input utterance directly using the units of
the unit codebook, instead of using a two-stage procedure
employing an intermediate segmentation (and labeling) using a
pre-constructed clustered VLSQ codebook, and thereby has the
advantage of not incurring any of the sub-optimalities
thereof. Thus, the above algorithm handles fixed-length
segments of any size as well as variable length segments in a
unified and optimal manner.
As discussed above, the incorporation of concatenation-cost
in unit-selection favors quantizing consecutive segments (or
frames) of the input speech with units which have
consecutive indices in the continuous codebook, a feature
which may be exploited by run-length coding to achieve
reduced bit-rates in speech transmission. Run length coding
refers to the following coding scheme applied on the sequence
Q* of indices of selected units obtained by the above one-
pass DP algorithm that solves Eqn. (1.1) . Let a partial
sequence of unit-indices in Q* be
which are such that
The partial sequence is
referred to as a 'contiguous group' with a 'contiguity' of m,
i.e., a group of m segments whose indices are consecutive in
the unit codebook. Run-length coding exploits this contiguity
in coding the above contiguous group by transmitting the
index of the first unit q1 (henceforth referred to as the
base-index), followed by the contiguity value m-1 (quantized
using an appropriate number of bits). At the decoding means,
this indicates that q1 is to be followed by its m-1
successive units in the continuous codebook, which the
decoder retrieves for reconstruction. All the m segment
lengths lk+1, j = 0,...m-1 are quantized and transmitted as in a
normal segment vocoder. In the example illustrated in FIG 3,
wherein the optimized unit sequence Q'= (3,2,3,4,1,5), a
partial sequence (2,3,4) may be identified as a contiguous
group having a base index 2 and a contiguity 3.
Use of an appropriate concatenation-cost favors the optimal
unit sequence to have several 'contiguous' groups with high
"contiguity" thereby aiding run-length coding and decreasing
the bit-rate effectively. The unit-cost represents the
spectral distortion and the concatenation-cost (indirectly)
the bit-rate; a trade-off between the two costs allows for
obtaining different rate-distortion points for the above
algorithm. This is achieved by the factored (which takes
values from 0 to 1).
The effective bit-rate with the run-length coding depends
entirely on the specific contiguity pattern for a given data
being quantized. For a given input utterance 0 = (o1, o2. • •
oT ) , let Q* = q1*, q2*, . . . , qk-1*, qk*, . . . , qK*. be the
optimal unit-indices obtained by the one-pass DP algorithm as
above. Let there be P 'contiguous groups' in this K-segment
unit-index sequence, given by g1*, g2*, • • • , gk-1*, gp*, • • •
, gP*, where the group gP has a 'contiguity' cP , i.e., cP
units whose indices are contiguous in the unit codebook. Then
the total number of bits B for quantization of the input
utterance 0 with run-length coding is given by Eqn. (7)
below:
wherein,
P is the total number of contagious groups during the time
duration T,
N is the size of the continuous codebook,
Cmax is a maximum allowed contiguity,
K* is the total number of segments of input speech during the
time duration T, and
Lmax is a maximum allowed length of a segment k of input
speech obtained after decoding of the feature vectors of
input speech.
In Eqn. (7), the first term in the represents the total
number of bits for the base-indices for the P contiguous
groups, each being quantized to the address of the size N
continuous codebook. The expression log2cmax denotes the number
of bits to quantize a group with a maximum allowed contiguity
of Cmax+1. The expression log2Lmax denotes the number of bits to
quantize an input segment with a maximum allowed length of
Lmax frames.
The effective bit-rate in bits/second is obtained by dividing
the total number of bits B by the duration of the speech
utterance T*f (for an input of T frames with a frame-size of
f milliseconds).
Thus, the present invention uses the principle of unit
selection as used in concatenative speech synthesis while
providing a unified framework for segmentation and
quantization of speech at very low bit rates based on a
dynamic programming (DP) algorithm. To make use of the
principle of unit selection, a continuous codebook is
employed. The use of the continuous codebook along with the
use of concatenation-costs in the DP algorithm favors
quantizing of consecutive segments of input speech with units
having consecutive indices in the continuous codebook.
Advantageously, the unified framework according to the
present invention is applicable for optimal quantization
using both variable-length as well as fixed-length codebooks
including single-frame codebook.
FIG 6 and FIG 7 illustrate the results of the proposed unit-
selection based segment quantization algorithm with respect
to its quantization accuracy in terms of rate-distortion
curves showing spectral distortion versus the effective bit-
rate with run-length coding.
The segment quantization performance is measured in terms of
the average spectral distortion between the original sequence
of linear-prediction vectors and the sequence obtained after-
segment quantization and length renormalization. The average
spectral distortion is the average of the single frame
spectral distortion over the number of frames in the input
speech; the single frame spectral distortion is the squared
difference between the log of the linear-prediction power
spectra of the original frame and the quantized frame,
averaged over frequency. The bit-rate for segment
quantization is measured as given in Eqn. (7) using the run-
length coding. The TIMIT (Texas Instruments and Massachusetts
Institute of Technology) database has been used for all the
experiments. A value of alpha=0.5 has been used in (Eqns.
(1), (3), (4), (6)) in all the experiments, giving equal
weightage to both unit-cost and concatenation-cost.
In FIG 6, the rate-distortion performance of the unit-
selection algorithm is shown for two basic kinds of unit
sizes: i) fixed length units with lengths ranging from 1 to 8
and ii) variable-length phonetic units. Herein, the x-axis
104 represents bit-rate (bits/sec) and the y-axis 106
represents spectral distortion (dB). The curves 108, 110,
112, 114 and 116 represent rate-distortion performances for
fixed length segments of lengths 1,2,4,6, and 8 respectively.
The curve 118 represents the rate-distortion performance for
variable length phonetic segments.
In all cases, the codebook is a continuous sequence of
linear-prediction vectors (log-area ratios) of continuous
speech utterances in the TIMIT database, but treated as being
made of fixed length units or variable sized units. Since
TIMIT is phonetically segmented, this phonetic segmentation
is used to define the variable-length units. This represents
the best performance achievable for variable length units,
such as when the automatic segmentation used to obtain the
units is as good as manual segmentation that defines phonetic
segments. In both cases, codebooks of sizes 32 to 4096 have
been used, which are essentially the first 32 (or 4096)
vectors of the TIMIT sentences ordered with male and female
sentences interleaved. The number of sentences used to form
these codebooks range from 1 to 128 sentences corresponding
to multi-speaker codebook with sentences from up to 14
speakers. The test data used was 8 sentences with 4 male and
4 female speakers from outside the speakers used in the
codebook.
From FIG 6, it can be observed that the effective bit-rate
reduces significantly (nearly halves, such as from 200 bits/s
to 100 bits/s), with increase in the fixed length unit-size
from 1 to 4 to 8 frames. This is largely due to the fact that
with a larger unit, the segment rate (number of segments per
second) is reduced, and even without run-length coding, the
number of bits used for base-index quantization would
decrease proportionately. In addition, the use of run-length
coding further reduces the effective bit-rate; the contiguity
of larger length units implies that more frames are quantized
with the same run-length base indices resulting in improved
run-length advantage for longer fixed length units. Fixed
length units of length 6 and 8 also provide a performance
comparable to that of variable length phonetic units. This
circumvents the need for defining variable length units in a
continuous codebook by automatically segmenting it or by
other means. It would be sufficient to simply use a large
continuous speech data and define fixed length units of
lengths comparable to phonetic units.
It can be noted that the variable length phonetic units
perform the best, offering an halving of the bit-rate from
300 bits/s (for single-frame units) to 150 bits/s for the
same distortion, clearly validating the potential of the
unit-selection algorithm to gain rate-distortion with larger
unit sizes that approximate phone-like units. However, fixed
length units of length 6 and 8 (shown up to codebook sizes
4096 and 2048) also provide a performance comparable to that
of variable length phonetic units. This circumvents the need
for defining variable length units in a continuous codebook
by automatically segmenting it or by other means. It would be
sufficient to simply use a large continuous speech data and
define fixed length units of lengths comparable to phonetic
units.
The effect of increasing the fixed length on the run-length
based bit-rate advantage is brought out clearly from the
distribution of contiguity in FIG 7 which plots the number of
times a contiguity group of contiguity 'm' occurs. Herein the
number of segments that are contiguous with contiguity (L)
are represented along y-axis 122, while the length (L) of the
contiguous units are represented along the x-axis 120. Curves
124, 126 and 128 respectively represent contiguity
distribution for fixed length segments of lengths 1, 2 and 4.
The curve 130 represents contiguity distribution for variable
length phonetic segments.
As can be expected, the contiguity is high even for units of
lencrth one. With increase in the unit lengths from 1 to 2, 4,
6, and 8, and finally to the variable length phonetic units,
the largest contiguity tends to come down, since each unit
now already spans multiple frames. However, the effective
number of frames grouped by a contiguity has increased
considerably even with limited contiguity for longer units.
For instance, for unit lengths of 4, a contiguity of 4
performs an effective run-length coding over 16 frames in
comparison to the maximum of 9 frames of single-frame units.
Referring back to FIG 6, it is important to note that for
longer fixed length units and variable length units, there is
a sharper decrease in spectral distortion (SD) for a given
bit-rate increase, in comparison to single-frame units. This
steep trend in the rate-distortion curves of the proposed
unit-selection algorithm even with fixed-length units
indicates that large reductions in spectral distortion can be
achieved by using suitably large codebook sizes. This is
particularly appealing since the codebook need not be
'designed' by any complex algorithms and nor does it have to
be segmented (phonetically or otherwise) prior to use.
Solutions to make the one-pass DP unit-selection algorithm
perform with low computational complexities at very large
codebook sizes will enable it to achieve close to 1-2 dB
average spectral distortions, which make this ultra low bit
rate quantization as good as higher rate spectral quantizers.
FIG 8 shows the rate-distortion performance of the proposed
unit-selection algorithm and the algorithm set forth in the
article K. S. Lee and R. V. Cox, "A segmental speech coder
based on a concatenative TTS", Speech Commun., 38:89-100,
2002 Lee and Cox, 2002, hereinafter simply referred to as
Lee-Cox'02. Herein, bitrate (bits/ sec) is represented along
the x-axis 130 while the spectral distortion (dB) is
represented along the y-axis 132. The curves 134 and 138
respectively represent rate-distortion performances of the
proposed algorithm and the Lee-Cox'02 algorithm.
For both the algorithms, the same continuous speech codebook
as the unit database is used, which is a continuous sequence
of linear-prediction vectors (log-area ratios) of continuous
speech utterances in the TIMIT database, treated as being
made of variable sized units as defined by the manually
defined phonetic units. Since TIMIT is phonetically
segmented, this phonetic segmentation has been used to define
the variable-length units for both the algorithms. This
represents the best performance achievable for variable
length units, and can be expected to provide an optimal
baseline performance to the case when automatic segmentation
is used to obtain the units such as using a clustered
codebook (by the variable-length segment quantization (VLSQ)
technique) as used in Lee-Cox'02.
Herein unit databases of size ranging from 512 to 65536
corresponding bit-rates of 9 to 17 bits have been used. These
are essentially the first 65536 phonetic segments of the
TIMIT sentences ordered with male and female sentences
interleaved, from about 200 sentences from 20 speakers
constituting nearly 2 hours of continuous speech. The number
along side each point in the curves is the codebook size (in
bits/unit). In the case of the proposed algorithm database
size up to 8192 has been used, as this achieves the same
spectral distortions as the Lee-Cox'02 algorithm with a
database size of 65536 and was hence adequate to bring out
the performance advantage achievable (at significantly lower
bit-rates), due to optimality of the proposed algorithm. The
test data used was 10 sentences with 5 male and 5 female
speakers from outside the speakers used in the codebook.
From FIG 8, the some important differences between the
proposed optimal unified unit-selection algorithm and the
sub-optimal algorithm of Lee-Cox'02 can be noted. In general,
the rate-distortion curve of the proposed algorithm has the
ideal shift towards left-bottom with a significant distortion
and rate margins over the rate-distortion curve of Lee-
Cox' 02. This is as would be expected for an enhanced
quantization scheme with both rate and distortion advantages.
Specifically, it can be seen that the proposed algorithm has
significantly lower distortions for a given database size
(given in bits alongside) and final effective bit-rate. For
instance, the spectral distortion of the proposed algorithm
is about 1.5 dB less than that of Lee-Cox'02 algorithm for a
database size of size 512 (9 bits), with a corresponding
effective bit-rate that is 75 bits/sec less. For a size of 13
bits, the proposed algorithm is able to provide a much lower
spectral distortions (as much as 3 dB less) than the Lee-
Cox' 02 algorithm at the same effective bit-rate. It should be
noted that this 3 dB difference is highly significant for the
ultra low-rate ranges being dealt with here. It can be
further noted that the 13 bit database with the proposed
algorithm gives about 1.5 dB performance improvement over
that of Lee-Cox'02 and at a much lower bit-rate. Further, it
can be noted that the Lee-Cox'02 algorithm needs extremely
large database sizes (of the order of 17 bits which is 65536
segmental units or approximately 2 hours of continuous
speech), to achieve distortions comparable to that achievable
by the proposed algorithm with a much smaller database size
13 bits (8192 segmental units, or about 13 minutes of
continuous speech, which is nearly 8 times less than that
needed by Lee-Cox'02 algorithm).
FIG 8 also shows curves 136 and 140 respectively indicating
rate-distortion characteristics of the proposed algorithm and
the Lee-Cox'02 algorithm when the concatenation cost is not
used; i.e., the Viterbi decoding in Lee-Cox'02 as well as the
proposed one-pass DP constraints in this paper does not have
the second term in Eqn. (1) above. By this,
the two algorithms have better (i.e., lower) spectral
distortions, since not using the concatenation constraint
leads to more optimal decoding with respect to the unit-cost
of term 1 in this equation. It can be observed that the
proposed algorithm has significantly lower spectral
distortions than Lee-Cox'02 for a given unit database size.
This clearly brings out the effect of gain of optimality
resulting from quantizing the input utterance Mirectly'
using 'all' the units of the database, unlike the two step
procedure of Lee-Cox' 02 which uses an intermediate
quantization and a subsequent unit-selection using a highly
reduced choice of units for each segment of the pre-segmented
input utterance.
Further, it can be noted that the proposed algorithm gains
significantly in achieving much lower effective bit-rates,
once the concatenation-constraints are restored, again
indicating another performance advantage of the optimality of
the proposed algorithm. The proposed algorithm is able to
produce more contiguous decoding, which in turn reduces the
effective bit-rate with run-length coding. Again, this is due
to the fact, the decoding is done with the entire unit-
database, in comparison to the highly reduced unit choices
available in Lee-Cox'02 algorithm due to the pre-quantization
using the clustered intermediate codebook.
While phonetically defined variable length units have been
used as available in the TIMIT database, the above algorithms
should in principle be used with a unit database defined
automatically, i.e. with units defined by automatic methods
such as the variable length segment quantization method.
However, in an interesting result shown in FIG 6, it turns
out that it is possible to completely avoid such expensive
segmentation and labeling (either manually or by automatic
methods), by using fixed-length units of sufficient lengths
(comparable to average phonetic units) such as 6 to 8 frames
and still get rate-distortion performances comparable to what
is possible with variable-length units. This leads to the
conclusion that the algorithm proposed herein is able to
firstly overcome the sub-optimalities of the Lee-Cox'02
algorithm with a consequent improved rate-distortion
performance and in addition, completely circumvent the need
to have pre-defined variable-length units, as was obtained by
using clustered codebooks in the Lee-Cox'02 algorithm.
The aforementioned embodiments of the present invention
provide several important advantages. The proposed unified
framework allows segmenting and quantizing the input speech
using a constrained one-pass dynamic programming algorithm
for performing unit-selection on continuous codebooks. This
is an optimal and unified algorithm for unit-selection based
segment quantization for continuous codebooks which can be of
two types, i.e. codebooks having segments of fixed lengths
including the case of the fixed length of one frame and
codebooks having segments of variable lengths. The proposed
framework is optimal for both the above cases and does not
suffer from the sub-optimalities incurred in the existing
techniques as discussed earlier.
The proposed framework is advantageous over the existing
techniques discussed above in a number of ways. First, the
proposed framework is based on a single algorithm, which is a
generalization of both the single-frame system and segmental
system. This allows evaluation of the unit-selection based
system for fixed unit sizes greater than directly from a
single codebook, which was not attempted in the past. It has
been shown that long fixed sized units of 6-8 frames offer
significantly improved performance over 'single-frame' units.
Further, the algorithm proposed, with fixed size units of 6-8
frames that approximate phone-like segments, offer
performance comparable to variable sized units (such as
phonetic units), thereby completely obviating the need to
segment and label the continuous speech database manually or
automatically using phonetic or clustered codebooks. Also,
unlike the existing techniques, the algorithm proposed
provides an optimal segment quantization of the input speech
directly with respect to the actual units of the continuous
codebook. By this, significantly superior performance can be
achieved in comparison to existing techniques discussed
above, as the proposed algorithm has considerably lower
spectral distortions (up to 3 dB lower distortions) for a
given bit-rate as well as much lower bit-rates for a given
distortion than exiting techniques.
The present invention is particularly useful in speech
transmission applications where there is no more bandwidth
available than 400-500 bits/second such as underwater
communication or captive RF channels for highly secure
communications. Further, the proposed method allows storage
of large amount of speech data (manual, instructions) where
it is important to preserve intelligibility and speaker-
identity, encoding long speech surveillance data or the
speech/audio part of long duration audio and video
surveillance data. Further, in applications involving secure
voice communication, a relatively large bandwidth can be
provided for encryption while coding speech using the
proposed ultra low bit rate coding method.
The proposed framework and algorithm can be applied to any of
the linear prediction parameters or the mel-frequency
cepstral coefficients (MFCCs). In particular, line spectral
frequencies (LSFs) are known to be efficient for spectral
quantization and the system might be extended to use LSFs.
The proposed method for quantization may be incorporated
along with residual modeling and quantization to result in a
complete ultra low bit-rate speech coder.
Summarizing, the present invention relates to a system and
method for coding speech. In order to improve coding of
speech at very low bit rates, the proposed system comprises
means for dividing input speech into a sequence of speech
frames and analyzing said sequence of speech frames to obtain
a sequence of feature vectors there and means for quantizing
said sequence of feature vectors obtained from the input
speech. The means for quantizing includes a dynamic
programming framework adapted for unit-selection directly
from a continuous codebook. The continuous codebook comprises
sequentially indexed units made up of feature vectors
corresponding to frames of continuous speech. The dynamic
programming framework includes means for determining a
warping path for an optimized alignment of said sequence of
feature vectors obtained from input, speech with a sequence of
feature vectors of units from the continuous codebook using a
within-unit recursion and a cross-unit recursion, said
within-unit recursion and cross-unit recursion minimizing an
overall distortion defined by a weighted function of a unit-
cost and a concatenation-cost. The dynamic programming
framework also includes means for computing, from said
warping path, an optimized number of segments and segment
boundaries of the sequence of feature vectors obtained from
input speech, and a corresponding sequence of units selected
from the continuous codebook along the warping path.
Although the invention has been described with reference to
specific embodiments, this description is not meant to be
construed in a limiting sense. Various modifications of the
disclosed embodiments, as well as alternate embodiments of
the invention, will become apparent to persons skilled in the
art upon reference to the description of the invention. It is
therefore contemplated that such modifications can be made
without departing from the spirit or scope of the present
invention as defined.
Patent Claims
1. A system for coding speech, comprising:
- means (14) for dividing input speech (12) into a
seguence of speech frames and analyzing said
sequence of speech frames to obtain a sequence of
feature vectors (15) corresponding to said sequence
of speech frames, and
- means (16) for quantizing said sequence of feature
vectors (15) obtained from the input speech (12),
including a dynamic programming framework (17)
adapted for unit-selection directly from a
continuous codebook (18), said continuous codebook
(18) comprising seguentially indexed units made up
of feature vectors corresponding to frames of
continuous speech, wherein said dynamic programming
framework (17) further comprises:
- means for determining a warping path (66) for an
optimized alignment of said sequence of feature
vectors (15) obtained from input speech (12) with
a sequence of feature vectors of units from the
continuous codebook (18) using a within-unit
recursion and a cross-unit recursion, said
within-unit recursion and cross-unit recursion
minimizing an overall distortion defined by a
weighted function of a unit-cost and a
concatenation-cost, and
- means for computing, from said warping path (66),
an optimized number of segments and segment
boundaries of the sequence of feature vectors
(15) obtained from input speech (12), and a
corresponding sequence of units selected from the
continuous codebook (18) along the warping path
(66),
wherein said unit-cost is defined as a distortion in
quantizing a segment of sequence of feature vectors
(15) of input speech (12) with a unit selected from
the continuous codebook (18) and said concatenation-
cost is defined as a distortion between two units of
the continuous codebook (18) that are selected to
quantize two consecutive segments of the sequence of
feature vectors (15) of input speech (12).
2. The system according to claim 1, further comprising a
run-length coding module (19) for identifying, from
said sequence of selected units, at least one
contiguous group comprising a partial sequence of
units having consecutive indices in the continuous
codebook (18), and determining therefrom, a base-index
and contiguity of said contiguous group, wherein base
index is defined as the index of the first unit in a
contiguous group and contiguity is defined by the
total number of units in a contiguous group.
3. The system according to claim 2, further comprising
means for communicating quantized speech parameters
(21) including the base-index and the contiguity of
identified contiguous groups and individual segment
lengths as a plurality of bits over a communication
channel (22), for reconstruction by a decoding means
(24), wherein total number of bits (B) transmitted
over a time duration T is computed based upon the
relationship given by:
wherein,
P is the total number of contagious groups during the
time duration T,
N is the size of the continuous codebook,
is a maximum allowed contiguity,
K* is the total number of segments of input speech
during the time duration T, and
Lmax is a maximum allowed length of a segment k of
input speech obtained after decoding of the feature
vectors of input speech.
4. The system according to any of claims 1 to 3, wherein
units of said continuous codebook (18) comprise
variable number of feature vectors corresponding to a
variable number of frames of said continuous speech.
5. The system according to any of claims 1 to 3, wherein
each of the units of said continuous codebook (18)
comprises an equal number of feature vectors
corresponding to an equal number of frames of said
continuous speech.
6. The system according to any of claims 1 to 3, wherein
each unit of said continuous codebook (18) comprises a
single feature vector obtained corresponding to a
single frame of said continuous speech.
7. The system according to any of claims 1 to 6, wherein
said quantizing means (16) is adapted to carry out
within-unit recursion over all feature vectors of a
unit except a feature vector corresponding to the
first frame of said unit based upon a relationship
given by:
said quantizing means is adapted to carry out cross-
unit recursion over the feature vector corresponding
to the first frame of a unit based upon a relationship
given by:
wherein,
D (i,j,n) is defined as a minimum accumulated
distortion by any path reaching a grid point defined
by feature vector 'i' of input speech and feature
vector 'j' of unit 'n' in the continuous codebook,
d (i,j,n) is defined as a local distance between
feature vector xi' of the input speech and feature
vector 'j' of unit n,
N denotes total number of units in the continuous
codebook,
lr denotes the feature vector corresponding to the
last frame of unit 'r' in the continuous codebook,
α and 1-α are respectively defined as relative
weights of the unit-cost and concatenation-cost, and
Dc(r,n)is defined as:
wherein is the Euclidean distance between
the feature vector corresponding to the last frame of
unit ur and the feature vector corresponding to the
first frame of unit un, and
if n=r+l and =1 otherwise.
8. The system according to any of claims 1 to 7, wherein
the feature vectors comprise linear prediction
parameters obtained from linear prediction analysis of
speech frames.
9. The system according to any of claims 1 to 7, wherein
the feature vectors comprise mel-frequency cepstrum
coefficients.
10. A method for coding speech, comprising the steps of:
- dividing input speech (12) into a sequence of speech
frames and analyzing said sequence of speech frames
to obtain a sequence of feature vectors (15)
corresponding to said sequence of speech frames, and
- quantizing said sequence of feature vectors (15)
obtained from the input speech (12) via unit-
selection directly from a continuous codebook (18),
said continuous codebook (18) comprising
sequentially indexed units made up of feature
vectors corresponding to frames of continuous
speech, wherein unit-selection from said continuous
codebook (18) is based upon a dynamic programming
algorithm comprising:
- determining a warping path (66) for an optimized
alignment of said sequence of feature vectors
(15) obtained from input speech (12) with a
sequence of feature vectors of units from the
continuous codebook (18) using a within-unit
recursion and a cross-unit recursion, said
within-unit recursion and cross-unit recursion
minimizing an overall distortion defined by a
weighted function of a unit-cost and a
concatenation-cost, and
- computing, from said warping path (66), an
optimized number of segments and segment
boundaries of the sequence of feature vectors
(15) obtained from input speech (12), and a
corresponding sequence of units selected from the
continuous codebook (18) along the warping path
(66),
wherein said unit-cost is defined as a distortion in
quantizing a segment of sequence of feature vectors
(15) of input speech (12) with a unit selected from
the continuous codebook (18) and said concatenation-
cost is defined as a distortion between two units of
the continuous codebook (18) that are selected to
quantize two consecutive segments of the sequence of
feature vectors (15) of input speech (12).
11. The method according to claim 10, further comprising
run-length coding of said sequence of selected units,
to identify, from said sequence of selected units, at
least one contiguous group comprising a partial
sequence of units having consecutive indices in the
continuous codebook (18), and to determine therefrom,
a base-index and contiguity of said contiguous group,
wherein base index is defined as the index of the
first unit in a contiguous group and contiguity is
defined by the total number of units in a contiguous
group.
12. The method according to claim 11, further comprising
communicating quantized speech parameters (21)
including the base-index and the contiguity of
identified contiguous groups and individual segment
lengths as a plurality of bits, for reconstruction by
a decoding means (24), wherein total number of bits
(B) transmitted over a time duration T is computed
based upon the relationship given by:
wherein,
P is the total number of contagious groups during the
time duration T,
N is the size of the continuous codebook,
Cmax is a maximum allowed contiguity,
K* is the total number of segments of input speech
during the time duration T, and
Lmax is a maximum allowed length of a segment k of
input speech obtained after decoding of the feature
vectors of input speech.
13. The method according to any of claims 10 to 12,
wherein said within-unit recursion is carried out over
all feature vectors of a unit except a feature vector
corresponding to the first frame of said unit based
upon a relationship given by:
said cross-unit recursion is carried out over the
feature vector corresponding to the first frame of a
unit based upon a relationship given by:
wherein,
D(i,j,n) is defined as a minimum accumulated
distortion by any path reaching a grid point defined
by feature vector 'i' of input speech and feature
vector 'j' of unit 'n' in the continuous codebook,
d (i,j,n) is defined as a local distance between
feature vector 'i' of the input speech and feature
vector 'j' of unit n,
N denotes total number of units in the continuous
codebook,
lr denotes the feature vector corresponding to the last
frame of unit 'r' in the continuous codebook,
α and 1-α are respectively defined as relative
weights of the unit-cost and concatenation-cost, and
Dc(r,n)is defined as:
wherein is the Euclidean distance between
the feature vector corresponding to the last frame of
unit ur and the feature vector corresponding to the
first frame of unit un, and
14. A system or method substantially as herein above
described in the specification with reference to the
accompanying drawings.
The present invention relates to a system and method for coding speech. In order to improve coding of speech at very low bit rates, the proposed system comprises means (14) for
dividing input speech (12) into a sequence of speech frames and analyzing said sequence of speech frames to obtain a sequence of feature vectors (15) there and means (16) for
quantizing said sequence of feature vectors (15) obtained from the input speech (12). The means (16) includes a dynamic programming framework (17) adapted for unit-selection
directly from a continuous codebook (18). The continuous codebook (18) comprises sequentially indexed units made up of feature vectors corresponding to frames of continuous speech.
The dynamic programming framework (17) includes means for determining a warping path (66) for an optimized alignment of said sequence of feature vectors (15) obtained from input
speech (12) with a sequence of feature vectors of units from the continuous codebook (18) using a within-unit recursion and a cross-unit recursion, said within-unit recursion and
cross-unit recursion minimizing an overall distortion defined by a weighted function of a unit-cost and a concatenation-cost.
The dynamic programming framework (17) also includes means for computing, from said warping path (66), an optimized number of segments and segment boundaries of the sequence of feature vectors (15) obtained from input speech
(12), and a corresponding sequence of units selected from the continuous codebook (18) along the warping path (66).
| # | Name | Date |
|---|---|---|
| 1 | 00919-kol-2006-correspondence others-1.1.pdf | 2011-10-07 |
| 1 | 919-kol-2006-specification.pdf | 2011-10-07 |
| 2 | 00919-kol-2006-correspondence others.pdf | 2011-10-07 |
| 2 | 919-kol-2006-form 5.pdf | 2011-10-07 |
| 3 | 919-kol-2006-form 3.pdf | 2011-10-07 |
| 3 | 00919-kol-2006-description(provisional).pdf | 2011-10-07 |
| 4 | 919-kol-2006-form 2.pdf | 2011-10-07 |
| 4 | 00919-kol-2006-drawings.pdf | 2011-10-07 |
| 5 | 919-kol-2006-form 1.pdf | 2011-10-07 |
| 5 | 00919-kol-2006-form-1-1.1.pdf | 2011-10-07 |
| 6 | 919-kol-2006-drawings.pdf | 2011-10-07 |
| 6 | 00919-kol-2006-form-1.pdf | 2011-10-07 |
| 7 | 919-kol-2006-description (complete).pdf | 2011-10-07 |
| 7 | 00919-kol-2006-form-2.pdf | 2011-10-07 |
| 8 | 919-kol-2006-correspondence.pdf | 2011-10-07 |
| 8 | 00919-kol-2006-form-3.pdf | 2011-10-07 |
| 9 | 919-kol-2006-abstract.pdf | 2011-10-07 |
| 9 | 919-kol-2006-claims.pdf | 2011-10-07 |
| 10 | 919-kol-2006-abstract.pdf | 2011-10-07 |
| 10 | 919-kol-2006-claims.pdf | 2011-10-07 |
| 11 | 00919-kol-2006-form-3.pdf | 2011-10-07 |
| 11 | 919-kol-2006-correspondence.pdf | 2011-10-07 |
| 12 | 00919-kol-2006-form-2.pdf | 2011-10-07 |
| 12 | 919-kol-2006-description (complete).pdf | 2011-10-07 |
| 13 | 00919-kol-2006-form-1.pdf | 2011-10-07 |
| 13 | 919-kol-2006-drawings.pdf | 2011-10-07 |
| 14 | 00919-kol-2006-form-1-1.1.pdf | 2011-10-07 |
| 14 | 919-kol-2006-form 1.pdf | 2011-10-07 |
| 15 | 00919-kol-2006-drawings.pdf | 2011-10-07 |
| 15 | 919-kol-2006-form 2.pdf | 2011-10-07 |
| 16 | 00919-kol-2006-description(provisional).pdf | 2011-10-07 |
| 16 | 919-kol-2006-form 3.pdf | 2011-10-07 |
| 17 | 00919-kol-2006-correspondence others.pdf | 2011-10-07 |
| 17 | 919-kol-2006-form 5.pdf | 2011-10-07 |
| 18 | 919-kol-2006-specification.pdf | 2011-10-07 |
| 18 | 00919-kol-2006-correspondence others-1.1.pdf | 2011-10-07 |