Speaker Authentication

< Back

Speaker Authentication

Abstract: SPEAKER AUTHENTICATION speaker authentication is performed by determining a similarity score for a test utterance and a stored training utterance. Computing the similarity score involves determining the sum of a group of functions, where each function includes the product of a posterior probability of a mixture component and a difference between an adapted mean and a background mean. The adapted mean is formed based on the background mean and the test utterance. The speech content provided by the speaker for authentication can be text-independent (i.e., any content they want to say) or text-dependent (i.e., a particular phrase used for training).

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

11 August 2008

Publication Number

11/2009

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Parent Application

Applicants

MICROSOFT CORPORATION

ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052-6399

Inventors

1. ZHANG, ZHENGYOU

ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052-6399

2. LIU, MING;

ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052-6399

Specification

BACKGROUND
Speaker authentication is the process of verifying the claimed identity of a speaker based on a speech signal. The authentication is typically performed using speech models that have been trained for each person who uses the system.
In general, there are two types of speaker authentication, text-independent and text-dependent. In text-independent speaker authentication, the speaker provides any speech content that they want to provide. In text-dependent speaker authentication, the speaker recites a particular phrase during model training and during use of the authentication system. By repeating the same phrase, a strong model of the phonetic units and transitions between those phonetic units can be constructed for the text-dependent speaker authentication system. This is not as true in text-independent speak authentication systems since many phonetic units and many transitions between phonetic units will not be observed during training and thus will not be represented well in the models.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
SUMMARY Speaker authentication is performed by determining a similarity score for a test utterance and a stored training utterance. Computing the similarity score involves determining the sum of a group of

functions, where each function includes the proauct or a posterior probability of a mixture component and a difference between an adapted mean and a background mean. The adapted mean is formed based on the background mean and the test utterance.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the background.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of one computing environment in which some embodiments may be practiced.
FIG. 2 is a block diagram of an alternative computing environment in which some embodiments may be practiced.
FIG. 3 is a flow diagram of a method of training a text-independent authentication system.
FIG. 4 is a block diaphragm of elements used to train a text-independent authentication system.
FIG. 5 is a flow diagram of a method for setting thresholds during training,
FIG. 6 is a flow diagram of a method of identifying model parameters for a test utterance.
FIG. 7 is a block diagram of elements used in the methods of FIGS. 6 and 8.
FIG. 8 is a flow diagram of a method for determining thresholds for a test utterance.

FIG. 9 is a flow diagram of a method of authenticating a test utterance.
FIG. 10 is a block diagram of elements used to authenticate a test utterance.
FIG. 11 is a flow diagram of a method of training a Hidden Markov Model for a text-dependent authentication system.
FIG, 12 is a block diagram of elements used to train a Hidden Markov Model.
FIG. 13 is a flow diagram of a method of authenticating a test utterance using a Hidden Markov Model.
FIG, 14 is a block diagram of elements used to authenticate a test utterance using a Hidden Markov Model.
DETAILED DESCRIPTION FIG. 1 illustrates an example of a suitable computing system environment 100 on which embodiments may be implemented. The computing system environment 100 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the claimed subject matter. Neither should the computing environment 100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 100.
Embodiments are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with various embodiments include, but are not limited to, personal computers,

server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
Embodiments may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Some embodiments are designed to be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules are located in both local and remote computer storage media including memory storage devices.
With reference to FIG. 1, an exemplary system for implementing some embodiments includes a general-purpose computing device in the form of a computer 110. Components of computer 110 may include, but are not limited to, a processing unit 120, a system memory 130, and a system bus 121 that couples various system components including the system memory to the processing unit 120. The system bus 121 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus. Micro Channel Architecture (MCA) bus. Enhanced ISA (EISA) bus. Video

Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also knovn as Mezzanine bus.
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should

also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation, FIG. 1 illustrates operating system 134, application programs 135, other program modules 136, and program data 137.
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only, FIG. 1 illustrates a hard disk drive 141 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 151 that reads from or writes to a removable, nonvolatile magnetic disk 152, and an optical disk drive 155 that reads from or writes to a removable, nonvolatile optical disk 156 such as a CD ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The hard disk drive 141 is typically connected to the system bus 121 through a non-removable memory interface such as interface 140, and magnetic disk drive 151 and optical

disk drive 155 are typically connected to the system bus 121 by a removable memory interface, such as interface 150.
The drives and their associated computer storage media discussed above and illustrated in FIG. 1, provide storage of computer readable instructions, data structures, program modules and other data for the computer 110. In FIG. 1, for example, hard disk drive 141 is illustrated as storing operating system 14 4, application programs 145, other program modules 146, and program data 147. Note that these components can either be the same as or different from operating system 134, application programs 135, other program modules 136, and program data 137. Operating system 14 4, application programs 145, other program modules 146, and program data 147 are given different numbers here to illustrate that, at a minimum, they are different copies.
A user may enter commands and information into the computer 110 through input devices such as a keyboard 152, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and

printer 196, which may be connected through an output peripheral interface 195.
The computer 110 is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 180, The remote computer 180 may be a personal computer, a hand¬held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in FIG, 1 include a local area network (LAN) 171 and a wide area network (WAN) 173, but may also include other networks. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 17 0. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation, FIG. 1 illustrates remote application programs 185 as residing on remote computer 180. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.

FIG, 2 is a block diagram of a mobile device 200, which is an exemplary computing env-ironment. Mobile device 200 includes a microprocessor 202, memory 2 04, input/output (I/O) components 206, and a communication interface 208 for communicating with remote computers or other mobile devices. In one embodiment, the afore¬mentioned components are coupled for communication with one another over a suitable bus 210.
Memory 204 is implemented as non-volatile electronic memory such as random access memory (RAM) with a battery back-up module {not shown) such that information stored in memory 204 is not lost when the general power to mobile device 200 is shut down. A portion of memory 204 is preferably allocated as addressable memory for program execution, while another portion of memory 2 04 is preferably used for storage, such as to simulate storage on a disk drive.
Memory 204 includes an operating system 212, application programs 214 as well as an object store 216. During operation, operating system 212 is preferably executed by processor 202 from memory 204. Operating system 212, in one preferred embodiment, is a WINDOWS® CE brand operating system commercially available from Microsoft Corporation. Operating system 212 is preferably designed for mobile devices, and implements database features that can be utilized by applications 214 through a set of exposed application programming interfaces and methods. The objects in object store 216 are maintained by applications 214 and operating system 212, at least partially in response to calls to the exposed application programming interfaces and methods.
Communication interface 208 represents numerous devices and technologies that allow mobile device 200 to

send and receive information. The devices include wired and wireless modems, satellite receivers and broadcast tuners to name a few. Mobile device 200 can also be directly connected to a computer to exchange data therewith. In such cases, communication interface 208 can be an infrared transceiver or a serial or parallel communication connection, all of which are capable of transmitting streaming information.
Input/output components 2 06 include a variety of input devices such as a touch-sensitive screen, buttons, rollers, and a microphone as well as a variety of output devices including an audio generator, a vibrating device, and a display. The devices listed above are by way of example and need not all be present on mobile device 200. In addition, other input/output devices may be attached to or found with mobile device 200.
TEXT-IHDEPEHDENT SPEAKER VERIFICATION Under one embodiment of the present invention, a text-independent speaker authentication system is provided which authenticates a test speech signal by forming a similarity measure that is based on a model adapted to training speech for a user and a model adapted to the test speech signal. In particular, the similarity measure uses the differences between the two adapted models and a background model.
In one embodiment, the background model is a Gaussian Mixture Model that is defined as:
where M is the number of mixture components in the model, wi is a weight for the ith mixture component, mi is the

mean for the ith mixture component and is the
covariancG matrix of the ith component. Notation ^ denotes the set of parameters of the background model (the weight, mean and covariance for each component). The background model is adapted to training speech using the following equations:

X
where ' is a training feature vector from a particular
yd IX ) speaker, '^ ' '' is the posterior probability of the ith
mixture component given the feature vector from the
speaker, T is the number of frames in the training
utterance from the particular speaker, ^^'^ is the soft count of the frames belonging to the ith mixture component across the entire training utterance from the particular speaker, and ^ is a smoothing factor that
causes the mean ' of the adapted model to adopt the mean of the background model if there are few observed frames for the ith mixture component in the training utterance. Note that in the embodiment described above, the covariance for the adapted model is equal to the covariance for the background model.

Under one embodiment, the similarity measure is defined as:
where
S, =m,-m, EQ. 8
d,=m,-m, EQ. 9
r(i) = i.Yii\x,) EQ. 10
1=1
where x, is a feature vector of the test utterance, T is the number of frames of the test utterance and m, is the sample mean of the test utterance which is defined as:
EQ. 11
Thus, in the similarity measure of equation 7, a product is formed from the posterior probability y for
the test utterance, the difference, S^ , between an adapted mean for the test speaker and a background mean and the difference,

Documents

Application Documents

#	Name	Date
1	4242-chenp-2008 pct.pdf	2011-09-04
2	4242-chenp-2008 form-5.pdf	2011-09-04
3	4242-chenp-2008 form-3.pdf	2011-09-04
4	4242-chenp-2008 form-26.pdf	2011-09-04
5	4242-chenp-2008 form-1.pdf	2011-09-04
6	4242-chenp-2008 drawings.pdf	2011-09-04
7	4242-chenp-2008 description(complete).pdf	2011-09-04
8	4242-chenp-2008 correspondence-others.pdf	2011-09-04
9	4242-chenp-2008 claims.pdf	2011-09-04
10	4242-chenp-2008 abstract.pdf	2011-09-04