Abstract: A system and method for localizing spoken words in utterances using mismatched crowd. The process creates an audio annotation task comprising a list of spoken utterances and a list of keywords selected from a predefined dictionary. The process presents the task comprising of utterances and transliterated keywords to workers for annotation. The annotation process requires a worker to choose and localize a keyword from the plurality of keywords, wherein among them one option allowed worker to choose none of above option indicating none of the keywords from the list is spoken in utterance. If worker choses one of the keyword from the list then the worker is asked to mark boundaries of it on the waveform of the spoken utterances. Further, the process aggregates workers’ responses to generate ground truth label by using EM framework. In addition, system maintains workers reputation in terms of their accuracy and bias.
Claims: 1. A method for localizing a keyword in spoken utterances using mismatched crowd, wherein the method comprising:
creating an audio annotation task comprising a list of spoken utterances and a list of keywords selected from predefined dictionary;
transliterating the list of keywords from spoken script into the script of each worker of the mismatched crowd;
presenting the audio annotation task comprising the list of spoken utterances and the transliterated list of keywords to one or more workers of the mismatched crowd to localize the keywords from the list of spoken utterances;
aggregating one or more responses localized by each of the worker of the mismatched crowd to generate ground truth label using expectation-maximization framework; and
analyzing performance of each worker from the aggregated one or more responses to localize the keyword in the spoken utterances.
2. The method claimed in claim 1, further comprising:
annotating process requires a worker to choose a keyword from the list of keywords, wherein among them one option allowed the worker to choose None of above indicating none of keyword from the list of spoken utterances.
3. The method claimed in claim 1, further comprising:
If the worker choses one of the keyword from the list of spoken utterances then the worker is required to mark boundaries of selected keyword on the waveform of the spoken utterance.
4. The method claimed in claim 3, wherein the worker’s label consists of keyword identity as well as its markings.
5. The method claimed in claim 1, further comprising:
estimating reputation of each worker of mismatched crowd using expectation-maximization framework, wherein the reputation is modelled in terms of worker’s accuracy and bias towards rejecting the existence of keyword.
6. A system for localizing a keyword in spoken utterances using mismatched crowd, wherein the system comprising:
a plurality of crowd interfaces;
a memory;
a processor communicatively coupled with the memory, wherein the processor is configured to perform one or more steps of:
creating an audio annotation task comprising a list of spoken utterances and a list of keywords selected from predefined dictionary;
transliterating the list of keywords from spoken script into the script of each worker of the mismatched crowd;
presenting the audio annotation task comprising the list of spoken utterances and the transliterated list of keywords to one or more workers of the mismatched crowd to identify the keywords from the list of spoken utterances;
localizing the keyword from the list of spoken utterances;
aggregating one or more responses localized by each of the worker of the mismatched crowd to generate ground truth label using expectation-maximization framework; and
analyzing performance of each worker from the aggregated one or more responses to localize the keyword in the spoken utterances.
7. The system claimed in claim 6, further comprising:
annotating process requires a worker to choose a keyword from the plurality of keywords, wherein among them one option allowed the worker to choose none of above indicating none of keyword from the list is spoken in utterances.
8. The system claimed in claim 6, further comprising:
If the worker choses one of the keyword from the list of spoken utterances then the worker is required to mark boundaries on the waveform of the list of spoken utterances.
9. The system claimed in claim 8, wherein the worker’s label consists of identity of keyword and its markings.
10. The system claimed in claim 6, further comprising:
estimating reputation of each worker of mismatched crowd using expectation-maximization framework, wherein the reputation is modelled in terms of worker’s accuracy and bias towards rejecting the existence of keyword.
, Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
A METHOD AND SYSTEM FOR LOCALIZING SPOKEN WORDS IN UTTERANCES USING MISMATCHED CROWD
Applicant:
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th Floor,
Nariman Point, Mumbai 400021,
Maharashtra, India
The following specification particularly describes the embodiments and the manner in which it is to be performed.
TECHNICAL FIELD
The embodiments herein generally relates to a system and method for localizing a keyword in spoken utterances and, more particularly, identifying the correct word label across different options and its boundary on the utterance from the responses of mismatched crowd.
BACKGROUND
In this digital age, mobile phone and internet accessibility has reached to almost everyone in world. This advancement can be used to derive a demographic advantage for crowd work. But there are various scripts and languages in the world for reading and writing.
Despite advancement in automated speech recognition, the availability of additional training data is always desirable, especially with regard to under resourced languages. Crowdsourcing provides a viable option to construct such corpus. Conventionally, speech annotation tasks have required an annotator to be familiar with the language being spoken. This can limit the crowd size that can be enlisted, due to significant mismatch between the population of native speakers and the available online crowd workers for a language.
In order to overcome this difficulty there has been interest in studying the annotation performance of a mismatched crowd. The crowd employed for speech annotation without being familiar with the spoken language is called mismatched crowd.
SUMMARY
The following presents a simplified summary of some embodiments of the disclosure in order to provide a basic understanding of the embodiments. This summary is not an extensive overview of the embodiments. It is not intended to identify key/critical elements of the embodiments or to delineate the scope of the embodiments. Its sole purpose is to present some embodiments in a simplified form as a prelude to the more detailed description that is presented below.
In view of the foregoing, an embodiment herein provides a system and method for localizing spoken words in utterances using mismatched crowd.
In one aspect, a system for localizing spoken words in utterances using mismatched crowd is provided. The system comprises a processor, a memory communicatively coupled to the processor and the memory contains instructions that are readable by the processor. The processor is configured to perform one or more steps of creating an audio annotation task comprising a list of spoken utterances and a list of keywords selected from predefined dictionary, transliterating the list of keywords from spoken script into the script of each worker of the mismatched crowd, presenting the audio annotation task comprising the list of spoken utterances and the transliterated list of keywords to one or more workers of the mismatched crowd to identify the keywords from the list of spoken utterances, aggregating one or more responses localized by each of the worker of the mismatched crowd to generate ground truth label using expectation-maximization framework and analyzing performance of each worker from the aggregated one or more responses to localize the keyword in the spoken utterances.
In another aspect, a method for localizing a keyword in spoken utterances using mismatched crowd is provided. The method comprises one or more steps of creating an audio annotation task comprising a list of spoken utterances and a list of keywords selected from predefined dictionary, transliterating the list of keywords from spoken script into the script of each worker of the mismatched crowd, presenting the audio annotation task comprising the list of spoken utterances and the transliterated list of keywords to one or more workers of the mismatched crowd to identify the keywords from the list of spoken utterances, aggregating one or more responses localized by each of the worker of the mismatched crowd to generate ground truth label using expectation-maximization framework and analyzing performance of each worker from the aggregated one or more responses to localize the keyword in the spoken utterances.
It should be appreciated by those skilled in the art that any block diagram herein represent conceptual views of illustrative systems embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computing device or processor, whether or not such computing device or processor is explicitly shown.
BRIEF DESCRIPTION OF THE DRAWINGS
The embodiments herein will be better understood from the following detailed description with reference to the drawings, in which:
Figure 1 illustrates a method for localizing spoken words in utterances using mismatched crowd by identifying the correct word label across different options and its boundary on the utterance from the responses of mismatched crowd according to an embodiment of the present disclosure;
Figure 2 illustrates a system for localizing spoken words in utterances using mismatched crowd by identifying the correct word label across different options and its boundary on the utterance from the responses of mismatched crowd according to an embodiment of the present disclosure;
Figure 3 is a schematic diagram to get annotated audio for localizing spoken words in utterances using mismatched crowd according to an embodiment of the present disclosure;
Figure 4 is a schematic diagram for task generation and to allocate task to mismatched crowd for localizing spoken words in utterances according to an embodiment of the present disclosure;
Figure 5 illustrates results in terms of average accuracies for a defined language dataset according to an embodiment of the present disclosure; and
Figure. 6 a schematic diagram for comparison of estimated parameters of performance of the mismatched crowd according to an embodiment of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
Referring fig. 1, a method 200 for localizing a keyword in spoken utterances using mismatched crowd.
At the step 202, where the process performs a task of audio annotation comprising a list of spoken script of each worker of the mismatched crowd. The audio recording is of arbitrary sizes and may have duration ranging from seconds to hours. The process applies a voice detection algorithm on the audio recordings and creates the segments of shorter durations because the shorter audio clips can be well annotated by a mismatched crowd worker.
At the step 204, where the process transliterates the list of keywords from spoken script into the script of each worker of the mismatched crowd. The mismatched crowd includes a plurality of workers who are unfamiliar with the source language that is being spoken in the utterance. The script of the plurality of workers can be different from each other even for the same spoken utterance in the task. It will use either a publically available utilities or build in transliteration engine to achieve the same.
At the step 206, where the process presents the audio annotation task comprising the list of spoken utterances and the transliterated list of keywords to one or more workers of the mismatched crowd to localize the keywords from the list of spoken utterances. The workers are asked to listen the list of spoken utterances and to choose a word from the given list and mark it on the waveform. The worker is asked to follow the following steps:
Choose a keyword from the list of given options, among them one option allowed the worker to choose ‘None of above’; and
If the worker chooses one of the words then the worker is required to mark boundaries on the waveform of the utterance. Hence the worker response contains keyword identity and it’s marking.
At the step 208, where the process aggregates one or more responses localized by each of the worker of the mismatched crowd to generate ground truth label using expectation-maximization framework. It would be appreciated that the method is not limited to the EM framework. Belief propagation or ordinal labeling methods can be used for the same.
In one example, where modeling of worker’s parameters is performed using an EM framework. Let the worker be indexed by j ? W = {1,2,…M}. The task be indexed by i ? T={1,2,…,N}. The label l_ij is provided by j^th workers to i^th task. Moreover, each task i has an underlying target label (to be estimated), z_i?S_i.The set S_i ? ( Z^+×Z^+ ×O_i) ? NK where O_i represents the set of keyword identities in the option list shown to the worker for task i, NK represents the label where keyword is not present in the utterance and the integer tuple represents the boundaries of a marked keyword. It assumes that l_ij belong to the same set as z_i.
Further, it models the parameters of the worker j as ?_j={a_j,ß_j}. The parameter a_j represents the probability of worker spams the keyword identity and ß_j represents the probability of worker’s bias towards choosing the ‘None of above’ option when spamming. Each worker j provides labels L^j={l_ij }_(i?T_j ) for the subset? T?_j?T. Similarly, each task i has labels L_i= {l_ij }_(j?W_i ) provided by subset of workers W_i?W.
In another embodiment, considering the case when correct keyword is absent but worker chooses the wrong keyword x from the list and provides markings of it on waveform as a_1 and a_2 respectively. The likelihood is given by:
where ? represents the number of samples in the utterance and D represents number of words in the list.
In another embodiment, where the process computes the likelihood probability of worker’s label, l_ij , given the target label, z_i and the parameters, ?_j the likelihood probability of worker providing true label when no keyword is present in the list is given by:
In another embodiment, where the worker spams the label given that any of the keyword is present in the list, its likelihood probability is computed as:
In another embodiment, where the worker attempts to provide the correct label for keyword, its likelihood probability is computed as:
It would be appreciated that in above cases the worker markings are normally distributed from the correct markings. It also accounts the possibility of worker’s spamming the label with an additional term? a?_j (1-ß_j ) 1/?^2 1/D . It would also be appreciated that S can also be modelled in terms of worker-language pair S_jk where j is worker index and k is language index. Similarly, the process can compute:
At the last step 110, the process analyses the performance of each worker from the aggregated responses to localize the keyword. Further, the process estimates reputation indicated by the parameters of each worker of mismatched crowd using expectation-maximization framework. The reputation is modelled in terms of worker’s accuracy and bias towards rejecting the existence of keyword.
Referring fig. 2, a system 200 for localizing a keyword in spoken utterances using mismatched crowd. The system 200 comprising a processor 202, a memory 204 communicatively coupled to the processor 202 a plurality of crowd interfaces 206, a predefined dictionary 208, a task allocator module 210 and a task response collector module 212. Wherein the processor is configured to create an audio annotation task comprising a list of spoken utterances and a list of keywords selected from predefined dictionary.
Referring fig. 3, a schematic diagram to generate annotated audio of audio recordings or speech files. The audio recording usually is of arbitrary sizes and may have duration ranging from seconds to hours. These longer duration audios may bring distraction and stress in worker. Therefore, the process applies a voice detection algorithm on the audio recordings and creates the segments of shorter durations because the shorter audio clips can be well annotated by a mismatched crowd worker.
Further, the workers are asked to listen the list of spoken utterances through a plurality of crowd interfaces and asked to choose a word from the given list and mark it on the waveform by using task allocator module of the system. The worker is asked to follow the following steps:
Choose a keyword from the list of given options, among them one option allowed the worker to choose ‘None of above’; and
If the worker chooses one of the words then the worker is required to mark boundaries on the waveform of the utterance. Hence the worker response contains keyword identity and it’s marking. These worker responses are accumulated at task response collector module.
Referring fig. 4, a schematic diagram to generate and allocate tasks to workers. The system samples certain number of audio chunks. The sampling strategy can be based on task confidence or threshold on number of labels obtained or can be any other. Further, the processor is configured to transliterate the list of keywords from spoken script into the script of each worker of the mismatched crowd. The mismatched crowd includes a plurality of workers who are unfamiliar with the source language that is being spoken. The script of the plurality of workers can be different from each other even for the same source language word. It will use either a publically available utilities or build in transliteration engine to achieve the same. Furthermore, the processor is configured to present the audio annotation task comprising the list of spoken utterances and the transliterated list of keywords to one or more workers of the mismatched crowd to identify the keywords from the list of spoken utterances.
In the preferred embodiment, the processor is configured to localize the keyword from the list of spoken utterances and aggregating one or more responses localized by each of the worker of the mismatched crowd, wherein the one or more responses are collected at the task response collector module of the system. Further, the one or more responses are used to generate ground truth label using expectation-maximization framework and analyzing performance of each worker from the aggregated one or more responses to localize the keyword in the spoken utterances.
Referring fig. 5, an example, reports initial results in a table in terms of average accuracies for four language namely Arabic, German, Hindi and Russian on the small exemplary dataset of the disclosure. It would be appreciated that the proposed system and method can be used for any number of languages and more data. The KW term represents the case where keyword is present in the utterances, NKW refers to the case where keyword is not present. The number in the table represents the percent accuracies. The performance of the disclosure is compared with the basic majority voting method (MV). It may be observed that the overall performance of the proposed method and system is far better than the baseline. Since the crowd is not familiar with the spoken language, the workers may get biased towards choosing ‘None of Above’ i.e. rejecting the presence of any keyword from the list. Therefore, the baseline MV method provides accuracy on NKW case higher whereas accuracies for KW case is significantly lower. The proposed system and method is able to capture such bias explicitly and improve the KW accuracy significantly while slightly drops in NKW accuracy.
Referring fig. 6, it shows that the method and system can effectively be able to capture the spammers when using the EM framework. It would be appreciated that the method and system can compute the parameters of the workers (a and ß indicating his/her spam probability or biased probability). These parameters can be used to indicate the reputation of the worker. The parameters values towards 1 (one) could be used to declare the worker as spammer. The mistakes in the case of mismatched work can occur due deliberate spam or due to genuine inability to comprehend the speech. One can define thresholds for acceptable behavior, or follow non-parametric clustering approach to decide what a poor/acceptable behavior is. The visualizations in the Figures indicate that the chosen threshold can delineate such two categories of workers. A color coding has chosen to demarcate spammer and non-spammer, however this is clearly not entirely in agreement with the ground truth. All the dark marking (star points) above the thresholds (0.8 dashed lines) indicate the misclassification of workers. Bubble points represents spammer (non-spammers/annotators) in accordance to estimate of the proposed method. The low reputed workers can then be filtered out from further annotation tasks.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
A method and system for localizing a keyword in spoken utterances using mismatched crowd is provided. The method comprises one or more steps of creating an audio annotation task comprising a list of spoken utterances and a list of keywords selected from predefined dictionary, transliterating the list of keywords from spoken script into the script of each worker of the mismatched crowd, presenting the audio annotation task comprising the list of spoken utterances and the transliterated list of keywords to one or more workers of the mismatched crowd to identify the keywords in the list of spoken utterances, localizing the keywords within the list of spoken utterances, aggregating one or more responses localized by each of the worker of the mismatched crowd to generate ground truth label using expectation-maximization framework and analyzing performance of each worker from the aggregated one or more responses to localize the keyword in the spoken utterances.
The embodiments of present disclosure herein addresses unresolved problem of spotting the keywords of low resourced languages with the help of the mismatched crowd. The use of the mismatched crowd can be motivated by the fact that human annotations of low resourced language cannot be easily obtained online as the crowd speaking that language may not be readily available. Furthermore, low resourced setting may also not allow us to build the automatic keyword spotting engine without having ample annotated data.
It is, however to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output (I/O) devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
A representative hardware environment for practicing the embodiments may include a hardware configuration of an information handling/computer system in accordance with the embodiments herein. The system herein comprises at least one processor or central processing unit (CPU). The CPUs are interconnected via system bus to various devices such as a random access memory (RAM), read-only memory (ROM), and an input/output (I/O) adapter. The I/O adapter can connect to peripheral devices, such as disk units and tape drives, or other program storage devices that are readable by the system. The system can read the inventive instructions on the program storage devices and follow these instructions to execute the methodology of the embodiments herein.
The system further includes a user interface adapter that connects a keyboard, mouse, speaker, microphone, and/or other user interface devices such as a touch screen device (not shown) to the bus to gather user input. Additionally, a communication adapter connects the bus to a data processing network, and a display adapter connects the bus to a display device which may be embodied as an output device such as a monitor, printer, or transmitter, for example.
The preceding description has been presented with reference to various embodiments. Persons having ordinary skill in the art and technology to which this application pertains will appreciate that alterations and changes in the described structures and methods of operation can be practiced without meaningfully departing from the principle, spirit and scope.
| # | Name | Date |
|---|---|---|
| 1 | Form 3 [05-06-2017(online)].pdf | 2017-06-05 |
| 2 | Form 20 [05-06-2017(online)].jpg | 2017-06-05 |
| 3 | Form 18 [05-06-2017(online)].pdf_62.pdf | 2017-06-05 |
| 4 | Form 18 [05-06-2017(online)].pdf | 2017-06-05 |
| 5 | Drawing [05-06-2017(online)].pdf | 2017-06-05 |
| 6 | Description(Complete) [05-06-2017(online)].pdf_61.pdf | 2017-06-05 |
| 7 | Description(Complete) [05-06-2017(online)].pdf | 2017-06-05 |
| 8 | 201721019613-FORM-26 [26-07-2017(online)].pdf | 2017-07-26 |
| 9 | 201721019613-Proof of Right (MANDATORY) [22-08-2017(online)].pdf | 2017-08-22 |
| 10 | Abstract1.jpg | 2018-08-11 |
| 11 | 201721019613-ORIGINAL UNDER RULE 6 (1A)-310817.pdf | 2018-08-11 |
| 12 | 201721019613-ORIGINAL UNDER RULE 6 (1A)-010817.pdf | 2018-08-11 |
| 13 | 201721019613-FER.pdf | 2020-07-13 |
| 14 | 201721019613-OTHERS [13-01-2021(online)].pdf | 2021-01-13 |
| 15 | 201721019613-FER_SER_REPLY [13-01-2021(online)].pdf | 2021-01-13 |
| 16 | 201721019613-COMPLETE SPECIFICATION [13-01-2021(online)].pdf | 2021-01-13 |
| 17 | 201721019613-CLAIMS [13-01-2021(online)].pdf | 2021-01-13 |
| 18 | 201721019613-PatentCertificate01-02-2024.pdf | 2024-02-01 |
| 19 | 201721019613-IntimationOfGrant01-02-2024.pdf | 2024-02-01 |
| 1 | 2021-03-2414-47-06AE_24-03-2021.pdf |
| 2 | 2020-06-2514-10-51E_25-06-2020.pdf |