Device, System And Method For Collection Of Labeled Speech Data Using

< Back

Device, System And Method For Collection Of Labeled Speech Data Using Adaptive Thresholding

Abstract: The present invention relates to device, system and method for collection of labelled speech data using adaptive thresholding. The device 106 includes a processor 110 configured to: receive (302) a plurality of spoken utterances of a user during a period of use of the speech data collection device; determine variations in the plurality of spoken utterances for labeling each of the plurality of spoken utterances based on the variations in the plurality of spoken utterances and presence of noise in the plurality of spoken utterances; store, upon filtering the noise, each of the plurality of the labelled spoken utterances associated with each of the user maintaining natural fluency of the plurality of spoken utterances associated with each of the user.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

27 March 2021

Publication Number

39/2022

Publication Type

INA

Invention Field

ELECTRONICS

Status

Email

info@khuranaandkhurana.com

Parent Application

Applicants

Bharat Electronics Limited

Corporate Office, Outer Ring Road, Nagavara, Bangalore - 560045, Karnataka, India.

Inventors

1. ALAM, Mohd Shamshe

Central Research laboratory, Bharat Electronics Limited, Sahibabad Industrial Area Site IV, Ghaziabad - 201010, Uttar Pradesh, India.

2. SINGH, Shivani

Central Research laboratory, Bharat Electronics Limited, Sahibabad Industrial Area Site IV, Ghaziabad - 201010, Uttar Pradesh, India.

3. GOYAL, Sangeeta

Central Research laboratory, Bharat Electronics Limited, Sahibabad Industrial Area Site IV, Ghaziabad - 201010, Uttar Pradesh, India.

4. GUPTA, Charu

Central Research laboratory, Bharat Electronics Limited, Sahibabad Industrial Area Site IV, Ghaziabad - 201010, Uttar Pradesh, India.

5. PATRA, Tushar Kanti

Central Research laboratory, Bharat Electronics Limited, Sahibabad Industrial Area Site IV, Ghaziabad - 201010, Uttar Pradesh, India.

Specification

Claims:1. A speech data collection device (106) for collecting speech data, the device comprising:
a processor (110) configured to:
receive (302) a plurality of spoken utterances (104) of a user during a period of use of the speech data collection device;
determine (304) variations in the plurality of spoken utterances for labeling each of the plurality of spoken utterances based on the variations in the plurality of spoken utterances and presence of noise in the plurality of spoken utterances;
store (306), upon filtering the noise, each of the plurality of the labelled spoken utterances associated with each of the user maintaining natural fluency of the plurality of spoken utterances associated with each of the user.
2. The speech data collection device as claimed in claim 1, wherein the device (106) further comprising a receiving port (108) to receive the plurality of spoken utterances of the user.
3. The speech data collection device as claimed in claim 1, wherein labeling each of the plurality of spoken utterances is performed using an adaptive thresholding technique.
4. The speech data collection device as claimed in claim 3, wherein the processor (110) is configured to calculate initial threshold automatically based on the noise present in the each of the plurality of the labelled spoken utterances.
5. The speech data collection device as claimed in claim 1, wherein the processor (110) is configured to detect a start of spoken utterances and an end of spoken utterances in the plurality of spoken utterances.
6. The speech data collection device as claimed in claim 1, wherein the processor (110) is configured to identify at least a speech and a silence present inside each of the plurality of spoken utterances, label spoken utterances having presence of speech from the plurality of spoken utterances and stores the labelled spoken utterances having speech.
7. The speech data collection device as claimed in claim 1, wherein each of the plurality of the labelled spoken utterances are stored in audio chunks and thereby combined to form an audio file.
8. The speech data collection device as claimed in claim 1, wherein the speech data collection device is an audio processor.
9. A speech data collection system (100) for collecting speech data, the system comprising:
a receiving port (108) to receive a plurality of spoken utterances of a user; and
a speech data collection device (106) as claimed in claim 1.
10. A method (300) for collecting speech data, the method comprising:
receiving (302), by a receiving port (108) of a speech data collection device (106), a plurality of spoken utterances (104) of user during a period of use of the speech data collection device;
determining (304), by a processor (110) of the of the speech data collection device (106), variations in the plurality of spoken utterances for labeling each of the plurality of spoken utterances based on the variations in the plurality of spoken utterances and presence of noise in the plurality of spoken utterances;
store, upon filtering the noise, by the processor (110), each of the plurality of the labelled spoken utterances associated with each of the user maintaining natural fluency of the plurality of spoken utterances associated with each of the user.
, Description:TECHNICAL FIELD
[001] The present disclosure relates to a field of speech and speaker recognition technologies. More particularly, the present invention relates to device, system and method for collection of labelled speech data using adaptive thresholding.

BACKGROUND
[002] Background description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed invention, or that any publication specifically or implicitly referenced is prior art.
[003] There has been huge advancement in speech and speaker recognition technologies with innumerable speech to text engines available to the users worldwide. All these engines require speech data to train the machine learning model. In speech recognition use cases, several issues are faced in speech data collection and cleaning. The available data collection systems are just audio recorders which get invoked on pressing the start/play button and record whatever is spoken or not spoken (including silence/noise) for the pre-specified time. Also, these recorded files need data cleaning and pre-processing to remove the underlying noise/silence based on a pre-defined fixed noise/silence threshold. It does not consider the changing background noise level for processing the data. Also after data cleaning, the audio data needs to be labelled manually.
[004] Most commonly used data collection systems now-a-days are Mozilla Common Voice and Using Google API. The Mozilla Common Voice is an open source data collection system, where people can contribute by giving their voice samples. Here, you have to record your voice by clicking button and voice recorder runs for specified fixed time. This data will be manually verified by at least two other members .Then audio processing task has to be done on verified recorded data.
[005] Google API’s on the other hand collects Speech and corresponding subtitled using OCR over video frames. The subtitles appear on video screen is already an output of AI Model. Again here you have to do processing task after collecting data.
[006] A prior-art document US7464029B2 provides method for improving a speech signal using a voice activity detector, by using the energy levels of consecutive sound signals. Another prior-art document US9772993B2 provides that utterances are saved for NER systems based on the parameters specified by the user for each token, using a calibration system and the end of the utterance needs to be triggered by the user. Yet another prior-art document US9117460B2 relates to speech recognition systems. It just detects the end of utterance in speech recognition systems. Still another prior-art document US7756709B2 provides a speech discriminator method. It classifies audio window (which has to be configured manually) into speech/silence/noise using a state machine.
[007] However, the above recited and conventionally available techniques most commonly collects the speech data using audio recorders which just provides the feature to start recording (using in-built sound libraries) data on invocation, for the specified time duration Also, there was no provision for capturing only the speech utterances. Once the recording was invoked, the complete sound (noise and speech utterances) for the defined duration was saved. This increased the manual effort to listen to all the saved files and separate the usable speech utterances from the noise data.
[008] The drawbacks of earlier sound recording systems were to an extent sorted by sound capturing systems. The sound capturing systems capture the speech utterances from the live audio stream, where the loudness value of the sound and noise is used. They are dependent upon a pre-decided loudness threshold and defined time frame for capturing the speech utterances. The advantage of these systems lie in the fact that they auto detect the speech utterances before saving them for the next n seconds. But, the drawback of these systems is that, though the recording auto started on speech utterance, but did not auto end when there was no utterance. Also, the threshold value is fixed for such systems which lead to less accurate data collection.
[009] Therefore, there exists a need to provide an improved and efficient device, system and method for audio processing wherein the collection of labelled speech data can be done in real-time even before writing digital audio data to a file without requiring any Artificial Intelligence (AI) model/s for audio processing.
[0010] As used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.
[0011] In some embodiments, the numerical parameters set forth in the written description and attached claims are approximations that can vary depending upon the desired properties sought to be obtained by a particular embodiment. In some embodiments, the numerical parameters should be construed in light of the number of reported significant digits and by applying ordinary rounding techniques. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of some embodiments of the invention are approximations, the numerical values set forth in the specific examples are reported as precisely as practicable. The numerical values presented in some embodiments of the invention may contain certain errors necessarily resulting from the standard deviation found in their respective testing measurements.
[0012] The recitation of ranges of values herein is merely intended to serve as a shorthand method of referring individually to each separate value falling within the range. Unless otherwise indicated herein, each individual value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g. “such as”) provided with respect to certain embodiments herein is intended merely to better illuminate the invention and does not pose a limitation on the scope of the invention otherwise claimed. No language in the specification should be construed as indicating any non-claimed element essential to the practice of the invention.

SUMMARY
[0013] The present disclosure relates to a field of speech and speaker recognition technologies. More particularly, the present invention relates to device, system and method for collection of labelled speech data using adaptive thresholding.
[0014] To deal with the above issues, an improved and efficient device, system and method for audio processing wherein the collection of labelled speech data is done in real-time even before writing digital audio data to a file without requiring any artificial intelligence (AI) model/s for audio processing.
[0015] A device, system and method for collection of labelled speech data using adaptive thresholding are provided. Adaptive thresholding helps taking into consideration the variations in sound and noise effectively. There are a number of inherent features in this technique which are essential for collection of proper speech data, as required to train a machine learning model and to run it in real-world scenario.
[0016] In an aspect, as an ideal speech data collection system, the device, system and method of the present invention capture the human speech utterances and filters out the silence/noise. The adaptive approach is used since the environmental noise in real time keeps on changing and hence it becomes necessary to take this change into consideration while collecting speech data. It makes the process of data collection very close to real-time scenario in machine learning. This reduces the manual effort for data cleaning and pre-processing of speech data for training the machine learning model.
[0017] As compared to the conventional techniques available for audio processing, speech recognition and audio data collection, the present invention is a speech data collection system wherein only the speech utterances are captured and saved using adaptive thresholding. The system provides a speech data collection system wherein the user can speak according to his/her own fluency, and the system captures only speech utterances and filters out the silence/noise and saves it in a labelled fashion. The system enables to detect start and end of speech utterances in an audio stream in speech data collection system. The system identifies speech/silence/noise using adaptive thresholding and saves only the speech utterances labelled fashion.

BRIEF DESCRIPTION OF THE DRAWINGS
[0018] The accompanying drawings are included to provide a further understanding of the present disclosure, and are incorporated in and constitute a part of this specification. The drawings illustrate exemplary embodiments of the present disclosure and, together with the description, serve to explain the principles of the present disclosure.
[0019] In the figures, similar components and/or features may have the same reference label. Further, various components of the same type may be distinguished by following the reference label with a second label that distinguishes among the similar components. If only the first reference label is used in the specification, the description is applicable to any one of the similar components having the same first reference label irrespective of the second reference label.
[0020] FIG. 1 illustrates exemplary block diagram for labelled speech data collection system according to some implementations.
[0021] FIG. 2 illustrates an exemplary audio data in the data collection unit, in accordance with an exemplary embodiment of the present disclosure.
[0022] FIG. 3 illustrates an exemplary method of working of the labelled speech data collection system, in accordance with an exemplary embodiment of the present disclosure.

DETAILED DESCRIPTION
[0023] The following detailed description is made with reference to the technology disclosed. Preferred implementations are described to illustrate the technology disclosed, not to limit its scope, which is defined by the claims. Those of ordinary skill in the art will recognize a variety of equivalent variations on the description.
[0024] Examples of systems, apparatus, computer-readable storage media, and methods according to the disclosed implementations are described in this section. These examples are being provided solely to add context and aid in the understanding of the disclosed implementations. It will thus be apparent to one skilled in the art that the disclosed implementations may be practiced without some or all of the specific details provided. In other instances, certain process or method operations also referred to herein as “blocks,” have not been described in detail in order to avoid unnecessarily obscuring the disclosed implementations. Other implementations and applications also are possible, and as such, the following examples should not be taken as definitive or limiting either in scope or setting.
[0025] In the following detailed description, references are made to the accompanying drawings, which form a part of the description and in which are shown, by way of illustration, specific implementations. Although these disclosed implementations are described in sufficient detail to enable one skilled in the art to practice the implementations, it is to be understood that these examples are not limiting, such that other implementations may be used and changes may be made to the disclosed implementations without departing from their spirit and scope. For example, the blocks of the methods shown and described herein are not necessarily performed in the order indicated in some other implementations. Additionally, in some other implementations, the disclosed methods may include more or fewer blocks than are described. As another example, some blocks described herein as separate blocks may be combined in some other implementations. Conversely, what may be described herein as a single block may be implemented in multiple blocks in some other implementations. Additionally, the conjunction “or” is intended herein in the inclusive sense where appropriate unless otherwise indicated; that is, the phrase “A, B or C” is intended to include the possibilities of “A,” “B,” “C,” “A and B,” “B and C,” “A and C” and “A, B and C.”
[0026] Some implementations described and referenced herein are directed to systems, apparatus, computer-implemented methods and computer-readable storage media for detecting flooding of message queues.
[0027] Thus, for example, it will be appreciated by those of ordinary skill in the art that the diagrams, schematics, illustrations, and the like represent conceptual views or processes illustrating systems and methods embodying this disclosure. The functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing associated software. Similarly, any electronic code generator shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the entity implementing this disclosure. Those of ordinary skill in the art further understand that the exemplary hardware, software, processes, methods, and/or operating systems described herein are for illustrative purposes and, thus, are not intended to be limited to any particular named.
[0028] Various terms as used herein are shown below. To the extent a term used in a claim is not defined below, it should be given the broadest definition persons in the pertinent art have given that term as reflected in printed publications and issued patents at the time of filing.
[0029] Although the present disclosure has been described with the purpose of device, system and method for collection of labelled speech data using adaptive thresholding, it should be appreciated that the same has been done merely to illustrate the invention in an exemplary manner and any other purpose or function for which the explained structure or configuration can be used, is covered within the scope of the present disclosure.
[0030] FIG. 1 illustrates exemplary block diagram for labelled speech data collection system according to some implementations.
[0031] In an exemplary embodiment, FIG. 1 shows the block diagram for labelled data collection system (100) using adaptive thresholding. Audio data (104) is received from the speaker through microphone (102) with a frequency of 16 kHz. This audio data is received by the device (106) (system 100) wherein the data collection unit (112) captures the speech utterances and finally the labelled speech utterances are saved as files in the storage at a storage unit (114).
[0032] In an embodiment, a speech data collection device (106) for collecting speech data is provided. The speech data collection device (106) includes a processor (110) that is configured to receive a plurality of spoken utterances of a user during a period of use of the speech data collection device, determine variations in the plurality of spoken utterances for labeling each of the plurality of spoken utterances based on the variations in the plurality of spoken utterances and presence of noise in the plurality of spoken utterances, and store, upon filtering the noise, each of the plurality of the labelled spoken utterances associated with each of the user maintaining natural fluency of the plurality of spoken utterances associated with each of the user.
[0033] In an exemplary embodiment, the device (106) can also include a receiving port (108) to receive the plurality of spoken utterances of the user.
[0034] In an exemplary embodiment, labeling each of the plurality of spoken utterances is performed using an adaptive thresholding technique.
[0035] In an exemplary embodiment, the processor (110) is configured to calculate initial threshold automatically based on the noise present in the each of the plurality of the labelled spoken utterances.
[0036] In an exemplary embodiment, the processor (106) is configured to detect a start of spoken utterances and an end of spoken utterances in the plurality of spoken utterances
[0037] In an exemplary embodiment, the processor (106) is configured to identify at least a speech and a silence present inside each of the plurality of spoken utterances, label spoken utterances having presence of speech from the plurality of spoken utterances and stores the labelled spoken utterances having speech.
[0038] In an exemplary embodiment, each of the plurality of the labelled spoken utterances are stored in audio chunks and thereby combined to form an audio file.
[0039] In an exemplary embodiment, the speech data collection device (106) is an audio processor.
[0040] In another embodiment, a speech data collection system (100) for collecting speech data is provided. The speech data collection system can include a receiving port (108) to receive a plurality of spoken utterances of a user, and a speech data collection device having a processor (110) to determine variations in the plurality of spoken utterances for labeling each of the plurality of spoken utterances based on the variations in the plurality of spoken utterances and presence of noise in the plurality of spoken utterances, and store, upon filtering the noise, each of the plurality of the labelled spoken utterances associated with each of the user maintaining natural fluency of the plurality of spoken utterances associated with each of the user.
[0041] In an exemplary embodiment, the speech data collection device and/or system enable to collect labeled speech data. This includes: who spoke? And what was spoken? This facilitates easy labelling of classes for training the machine learning model, which can either be used for speech recognition or speaker identification.
[0042] In an exemplary embodiment, the speech data collection device and/or system enable to detect utterance of words. The system detects when some voice sample is given, depending upon variation in speech, and saves each word spoken irrespective of the length of the word or the duration for which a word is spoken. In other words, it neglects the noise and captures only the voice activity which is required for speech data collection.
[0043] In an exemplary embodiment, the speech data collection device and/or system enable to maintain natural fluency of the speaker. It maintains the flow in which the speaker gives the data making it as close to the testing/live scenario, as the speaker can give data according to his/her own comfort, he does not have to click the record button and immediately say the word every time. He can take his own time and give the speech data, as only the voice samples will be saved, not the silence or noise samples. Also there is no fixed time duration in which he has to speak the word; this helps to make the process of data collection very close to the real scenario.
[0044] In an exemplary embodiment, the speech data collection device and/or system enable to reduce manual efforts in pre-processing data. Also, it reduces the manual efforts of data preparation, which involved listening to all the saved files and separating out the useful ones from the non-useful ones (silence and noise), as here only the word utterances are saved.
[0045] In an exemplary embodiment, the speech data collection device and/or system enable to capture labeled speech data as compared to the conventional techniques wherein there is no labeled speech data collection system available which captures labeled speech utterances from live audio stream.
[0046] FIG. 2 illustrates an exemplary audio data (200) in the data collection unit, in accordance with an exemplary embodiment of the present disclosure.
[0047] As shown in FIG. 2, the audio data includes mix of data, which includes:
[0048] Initial Environment Noise/Silence: The first few data chunks (min 4-8, found experimentally) are taken as the initial environment noise/silence, which is used for calculating initial noise (environmental) variation threshold nti.
[0049] Silence: The sound chunks having variation values below nti.
[0050] Overlapping Speech Utterance: The sound chunks (2-3, found experimentally) which may be a part of the speech utterances (before the speech utterance and after the speech utterance) as these chunks also contain initial and final parts of the words respectively.
[0051] Speech Utterance: The sound chunks having values more than double that of the noise threshold. When we have n continuous chunks having value more than double of nti, it implies a word is spoken (here minimum value of n=3).
[0052] The audio data as depicted in the FIG. 2 includes the legends which are shown in the FIG. 2 (top left) that depicts the corresponding audio chunks. The explanation of these chunks is given below and the actual implementation is given below.
[0053] As the main aim of the present invention is to collect labelled speech data using Adaptive Thresholding, considering the variations in speech and noise in data. Before starting with the process, one need to specify the label/transcript beforehand, a folder will be automatically created with the specified label and the following utterances will be saved in the folder.
[0054] The steps involved in this process are as follows:
[0055] Step 1: Find initial noise variation threshold (nti): Initially, as soon as the recording starts, first few chunks (4-8, found experimentally) of data are silence/noise (Initial Environment Noise/Silence, as shown in figure 2). Each of this chunk comprises of 1024 digital sound values. As the frame rate is 16000 samples/sec (standard), we have
1 chunk=1024/16000 second
[0056] Hence, 4-8 chunks last for less than a second. So, these initial chunks are considered to find out the initial noise (environmental) variation threshold. As we know that variance V in noise is calculated as follows,
V= S(X-Xmean)2/N
Here, X is a single value of noise in a chunk, Xmean is the mean of all the values in a noise chunk and N (N=1024) is the total number of noise values in a chunk.
[0057] Now after finding the variation Vi of all these initial chunks, the mean value Vmean of all these variations is calculated. This gives us the initial noise variation threshold nti.
nti = Vmean = SVi
Note: The above assumption is for 4-8 chunks being silence stands true only when speaker starts speaking immediately after recording is started. But, if that’s not true, than there may be more noise chunks, until the first speech utterance.
[0058] Step 2: Find word utterances: When the value of variation of a sound chunk increases to more than double (found experimentally) of the initial noise threshold value nti (speech utterance, as shown in FIG. 2), it implies something is uttered. When we have minimum n continuous chunks having value more than double of nti, it implies a word is spoken. In our set of data to be recorded, we found the value of n to be 3 (experimentally). The value of n can vary according to the length of words in data.
[0059] Step 3: Save the word utterances: Now after these n continuous utterances of speech, when we find 5(experimental value) continuous sound chunks having variation values below nti, we save the word utterance along with 2-3 of these chunks and 2-3 initial chunks (overlapping speech utterances, as shown in FIG. 2), as these may contain some last and initial parts of the word respectively, which have very less amplitude. These words are saved in their labelled folders (word name and speaker) as given during the time of recording.
[0060] Step 4: Recalculating noise variation threshold ntn, adaptively:
Case 1: Next, when we find about 4 sound chunks falling below nti (silence, as shown in figure 2), after saving the word utterance, we find the sound variation of these chunks and also find the mean value of these, say ntj. Now, we keep on updating the sound variation threshold at every step, whenever we find noise/silence, by considering the mean of all the previous variation threshold values and the current mean value. So , in this case, the adaptive noise variation threshold ntn is the mean of the both the thresholds:
ntn= (nti+ntj)/2
Case 2: When after saving the word utterances , we directly find n (n=3 , in our case , as shown earlier) sound chunks exceeding double that of noise variation threshold nti, we do not update the threshold and directly continue with step 2 and 3 again.
[0061] Step 5: This process continues till the defined time frame of recording: Now this new threshold value is used for finding the word utterance. When the variation value exceeds more than double that of ntn, it implies sound utterance. Again threshold is updated according to the environment noise, and this process continues till the given time limit of recording. In this way, each word is saved according to its class and speaker, ie, in a properly labeled manner.
[0062] FIG. 3 illustrates an exemplary method of working of the labelled speech data collection system, in accordance with an exemplary embodiment of the present disclosure. The method may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, computer executable instructions may be located in both local and remote computer storage media, including memory storage devices.
[0063] The order in which the method is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method or alternate methods. Additionally, individual blocks may be deleted from the method without departing from the protection scope of the subject matter described herein. Furthermore, the method can be implemented in any suitable hardware, software, firmware, or combination thereof. However, for ease of explanation, in the embodiments described below, the method may be considered to be implemented in the above the proposed automated vulnerability scanning and notification apparatus 300.
[0064] In an embodiment, a method 300 for collecting speech data is provided.
[0065] At step 302, a plurality of spoken utterances of user are received by a receiving port of a speech data collection device during a period of use of the speech data collection device.
[0066] At step 304, variations in the plurality of spoken utterances are determined by a processor (110) of the of the speech data collection device (106) for labeling each of the plurality of spoken utterances based on the variations in the plurality of spoken utterances and presence of noise in the plurality of spoken utterances.
[0067] At step 306, each of the plurality of the labelled spoken utterances associated with each of the user are stored upon filtering the noise, by the processor (110) by maintaining natural fluency of the plurality of spoken utterances associated with each of the user.
[0068] In an exemplary embodiment, the system helps to easily save the data in a labelled manner, i.e., who spoke and what was spoken, which is helpful for training the model.
[0069] In an exemplary embodiment, the system captures only the speech utterances and not the silence and noise, so one can speak with natural flow, as it’s not necessary to speak as soon as the record button is clicked.
[0070] In an exemplary embodiment, the system is very close to the live testing scenario, where the operator speaks in the natural flow.
[0071] In an exemplary embodiment, the system enables manual effort required for data preparation is reduced, as one does not have to listen to the sound files manually and segregate empty files/noise files from speech utterances.
[0072] In an exemplary embodiment, the automatic nature of this system helps to speed up the task of data collection.
[0073] In an exemplary embodiment, the system can hence be used for data collection in all AI based speech applications, like speaker recognition, speech to text engine etc.
[0074] In an exemplary embodiment, the system saves the labeled speech utterances during data collection from audio stream and rejects the non-speech utterances
[0075] In an exemplary embodiment, the system uses dynamic thresholding and calculates the initial threshold itself based on the environmental/background noise.
[0076] In an exemplary embodiment, the system is adaptive towards the varying background/environmental noise levels and the threshold keeps on updating itself according to the background noise levels.
[0077] In an exemplary embodiment, the system has the unique capability of collecting speech data from sound stream with the inherent feature of removing silence/noise in between the words. It takes the environmental noise into consideration for finding the word utterance and is adaptive towards the noise threshold as per the environmental noise
[0078] In an exemplary embodiment, the system is very useful in real life AI based speech applications, where a huge amount of labeled speech data is required for training purposes. It becomes a tedious task to manually remove the noise/silence from the audio data and then to label the extracted words. This system simplifies the task of data collection for AI based speech applications and hence aids in the task of data cleaning as well.
[0079] In an exemplary embodiment, the system 100 or the device 106 can include tangible computer-readable media having non-transitory instructions stored thereon/in that are executable by or used to program a server or other computing system (or collection of such servers or computing systems) to perform some of the implementation of processes described herein. For example, computer program code 26 can implement instructions for operating and configuring the system 16 to intercommunicate and to process web pages, applications and other data and media content as described herein. In some implementations, the computer code 26 can be downloadable and stored on a hard disk, but the entire program code, or portions thereof, also can be stored in any other volatile or non-volatile memory medium or device as is well known, such as a ROM or RAM, or provided on any media capable of storing program code, such as any type of rotating media including floppy disks, optical discs, digital versatile disks (DVD), compact disks (CD), microdrives, and magneto-optical disks, and magnetic or optical cards, nano-systems (including molecular memory ICs), or any other type of computer-readable medium or device suitable for storing instructions or data. Additionally, the entire program code, or portions thereof, may be transmitted and downloaded from a software source over a transmission medium, for example, over the Internet, or from another server, as is well known, or transmitted over any other existing network connection as is well known (for example, extranet, VPN, LAN, etc.) using any communication medium and protocols (for example, TCP/IP, HTTP, HTTPS, Ethernet, etc.) as are well known. It will also be appreciated that computer code for the disclosed implementations can be realized in any programming language that can be executed on a server or other computing system such as, for example, C, C++, HTML, any other markup language, Java™, JavaScript, ActiveX, any other scripting language, such as VBScript, and many other programming languages as are well known may be used. (Java™ is a trademark of Sun Microsystems, Inc.).
[0080] In an exemplary embodiment, the system 100 or the device 106 can include may include one or more processors, an input/output (I/O) interface 108, and a memory. Each of the one or more may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, each of the one or more processors is configured to fetch and execute computer-readable instructions stored in the memory.
[0081] The I/O interface may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface may allow the automated vulnerability scanning and notification apparatus to interact with a user directly or through the client/computing devices. Further, the I/O interface may enable the system 100 or the device 106 to communicate with other computing devices, such as web servers and external data servers. The I/O interface 108 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. The I/O interface 108 may include one or more ports for connecting a number of devices to one another or to another server.
[0082] The memory may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. The memory may include modules, routines, programs, objects, components, data structures, etc., which perform particular tasks or implement particular abstract data types.
[0083] In an embodiment, the the system 100 or the device 106 can be implemented in the computer system to enable aspects of the present disclosure. Embodiments of the present disclosure include various steps, which have been described above. A variety of these steps may be performed by hardware components or may be tangibly embodied on a computer-readable storage medium in the form of machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with instructions to perform these steps. Alternatively, the steps may be performed by a combination of hardware, software, and/or firmware.
[0084] The computer system includes an external storage device, a bus, a main memory, a read only memory, a mass storage device, communication port, and a processor. A person skilled in the art will appreciate that computer system may include more than one processor and communication ports. Examples of processor include, but are not limited to, an Intel® Itanium® or Itanium 2 processor(s), or AMD® Opteron® or Athlon MP® processor(s), Motorola® lines of processors, FortiSOC™ system on a chip processors or other future processors. Processor may include various modules associated with embodiments of the present invention. Communication port can be any of an RS-232 port for use with a modem based dialup connection, a 10/100 Ethernet port, a Gigabit or 10 Gigabit port using copper or fiber, a serial port, a parallel port, or other existing or future ports. Communication port may be chosen depending on a network, such a Local Area Network (LAN), Wide Area Network (WAN), or any network to which computer system connects. Memory can be Random Access Memory (RAM), or any other dynamic storage device commonly known in the art. Read only memory can be any static storage device(s) e.g., but not limited to, a Programmable Read Only Memory (PROM) chips for storing static information e.g., start-up or BIOS instructions for processor. Mass storage may be any current or future mass storage solution, which can be used to store information and/or instructions. Exemplary mass storage solutions include, but are not limited to, Parallel Advanced Technology Attachment (PATA) or Serial Advanced Technology Attachment (SATA) hard disk drives or solid-state drives (internal or external, e.g., having Universal Serial Bus (USB) and/or Firewire interfaces), e.g. those available from Seagate (e.g., the Seagate Barracuda 7200 family) or Hitachi (e.g., the Hitachi Deskstar 7K1000), one or more optical discs, Redundant Array of Independent Disks (RAID) storage, e.g. an array of disks (e.g., SATA arrays), available from various vendors including Dot Hill Systems Corp., LaCie, Nexsan Technologies, Inc. and Enhance Technology, Inc. Bus communicatively couples processor(s) with the other memory, storage and communication blocks. Bus can be, e.g. a Peripheral Component Interconnect (PCI) / PCI Extended (PCI-X) bus, Small Computer System Interface (SCSI), USB or the like, for connecting expansion cards, drives and other subsystems as well as other buses, such a front side bus (FSB), which connects processor to software system. Optionally, operator and administrative interfaces, e.g. a display, keyboard, and a cursor control device, may also be coupled to bus to support direct operator interaction with computer system. Other operator and administrative interfaces can be provided through network connections connected through communication port. External storage device can be any kind of external hard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc - Read Only Memory (CD-ROM), Compact Disc - Re-Writable (CD-RW), Digital Video Disk - Read Only Memory (DVD-ROM). Components described above are meant only to exemplify various possibilities. In no way should the aforementioned exemplary computer system limit the scope of the present disclosure.
[0085] Although the proposed system has been elaborated as above to include all the main modules, it is completely possible that actual implementations may include only a part of the proposed modules or a combination of those or a division of those into sub-modules in various combinations across multiple devices that can be operatively coupled with each other, including in the cloud. Further the modules can be configured in any sequence to achieve objectives elaborated. Also, it can be appreciated that proposed system can be configured in a computing device or across a plurality of computing devices operatively connected with each other, wherein the computing devices can be any of a computer, a laptop, a smartphone, an Internet enabled mobile device and the like. All such modifications and embodiments are completely within the scope of the present disclosure.
[0086] As used herein, and unless the context dictates otherwise, the term “coupled to” is intended to include both direct coupling (in which two elements that are coupled to each other or in contact each other) and indirect coupling (in which at least one additional element is located between the two elements). Therefore, the terms “coupled to” and “coupled with” are used synonymously. Within the context of this document terms “coupled to” and “coupled with” are also used euphemistically to mean “communicatively coupled with” over a network, where two or more devices are able to exchange data with each other over the network, possibly via one or more intermediary device.
[0087] Moreover, in interpreting both the specification and the claims, all terms should be interpreted in the broadest possible manner consistent with the context. In particular, the terms “comprises” and “comprising” should be interpreted as referring to elements, components, or steps in a non-exclusive manner, indicating that the referenced elements, components, or steps may be present, or utilized, or combined with other elements, components, or steps that are not expressly referenced. Where the specification claims refers to at least one of something selected from the group consisting of A, B, C ….and N, the text should be interpreted as requiring only one element from the group, not A plus N, or B plus N, etc.
[0088] While some embodiments of the present disclosure have been illustrated and described, those are completely exemplary in nature. The disclosure is not limited to the embodiments as elaborated herein only and it would be apparent to those skilled in the art that numerous modifications besides those already described are possible without departing from the inventive concepts herein. All such modifications, changes, variations, substitutions, and equivalents are completely within the scope of the present disclosure. The inventive subject matter, therefore, is not to be restricted except in the protection scope of the appended claims.

Documents

Application Documents

#	Name	Date
1	202141013764-STATEMENT OF UNDERTAKING (FORM 3) [27-03-2021(online)].pdf	2021-03-27
2	202141013764-POWER OF AUTHORITY [27-03-2021(online)].pdf	2021-03-27
3	202141013764-FORM 1 [27-03-2021(online)].pdf	2021-03-27
4	202141013764-DRAWINGS [27-03-2021(online)].pdf	2021-03-27
5	202141013764-DECLARATION OF INVENTORSHIP (FORM 5) [27-03-2021(online)].pdf	2021-03-27
6	202141013764-COMPLETE SPECIFICATION [27-03-2021(online)].pdf	2021-03-27
7	202141013764-Proof of Right [22-09-2021(online)].pdf	2021-09-22
8	202141013764-POA [07-10-2024(online)].pdf	2024-10-07
9	202141013764-FORM 13 [07-10-2024(online)].pdf	2024-10-07
10	202141013764-AMENDED DOCUMENTS [07-10-2024(online)].pdf	2024-10-07
11	202141013764-Response to office action [01-11-2024(online)].pdf	2024-11-01
12	202141013764-FORM 18 [11-03-2025(online)].pdf	2025-03-11