Claims:We Claim:
1. A computer-implemented method for speech to text transcription, the method comprises
steps of:
building one or more audio models, wherein the one or more audio models are
specific to one or more users? accents of at least one language such that the one or more
audio models are trained by utilizing pre-stored audio data;
building one or more contextually aware neural network models, wherein the one or
more contextually aware neural network models are trained using a first set of words;
receiving, by an audio codec converter, speech data to convert the speech data into a
first code;
processing, by a pre-processing unit, the first code to convert the first code into a
second code and to load at least one of the one or more audio models and the one or
more contextually aware neural network models in an inference model loader;
invoking at least one of the one or more audio models based on one or more predetermined accents of the speech data;
generating, by the at least one invoked audio model, a first text data based on
inference from the second code; and
invoking, by a middleware, at least one of the one or more contextually aware neural
network models to generate, from the first text data, a text output and to iteratively
correct the generated text output.
2. The method as claimed in claim 1 comprises validating, using a look-up tool, the text
output by implementing a non-machine learning based validation architecture.
3. The method as claimed in claim 2 comprises determining if one or more words should be
replaced with one or more values.
4. The method as claimed in claim 2, wherein the non-machine learning based validation
architecture comprises a string matching algorithm to classify the text output.
5. The method as claimed in claim 4, wherein the string matching algorithm is selected
from at least one of Levenstein string matching algorithm and Cosine distance based
string matching algorithm to enable replacement of keywords.
6. The method as claimed in claim 1, wherein the middleware is adapted to provide the first
text data from the audio model to the contextually aware neural network model such that
the audio model is free to process a next batch of speech data.
7. The method as claimed in claim 1, wherein the middleware is adapted to provide the first
text data from the audio model to the contextually aware neural network model such that
the audio model processes a next batch of speech data only after output text data is
generated from the contextually aware neural network.
8. The method as claimed in claim 1, wherein the middleware is adapted to maintain two
distinct queues for the audio model and the contextually aware neural network model and
wherein the two distinct queues are maintained on separate CPU threads and both the
models are triggered based on the tasks available in the queue.
9. The method as claimed in claim 1, wherein the pre-stored audio data comprises at least
one of audio data of the one or more users? accented speech data and transcription data.
10. The method as claimed in claim 1, wherein the first set of words comprises words
specific to healthcare industry.
11. The method as claimed in claim 1, wherein the first set of words comprises words
specific to hospitality industry.
12. The method as claimed in claim 1, wherein the first set of words comprises words
specific to legal industry.
13. The method as claimed in claim 1, wherein the first set of words comprises words
specific to transportation industry.
14. The method as claimed in claim 1, wherein the first set of words comprises words specific
to sales and marketing.
15. The method as claimed in claim 1, wherein the first set of words comprises words specific
to customer support industry.
16. The method as claimed in claim 1, wherein the first set of words comprises words specific
to engineering documentation industry.
17. The method as claimed in claim 1, wherein the second code is a 16-bit signed Pulse
Coded Modulation (PCM) data.
18. A deep neural network system for providing speech to text transcription, the system
comprises:
one or more audio models built and specific to one or more users? accents of at least
one language, wherein the one or more audio models are trained to identify by utilizing
pre-stored audio data;
one or more contextually aware neural network models built, wherein the one or
more contextually aware neural network models are trained using a first set of words;
an audio codec converter configured to receive speech data to convert the speech
data into a first code;
a pre-processing unit operatively coupled to the audio codec converter and adapted
to convert the first code into a second code and to load at least one of the one or more
audio models and the one or more contextually aware neural network models in an
inference model loader,
wherein at least one of the one or more audio models are invoked based on one or more
pre-determined accents of the speech data, and
wherein the at least one invoked audio model generates a first text data based on
inference from the second code; and
a middleware configured to invoke at least one of the one or more contextually
aware neural network to generate, from the first text data, a text output and to iteratively
correct the generated text output.
19. The system as claimed in claim 13, wherein the pre-stored audio data comprises at least
one of audio data of the one or more users? accented speech data and transcription data.
20. The method as claimed in claim 13, wherein the first set of words comprises words
specific to healthcare industry.
21. The method as claimed in claim 13, wherein the first set of words comprises words
specific to hospitality industry.
22. The method as claimed in claim 13, wherein the first set of words comprises words
specific to legal industry.
23. The method as claimed in claim 13, wherein the first set of words comprises words
specific to transportation industry.
24. The method as claimed in claim 13, wherein the first set of words comprises words
specific to sales and marketing.
25. The method as claimed in claim 13, wherein the first set of words comprises words
specific to the customer support industry.
26. The method as claimed in claim 13, wherein the first set of words comprises words
specific to engineering documentation industry. , Description:TECHNICAL FIELD
[001] The present disclosure relates to speech recognition. More particularly, the
present disclosure relates to a system and method for providing an end-to-end speech to text
transcription that is adaptive to various languages and dialects while being localised to
various domains.
BACKGROUND
[002] Background description includes information that may be useful in
understanding the present invention. It is not an admission that any of the information
provided herein is prior art or relevant to the presently claimed invention, or that any
publication specifically or implicitly referenced is prior art.
[003] Computing devices are becoming more and more popular. Computing devices
are used in different ways, in different settings, and are displayed in different form factors.
For example, computing devices are used in appliances (televisions, refrigerators,
thermostats, etc.), mobile devices (smartphones, tablets, etc.), and wearable devices (smart
watches, etc.). The use of computing devices has also led to finding better ways to interface
with these devices. Interface problems are especially serious when dealing with computing
devices with limited space or limited I/O capabilities. We always need an improved interface,
regardless of the device's interface capabilities. Creating an easier or more natural interface
can provide a great competitive advantage. One of the interface areas that is attracting
attention is the area of speech recognition.
[004] As well known in the art, machines are capable of responding to human
utterances, such as machines capable of following human commands and machines capable
of transmitting human utterances, have long been desired. Such machines increase the speed
and ease with which people can communicate with computers to record and organize words
and thoughts. Recent advances in computer technology and speech recognition algorithms
have led to the emergence of speech recognition machines, which are becoming more
powerful and cheaper. Advances have made it possible to bring large-scale vocabulary based
speech recognition systems to market. Such a system recognizes most of the words used in
normal everyday dictation and is therefore very suitable for automatic transcription of such
dictation.
[005] The use of speech recognition as an alternative to input data into computers
has been becoming more and more popular as speech recognition algorithms become more
sophisticated and the processing power of modern computers increases. Speech recognition
systems are especially attractive to those who want to implement different human machine
interface to enhance certain workflows.
[006] In the past, speech/ voice recognition has been used as a way to control
computer programs. However, current voice recognition systems are usually far from full
proof, and the likelihood of their failure to recognize a word and/or misrecognise the word
increases with the size of the system's terminology. For this reason, and to reduce the amount
of computation required for recognition, many speech recognition systems work with precompiled artificial grammar. Such an artificial grammar associates a different sub-vocabulary
with each of a plurality of grammar states, providing rules to determine which grammar is the
state system at present, and the sub-state associated with the current machine state. However,
pre-compiled grammar is not suitable for normal dictation, as it does not allow users to freely
choose the words required for normal dictation. Further, adding of custom words in a
particular domain is a challenge in these systems i.e. for example; adding of various medical
words in the health-tech domain is difficult.
[007] Further, speech recognition and conversion to text is achieved by ASR
(Automatic Speech Recognition) technology. Automatic speech recognition is an important
technology that can be used on mobile and other devices. In general, speech recognition
automatically seeks to provide an accurate transcript of what is being said. Speech
recognition systems often use one or more models for speech transcription. For example,
acoustic models can be used to identify the sound of what is happening during a speech. The
language model can be used to determine what words or word orders are most important
given the identified sounds.
[008] Efforts have been made in the related art to provide systems, methods or
techniques for providing speech to text transcription using neural networks.
[009] For example, the U.S. Patent reference 10,319,374 discloses a system and
method to recognize speech of vastly different languages, such as English or Mandarin
Chinese. In embodiments, the entire pipelines of hand-engineered components are replaced
with neural networks, and the end-to-end learning allows handling of a diverse variety of
speech including noisy environments, accents, and different languages. Using a trained
embodiment and an embodiment of a batch dispatch technique with GPUs in a data center, an
end-to-end deep learning system can be inexpensively deployed in an online setting,
delivering low latency when serving users at scale.
[0010] The U.S. patent reference 10,860,685 discloses a method to obtain input
acoustic sequence that represents one or more utterances. The input acoustic sequences are
processed using a speech recognition model to generate a transcription of the input acoustic
sequence, and the speech recognition model includes a domain-specific language model. The
transcription of the input acoustic sequence is provided as input to a domain-specific
predictive model to generate structured text content that is derived from the transcription of
the input acoustic sequence. However, the patent references did not disclose utilization of
large amount of audio/voice data of various accents in building speech recognition models in
combination with utilization of dictionary of keywords of specific domains.
[0011] Some of the related art references including above-mentioned references
implement machine-learning models in speech to text transcription systems. Most of the
machine learning models work on a classic workflow where a large amount of initial voice
data (100?s of hours) is collected to train the models. The collection of this large amount of
voice data poses a challenge as voice data needs to be extracted from different databases and
sources particular to a domain. Some of the traditional speech to text systems do not follow
hyper local approach i.e. for example, Google speech to text system, Amazon speech to text
transcription system, Rev.AI speech to text transcription system, Microsoft speech to text
transcription system, and other generic VoIP based speech to text transcription system are
practically applicable for generic conversation style dictation, but it is not accurate at
transcribing domain specific and accent specific words and terms while dictating. In order to
build a hyperlocal model that is aligned with a particular accent, many of these existing
systems will need localised accent specific data. Adding dual constraint on the data greatly
reduces the data acquisition possibilities, which may result into a product that is not very
accurate for a particular accent in a particular domain.
[0012] Whereas there is certainly nothing wrong with traditional systems or methods,
nonetheless, there is a need in the art to provide an efficient, scalable, modular, cost-effective,
fast and accurate system and method for building speech to text neural network models with
minimal audio/voice data requirements, and for providing speech to text transcription using
deep neural network models. In addition, there is a need to adapt the generic speech to text
transcription system for various domains, languages, dialects and accents where the accent
of an end user is pre-determined for Neural Network models.
[0013] All publications herein are incorporated by reference to the same extent as if
each individual publication or patent application were specifically and individually indicated
to be incorporated by reference. Where a definition or use of a term in an incorporated
reference is inconsistent or contrary to the definition of that term provided herein, the
definition of that term provided herein applies and the definition of that term in the reference
does not apply.
[0014] The recitation of ranges of values herein is merely intended to serve as a
shorthand method of referring individually to each separate value falling within the range.
Unless otherwise indicated herein, each individual value is incorporated into the specification
as if it were individually recited herein. All methods described herein can be performed in
any suitable order unless otherwise indicated herein or otherwise clearly contradicted by
context. The use of any and all examples, or exemplary language (e.g. “such as”) provided
with respect to certain embodiments herein is intended merely to better illuminate the
invention and does not pose a limitation on the scope of the invention otherwise claimed. No
language in the specification should be construed as indicating any non-claimed element
essential to the practice of the invention.
[0015] Groupings of alternative elements or embodiments of the invention disclosed
herein are not to be construed as limitations. Each group member can be referred to and
claimed individually or in any combination with other members of the group or other
elements found herein. One or more members of a group can be included in, or deleted from,
a group for reasons of convenience and/or patentability. When any such inclusion or deletion
occurs, the specification is herein deemed to contain the group as modified thus fulfilling the
written description of all groups used in the appended claims.
OBJECTS OF THE PRESENT DISCLOSURE
[0016] Some of the objects of the present disclosure, which at least one embodiment
herein satisfies are as listed herein below.
[0017] It is an object of the present disclosure to provide an end-to-end speech to text
transcription system that is scalable across multiple domains.
[0018] It is another object of the present disclosure to provide an accurate, robust
speech to text transcription system for human machine interactions.
[0019] It is another object of the present disclosure to provide a computerimplemented method to build a modular speech to text transcription system trained on
different accents and dialects.
[0020] It is another object of the present disclosure to provide the speech to text
transcription system that offers plug-and-play model for various accents and corresponding
domains.
[0021] It is another object of the present disclosure to provide the speech to text
transcription system that utilizes low amounts of audio data required for training audio
models for the first time.
[0022] It is another object of the present disclosure to provide a highly trainable
system to learn newly spoken words or phrases.
[0023] It is another object of the present disclosure to provide a scalable system for
high concurrent usage of machine learning models.
SUMMARY
[0024] The present disclosure relates to speech recognition. More particularly, the
present disclosure relates to a system and method for providing an end-to-end speech to text
transcription that is adaptive to various languages and dialects localised to various domains.
[0025] This summary is provided to introduce simplified concepts of a system for
time bound availability check of an entity, which are further described below in the detailed
description. This summary is not intended to identify key or essential features of the claimed
subject matter, nor is it intended for use in determining/limiting the scope of the claimed
subject matter.
[0026] An aspect of the present disclosure pertains to a method for speech to text
transcription. The method includes steps of: building one or more audio models, wherein the
one or more audio models are specific to one or more users? accents of at least one language
such that the one or more audio models are trained by utilizing pre-stored audio data; and
building one or more contextually aware neural network models, wherein the one or more
contextually aware neural network models are trained using a first set of words. The prestored audio data can include audio data of users? accents and transcription data. The first set
of words can include words specific to the healthcare industry, hospitality industry, legal
industry, transportation industry, marketing industry and customer support industry. The
method further includes steps of: receiving, by an audio codec converter, speech data to
convert the speech data into a first code; processing, by a pre-processing unit, the first code to
convert the first code into a second code and to load at least one of the one or more audio
models and the one or more contextually aware neural network models in an inference model
loader; invoking at least one of the one or more audio models based on one or more
predetermined accents of the speech data; generating, by the at least one invoked audio
model, a first text data based on inference from the second code; and invoking, by a
middleware, at least one of the one or more contextually aware neural network models to
generate, from the first text data, a text output and to iteratively correct the generated text
output based on the domain of various input parameters like voice dictation, audio file
transcription and/or machine dictation.
[0027] In an aspect, the method includes steps of validating, using a look-up tool, the
text output by implementing a non-machine learning based validation architecture; and
determining if one or more words should be replaced with one or more values. The nonmachine learning based validation architecture can include a string-matching algorithm to
classify the text output. The string matching algorithm can be selected from Levenstein string
matching algorithm and Cosine distance based string matching algorithm to enable
replacement of keywords.
[0028] In an aspect, the middleware can be adapted to provide the first text data from
the audio model to the contextually aware neural network model such that the audio model is
free to process a next batch of speech data. In another aspect, the middleware is adapted to
provide the first text data from the audio model to the contextually aware neural network
model such that the audio model processes a next batch of speech data after output text data
is generated. In another aspect, the middleware is adapted to maintain two distinct queues for
the audio model and the contextually aware neural network model, wherein the two distinct
queues are maintained on separate CPU threads and both the models are triggered on separate
Graphic Processing Unit threads.
[0029] In an aspect, the method can include determining, by a time coding and voting
inter-process communication unit, if one or more words should be replaced with one or more
values.
[0030] In an aspect, the second code can be at least one of, including but not limited
to, 16-bit signed Pulse Coded Modulation (PCM) data and 32-bit PCM data.
[0031] Another aspect of the present disclosure pertains to a deep neural network
system for providing speech to text transcription. The system includes one or more audio
models, one or more contextually aware neural network models, an audio codec converter, a
pre-processing unit operatively coupled to the audio codec converter, and a middleware. The
one or more audio models are built and specific to one or more users? accents of at least one
language. The one or more audio models are trained by utilizing pre-stored audio data. The
pre-stored audio data can include audio data of users? accents and transcription data. The one
or more contextually aware neural network models are built using a dimensionality reduction
model and a memory embedding layer model and are trained using a first set of words. The
first set of words can include words specific to healthcare industry, hospitality industry, legal
industry, transportation industry, sales and marketing industry and customer support industry.
The audio codec converter is configured to receive speech data to convert the speech data
into a first code. The pre-processing unit is adapted to convert the first code into a second
code and to load the one or more audio models and the one or more contextually aware neural
network models in an inference model loader, wherein at least one of the one or more audio
models are invoked based on one or more pre-determined accents of the speech data and
wherein the at least one invoked audio model generates a first text data based on inference
from the second code. Further, the middleware is configured to invoke at least one of the one
or more contextually aware neural networks to generate, from the first text data, a text output
and to iteratively correct the generated text output.
[0032] In an embodiment, the system within the scope of this application it is
expressly envisaged that the various aspects, embodiments, examples and alternatives set out
in the preceding paragraphs, in the claims and/or in the following description and drawings,
and in particular the individual features thereof, may be taken independently or in any
combination. Features described in connection with one embodiment are applicable to all
embodiments, unless such features are incompatible.
[0033] Various objects, features, aspects and advantages of the inventive subject
matter will become more apparent from the following detailed description of preferred
embodiments, along with the accompanying drawing figures in which like numerals represent
like components.
BRIEF DESCRIPTION OF THE DRAWINGS
[0034] The present invention is described in detail below with reference to the
attached drawing figures, wherein:
[0035] FIG. 1 is a block diagram of an exemplary computing environment suitable for
use in implementing the present invention.
[0036] FIG. 2 shows a block diagram illustrating high-level machine learning
workflow, according to one embodiment of the present invention.
[0037] FIG. 3 shows a training pipeline diagram, according to one embodiment of the
present invention.
[0038] FIG. 4 shows a flow chart of a method for speech to text transcription,
according to one embodiment of the present invention.
[0039] Further, skilled artisans will appreciate that elements in the figures are
illustrated for simplicity and may not have been necessarily been drawn to scale.
Furthermore, in terms of the construction of the models, one or more part or element of the
model may have been represented in the figures by conventional symbols, and the figures
may show only those specific details that are pertinent to understanding the embodiments of
the present invention so as not to obscure the figures with details that will be readily apparent
to those of ordinary skill in the art having benefit of the description herein.
DETAILED DESCRIPTION
[0040] The following is a detailed description of embodiments of the disclosure
depicted in the accompanying drawings. The embodiments are in such detail as to clearly
communicate the disclosure. However, the amount of detail offered is not intended to limit
the anticipated variations of embodiments; on the contrary, the intention is to cover all
modifications, equivalents, and alternatives falling within the spirit and scope of the present
disclosure as defined by the appended claims.
[0041] In the following description, numerous specific details are set forth in order to
provide a thorough understanding of embodiments of the present invention. It will be
apparent to one skilled in the art that embodiments of the present invention may be practiced
without some of these specific details.
[0042] Embodiments of the present invention include various steps, which will be
described below. The steps may be performed by hardware components or may be embodied
in machine-executable instructions, which may be used to cause a general-purpose or specialpurpose processor programmed with the instructions to perform the steps. Alternatively, steps
may be performed by a combination of hardware, software, and firmware and/or by human
operators.
[0043] Exemplary embodiments will now be described more fully hereinafter with
reference to the accompanying drawings, in which exemplary embodiments are shown. These
exemplary embodiments are provided only for illustrative purposes and so that this disclosure
will be thorough and complete and will fully convey the scope of the invention to those of
ordinary skill in the art. The invention disclosed may, however, be embodied in many
different forms and should not be construed as limited to the embodiments set forth herein.
Various modifications will be readily apparent to persons skilled in the art. The general
principles defined herein may be applied to other embodiments and applications without
departing from the spirit and scope of the invention. Moreover, all statements herein reciting
embodiments of the invention, as well as specific examples thereof, are intended to
encompass both structural and functional equivalents thereof. Additionally, it is intended that
such equivalents include both currently known equivalents as well as equivalents developed
in the future (i.e., any elements developed that perform the same function, regardless of
structure). Also, the terminology and phraseology used is for describing exemplary
embodiments and should not be considered limiting. Thus, the present invention is to be
accorded the widest scope encompassing numerous alternatives, modifications and
equivalents consistent with the principles and features disclosed. For purpose of clarity,
details relating to technical material that is known in the technical fields related to the
invention have not been described in detail so as not to unnecessarily obscure the present
invention.
[0044] Thus, for example, it will be appreciated by those of ordinary skill in the art
that the diagrams, schematics, illustrations, and the like represent conceptual views or
processes illustrating systems and methods embodying this invention. The functions of the
various elements shown in the figures may be provided through the use of dedicated
hardware as well as hardware capable of executing associated software. Similarly, any
switches shown in the figures are conceptual only. Their function may be carried out through
the operation of program logic, through dedicated logic, through the interaction of program
control and dedicated logic, or even manually, the particular technique being selectable by
the entity implementing this invention. Those of ordinary skill in the art further understand
that the exemplary hardware, software, processes, methods, and/or operating systems
described herein are for illustrative purposes and, thus, are not intended to be limited to any
particular named element.
[0045] Systems depicted in some of the figures may be provided in various
configurations. In some embodiments, the systems may be configured as a distributed system
where one or more components of the system are distributed across one or more networks in
a cloud computing system.
[0046] Various terms as used herein are shown below. To the extent a term used in a
claim is not defined below, it should be given the broadest definition persons in the pertinent
art have given that term as reflected in printed publications and issued patents at the time of
filing.
[0047] Embodiments of the present invention facilitate the building of an accurate
voice to text transcription system using Neural Networks trained on various domains, accents
and dialects. Moreover, the present invention system is modular and can offer a plug-andplay type model for various accents and corresponding domains. Further, it requires a very
low amount of audio data for training models.
[0048] An aspect of the present disclosure pertains to a deep neural network system
for providing speech to text transcription. The system includes one or more audio models,
one or more contextually aware neural network models, an audio codec converter, a preprocessing unit operatively coupled to the audio codec converter, and a middleware. The one
or more audio models are built and specific to one or more users? accents of at least one
language. The one or more audio models are trained by utilizing pre-stored audio data. The
pre-stored audio data can include audio data of users? accents and transcription data. The one
or more contextually aware neural network models are built using a dimensionality reduction
model and a memory embedding layer model and are trained using a first set of words. The
first set of words can include words specific to healthcare industry, hospitality industry, legal
industry, transportation industry, sales and marketing industry and customer support industry.
The audio codec converter is configured to receive speech data to convert the speech data
into a first code. The pre-processing unit is adapted to convert the first code into a second
code and to load the one or more audio models and the one or more contextually aware neural
network models in an inference model loader, wherein at least one of the one or more audio
models are invoked based on one or more pre-determined accents of the speech data and
wherein the at least one invoked audio model generates a first text data based on inference
from the second code. Further, the middleware is configured to invoke at least one of the one
or more contextually aware neural network to generate, from the first text data, a text output
and to iteratively correct the generated text output.
[0049] Having briefly described an overview of the present invention, an exemplary
operating environment in which various aspects of the present invention may be implemented
is described below in order to provide a general context for various aspects of the present
invention. Referring initially to FIG. 1 in particular, an exemplary operating environment for
implementing embodiments of the present invention is shown and designated generally as
computing device 100. Computing device 100 is but one example of a suitable computing
environment and is not intended to suggest any limitation as to the scope of use or
functionality of the invention. Neither should the computing device 100 be interpreted as
having any dependency or requirement relating to any one or combination of components
illustrated.
[0050] The invention may be described in the general context of computer code or
machine-useable instructions, including computer-executable instructions such as program
modules, being executed by the computer or other machine, such as a personal data assistant
or other handheld device. Generally, program modules including routines, programs, objects,
components, data structures, sub-processes, machine learning models and data pipelines etc.,
refer to code that performs particular tasks or implement particular abstract data types. The
invention may be practised in a variety of system configurations, including hand-held
devices, consumer electronics, general-purpose computers, more speciality computing
devices, etc. The invention may also be practised in distributed computing environments
where tasks are performed by remote processing devices that are linked through a
communications network.
[0051] With reference to FIG. 1, computing device 100 includes a bus 110 that
directly or indirectly couples the following devices: memory 112, one or more processors
114, one or more presentation components 116, input/output ports 118, input/output
components 120, and an illustrative power supply 122. Bus 110 represents what may be one
or more busses (such as an address bus, data bus, or combination thereof). Although the
various blocks of FIG. 1 are shown with lines for the sake of clarity, in reality, delineating
various components is not so clear, and metaphorically, the lines would more accurately be
grey and fuzzy. For example, one may consider a presentation component such as a display
device to be an I/O component. In addition, processors have memory. We recognize that such
is the nature of the art, and reiterate that the diagram of FIG. 1 is merely illustrative of an
exemplary computing device that can be used in connection with one or more embodiments
of the present invention. Distinction is not made between such categories as “workstation,”
“server,” “laptop,” “hand-held device,” etc., as all are contemplated within the scope of FIG.
1 and reference to “computing device.”
[0052] Computing device 100 typically includes a variety of computer-readable
media. By way of example, and not limitation, computer-readable media may comprise
Random Access Memory (RAM); Read Only Memory (ROM); Electronically Erasable
Programmable Read Only Memory (EEPROM); flash memory like Non Volatile Memory
Express (NVME) solid state memory or other memory technologies; CDROM, digital
versatile disks (DVD) or other optical or holographic media; magnetic cassettes, magnetic
tape, magnetic disk storage or other magnetic storage devices, or any other storage medium
that can be used to encode and store desired information and be accessed by computing
device 100.
[0053] Memory 112 includes computer-storage media in the form of volatile and/or
non-volatile memory. The memory may be removable, non-removable, or a combination
thereof. Exemplary hardware devices include solid-state memory, hard drives, optical-disc
drives, etc. Computing device 100 includes one or more processors that read data from
various entities such as memory 112 or I/O components 120. Presentation component(s) 116
present data indications to a user or other device. Exemplary presentation components
include a display device, speaker, printing component, vibrating component, etc.
[0054] I/O ports 118 allow computing device 100 to be logically coupled to other
devices including I/O components 120, some of which may be built in. Illustrative
components include a microphone, joystick, game pad, satellite dish, scanner, printer,
wireless device, etc. Some examples of the I/O ports but not limited to are 3.5 mm Audio
Jack, Universal Serial Bus (USB) and any form of wireless communication between various
I/O devices
[0055] The Processor 114 may communicate with a user through a control interface
and display interface coupled to a display. The display may be, for example, a TFT LCD
(Thin-Film-Transistor Liquid Crystal Display) or an OLED (Organic Light Emitting Diode)
display, or other appropriate display technology. The display interface may comprise
appropriate circuitry for driving the display to present graphical and other information to an
entity/user. The control interface may receive commands from a user and convert them for
submission to the processor. In addition, an external interface may be provided in
communication with processor 114, to enable near area communication of device with other
devices. External interface may provide, for example, for wired communication in some
implementations, or for wireless communication in other implementations, and multiple
interfaces may be used.
[0056] FIG. 2 shows a block diagram illustrating high-level machine learning
workflow, according to one embodiment of the present invention. The workflow as shown
has two parts training 210 and Inference 220. Both of these workflows are triggered based on
different real-life events.
[0057] The training part 210 includes a data loader 202, a pre-processing unit 204, a
model loader/de-loader 206, and a cloud interface 208.
[0058] In an exemplary embodiment, the pre-processing unit 204 includes data
processing with a pre-processing element. The pre-processing element can be configured to
receive data to be processed by the data loader , to pre-process the received data by
performing a series of tasks but not limited to Amplitude Modulation, Speed Correction,
background noise reduction, sound profiling, normalisation and feature extraction to output
the pre-processed data. In an exemplary embodiment, the data loader 202 can be used to
convey digital data in a secure manner to another device. The data loader is but not limited to
a digital program that can load the data stored on the servers or cloud environment on an
allocated computer machine (Virtual Machine). This data is converted into a different forms
like, but not limited to, tfrecord formats for fast transportation. The data loader 202 encrypts
the digital data using two way encryption protocols. The digital data that is conveyed is
unrestricted in nature, and can include keys, navigational information, watermarking
parameters, or any other digital content requiring secure delivery.
[0059] The training part has a pipeline, which is triggered when a large amount of
audio data is generated by using the invention system in the real world, and this data is used
to help increase the accuracy of the system. In order to do this, user generated data 210 is
stored in a database. The system may also get data from new clients, which can be possible
customers, research organisations or strategic partnerships with other companies. The data is
loaded onto the system memory i.e. data loader 202 in a batch wise process. The loaded data
needs to be converted into a format that the system can understand . In an example
embodiment, the audio data is converted into a, including but not limited to, 16 bit signed
PCM value by the Pre-Processing unit 204. In order to train the model, the model
loader/deloader 206 is used, where a machine-learning model, which is stored on cloud, is
copied on the training machine. This model along with the loaded data can be used for batch
wise training of the model. Parameters of the machine-learning model are updated by the
model loader/deloader 206 after the batch training process is completed, the updated machine
learning model parameters are again stored on the cloud.
[0060] The inference part 220 includes an audio codec converter 212, a preprocessing unit 214, an inference model loader 216, a model output unit 218, and an output
processing unit 222.
[0061] In an exemplary embodiment, the audio codec converter 212 encodes and/or
decodes audio by implementing algorithms to compress and decompress digital audio data
according to a given audio file or streaming media audio coding format, and the algorithms
represent high-fidelity audio signals while retaining quality. The audio codec converter 212
implements software codecs as libraries to provide interfaces to one or more multimedia
players. In an exemplary embodiment, audio compression algorithms are generally based on
modified discrete cosine transform (MDCT) coding and linear predictive coding (LPC).
Further, the audio codec can be a stand-alone device to encode analog audio as digital signals
and to decode the digital signals into analog signals. The device can include analog-to-digital
converter (ADC) and digital-to-analog converter (DAC).
[0062] In an example embodiment, the inference pipeline is triggered when the user
starts using the system. The audio data is sent to a backend server using web sockets
connections. In this, first a layer of codec conversion is applied using the audio codec
converter 212, which ensures that the audio codecs are correct and conversion happens if
required. After this, the data is again converted to, including but not limited to, a similar 16-
bit signed PCM wave format by the pre-processing unit 214. The machine learning models
are now loaded on a device where the inference will happen (e.g., this is mostly the
production server) i.e. model loader 216. The input is given to the model output unit 218 and
the audio model generates text output. This text output is again given to the context validator
model or contextually aware neural network model, which gives a validated output as text.
This text then undergoes a series of non-ML based processes/ validation architecture at the
output using the output processing unit 222 for text classification. The validated text is then
returned to end users.
[0063] In an exemplary embodiment, the non-machine learning based validation
architecture includes a string matching algorithm to classify the text output. The string
matching algorithm can be selected from at least one of The Naive String Matching
Algorithm, Levenstein string matching algorithm and Cosine distance string matching
algorithm to enable replacement of keywords.
[0064] In an exemplary embodiment, the system of the present invention includes an
Artificial Intelligence (AI) engine coupled to the processor 114 and configured for processing
the received audio data to identify the one or more data attributes for enabling identification
of data language. The data attributes includes characteristics associated with a voice data that
enable identification of language, accent, pronunciation modifications etc. Depending on one
or more such processing parameters associated with the characteristics, the data language is
identified and the audio model is trained for all such languages.
[0065] In another exemplary embodiment, in case of a broken audio file (file that may
have missing audio information or inadequate audio information), the AI engine enables
accurate determination of the spoken words for mapping with terminologies associated with
different domains. The AI engine utilizes machine-learning techniques of normalization,
cleansing and classification for identifying the accurate text to be processed by the
contextually aware neural network data model.
[0066] FIG. 3 shows a training pipeline diagram, according to one embodiment of the
present invention. In an embodiment of the present invention, a clock trigger can start the
training process. Usually after a pre-set interval of time, the training pipeline is triggered. The
training pipeline can be triggered either by a clock trigger or manually through the means of
HTTP/HTTPS requests. After the clock trigger, two different processes are executed i.e.,
audio model trainer 320 and context model trainer 340. The audio model trainer 320 provides
the audio model training such that the audio model trainer 320 collects the weights 302 and
biases 304 from the cloud. These weights and biases together include machine-learning
model(s), which can be loaded on a training machine. Post loading, the audio can be loaded
from the database to the training machine i.e., generated transcription database 306 and the
accented audio data database 308. This is then used to initiate the training process for a preset number of epochs or until a certain accuracy value is attained. These training processes
regenerate the weights and biases for the neural networks through multiple training
algorithms. The multiple training algorithms may be or may include but not limited to
Gradient descent, Conjugate gradient, Quasi-Newton method or Levenberg-Marquardt
algorithm.
[0067] In an exemplary embodiment, the audio model can be trained by utilizing pre
acquired/predefined audio data. The pre-stored audio data can be audio data of the one or
more users? accents and transcription data. The audio model can be specific to users? accents
of a particular language. The audio model can be built using a transducer like model
[0068] The second part is the context validator/ contextually aware neural network
model training i.e., context model trainer 340. The context model trainer 340 receives input
from the audio model in order to undergo a series of non-ML based processes. The context
model trainer 340 is loaded with domain specific dictionary 342; user specific dictionary 344,
in-production user edits 346. The context model trainer 340 collects data from old weights
348 and old biases 350, which are used to create the Machine Learning Model. This Machine
Learning model is trained on the data mentioned above and new weight and biases are
generated. The output of the audio model is given to the context model trainer 340 on a
separate processing thread. Once the output is generated, the context model trainer 340
returns back the corrected text to the user.
[0069] In another exemplary embodiment, the context validator model can be built
using any or a combination of a dimensionality reduction model and a memory embedding
layer model. The dimensionality reduction model can be an auto-encoder and the memoryembedding layer model can be a Long Short Term Memory (LSTM) intent classification
model. The context validator model can be trained using a set of words. The set of words can
include domain specific dictionary 342, user specific dictionary 344, in-production user edits
346. Further, the domain can be any or a combination of legal industry, hospitality industry,
health industry, and transportation industry.
[0070] In an exemplary embodiment, in the dimensionality reduction model, the goal
is to achieve dimensionality reduction for the given text. This can be done by training an auto
encoder. An auto encoder consists of two parts (mainly the encoder and the decoder), and the
encoder part is mainly considered. After building the encoder, the LSTM model is required to
be built. This is required to have memory embeddings into the model.
[0071] In an exemplary embodiment, the contextually aware neural network model
can be termed as context validator model. The model is optimized in two stages by utilizing
genetic optimization framework, wherein stage 1 is dimensionality reduction and stage 2 is
LSTM for memory embeddings. In this model, each keyword from datasets of keywords is
represented as, but not limited to, a Hilbert space of n dimensional vectors such that „n? is
number of words in a dictionary. The vectors can be calculated by using straight forward
algorithms like word2vec algorithm, word embeddings etc.
[0072] In stage 1, dimensionality reduction for a given text is achieved by training an
auto-encoder. The auto-encoder can include, mainly, an encoder and a decoder. After
building the encoder, the LSTM model has memory embeddings into the model to be built.
This increases scalability and accuracy of speech to text transcription across multiple
domains and with respect to multiple accents..
[0073] Further, a middleware handles the communication between the different
machine learning models and the end user. In one example embodiment, the middleware
provides audio data as an input to the audio model, and an output of the audio model is
directly returned back to the user. Further, the output of the audio model is given to the
context validator model on a separate CPU thread. Once, the output is generated, it is
returned back to the user and the corrected text is displayed. In an example i.e., from the
perspective of the user, after dictation, within 1 second, the user sees the transcribed text (this
is performed by the audio model). This text is then changed after a few seconds (this is
performed by the context validator model). In an example embodiment, the middleware is
operatively coupled to the audio model and the contextually aware neural network model
such that the middleware is adapted to classify the text output by implementing the stringmatching algorithm.
[0074] In an exemplary embodiment, the middleware can be implemented in various
ways. These ways may include but not limited to: Model Cascading; Model Extension; and
Asynchronous Queue generation. 1) Model Cascading: In this model, the output of the audio
model is directly given to the Context Validator model or the contextually aware Neural
Network model. Once the output of the audio model is generated, the audio model is free to
process the next batch of audio to text conversion. 2) Model Extension: In this model, the
output of the audio model is given to the context validator model or the contextually aware
Neural Network model. In this approach, both the models are presented as a single unit that is
used to generate the text output given speech input. New set of speech data can be taken by
the audio model after generating the text output of previous speech input. 3) Asynchronous
Queue generation: In this approach, the middleware may maintain two separate queues for
both the models. The user-generated data is added to the queue of the audio model and the
output from the audio model is added to the queue of the context validator model or
contextually aware neural network model. These queues are maintained on separate CPU
threads and the models are triggered on separate, but not limited to, GPU?s or Graphic
Processing Units optimised for high speed Model Execution.
[0075] In an embodiment, the present invention is configured to build/train various
audio models by utilizing audio of various accents of a particular language. Based on this
training, the audio model is adapted to a particular accent. The source of the speech data is
known before the invocation of the model. This data can be used to deploy a corresponding
audio model, which will be most accurate for that user. The accuracy of this speech to text
conversion is determined in real time and if the accuracy is very low, different Audio Models
are given automatically to the user until an Audio Model is found that is generating a very
high accuracy for the corresponding user.
[0076] In another related embodiment the present invention, building/ training of the
contextually aware neural network model (context validator model) is done by using domainspecific words, user-specific words, and in-production user edits. The domain specific words
can be obtained by multiple ways but not limited to domain-specific dictionaries, web
scraping engines, domain experts and pre-acquired datasets. The built contextually aware
neural network model analyses contexts of the text data and corrects the text data to provide
an accurate speech to text transcription.
[0077] FIG. 4 shows a flow chart of a method for speech to text transcription,
according to one embodiment of the present invention. The method 400 includes step 410 of
building one or more audio models, wherein the one or more audio models are specific to one
or more users? accents of at least one language such that the one or more audio models are
trained by utilizing pre-stored audio data; step 420 of building one or more contextually
aware neural network models, wherein the one or more contextually aware neural network
models are trained using a first set of words; step 430 of receiving, by an audio codec
converter, speech data to convert the speech data into a first code; step 440 of processing, by
a pre-processing unit, the first code to convert the first code into a second code and to load at
least one of the one or more audio models and the one or more contextually aware neural
network models in an inference model loader; step 450 of invoking at least one of the one or
more audio models based on one or more pre-determined accents of the speech data; step 460
of generating, by the at least one invoked audio model, a first text data based on inference
from the second code; and step 470 of invoking, by a middleware, at least one of the one or
more contextually aware neural network models to generate, from the first text data, a text
output and to iteratively correct the generated text output. The speech data can be real-time
speech data or recorded speech data of, including but not limited to, humans, robots,
machines and other audio generating units.
[0078] In an exemplary embodiment, the pre-stored audio data comprises at least one
of audio data of the one or more users? accented speech data and transcription data. The first
set of words comprises words specific to, including but not limited to, healthcare industry,
hospitality industry, legal industry, transportation industry, sales and marketing, engineering
documentation industry and other industries. In another exemplary embodiment, the second
code can be, including but not limited to, a 16-bit signed Pulse Coded Modulation (PCM)
data.
[0079] In an embodiment, the method 400 can include steps of validating, using a
look-up tool, the text output by implementing a non-machine learning based validation
architecture; and determining if one or more words should be replaced with one or more
values. The non-machine learning based validation architecture comprises a string matching
algorithm to classify the text output. The string matching algorithm is selected from at least
one of Levenstein string matching algorithm and Cosine distance based string matching
algorithm to enable replacement of keywords.
[0080] In an exemplary embodiment, the middleware is adapted to provide the first
text data from the audio model to the contextually aware neural network model such that the
audio model is free to process a next batch of speech data. The middleware is also adapted to
provide the first text data from the audio model to the contextually aware neural network
model such that the audio model processes a next batch of speech data after output text data
is generated. The middleware is also adapted to maintain two distinct queues for the audio
model and the contextually aware neural network model and wherein the two distinct queues
are maintained on separate CPU threads and both the models are triggered on separate
Graphic Processing Unit threads.
[0081] In an example embodiment, the method 400 includes the steps of receiving
data corresponding to an audio data (e.g. a speech or a spoken word or of vocal sound)
invoking a relevant audio model based on the accent of the user. This accent can be
determined by the geographical location or pre-acquired information. The receiving data
corresponding to the speech comprises receiving data corresponding to the speech from at
least one of a client device, batch wise loaded audio files, and machine spoken audio. The
obtaining of location information includes receiving location information where the speech
was spoken. In response to reception of the speech, the client device initiates a process to
obtain location information for the area where the speech was spoken. Location information
as referred in the present disclosure is data that indicate a relative possibility that the audio
(speech) data was obtained from a particular geographical location.
[0082] The method 400 obtains location information where the speech was spoken
and derives features from the identified accent, and selects one or more models for speech
recognition based on the location information, where each of the selected one or more models
is associated with a weight based on the location information. The derivation of features is
achieved by evaluating the features for recognizing speech using at least a speech recognition
model that is deemed appropriate for the identified accent. The speech recognition model that
is deemed appropriate for the identified accent includes an acoustic model that has been
adapted for the identified accent. The acoustic model that has been adapted for the identified
accent was adapted using accented training speech data not necessarily in the healthcare
domain. Further, the acoustic model that has been adapted for the identified accent was
adapted using speech data of a language that is associated with the identified accent along
with some general speech data belonging to that language. For an example, the accent of the
speech to be recognized as region style Indian English accent, and the language that is
associated with the identified accent is United Kingdom English.
[0083] Furthermore, the method 400 analyses the text by the context dictionary, where
the context dictionary is operative to maintain one or more query refinement keywords or
phrases or terminology associated with a given concept. The context dictionary is operative to
maintain one or more keywords or phrases submitted by a user during a query session, one or
more frequently co-occurring keywords or phrases appearing in a corpus of acquired data,
and one or more keywords or phrases associated with a given concept as specified by a
human editor.
[0084] The method 400 includes correcting the text data after analysis by the
contextually aware neural network data model and replacing relevant keywords and assigning
a value into the final output. The neural network model of speech to text processes the accent
from one or more domain subjects including medical, legal, radiology, secretarial, sales and
marketing, customer support, engineering documentation, website navigation etc. In an
example embodiment, a middleware comprises a look-up toll to classify the text output by
implementing non-machine learning based validation architecture.
[0085] In an exemplary embodiment, the method 400 further comprises evaluating
features for recognizing audio data using a speech recognition model that is deemed
appropriate for the identified accent. The speech recognition model includes an acoustic
model that has been adapted for the identified accent without using domain specific accented
training speech data.
[0086] In an exemplary embodiment, the method and system of the present disclosure
create models, which are of high accuracy in a specific domain for a specific accent. The
models require low audio training data requirements, which reduces the dependency on
getting audio data particular to simultaneous constraints (accent and domain) directly.
Moreover, the system and method of voice to text transcription have the ability to plug and
play various audio models and match them with various context validator models to enable
the possibility of using the same insights that the model has learnt for new domains and
accents. Further, it has the ability to add custom vocabulary in any domain to enable highlevel customisation and to ensure that domain level learning is preserved. Moreover, the
invention can adapt to any accent quickly and this learning can be used as a generic base to
build various context validator models. This directly affects the accuracy of the system.
[0087] In an exemplary embodiment, the system of the present invention is
configured to be implemented on a cloud network.
[0088] As can be understood, embodiments of the present invention provide a deep
neural network model for voice to text and splits it into two discrete systems, which work
asynchronously with each other. The present invention has been described in relation to
particular embodiments, which are intended in all respects to be illustrative rather than
restrictive. Alternative embodiments will become apparent to those of ordinary skill in the art
to which the present invention pertains without departing from its scope.
[0089] In an exemplary embodiment, the present invention may be a system, a
method, and/or a computer program product or a web-based application. The computer
program product may include a computer readable storage medium (or media) having
computer readable program instructions thereon for causing a processor to carry out aspects
of the present invention. The media has embodied therein, for instance, computer readable
program code (instructions) to provide and facilitate the capabilities of the present disclosure.
The article of manufacture (computer program product) can be included as a part of a
computer system/ computing device or as a separate product.
[0090] Computer readable program instructions described herein can be
downloaded to respective computing/processing devices from a computer readable storage
medium or to an external computer or external storage device via a network, for example, the
internet, a local area network (LAN), a wide area network (WAN) and/or a wireless network.
The network may comprise copper transmission cables, optical transmission fibres, wireless
transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network
adapter card or network interface in each computing/processing device receives computer
readable program instructions from the network and forwards the computer readable program
instructions for storage in a computer readable storage medium within the respective
computing/processing device.
[0091] Computer readable program instructions described herein can also refer to one
or many of the following: Web App (A website or product that can be directly launched by
navigating to a particular Uniform Resource Locator (Web URL) in the browser of the end
user) or a Progressive Web App (PWA) which can is an installable application while running
inside the browser execution engine.
[0092] It should be appreciated that the machine learning technologies disclosed
herein have been primarily described in reference to a particular use-case related to the
contextually aware neural network data model related to context dictionary pertaining to
different domain such as medical, legal etc. However, it should be appreciated that the
disclosed machine learning technologies are applicable to any other system, device, or service
utilizing a machine learning algorithm. In such embodiments, the user may be prompted to
verify or supply classifications or other determinations. As such, the present disclosure
should not be interpreted as limited to the specifically disclosed use-case or domain, but
rather understood to be applicable to other systems, devices, and services in which machine
learning is used.
[0093] From the foregoing, it will be seen that this invention is one well adapted to
attain all the ends and objects set forth above, together with other advantages, which are
obvious and inherent to the system and method. It will be understood that certain features and
sub combinations are of utility and may be employed without reference to other features and
sub combinations. This is contemplated by and is within the scope of the appended claims.
ADVANTAGES OF THE PRESENT DISCLOSURE
[0094] The present disclosure provides an end-to-end speech to text transcription
system that is scalable across multiple domains.
[0095] The present disclosure provides an accurate, robust speech to text transcription
system for human machine interactions.
[0096] The present disclosure provides a computer-implemented method to build a
modular speech to text transcription system trained on different accents and dialects.
[0097] The present disclosure provides the speech to text transcription system that
offers plug-and-play model for various accents and corresponding domains.
[0098] The present disclosure provides the speech to text transcription system that
utilizes low amounts of audio data required for training audio models for the first time.
[0099] The present disclosure provides a highly trainable systems to learn newly
spoken words or phrases.
[00100] The present disclosure provides a scalable system for high concurrent usage of
machine learning models.