Abstract: A system to convert a mixed language conversation into text is disclosed. The system includes an audio analysis module, identify one or more speakers in a conversation from one or more audio files, segregate one or more words corresponding to each of the one or more speakers of the one or more audio files, identify one or more languages associated with segregated one or more words and convert the one or more audio files into a text in real time based on inferred text data for each of the segregated one or more words and predicted sequence of words in the conversation. The system includes an entity extraction module, extract one or more entities present in the speaker diary. The system also includes an entity search module search each of extracted one or more entities in a repository database based on n-grams.
Embodiments of a present disclosure relates to speech to text conversion techniques, and more particularly to a system and a method to convert a mixed language audio input into text.
BACKGROUND
[0002] Globalization and technological advances have led to the spread of
multilingualism across the globe. The multilingualism is considered to use a mixture
or combination of more than one language by an individual or community of speakers.
In daily conversations as well as in business environments, people use mixed
languages very frequently. Here, many instances of spoken and written language
include words from two or more different languages or dialects. To understand such
mixed languages manually, a person must pay attention to more personal details such
as accent, cadence, and the like. Speech recognition systems are also design for
automatic understanding of different languages.
[0003] Conventional speech recognition systems and methods are based on a single
language. Such systems are ill-equipped to handle multilingual communication. They
face difficulty in accurately transcribing speech that combines words from different
languages or dialects.
[0004] Speech recognition includes the capability to recognize and translate at the
same time the spoken language into text. Known systems lack any single system to
recognize and translate spoken multilingual communication in fast, accurate and much
more reliable way. Complex systems are combined to perform both the tasks
simultaneously.
[0005] Hence, there is a need for an improved system to convert a mixed language
conversation into text and a method to operate the same and therefore address the
aforementioned issues.
BRIEF DESCRIPTION
[0006] In accordance with one embodiment of the disclosure, a system to convert
a mixed language conversation into text is disclosed. The system includes an audio
3
analysis module operable by one or more processors. The audio analysis module is
configured to identify one or more speakers in a conversation from one or more audio
files. The audio analysis module is also configured to segregate one or more words
corresponding to each of the one or more speakers of the one or more audio files. The
audio analysis module is also configured to identify one or more languages associated
with sone or more segregated words.
[0007] The audio analysis module is also configured to convert the one or more
audio files into a text in real time based on inferred text data for each of the one or
more segregated words corresponding to the one or more identified language and
predicted sequence of words in the conversation. The system also includes an audio
diarization module operable by the one or more processors. The audio diary generation
module is operatively coupled to the audio analysis module and configured to generate
a speaker diary for the one or more speakers. The speaker diary comprises timestamp
for each of the one or more segregated words with respect to each of the one or more
speakers, and a conversational response of each of the one or more speakers against
other speakers in the conversation.
[0008] The system also includes an entity extraction module operable by the one
or more processors. The entity extraction module is operatively coupled to the audio
diarization module. The entity extraction module is configured to extract one or more
entities present in the speaker diary using a ML trained phonetic lexicon. The system
also includes an entity search module operable by the one or more processors. The
entity search module is operatively coupled to the entity extraction module. The entity
search module is configured to search each of one or more extracted entities in a
repository database based on n-grams, and present related set of information in a
presentable text format.
[0009] In accordance with one embodiment of the disclosure, a method for
converting a mixed language conversation into text is disclosed. The method includes
identifying one or more speakers in a conversation from one or more audio files. The
method also includes segregating one or more words corresponding to each of the one
or more speakers of the one or more audio files. The method also includes identifying
one or more languages associated with one or more segregated words. The method
also includes converting the one or more audio files into a text in real time based on
4
inferred text data for each of the sone or more segregated words corresponding to the
one or more identified language, and predicted sequence of words in the conversation.
[0010] The method also includes generating a speaker diary for the one or more
speakers. The method also includes extracting one or more entities present in the
speaker diary using a ML trained phonetic lexicon. The method also includes
searching each of one or more extracted entities in a repository database based on ngrams. The method also includes presenting related set of information in a presentable
text format.
[0011] To further clarify the advantages and features of the present disclosure, a
more particular description of the disclosure will follow by reference to specific
embodiments thereof, which are illustrated in the appended figures. It is to be
appreciated that these figures depict only typical embodiments of the disclosure and
are therefore not to be considered limiting in scope. The disclosure will be described
and explained with additional specificity and detail with the appended figures.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] The disclosure will be described and explained with additional specificity
and detail with the accompanying figures in which:
[0013] FIG. 1 is a block diagram representation of a system to convert a mixed
language conversation into text in accordance with an embodiment of the present
disclosure;
[0014] FIG. 2 is a block diagram representation of an embodiment representing the
system to convert the mixed language conversation into text of FIG. 1 in accordance
with an embodiment of the present disclosure;
[0015] FIG. 3 is a schematic representation of the mixed language conversation
text for the discussion between the agent and the customer in accordance with an
embodiment of the present disclosure;
[0016] FIG. 4 is a flowchart representing the steps of an entity extraction feature
corresponding to the system to convert the mixed language conversation into text in
accordance with an embodiment of the present disclosure;
5
[0017] FIG. 5 is a flowchart representing the steps of end-to-end working of the
system to convert the mixed language conversation into text in accordance with an
embodiment of the present disclosure;
[0018] FIG. 6 is a block diagram of a computer or a server in accordance with an
embodiment of the present disclosure; and
[0019] FIG. 7 is a flowchart representing the steps of a method for converting a
mixed language conversation into text in accordance with an embodiment of the
present disclosure.
[0020] Further, those skilled in the art will appreciate that elements in the figures
are illustrated for simplicity and may not have necessarily been drawn to scale.
Furthermore, in terms of the construction of the device, one or more components of
the device may have been represented in the figures by conventional symbols, and the
figures may show only those specific details that are pertinent to understanding the
embodiments of the present disclosure so as not to obscure the figures with details that
will be readily apparent to those skilled in the art having the benefit of the description
herein.
DETAILED DESCRIPTION
[0021] For the purpose of promoting an understanding of the principles of the
disclosure, reference will now be made to the embodiment illustrated in the figures
and specific language will be used to describe them. It will nevertheless be understood
that no limitation of the scope of the disclosure is thereby intended. Such alterations
and further modifications in the illustrated online platform, and such further
applications of the principles of the disclosure as would normally occur to those skilled
in the art are to be construed as being within the scope of the present disclosure.
[0022] The terms "comprises", "comprising", or any other variations thereof, are
intended to cover a non-exclusive inclusion, such that a process or method that
comprises a list of steps does not include only those steps but may include other steps
not expressly listed or inherent to such a process or method. Similarly, one or more
devices or subsystems or elements or structures or components preceded by
"comprises... a" does not, without more constraints, preclude the existence of other
6
devices, subsystems, elements, structures, components, additional devices, additional
subsystems, additional elements, additional structures or additional components.
Appearances of the phrase "in an embodiment", "in another embodiment" and similar
language throughout this specification may, but not necessarily do, all refer to the same
embodiment.
[0023] Unless otherwise defined, all technical and scientific terms used herein have
the same meaning as commonly understood by those skilled in the art to which this
disclosure belongs. The system, methods, and examples provided herein are only
illustrative and not intended to be limiting.
[0024] In the following specification and the claims, reference will be made to a
number of terms, which shall be defined to have the following meanings. The singular
forms “a”, “an”, and “the” include plural references unless the context clearly dictates
otherwise.
[0025] A computer system (standalone, client or server computer system)
configured by an application may constitute a “module” that is configured and
operated to perform certain operations. In one embodiment, the “module” may be
implemented mechanically or electronically, so a module may comprise dedicated
circuitry or logic that is permanently configured (within a special-purpose processor)
to perform certain operations. In another embodiment, a “module” may also comprise
programmable logic or circuitry (as encompassed within a general-purpose processor
or other programmable processor) that is temporarily configured by software to
perform certain operations.
[0026] Accordingly, the term “module” should be understood to encompass a
tangible entity, be that an entity that is physically constructed permanently configured
(hardwired) or temporarily configured (programmed) to operate in a certain manner
and/or to perform certain operations described herein.
[0027] FIG. 1 is a block diagram representation of a system (10) to convert a mixed
language conversation into text in accordance with an embodiment of the present
disclosure. The system (10) via a single robust Speech-to-Text engine accurately
captures different language in a conversation and provides text. The designed system
(10) is fast, accurate and reliable. The system is based on a GMM-HMM-DNN
7
(Gaussian Mixed Model -Hidden Markov Model- Deep Neural Network) training
method.
[0028] The system (10) includes an audio analysis module (20) operable by one or
more processors. The audio analysis module (20) is configured to identify one or more
speakers in a conversation from one or more audio files. As used herein, the term
“audio file” is a file format for storing digital audio data on a computer system. The
user of the system (10) uploads the audio file in the system (10) for speech to text
conversation. In one embodiment, the conversation may be any form of continuous
speech.
[0029] In one embodiment, a word utterance from an unknown speaker is analysed
and compared with speech models of known speakers, thereby classifying the
unknown speaker as another known speaker. The system (10) basically identifies the
number of speakers in a conversation and tags the utterance of words to that particular
speaker.
[0030] The audio analysis module (20) is also configured to segregate one or more
words corresponding to each of the one or more speakers of the one or more audio
files. Basically, the system (10) isolates each word spoken in the provided audio file.
Thereafter, the audio analysis module (20) identifies one or more languages associated
with one or more segregated words.
[0031] The audio analysis module (20) is also configured to convert the one or
more audio files into a text in real time. Such conversion is based on inferred text data
for each of the one or more segregated words corresponding to the one or more
identified language and predicted sequence of words in the conversation. In one
embodiment, the text data is inferred for each of the sone or more segregated words
corresponding to the one or more identified language by implementation of an acoustic
model and a trigram-based language model. In such embodiment, the acoustic model
generates the feature vectors for different phonemes based on an HMM model. The
sequence of words is predicted by combining the Acoustic model with a language
model.
[0032] An acoustic model is used in automatic speech recognition to represent the
relationship between an audio signal and the phonemes or other linguistic units that
8
make up speech. A hidden Markov model (HMM) is a statistical model that may be
used to describe the evolution of observable events that depend on internal factors,
which are not directly observable.
[0033] The audio analysis module (20) is also configured to perform word
embedding for the one or more audio files to give a meaningful text. A word
embedding is a learned representation for text where words that have the same
meaning have a similar representation. The word embedding is performed by a
vectorized representation for words being spoken in the audio file that are being
analysed. A pre-trained word embedding is used to initialize the word vectors. The
weights associated with these word vectors get updated as the audio file gets converted
to text and that text is used for downstream processing like entity extraction. Back
propagation algorithm updates the word embeddings during the training process.
[0034] The system (10) also includes an audio diarization module (30) operable by
the one or more processors. The audio diarization module (30) is operatively coupled
to the audio analysis module (20). The audio diarization module (30) is configured to
generate a speaker diary for the one or more speakers. Speaker diarization is a
combination of speaker segmentation and speaker clustering. It aims at finding speaker
change points in an audio stream, subsequently grouping together speech segments on
the basis of speaker characteristics. Output of the Speaker diarization is one or more
speaker diaries. The speaker diary comprises timestamp for each of the one or
segregated more words with respect to each of the one or more speakers, and a
conversational response of each of the one or more speakers against other speakers in
the conversation. Timestamp provides a digital record of all words expressed in the
uploaded audio file. The speaker diary may be very useful for audio search.
[0035] The system (10) also includes an entity extraction module (40) operable by
the one or more processors. The entity extraction module (40) is operatively coupled
to the audio diarization module (30). The entity extraction module (40) is configured
to extract one or more entities present in the speaker diary using a ML trained phonetic
lexicon. The one or more entities corresponds to the intent of speaker associated to
the conversation. By using the ML trained phonetic lexicon, the internal state of the
conversation is decoded.
9
[0036] The system (10) also includes an entity search module (50) operable by the
one or more processors. The entity search module (50) is operatively coupled to the
entity extraction module (40). The entity search module (50) is configured to search
each of extracted one or more entities in a repository database based on n-grams, and
present related set of information in a presentable text format. In the fields of
computational linguistics and probability, an n-gram is a contiguous sequence of n
items from a given sample of text or speech. The items can be phonemes, syllables,
letters, words or base pairs according to the application. In such embodiment, the
presentable text format resembles any written document.
[0037] The system (10) further comprising a data storage module operable by the
one or more processors. The data storage module is operatively coupled to the audio
analysis module (20). The data storage module is configured to record and store
timestamp for each of the one or more segregated words with respect to each of the
one or more speakers.
[0038] FIG. 2 is a block diagram representation of an embodiment representing the
system (60) to convert the mixed language conversation into text of FIG. 1 in
accordance with an embodiment of the present disclosure. The exemplary embodiment
showcases the implementation of the speech to text system (10) in call centres. A call
centre is an office in which large numbers of phone calls are handled, especially one
providing the customer services functions of a large organization. Hence customers of
different language background interact here.
[0039] In this specific example, a customer (70) while conversing (100) with an
agent (90) of a call centre (80), uses Hindi and English language together. The system
(60) via an audio analysis module (20) first identifiesindividually each of the customer
(70) and the agent (90). The system (60) then segregates each Hindi or English word
spoken by the customer (70) or the agent (90). The system (60) via the same audio
analysis module (20) further identifies which segregated word is what language. The
language identified in this particular exemplary embodiment may be Hindi and
English.
[0040] The audio analysis module (20) then converts the segregated words files
into a text (110) in real time based on inferred text data for each of the segregated
10
words and predicted sequence of words. The system (60) for such purpose uses an
acoustic model and a trigram-based language model, whereby the acoustic model
generates the feature vectors for different phonemes based on an HMM model. FIG. 3
is a schematic representation of the mixed language conversation text (160) for the
discussion between the agent and the customer in accordance with an embodiment of
the present disclosure. The text clearly shows the discussion was in Hindi and English.
English words are output as English and Hindi portions of the conversation are output
as Devanagari. Additionally, a timestamped speaker diary is maintained by an audio
diarization module (30) for each of the segregated words of the uploaded audio file.
[0041] Furthermore, an entity extraction module (40) associated with the system
(60), extract one or more entities (120) present in the speaker diary using a ML trained
phonetic lexicon. Entities is the representative intent of the speaker (the customer (70)
and the agent (90)) of the segregated word. An entity search module (50) searches
each of extracted one or more entities in a repository database (130) based on n-grams
to find related set of information. Lastly, the related set of information is presented
(140) and the agent (90) response is captured (150) for the provided text.
[0042] The audio analysis module (20), the audio diarization module (30), the
entity extraction module (40) and the entity search module (50) in FIG. 2 is
substantially equivalent to the audio analysis module (20), the audio diarization
module (30), the entity extraction module (40) and the entity search module (50) of
FIG. 1.
[0043] FIG. 4 is a flowchart representing the steps (170) of entity extraction feature
corresponding to the system to convert the mixed language conversation into text in
accordance with an embodiment of the present disclosure. The system via an entity
extraction module extracts multiple entities present in the speaker conversation using
a ML trained phonetic lexicon. Here, the system receives mixed input sentence in step
180.
[0044] In a pre-processing step, the system in step 190 understands the character
that is embedded in word used. Embedding layer is a technique used here to learn the
word embedding process from text data. Further, the system via Bidirectional
recurrent neural networks extract features of the word and preserves the associated
11
context in step 200. The Bi-directional neural networks consider the context both from
the past and future to evaluate the function at current time step. For sequence-tosequence temporal tasks like language processing, it proves helpful in prediction and
recognition tasks like entity extraction. The system adds labels to the word in step 210.
Lastly, the system provides entity result is step 230 after constructing proper sentences
with that added labels in step 220. Conditional random field (CRF) modelling method
is used to predict structured proper sentences.
[0045] FIG. 5 is a flowchart representing the steps of end-to-end working (240) of
the system to convert the mixed language conversation into text in accordance with an
embodiment of the present disclosure. The user requests the system for speech
conversation in step 250 and uploads an audio file in step 260. All such interaction
takes place over the GUI interface. The server associated with this speech to text
conversation requests the uploaded audio file and directs such uploaded file to the
automatic speech recognition module is step 270. The automatic speech recognition
module generates the text data in step 280 and stores the data in data storage module
in step 290.
[0046] Simultaneously, the system extracts the one words and puts a query for
entity recognition via step 300. The extracted word relatable information is searched
in a repository database based on n-grams in step 310. The related set of information
in a presentable text format in step 320. Lastly, all such extracted words are stored in
the data storage module.
[0047] FIG. 6 is a block diagram of a computer or a server (330) in accordance with
an embodiment of the present disclosure. The server (330) includes processor(s) (360),
and memory (340) coupled to the processor(s) (360).
[0048] The processor(s) (360), as used herein, means any type of computational
circuit, such as, but not limited to, a microprocessor, a microcontroller, a complex
instruction set computing microprocessor, a reduced instruction set computing
microprocessor, a very long instruction word microprocessor, an explicitly parallel
instruction computing microprocessor, a digital signal processor, or any other type of
processing circuit, or a combination thereof.
12
[0049] The memory (340) includes a plurality of modules stored in the form of
executable program which instructs the processor (360) via a bus (350) to perform the
method steps illustrated in Fig 1. The memory (340) has following modules: the audio
analysis module (20), the audio diarization module (30), the entity extraction module
(40) and the entity search module (50).
[0050] The audio analysis module (20) is configured to identify one or more
speakers in a conversation from one or more audio files. The audio analysis module
(20) is also configured to segregate one or more words corresponding to each of the
one or more speakers of the one or more audio files. The audio analysis module (20)
is also configured to identify one or more languages associated with segregated one or
more words.
[0051] The audio analysis module (20) is also configured to convert the one or
more audio files into a text in real time based on inferred text data for each of the
segregated one or more words corresponding to the one or more identified language
and predicted sequence of words in the conversation. The audio diarization module
(30) is operatively coupled to the audio analysis module and configured to generate a
speaker diary for the one or more speakers.
[0052] The entity extraction module (40) is configured to extract one or more
entities present in the speaker diary using a ML trained phonetic lexicon. The entity
search module (50) is configured to search each of extracted one or more entities in a
repository database based on n-grams, and present related set of information in a
presentable text format.
[0053] Computer memory elements may include any suitable memory device(s) for
storing data and executable program, such as read only memory, random access
memory, erasable programmable read only memory, electrically erasable
programmable read only memory, hard drive, removable media drive for handling
memory cards and the like. Embodiments of the present subject matter may be
implemented in conjunction with program modules, including functions, procedures,
data structures, and application programs, for performing tasks, or defining abstract
data types or low-level hardware contexts. Executable program stored on any of the
above-mentioned storage media may be executable by the processor(s) (360).
13
[0054] FIG. 7 is a flowchart representing the steps (370) of a method for converting
a mixed language conversation into text in accordance with an embodiment of the
present disclosure. The method (370) includes identifying one or more speakers in a
conversation from one or more audio files in step 380. In one embodiment, identifying
the one or more speakers in the conversation from the one or more audio files includes
identifying the one or more speakers in the conversation by an audio analysis module.
[0055] The method (370) also includes segregating one or more words
corresponding to each of the one or more speakers of the one or more audio files in
step 390. In one embodiment, segregating the one or more words corresponding to
each of the one or more speakers of the one or more audio files includes segregating
the one or more words by the audio analysis module.
[0056] The method (370) also includes identifying one or more languages
associated with segregated one or more words in step 400. In one embodiment,
identifying the one or more languages associated with one or more segregated words
includes identifying the one or more languages by the audio analysis module.
[0057] The method (370) also includes converting the one or more audio files into
a text in real time based on inferred text data for each of the segregated one or more
words corresponding to the one or more identified language and predicted sequence
of words in the conversation in step 410. In one embodiment, converting the one or
more audio files into a text in real time includes identifying by the audio analysis
module.
[0058] In another embodiment, converting the one or more audio files into a text in
real time based on inferred text data comprises the text data inferred for each of the
segregated one or more words corresponding to the one or more identified language
by implementation of an acoustic model and a trigram-based language model. In yet
another embodiment, converting the one or more audio files into a text in real time
based on the predicted sequence of words in the conversation comprises prediction by
combining the Acoustic model with a language model.
[0059] The method (370) also includes performing word embedding for the one or
more audio files. In one embodiment, performing the word embedding for the one or
14
more audio files includes performing the word embedding for the one or more audio
files by the audio analysis module.
[0060] The method (370) also includes generating a speaker diary for the one or
more speakers in step 420. In one embodiment, generating the speaker diary for the
one or more speakers includes generating the speaker diary by an audio diary
generation module.
[0061] The method (370) also includes extracting one or more entities present in
the speaker diary using a ML trained phonetic lexicon in step 430. In one embodiment,
extracting the one or more entities present in the speaker diary using a ML trained
phonetic lexicon includes extracting the one or more entities includes extracting by an
entity extraction module.
[0062] The method (370) also includes searching each of extracted one or more
entities in a repository database based on n-grams in step 440. In one embodiment,
searching each of one or more extracted entities in the repository database based on ngrams includes searching by an entity search module. In another embodiment,
searching each of extracted one or more entities in the repository database comprises
search using the timestamp for each of the segregated one or more words with respect
to each of the one or more speakers.
[0063] The method (370) also includes presenting related set of information in a
presentable text format in step 450. In one embodiment, presenting the related set of
information in the presentable text format includes presenting by the entity search
module.
[0064] The method (370) also includes recording and storing timestamp for each
of the segregated one or more words with respect to each of the one or more speakers.
In one embodiment, recording and storing timestamp for each of the segregated one
or more words with respect to each of the one or more speakers includes recording and
storing by a data storage module.
[0065] Present disclosure of a system to convert a mixed language conversation
into text is extremely useful for Indian context as most of the conversation happens in
15
that kind of settings. The system makes it very easy to do all kind of analytics on the
text output.
[0066] The system also does speaker diarisation and segmentation, about what is
being said but also for who has said in any given conversation. The disclosed system
accurately gives timestamps for each and every word being spoken in an audio file.
Phonetic mappings trained using Machine Learning algorithms. Furthermore, the
process may be deployed both in real time and post facto speech analysis applications.
[0067] While specific language has been used to describe the disclosure, any
limitations arising on account of the same are not intended. As would be apparent to a
person skilled in the art, various working modifications may be made to the method
in order to implement the inventive concept as taught herein.
[0068] The figures and the foregoing description give examples of embodiments.
Those skilled in the art will appreciate that one or more of the described elements may
well be combined into a single functional element. Alternatively, certain elements may
be split into multiple functional elements. Elements from one embodiment may be
added to another embodiment. For example, order of processes described herein may
be changed and are not limited to the manner described herein. Moreover, the actions
of any flow diagram need not be implemented in the order shown; nor do all of the
acts need to be necessarily performed. Also, those acts that are not dependant on other
acts may be performed in parallel with the other acts. The scope of embodiments is by
no means limited by these specific examples.
WE CLAIM:
1. A system (10) to convert a mixed language conversation into text, comprising:
an audio analysis module (20), operable by one or more processors, and
configured to:
identify one or more speakers in a conversation from one or more
audio files;
segregate one or more words corresponding to each of the one or
more speakers of the one or more audio files;
identify one or more languages associated with one or more
segregated words; and
convert the one or more audio files into a text in real time based on
inferred text data for each of the one or more segregated words
corresponding to the one or more identified language, and predicted
sequence of words in the conversation;
an audio diarization module (30), operable by the one or more processors,
operatively coupled to the audio analysis module (20) and configured to perform
speaker diarization to generate a speaker diary for the one or more speakers,
wherein the speaker diary comprises timestamp for each of the one or more
segregated words with respect to each of the one or more speakers, and a
conversational response of each of the one or more speakers against other
speakers in the conversation;
an entity extraction module (40), operable by the one or more processors,
operatively coupled to the audio diarization module (30) and configured to
extract one or more entities present in the speaker diary using a ML trained
phonetic lexicon; and
an entity search module (50), operable by the one or more processors,
operatively coupled to the entity extraction module (40) and configured to search
17
each of one or more extracted entities in a repository database based on n-grams,
and present related set of information in a presentable text format.
2. The system (10) as claimed in claim 1, wherein the text data is inferred for
each of the one or more segregated words corresponding to the one or more identified
language by implementation of an acoustic model and a trigram-based language
model, where the acoustic model generates the feature vectors for different phonemes
based on an HMM model.
3. The system (10) as claimed in claim 1, wherein the sequence of words is
predicted by combining the Acoustic model with a language model.
4. The system (10) as claimed in claim 1, wherein the audio analysis module (20)
is configured to perform word embedding for the one or more audio files.
5. The system (10) as claimed in claim 1, wherein the entity search module (50)
is configured to search each of extracted one or more entities in a repository database
using the timestamp for each of the sone or more segregated words with respect to
each of the one or more speakers.
6. The system (10) as claimed in claim 1, wherein the one or more entities
corresponds to the intent of speaker associated to the conversation.
7. The system (10) as claimed in claim 1, further comprising a data storage
module operable by the one or more processors and operatively coupled to the audio
analysis module (20), wherein the data storage module is configured to record and
store timestamp for each of the one or more segregated words with respect to each of
the one or more speakers.
8. A method (370) for converting a mixed language conversation into text,
comprising:
identifying, by an audio analysis module, one or more speakers in a
conversation from one or more audio files (380);
18
segregating, by the audio analysis module, one or more words
corresponding to each of the one or more speakers of the one or more audio
files (390);
identifying, by the audio analysis module, one or more languages
associated with one or more segregated words (400);
converting, by the audio analysis module, the one or more audio files
into a text in real time based on inferred text data for each of the segregated
one or more words corresponding to the one or more identified language, and
predicted sequence of words in the conversation (410);
generating, by an audio diarization module, a speaker diary for the one
or more speakers, wherein the speaker diary comprises timestamp for each of
the one or more segregated words with respect to each of the one or more
speakers, and a conversational response of each of the one or more speakers
against other speakers in the conversation (420);
extracting, by an entity extraction module, one or more entities present
in the speaker diary using a ML trained phonetic lexicon (430);
searching, by an entity search module, each of one or more extracted
entities in a repository database based on n-grams (440); and
presenting, by the entity search module, related set of information in a
presentable text format (450).
9. The method (370) as claimed in claim 8, wherein converting, by the audio
analysis module, the one or more audio files into a text in real time based on inferred
text data comprises the text data inferred for each of the one or more segregated words
corresponding to the one or more identified language by implementation of an acoustic
model and a trigram-based language model.
10. The method (370) as claimed in claim 8, wherein converting, by the audio
analysis module, the one or more audio files into a text in real time based on the
predicted sequence of words in the conversation comprises prediction by combining
the Acoustic model with a language model.
19
11. The method (370) as claimed in claim 8, further comprising performing, by the
audio analysis module, word embedding for the one or more audio files.
12. The method (370) as claimed in claim 8, wherein searching, by the entity
search module, each of extracted one or more entities in a repository database
comprises search using the timestamp for each of the one or more segregated words
with respect to each of the one or more speakers.
13. The method (370) as claimed in claim 8, further comprising recording and
storing, by a data storage module, timestamp for each of the one or more segregated
words with respect to each of the one or more speakers.
| # | Name | Date |
|---|---|---|
| 1 | 202111001226-STATEMENT OF UNDERTAKING (FORM 3) [11-01-2021(online)].pdf | 2021-01-11 |
| 1 | IN 373119-F-15-Decision ur 84(2).pdf | 2024-02-16 |
| 2 | 202111001226-FORM-15 [04-08-2023(online)].pdf | 2023-08-04 |
| 2 | 202111001226-PROOF OF RIGHT [11-01-2021(online)].pdf | 2021-01-11 |
| 3 | 202111001226-POWER OF AUTHORITY [04-08-2023(online)].pdf | 2023-08-04 |
| 3 | 202111001226-FORM FOR STARTUP [11-01-2021(online)].pdf | 2021-01-11 |
| 4 | 202111001226-RELEVANT DOCUMENTS [04-08-2023(online)]-1.pdf | 2023-08-04 |
| 4 | 202111001226-FORM FOR SMALL ENTITY(FORM-28) [11-01-2021(online)].pdf | 2021-01-11 |
| 5 | 202111001226-RELEVANT DOCUMENTS [04-08-2023(online)].pdf | 2023-08-04 |
| 5 | 202111001226-FORM 1 [11-01-2021(online)].pdf | 2021-01-11 |
| 6 | 202111001226-RELEVANT DOCUMENTS [29-09-2022(online)].pdf | 2022-09-29 |
| 6 | 202111001226-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [11-01-2021(online)].pdf | 2021-01-11 |
| 7 | 202111001226-FER.pdf | 2021-10-19 |
| 7 | 202111001226-EVIDENCE FOR REGISTRATION UNDER SSI [11-01-2021(online)].pdf | 2021-01-11 |
| 8 | 202111001226-IntimationOfGrant29-07-2021.pdf | 2021-07-29 |
| 8 | 202111001226-DRAWINGS [11-01-2021(online)].pdf | 2021-01-11 |
| 9 | 202111001226-DECLARATION OF INVENTORSHIP (FORM 5) [11-01-2021(online)].pdf | 2021-01-11 |
| 9 | 202111001226-PatentCertificate29-07-2021.pdf | 2021-07-29 |
| 10 | 202111001226-CLAIMS [26-07-2021(online)].pdf | 2021-07-26 |
| 10 | 202111001226-COMPLETE SPECIFICATION [11-01-2021(online)].pdf | 2021-01-11 |
| 11 | 202111001226-FER_SER_REPLY [26-07-2021(online)].pdf | 2021-07-26 |
| 11 | 202111001226-STARTUP [12-01-2021(online)].pdf | 2021-01-12 |
| 12 | 202111001226-FORM28 [12-01-2021(online)].pdf | 2021-01-12 |
| 12 | 202111001226-OTHERS [26-07-2021(online)].pdf | 2021-07-26 |
| 13 | 202111001226-FORM-26 [20-01-2021(online)].pdf | 2021-01-20 |
| 13 | 202111001226-FORM-9 [12-01-2021(online)].pdf | 2021-01-12 |
| 14 | 202111001226-FORM 18A [12-01-2021(online)].pdf | 2021-01-12 |
| 15 | 202111001226-FORM-26 [20-01-2021(online)].pdf | 2021-01-20 |
| 15 | 202111001226-FORM-9 [12-01-2021(online)].pdf | 2021-01-12 |
| 16 | 202111001226-FORM28 [12-01-2021(online)].pdf | 2021-01-12 |
| 16 | 202111001226-OTHERS [26-07-2021(online)].pdf | 2021-07-26 |
| 17 | 202111001226-STARTUP [12-01-2021(online)].pdf | 2021-01-12 |
| 17 | 202111001226-FER_SER_REPLY [26-07-2021(online)].pdf | 2021-07-26 |
| 18 | 202111001226-COMPLETE SPECIFICATION [11-01-2021(online)].pdf | 2021-01-11 |
| 18 | 202111001226-CLAIMS [26-07-2021(online)].pdf | 2021-07-26 |
| 19 | 202111001226-DECLARATION OF INVENTORSHIP (FORM 5) [11-01-2021(online)].pdf | 2021-01-11 |
| 19 | 202111001226-PatentCertificate29-07-2021.pdf | 2021-07-29 |
| 20 | 202111001226-DRAWINGS [11-01-2021(online)].pdf | 2021-01-11 |
| 20 | 202111001226-IntimationOfGrant29-07-2021.pdf | 2021-07-29 |
| 21 | 202111001226-EVIDENCE FOR REGISTRATION UNDER SSI [11-01-2021(online)].pdf | 2021-01-11 |
| 21 | 202111001226-FER.pdf | 2021-10-19 |
| 22 | 202111001226-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [11-01-2021(online)].pdf | 2021-01-11 |
| 22 | 202111001226-RELEVANT DOCUMENTS [29-09-2022(online)].pdf | 2022-09-29 |
| 23 | 202111001226-FORM 1 [11-01-2021(online)].pdf | 2021-01-11 |
| 23 | 202111001226-RELEVANT DOCUMENTS [04-08-2023(online)].pdf | 2023-08-04 |
| 24 | 202111001226-FORM FOR SMALL ENTITY(FORM-28) [11-01-2021(online)].pdf | 2021-01-11 |
| 24 | 202111001226-RELEVANT DOCUMENTS [04-08-2023(online)]-1.pdf | 2023-08-04 |
| 25 | 202111001226-POWER OF AUTHORITY [04-08-2023(online)].pdf | 2023-08-04 |
| 25 | 202111001226-FORM FOR STARTUP [11-01-2021(online)].pdf | 2021-01-11 |
| 26 | 202111001226-PROOF OF RIGHT [11-01-2021(online)].pdf | 2021-01-11 |
| 26 | 202111001226-FORM-15 [04-08-2023(online)].pdf | 2023-08-04 |
| 27 | IN 373119-F-15-Decision ur 84(2).pdf | 2024-02-16 |
| 27 | 202111001226-STATEMENT OF UNDERTAKING (FORM 3) [11-01-2021(online)].pdf | 2021-01-11 |
| 1 | SEARCHSTRATEGYE_18-02-2021.pdf |