Abstract: ABSTRACT SPEAKER DIARIZATION AND VERIFICATION FROM A MULTI-USER CONVERSATION The present invention relates to a server (102) and a method for performing identification of speakers in an audio file. The server (102) includes a processor (202) and a memory (204) storing programmed instructions executable by the processor (202). The processor (202) executes the programmed instructions to perform operations comprising identification of audio gaps present within the audio file including utterances of a plurality of speakers. The audio gaps correspond to portions of the audio file where audio silence is present for at least a pre-defined time period. The audio file is broken into audio segments based on the audio gaps. Numerical representation of the audio segments is generated using a sound embedding technique. A similarity between the audio segments is determined between the audio segments. The audio segments are clustered based on their similarity. Each cluster indicates a unique speaker of the audio file. (To be published with Fig. 3)
DESC:SPEAKER DIARIZATION AND VERIFICATION FROM A MULTI-USER CONVERSATION
FIELD OF INVENTION
The present invention generally relates to speaker diarization. More specifically, the present invention relates to speaker diarization from an audio conversation, using machine-learning techniques.
BACKGROUND
Tele-conference session is a live meeting of participants and is conducted over a telecommunication channel or over internet. Tele-conference session may include audio and audio-visual meetings, such as telephonic conference calls, webinars, webcasts, and peer-level web meetings. Tele-conference session is commonly used for business meetings, online classrooms, training sessions, lectures, and seminars etc.
A business organization, an educational institution or a group of people discuss an agenda over tele-conference sessions. It becomes crucial that outcome of the tele-conference sessions and key contribution of each speaker of the tele-conference sessions are analysed. The analysis of the outcome of the meeting requires identifying contribution of each speaker. For a multi-person conversation scene with an uncertain number of people, identification of contribution of each speaker becomes a tedious process. Due to the factors of large style difference of different speakers in the session, presence of passive speakers, and unfixed duration of speech of a speaker in the session, it becomes difficult to determine number of speakers in the session and further recognise portions of speech belonging to each of the speaker.
Thus, there is a need of an effective process of identification of speakers in a tele-conference session.
OBJECTS OF THE INVENTION
An object of the present invention is to achieve speaker diarization from a tele-conference session.
Another object of the present invention is to group audio segments pertaining to each speaker in a separate group.
Yet another object of the present invention is to verify grouping of audio segments pertaining to each speaker.
SUMMARY OF THE INVENTION
The summary is provided to introduce aspects related to a system and a method of identification of speakers in an audio file, and the aspects are further described below in the detailed description. This summary is not intended to identify essential features of the claimed subject matter nor is it intended for use in determining or limiting the scope of the claimed subject matter.
In one embodiment, a method of identification of speakers in an audio file is described. The method may comprise identifying audio gaps present within the audio file including utterances of a plurality of speakers. The audio gaps correspond to portions of the audio file where audio silence is present for at least a pre-defined time period. The audio file may be broken into multiple audio segments on basis of the audio gaps. Using sound embedding, numerical representations of the multiple audio segments may be generated. Using a distance algorithm, similarity between the numerical representations of the multiple audio segments may be determined. On the basis of the similarity, the multiple audio segments may be clustered into a plurality of clusters on basis of the similarity. Each cluster of the plurality of clusters may indicate a unique speaker of the audio file. In one aspect, the similarity may be cosine similarity.
In one embodiment, a system for identification of speakers in an audio file is described. The system may comprise a processor and a memory storing programmed instructions executable by the processor. The processor may execute the programmed instructions to perform operations comprising identifying audio gaps within the audio file including utterances of a plurality of speakers. The audio gaps may be identified where silence is present within the audio file for a pre-defined time period. The audio file may be broken into multiple audio segments on basis of the audio gaps. Using sound embedding, numerical representations of the multiple audio segments may be generated. Using a distance algorithm, similarity between the numerical representations of the multiple audio segments may be determined. On the basis of the similarity, the multiple audio segments may be clustered into a plurality of clusters on basis of the similarity. Each cluster of the plurality of clusters may indicate a unique speaker of the audio file.
In one aspect, the pre-defined time period may range from 0.20 to 0.40 seconds based on length of the audio file.
In one aspect, the pre-defined time period may be set based on a list of declining values of a silence threshold.
In one aspect, the clustering of the multiple audio segments may include identifying an audio segment of longest duration among the multiple audio segments. At least one audio segment having a similarity greater than a threshold similarity compared to the audio segment of longest duration may be identified. A cluster of the plurality of clusters may be defined by including the at least one audio segment.
In one aspect, overlaps in the multiple audio segments may be detected using pre-trained machine learning models. The pre-trained machine learning models may be fine-tuned using pyannotate library.
In one aspect, dissimilarity among two or more audio segments of the multiple audio segments that are not a part of the plurality of clusters may be identified. A least dissimilar cluster from the plurality of clusters based on the dissimilarity among the two or more audio segments may be identified. The two or more audio segments may be included in the least dissimilar cluster.
In one aspect, when a count of the plurality of clusters exceeds an actual number of speakers present in the audio file, additional clusters may be identified from the plurality of clusters that are not mapped with the unique speaker of the audio file. A cluster having a least number of audio segments from the additional clusters may be identified. A pre-defined number of audio segments of a cluster of one or more clusters mapped with the unique speaker may further be identified. The pre-defined number of audio segments may have durations greater than durations of other audio segments of the cluster. Dissimilarities between the pre-defined number of audio segments and the cluster having the least number of audio segment may be determined. A normalized sum of the dissimilarities between the pre-defined number of audio segments and the cluster having the least number of audio segment may further be determined. The cluster having the least number of audio segments may be included into the cluster of the one or more clusters.
In one aspect, the dissimilarity may be determined as 1 – minus cosine similarity.
In one aspect, one or more machine learning models may extract audio features from the multiple audio segments, wherein the audio features indicate dimensional aspects of sound of each unique speaker in each of the multiple audio segments. An audio signature of each unique speaker based on the audio features may be determined. Based on the audio signature of each unique speaker, it may be determined whether an audio segment of the multiple audio segments is incorrectly included within a cluster. The audio segment incorrectly included within the cluster may be shifted to another cluster of the plurality of clusters, based on a successful match of the audio signature.
In one aspect, the audio segment incorrectly included within the cluster may be determined by comparing Mel Frequency Cepstral Coefficient (MFCC) of the multiple audio segments with one or more parameters of Gaussian Mixture Model (GMM) representations. The GMM representations may indicate the audio signature of each unique speaker.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings constitute a part of the description and are used to provide further understanding of the present invention. Such accompanying drawings illustrate the embodiments of the present invention which are used to describe the principles of the present invention. The embodiments are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” embodiment in this invention are not necessarily to the same embodiment, and they mean at least one. In the drawings:
Fig. 1 illustrates a system for identification of speakers in an audio file through speaker diarization, in accordance with an embodiment of the present invention;
Fig. 2 illustrates a block diagram showing different components present in a server for identification of speakers, in accordance with an embodiment of the present invention; and
Fig. 3 illustrates a flowchart depicting a method for identification of speakers in an audio file through speaker diarization, in accordance with an embodiment of the present invention.
DETAILED DESCRIPTION OF THE INVENTION
The detailed description set forth below in connection with the appended drawings is intended as a description of various embodiments of the present invention and is not intended to represent the only embodiments in which the present invention may be practiced. Each embodiment described in this invention is provided merely as an example or illustration of the present invention, and should not necessarily be construed as preferred or advantageous over other embodiments. The detailed description includes specific details for the purpose of providing a thorough understanding of the present invention. However, it will be apparent to those skilled in the art that the present invention may be practiced without these specific details.
The terms “or” and “and/or” as used herein are to be interpreted as inclusive or meaning any one or any combination. Therefore, “A, B or C” or “A, B and/or C” mean “any of the following: A; B; C; A and B; A and C; B and C; A, B and C.” An exception to this definition will occur only when a combination of elements, functions, steps or acts are in some way inherently mutually exclusive.
Fig. 1 illustrates a system 100 for identification of speakers in an audio file through speaker diarization, in accordance with an embodiment of the present invention. Speaker diarization is used to identify each speaker in a tele-conference session. Speaker diarization is a process of partitioning an input audio stream into homogeneous segments according to a speaker identity. Mathematical algorithms, such as machine learning algorithms are executed to identify segments of audio corresponding to different speakers and linking the segments from same speaker. Speaker diarization is a combination of speaker segmentation and clustering. Speaker diarization is used to automatically annotate turns of the speakers using automatic speech recognition technology and provide transcription of “who said what” in a tele-conference session.
The system 100 may include a server 102 present locally or present over a cloud network. The server 102 may be used for hosting tele-conference sessions for allowing interaction between multiple speakers. In one instance, the multiple speakers may join a tele-conference session through a plurality of user devices (104-1 through 104-n). For example, a first speaker may interact though a first user device 104-1, a second speaker may interact through a second user device 104-2, and an nth speaker may interact through an nth user device 104-n. The plurality of user devices (104-1 through 104-n) may be cumulatively referred as user devices 104. Examples of the user devices 104 may include a telephone, smartphone, laptop, desktop, or a wearable device.
The user devices 104 may comprise a microphone for capturing voices of their speakers. The voices of the speakers i.e. audio file may be shared with the server 102 through a communication network 106. The communication network 106 may be a radio communication network, Local Area Network (LAN), a Wide Area Network (WAN) such as the Internet, or any combination of connections. The tele-conference session may be established over Voice over Internet Protocol (VoIP) or radio link.
Fig. 2 illustrates a block diagram showing different components present in a server 102 for identification of speakers, in accordance with an embodiment of the present invention. The server 102 may comprise an Input/ Output (I/O) interface, a processor 202 and a memory 204. The server 202 may include a memory 204 for storing the audio data. The server 102 may further include a processor for executing program instructions for processing the audio data for speaker diarization and verification. The memory 204 may store a plurality of modules, each module may include program instructions for identifying speakers present in the tele-conference session. A few such modules may include an audio splitting module 206, a sound embedding module 208, a similarity determining module 210, a clustering module 212, a false positive identification module 214, and a verification module 216. The modules 206 through 216 may be executed by the processor 202 to perform one or more steps for identification of the speakers.
The audio splitting module 206 may break an audio file into multiple audio segments. As the multiple audio segments comprises audio data of different volumes and noise values, the multiple audio segments may be normalized to ensure that volume and/or noise values are the multiple audio segments are consistent. The sound embedding module 208 may generate numerical representations of the multiple audio segments of the audio file. The similarity determining module 210 may identify similarity between the multiple audio segments. To identify similarity, the processor 202 may apply a distance algorithm between the multiple audio segments. The clustering module 212 may cluster the multiple audio segments into a plurality of clusters, on basis of the similarity. Each cluster may indicate a unique speaker associated with the audio file. The false positive identification module 214 may determine false positive values of unique speakers. The false positive values of unique speakers may be identified by determining dissimilarity between the multiple audio segments that are not part of the clusters. The verification module 216 may verify unique speakers based on the audio signatures of the speakers. Detailed functioning of the modules 206 through 216 has been provided successively.
Fig. 3 illustrates a flowchart depicting a method of speaker diarization, in accordance with an embodiment of the present invention. At step 302, the audio file may be fetched from the memory 204 of the server 102. At step 304, audio gaps present within the audio file may be identified. The audio gaps may correspond to portions of the audio file where audio silence is present for at least a pre-defined time period. The audio silence may be identified when one or more parameters, such as amplitude and perceived loudness of the audio segments are present below a predefined value for the pre-defined time period. The pre-defined time period may be determined based on an overall length of the audio file. An optimal value of the pre-defined time period may range from 0.20 to 0.40 seconds.
At step 306, the audio file may be split into audio segments of optimal lengths, based on the audio gaps. The duration of audio segments may range from 1 to 60 seconds. In one implementation, the audio file may be split using Sound eXchange (SoX) application. SoX application may utilise a silence threshold in a runSoxSplits() function to split the audio file into audio segments. To determine an optimal value of the silence threshold, the SoX application may be executed iteratively. In each iteration, a value of the silence threshold may be selected from a list of pre-defined value. For example, the value of the silence threshold may be selected from a list of declining values, such as 0.20, 0.15, 0.10, and 0.05 to ensure that longer audio files are split into smaller audio segments until the audio segments are less than a maximum value of the pre-defined time period. For instance, the maximum value of the pre-defined time period may be set to set to 35 seconds.
SoX application may detect smaller gaps in the audio file by iteratively selecting smaller silence threshold to ensure that time periods of the audio segments are smaller than the maximum value of the pre-defined time period. The runSoxSplits() function of SoX application may be executed using an SoX command by selecting one or more parameters. For example, a silence1 parameter is set to indicate a length of silence for determining an end of an audio segment, {} parameter is set to replace the length of silence with a first value of a pre-defined time period, and a noise floor level parameter is set to determine a desired fraction of the maximum amplitude of the audio segments, such as 20% of the maximum amplitude of the audio file. Values of these parameters may be iteratively changed to change a value of the pre-defined time period.
At step 308, overlaps between the audio segments may be detected. In one implementation, the overlaps in the audio segments may be detected by applying pre-trained machine learning models over the audio data. The overlap in audio segments may occur when two or more speakers speak for a same time period. The pre-trained machine learning models may be developed through training over open source package including pyannotate library. At step 310, numerical representations of the audio segments may be generated using sound embedding. For generating the numerical representations, the audio segments may be mapped with vector representations, such as low dimensional vector space numerical representations, using a machine learning model. In one implementation, the sound embedding may be a 512 vector space numerical representation of the audio segments. The machine learning model may be pre-trained using an open source toolkit, such as SpeechBrain.
At step 312, the audio segments may be clustered into multiple clusters based on their similarity. For clustering the audio segments, distances between the audio segments may be identified by applying a distance metric algorithm on the numerical representations of the audio segments. Based on the distances between the audio segments, similarity among the audio segments may be identified. In one implementation, cosine similarity may be identified using an inner product of the numerical representations of the audio segments. Based on the similarity of the audio segments, the audio segments may be selected for clustering. Clustering may be performed using a clustering technique implemented using computational machine learning models, such as agglomerative hierarchical clustering, Bayesian hierarchical clustering, K-means clustering, mean-shift clustering, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Expectation-Maximization (EM) based clustering, Gaussian Mixture Model (GMM) based clustering, or a combination thereof.
In one implementation, the clustering technique may include sorting the audio segments by their duration and selecting an audio segment of longest duration as an anchor audio segment. The anchor audio segment may be compared with other audio segments to identify at least one audio segment having the cosine similarity greater than a threshold similarity compared to the anchor audio segment. Further, a cluster of the multiple clusters may be defined by including the at least one audio segment.
The threshold similarity may be determined based on size of the audio segments. The size of the audio segments may be determined on basis of duration of the audio segments. For example, when a duration of a first audio segment is determined to be greater than or equal to 5 seconds, a first size of the first audio segment may be set to 1. When the duration of the first audio segment lies in a range of 1.5 seconds to 5 seconds, the first size may be set to 0. When the duration of the first audio segment is less than 1. 5 seconds to 5 seconds, the first size may be set to -1. Similarly, a second size may be determined on basis of duration of the second audio segment. Upon determining the first size and the second size, a total size may be determined as a sum of the first size and the second size.
The threshold similarity may be set to different values based on the total size. When the total size is determined to be equal to 2, the threshold similarity may be set to 0.75. When the total size is determined to be equal to 1, the threshold may be set to 0.55. When the total size is determined to be greater than 0, the threshold similarity may be set to 0.45. When the total size is determined lesser than or equal to 0, the threshold similarity may be set to 0.35. When both the first size and the second size are large, the threshold similarity may be higher. When one or none of the first size and the second size is large, the threshold similarity may be lower. The threshold similarity may be updated by multiplication with a multiplier. For instance, when an absolute difference between an integer value of duration of the first audio segment and the second segment is determined to be 1, a multiplier of value 0.9 may be selected. When both the first size and the second size are less than or equal to 1, a multiplier of value 0.75 may be selected.
Clustering may be iteratively performed by selecting the anchor audio segment as a next longest audio segment out of the audio segments that are not clustered. Clustering the audio segments may be stopped when all the multiple audio segments are grouped into the plurality of clusters.
At step 314, clustering of remaining audio segments that are not part of any cluster may be performed. For clustering the remaining audio segments, a distance algorithm may be executed on all of the audio segments that are not part of any cluster. The distance algorithm may provide a distance between the numerical representations of the audio segments. In one implementation, the distance may be determined using a value obtained by subtracting the cosine similarity from a numerical value one. The distance may indicate dissimilarity between the audio segments.
A dissimilarity matrix may be obtained based on dissimilarities between all numerical representations of the audio segments. For example, a dissimilarity matrix M with dimension NxN (for n audio files) may be generated. A record i, j in the dissimilarity matrix M may represent dissimilarity between ith audio segment and jth audio segment. The dissimilarity between the between ith audio segment and jth audio segment may be determined as a value obtained by value obtained by subtracting the cosine similarity of numerical representations of between ith audio segment and jth audio segment from the numerical value one. A sum of the numerical representations of the audio segments may be determined against all combinations of the plurality of clusters. The sum may be normalized by the length of the audio segments to find a least dissimilar cluster. The remaining audio segments may be included into the least dissimilar cluster.
The dissimilarity matrix may be utilised to determine false positive values of clusters in a case when a number of the plurality of clusters is greater than a number of speakers of the audio file. The false positive values may indicate a number of the plurality of clusters exceeding the number of speakers. For determining the false positives, additional clusters other than the clusters mapped with a correct speaker are identified. In an implementation, ResCNN_triplet AI model may be used to generate embedding on audio components, such as Mel Frequency Cepstral Coefficient (MFCC) components. The audio components may be utilized to generate "n" clusters based on known speakers. The false positives may be corrected based on a commanality between a cluster and a speaker. The additional clusters may be processed to determine a cluster having a least number of audio segments. The cluster having the least number of audio segments may be included with other clusters out of the additional clusters. For including the cluster having the least number of audio segments, a pre-defined number of audio segments having duration greater than durations of other audio segments of the cluster, may be selected from other clusters. In an implementation, the pre-defined number of audio segments may be five. A sum of dissimilarities between the pre-defined number of audio segments and the cluster having the least number of audio segments may be determined and the sum of dissimilarities may be normalized by length of the audio file. The cluster having the least number of audio segments may be included into a cluster based on the dissimilarities normalized by the length of the audio file.
It is to be verified that the each audio segment of the multiple audio segments is correctly included within a cluster of the plurality of clusters or not. For the verification of each audio segment, audio features may be extracted from the multiple audio segments. The audio features may indicate dimensional aspects of sound of each unique speaker in each of the multiple audio segments. The audio features may include MFCC determined by a spectral envelop of a sound signal. Further, an audio signature of each unique speaker may be determined based on the audio features. It may be determined whether audio segment of the multiple audio segments is incorrectly included within the cluster. Further, the audio segment incorrectly included within the cluster to another cluster of the plurality of clusters based on a successful match of the audio signature.
In one implementation, Mel Frequency Cepstral Coefficient (MFCC) transformation may be applied on the multiple audio segments for the verification of each audio segment. The MFCC may indicate short term power spectrum of sound in the multiple audio segments. The MFCC may be used for extraction of a pre-defined number of features from each of the multiple audio segments. In an implementation, the pre-defined number of features may be forty features. The pre-defined number of features may represent a dimensional aspect of sound of each unique speaker in each of the multiple audio segments.
For verification of each audio segment using MFCC, one or more parameters of the multiple audio segments may be estimated. The one or more parameters may be estimated based on features obtained from the MFCC of the multiple audio segments. The one or more parameters may include a frame size, a frame shift, number of Mel filterbanks, and number of MFC coefficients. The frame size may be a size of an audio segment used to calculate the MFCC. For example, the frame size may be 20 milliseconds. The frame shift may be an amount of overlap between adjacent audio segments. For example, the frame shift may be 10 ms. The number of Mel filterbanks may indicate a number of triangular filters used to extract Mel-scale filterbanks from a power spectrum of the audio segment. Preferably, 40 filterbanks may be used. The number of MFC coefficients may be number of MFC coefficients calculated from each audio segment of the Mel-scaled filterbank. Preferably, 13 MFC coefficients may be used.
The one or more parameters may be estimated using Gaussian Mixture Model (GMM) representation of the audio file. The GMM may represent the audio segment as a combination of several Gaussian distributions. Each of the several Gaussian distributions may represent a cluster. The one or more parameters of each Gaussian distribution may be estimated by mean and variance of the Gaussian distribution that may indicate to describe the location and shape of the distribution, respectively. Weights of the Gaussian distributions may represent relative importance or proportion of each cluster. The one or more parameters of the GMM model may be estimated using an algorithm called the Expectation-Maximization (EM) algorithm. The algorithm may iteratively estimate the one or more parameters of the Gaussian distribution by an expectation step and a maximization step. In the expectation step, probability of each data point of Gaussian distribution may be computed. In the maximization step, the one or more parameters of Gaussian distribution may be updated based on weighted sum of the data points assigned to the Gaussian distribution.
In one implementation, the GMM representation may be used with 16 components and “diag” covariance to have unique representation of the audio signatures of the speakers. Each of the 16 components may have its own diag covariance. The diag covariance of component of the ith audio segment and the jth audio segment in the dissimilarity matrix M, when i is not equal to j (i!=j), may be assumed to be zero.
The one or more parameters may denote audio signature for the unique speaker. When the unique speaker is identified, the audio signature of the unique speaker may be matched with available GMM distributions of the clusters to obtain a highest match. After obtaining the highest match, a group of audio segments of a cluster may be associated with the unique speaker may be verified.
Above described steps 302 through 314 may be performed for identifying one or more speakers of the audio file of the tele-conference. Each speaker may correspond to a unique cluster of the multiple clusters. The unique cluster may comprise one or more audio segments associated with audio data of the corresponding speaker.
The present invention discloses an effective method of clustering audio segments of the audio file on the basis of the speech spoken by a unique speaker. Further, the present invention is helpful in identifying unique speakers in an audio file of a tele-conference session. Further, audio signatures are derived for speakers of audio file. The audio signatures may be utilised in verification that each audio segment of the multiple audio segments is correctly included within a cluster of the plurality of clusters.
Although, the present invention is explained for post-processing of audio file stored in a memory, it is also to be understood that the present invention is applicable for real-time processing of the audio file during progress of the tele-conference.
Although, the present invention is explained in view of a centralized server having a memory for storing machine learning models, it is also to be understood that the present invention is applicable in view of de-centralized and distributed server in which the machine learning models are distributed in multiple memories of the de-centralized and distributed server.
An interface may be used to provide input or fetch output from a cloud server. An interface may be implemented as a Command Line Interface (CLI), Graphical User Interface (GUI). Further, Application Programming Interfaces (APIs) may also be used for remotely interacting with a master node.
A processor may include one or more general purpose processors (e.g., INTEL® or Advanced Micro Devices® (AMD) microprocessors) and/or one or more special purpose processors (e.g., digital signal processors or Xilinx® System On Chip (SOC) Field Programmable Gate Array (FPGA) processor), MIPS/ARM-class processor, a microprocessor, a digital signal processor, an application specific integrated circuit, a microcontroller, a state machine, or any type of programmable logic array.
A memory may include, but may not be limited to, non-transitory machine-readable storage devices such as hard drives, magnetic tape, floppy diskettes, optical disks, Compact Disc Read-Only Memories (CD-ROMs), and magneto-optical disks, semiconductor memories, such as ROMs, Random Access Memories (RAMs), Programmable Read-Only Memories (PROMs), Erasable PROMs (EPROMs), Electrically Erasable PROMs (EEPROMs), flash memory, magnetic or optical cards, or other type of media/machine-readable medium suitable for storing electronic instructions.
In one or more embodiments, while different steps are described as appearing as distinct acts, one or more of the steps may also be performed in different orders, simultaneously and/or sequentially. In one or more embodiments, the steps may be merged into one or more steps.
The methods described above may be executed on computational machine learning models present over the processing server. The processing server may include volatile and/or non-volatile data storage that may store data and a software including machine-readable instructions. The software may include subroutines or programmed instructions for performing the methods.
In above described embodiments, the words “comprising,” “having,” “containing,” and “including,” and other forms thereof, are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items.
Any combination of the above features and functionalities may be used in accordance with one or more embodiments. In the foregoing specification, embodiments have been described with reference to numerous specific details that may vary from implementation to implementation. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
,CLAIMS:WE CLAIM:
1. A method of identification of speakers in an audio file, the method comprising:
identifying audio gaps present within the audio file including utterances of a plurality of speakers, wherein the audio gaps correspond to portions of the audio file where audio silence is present for at least a pre-defined time period;
breaking the audio file into multiple audio segments on basis of the audio gaps;
generating, using sound embedding, numerical representations of the multiple audio segments;
determining, using a distance algorithm, similarity between the numerical representations of the multiple audio segments; and
clustering the multiple audio segments into a plurality of clusters on basis of the similarity, wherein each cluster of the plurality of clusters indicates a unique speaker of the audio file.
2. The method as claimed in claim 1, wherein the pre-defined time period ranges from 0.20 to 0.40 seconds based on length of the audio file.
3. The method as claimed in claim 1, wherein the pre-defined time period is set based on a list of declining values of a silence threshold.
4. The method as claimed in claim 1, wherein the similarity is cosine similarity.
5. The method as claimed in claim 1, wherein the clustering of the multiple audio segments includes:
identifying an audio segment of longest duration among the multiple audio segments;
identifying at least one audio segment having a similarity greater than a threshold similarity compared to the audio segment of longest duration; and
defining a cluster of the plurality of clusters by including the at least one audio segment.
6. The method as claimed in claim 1, further comprising detecting, using pre-trained machine learning models, overlaps in the multiple audio segments, wherein the pre-trained machine learning models are fine-tuned using pyannotate library.
7. The method as claimed in claim 1, further comprising:
determining dissimilarity among two or more audio segments, of the multiple audio segments, that are not a part of the plurality of clusters;
identifying a least dissimilar cluster from the plurality of clusters based on the dissimilarity among the two or more audio segments; and
including the two or more audio segments in the least dissimilar cluster.
8. The method as claimed in claim 1, further comprising performing, when a count of the plurality of clusters exceeds an actual number of speakers present in the audio file:
identifying additional clusters, from the plurality of clusters, that are not mapped with the unique speaker of the audio file;
identifying a cluster having a least number of audio segments from the additional clusters;
identifying a pre-defined number of audio segments of a cluster of one or more clusters mapped with the unique speaker, wherein the pre-defined number of audio segments have durations greater than durations of other audio segments of the cluster;
determining dissimilarities between the pre-defined number of audio segments and the cluster having the least number of audio segment;
determining a normalized sum of the dissimilarities between the pre-defined number of audio segments and the cluster having the least number of audio segment; and
including the cluster having the least number of audio segments into the cluster of the one or more clusters.
9. The method as claimed in claim 8, wherein the dissimilarity is determined as 1 – cosine similarity.
10. The method as claimed in claim 1, further comprising:
extracting, by one or more machine learning models, audio features from the multiple audio segments, wherein the audio features indicate dimensional aspects of sound of each unique speaker in each of the multiple audio segments;
determining an audio signature of each unique speaker based on the audio features;
determining, based on the audio signature of each unique speaker, whether an audio segment of the multiple audio segments is incorrectly included within a cluster; and
shifting the audio segment incorrectly included within the cluster to another cluster of the plurality of clusters, based on a successful match of the audio signature.
11. The method as claimed in claim 10, wherein the audio segment incorrectly included within the cluster is determined by comparing Mel Frequency Cepstral Coefficient (MFCC) of the multiple audio segments with one or more parameters of Gaussian Mixture Model (GMM) representations, and wherein the GMM representations indicate the audio signature of each unique speaker.
12. A server (102) for identification of speakers in an audio file, the server (102) comprising:
a processor (202); and
a memory (204) storing programmed instructions executable by the processor (202), wherein the processor (202) executes the programmed instructions to perform operations comprising:
identifying audio gaps present within the audio file including utterances of a plurality of speakers, wherein the audio gaps correspond to portions of the audio file where audio silence is present for at least a pre-defined time period;
breaking the audio file into multiple audio segments on basis of the audio gaps;
generating, using sound embedding, numerical representations of the multiple audio segments;
determining, using a distance algorithm, similarity between the numerical representations of the multiple audio segments; and
clustering the multiple audio segments into a plurality of clusters on basis of the similarity, wherein each cluster of the plurality of clusters indicates a unique speaker of the audio file.
13. The server (102) as claimed in claim 12, wherein the pre-defined time period ranges from 0.20 to 0.40 seconds based on length of the audio file.
14. The server (102) as claimed in claim 12, wherein the pre-defined time period is set based on a list of declining values of a silence threshold.
15. The server (102) as claimed in claim 12, wherein the programmed instructions perform clustering of the multiple audio segments by:
identifying an audio segment of longest duration among the multiple audio segments;
identifying at least one audio segment having a similarity greater than a threshold similarity compared to the audio segment of longest duration; and
defining a cluster of the plurality of clusters by including the at least one audio segment.
16. The server (102) as claimed in claim 12, wherein the programmed instructions execute pre-trained machine learning models for detection of overlaps in the multiple audio segments and wherein the pre-trained machine learning models are fine-tuned using pyannotate library.
17. The server (102) as claimed in claim 12, wherein the programmed instructions perform:
determining dissimilarity among two or more audio segments, of the multiple audio segments, that are not a part of the plurality of clusters;
identifying a least dissimilar cluster from the plurality of clusters based on the dissimilarity among the two or more audio segments; and
including the two or more audio segments in the least dissimilar cluster.
18. The server (102) as claimed in claim 12, wherein when a count of the plurality of clusters exceeds an actual number of speakers present in the audio file, the programmed instructions perform:
identifying additional clusters, from the plurality of clusters, that are not mapped with the unique speaker of the audio file;
identifying a cluster having the least number of audio segments from the additional clusters;
identifying a pre-defined number of audio segments of a cluster of one or more clusters mapped with the unique speaker, wherein the pre-defined number of audio segments have durations greater than durations of other audio segments of the cluster;
determining dissimilarities between the pre-defined number of audio segments and the cluster having the least number of audio segment;
determining a normalized sum of the dissimilarities between the pre-defined number of audio segments and the cluster having the least number of audio segment; and
including the cluster having the least number of audio segments into the cluster.
19. The server (102) as claimed in claim 18, wherein the dissimilarity is determined as 1 – cosine similarity.
20. The server (102) as claimed in claim 12, wherein the programmed instructions perform:
extracting, by one or more machine learning models, audio features from the multiple audio segments, wherein the audio features indicate dimensional aspects of sound of each unique speaker in each of the multiple audio segments;
determining an audio signature of each unique speaker based on the audio features;
determining, based on the audio signature of each unique speaker, whether an audio segment of the multiple audio segments is incorrectly included within a cluster; and
shifting the audio segment incorrectly included within the cluster to another cluster of the plurality of clusters, based on a successful match of the audio signature.
21. The server (102) as claimed in claim 20, wherein the audio segment incorrectly included within the cluster is determined by comparing Mel Frequency Cepstral Coefficient (MFCC) of the multiple audio segments with one or more parameters of Gaussian Mixture Model (GMM) representations, and wherein the GMM representations indicate the audio signature of each unique speaker.
Dated this: 25.05.2023
JAYANTA PAL
IN/PA-172
OF REMFRY & SAGAR
ATTONREY FOR THE APPLICANT[s]
| # | Name | Date |
|---|---|---|
| 1 | 202241050872-STATEMENT OF UNDERTAKING (FORM 3) [06-09-2022(online)].pdf | 2022-09-06 |
| 2 | 202241050872-PROVISIONAL SPECIFICATION [06-09-2022(online)].pdf | 2022-09-06 |
| 3 | 202241050872-FORM FOR SMALL ENTITY(FORM-28) [06-09-2022(online)].pdf | 2022-09-06 |
| 4 | 202241050872-FORM FOR SMALL ENTITY [06-09-2022(online)].pdf | 2022-09-06 |
| 5 | 202241050872-FORM 1 [06-09-2022(online)].pdf | 2022-09-06 |
| 6 | 202241050872-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [06-09-2022(online)].pdf | 2022-09-06 |
| 7 | 202241050872-EVIDENCE FOR REGISTRATION UNDER SSI [06-09-2022(online)].pdf | 2022-09-06 |
| 8 | 202241050872-DRAWINGS [06-09-2022(online)].pdf | 2022-09-06 |
| 9 | 202241050872-DECLARATION OF INVENTORSHIP (FORM 5) [06-09-2022(online)].pdf | 2022-09-06 |
| 10 | 202241050872-DRAWING [25-05-2023(online)].pdf | 2023-05-25 |
| 11 | 202241050872-COMPLETE SPECIFICATION [25-05-2023(online)].pdf | 2023-05-25 |