Sign In to Follow Application
View All Documents & Correspondence

Methods And Systems For Assessment Of Speech Intelligibility In Dysarthric Subjects

Abstract: This disclosure provides systems and methods for assessing speech intelligibility in dysarthric subjects. Traditional approaches are susceptible to inter-subject and inter-clinician differences apart from causing discomfort to subjects. Conventional automated approaches have a huge dependency on quantity and quality of training datasets. Furthermore, complex machine learning based metrics are prone to over-fitting making it difficult to generalize over a large-scale. The present disclosure addresses these shortcomings by firstly requiring a few words to be uttered by the subject. A recorded speech of the utterances may be converted to text. A raw intelligibility score is computed by comparing similarity of the uttered words to the actual word in form of character strings. A final intelligibility score is then obtained by correlating with a perceptual intelligibility score. [To be published with FIG. 2]

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
28 February 2020
Publication Number
36/2021
Publication Type
INA
Invention Field
ELECTRONICS
Status
Email
kcopatents@khaitanco.com
Parent Application
Patent Number
Legal Status
Grant Date
2024-09-06
Renewal Date

Applicants

Tata Consultancy Services Limited
Nirmal Building, 9th Floor, Nariman Point Mumbai 400021 Maharashtra, India

Inventors

1. TRIPATHI, Ayush
Tata Consultancy Services Limited Yantra Park, Opp Voltas HRD Trg Center, Subhash Nagar, Pokhran Road No. 2, Thane West 400601 Maharashtra, India
2. BHOSALE, Swapnil Prakash
Tata Consultancy Services Limited Yantra Park, Opp Voltas HRD Trg Center, Subhash Nagar, Pokhran Road No. 2, Thane West 400601 Maharashtra, India
3. KOPPARAPU, Sunil Kumar
Tata Consultancy Services Limited Yantra Park, Opp Voltas HRD Trg Center, Subhash Nagar, Pokhran Road No. 2, Thane West 400601 Maharashtra, India

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION (See Section 10 and Rule 13)
Title of invention:
METHODS AND SYSTEMS FOR ASSESSMENT OF SPEECH INTELLIGIBILITY IN DYSARTHRIC SUBJECTS
Applicant
Tata Consultancy Services Limited A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description
The following specification particularly describes the invention and the manner in which it is to be performed.

TECHNICAL FIELD [001] The disclosure herein generally relates to neurogenic speech disorders, and, more particularly, to systems and methods for assessing speech intelligibility in dysarthric subjects.
BACKGROUND
[002] Dysarthria refers to a group of neurogenic speech disorders characterized by abnormalities in the strength, speed, range, steadiness, tone, or accuracy of movements required for breathing, phonatory, resonatory, articulatory, or prosodic aspects of speech production. Proper evaluation of speech intelligibility is a key diagnostic step in identifying the progress of patients. Traditionally, this assessment has been performed by a trained Speech Language Pathologists (SLP) who use different measures like Hoehn and Yahr scale, Assessment of Intelligibility of Dysarthric Speech (AIDS) scale, Frenchay Dysarthria Assessment (FDA) scale, Unified Parkinson’s Disease Rating Scale (UPDRS) and Scale for Assessment and Rating of Ataxia (SARA). However, the traditional methods are susceptible to errors including but not limited to inter-listener differences, type of stimuli, phonetic context, vocal quality and patterns of articulation, and therefore the development of a standardized method for intelligibility estimation is important.
[003] Instrumentation-based direct assessment techniques visualize the velopharyngeal closing mechanism using videofluoroscopy or magnetic resonance imaging (MRI) and provide information about velopharyngeal gap size and shape. However, these methods are invasive and may cause pain and discomfort to the patients. As an alternative, nasometry seeks to measure nasalence, the modulation of the velopharyngeal opening area, by estimating the acoustic energy from the nasal cavity relative to the oral cavity. Nasalance scores yield good correlation with perceptual judgment of hypernasality, however properly administering the evaluation requires significant training and it cannot be used to assess patients from existing speech recordings. Due to these shortcomings in the traditional evaluation methods, automated assessments with minimal human intervention are needed to evaluate speech intelligibility.

SUMMARY
[004] Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems.
[005] In an aspect, there is provided a processor implemented method for assessing speech intelligibility in a subject, the method comprising the steps of: obtaining, via a first set of hardware processors, a set of reference words to be provided to the subject being assessed, wherein the set of reference words is an optimal subset of n words in a dysarthric speech database and cardinality r of the set of reference words is a pre-determined optimal number based on a cost function; converting, via a second set of hardware processors serving as an Automatic Speech Recognition (ASR) system, spoken utterances of the reference words by the subject being assessed to a corresponding text; computing for each of the spoken utterances, a first metric and a second metric, via the first set of hardware processors, representative of a measure of similarity between a string of characters s1 of length l1 corresponding to each converted text and a string of characters s2 of length l2 associated with a corresponding word from the set of reference words; computing, via the first set of hardware processors, a first intelligibility score based
on a total number of subsequence matches between the string of characters and the string of characters for each word from the set of reference words; computing, via the first set of hardware processors, a second intelligibility score based on a cost of transforming the string of characters s1 to the string of characters for each word from the set of reference words; computing, via the first set of hardware processors, a raw intelligibility score as a weighted average of (i) an average of the first intelligibility score and (ii) an average of the second intelligibility score, for the words in the set of reference words; and obtaining, via the first set of hardware processors, a final intelligibility score for the subject being assessed based on a correlation between the computed raw intelligibility score and a perceptual intelligibility score obtained from clinicians.

[006] In another aspect, there is provided a system for assessing speech intelligibility in a subject, the system comprising: one or more data storage devices operatively coupled to one or more hardware processors and configured to store instructions configured for execution via the one or more hardware processors to: obtain a set of reference words to be provided to the subject being assessed, wherein the set of reference words is an optimal subset of n words in a dysarthric speech database and cardinality r of the set of reference words is a pre-determined optimal number based on a cost function; convert spoken utterances of the reference words by the subject being assessed to a corresponding text, via some of the one or more hardware processors serving as an Automatic Speech Recognition (ASR) system; compute a first metric and a second metric representative of a measure of similarity between a string of characters of length corresponding to each converted text
and a string of characters of length associated with a corresponding word from
the set of reference words; compute for each of the spoken utterances a first intelligibility score based on a total number of subsequence matches
between the string of characters and the string of charactersfor each word
from the set of reference words; compute a second intelligibility score based on a cost of transforming the string of characters to the string of characters
for each word from the set of reference words; compute a raw intelligibility score as a weighted average of (i) an average of the first intelligibility score and (ii) an average of the second intelligibility score, for the words in the set of reference words; and obtain a final intelligibility score for the subject being assessed based on a correlation between the computed raw intelligibility score and a
perceptual intelligibility score obtained from clinicians.
[007] In yet another aspect, there is provided a computer program product comprising a non-transitory computer readable medium having a computer readable program embodied therein, wherein the computer readable program, when executed on a computing device, causes the computing device to: obtain a set of reference words to be provided to the subject being assessed, wherein the set of reference words is an optimal subset of n words in a dysarthric speech database and cardinality r of the set of reference words is a pre-determined optimal number based

on a cost function; convert spoken utterances of the reference words by the subject being assessed to a corresponding text, via some of the one or more hardware processors serving as an Automatic Speech Recognition (ASR) system; compute a first metric and a second metric representative of a measure of similarity between a string of characters of length corresponding to each converted text and a
string of characters of lengthassociated with a corresponding word from the
set of reference words; compute for each of the spoken utterances a first intelligibility score based on a total number of subsequence matches
between the string of characters and the string of charactersfor each word
from the set of reference words; compute a second intelligibility score based on a cost of transforming the string of characters to the string of characters
for each word from the set of reference words; compute a raw intelligibility score as a weighted average of (i) an average of the first intelligibility score and (ii) an average of the second intelligibility score, for the words in the set of reference words; and obtain a final intelligibility score for the subject being assessed based on a correlation between the computed raw intelligibility score and a
perceptual intelligibility score obtained from clinicians.
[008] In accordance with an embodiment of the present disclosure, the one or more hardware processors are configured to predict a class of intelligibility to which the subject being assessed belongs based on a confidence score associated with the computed raw intelligibility score
[009] In accordance with an embodiment of the present disclosure, the one or more hardware processors are configured to obtain a set of reference words to be provided to the subject being assessed by: identifying, possible subsets of the � words in the dysarthric speech database; computing the raw intelligibility score for each of the subsets; obtaining a correlation of the raw intelligibility score and the perceptual intelligibility score for each of the subsets; and determining a value
of that minimizes the cost function for the subset having a highest correlation as the optimum number; and identifying the subset having the highest correlation as the set of reference words.

[010] In accordance with an embodiment of the present disclosure, the first metric is the sequence matcher technique and the second metric is the Levenshtein distance.
[011] In accordance with an embodiment of the present disclosure, the one or more hardware processors are configured to obtain a final intelligibility score based on the Pearson Correlation (PC) between the estimated raw intelligibility score and the perceptual intelligibility score (Ip)
[012] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[013] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[014] FIG.1 illustrates an exemplary block diagram of a system for assessing speech intelligibility in a subject, in accordance with some embodiments of the present disclosure.
[015] FIG.2 illustrates a functional flow diagram for the system of FIG.1, in accordance with some embodiments of the present disclosure.
[016] FIG.3 illustrates an exemplary flow diagram of a computer implemented method for assessing speech intelligibility in a subject, in accordance with some embodiments of the present disclosure.
[017] FIG.4 illustrates a scatter plot between a perceptual intelligibility score and a raw intelligibility score, in accordance with some embodiments of the present disclosure.
[018] FIG.5 illustrates a graphical illustration of predicted classes of intelligibility based on assessing the speech intelligibility, in accordance with some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS

[019] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims.
[020] Dysarthria is a common symptom in neurological disorders such as Parkinson's Disease (PD), Huntington’s Disease (HD), Amyotropic Lateral Scleoris (ALS), cerebral palsy or neurological trauma, manifesting as weakness, paralysis, or a lack of co-ordination of the motor-speech system. Additionally, dysarthria may also arise after a traumatic head injury or a side effect of brain tumor. Since, the rapid movement of velum to produce speech requires very precise motor control, dysarthria results in reduction in intelligibility, audibility, naturalness, and efficiency of vocal communication. Some common effects that can be observed in the speech of a dysarthric patient are slurred speech, hoarse and choppy sound, hypernasal voice and articulation errors.
[021] Assessment of dysarthria and its severity level is an important diagnostic step and is crucial in order to understand the patient’s progress, effect of medication and speech therapy. Traditional methods of assessment are susceptible to errors due to dependency on expertise of the Speech Language Pathologist (SLP), type of stimuli, phonetic context, vocal quality and patterns of articulation. Instrumentation based assessment techniques cause pain and discomfort to the patients. Methods for measuring and quantifying abnormalities in speech arising due to dysarthria in an automated manner may be broadly categorized in two main groups, namely, (a) features based on speech signal processing and (b) supervised methods based on machine learning and deep learning. Various temporal and spectral features have been used to classify dysarthric speech based on the type of dysarthria to assess intelligibility and in detection of speech disorder.

Representations such as neurograms, spectro-temporal modulation features and long term spectro-temporal modulation spectrograms are used for intelligibility assessment or disorder detection. Also, a spectrogram is given as input to CNN in different applications like automatic recognition of normal and dysarthric speech. However, due to high inter-subject variability, simple acoustic features fail to capture the complex manifestations of disorderness in dysarthric speech. The more complex machine learning-based metrics are prone to over-fitting to the necessarily small disease specific speech datasets on which they are trained, making it difficult to generalize over a large-scale. Mainly due to the non-inclusivity of disordered speech in training Automatic Speech Recognition (ASR) systems, the performance of the ASR becomes poor.
[022] Applicant has addressed these problems by providing systems and methods for automated assessment of speech intelligibility that may be used by clinicians to evaluate intelligibility of dysarthric patients which will help them in examining the outcome of speech therapy and medications without causing discomfort to the patients and yet achieve reliability in predicting intelligibility rating (score) that is equivalent to those predicted by experienced and trained clinicians. The systems and methods of the present disclosure may be operated on a recorded voice and does not require the patient to physically visit a clinician for the speech intelligibility assessment, making the approach scalable to large populations. Furthermore, dependency on large training datasets is averted by the systems and methods of the present disclosure that makes the intelligibility assessment data efficient, time and effort efficient without compromising on the quality of the computed intelligibility rating. It may be noted from the experimental evaluation provided later in the description that the computed intelligibility rating in accordance with the approach of the present disclosure is comparable with the perceptual intelligibility ratings provided by the SLPs that today serve as the gold standard for speech intelligibility assessment.
[023] In the context of the present disclosure, it may be noted that the expressions ‘rating’ and ‘score’; ‘subject’, ‘speaker’ and ‘patient’; ‘clinician’ and

‘Speech Language Pathologist (SLP)’; ‘category’ and ‘class’ may be used interchangeably.
[024] Referring now to the drawings, and more particularly to FIG. 1 through FIG.5, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.
[025] FIG.1 illustrates an exemplary block diagram of a system 100 for assessing speech intelligibility in a subject, in accordance with some embodiments of the present disclosure. In an embodiment, the system 100 includes one or more processors 104, communication interface device(s) or input/output (I/O) interface(s) 106, and one or more data storage devices or memory 102 operatively coupled to the one or more processors 104. The one or more processors 104 that are hardware processors can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, graphics controllers, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) are configured to fetch and execute computer-readable instructions stored in the memory. In the context of the present disclosure, the expressions ‘processors’ and ‘hardware processors’ may be used interchangeably. In an embodiment, the system 100 can be implemented in a variety of computing systems, such as laptop computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the like.
[026] I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface(s) can include one or more ports for connecting a number of devices to one another or to another server.

[027] The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment, one or more modules (not shown) of the system 100 can be stored in the memory 102.
[028] FIG.2 illustrates a functional flow diagram 200 for the system of FIG.1 while FIG.3 illustrates an exemplary flow diagram of a computer implemented method 300 for assessing speech intelligibility in a subject, in accordance with some embodiments of the present disclosure. In an embodiment, the system 100 includes one or more data storage devices or memory 102 operatively coupled to the one or more processors 104 and is configured to store instructions configured for execution of steps of the method 300 by the one or more processors 104. The steps of the method 300 will now be explained in detail with reference to the components of the system 100 of FIG.1 and the functional flow diagram of FIG.2 for the same. Although process steps, method steps, techniques or the like may be described in a sequential order, such processes, methods and techniques may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously.
[029] In an embodiment of the present disclosure, the high-level functional steps of FIG.2 are mainly executed by a first set of hardware processors that are part of the one or more hardware processors 104. Particularly, a step 304 described hereinafter, may be executed by a second set of hardware processors that are part of the one or more hardware processors 104 and serve as an Automatic Speech Recognition (ASR) system.
[030] Typically, a patient is required to record utterances corresponding to a large set of words from a dysarthric speech database. For instance, the Universal

Access (UA) Speech Database is a publicly available dysarthric speech database, provided by H.Kim et al. in ISCA Interspeech, 2008, containing recordings from 15 speakers with cerebral palsy wherein each speaker spoke 765 isolated words which were recorded in 3 equally sized blocks of 255 words each and recorded at a sampling frequency of 48kHz.
[031] Table 1 below provides a summary of speaker information (UA Speech Database for reference, wherein the Perceptual Speech Intelligibility score is provided by clinicians (SLPs). Table 1:

Speaker ID Age Perceptual
Speech
Intelligibility
score (Ip) Dysarthria Diagnosis
M04 >18 Very low (2%) Spastic
F03 51 Very low (6%) Spastic
M12 19 Very low (7.4%) Mixed
M01 >18 Very low (15%) Spastic
M07 58 Low (28%) Spastic
F02 30 Low (29%) Spastic
M16 - Low (43%) Spastic
M05 21 Medium (58%) Spastic
M11 48 Medium (62%) Athetoid
F04 18 Medium (62%) Athetoid
M09 18 High (86%) Spastic
M14 40 High (90.4%) Spastic
M08 28 High (93%) Spastic
M10 21 High (93%) Mixed
F05 22 High (95%) Spastic

[032] Table 2 below provides the 3 equally sized blocks (B1, B2, B3) of 255 words each. Number of samples per subject is shown in {}. Table 2:

Full {765} B1{255} B2{255} B3{255}
CC {57} CC (1) {19} CC (2) {19} CC (3) {19}
Letters {78} Letters (1) {26} Letters (2) {26} Letters (3) {26}
Digits {30} Digits (1) {10} Digits (2) {10} Digits (3) {10}
CW {300} CW (1) {100} CW (2) {100} CW (3) {100}
UW {300} UW (1) {100} UW (2) {100} UW (3) {100}
Out of the 255 words per block, 155 words that include 10 digits, 25 radio alphabet letters, 19 computer commands and 100 common words selected from the Brown corpus of written English are repeated across the blocks. 100 uncommon words selected from children’s novels digitized by Project Gutenberg using a greedy algorithm that maximized token counts of infrequent biphones differ across the three blocks.
[033] In order to compute the intelligibility rating of a speaker, five naive listeners were asked to provide orthographic transcriptions for the dysarthric utterances. For each transcription given by the listener, the percentage of correct responses was calculated. The correct percentage was then averaged across the five listeners to obtain each speaker’s intelligibility based on the perception (Ip) of the 5 listeners. The perceptual intelligibility ratings are in the range of 2% to 95%. Further the speakers are classified as shown in Table 3 below. Table 3:

Category Very Low Low Medium High
(Ip) range 0-25% 26%-50% 51%-75% 76%-100%
[034] It may be noted from the above description of a traditional approach to assess speech intelligibility that a speaker who is a dysarthric patient is required to utter a large number of words (765 words in the above example) which may be challenging since precisely controlling velum and the inability increases with an increase in severity of dysarthria. To address the discomfort that the dysarthric patient is subjected to, the present disclosure determines optimal number of

reference words from the dysarthric speech database that the patient needs to speak in order to assess the speech intelligibility.
[035] In an embodiment, as shown in FIG.2, the high-level functional flow involves the subject being assessed entering a unique identifier (ID) and being asked to record utterances corresponding to the determined reference words. In an embodiment, a speech recorder (recording at 16 kHz) active for a predefined period, say 3 seconds (depending on the reference words) is enabled. This recording process continues for all the reference words and the recorded data is provided to a pre-trained Automatic Speech Recognition (ASR) system resulting in a string of characters as an output. The string of characters is compared with the characters in the corresponding original reference words (ground truth) using two different metrics. In an embodiment, the metrics used may be Levenshtein distance and sequence matcher technique to obtain a raw intelligibility score. The raw intelligibility score is then corelated with the Perceptual Speech Intelligibility score obtained from the clinicians to obtain a final Intelligibility score. In an embodiment, the obtained raw intelligibility score which is in a range of 0 to 100 may need to be normalized against a set of predefined baseline scores to obtain a scaled score before it is corelated with the Perceptual Speech Intelligibility score.
[036] Accordingly, in an embodiment of the present disclosure, the one or more processors 104, are configured to obtain, at step 302, a set of reference words to be provided to the subject being assessed, wherein the set of reference words is an optimal subset of n words in a dysarthric speech database. The number of reference words or the cardinality r of the set of reference words is a pre-determined optimal number based on a cost function. Method for determining the optimal number is explained later in the description. The selection of the dysarthric speech database may depend on the language spoken by the subject, ethnicity, age, and the like for an effective assessment. Determining the optimal number of reference words is a one-time activity for the selected dysarthric speech database.
[037] In an embodiment of the present disclosure, the one or more processors 104, are configured to convert, at step 304, spoken utterances of the reference words by the subject being assessed to a corresponding text. The one or

more processors that convert the spoken utterances to text may be any pre-trained ASR system known in the art. In an embodiment, Mozilla DeepSpeech
may be used. [038] In an experimental setup, a DeepSpeech-1(DS) model pretrained using 1000 hours of English speech from the LibriSpeech corpus as the ASR was used. The DS model takes Mel filter cepstral coeeficients (MFCC) as input and passes it through 5 layers, in which layer 1, 2, 3 and 5 are feedforward dense layers with Rectified Linear Unit (ReLU) activation and layer 4 is a recurrent layer with 2048 Long Short Term Memory (LSTM) units. For an input sequence of MFCCs, the DS model uses a context of 9 preceding and 9 succeeding frames and outputs a probability distribution over characters in a character set A.

A greedy decoding over is performed to get a final string of characters. In
accordance with the present disclosure, a language model is not used with the ASR to retain mispronunciations which are vital for intelligibility assessment. The output of the DS model for the word scissors for different speakers is presented in Table 4 below.
[039] Table 4: Character string obtained from the DeepSpeech (DS) model.

Speaker ID Perceptual Speech Intelligibility score (Ip) DS output
M04 Very low (2%) m-m
F03 Very low (6%) -
M12 Very low (7.4%) o-h
M01 Very low (15%) e
M07 Low (28%) o
F02 Low (29%) s-e-s-e-b-o
M16 Low (43%) f-e-s-a
M05 Medium (58%) s-i-s-a-s-s

M11 Medium (62%) s-o-t-h-e-r
F04 Medium (62%) a-n-e-r-i
M09 High (86%) s-i-z-e-r-r-s
M14 High (90.4%) s-s-i-s-a-s
M08 High (93%) s-a-r-r-s
M10 High (93%) s-i-s-s-o-r-s-s
F05 High (95%) s-i-s-o-r-s
[040] In an embodiment of the present disclosure, the one or more processors 104, are configured to compute for each of the spoken utterances, at step 306, a first metric and a second metric, via the first set of hardware processors, representative of a measure of similarity between a string of characters s1 of length l1 corresponding to each converted text and a string of characters s2 of length l2 associated with a corresponding word from the set of reference words.
[041] In an embodiment of the present disclosure, the first metric is the sequence matcher technique. The contiguous matching subsequences in the string of characters s1 and the string of characters s2 is applied recursively to the string of characters to the left and to the right of the matching subsequence to obtain the longest contiguous matching subsequence in the string of characters s1 and the string of characters s2. In an embodiment of the present disclosure, the one or more processors 104, are configured to compute, at step 308, a first intelligibility score based on a total number of subsequence matches between the string of characters s1 and the string of characters s2 for each word from the set of reference words. For instance, the first intelligibility score being a sequence matcher intelligibility score may be represented as:
where m represents the total number of subsequence
matches between the strings of characters s1 and s2.
[042] In an embodiment of the present disclosure, the second metric is the Levenshtein distance or the edit distance. In an embodiment of the present disclosure, the one or more processors 104, are configured to compute, at step 310, a second intelligibility score based on a cost of transforming the string

of characters s1 to the string of characters s2 for each word from the set of reference words, wherein the operations involved in the transforming step include deletion, insertion and substitution. For instance, the second intelligibility score being a Levenshtein intelligibility score may be represented as:
represents the edit distance between the
strings of characters s1 and s2.
[043] In an embodiment of the present disclosure, the one or more processors 104, are configured to compute, at step 312, a raw intelligibility score (Im) as a weighted average of (i) an average of the first intelligibility score and (ii) an average of the second intelligibility score, for the words in the set of reference words. Then a correlation between the computed raw intelligibility score (Im) and the perceptual intelligibility score (Ip) obtained from the clinicians is analyzed.
[044] Accordingly, in an embodiment of the present disclosure, the one or more processors 104, are configured to obtain, at step 314, a final intelligibility score for the subject being assessed based on a correlation between the computed raw intelligibility score (Im) and the perceptual intelligibility score (Ip). The final intelligibility score for the subject being assessed is a measure of the intelligibility of the subject having dysarthria. In an embodiment, the final intelligibility score is based on the Pearson Correlation (PC) between the estimated raw intelligibility score (Im) and the perceptual intelligibility score (Ip).
[045] It may be noted from the above steps that the subject being assessed is not required to endure a tedious process of firstly speaking a large number of words. Secondly, a recorded speech may be used and hence the approach provided in the present disclosure can be performed remotely. Since the subject is only required to speak a few words (set of reference words) which are a subset of the dysarthric speech database, the computation time and effort is also considerably reduced as compared to the art.
[046] In an embodiment of the present disclosure, the step of obtaining a set of reference of words to be provided to the subject being assessed comprises identifying all possible subsets of the n words in the dysarthric speech database.

For instance, to identify r reference words from the dysarthric speech database
containing n words,
Let s be the set of n words such that

Let the total number of subsets that can be formed from s be namely

[047] For each subset Sk(r), wherein k = 1 to T, a raw intelligibility score is computed for each of the subsets as shown below.
is the raw intelligibility score of the word is
the average intelligibility score over all the r words in that subset.
[048] Then a correlation of the raw intelligibility score and the perceptual intelligibility score (Ip) is obtained for each of the subsets. cor

[049] In step 302, as described above, a pre-determined optimal number r of reference words are obtained based on a cost function. A value of r that minimizes the cost function of the subset having a highest correlation is the optimal value that represents the optimal number of best words from the dysarthric speech database to be used for assessing speech intelligibility. This may be represented as below.
[050] Compute a cost function where a is a parameter in the range (0,1) that can be chosen based on the ease in collecting data. For instance, when the number of words that can be spoken has to be kept very low, a is chosen close to 0. Then find such thatis
minimized. Using r number of reference words only averts processing of large number of words as needed in the approaches of the art and also requires the subject being assessed to speak only these r words thereby minimizing discomfort that may be caused. For the character string obtained in from DeepSpeech and provided

in Table 4 above, the determined r = 5 reference words and the computed metrics in accordance with the present disclosure are provided in Table 5 below. Table 5:

Sequence Matcher Ism Levenshtein Distance Ild
Word Pearson Correlation Word Pearson Correlation
dispossess 0.9229 dodgers 0.9654
sandpipers 0.9075 naturalization 0.9332
cowhide 0.8936 cowhide 0.9159
sierra 0.8904 sierra 0.9100
displeasure 0.8875 supervision 0.8998
[051] As part of an experimental evaluation, the metrics Ims and Ild were averaged across different subsets of utterances from the UA Speech database (Table 2 above) and compared with the perceptual intelligibility score (Ip). The subsets used were: Computer Commands (CC), International Radio Letters (Letters), Digits, Common Words (CW) and Uncommon Words UW1, UW2 and UW3 and across three different blocks, namely, B1, B2 and B3 as shown in Table 2 above. A combined average score across all 765 words is shown in Table 6 (parts A and B) below, wherein the combined average is compared with the approach of the present disclosure, namely, the raw intelligibility score using 5 reference words and the two metrics described above.
[052] Table 6A: Pearson Correlation of the Sequence Matcher and the Levenshtein distance technique for assessment of intelligibility

CC Digits Letters CW UW1 UW2 UW3
Utterance 57 30 78 300 100 100 100
Ism 0.8481 0.8426 0.9271 0.9417 0.9375 0.9247 0.9239
Ild 0.8771 0.8429 0.9423 0.9260 0.9073 0.8965 0.9049
Table 6B (contd.)

B1 B2 B3 All words Best 5
Utterance 255 255 255 765 5

lsm 0.9437 0.9327 0.9352 0.9410 0.9607
Ild 0.9348 0.9099 0.9279 0.9278 0.9816
It may be noted that the intelligibility metric of the determined 5 reference words (utterance) correlates best with the perceptual intelligibility score (Ip).
[053] The approach of the present disclosure was further evaluated against state-of-the-art techniques and the Pearson Correlation is as shown in Table 7 below.
Table 7: Performance of the method of the present disclosure and state of the art measures on the UA Speech database.

Method Pearson Correlation
Martinez et al. (“Dysarthria 0.91
intelligibility assessment in a factor
analysis total variability space,” in
ISCA Interspeech, 2013,
Hummel (“Objective estimation of 0.92
dysarthric speech intelligibility,” M.S.
thesis, Queen’s University, Kingston,
Ontario, Canada, Sep 2011)
Janbakshi et al. (“Spectral subspace 0.95
analysis for automatic assessment of
pathological speech intelligibility,” in
ISCA Interspeech, 2019)
Paja et al. (“Automated dysarthria 0.96
severity classification for improved
objective intelligibility assessment of
spastic dysarthric speechy,” in ISCA
Interspeech, 2012)
DS + Ism (Present disclosure) 0.96
DS + Id (Present disclosure) 0.98

[054] FIG.4 illustrates a scatter plot between the perceptual intelligibility score and the raw (predicted) intelligibility score , in accordance with some
embodiments of the present disclosure. It may be noted that the perceptual intelligibility score and the raw (predicted) intelligibility score are linear
throughout the intelligibility range. This helps the clinician to not only easily interpret the obtained raw (predicted) intelligibility score but also in
classifying dysarthria into four categories: Very Low, Low, Medium and High by selecting a threshold (refer ranges provided in Table 3), thereby addressing the
clinician’s need.
[055] Furthermore, it is observed that words such as not, of, up, all, on which are part of the UA Speech database do not show any variation in intelligibility scores across dysarthria subjects. It may be hypothesized that this may be due to the lesser articulatory movement involved in pronunciation of these words.
[056] In an embodiment of the present disclosure, the one or more processors 104, are configured to predict, at step 316, a class of intelligibility to which the subject being assessed belongs based on a confidence score associated with the computed raw intelligibility score This is further described using
another experimental evaluation performed by the Applicant. The described experimental evaluation was implemented in Python® using mainly an audio recorder, a speech-to-text engine and the method of the present disclosure is described herein after. A set of 5 reference words determined by the method of the present disclosure were Dodgers, Naturalization, Cowhide, Sierra and Supervision. These words involve a precise control of the velum and hence patients with dysarthria have difficulty in speaking these words. These words were recoded by the patient using a microphone attached to the desktop. It may also be recoded in isolation and sent to a clinician.
[057] The recorded samples were processed through DeepSpeech to obtain a string of characters recognized by the ASR. A language model was not used for decoding, as mentioned earlier, to retain the actual pronunciation of the patient.

[058] The string of characters was compared to the string of characters in the actual word (ground truth) using the Levenshtein distance and the Sequence Matcher. The Sequence matcher was recursively applied to get the longest continuous matching subsequence as explained earlier. The intelligibility scores obtained using both metrics were averaged to obtain the raw intelligibility score for the patient. Using the intelligibility score marked by clinicians of 28
patients from the UA Speech database, the raw intelligibility score was
normalized to determine an intelligibility score between 0 and 100. The 28 patients were divided into 5 classes (categories) namely, Healthy, High Intelligibility, Low Intelligibility and Very Low Intelligibility. Each of these classes were modeled with their mean (µ) and variance (σ Gaussian). The scaled raw intelligibility score was then used obtain a final intelligibility score to predict a class among the five classes, based on a confidence of belonging to that class. In an embodiment, the confidence score for the class Healthy (exemplary) may be represented as shown below.
Healthy
[059] Thus, for a given raw intelligibility score, the probability of the patient belonging to each of the five classes was predicted. This is advantageous and may be used by the clinician for obtaining a clear understanding of the patient’s progress with a confidence score as shown for the category. FIG.5 illustrates a graphical illustration of predicted classes of intelligibility based on assessing the speech intelligibility, in accordance with some embodiments of the present disclosure, wherein the ‘star’ represents the final intelligibility score of a patient being assessed and classified.
[060] Based on the above description and the experimental evaluations of the approach of the present disclosure, it may be noted that firstly the set of reference words determined by the method described is the most predictive of intelligibility of dysarthric speech and averts discomfort caused to patients in speaking a large set of words as is done conventionally. Secondly, with minimum human intervention, the method described provides intelligibility scores that are

easily interpretable by clinicians since they map with the perceptual intelligibility score linearly over the complete range of scores and the perceptual intelligibility score serves as the gold standard for assessment today.
[061] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[062] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
[063] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a

computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[064] The illustrated steps are set out to explain the exemplary
embodiments shown, and it should be anticipated that ongoing technological
development will change the manner in which particular functions are performed.
These examples are presented herein for purposes of illustration, and not limitation.
Further, the boundaries of the functional building blocks have been arbitrarily
defined herein for the convenience of the description. Alternative boundaries can
be defined so long as the specified functions and relationships thereof are
appropriately performed. Alternatives (including equivalents, extensions,
variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[065] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory,

nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[066] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.

We Claim:
1. A processor implemented method (300) for assessing speech intelligibility
in a subject, the method comprising the steps of:
obtaining, via a first set of hardware processors, a set of reference words to be provided to the subject being assessed, wherein the set of reference words is an optimal subset of n words in a dysarthric speech database and cardinality r of the set of reference words is a pre-determined optimal number based on a cost function (302);
converting, via a second set of hardware processors serving as an Automatic Speech Recognition (ASR) system, spoken utterances of the reference words by the subject being assessed to a corresponding text (304);
computing for each of the spoken utterances, a first metric and a second metric, via the first set of hardware processors, representative of a measure of similarity between a string of characters of length
corresponding to each converted text and a string of characters of length
associated with a corresponding word from the set of reference words (306);
computing, via the first set of hardware processors, a first intelligibility score based on a total number of subsequence
matches between the string of characters s1 and the string of characters for each word from the set of reference words (308);
computing, via the first set of hardware processors, a second intelligibility score based on a cost of transforming the string of
characters to the string of charactersfor each word from the set of
reference words (310);
computing, via the first set of hardware processors, a raw intelligibility score as a weighted average of (i) an average of the first
intelligibility score and (ii) an average of the second intelligibility score, for the words in the set of reference words (312); and

obtaining, via the first set of hardware processors, a final intelligibility score for the subject being assessed based on a correlation between the computed raw intelligibility score and a perceptual
intelligibility score obtained from clinicians (314).
2. The processor implemented method of claim 1 further comprising predicting a class of intelligibility to which the subject being assessed belongs based on a confidence score associated with the computed raw intelligibility score (316).
3. The processor implemented method of claim 1, wherein the step of obtaining a set of reference words to be provided to the subject being assessed comprises:
identifying, possible subsets of the nwords in the dysarthric speech database;
computing the raw intelligibility score for each of the subsets;
obtaining a correlation of the raw intelligibility score and the perceptual intelligibility score for each of the subsets;
determining a value of r that minimizes the cost function for the subset having a highest correlation as the optimum number; and
identifying the subset having the highest correlation as the set of reference words.
4. The processor implemented method of claim 1, wherein the first metric is the sequence matcher technique.
5. The processor implemented method of claim 1, wherein the second metric is the Levenshtein distance.
6. The processor implemented method of claim 1, wherein the step of obtaining a final intelligibility score is based on the Pearson Correlation

(PC) between the estimated raw intelligibility score and the perceptual intelligibility score (Ip).
7. A system (100) for assessing speech intelligibility in a subject, the system
comprising:
one or more data storage devices (102) operatively coupled to one or more hardware processors (104) and configured to store instructions configured for execution via the one or more hardware processors to:
obtain a set of reference words to be provided to the subject being assessed, wherein the set of reference words is an optimal subset of nwords in a dysarthric speech database and cardinality r of the set of reference words is a pre-determined optimal number based on a cost function;
convert spoken utterances of the reference words by the subject being assessed to a corresponding text, via some of the one or more hardware processors serving as an Automatic Speech Recognition (ASR) system;
compute a first metric and a second metric representative of a measure of similarity between a string of characters s1 of length l1 corresponding to each converted text and a string of characters s2 of length l2 associated with a corresponding word from the set of reference words;
compute for each of the spoken utterances a first intelligibility score Ism(s1,s2) based on a total number of subsequence matches between the string of characters s1 and the string of characters s2 for each word from the set of reference words;
compute a second intelligibility score Ild(s1,s2) based on a cost of transforming the string of characters s1 to the string of characters s2 for each word from the set of reference words;
compute a raw intelligibility score (Im) as a weighted average of (i) an average of the first intelligibility score and (ii) an average of the second intelligibility score, for the words in the set of reference words; and

obtain a final intelligibility score for the subject being assessed based on a correlation between the computed raw intelligibility score (Im) and a perceptual intelligibility score (Ip) obtained from clinicians.
8. The system of claim 7, wherein the one or more processors are further configured to predict a class of intelligibility to which the subject being assessed belongs based on a confidence score associated with the computed raw intelligibility score (Im).
9. The system of claim 7, wherein the one or more processors are further configured to obtain a set of reference words to be provided to the subject being assessed by:
identifying, possible subsets of the n words in the dysarthric speech database;
computing the raw intelligibility score for each of the subsets;
obtaining a correlation of the raw intelligibility score and the perceptual intelligibility score (Ip) for each of the subsets; and
determining a value of r that minimizes the cost function for the subset having a highest correlation as the optimum number; and
identifying the subset having the highest correlation as the set of reference words.
10. The system of claim 7, wherein the first metric is the sequence matcher technique.
11. The system of claim 7, wherein the second metric is the Levenshtein distance.
12. The system of claim 7, wherein the one or more hardware processors are configured to obtain a final intelligibility score is based on the Pearson

Correlation (PC) between the estimated raw intelligibility score and the perceptual intelligibility score (Ip).

Documents

Orders

Section Controller Decision Date

Application Documents

# Name Date
1 202021008649-IntimationOfGrant06-09-2024.pdf 2024-09-06
1 202021008649-STATEMENT OF UNDERTAKING (FORM 3) [28-02-2020(online)].pdf 2020-02-28
2 202021008649-REQUEST FOR EXAMINATION (FORM-18) [28-02-2020(online)].pdf 2020-02-28
2 202021008649-PatentCertificate06-09-2024.pdf 2024-09-06
3 202021008649-Written submissions and relevant documents [16-07-2024(online)].pdf 2024-07-16
3 202021008649-FORM 18 [28-02-2020(online)].pdf 2020-02-28
4 202021008649-FORM 1 [28-02-2020(online)].pdf 2020-02-28
4 202021008649-Correspondence to notify the Controller [03-07-2024(online)].pdf 2024-07-03
5 202021008649-FORM-26 [03-07-2024(online)].pdf 2024-07-03
5 202021008649-FIGURE OF ABSTRACT [28-02-2020(online)].jpg 2020-02-28
6 202021008649-US(14)-HearingNotice-(HearingDate-09-07-2024).pdf 2024-05-29
6 202021008649-DRAWINGS [28-02-2020(online)].pdf 2020-02-28
7 202021008649-ORIGINAL UR 6(1A) FORM 1 & FORM 26-160222.pdf 2022-02-18
7 202021008649-DECLARATION OF INVENTORSHIP (FORM 5) [28-02-2020(online)].pdf 2020-02-28
8 202021008649-COMPLETE SPECIFICATION [28-02-2020(online)].pdf 2020-02-28
8 202021008649-CLAIMS [17-02-2022(online)].pdf 2022-02-17
9 Abstract1.jpg 2020-03-04
9 202021008649-FER_SER_REPLY [17-02-2022(online)].pdf 2022-02-17
10 202021008649-OTHERS [17-02-2022(online)].pdf 2022-02-17
10 202021008649-Proof of Right [17-06-2020(online)].pdf 2020-06-17
11 202021008649-FER.pdf 2021-11-15
11 202021008649-FORM-26 [09-10-2020(online)].pdf 2020-10-09
12 202021008649-FER.pdf 2021-11-15
12 202021008649-FORM-26 [09-10-2020(online)].pdf 2020-10-09
13 202021008649-OTHERS [17-02-2022(online)].pdf 2022-02-17
13 202021008649-Proof of Right [17-06-2020(online)].pdf 2020-06-17
14 202021008649-FER_SER_REPLY [17-02-2022(online)].pdf 2022-02-17
14 Abstract1.jpg 2020-03-04
15 202021008649-CLAIMS [17-02-2022(online)].pdf 2022-02-17
15 202021008649-COMPLETE SPECIFICATION [28-02-2020(online)].pdf 2020-02-28
16 202021008649-DECLARATION OF INVENTORSHIP (FORM 5) [28-02-2020(online)].pdf 2020-02-28
16 202021008649-ORIGINAL UR 6(1A) FORM 1 & FORM 26-160222.pdf 2022-02-18
17 202021008649-DRAWINGS [28-02-2020(online)].pdf 2020-02-28
17 202021008649-US(14)-HearingNotice-(HearingDate-09-07-2024).pdf 2024-05-29
18 202021008649-FIGURE OF ABSTRACT [28-02-2020(online)].jpg 2020-02-28
18 202021008649-FORM-26 [03-07-2024(online)].pdf 2024-07-03
19 202021008649-FORM 1 [28-02-2020(online)].pdf 2020-02-28
19 202021008649-Correspondence to notify the Controller [03-07-2024(online)].pdf 2024-07-03
20 202021008649-Written submissions and relevant documents [16-07-2024(online)].pdf 2024-07-16
20 202021008649-FORM 18 [28-02-2020(online)].pdf 2020-02-28
21 202021008649-REQUEST FOR EXAMINATION (FORM-18) [28-02-2020(online)].pdf 2020-02-28
21 202021008649-PatentCertificate06-09-2024.pdf 2024-09-06
22 202021008649-STATEMENT OF UNDERTAKING (FORM 3) [28-02-2020(online)].pdf 2020-02-28
22 202021008649-IntimationOfGrant06-09-2024.pdf 2024-09-06

Search Strategy

1 sseraAE_28-10-2022.pdf
1 sserE_01-11-2021.pdf
2 sseraAE_28-10-2022.pdf
2 sserE_01-11-2021.pdf

ERegister / Renewals

3rd: 10 Sep 2024

From 28/02/2022 - To 28/02/2023

4th: 10 Sep 2024

From 28/02/2023 - To 28/02/2024

5th: 10 Sep 2024

From 28/02/2024 - To 28/02/2025

6th: 07 Jan 2025

From 28/02/2025 - To 28/02/2026