Abstract: ABSTRACT The present invention relates to a system and method for measuring voice pitch and intensity variation with time during the phonation of a vowel. The hand-held battery-operated voice assessment device displays vocal health parameters, such as pitch, amplitude, and breath control, along with the normal range of these parameters. The device displays graphs of pitch and intensity variations to understand dynamic characteristics visually and generates a report and stores. Published with Figure 2 and Figure 4
Description:FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
The Patent Rules, 2003
COMPLETE SPECIFICATION
(See sections 10 & rule 13)
1. TITLE OF THE INVENTION
A SYSTEM AND METHOD FOR MEASURING VOICE PITCH AND INTENSITY VARIATION WITH TIME
2. APPLICANT (S)
NAME NATIONALITY ADDRESS
DIVYASAMPARK IHUB ROORKEE FOR DEVICES MATERIALS AND TECHNOLOGY FOUNDATION IN Indian Institute of Technology Roorkee, Roorkee-247667, Uttarakhand, India.
3. PREAMBLE TO THE DESCRIPTION
COMPLETE SPECIFICATION
The following specification particularly describes the invention and the manner in which it is to be performed.
FIELD OF INVENTION:
[001] The present invention relates to the field of voice analyzer. The present invention in particular relates to a system and method for voice pitch and intensity variation with time.
DESCRIPTION OF THE RELATED ART:
[002] Speech is an important and most convenient form of communication among human beings. Human speech communication starts with message formation in the brain and then the generation of appropriate neural signals to motor controls responsible for speech generation. But sometimes, because of neurological disorders, speech production is affected, which causes voice disorders. Voice disorders affect approximately 6% of children under 14 years of age and 3-9% of the adult population of India. Voice disorders affect millions globally, with conditions such as dysphonia, vocal fold nodules, and laryngeal cancer requiring precise diagnosis and continuous monitoring. Traditional methods of voice assessment, such as perceptual evaluation by clinicians, can be subjective and inconsistent. Objective, automated voice assessment tools can provide standardized, repeatable, and accurate evaluations, improving diagnostic accuracy and treatment outcomes.
[003] A person affected with Parkinson’s disease has several speech disorders termed ‘Dysarthria’. It is a common neurodegenerative disorder among the elderly population of India. Around 300 to 400 people for every 1,00,000 populations are diagnosed with Parkinson’s in India. It is expected to more than double by the Year 2030, and it will be a major non-communicable degenerative disorder among the people in the country. Other major Parkinson-plus syndromes include Progressive Supranuclear Palsy (PSP), Cortico Basal Degeneration (CBD), and Multiple System Atrophy (MSA). PSP and MSA are uncommon disorders. While the prevalence of PSP is approximately 7 per 100,000 populations, that of MSA lies between 1.9-4.9 per 100,000 populations. Speech abnormality is seen early in these disorders. Most often, people with dysarthria have problems with communicating, and this inhibits their social interaction result in depression or social isolation or low self-esteem in affected people.
[004] Reference may be made to the following:
[005] US5794188, JPH03198100, JPH0877152, JPH04346400, KR1020060045423, US10497383, US6018706, and IN130/DEL/1997 propose voice evaluation device of speech over transmission channel. The distortion can be due to compression of speech or noise. These devices use pitch information to improve the generated synthetic speech. These devices are not related to measuring voice disorders.
[006] Patent No. US11972774 disclosed a system for assessing the quality of a singing voice singing a song. US4093821 and JP2020067495 determine the emotional state of a person by analyzing pitch or frequency perturbations in the speech pattern.
[007] CN113035237, US11756693, CN110415824, CN109087633, CN110600052, CN115394314, CN115359808, and US2023178099 present voice evaluation methods based on machine learning/ neural network-based classifier or score predictor to assess the quality of speech. These devices require costly processors to function and also these processors consume much power compared to microcontrollers. Their accuracy or efficiency depends on the size of the training data.
[008] CN113689884, IN202311054442, and CN108877836 present devices that contain multiple sensors, which makes them costly and more power-consuming compared to single-sensor systems.
[009] CN103050128 discloses a vibration distortion-based voice frequency objective quality evaluating method and system. It compares input voice quality to a test signal based on jitter distortion. CN103067322 discloses a method for evaluating the voice quality of an audio frame in a single-channel audio signal to calculate the ratio of the harmonic and the non-harmonic (HnHR). CN103050128 and CN103067322 related devices measure a single voice parameter and do not provide detailed variation with time.
[010] ES2432480 pertains to the acoustic assessment of voice quality using a computer system to analyze a recording of sustained vowel phonation. This method objectively measures four aspects of voice quality using just 5 seconds of recording.
[011] CN113506572, CN108766415, and CN113763992 provide a pronunciation evaluation method, not a vocal health assessment. These are used for the technical field of language learning. CN213484940 discloses a voice quality evaluation, which is big in size and, therefore, not portable.
[012] KR100623214 relates to a method for providing real-time perceptual quality measurements of an audio signal with respect to a pre-stored audio signal representation.
[013] The rise of telehealth, accelerated by the COVID-19 pandemic, has underscored the necessity for remote diagnostic tools. Voice assessment devices that can be used remotely allow clinicians to monitor patients’ progress without the need for in-person visits, making healthcare more accessible and convenient. This capability is particularly beneficial for patients in rural or underserved areas.
[014] Thus there is a need for an intelligent voice assessment among dysarthric patients. The existing methods are computer-based solutions which are costly, require training to operate, and report generation takes more time than the proposed technology.
[015] In order to overcome above listed prior art, the present invention aims to provide a system and method for voice pitch and intensity variation with time. This is light-weighted device containing a single microphone sensor as input hence compact and lower in cost. The method uses variation in pitch and intensity from a phonation task to determine mental health, which is less time-consuming than a conversational speech task. It utilizes a dual-core microcontroller to manage the parallel processing of voice sample acquisition and feature extraction. The method includes fixed-point arithmetic which speeds up the multiplication operations many folds in a microcontroller.
OBJECTS OF THE INVENTION:
[016] The principal object of the present invention is to provide a system and method for voice pitch and intensity variation with time.
[017] Another object of the present invention is to provide voice assessment device which is portable, low-cost, user-friendly voice assessment device.
[018] Yet another object of the present invention is to voice assessment device which can be integrated into healthcare settings for diagnostic purposes and educational institutions for language and speech training.
SUMMARY OF THE INVENTION:
[019] The present invention relates to the voice assessment device measures voice pitch and intensity variation with time during the phonation of a vowel. This is affordable and low cost device which provides accurate and efficient voice analysis. The device can be integrated into healthcare settings for diagnostic purposes and educational institutions for language and speech training.
[020] This is a low-cost and non-invasive device which measures voice parameters using an ARM-based microcontroller, which consumes much less power than the computer. The hand held battery-operated device displays vocal health parameters, such as pitch, amplitude, and breath control, along with the normal range of these parameters. The device displays graphs of pitch and intensity variations to understand dynamic characteristics visually and generates a report and stores.
[021] The device captures total phonation time to assess lung capacity. Powered by a dual-core microcontroller, the system enables real-time calculation of acoustic parameters, offering faster performance and lower power consumption compared to traditional computer systems. The device measures jitter, shimmer, phonation time, and HNR parameters. It is fitted with a normative range learned to assess vocal health.
BREIF DESCRIPTION OF THE INVENTION
[022] It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered for limiting of its scope, for the invention may admit to other equally effective embodiments.
[023] Fig. 1: Shows block diagram of the process of the voice assessment device.
[024] Fig. 2: Shows block diagram of the proposed voice assessment device.
[025] Fig. 3: Design of the proposed voice assessment device.
[026] Fig. 4: Use of proposed voice assessment device.
DETAILED DESCRIPTION OF THE INVENTION:
[027] The present invention provides a system and method for measuring voice pitch and intensity variation with time during the phonation of a vowel. The hand held battery-operated voice assessment device displays vocal health parameters, such as pitch, amplitude, and breath control, along with the normal range of these parameters. The device displays graphs of pitch and intensity variations to understand dynamic characteristics visually and generates a report and stores.
[028] Referring to figure 1, the device comprises micro SD memory card slot to enable storage of all voice samples and corresponding reports. The device displays vocal health parameters, such as pitch, amplitude, and breath control, along with the normal range of these parameters. The device displays graphs of pitch and intensity variations to understand dynamic characteristics visually and generates a report and stores. Input unit (101) inputs voice of sustained phonation will be inputted to the device from a sensitive microphone in a noise-free room. Preamplifier and analog to digital converter (l02). Processor (103) to analyse time-frequency analysis of recorded speech will be done to compute the following voice characteristics. Output unit to display on output on display screen (104) wherein the OLED display plots pitch and intensity with time for visual understanding of the variation. The micro SD memory card slot to enable storage of all voice samples and corresponding reports (105). The battery (106) provides power to all hardware modules to function.
[029] Figure 3 illustrates the proposed design of the device, which contains the following parts: (301) Start/Stop Button, (302) Upward/Backward navigation button, (303) Down/Forward navigation button, (304) OLED display, (305) Voice input.
[030] Figure 4 demonstrates the use of the proposed voice assessment device on a human subject that should be placed close to mouth.
[031] The device uses a digital signal processing technique to analyze the voice of dysarthric speech. The block diagram shown in Figure 1 represents the proposed process of the voice assessment device. The methodology used in the proposed process of voice assessment device are as follows:
I. Input: The voice of sustained phonation will be inputted to the device from a sensitive microphone in a noise-free room.
II. Processor: Time-frequency analysis of recorded speech will be done to compute the following voice characteristics:
Mean, std., minimum, and maximum values of pitch.
Mean, std., minimum, and maximum values of intensity.
Mean, std., minimum, and maximum values of jitter.
Mean, std., minimum, and maximum values of shimmer.
Mean Harmonic to Noise Ratio (HNR).
Phonation Time
III. Output: The real time pitch and intensity will be displayed on an OLED screen. After completion of the phonation task, a report will be generated for the record.
[032] Figure 2 shows the components of the proposed voice assessment device. The device captures voice input through a sensitive microphone. The voice signal is pre-processed through pre-amplifier and ADC. The digital voice signal is processed by the processor. The OLED display plots pitch and intensity with time for visual understanding of the variation. A report file is generated that contains observed values of voice measures and normative range (Fig 2).
[033] Method for computation of fundamental frequency and HNR
[034] The fundamental frequency F0 of a signal is the lowest frequency or the longest wavelength of a periodic waveform. F0 is calculated for each voice frame, the most likely candidate is chosen from the lowest possible frequencies to be F0. From all of these values, the median value is returned. The frame size is considered as 32 milliseconds for sampling frequency of 16000 Hz with 16-bit representation for each sample. For each frame, it then calculates the normalized autocorrelation ra, or the correlation of the signal to a delayed copy of itself. ra is estimated by dividing the windowed signal's autocorrelation by the window's autocorrelation. After ra is calculated, the maxima values of ra are found. These points correspond to the lag domain, or points in the delayed signal, where the correlation value has peaked. The higher peaks indicate a stronger correlation. These points in the lag domain suggest places of wave repetition and are the candidates for F0. The best candidate for F0 of each frame is picked by a cost function, a function that compares the cost of transitioning from the best F0 of the previous frame to all possible F0 of the current frame. Once the path of F0 of least cost has been determined, the F0 of all voiced frames is returned.
[035] The steps for computation of the fundamental period are as follows:
[036] Initialize global absolute peak = 0, frame = 0, input buffer [512] = 0
[037] Start if start/stop button is pressed
[038] Step 1. Fill input buffer of size 32 milliseconds or 512 samples using core 1.
[039] All upcoming steps 2.1-2.9 occur in core 2 within 24 milliseconds. Meanwhile, core 1 fills new samples. This is a novel step to ensure that 25% of samples out of 512 samples are old samples. Thus, there is an overlap of 25% between new and old frames (buffer). In addition to that a novel approach is used for faster computation. Fixed-point arithmetic operations are used in all computational steps in place of floating-point arithmetic, which is nearly 11 times faster as shown in Table 1.
Table 1. Calculation time for 256 point FFT on Cortex-M0+ @ 133 MHz
Floating point time Fixed point time Improvement Percentage
7861 us
674 us
1166.51 % (11.66 times
faster)
[040] Step 2.1. Compute the local absolute peak value of the signal and update the global absolute peak if it is lower than the local peak.
[041] Step 2.2. Subtract the local average.
[042] Step 2.3. A frame will be assumed to be voiceless if there are no autocorrelation peaks above the Voicing Threshold or if the local absolute peak value is less than the Silence Threshold times the global absolute peak value.
[043] Step 2.4. Multiply by the window function.
[044] Step 2.5. Perform a fast Fourier transform. Here, the CMSIS DSP library is used to implement discrete Fourier transform.
[045] Step 2.6. Square the samples in the frequency domain.
[046] Step 2.7. Perform a Fast Fourier Transform to get ra(t).
[047] Step 2.8. To estimate the autocorrelation rx(t) of the original signal segment, we divide the autocorrelation ra(t) of the windowed signal by the autocorrelation rw(t) of the window:
r_x (t)=(r_a (t))/(r_w (t) )
[048] Step 2.9. Find the places and heights of the maxima of the continuous version of rx(t). The only places considered for the maxima that yield a pitch between Minimum Pitch and Maximum Pitch. The only candidates that are remembered are the unvoiced candidate, which has a local strength equal to
R=VoicingThreshold+max?(0,2-(local absolute peak/global absolute peak)/(SilenceThreshold/(1 + VoicingThreshold ) ))
[049] and the voiced candidates with the highest (maximum number of candidates per frame-1) values of the local strength
R=r(t_max )-OctaveCost*log_2?(MinimumPitch* t_max )
[050] The octave cost parameter favors higher fundamental frequencies. After step 2, for every frame, we are left with a number of frequency strength pairs (Fni, Rni), where the index n runs from 1 to the number of frames, and i is between 1 and the number of candidates in each frame. The locally best candidate in each frame is the one with the highest R. Though there can be several approximately equally strong candidates in any frame. The aim of global path finder is to minimize the number of incidental voiced-unvoiced decisions and large frequency jumps.
[051] Go to Step 1 by incrementing frame count if start/stop button is still pressed.
[052] Step 3. After the button is released, for every frame n, the least cost path is tracked to determine the fundamental period as follows:
[053] For every frame n, pn is a number between 1 and the number of candidates for that frame. The values {pn | 1 = n = number of frames} define a path through the candidates: {(Fnpn, Rnpn) | 1 = n = number of frames}. With every possible path, we associate a cost
cost({p_n })=?_(n=2)^numberOfFrames¦?transitionCost(F_(n-1?,p?_(n-1) ),F_(n?,p?_n ))?-?_(n=1)^numberOfFrames¦R_(n?,p?_n )
[054] where the transition cost function is defined by:
transitionCost(F_1,F_2 )={¦( 0 if? F?_1=0 and F_2=0 @VoicedUnvoicedCost if? F?_1=0 xor F_2=0@OctaveJumpCost*|log_2??F_1/F_2 ? | if? F?_1?0 and F_2?0)¦
[055] The globally best path is the path with the lowest cost. This path might contain some candidates that are locally second-choice.
[056] The harmonics-to-noise ratio (HNR) is the ratio of the energy of a periodic signal to the energy of the noise in the signal, expressed in dB. This value is often used as a measure of hoarseness in a person’s voice. The highest peak is picked from ra. If the height of this peak is larger than the strength of the silent candidate, then the HNR for this frame is calculated from that peak. The height of the peak corresponds to the energy of the periodic part of the signal. Once the HNR value has been calculated for all voiced frames, the mean is taken from these values and returned. In computation algorithm, for HNR measurements, the path finder is turned off, and the octave cost and voicing threshold parameters are zero; maximum pitch equals the Nyquist frequency; only the time step, minimum pitch, and silence threshold parameters are relevant for HNR measurements.
[057] The device is affordable and low cost, hence reachable to middle- and low-income people in developing countries. The device provides accurate and efficient voice analysis; therefore, the device can be integrated into healthcare settings for diagnostic purposes and educational institutions for language and speech training.
[058] Thus, the device is useful to audiologists, neurologists, ENT specialists, and researchers working in the field of speech disorders across the globe. The solution and its affordability make it possible to reach all depths of the general community. The increasing awareness regarding the assessment and management of voice disorders worldwide makes the device scalable.
[059] Numerous modifications and adaptations of the system of the present invention will be apparent to those skilled in the art, and thus it is intended by the appended claims to cover all such modifications and adaptations which fall within the true spirit and scope of this invention.
, Claims:
WE CLAIM:
1. A system and method for measuring voice pitch and intensity variation with time during the phonation of a vowel comprises-
a) Voice input module (101) to input voice of sustained phonation will be inputted to the device from a sensitive microphone in a noise-free room.
b) Pre-processing module (102) consists of preamplifier and analog to digital converter.
c) Dual core Microcontroller (103) to analyse time-frequency analysis of recorded speech will be done to compute the voice characteristics.
d) Output unit to display on output on the display screen (104) wherein the OLED display plots pitch and intensity with time for visual understanding of the variation.
e) Micro SD memory card slot to enable storage of all voice samples and corresponding reports through external storage module (105).
f) Battery (106) provides power to all modules (101-105) to function.
2. The voice assessment device, as claimed in claim 1, wherein the device captures voice input through a sensitive microphone, pre-processes voice signal through pre-amplifier and ADC, processes the digital voice signal by the processor, plot pitch and intensity with time for visual understanding of the variation and a report file containing observed values of voice measures and normative range is generated.
3. The device for measuring voice pitch and intensity variation, as claimed in claim 1, wherein the steps for computation of the fundamental period are as follows:
a) Fill a buffer using core 1 and process data in core 2 to manage step size of 24 milliseconds via two cores wherein both acquisition and processing occur simultaneously and includes following steps:
o Compute the local absolute peak value of the signal and update the global absolute peak if it is lower than the local peak.
o Subtract the local average.
o Parameter-based selection of voiced frame.
o Multiply by the window function.
o perform a Fast Fourier Transform.
o Square the samples in the frequency domain.
o Perform a Fast Fourier Transform to get ra(t).
o Estimate the autocorrelation rx(t) of the original signal segment, by dividing the autocorrelation ra(t) of the windowed signal by the autocorrelation rw(t) of the window
o Find the places and heights of the maxima of the continuous version of rx(t).
b) After the button is released, the least cost path for every frame n is tracked to determine the fundamental period.
c) Intensity (local absolute peak) and fundamental period (pitch) variation with time (frame) is plotted on OLED display (104).
d) Numeric value of jitter, shimmer, apq, ppq and phonation time is displayed on OLED display (104).
e) All voice samples and voice report is saved to microSD card attached to external memory module (105).
f) Battery (106) provides power to all units (101-105).
| # | Name | Date |
|---|---|---|
| 1 | 202411074587-STATEMENT OF UNDERTAKING (FORM 3) [03-10-2024(online)].pdf | 2024-10-03 |
| 2 | 202411074587-FORM FOR SMALL ENTITY(FORM-28) [03-10-2024(online)].pdf | 2024-10-03 |
| 3 | 202411074587-FORM 1 [03-10-2024(online)].pdf | 2024-10-03 |
| 4 | 202411074587-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [03-10-2024(online)].pdf | 2024-10-03 |
| 5 | 202411074587-EDUCATIONAL INSTITUTION(S) [03-10-2024(online)].pdf | 2024-10-03 |
| 6 | 202411074587-DRAWINGS [03-10-2024(online)].pdf | 2024-10-03 |
| 7 | 202411074587-DECLARATION OF INVENTORSHIP (FORM 5) [03-10-2024(online)].pdf | 2024-10-03 |
| 8 | 202411074587-COMPLETE SPECIFICATION [03-10-2024(online)].pdf | 2024-10-03 |
| 9 | 202411074587-FORM-9 [18-10-2024(online)].pdf | 2024-10-18 |
| 10 | 202411074587-FORM-8 [18-10-2024(online)].pdf | 2024-10-18 |
| 11 | 202411074587-FORM 18 [18-10-2024(online)].pdf | 2024-10-18 |