Abstract: Speech Enhancement for hearing aids gives attractive attention tothe research community. This approach proposes improved dynamic-quantile tracking for the estimationofthe noise spectrum from the degraded speech signal. It is done using the spectral subtraction method for a single-channel speech enhancement system. For implementation, MATLAB platform has been used through the Simulink library. After getting faithful results from the model, the obtained results are evaluated through the PESQ score, waveform and spectrogram. The PESQ score’s recovered speech signals (enhanced speech signals) is higher as compared to observed speech signals (noisy speech signals, spoken by male and female, corrupted with f16 cockpit noise and pink noise, respectively) at each SNR level - 1, 3, 5, 7, 9, 11, 13 and 15dB. In the case of waveform and spectrogram, the visualization of the enhanced speech signal is shown that the waveform and spectrogram of the enhanced speech signal are very close to its clean speech signal. 5 Claims & 6 Figures
Description:Field of Invention
The present invention relates tospeech and audio processing and is also significantly related to speech technologies like speech enhancement, echo cancellation, preprocessing … etc, for hearing aids devices. For these technologies, the estimation of noise is one of the essential factors. In this invention, noise spectrum is obtained based on the dynamic quantile values of degraded speech signal for improving speech signal quality.
The objectives of this invention
This invention aims to improve the performance of hearing aid devices in terms of speech quality. In hearing aids devices, speech enhancement is running to improve or enhance speech signal quality and reduce or remove the noises available in the input speech signal. If we maintain the quality of the speech signal with better PESQ values, the hearing aid device's performance is good.
Background of the invention
Numerous applications in speech processing, such as hearing aids, voice-controlled system, mobile communication and multiparty teleconferencing, have strongly required speech enhancement techniques for removing the noise with less speech distortion. Due to continually increasing pollution, our living places are nosier, however, speech processing-based systems disturb by noises. Actually, noises are picked up through the microphone simultaneously with the speech signal. Suppose we are talking from the phone ata railway station. At that time, train, bubble, and station noise could affect our communication process. Due to noises, the original speech signal has degraded in terms of speech quality and intelligibility. To reduce the noises, speech enhancement techniques are used in pre-processing. Several speech enhancement techniques like as wiener filtering (Jaiswal, et.al, International Journal of Speech Technology, Vol. 25, No..3, pp: 745-758, 2022), MMSE STSA (kumar, Bittu, International Journal of Speech Technology, Vol. 21, no.4, pp): 1033-1044, 2018), spectral subtraction (Bharti, et.al, 2016 3rd International Conference on Recent Advances in Information Technology (RAIT). IEEE, 2016), signal subspace approach and source separation are available for recovering the desired speech signal from degraded speech signals. Among all, those techniques are more popular, obtaining real speech signals with less distortion. Through a literature survey, we investigated spectral subtraction methodsthat have found lots of attention in the last few decades.
Spectral subtraction is a signal estimation method used for estimating the clean speech signal from degraded speech signal using estimated noise (obtained through the noise estimation method). According to the spectral subtraction process, the original speech signal is recovered by subtracting the estimated noise from the degraded speech signal. If we estimate accurate noises (it is present in observed noisy speech signals), then the spectral subtraction method can give a better speech quality signal. Therefore, selectinga noise estimation method is also essential for any speech enhancement techniques.
Several different types of noise estimation methods have been availed in a recently published paper by different authors. In most cases, researchers use Voice Activity Detection (VAD) based noise estimation or statistical noise estimation method (R. Martin, in Proc. Euro. Signal Processing Conf.(EUSIPCO) 1994).Through the literature survey of noise estimation methods, most researchers are more devoted to statistical noise estimation for estimating noise in speech enhancement applications. In quantile-based noise estimation method (V. Stahl, A. Fisher, and R. Bipus, “Spectral subtraction based on minimum statistics”, in Proc. IEEE Int. Conf. Acoust., Speech, and Sig. Proc. ICASSP’00. 2000.), Stahl et. al observe that most of the frames (80-90%) of degraded speech carry low energy level spectrum which is very near to noise energy spectrum signal in the particular frequency bins and only a few numbers of frames (10-20%) of the signal contain high energy level spectrum corresponding to the speech signal. To this observation, noise samples are obtained as a few quantile values histogram of the observed degraded speech signals. More sorting operations are needed to find the sample value at every quantile chosen in every frequency bin. In cascaded median-based noise estimation method (Santosh K. Waddi, Prem C. Pandey, and Nitya Tiwari, “Speech Enhancement Using Spectral Subtractionand Cascaded-Median Based Noise Estimation for Hearing Impaired Listeners”, Communications (NCC), 2013 National Conference on. IEEE, 2013), Waddi et al reduce the sorting operation, reduces in computational complexity and storage requirement. For improvement in speech quality, in paper (Kumar, Bittu, Fluctuation and Noise Letters, vol.18, no.04, 1950020, 2019), Kumar had proposed a noise estimation which was modified version of cascaded median based noise estimation, but it requires some storage requirement. In (Tiwari, Nitya, and Prem C. Pandey, Journal of Signal Processing Systems, vol.91, no.5, pp: 411-422, 2019), Dynamic Quantile Tracking (DQT) based noise estimation method do not need memory requirement with less computational complexity. Based on our observation and studies, in this invention, the DQT noise estimationis improved for better speech quality of the desired speech signal. This was modified with less computational complexity as compared to other existing noise estimation methods.
Several speech enhancement applications, like hands-free hearing aids, speech-to-speech communication, etc. are frequently applied to real-time devices. To improve speech quality and reduce noise, noise suppression /speech enhancement algorithms are utilized. Initially, speech filtering filters the degraded speech signal taken from the microphone and outputs the filtered sound. Additionally, it is set up to change the frequency and volume of the auditory input and sounds. The enhanced speech component is output via the speaker, which amplifies it and makes it audible. (US7191127B2). As we know that when processing any voiced speech signal, the amplitude and time periods may be non-uniform so that intelligibility thereof is adversely unnatural. According to this statement or methodology, the continually few frames of the speech spectrum are processed so that every part of the spectrum has a significantly uniform period and improved intelligibility. In some examples, the processing could be done so that each treated section also has roughly uniform peak amplitudes. To improve the intelligibility of unvoiced speech waveforms, methods for processing them may be used with the improved voiced waveform produced by the enhancement approach. (US4468804A). A method of estimating the noise power spectrum in an observed noisy signal having a power spectrum comprising the steps of (a) Receiving and storing the noisy signal for T time frames; (b) for a frequency k sorting the frames and obtaining a noise estimate, wherein the noise estimate is calculated as the quantile (c) applying a recursive function to generate, the recursive function is arranged to reduce fluctuations in the noise estimate. Also, a method of saving processing power by only updating odd frequency bands for a one-time frame and only updating even frequency bands for the next time frame.(GB2426167A). In some embodiments, a method comprises: dividing, using at least one processor, an audio input into speech and non-speech segments; for each frame in each non-speech segment, estimating, using at least one processor, a time-varying noise spectrum of the non-speech segment; for each frame in each speech segment, estimating, using the at least one processor, speech spectrum of the speech segment; for each frame in each speech segment, identifying one or more non-speech frequency components in the speech spectrum; comparing the one or more non-speech frequency components with one or more corresponding frequency components in a plurality of estimated noise spectra and selecting the estimated noise spectrum from the plurality of estimated noise spectra based on a result of the comparing. (WO2022066590A1).
Summary of Invention
This patent discusses the noise estimation method for speech enhancement systems in commonly usedhearing aids and speech-to-speech communication systems. The proposed method, i.e., the Improved DQT noise estimation method, is given a good quality speech signal with less computation time and memory. Regardingthe PESQ Score, its performance is better than the existing method and PESQ score of noisy speech signal score. The waveform-spectrogram of the enhanced speech signal are also observed that it is very close to the waveform-spectrogram of the original speech signal. So, this noise estimation can use for enhancement purposes in hearing aid devices.
Detailed description of the invention
Fig.1 shows the block diagram of the IDQT noise estimation-based speech enhancement technique. In this technique, only a noisy speech signal is used as an input signal for finding the original speech signal. The observed noisy speech signal (it contains real speech data and background noises) first passes through Fast Fourier Transform (FFT) to transform into a noisy speech spectrum? S?_i (k). For obtaining original speech spectrum(? Y?_ ) ^(i,k), the magnitude of the noisy speech spectrum is used by the spectral subtraction method. The spectral subtraction methodis done by a generalized equation which is expressedas
|(Y_ ) ^(i,k) |= {¦(?[|S_i (k) |^?-a?(E_i (k))?^?]?^(1/?), if?? |S?_i (k)|>(a+b)?^(1/?) E_i (k)@b^(1/?) E_i (k), otherwise)¦ (1)
where a=over-subtraction factor,b=floor factor and ?=exponent factor (Unity for magnitude spectral subtraction and two for power spectral subtraction method)
In equation (1), the noise spectrum E_i (k) requires and it estimates through the noise estimation method. Here for estimating the noise spectrum, Improved Dynamic Quantile Tracking (IDQT)is proposed; detail of IDQT is mentioned in the next paragraph. The speciality of this method is that it doesn’t need much memory and takes less computational complexity compared with existing original DQT and othermethods. Next, The natural speech signal is obtained by removing the noise spectrum from the observed degraded speech spectrum, in accordance with the generalized equation of spectrum subtraction.. The obtained spectrum (enhanced speech spectrum) and phase spectrum of noisy speech signals are combined into the complex spectrum. We have applied Inverse Fast Fourier Transform (IFFT) to recover this spectrum into the time domain. In this process, we assumed that speech and noise were uncorrected and also, the phase spectrum of noisy speech spectrum was independent of noise.
In IDQT noise estimation, the input signal, i.e., frequency domain degraded speech signal, is applied to different frames. For every frame of the input spectrum, we assign one quantile value which is generated using the previous frame of the estimated noise spectrum. It can be done through increment/decrement on the samples of the previous frame of the generated orcalculated noise spectrum. The preferredincrement or decrement value in previous noise samples is processed through the suitable range values. However, the spreading of the degraded speech spectrum is not known, which means that the range's approximation is required to be dynamically acquired. Differences e_i (k) in between the obtained noise spectrum of the present frameE_i (k) and previous frame E_(i-1) (k) can be measuredby the degraded speech spectrum and noise spectrum. If the magnitude of that degraded spectrum is larger than the previously obtained noise samples, then there will be a minor increment in the next frame's noise spectrum. On the other hand, there will be a reduction in the following frame if the volume of the noisy speech is smaller than the previously calculated noise spectrum. For finding the quantile value? Q?_i (k) of the noise spectrumE_i (k), the following equation can be used.
Q_i (k)={¦(Q_(i-1) (k)+qR_i (k) , |?¦?S_i (k) ?| =|?¦?Q_(i-1) (k) ?|@Q_(i-1) (k)-(1-q)R_i (k), otherwise)¦ (2)
where, R_i (k),qandQ_i (k) are range, initial fixed quantile level/value and the estimated quantile value of degraded spectrum S_i (k), respectively. The range is dynamically obtained by subtracting the valley? V?_i (k) from dynamic peak? P?_i (k). The peak and valley values are updated using the first-order recursive equation, which is provided as follows:
P_i (k)={¦(?tP?_(i-1) (k)+(1-t)?|E?_i (k)|, ?|S?_i (k)| =P_(i-1) (k)@?sP?_(i-1) (k)+?(1-s)|V?_(i-1) (k)|, otherwise)¦ (3)
V_i (k)={¦(?tV?_(i-1) (k)+(1-t) ?|E?_i (k)|, ?|S?_i (k) |=V_(i-1) (k)@?sV?_(i-1) (k)+?(1-s)|P?_(i-1) (k)|, otherwise) (4)¦
Where, s and t are the fall and rise in detection duration, respectively.
The block design for the IDQT-based noise estimation technique is shown in Fig. 2. The peak and valley values in this diagram are determined using the peak and valley calculation blocks shown in equations (3) and (4), respectively. Range ? R?_i (k) is calculated using the peak P_i (k) and valley? V?_i (k) values. With the help of? R?_i (k)&Q_i (k)which is mentioned in equation (2), the noise spectrumE_i (k) is estimated.
Brief Description of Drawing
The List ofFigures, which are illustrated exemplary embodiments of the invention.
Figure 1 Speech Enhancement System.
Figure 2 Improved Dynamic Quantile Tracking based Noise Estimation Method.
Figure 3 Performance of IDQT in terms of speech quality.
Figure 4 Waveform-Spectrogram of the original speech signal.
Figure 5 Waveform-Spectrogram of Noisy speech signal.
Figure 6 Waveform-Spectrogram of the enhanced speech signal
Detailed description of the drawing
As described above present invention relates to IDQT-based noise estimation for speech enhancement systems.
Figure 1 shows the block diagram of the IDQT noise estimation-based speech enhancement technique. In this technique, only a noisy speech signal is used as an input signal for finding the original speech signal. The observed noisy speech signal first passes through Fast Fourier Transform (FFT) to transform into a noisy speech spectrum. For obtaining the original speech spectrum, the spectral subtraction method uses the magnitude of the noisy speech spectrum. Improved Dynamic Quantile Tracking (IDQT)is used to estimate the noise spectrum. Next, The obtained and phase spectrum of noisy speech signals are combined into the complex spectrum. To recover this spectrum into the time domain, we have applied Inverse Fast Fourier Transform (IFFT).
The block design for the IDQT-based noise estimate technique is shown in Figure 2. The dynamic peak and valley detectors are used to determine the peak and valley values in this figure. Next, the Range block calculates the range value using estimated peak values and obtained valley values. With the help of range, quantile value and degraded speech spectrum, the final noise spectrum is calculated from particular frame to frame.
The Improved DQT-based noise estimation method has been tested using objective measures i.e. PESQ score. From the SpEAR database, we have chosen 28 different degraded speech signals corrupted by different SNR levels such as 1, 3, 5, 7, 9, 11, 13 and 15dB. These speech files (spoken by both Male and female) passed through the proposed speech enhancement model and stored the enhanced speech signals. Then,we calculate the PESQ score of enhanced speech signals as well as degraded speech signals. Figure 3 shows the simulation result (plot between PESQ score vs SNR) for two different speech files – “noisy speech file 1”(spoken by Male) and “noisy speech file 2”(spoken by female)which are corrupted with f16 cockpit noise and pink noise respectively. In this figure, PESQ score of recovered speech signals (enhanced speech signals) is higher than observed speech signals (noisy speech signals) at each SNR level - 1, 3, 5, 7, 9, 11, 13 and 15dB.
Figures 4-6 show the visualization of the speech signal in the form of waveform and spectrogram for clean, noisy and enhanced speech signals. These signals are taken from the SpEAR speech database. Figure 4 shows the waveform and spectrogram of the pure speech signal that the female speaker speaks at sampled 16 kHz. The speech signal contains the sentence like as “I am sitting in the morning at the diner on the corner; I am waiting at the counter for the man to pour the coffee; and he fills it only halfway and before I even argue; He is looking out the window at somebody coming in”. In Figure 5, the waveform and spectrogram of the noisy speech signal are presented and these noisy speeches are corrupted by pink noise at 1dB SNR values. This is not generated file by the mixture of clean and noise; it is downloaded with a speech database. Then after, this wave file (noisy speech signal) is processed through different Simulink models, and processed speech i.e., the enhanced speech signal, is saved in wave files. Figure 6 shows the waveform and spectrogram of the enhanced speech signal obtained from the speech enhancement system with the IDQT noise estimation method. From these figures i.e., 4-6, it can be seen that the spectrogram and waveform of enhanced speech for the IDQT method is close to the clean speech signal.
5 Claims & 6 Figures
Field of Invention
The present invention relates tospeech and audio processing and is also significantly related to speech technologies like speech enhancement, echo cancellation, preprocessing … etc, for hearing aids devices. For these technologies, the estimation of noise is one of the essential factors. In this invention, noise spectrum is obtained based on the dynamic quantile values of degraded speech signal for improving speech signal quality.
The objectives of this invention
This invention aims to improve the performance of hearing aid devices in terms of speech quality. In hearing aids devices, speech enhancement is running to improve or enhance speech signal quality and reduce or remove the noises available in the input speech signal. If we maintain the quality of the speech signal with better PESQ values, the hearing aid device's performance is good.
Background of the invention
Numerous applications in speech processing, such as hearing aids, voice-controlled system, mobile communication and multiparty teleconferencing, have strongly required speech enhancement techniques for removing the noise with less speech distortion. Due to continually increasing pollution, our living places are nosier, however, speech processing-based systems disturb by noises. Actually, noises are picked up through the microphone simultaneously with the speech signal. Suppose we are talking from the phone ata railway station. At that time, train, bubble, and station noise could affect our communication process. Due to noises, the original speech signal has degraded in terms of speech quality and intelligibility. To reduce the noises, speech enhancement techniques are used in pre-processing. Several speech enhancement techniques like as wiener filtering (Jaiswal, et.al, International Journal of Speech Technology, Vol. 25, No..3, pp: 745-758, 2022), MMSE STSA (kumar, Bittu, International Journal of Speech Technology, Vol. 21, no.4, pp): 1033-1044, 2018), spectral subtraction (Bharti, et.al, 2016 3rd International Conference on Recent Advances in Information Technology (RAIT). IEEE, 2016), signal subspace approach and source separation are available for recovering the desired speech signal from degraded speech signals. Among all, those techniques are more popular, obtaining real speech signals with less distortion. Through a literature survey, we investigated spectral subtraction methodsthat have found lots of attention in the last few decades.
Spectral subtraction is a signal estimation method used for estimating the clean speech signal from degraded speech signal using estimated noise (obtained through the noise estimation method). According to the spectral subtraction process, the original speech signal is recovered by subtracting the estimated noise from the degraded speech signal. If we estimate accurate noises (it is present in observed noisy speech signals), then the spectral subtraction method can give a better speech quality signal. Therefore, selectinga noise estimation method is also essential for any speech enhancement techniques.
Several different types of noise estimation methods have been availed in a recently published paper by different authors. In most cases, researchers use Voice Activity Detection (VAD) based noise estimation or statistical noise estimation method (R. Martin, in Proc. Euro. Signal Processing Conf.(EUSIPCO) 1994).Through the literature survey of noise estimation methods, most researchers are more devoted to statistical noise estimation for estimating noise in speech enhancement applications. In quantile-based noise estimation method (V. Stahl, A. Fisher, and R. Bipus, “Spectral subtraction based on minimum statistics”, in Proc. IEEE Int. Conf. Acoust., Speech, and Sig. Proc. ICASSP’00. 2000.), Stahl et. al observe that most of the frames (80-90%) of degraded speech carry low energy level spectrum which is very near to noise energy spectrum signal in the particular frequency bins and only a few numbers of frames (10-20%) of the signal contain high energy level spectrum corresponding to the speech signal. To this observation, noise samples are obtained as a few quantile values histogram of the observed degraded speech signals. More sorting operations are needed to find the sample value at every quantile chosen in every frequency bin. In cascaded median-based noise estimation method (Santosh K. Waddi, Prem C. Pandey, and Nitya Tiwari, “Speech Enhancement Using Spectral Subtractionand Cascaded-Median Based Noise Estimation for Hearing Impaired Listeners”, Communications (NCC), 2013 National Conference on. IEEE, 2013), Waddi et al reduce the sorting operation, reduces in computational complexity and storage requirement. For improvement in speech quality, in paper (Kumar, Bittu, Fluctuation and Noise Letters, vol.18, no.04, 1950020, 2019), Kumar had proposed a noise estimation which was modified version of cascaded median based noise estimation, but it requires some storage requirement. In (Tiwari, Nitya, and Prem C. Pandey, Journal of Signal Processing Systems, vol.91, no.5, pp: 411-422, 2019), Dynamic Quantile Tracking (DQT) based noise estimation method do not need memory requirement with less computational complexity. Based on our observation and studies, in this invention, the DQT noise estimationis improved for better speech quality of the desired speech signal. This was modified with less computational complexity as compared to other existing noise estimation methods.
Several speech enhancement applications, like hands-free hearing aids, speech-to-speech communication, etc. are frequently applied to real-time devices. To improve speech quality and reduce noise, noise suppression /speech enhancement algorithms are utilized. Initially, speech filtering filters the degraded speech signal taken from the microphone and outputs the filtered sound. Additionally, it is set up to change the frequency and volume of the auditory input and sounds. The enhanced speech component is output via the speaker, which amplifies it and makes it audible. (US7191127B2). As we know that when processing any voiced speech signal, the amplitude and time periods may be non-uniform so that intelligibility thereof is adversely unnatural. According to this statement or methodology, the continually few frames of the speech spectrum are processed so that every part of the spectrum has a significantly uniform period and improved intelligibility. In some examples, the processing could be done so that each treated section also has roughly uniform peak amplitudes. To improve the intelligibility of unvoiced speech waveforms, methods for processing them may be used with the improved voiced waveform produced by the enhancement approach. (US4468804A). A method of estimating the noise power spectrum in an observed noisy signal having a power spectrum comprising the steps of (a) Receiving and storing the noisy signal for T time frames; (b) for a frequency k sorting the frames and obtaining a noise estimate, wherein the noise estimate is calculated as the quantile (c) applying a recursive function to generate, the recursive function is arranged to reduce fluctuations in the noise estimate. Also, a method of saving processing power by only updating odd frequency bands for a one-time frame and only updating even frequency bands for the next time frame.(GB2426167A). In some embodiments, a method comprises: dividing, using at least one processor, an audio input into speech and non-speech segments; for each frame in each non-speech segment, estimating, using at least one processor, a time-varying noise spectrum of the non-speech segment; for each frame in each speech segment, estimating, using the at least one processor, speech spectrum of the speech segment; for each frame in each speech segment, identifying one or more non-speech frequency components in the speech spectrum; comparing the one or more non-speech frequency components with one or more corresponding frequency components in a plurality of estimated noise spectra and selecting the estimated noise spectrum from the plurality of estimated noise spectra based on a result of the comparing. (WO2022066590A1).
Summary of Invention
This patent discusses the noise estimation method for speech enhancement systems in commonly usedhearing aids and speech-to-speech communication systems. The proposed method, i.e., the Improved DQT noise estimation method, is given a good quality speech signal with less computation time and memory. Regardingthe PESQ Score, its performance is better than the existing method and PESQ score of noisy speech signal score. The waveform-spectrogram of the enhanced speech signal are also observed that it is very close to the waveform-spectrogram of the original speech signal. So, this noise estimation can use for enhancement purposes in hearing aid devices.
Detailed description of the invention
Fig.1 shows the block diagram of the IDQT noise estimation-based speech enhancement technique. In this technique, only a noisy speech signal is used as an input signal for finding the original speech signal. The observed noisy speech signal (it contains real speech data and background noises) first passes through Fast Fourier Transform (FFT) to transform into a noisy speech spectrum? S?_i (k). For obtaining original speech spectrum(? Y?_ ) ^(i,k), the magnitude of the noisy speech spectrum is used by the spectral subtraction method. The spectral subtraction methodis done by a generalized equation which is expressedas
|(Y_ ) ^(i,k) |= {¦(?[|S_i (k) |^?-a?(E_i (k))?^?]?^(1/?), if?? |S?_i (k)|>(a+b)?^(1/?) E_i (k)@b^(1/?) E_i (k), otherwise)¦ (1)
where a=over-subtraction factor,b=floor factor and ?=exponent factor (Unity for magnitude spectral subtraction and two for power spectral subtraction method)
In equation (1), the noise spectrum E_i (k) requires and it estimates through the noise estimation method. Here for estimating the noise spectrum, Improved Dynamic Quantile Tracking (IDQT)is proposed; detail of IDQT is mentioned in the next paragraph. The speciality of this method is that it doesn’t need much memory and takes less computational complexity compared with existing original DQT and othermethods. Next, The natural speech signal is obtained by removing the noise spectrum from the observed degraded speech spectrum, in accordance with the generalized equation of spectrum subtraction.. The obtained spectrum (enhanced speech spectrum) and phase spectrum of noisy speech signals are combined into the complex spectrum. We have applied Inverse Fast Fourier Transform (IFFT) to recover this spectrum into the time domain. In this process, we assumed that speech and noise were uncorrected and also, the phase spectrum of noisy speech spectrum was independent of noise.
In IDQT noise estimation, the input signal, i.e., frequency domain degraded speech signal, is applied to different frames. For every frame of the input spectrum, we assign one quantile value which is generated using the previous frame of the estimated noise spectrum. It can be done through increment/decrement on the samples of the previous frame of the generated orcalculated noise spectrum. The preferredincrement or decrement value in previous noise samples is processed through the suitable range values. However, the spreading of the degraded speech spectrum is not known, which means that the range's approximation is required to be dynamically acquired. Differences e_i (k) in between the obtained noise spectrum of the present frameE_i (k) and previous frame E_(i-1) (k) can be measuredby the degraded speech spectrum and noise spectrum. If the magnitude of that degraded spectrum is larger than the previously obtained noise samples, then there will be a minor increment in the next frame's noise spectrum. On the other hand, there will be a reduction in the following frame if the volume of the noisy speech is smaller than the previously calculated noise spectrum. For finding the quantile value? Q?_i (k) of the noise spectrumE_i (k), the following equation can be used.
Q_i (k)={¦(Q_(i-1) (k)+qR_i (k) , |?¦?S_i (k) ?| =|?¦?Q_(i-1) (k) ?|@Q_(i-1) (k)-(1-q)R_i (k), otherwise)¦ (2)
where, R_i (k),qandQ_i (k) are range, initial fixed quantile level/value and the estimated quantile value of degraded spectrum S_i (k), respectively. The range is dynamically obtained by subtracting the valley? V?_i (k) from dynamic peak? P?_i (k). The peak and valley values are updated using the first-order recursive equation, which is provided as follows:
P_i (k)={¦(?tP?_(i-1) (k)+(1-t)?|E?_i (k)|, ?|S?_i (k)| =P_(i-1) (k)@?sP?_(i-1) (k)+?(1-s)|V?_(i-1) (k)|, otherwise)¦ (3)
V_i (k)={¦(?tV?_(i-1) (k)+(1-t) ?|E?_i (k)|, ?|S?_i (k) |=V_(i-1) (k)@?sV?_(i-1) (k)+?(1-s)|P?_(i-1) (k)|, otherwise) (4)¦
Where, s and t are the fall and rise in detection duration, respectively.
The block design for the IDQT-based noise estimation technique is shown in Fig. 2. The peak and valley values in this diagram are determined using the peak and valley calculation blocks shown in equations (3) and (4), respectively. Range ? R?_i (k) is calculated using the peak P_i (k) and valley? V?_i (k) values. With the help of? R?_i (k)&Q_i (k)which is mentioned in equation (2), the noise spectrumE_i (k) is estimated.
Brief Description of Drawing
The List ofFigures, which are illustrated exemplary embodiments of the invention.
Figure 1 Speech Enhancement System.
Figure 2 Improved Dynamic Quantile Tracking based Noise Estimation Method.
Figure 3 Performance of IDQT in terms of speech quality.
Figure 4 Waveform-Spectrogram of the original speech signal.
Figure 5 Waveform-Spectrogram of Noisy speech signal.
Figure 6 Waveform-Spectrogram of the enhanced speech signal
Detailed description of the drawing
As described above present invention relates to IDQT-based noise estimation for speech enhancement systems.
Figure 1 shows the block diagram of the IDQT noise estimation-based speech enhancement technique. In this technique, only a noisy speech signal is used as an input signal for finding the original speech signal. The observed noisy speech signal first passes through Fast Fourier Transform (FFT) to transform into a noisy speech spectrum. For obtaining the original speech spectrum, the spectral subtraction method uses the magnitude of the noisy speech spectrum. Improved Dynamic Quantile Tracking (IDQT)is used to estimate the noise spectrum. Next, The obtained and phase spectrum of noisy speech signals are combined into the complex spectrum. To recover this spectrum into the time domain, we have applied Inverse Fast Fourier Transform (IFFT).
The block design for the IDQT-based noise estimate technique is shown in Figure 2. The dynamic peak and valley detectors are used to determine the peak and valley values in this figure. Next, the Range block calculates the range value using estimated peak values and obtained valley values. With the help of range, quantile value and degraded speech spectrum, the final noise spectrum is calculated from particular frame to frame.
The Improved DQT-based noise estimation method has been tested using objective measures i.e. PESQ score. From the SpEAR database, we have chosen 28 different degraded speech signals corrupted by different SNR levels such as 1, 3, 5, 7, 9, 11, 13 and 15dB. These speech files (spoken by both Male and female) passed through the proposed speech enhancement model and stored the enhanced speech signals. Then,we calculate the PESQ score of enhanced speech signals as well as degraded speech signals. Figure 3 shows the simulation result (plot between PESQ score vs SNR) for two different speech files – “noisy speech file 1”(spoken by Male) and “noisy speech file 2”(spoken by female)which are corrupted with f16 cockpit noise and pink noise respectively. In this figure, PESQ score of recovered speech signals (enhanced speech signals) is higher than observed speech signals (noisy speech signals) at each SNR level - 1, 3, 5, 7, 9, 11, 13 and 15dB.
Figures 4-6 show the visualization of the speech signal in the form of waveform and spectrogram for clean, noisy and enhanced speech signals. These signals are taken from the SpEAR speech database. Figure 4 shows the waveform and spectrogram of the pure speech signal that the female speaker speaks at sampled 16 kHz. The speech signal contains the sentence like as “I am sitting in the morning at the diner on the corner; I am waiting at the counter for the man to pour the coffee; and he fills it only halfway and before I even argue; He is looking out the window at somebody coming in”. In Figure 5, the waveform and spectrogram of the noisy speech signal are presented and these noisy speeches are corrupted by pink noise at 1dB SNR values. This is not generated file by the mixture of clean and noise; it is downloaded with a speech database. Then after, this wave file (noisy speech signal) is processed through different Simulink models, and processed speech i.e., the enhanced speech signal, is saved in wave files. Figure 6 shows the waveform and spectrogram of the enhanced speech signal obtained from the speech enhancement system with the IDQT noise estimation method. From these figures i.e., 4-6, it can be seen that the spectrogram and waveform of enhanced speech for the IDQT method is close to the clean speech signal.
5 Claims & 6 Figures
, Claims:Field of Invention
The present invention relates tospeech and audio processing and is also significantly related to speech technologies like speech enhancement, echo cancellation, preprocessing … etc, for hearing aids devices. For these technologies, the estimation of noise is one of the essential factors. In this invention, noise spectrum is obtained based on the dynamic quantile values of degraded speech signal for improving speech signal quality.
The objectives of this invention
This invention aims to improve the performance of hearing aid devices in terms of speech quality. In hearing aids devices, speech enhancement is running to improve or enhance speech signal quality and reduce or remove the noises available in the input speech signal. If we maintain the quality of the speech signal with better PESQ values, the hearing aid device's performance is good.
Background of the invention
Numerous applications in speech processing, such as hearing aids, voice-controlled system, mobile communication and multiparty teleconferencing, have strongly required speech enhancement techniques for removing the noise with less speech distortion. Due to continually increasing pollution, our living places are nosier, however, speech processing-based systems disturb by noises. Actually, noises are picked up through the microphone simultaneously with the speech signal. Suppose we are talking from the phone ata railway station. At that time, train, bubble, and station noise could affect our communication process. Due to noises, the original speech signal has degraded in terms of speech quality and intelligibility. To reduce the noises, speech enhancement techniques are used in pre-processing. Several speech enhancement techniques like as wiener filtering (Jaiswal, et.al, International Journal of Speech Technology, Vol. 25, No..3, pp: 745-758, 2022), MMSE STSA (kumar, Bittu, International Journal of Speech Technology, Vol. 21, no.4, pp): 1033-1044, 2018), spectral subtraction (Bharti, et.al, 2016 3rd International Conference on Recent Advances in Information Technology (RAIT). IEEE, 2016), signal subspace approach and source separation are available for recovering the desired speech signal from degraded speech signals. Among all, those techniques are more popular, obtaining real speech signals with less distortion. Through a literature survey, we investigated spectral subtraction methodsthat have found lots of attention in the last few decades.
Spectral subtraction is a signal estimation method used for estimating the clean speech signal from degraded speech signal using estimated noise (obtained through the noise estimation method). According to the spectral subtraction process, the original speech signal is recovered by subtracting the estimated noise from the degraded speech signal. If we estimate accurate noises (it is present in observed noisy speech signals), then the spectral subtraction method can give a better speech quality signal. Therefore, selectinga noise estimation method is also essential for any speech enhancement techniques.
Several different types of noise estimation methods have been availed in a recently published paper by different authors. In most cases, researchers use Voice Activity Detection (VAD) based noise estimation or statistical noise estimation method (R. Martin, in Proc. Euro. Signal Processing Conf.(EUSIPCO) 1994).Through the literature survey of noise estimation methods, most researchers are more devoted to statistical noise estimation for estimating noise in speech enhancement applications. In quantile-based noise estimation method (V. Stahl, A. Fisher, and R. Bipus, “Spectral subtraction based on minimum statistics”, in Proc. IEEE Int. Conf. Acoust., Speech, and Sig. Proc. ICASSP’00. 2000.), Stahl et. al observe that most of the frames (80-90%) of degraded speech carry low energy level spectrum which is very near to noise energy spectrum signal in the particular frequency bins and only a few numbers of frames (10-20%) of the signal contain high energy level spectrum corresponding to the speech signal. To this observation, noise samples are obtained as a few quantile values histogram of the observed degraded speech signals. More sorting operations are needed to find the sample value at every quantile chosen in every frequency bin. In cascaded median-based noise estimation method (Santosh K. Waddi, Prem C. Pandey, and Nitya Tiwari, “Speech Enhancement Using Spectral Subtractionand Cascaded-Median Based Noise Estimation for Hearing Impaired Listeners”, Communications (NCC), 2013 National Conference on. IEEE, 2013), Waddi et al reduce the sorting operation, reduces in computational complexity and storage requirement. For improvement in speech quality, in paper (Kumar, Bittu, Fluctuation and Noise Letters, vol.18, no.04, 1950020, 2019), Kumar had proposed a noise estimation which was modified version of cascaded median based noise estimation, but it requires some storage requirement. In (Tiwari, Nitya, and Prem C. Pandey, Journal of Signal Processing Systems, vol.91, no.5, pp: 411-422, 2019), Dynamic Quantile Tracking (DQT) based noise estimation method do not need memory requirement with less computational complexity. Based on our observation and studies, in this invention, the DQT noise estimationis improved for better speech quality of the desired speech signal. This was modified with less computational complexity as compared to other existing noise estimation methods.
Several speech enhancement applications, like hands-free hearing aids, speech-to-speech communication, etc. are frequently applied to real-time devices. To improve speech quality and reduce noise, noise suppression /speech enhancement algorithms are utilized. Initially, speech filtering filters the degraded speech signal taken from the microphone and outputs the filtered sound. Additionally, it is set up to change the frequency and volume of the auditory input and sounds. The enhanced speech component is output via the speaker, which amplifies it and makes it audible. (US7191127B2). As we know that when processing any voiced speech signal, the amplitude and time periods may be non-uniform so that intelligibility thereof is adversely unnatural. According to this statement or methodology, the continually few frames of the speech spectrum are processed so that every part of the spectrum has a significantly uniform period and improved intelligibility. In some examples, the processing could be done so that each treated section also has roughly uniform peak amplitudes. To improve the intelligibility of unvoiced speech waveforms, methods for processing them may be used with the improved voiced waveform produced by the enhancement approach. (US4468804A). A method of estimating the noise power spectrum in an observed noisy signal having a power spectrum comprising the steps of (a) Receiving and storing the noisy signal for T time frames; (b) for a frequency k sorting the frames and obtaining a noise estimate, wherein the noise estimate is calculated as the quantile (c) applying a recursive function to generate, the recursive function is arranged to reduce fluctuations in the noise estimate. Also, a method of saving processing power by only updating odd frequency bands for a one-time frame and only updating even frequency bands for the next time frame.(GB2426167A). In some embodiments, a method comprises: dividing, using at least one processor, an audio input into speech and non-speech segments; for each frame in each non-speech segment, estimating, using at least one processor, a time-varying noise spectrum of the non-speech segment; for each frame in each speech segment, estimating, using the at least one processor, speech spectrum of the speech segment; for each frame in each speech segment, identifying one or more non-speech frequency components in the speech spectrum; comparing the one or more non-speech frequency components with one or more corresponding frequency components in a plurality of estimated noise spectra and selecting the estimated noise spectrum from the plurality of estimated noise spectra based on a result of the comparing. (WO2022066590A1).
Summary of Invention
This patent discusses the noise estimation method for speech enhancement systems in commonly usedhearing aids and speech-to-speech communication systems. The proposed method, i.e., the Improved DQT noise estimation method, is given a good quality speech signal with less computation time and memory. Regardingthe PESQ Score, its performance is better than the existing method and PESQ score of noisy speech signal score. The waveform-spectrogram of the enhanced speech signal are also observed that it is very close to the waveform-spectrogram of the original speech signal. So, this noise estimation can use for enhancement purposes in hearing aid devices.
Detailed description of the invention
Fig.1 shows the block diagram of the IDQT noise estimation-based speech enhancement technique. In this technique, only a noisy speech signal is used as an input signal for finding the original speech signal. The observed noisy speech signal (it contains real speech data and background noises) first passes through Fast Fourier Transform (FFT) to transform into a noisy speech spectrum? S?_i (k). For obtaining original speech spectrum(? Y?_ ) ^(i,k), the magnitude of the noisy speech spectrum is used by the spectral subtraction method. The spectral subtraction methodis done by a generalized equation which is expressedas
|(Y_ ) ^(i,k) |= {¦(?[|S_i (k) |^?-a?(E_i (k))?^?]?^(1/?), if?? |S?_i (k)|>(a+b)?^(1/?) E_i (k)@b^(1/?) E_i (k), otherwise)¦ (1)
where a=over-subtraction factor,b=floor factor and ?=exponent factor (Unity for magnitude spectral subtraction and two for power spectral subtraction method)
In equation (1), the noise spectrum E_i (k) requires and it estimates through the noise estimation method. Here for estimating the noise spectrum, Improved Dynamic Quantile Tracking (IDQT)is proposed; detail of IDQT is mentioned in the next paragraph. The speciality of this method is that it doesn’t need much memory and takes less computational complexity compared with existing original DQT and othermethods. Next, The natural speech signal is obtained by removing the noise spectrum from the observed degraded speech spectrum, in accordance with the generalized equation of spectrum subtraction.. The obtained spectrum (enhanced speech spectrum) and phase spectrum of noisy speech signals are combined into the complex spectrum. We have applied Inverse Fast Fourier Transform (IFFT) to recover this spectrum into the time domain. In this process, we assumed that speech and noise were uncorrected and also, the phase spectrum of noisy speech spectrum was independent of noise.
In IDQT noise estimation, the input signal, i.e., frequency domain degraded speech signal, is applied to different frames. For every frame of the input spectrum, we assign one quantile value which is generated using the previous frame of the estimated noise spectrum. It can be done through increment/decrement on the samples of the previous frame of the generated orcalculated noise spectrum. The preferredincrement or decrement value in previous noise samples is processed through the suitable range values. However, the spreading of the degraded speech spectrum is not known, which means that the range's approximation is required to be dynamically acquired. Differences e_i (k) in between the obtained noise spectrum of the present frameE_i (k) and previous frame E_(i-1) (k) can be measuredby the degraded speech spectrum and noise spectrum. If the magnitude of that degraded spectrum is larger than the previously obtained noise samples, then there will be a minor increment in the next frame's noise spectrum. On the other hand, there will be a reduction in the following frame if the volume of the noisy speech is smaller than the previously calculated noise spectrum. For finding the quantile value? Q?_i (k) of the noise spectrumE_i (k), the following equation can be used.
Q_i (k)={¦(Q_(i-1) (k)+qR_i (k) , |?¦?S_i (k) ?| =|?¦?Q_(i-1) (k) ?|@Q_(i-1) (k)-(1-q)R_i (k), otherwise)¦ (2)
where, R_i (k),qandQ_i (k) are range, initial fixed quantile level/value and the estimated quantile value of degraded spectrum S_i (k), respectively. The range is dynamically obtained by subtracting the valley? V?_i (k) from dynamic peak? P?_i (k). The peak and valley values are updated using the first-order recursive equation, which is provided as follows:
P_i (k)={¦(?tP?_(i-1) (k)+(1-t)?|E?_i (k)|, ?|S?_i (k)| =P_(i-1) (k)@?sP?_(i-1) (k)+?(1-s)|V?_(i-1) (k)|, otherwise)¦ (3)
V_i (k)={¦(?tV?_(i-1) (k)+(1-t) ?|E?_i (k)|, ?|S?_i (k) |=V_(i-1) (k)@?sV?_(i-1) (k)+?(1-s)|P?_(i-1) (k)|, otherwise) (4)¦
Where, s and t are the fall and rise in detection duration, respectively.
The block design for the IDQT-based noise estimation technique is shown in Fig. 2. The peak and valley values in this diagram are determined using the peak and valley calculation blocks shown in equations (3) and (4), respectively. Range ? R?_i (k) is calculated using the peak P_i (k) and valley? V?_i (k) values. With the help of? R?_i (k)&Q_i (k)which is mentioned in equation (2), the noise spectrumE_i (k) is estimated.
Brief Description of Drawing
The List ofFigures, which are illustrated exemplary embodiments of the invention.
Figure 1 Speech Enhancement System.
Figure 2 Improved Dynamic Quantile Tracking based Noise Estimation Method.
Figure 3 Performance of IDQT in terms of speech quality.
Figure 4 Waveform-Spectrogram of the original speech signal.
Figure 5 Waveform-Spectrogram of Noisy speech signal.
Figure 6 Waveform-Spectrogram of the enhanced speech signal
Detailed description of the drawing
As described above present invention relates to IDQT-based noise estimation for speech enhancement systems.
Figure 1 shows the block diagram of the IDQT noise estimation-based speech enhancement technique. In this technique, only a noisy speech signal is used as an input signal for finding the original speech signal. The observed noisy speech signal first passes through Fast Fourier Transform (FFT) to transform into a noisy speech spectrum. For obtaining the original speech spectrum, the spectral subtraction method uses the magnitude of the noisy speech spectrum. Improved Dynamic Quantile Tracking (IDQT)is used to estimate the noise spectrum. Next, The obtained and phase spectrum of noisy speech signals are combined into the complex spectrum. To recover this spectrum into the time domain, we have applied Inverse Fast Fourier Transform (IFFT).
The block design for the IDQT-based noise estimate technique is shown in Figure 2. The dynamic peak and valley detectors are used to determine the peak and valley values in this figure. Next, the Range block calculates the range value using estimated peak values and obtained valley values. With the help of range, quantile value and degraded speech spectrum, the final noise spectrum is calculated from particular frame to frame.
The Improved DQT-based noise estimation method has been tested using objective measures i.e. PESQ score. From the SpEAR database, we have chosen 28 different degraded speech signals corrupted by different SNR levels such as 1, 3, 5, 7, 9, 11, 13 and 15dB. These speech files (spoken by both Male and female) passed through the proposed speech enhancement model and stored the enhanced speech signals. Then,we calculate the PESQ score of enhanced speech signals as well as degraded speech signals. Figure 3 shows the simulation result (plot between PESQ score vs SNR) for two different speech files – “noisy speech file 1”(spoken by Male) and “noisy speech file 2”(spoken by female)which are corrupted with f16 cockpit noise and pink noise respectively. In this figure, PESQ score of recovered speech signals (enhanced speech signals) is higher than observed speech signals (noisy speech signals) at each SNR level - 1, 3, 5, 7, 9, 11, 13 and 15dB.
Figures 4-6 show the visualization of the speech signal in the form of waveform and spectrogram for clean, noisy and enhanced speech signals. These signals are taken from the SpEAR speech database. Figure 4 shows the waveform and spectrogram of the pure speech signal that the female speaker speaks at sampled 16 kHz. The speech signal contains the sentence like as “I am sitting in the morning at the diner on the corner; I am waiting at the counter for the man to pour the coffee; and he fills it only halfway and before I even argue; He is looking out the window at somebody coming in”. In Figure 5, the waveform and spectrogram of the noisy speech signal are presented and these noisy speeches are corrupted by pink noise at 1dB SNR values. This is not generated file by the mixture of clean and noise; it is downloaded with a speech database. Then after, this wave file (noisy speech signal) is processed through different Simulink models, and processed speech i.e., the enhanced speech signal, is saved in wave files. Figure 6 shows the waveform and spectrogram of the enhanced speech signal obtained from the speech enhancement system with the IDQT noise estimation method. From these figures i.e., 4-6, it can be seen that the spectrogram and waveform of enhanced speech for the IDQT method is close to the clean speech signal.
5 Claims & 6 Figures
Field of Invention
The present invention relates tospeech and audio processing and is also significantly related to speech technologies like speech enhancement, echo cancellation, preprocessing … etc, for hearing aids devices. For these technologies, the estimation of noise is one of the essential factors. In this invention, noise spectrum is obtained based on the dynamic quantile values of degraded speech signal for improving speech signal quality.
The objectives of this invention
This invention aims to improve the performance of hearing aid devices in terms of speech quality. In hearing aids devices, speech enhancement is running to improve or enhance speech signal quality and reduce or remove the noises available in the input speech signal. If we maintain the quality of the speech signal with better PESQ values, the hearing aid device's performance is good.
Background of the invention
Numerous applications in speech processing, such as hearing aids, voice-controlled system, mobile communication and multiparty teleconferencing, have strongly required speech enhancement techniques for removing the noise with less speech distortion. Due to continually increasing pollution, our living places are nosier, however, speech processing-based systems disturb by noises. Actually, noises are picked up through the microphone simultaneously with the speech signal. Suppose we are talking from the phone ata railway station. At that time, train, bubble, and station noise could affect our communication process. Due to noises, the original speech signal has degraded in terms of speech quality and intelligibility. To reduce the noises, speech enhancement techniques are used in pre-processing. Several speech enhancement techniques like as wiener filtering (Jaiswal, et.al, International Journal of Speech Technology, Vol. 25, No..3, pp: 745-758, 2022), MMSE STSA (kumar, Bittu, International Journal of Speech Technology, Vol. 21, no.4, pp): 1033-1044, 2018), spectral subtraction (Bharti, et.al, 2016 3rd International Conference on Recent Advances in Information Technology (RAIT). IEEE, 2016), signal subspace approach and source separation are available for recovering the desired speech signal from degraded speech signals. Among all, those techniques are more popular, obtaining real speech signals with less distortion. Through a literature survey, we investigated spectral subtraction methodsthat have found lots of attention in the last few decades.
Spectral subtraction is a signal estimation method used for estimating the clean speech signal from degraded speech signal using estimated noise (obtained through the noise estimation method). According to the spectral subtraction process, the original speech signal is recovered by subtracting the estimated noise from the degraded speech signal. If we estimate accurate noises (it is present in observed noisy speech signals), then the spectral subtraction method can give a better speech quality signal. Therefore, selectinga noise estimation method is also essential for any speech enhancement techniques.
Several different types of noise estimation methods have been availed in a recently published paper by different authors. In most cases, researchers use Voice Activity Detection (VAD) based noise estimation or statistical noise estimation method (R. Martin, in Proc. Euro. Signal Processing Conf.(EUSIPCO) 1994).Through the literature survey of noise estimation methods, most researchers are more devoted to statistical noise estimation for estimating noise in speech enhancement applications. In quantile-based noise estimation method (V. Stahl, A. Fisher, and R. Bipus, “Spectral subtraction based on minimum statistics”, in Proc. IEEE Int. Conf. Acoust., Speech, and Sig. Proc. ICASSP’00. 2000.), Stahl et. al observe that most of the frames (80-90%) of degraded speech carry low energy level spectrum which is very near to noise energy spectrum signal in the particular frequency bins and only a few numbers of frames (10-20%) of the signal contain high energy level spectrum corresponding to the speech signal. To this observation, noise samples are obtained as a few quantile values histogram of the observed degraded speech signals. More sorting operations are needed to find the sample value at every quantile chosen in every frequency bin. In cascaded median-based noise estimation method (Santosh K. Waddi, Prem C. Pandey, and Nitya Tiwari, “Speech Enhancement Using Spectral Subtractionand Cascaded-Median Based Noise Estimation for Hearing Impaired Listeners”, Communications (NCC), 2013 National Conference on. IEEE, 2013), Waddi et al reduce the sorting operation, reduces in computational complexity and storage requirement. For improvement in speech quality, in paper (Kumar, Bittu, Fluctuation and Noise Letters, vol.18, no.04, 1950020, 2019), Kumar had proposed a noise estimation which was modified version of cascaded median based noise estimation, but it requires some storage requirement. In (Tiwari, Nitya, and Prem C. Pandey, Journal of Signal Processing Systems, vol.91, no.5, pp: 411-422, 2019), Dynamic Quantile Tracking (DQT) based noise estimation method do not need memory requirement with less computational complexity. Based on our observation and studies, in this invention, the DQT noise estimationis improved for better speech quality of the desired speech signal. This was modified with less computational complexity as compared to other existing noise estimation methods.
Several speech enhancement applications, like hands-free hearing aids, speech-to-speech communication, etc. are frequently applied to real-time devices. To improve speech quality and reduce noise, noise suppression /speech enhancement algorithms are utilized. Initially, speech filtering filters the degraded speech signal taken from the microphone and outputs the filtered sound. Additionally, it is set up to change the frequency and volume of the auditory input and sounds. The enhanced speech component is output via the speaker, which amplifies it and makes it audible. (US7191127B2). As we know that when processing any voiced speech signal, the amplitude and time periods may be non-uniform so that intelligibility thereof is adversely unnatural. According to this statement or methodology, the continually few frames of the speech spectrum are processed so that every part of the spectrum has a significantly uniform period and improved intelligibility. In some examples, the processing could be done so that each treated section also has roughly uniform peak amplitudes. To improve the intelligibility of unvoiced speech waveforms, methods for processing them may be used with the improved voiced waveform produced by the enhancement approach. (US4468804A). A method of estimating the noise power spectrum in an observed noisy signal having a power spectrum comprising the steps of (a) Receiving and storing the noisy signal for T time frames; (b) for a frequency k sorting the frames and obtaining a noise estimate, wherein the noise estimate is calculated as the quantile (c) applying a recursive function to generate, the recursive function is arranged to reduce fluctuations in the noise estimate. Also, a method of saving processing power by only updating odd frequency bands for a one-time frame and only updating even frequency bands for the next time frame.(GB2426167A). In some embodiments, a method comprises: dividing, using at least one processor, an audio input into speech and non-speech segments; for each frame in each non-speech segment, estimating, using at least one processor, a time-varying noise spectrum of the non-speech segment; for each frame in each speech segment, estimating, using the at least one processor, speech spectrum of the speech segment; for each frame in each speech segment, identifying one or more non-speech frequency components in the speech spectrum; comparing the one or more non-speech frequency components with one or more corresponding frequency components in a plurality of estimated noise spectra and selecting the estimated noise spectrum from the plurality of estimated noise spectra based on a result of the comparing. (WO2022066590A1).
Summary of Invention
This patent discusses the noise estimation method for speech enhancement systems in commonly usedhearing aids and speech-to-speech communication systems. The proposed method, i.e., the Improved DQT noise estimation method, is given a good quality speech signal with less computation time and memory. Regardingthe PESQ Score, its performance is better than the existing method and PESQ score of noisy speech signal score. The waveform-spectrogram of the enhanced speech signal are also observed that it is very close to the waveform-spectrogram of the original speech signal. So, this noise estimation can use for enhancement purposes in hearing aid devices.
Detailed description of the invention
Fig.1 shows the block diagram of the IDQT noise estimation-based speech enhancement technique. In this technique, only a noisy speech signal is used as an input signal for finding the original speech signal. The observed noisy speech signal (it contains real speech data and background noises) first passes through Fast Fourier Transform (FFT) to transform into a noisy speech spectrum? S?_i (k). For obtaining original speech spectrum(? Y?_ ) ^(i,k), the magnitude of the noisy speech spectrum is used by the spectral subtraction method. The spectral subtraction methodis done by a generalized equation which is expressedas
|(Y_ ) ^(i,k) |= {¦(?[|S_i (k) |^?-a?(E_i (k))?^?]?^(1/?), if?? |S?_i (k)|>(a+b)?^(1/?) E_i (k)@b^(1/?) E_i (k), otherwise)¦ (1)
where a=over-subtraction factor,b=floor factor and ?=exponent factor (Unity for magnitude spectral subtraction and two for power spectral subtraction method)
In equation (1), the noise spectrum E_i (k) requires and it estimates through the noise estimation method. Here for estimating the noise spectrum, Improved Dynamic Quantile Tracking (IDQT)is proposed; detail of IDQT is mentioned in the next paragraph. The speciality of this method is that it doesn’t need much memory and takes less computational complexity compared with existing original DQT and othermethods. Next, The natural speech signal is obtained by removing the noise spectrum from the observed degraded speech spectrum, in accordance with the generalized equation of spectrum subtraction.. The obtained spectrum (enhanced speech spectrum) and phase spectrum of noisy speech signals are combined into the complex spectrum. We have applied Inverse Fast Fourier Transform (IFFT) to recover this spectrum into the time domain. In this process, we assumed that speech and noise were uncorrected and also, the phase spectrum of noisy speech spectrum was independent of noise.
In IDQT noise estimation, the input signal, i.e., frequency domain degraded speech signal, is applied to different frames. For every frame of the input spectrum, we assign one quantile value which is generated using the previous frame of the estimated noise spectrum. It can be done through increment/decrement on the samples of the previous frame of the generated orcalculated noise spectrum. The preferredincrement or decrement value in previous noise samples is processed through the suitable range values. However, the spreading of the degraded speech spectrum is not known, which means that the range's approximation is required to be dynamically acquired. Differences e_i (k) in between the obtained noise spectrum of the present frameE_i (k) and previous frame E_(i-1) (k) can be measuredby the degraded speech spectrum and noise spectrum. If the magnitude of that degraded spectrum is larger than the previously obtained noise samples, then there will be a minor increment in the next frame's noise spectrum. On the other hand, there will be a reduction in the following frame if the volume of the noisy speech is smaller than the previously calculated noise spectrum. For finding the quantile value? Q?_i (k) of the noise spectrumE_i (k), the following equation can be used.
Q_i (k)={¦(Q_(i-1) (k)+qR_i (k) , |?¦?S_i (k) ?| =|?¦?Q_(i-1) (k) ?|@Q_(i-1) (k)-(1-q)R_i (k), otherwise)¦ (2)
where, R_i (k),qandQ_i (k) are range, initial fixed quantile level/value and the estimated quantile value of degraded spectrum S_i (k), respectively. The range is dynamically obtained by subtracting the valley? V?_i (k) from dynamic peak? P?_i (k). The peak and valley values are updated using the first-order recursive equation, which is provided as follows:
P_i (k)={¦(?tP?_(i-1) (k)+(1-t)?|E?_i (k)|, ?|S?_i (k)| =P_(i-1) (k)@?sP?_(i-1) (k)+?(1-s)|V?_(i-1) (k)|, otherwise)¦ (3)
V_i (k)={¦(?tV?_(i-1) (k)+(1-t) ?|E?_i (k)|, ?|S?_i (k) |=V_(i-1) (k)@?sV?_(i-1) (k)+?(1-s)|P?_(i-1) (k)|, otherwise) (4)¦
Where, s and t are the fall and rise in detection duration, respectively.
The block design for the IDQT-based noise estimation technique is shown in Fig. 2. The peak and valley values in this diagram are determined using the peak and valley calculation blocks shown in equations (3) and (4), respectively. Range ? R?_i (k) is calculated using the peak P_i (k) and valley? V?_i (k) values. With the help of? R?_i (k)&Q_i (k)which is mentioned in equation (2), the noise spectrumE_i (k) is estimated.
Brief Description of Drawing
The List ofFigures, which are illustrated exemplary embodiments of the invention.
Figure 1 Speech Enhancement System.
Figure 2 Improved Dynamic Quantile Tracking based Noise Estimation Method.
Figure 3 Performance of IDQT in terms of speech quality.
Figure 4 Waveform-Spectrogram of the original speech signal.
Figure 5 Waveform-Spectrogram of Noisy speech signal.
Figure 6 Waveform-Spectrogram of the enhanced speech signal
Detailed description of the drawing
As described above present invention relates to IDQT-based noise estimation for speech enhancement systems.
Figure 1 shows the block diagram of the IDQT noise estimation-based speech enhancement technique. In this technique, only a noisy speech signal is used as an input signal for finding the original speech signal. The observed noisy speech signal first passes through Fast Fourier Transform (FFT) to transform into a noisy speech spectrum. For obtaining the original speech spectrum, the spectral subtraction method uses the magnitude of the noisy speech spectrum. Improved Dynamic Quantile Tracking (IDQT)is used to estimate the noise spectrum. Next, The obtained and phase spectrum of noisy speech signals are combined into the complex spectrum. To recover this spectrum into the time domain, we have applied Inverse Fast Fourier Transform (IFFT).
The block design for the IDQT-based noise estimate technique is shown in Figure 2. The dynamic peak and valley detectors are used to determine the peak and valley values in this figure. Next, the Range block calculates the range value using estimated peak values and obtained valley values. With the help of range, quantile value and degraded speech spectrum, the final noise spectrum is calculated from particular frame to frame.
The Improved DQT-based noise estimation method has been tested using objective measures i.e. PESQ score. From the SpEAR database, we have chosen 28 different degraded speech signals corrupted by different SNR levels such as 1, 3, 5, 7, 9, 11, 13 and 15dB. These speech files (spoken by both Male and female) passed through the proposed speech enhancement model and stored the enhanced speech signals. Then,we calculate the PESQ score of enhanced speech signals as well as degraded speech signals. Figure 3 shows the simulation result (plot between PESQ score vs SNR) for two different speech files – “noisy speech file 1”(spoken by Male) and “noisy speech file 2”(spoken by female)which are corrupted with f16 cockpit noise and pink noise respectively. In this figure, PESQ score of recovered speech signals (enhanced speech signals) is higher than observed speech signals (noisy speech signals) at each SNR level - 1, 3, 5, 7, 9, 11, 13 and 15dB.
Figures 4-6 show the visualization of the speech signal in the form of waveform and spectrogram for clean, noisy and enhanced speech signals. These signals are taken from the SpEAR speech database. Figure 4 shows the waveform and spectrogram of the pure speech signal that the female speaker speaks at sampled 16 kHz. The speech signal contains the sentence like as “I am sitting in the morning at the diner on the corner; I am waiting at the counter for the man to pour the coffee; and he fills it only halfway and before I even argue; He is looking out the window at somebody coming in”. In Figure 5, the waveform and spectrogram of the noisy speech signal are presented and these noisy speeches are corrupted by pink noise at 1dB SNR values. This is not generated file by the mixture of clean and noise; it is downloaded with a speech database. Then after, this wave file (noisy speech signal) is processed through different Simulink models, and processed speech i.e., the enhanced speech signal, is saved in wave files. Figure 6 shows the waveform and spectrogram of the enhanced speech signal obtained from the speech enhancement system with the IDQT noise estimation method. From these figures i.e., 4-6, it can be seen that the spectrogram and waveform of enhanced speech for the IDQT method is close to the clean speech signal.
5 Claims & 6 Figures
| # | Name | Date |
|---|---|---|
| 1 | 202341065911-REQUEST FOR EARLY PUBLICATION(FORM-9) [30-09-2023(online)].pdf | 2023-09-30 |
| 2 | 202341065911-FORM FOR SMALL ENTITY(FORM-28) [30-09-2023(online)].pdf | 2023-09-30 |
| 3 | 202341065911-FORM FOR SMALL ENTITY [30-09-2023(online)].pdf | 2023-09-30 |
| 4 | 202341065911-FORM 1 [30-09-2023(online)].pdf | 2023-09-30 |
| 5 | 202341065911-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [30-09-2023(online)].pdf | 2023-09-30 |
| 6 | 202341065911-EVIDENCE FOR REGISTRATION UNDER SSI [30-09-2023(online)].pdf | 2023-09-30 |
| 7 | 202341065911-EDUCATIONAL INSTITUTION(S) [30-09-2023(online)].pdf | 2023-09-30 |
| 8 | 202341065911-DRAWINGS [30-09-2023(online)].pdf | 2023-09-30 |
| 9 | 202341065911-COMPLETE SPECIFICATION [30-09-2023(online)].pdf | 2023-09-30 |