Robust Voice Activity Detection In Adverse Environments

< Back

Robust Voice Activity Detection In Adverse Environments

Abstract: Methods and system for robust voice activity detection under adverse environments are disclosed. The present invention comprises a signal receiving module; a signal blocking module; a silent/non-silent classification module for discriminating silent blocks by comparing temporal feature to pre-determined thresholds; a total variation filtering module for enhancing voiced portions and reducing effect of background noises; a frame division module for dividing filtered signal into small frames; a residual processing module for estimating noise floor; a silent/non-silent frame classification module; a voice/non-voice signal frame classification module based on autocorrelation features of total variation filtered signal; a binary-flag merging and deletion module; a voice endpoint detection and correction module; and a voice endpoint storing/sending module. A decision-tree is arranged based on time and memory complexity of feature extraction methods. The preferred system can capable of accurately determining endpoints of voice regions in audio signals under different adverse environments. FIG. 2

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

05 September 2012

Publication Number

36/2016

Publication Type

INA

Invention Field

ELECTRONICS

Status

Parent Application

Patent Number

Legal Status

Grant Date

2021-03-25

Renewal Date

Applicants

Samsung India Electronics Pvt Ltd.

Samsung India Electronics Pvt. Ltd. Logix Cyber Park Plot No C-28 & 29 Tower D Noida Sec - 62

Inventors

1. M. Sabarimalai Manikandan

2/164 West Street Chitoor Tirumangalam Taluk Madurai District Tamilnadu-625707 INDIA.

2. Saurabh Tyagi

H/No - 6/159 Sector – 2 Rajendra Nagar Ghaziabad (Pin code - 201005) Uttar Pradesh (INDIA)

Specification

FIELD OF INVENTION
[001] The present invention relates to field of speech and audio processing and more particularly to voice activity detection in voice processing apparatus under adverse environmental conditions such as noise background sound and channel.

BACKGROUND OF INVENTION
[002] Recent growth in communication technologies and concurrent development of powerful electronic devices enables the development of various multimedia related techniques. The usage of many voice-enabled devices systems and communication technologies is limited due to issues related to battery life (or power consumption) of the device accuracy transmission and storage cost. In audio processing and communication systems the overall performance in terms of accuracy computational complexity memory consumption and other factors greatly depends on ability to discriminate voiced speech signal from unvoiced/noise signal present in an input audio signal under adverse environments where various kinds of noises exists.
[003] Existing systems and methods have attempted to develop voice/speech activity detection voiced and non-voiced detection temporal and spectral features based systems source-filter based systems time-frequency domain based systems audio-visual based systems statistical based systems and entropy based systems short-time spectral analysis systems and speech endpoint/boundary detection for discriminating a voice signal portion and a non-voice signal portion by using feature information extracted from the input signal. However it is difficult to detect and extract voice signal portion since the voice signal is usually corrupted by a wide range of background sounds and noises.
[004] The existing systems and methods for voice/speech detection have many shortcomings such as: (i) the systems and methods may be diminished under highly non-stationary and at low signal-to-noise ratio (SNR) environments; (ii) the systems and methods may not be more robust under various types of background sound sources including applause laughter crowd cheer whistling explosive sounds babble train car and so on; (iii) the systems and methods includes less discriminative power in characterizing the signal frames having periodic structured noise components; and (iv) fixing a peak amplitude threshold for computing the periodicity from the autocorrelation lag index is very difficult under different noises and noise levels.
[005] Due to the above mentioned reasons the existing systems and methods fails to provide better detections when level of background noise increases and signal corrupted by the time-varying noise levels. Thus the use of appropriate noise robust features to characterize speech and non-speech signals is critical for all detection problems. Hence there is a need for a system which achieves a better detection performance at low computational cost.

OBJECT OF INVENTION
[006] The principal object of the embodiments herein is to provide a method and system to achieve robust voice activity detection under adverse environmental conditions.
[007] Another object of the invention is to provide a method to determine endpoints of voice regions.
[008] Another object of the invention is to provide a method to perform noise reduction and to improve the robustness of voice activity detection against different kinds of realistic noises at varying noise levels.

SUMMARY
[009] Accordingly the invention provides a method for voice activity detection (VAD) in adverse environmental conditions. The method includes receiving input signal from source. The method also includes classifying the input signal into a silent or non-silent signal block by comparing temporal feature information. The method also includes sending the silent or non-silent signal block to voice endpoint storing (VES) module or total variation (TV) filtering module by comparing the temporal feature information to pre-determined thresholds. The method also includes determining endpoint information of a voice signal or non-voice signal. The method also includes employing total variation (TV) filtering for enhancing speech features and suppressing noise levels in non-speech portions. Further the method includes determining noise floor in the TV filtered signal domain. Furthermore the method includes determining feature information in autocorrelation of the TV filtered signal sequence. Further the method includes determining binary-flag merging and deletion (BSMD) based on the pre-determined duration threshold on the determined feature information by BSMD module.Further the method includes determining voice endpoint correction based on short-term the temporal feature information after the determined binary-flag merging and deletion and outputting the input signal with the voice endpoint information.
[0010] Accordingly the invention provides a system for voice activity detection (VAD) in adverse environmental conditions. The system is configured for receiving input signal from source. The system is also configured for classifying the input signal into a silent or non-silent signal block by comparing temporal feature information. The system is also configured for sending the silent or non-silent signal block to voice endpoint storing (VES) module or total variation filtering module by comparing the temporal feature information to the pre-determined thresholds.The system is also configured for determining endpoint information of a voice signal or non-voice signal. The system is also configured for employing total variation filtering for enhancing speech features and suppressing noise levels in non-speech portions. Further the system is configured for determining noise floor in the total variation filtered signal domain determining feature information in autocorrelation of the total variation filtered signal sequence. Furthermore the system is configured for determining binary-flag merging and deletion (BSMD) based on the pre-determined duration threshold on the determined feature information. Furthermore the system is configured for determining voice endpoint correction based on the short-term temporal feature information after the determined binary-flag merging and deletion and outputting the input signal with the voice endpoint information.
[0011] Accordingly the invention provides an apparatus for voice activity detection in adverse environmental conditions. The apparatus including an integrated circuit further including processor memory having a computer program code within the circuit. The memory and the computer program code configured to with the processor cause the apparatus to receive input signal from source. The processor causes the apparatus to classify the input signal into a silent or non-silent signal block by comparing temporal feature information. The processor causes the apparatus to send the silent or non-silent signal block to voice endpoint storing (VES) module or total variation filtering module by comparing the temporal feature information to the pre-determined thresholds. Further the processor causes the apparatus to determine endpoint information of a voice signal or non-voice signal by the VES module or total variation filtering module. Furthermore the processor causes the apparatus to employ total variation filtering by the total variation filtering module for enhancing speech features and suppressing noise levels in non-speech portions. Furthermore the processor causes the apparatus to determine noise floor in the total variation filtered signal domain. Furthermore the processor causes the apparatus to determine feature information in autocorrelation of the total variation filtered signal sequence. Furthermore the processor causes the apparatus to determine binary-flag merging and deletion (BSMD) based on the pre-determined duration threshold on the determined feature information by BSMD module. Furthermore the processor causes the apparatus to determine voice endpoint correction based on the short-term temporal feature information after the determined binary-flag merging and deletion and output the input signal with the voice endpoint information.
[0012] These and other aspects of the embodiments herein will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood however that the following descriptions while indicating preferred embodiments and numerous specific details thereof are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments herein without departing from the spirit thereof and the embodiments herein include all such modifications.

BRIEF DESCRIPTION OF FIGURES
[0013] This invention is illustrated in the accompanying drawings throughout which like reference letters indicate corresponding parts in the various figures. The embodiments herein will be better understood from the following description with reference to the drawings in which:
[0014] FIG. 1 illustrates a schematic block diagram of an arrangement of speech and audio processing applications with voice activity detector apparatus in accordance with various embodiments of the present invention;
[0015] FIG. 2 illustrates a block diagram of a voice activity detector apparatus in accordance with various embodiments of the present invention;
[0016] FIG. 3 illustrates a flow diagram which describes the process of voice activity detection in accordance with various embodiments of the present invention;
[0017] FIG. 4 illustrates a flow diagram explaining the method for determining silent signal block and non-silent block in accordance with various embodiments of the present invention;
[0018] FIG. 5 illustrates a graph indicating effectiveness of total variation filtering under realistic noise environment in accordance with various embodiments of the present invention;
[0019] FIG. 6 illustrates a graph indicating effectiveness of total variation filtering for a speech signal corrupted by babble noise with varying noise levels in accordance with various embodiments of the present invention;
[0020] FIG. 7 illustrates a graph indicating effectives of total variation filtering for a speech signal corrupted by airport noise in accordance with various embodiments of the present invention;
[0021] FIG. 8 illustrates a graph indicating noise-reduction capability of the total variation filtering for a speech signal corrupted by time-varying levels of additive white Gaussian noise in accordance with various embodiments of the present invention;
[0022] FIG. 9 illustrates a flow diagram explaining the process for determining silent/non-silent frame classification (SNFC) module in accordance with various embodiments of the present invention;
[0023] FIG. 10 illustrates a graph indicating an experimental results of SNFC module in accordance with various embodiments of the present invention;
[0024] FIG. 11 illustrates a graph indicating an experimental results of SNFC module in accordance with various embodiments of the present invention;
[0025] FIG. 12 illustrates a flow diagram explaining the voice/non-voice signal frame classification (VNFC) in accordance with various embodiments of the present invention;
[0026] FIG. 13 illustrates a graph indicating patterns of the features extracted from the autocorrelation of the total variation filtered signal in accordance with various embodiments of the present invention;
[0027] FIG. 14 illustrates a flow diagram explaining the process of binary-flag merging and deletion (BSMD) in accordance with various embodiments of the present invention;
[0028] FIG. 15 illustrates a graph indicating outputs of the speech corrupted by train noise in accordance with various embodiments of the present invention;
[0029] FIG. 16 illustrates a graph indicating outputs for clean speech signal in accordance with various embodiments of the present invention;
[0030] FIG. 17 illustrates a flow diagram explaining the process of a voice endpoint determination and correction (VEDC) in accordance with various embodiments of the present invention;
[0031] FIG. 18 illustrates a graph indicating outputs of voice/non-voice classification module the binary-flag merging/deletion (BSMD) module and the voice endpoint determination and correction module in accordance with various embodiments of the present invention;
[0032] FIG. 19 illustrates a graph indicating outputs of voice/non-voice classification module the binary-flag merging/deletion (BSMD) module and the voice endpoint determination and correction module in accordance with various embodiments of the present invention;
[0033] FIG. 20 illustrates a graph indicating outputs of voice/non-voice classification module the binary-flag merging/deletion (BSMD) module and the voice endpoint determination and correction module in accordance with various embodiments of the present invention; and
[0034] FIG. 21 illustrates a computing environment implementing the application in accordance with various embodiments of the present invention.

DETAILED DESCRIPTION OF INVENTION
[0035] The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein can be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly the examples should not be construed as limiting the scope of the embodiments herein.
[0036] The embodiments herein achieve a method and system of a voice activity detection which can be used in a wide range of audio and speech processing applications. The proposed method accurately detects voice signal regions and determining endpoints of voice signal regions in audio signals under diverse kinds of background sounds and noises with varying noise levels.
[0037] Referring now to the drawings and more particularly to FIGS. 1 through 21 where similar reference characters denote corresponding features consistently throughout the figures there are shown preferred embodiments.
[0038] FIG. 1 illustrates a schematic block diagram of an arrangement of speech and audio processing applications with voice activity detector apparatus in accordance with various embodiments of the present invention. As depicted in the figure 1 the input signal receiving (ISR) module 101 provides data acquisition interface to receive an input signal from different sources. In an embodiment the sources can be portable devices microphones storage devices communication channels and the like. The ISR module 101 indicates the received data (or signal) format such as sampling frequency (number of samples per second) sample resolution (number of bits per sample) and coding standards. Further the ISR module 101 includes a method for converting the received data into a waveform and provides a method to resample the received signal at pre-determined sampling rate or change the sampling-rate conversion if required. The ISR module 101 handles the standard coding and sampling rates utilized in various audio signal processing systems. The outputs of one or more microphones are coupled with analog to digital converters (ADCs) which provides digital form of analog signal for pre-determined ADC specifications. The ISR module 101 further supports for constructing the signal from the measurements received from a compressive sensing system by using the sparse coding with pre-determined convex optimization technique. The voice activity detection (VAD) module 102 can be used for several applications which may not be limited to automatic speech recognition speech enhancement (noise modeling) speech compression pitch/formant determination voiced/unvoiced speech recognition speech disorder and disease analysis HD voice telephony vocal tract information human emotion recognition audio indexing retrieval suppression automatic speaker recognition and speech driven animation.
[0039] In an embodiment the VAD module 102 can be an integrated circuit system-on-a-chip (SoC or SOC) communication device (mobile phone Personal Digital Assistant (PDA) tablet) and the like.
[0040] FIG. 2 illustrates a block diagram of voice activity detection (VAD) module 102 in accordance with various embodiments of the present invention. The voice activity detection (VAD) module 102 is used for determining endpoint of voice signal portion in the audio signal. The VAD comprises of an input signal receiving (ISR) module 101 a signal block division (SBD) module 201 a silent/non-silent block classification (SNBC) module 202 a total variation filtering (TVF) module 203 a total variation residual processing (TVRP) module 204 a total variation filtered signal frame division (SFD) module 205 a voice endpoint storing/sending (VES) module 206 a voice/non-voice signal frame (VNFC) classification module 207 a silent/non-silent frame classification (SNFC) module 208 a voice endpoint determination and correction (VEDC) module 209 and a binary-flag storing merging and deletion (BSMD) module 210.
[0041] In an embodiment the signal block division (SBD) module 201 includes a memory buffer plurality of programs and history of memory allocations. Further the SBD module 201 sets a pre-determined block length based on a buffer memory size of the processing device and divides the input discrete-time signal received from the data acquisition module into equal-sized blocks of N×1 samples. The selection of appropriate block length depends on the type of applications of interest as well as on the memory size allocated for scheduled task and other internal resources such as processor power consumption processor speed memory or I/O (input/output) of audio communication and processing devices.
[0042] Further the SBD module 201 waits for a specific period of time for the audio data to be acquired sufficiently and then releases collected data for further processing when the memory buffer gets full. The SBD module 201 holds data for a short period of time until the finishing cycle of VAD process. The internal memories of the SBD module 201 will be refreshed periodically. Then it continues the next block processing based on action variable information. The SBD module 201 maintains history information including start and endpoint position of block memory size and action variable state information.
[0043] FIG. 3 illustrates a flow diagram 300 which describes the process of voice activity detection in accordance with various embodiments of the present invention. As depicted in figure 300 at step 301 input signal is initially received from communication channels recording devices and databases. In an embodiment the input signal can be received from portable devices microphones storage devices and communication channels and so on. At step 302 classifies a signal block as silent or non-silent block using feature parameters extracted from the signal block. At step 303 applies a total variation filtering on non-silent signal block with desired regularization parameter that can be used for speech enhancement by smoothing out background noises and preserving high-slopes of voice components. In an embodiment the total variation filtering prevents pitch doubling and pitch halving errors which is introduced due to variations of phoneme level and prominent slowly varying wave component between two pitch peak portions.
[0044] Further at step 304 the filtered signal is divided into signal frames and at the step 305 the signal frames are classified as silent or non silent frame using feature parameters extracted from signal frame and the total variation residual under a wide range of background noises encountered in real world applications. At step 306 binary values generated for the voice/non-voice signal classification process are stored (1: Voice and 0: Non-voice. At step 307 merging and deleting of signal frame using the duration information by processing the binary sequence information obtained for each signal block takes place. At step 308 the endpoint of voice signal by using the binary sequence information and energy envelope information is determined. Further at step 309 correcting the endpoints using the feature parameter made from portion of signal samples extracted from the endpoint determined at the previous steps takes place. At step 310 executing the voice end point information or the input signal with voice endpoint information to the speech related technologies and systems takes place. The various actions in method 300 may be performed in the order presented in a different order or simultaneously. Further in some embodiments some actions listed in FIG. 3 may be omitted.
[0045] FIG. 4 illustrates a flow diagram 400 explaining the method for determining silent signal block and non-silent block in accordance with various embodiments of the present invention. At step 401 the signal block from the SBD module 201 is received. At step 402 temporal features are computed for the signal block. At step 403 the feature information is compared to pre-determined thresholds .At step 404; a check is performed to find whether feature information is greater than the pre-determined threshold. If the feature information is greater than the pre-determined threshold then the signal block is considered as non-silent and the silent signal block is sent to total variation filtering module 203. If the feature information is smaller than the pre-determined threshold then the signal block is considered as silent and the silent block is sent to voice endpoint storing/sending module 206.
[0046] In an embodiment the silent/non-silent block classification (SNBC) module 202 includes means for receiving input signal block from memory buffer means for determining temporal feature parameters from the received signal block means for determining silent blocks by comparing the extracted temporal feature parameters to the pre-determined thresholds means for determining endpoints of non-silent signal block and means for generating action variable information to send the signal block either to voice endpoint storing/sending module 206 or to total variation filtering module 203.
[0047] Further the SNBC module 202 is constructed using the hierarchical decision-tree (HDT) scheme with pre-determined threshold. The SBC module 201 extracts the one or more temporal features (energy zero crossing rate and energy envelope) from an input signal block received from the signal block division (SBD) module 201. The temporal features may represent various nature of audio signal that can be used to classify the input signal block. The HDT uses the feature information extracted from the input signal block and pre-determined threshold for detecting silent signal block. The HDT sends signal block as an output to the total variation filtering module 203 only when the feature information of a signal block will be equal to or greater than pre-determined threshold. The method provides signal frame division (SFD) for dividing a total variation filtered signal into consecutive signal frames.
[0048] The SFD module 205 receives a filtered signal block from the TV filtering module 203 and then divides a received filtered signal into equal-sized overlapping short signal frames with pre-determined frame length of L samples. The frame length and the frame shift are adopted based the system requirements. The SFD module 205 sends a signal frame to silent/non-silent frame classification module 208 according to the action variable information received from the succeeding modules. In another aspect of HDT the decision stage considers the signal block as a silent block when the feature information will be smaller than pre-determined threshold. In such a scenario the SNBC module 202 directly sends action variable information to the voice endpoint storing/sending (VES) module 206 without sending to the other signal processing units. The main objective of preferred SNBC module 202 is to reduce computational cost and power consumption. In the SNBC module 202 a long silent interval frequently occurs between two successive voice signal portions. The various actions in method 400 may be performed in the order presented in a different order or simultaneously. Further in some embodiments some actions listed in FIG. 4 may be omitted.
[0049] FIG. 5 illustrates a graph indicating effectiveness of total variation filtering under realistic noise environment in accordance with various embodiments of the present invention. The graph illustrates the input speech signal is corrupted by train noise. FIG.5a is the first plot depicting the speech signal corrupted with train noise.FIG.5b is the second plot depicting the output of the preferred total variation filter that is the filtered signal using total variation filtering. FIG.5c is the third plot depicting the residual signal obtained between the input signal and the TV filtered signal. FIG. 5d is the fourth plot depicting the normalized energy envelope obtained for the input signal. FIG. 5e is the fifth plot depicting the normalized energy envelope obtained for the TV filtered signal.
[0050] The total variation filtering technique is a process often used in digital image processing that has applications in noise removal. Total variation filtering is based on the principle that signals with excessive and possibly spurious details have high total variation that is the integral of the absolute gradient of the signal is high.
[0051] FIG. 6 illustrates a graph indicating effectiveness of total variation filtering for a speech signal corrupted by babble noise with varying noise levels in accordance with various embodiments of the present invention. FIG. 6a is the first plot indicating the speech signal corrupted with babble noise. FIG. 6b is the second plot indicating the filtered signal using total variation filtering. FIG.6c is the third plot indicating the energy envelope of noisy speech signal. The energy envelope shown in the third plot illustrates the limitations of the energy threshold based existing VAD systems. FIG. 6d is the fourth plot indicating the energy envelope of TV filtered signal. Experimental result in FIG. 6d demonstrates that the preferred total variation filtering process may provide excellent feature for more accurate detection and determining endpoints of speech regions.
[0052] FIG. 7 illustrates a graph indicating effectiveness of total variation filtering for a speech signal corrupted by airport noise in accordance with various embodiments of the present invention. Under varying noise levels the total variation filtering technique leads to provide an effective robust system for more accurately detecting a voice signal activity period that can reduce the total number of false and missed detections by maintaining the energy level (or noise floor or the magnitude) of a non-voice signal portion even in varying background noise levels. From the experimental results it can be observed that the system with total variation filtered signal can produce better detection rates by using a noise floor (or level) estimates measured from the total variation residual which is obtained between the original and total variation filtered signals. The VAD system processes and extracts feature parameters from both total variation filtered and total variation residual signals. The feature extraction from the total variation filtered signal may increase the robustness of the features and thus improves overall detection accuracy under different adverse conditions.
[0053] FIG. 8 illustrates a graph which indicates noise-reduction capability of the total variation filtering for a speech signal corrupted by time-varying levels of additive white Gaussian noise in accordance with various embodiments of the present invention. FIG. 8a is the first plot indicating the speech signal corrupted with AWGN (Additive white Gaussian noise). FIG. 8b is the second plot indicating the filtered signal using total variation filtering. FIG. 8c is the third plot indicating the energy envelope of noisy speech signal. FIG.8d is the fourth plot indicating the energy envelope of TV filtered signal. The normalized energy envelope signals obtained for the input signal and the total variation filtered signal are shown in the FIG. 8c and FIG. 8d respectively. It can be noticed that total variation filtering method provides better reduction of noise components. By using an optimal energy threshold parameter the total variation filtered signal can provide significantly better detection rates since the effect of time varying noise is reduced significantly.
[0054] The experimental results on different noise types demonstrate that the total variation filtering technique can provide solution to improve the robustness of the traditional features. The capabilities of the preferred total variation filtering technique can be observed from the energy envelopes extracted from the noisy signal and the total variation filtered signal.
[0055] Further the total variation filtering technique improves capability of noise-reduction as compared to existing filtering techniques even if the input signal is a mixture of different background noise sources at varying amplitude levels low-frequency voiced speech portions and unvoiced portions which often reduces the detection rates in most of the voice activity detection systems published based on the prior art techniques. The main advantage of using the total variation smoothing filter is that it preserves speech properties of interest in a different manner than conventional filtering techniques used for suppressing noise components.
[0056] FIG. 9 illustrates a flow diagram explaining the process of determining silent/non-silent frame classification (SNFC) in accordance with various embodiments of the present invention. As depicted in figure 900 the SNFC module 208 receives (901) the total variation filtered signal frame from SFD module 205 and computes (902) the temporal features for the signal frame. Then the SNFC module 208 compares (903) the features to pre-determined thresholds. Based on instructions the hierarchical decision-tree sends the signal frame as an output to the voice/non-voice signal frame classification (VNFC) module 207 only when feature information fully satisfy logical statements with pre-determined thresholds. Otherwise the decision tree considers (904) a signal frame as a silent frame when feature information may fail to satisfy logical statements with pre-determined thresholds. In this scenario the SNFC module 208 generates (905) binary-flag information which provides assignment statements of logical expressions or using if-then statements.
[0057] The SNFC module 208 comprises means for receiving total variation filtered signal frames means for extracting temporal feature information from each signal frame means for determining silent signal frames by comparing extracted feature information to pre-determined thresholds means for determining binary-flag information (1:non-silent signal frame and 0:silent signal frame) means for generating action variable information to send the signal block either to voice/non-voice classification module or to binary-flag storing merging and deletion module. The main objective of the SNFC module 208 is to reduce computational cost and power consumption where a silent portion frequently occurs between voice signal portions. Further the SNFC module 208 with total variation filter feature information provides better discrimination of silent signal frames from non-silent signal frames.
[0058] The binary-flag information may include binary values of 0 (False Statement) and 1 (True Statement). Further the decision tree of the HDT sends the binary-flag information of value 0 as an output to binary-flag storing merging and deletion (BFSMD) module without sending signal frame to the voice/non-voice signal frame classification (VNFC) module 207 for further signal processing. Otherwise the input signal frame is further processed at the VNFC module 207 only when feature information extracted from the input signal frame is equal to or greater than pre-determined thresholds. The various actions in method 900 may be performed in the order presented in a different order or simultaneously. Further in some embodiments some actions listed in FIG. 9 may be omitted.
[0059] FIG. 10 illustrates a graph which indicates experimental results of SNFC module in accordance with various embodiments of the present invention. Experimental results show that total variation filtering method provides enhanced energy feature for reducing computational load of further processing systems by eliminating signal frames in silent regions without substantially missing the speech regions. As depicted in the graph signal frames in silent regions are marked with magnitude of zero in the third plot.
[0060] FIG. 11 illustrates a graph which indicates experimental results of SNFC module in accordance with various embodiments of the present invention. The total variation filtered signal for the input speech signal corrupted by applause sound is shown in the second plot. The output of the SFC module is shown in the third plot. It can be noticed that the total variation filtering method significantly reduces the effect of applause sounds without distorting the shape of the envelope and essential features used for detecting voiced speech regions. The result shows that the SFC module can decrease the computational load by discarding the signal frames which have very low energy value.
[0061] From FIG. 10 and FIG. 11 the first graphical plot of both figures represents the noisy speech signal corrupted by train noise and applause respectively wherein the x-axis represents the sample number while the y-axis represents amplitude of discrete sample. The second graphical plot represents the filtered signal by using total variation filter. The third graphical plot represents the thresholded energy envelope obtained by combining results of all the signal frames. Experiment shows that total variation filtering method provides energy feature for reducing computational load of further processing systems by eliminating signal frames in silent regions without substantially missing the speech regions. The signal frames in silent regions are marked with magnitude of zero in the third plot.
[0062] FIG. 12 illustrates a flow diagram explaining the voice/non-voice signal frame classification (VNFC) in accordance with various embodiments of the present invention. As depicted in figure 1200 the VNFC module 207 receives (1201) non-silent signal frame from SNFC module 208. The VNFC module 207 computes (1202) normalized one-sided autocorrelation (AC) sequence of a non-silent signal frame. Further the VNFC module 207 computes (1203) feature parameters such as lag index of first zero crossing point zero crossing rate lag index of minimum point and amplitude of minimum point for a predefined lag range of the autocorrelation sequence. Then the VNFC module 207 compares (1204) features to pre-determined thresholds. The VNFC module 207 computes (1205) feature parameters for a predefined lag range of the autocorrelation sequence. Further the VNFC module 207 generates (1206) binary flag information which is sent to the BMDS module. The VNFC module 207 compares (1207) features to pre-determined thresholds. Further the VNFC module 207 initially generates (1208) binary flag 1 information and then generates (1209) binary flag 0 information which is sent to BMDS module.
[0063] The VNFC module 207 includes means for receiving non-silent signal frame from the signal frame classification module means for computing normalized one-sided autocorrelation of non-silent signal frame means for extracting autocorrelation feature information means for determining voice signal frame and non-voice signal frame based on the extracted total variation residual and autocorrelation features by comparing features to pre-determined thresholds means for generating action variable information to send the voice signal frame to binary-flag storing merging and deletion module and to control voice activity detection process. The VNFC module 207 classifies an input non-silence signal frame into voice signal frame and non-voice signal frame. Based on the classification results more specifically the VNFC module generates binary-flag information (binary-flag 0 for non-voice signal frame and binary-flag 1 for voice signal frame) to determine the endpoint of the voice signal activity portion.
[0064] The VNFC module 207 includes three major methods such as autocorrelation computation feature extraction and decision. The classification method is implemented using a multi-stage hierarchical decision-tree (HDT) scheme with pre-determined thresholds. The flowchart configuration of the multi-stage HDT can be redesigned according to the computation complexity and memory space involved in extracting feature parameters from the autocorrelation sequence of the non-silence signal frame.
[0065] In an embodiment the VNFC module 207 first receives a non-silence signal frame with pre-determined number of signal samples. The VFC module then computes normalized one-sided autocorrelation of a non-silence signal frame represented as d[n]. The autocorrelation of the signal frame d[n] with length of N samples is computed as:
(1)
Where
r denotes the autocorrelation sequence.
k denotes the lag of the autocorrelation sequence.
[0066] The feature information from the autocorrelation sequence is used to characterize signal frames. The periodicity feature of the autocorrelation may provide temporal and spectral characteristics of a signal to be processed. For example the periodicity in autocorrelation sequence indicates that the signal is periodic. The autocorrelation function falls to zero for highly non-stationary signals. The voiced speech sound is periodically correlated and other background sounds from noise sources are not (or uncorrelated). If a frame of voiced sound signal is periodically correlated (or a quasi-periodic) its autocorrelation function has the maximum peak value at the location of pitch period of a voiced sound. In general the autocorrelation function demonstrates the maximum peak within the lag value range corresponding to the expected pitch periods of 2 to 20 ms for voiced sounds. The conventional voiced activity detection considers that voiced speech may have a higher maximum autocorrelation peak value than the background noise frames. In an embodiment maximum autocorrelation peak value may be diminished and also autocorrelation lag of maximum peak may be deviated from the pre-determined threshold range due to phoneme variations and different background sources including applause laughter car train crowd cheer babble thermal noise and so on. The feature parameters that are extracted from the autocorrelation of the total variation filtered signal can have ability to increase the robustness of the VAD process.
[0067] Further the VNFC module 207 extracts the feature information comprising autocorrelation lag index (or time lag) of first zero crossing point of the autocorrelation function lag index of minimum point of the autocorrelation function amplitude of minimum point of the autocorrelation function lag indices of local maxima points of the autocorrelation function amplitudes of local maxima points and decaying energy ratios. The extraction of feature information is done in a sequential manner according to the heuristic decision rules followed in the preferred HDT scheme.
[0068] The lag index (or time lag) of first zero crossing point is used to characterize the frames with highly non-stationary noises (or transients). From various experimental results it is noted that the value lag index of first zero crossing point of the autocorrelation sequence is less than lag value of 4 for several types of noises.
[0069] The proposed method uses the lag index of first zero crossing point feature to detect the noise frames. For a given autocorrelation sequence with pre-determined number of coefficients the first zero crossing point is described as:

Where

is the function that provides the lag index of first zero crossing point (fzcp1)
m denotes the autocorrelation lag index variable and UL1 denotes the upper lag index value.
[0070] The proposed method performs the determination of lag index of first zero crossing point within new autocorrelation sequence constructed with a pre-determined number of autocorrelation values. Thus the proposed method may reduce the computational cost of the feature extraction by examining only a few autocorrelation sequence values. In addition the power consumption computational load and memory consumption may be drastically reduced when a particular type of noise constantly occurs.
[0071] For a given pre-determined range of autocorrelation sequence the lag index and amplitude of the minimum peak are computed as:

Where
is the function which computes the minimum amplitude (rmin_amp) and its lag index (rmin_lag)
m is the autocorrelation lag variable.
LL2 denotes the lower lag index value and
UL2 denotes the upper lag index value
[0072] In an embodiment the lag index and amplitude of the minimum peak features are extracted from the autocorrelation sequence within a pre-determined lag interval. These features are used to identify the some types of noise signals having periodic structure components.
[0073] The proposed method includes extraction of the lag index and amplitude of the maximum peak of the autocorrelation sequence within a pre-determined lag interval. These features are used to represent a voiced speech sound frame. The pre-determined lag and maximum peak thresholds are used to distinguish voiced sound from other background sounds. For a given pre-determined range of autocorrelation coefficients the lag index and amplitude of the minimum peak are computed as:

Where
is the function that outputs the maximum amplitude ( r max _amp) and its lag index (r max _lag).
[0074] The proposed method utilizes the peak amplitude and its lag index information for reducing the computational cost of VAD system by eliminating highly non-stationary noise frames having different noise levels. In order to reduce the number of noise frame detections the proposed method uses decaying energy ratios.
[0075] In certain implementations the feature extraction method computes decaying energy ratios by dividing the autocorrelation sequence into unequal blocks. For a given block of autocorrelation sequence the autocorrelation energy decaying ratio (t) is computed as:

Where
denotes the ith decaying energy ratio computed for autocorrelation lag index ranging from Li and Ui.
N denotes the total number of autocorrelation coefficients.
k denotes autocorrelation lag variable
[0076] Further the decaying energy ratio lies between 0 and 1 and are representation features for distinguishing the voiced sounds from the background sounds and noises. In most sound frames the decaying energy ratios in the autocorrelation domain computed in the way described above can demonstrate a high robustness against a wide variety of background sounds and noises. In addition the decaying energy ratios are computed in computationally efficient manner.
[0077] In an embodiment the method of constructing a decision tree has to take the computational cost of each feature into consideration. FIG. 13 illustrates a graph indicating patterns of the features extracted from the autocorrelation of the total variation filtered signal in accordance with various embodiments of the present invention. FIG.13a is the first plot depicting the signal corrupted with train noise. FIG.13b is the second plot depicting the filtered signal using total variation filter. FIG. 13c is the third plot depicting the plot of energy value of each signal frame. FIG. 13d is the fourth plot depicting the decaying energy ratio value of ACF of signal frame. FIG. 13e is the fifth plot depicting the maximum peak value of ACF of signal frame. FIG. 13f is the sixth plot depicting the lag value of maximum peak of ACF frame. The graphical plots of feature patterns are shown for illustrating the effectiveness in distinguishing voice signal frame from non-voice signal frame by using total variation autocorrelation feature information.
[0078] Further from FIG. 12 the VNFC module 207 includes configurable feature extraction methods that may extract feature parameters. The extracted feature parameters are used as the input to the internal decision statement or logical expressions described in accordance with the proposed method. The configuration of the feature extraction methods may be modified in different ways.
[0079] In the proposed method each feature extraction method receives the autocorrelation sequence with pre-determined number of autocorrelation coefficient values. The feature extraction method processes the input data according to the action variable information. Finally the VNFC module 207 of the proposed method generates binary flag information (binary flag 0 for non-voice signal frame and binary flag 1 for voice signal frame) and sends flag information to binary-flag storing merging and deletion (BSMD) module. The plots of feature patterns are shown for comprehensive understanding and illustrating the effectiveness in distinguishing voice signal frame from non voice signal frame by using total variation autocorrelation feature information.
[0080] FIG. 14 illustrates a flow diagram 1400 explaining the process of binary-flag merging and deletion (BSMD) in accordance with various embodiments of the present invention. The merging operation is also referred to as insertion (or inclusion or addition). Referring to the FIG. 14 the BSMD module 210 processes the binary-flag sequence generated for each non-silence signal block. The binary-flag sequence comprises of binary-flag 1 and binary-flag 0 values for detected voice signal frame and non-voice signal frame respectively. At step 1401 binary flag sequence information is received and at step 1402 locations of positive transitions (0 to 1) and negative transitions (1 to 0) in the input binary sequences are found. Further at step 1403 the differences of locations are calculated and compared with the pre-determined duration threshold. At step 1404 the binary block of 0 is replaced with another binary block of 1. This process happens when the current binary block occurs between long series of 1’s which is also located in the binary block mask obtained from the energy envelope of the total variation filtered signal. At step 1404 can also occur when the binary block of 1 is replaced with another binary block of 0 that is when the current binary block occurs between long series of 0’s and is also located in the binary block mask obtained from the energy envelope of the TV filtered signal.
[0081] Based on the overlapping frame concept in VAD the total number of missed and false signal frame detections may be reduced by using the information of possible duration of voiced speech regions. Further in certain embodiments the proposed method employs the minimum voiced speech duration and the interval between two successive voice signal portions. In an embodiment the VAD system determines the feature smoothing process which can reduce numbers of false and missed detections. In an embodiment the VAD system can have options to configure the construction of embodiments depends on the applications. The mode of VAD triggered can be manual or automatic selection by a user. In power saving mode VAD application may be disabled.
[0082] According to the proposed method the method of merging replaces binary-flag 0 by binary-flag 1 when it identifies the binary-flag 0 for the signal frames within pre-determined interval from the previous endpoint of voiced speech portion. In another aspect the binary-flag 1is replaced by binary-flag 0 when the signal frames detected as voice signal frame within long zeros on both left and right side of the detected voice signal frames with total duration is less than the pre-determined duration threshold.
[0083] In certain embodiments the binary-flag merging/deletion is performed by using a set of instructions that counts total numbers of series of ones and zeros and also continuously compare count values with the pre-determined thresholds. From various experiments it was noticed that the merging and deletion methods of the proposed method may provide significantly better endpoint detection results. The main objective of the preferred method of merging is to avoid discontinuity effect that is introduced due to the elimination of a set of signal samples of single spoken word during the voice and non-voice classification process.
[0084] The main objective of preferred method of deletion is to remove short-bust of some type of sounds that are falsely detected. Further the voice endpoint determination and correction (VEDC) is designed for accurately determining the endpoint (or boundary or onset/offset) of a voice signal portion and correcting using the feature information extracted from each sub-frame of the pre-determined signal samples. The various actions inflow diagram/flowchart 1400 may be performed in the order presented in a different order or simultaneously. Further in some embodiments some actions listed in FIG. 14 may be omitted.
[0085] FIG. 15 illustrates a graph which indicates outputs of the speech corrupted by train noise in accordance with various embodiments of the present invention. FIG. 15a is the first plot depicting the signal corrupted with train noise. FIG. 15b is the second plot depicting the filtered signal using total variation filter. FIG.15b demonstrates the performance of the preferred total variation filtering module 203. FIG. 15c is the third plot depicting the plot of energy value of each signal frame. FIG.15c depicts the output of the SFC module 208 with temporal feature information. FIG. 15d is the fourth plot depicting the decaying energy ratio value of ACF of signal frame. FIG. 15e is the fifth plot depicting the maximum peak value of ACF of signal frame. The outputs obtained by comparing feature information with pre-determined threshold are shown in theFIG.15d and FIG.15e. FIG. 15f is the sixth plot depicting the binary flag sequence information. FIG.15f is the voice/non-voice classification result obtained by comparing both decaying energy ratio and the maximum peak values with pre-determined thresholds.
[0086] FIG. 16 illustrates a graph which indicates outputs for clean speech signal in accordance with various embodiments of the present invention. FIG.16a is the first plot depicting a clean signal. FIG.16 b is the second plot depicting a filtered signal using total variation filter. FIG.16b demonstrates the performance of the preferred total variation filtering module. FIG.16c is the third plot depicting the energy value of each signal frame. FIG. 16c is the output of the SFC module with temporal feature information. FIG. 16d is the fourth plot depicting the decaying energy ratio value of ACF of signal frame. FIG.16 e is the fifth plot depicting the maximum peak value of ACF of signal frame. The outputs obtained by comparing feature information with pre-determined threshold are shown in FIG.16d and FIG.16 e. FIG. 16f is the sixth plot depicting the binary flag sequence information. FIG. 16f is the voice/non-voice classification result obtained by comparing both decaying energy ratio and the maximum peak values with pre-determined thresholds.
[0087] FIG. 17 illustrates a flow diagram 1700 explaining the process of a Voice endpoint determination and correction (VEDC) in accordance with various embodiments of the present invention. The VEDC module 209 is designed for more accurately determining the endpoint (or boundary or onset/offset) of a voice signal portion and correcting using the feature information extracted from each sub-frame of the pre-determined signal samples. As depicted in figure 17 initially the VEDC module 209 receives (1701) endpoints (onset and offset) of voice signal portions in the input signal block and extracts (1702) samples from an onset (or offset) location of voice signal portion and divides into small frames. Further VEDC module 209 calculates (1703) the frame energy and compares with pre-determined duration threshold. The VEDC module 209 finds (1704) a new endpoint (onset and offset) by removing the insignificant frame and outputs (1705) the endpoint information determined from the input signal block.
[0088] The VEDC module 209 includes endpoint determination signal framing feature extraction and endpoint correction. The endpoints of all detected voiced signal portions are computed by processing the binary-flag sequence information and the pre-determined values of frame length and frame shift. Further the VEDC module 209 provides endpoint points in terms of either sample index number or sample time measured in milliseconds.
[0089] In an embodiment the endpoint is corrected using a simple feature extraction and a thresholding rule. During correction processing of the signal frame is performed with pre-determined number of signal samples. The signal frame is extracted at the onset and offset of each voiced speech portion. During endpoint correction the signal frame is first divided into non-overlapping small frames. Then the computation of energy of each sub-frame takes place and is finally compared with pre-determined threshold. The proposed method may lead to provide accurate determination of endpoints of voiced signal portions when the recorded/received audio signal with high signal-to-noise ratio mostly occurs in many realistic environments. The various actions in method 1700 may be performed in the order presented in a different order or simultaneously. Further in some embodiments some actions listed in FIG. 17 may be omitted.
[0090] FIG. 18 illustrates a graph which indicates outputs of voice/non-voice classification module the binary-flag merging/deletion (BSMD) module and the voice endpoint determination and correction module in accordance with various embodiments of the present invention. FIG.18a is the first plot depicting the signal corrupted with train noise. FIG.18b is the second plot depicting the filtered signal using total variation filter. FIG. 18c is the third plot depicting binary flag sequence information. FIG. 18c shows the output of the voice/non-voice classification module. FIG. 18d is the fourth plot depicting the binary sequence after merging deletion and correction. FIG. 18d shows the output of the binary-flag merging/deletion module. FIG. 18e is the fifth plot depicting the detected endpoints using the preferred VAD system. FIG. 18e demonstrates the output of the voice endpoint determination and correction module. The endpoints of a voice signal portions are marked as circles.
[0091] FIG. 19 illustrates a graph which indicates outputs of voice/non-voice classification module the binary-flag merging/deletion (BSMD) module and the voice endpoint determination and correction module in accordance with various embodiments of the present invention. FIG. 19a is the first plot depicting a clean signal. FIG. 19b is the second plot depicting a filtered signal using total variation filter. FIG. 19c is the third plot depicting binary flag sequence information. FIG .19c shows the output of the voice/non-voice classification module. FIG. 19d is the fourth plot depicting binary sequences after merging deletion and correction. FIG. 19d shows the output of the binary-flag merging/deletion module. FIG. 19e is the fifth plot depicting detected endpoints using VAD system. FIG. 19e demonstrates the output of the voice endpoint determination and correction module. The endpoints of a voice signal portions are marked as circles.
[0092] FIG. 20 illustrates a graph which indicates outputs of voice/non-voice classification module the binary-flag merging/deletion (BSMD) module and the voice endpoint determination and correction module in accordance with various embodiments of the present invention. FIG. 20a is the first plot depicting a clean signal. FIG. 20b is the second plot depicting a filtered signal using total variation filter. FIG. 20c is the third plot depicting binary flag sequence information. FIG. 20c shows the output of the voice/non-voice classification module. FIG. 20d is the fourth plot depicting binary sequences after merging deletion and correction. FIG. 20d shows the output of the binary-flag merging/deletion module. FIG. 20e is the fifth plot depicting detected endpoints using VAD system. FIG. 20e demonstrates the output of the voice endpoint determination and correction module. The endpoints of a voice signal portions are marked as circles.
[0093] FIGS. 18-20 are graphical plots illustrating the outputs of voice/non-voice classification module the binary-flag merging/deletion module and the voice endpoint determination and correction module in accordance with embodiments of the present invention for speech signal corrupted by train noise clean speech and airport noise. The third plot is the output of the voice/non-voice classification module. The fourth plot is the output of the binary-flag merging/deletion module. The fifth plot demonstrates the output of the voice endpoint determination and correction module. The endpoints of a voice signal portions are marked as circles. Further in some simulations overall performance of the preferred voice activity detection apparatus is evaluated using different speech signals corrupted by different types of noises such as airport babble car train exhibition station applause laughter AC noise computer hardware fan and white noise at varying noise levels. Experimental studies prove that the techniques and configurations of the proposed method for determining endpoints of voice signal portions in audio signal overcome shortcomings of existing techniques. The third plot is the output of the voice/non-voice classification module. The fourth plot is the output of the binary-flag merging/deletion module. The fifth plot demonstrates the output of the voice endpoint determination and correction module. The endpoints of a voice signal portions are marked as circles.
[0094] FIG. 21 illustrates computing environment implementing the application in accordance with various embodiments of the present invention. As depicted the computing environment includes at least one processing unit that is equipped with a control unit and an arithmetic logic unit (ALU) a memory a storage unit plurality of networking devices and a plurality input output (I/O) devices. The processing unit is responsible for processing the instructions of the algorithm. The processing unit receives commands from the control unit in order to perform its processing. Further any logical and arithmetic operations involved in the execution of the instructions are computed with the help of the ALU.
[0095] The overall computing environment can be composed of multiple homogeneous and/or heterogeneous cores multiple CPUs of different kinds special media and other accelerators. The processing unit is responsible for processing the instructions of the algorithm. The processing unit receives commands from the control unit in order to perform its processing. Further any logical and arithmetic operations involved in the execution of the instructions are computed with the help of the ALU. Further the plurality of process units may be located on a single chip or over multiple chips.
[0096] The algorithm comprising of instructions and codes required for the implementation are stored in either the memory unit or the storage or both. At the time of execution the instructions may be fetched from the corresponding memory and/or storage and executed by the processing unit.
[0097] In case of any hardware implementations various networking devices or external I/O devices may be connected to the computing environment to support the implementation through the networking unit and the I/O device unit.
[0098] The embodiments disclosed herein can be implemented through at least one software program running on at least one hardware device and performing network management functions to control the elements. The elements shown in Figs. 1 and 2 include blocks which can be at least one of a hardware device or a combination of hardware device and software module.
[0099] The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can by applying current knowledge readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept and therefore such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore while the embodiments herein have been described in terms of preferred embodiments those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the embodiments as described herein.

STATEMENT OF CLAIMS
We claim:
1. A method for voice activity detection (VAD) in adverse environmental conditions the method comprising:
receiving input signal from at least one source;
classifying said input signal into at least one of a silent and non-silent signal block by comparing temporal feature information;
sending said at least one of silent or non-silent signal block to at least one of voice endpoint storing (VES) module or total variation filtering module by comparing said temporal feature information to a plurality of pre-determined thresholds;
determining endpoint information of at least one of a voice signal or non-voice signal;
employing total variation filtering for enhancing speech features and suppressing noise levels in non-speech portions;
determining noise floor in said total variation filtered signal domain;
determining feature information in autocorrelation of said total variation filtered signal sequence;
determining binary-flag merging and deletion (BSMD) based on said pre-determined duration threshold on said determined feature information by BSMD module;
determining voice endpoint correction based on short-term said temporal feature information after said determined binary-flag merging and deletion; and
outputting said input signal with said voice endpoint information.
2. The method as in claim 1 wherein said temporal feature information comprises at least one of: energy zerocrossing rate energy envelope that represents various nature of audio signal.
3. The method as in claim 1 wherein said method further comprises using lag of first zerocrossing point of autocorrelation sequence for detecting at least one of noise transients and white Gaussian noise frames.
4. The method as in claim 1 wherein said method determines feature information and wherein said feature information comprises at least one of: decaying energy ratios amplitude lag of minimum peak amplitude lag of maximum peak zerocrossing rate from said autocorrelation of said signal for discriminating said voice signals from said non-voice signal.
5. The method as in claim 4 wherein said method determines decaying energy ratios from said autocorrelation sequence to provide accurate characterizing of at least one of said voice signal and other background sounds.
6. The method as in claim 1 wherein said sending further comprises receiving signal block from signal block division module and computing temporal features for said signal block.
7. The method as in claim 1 wherein said method further comprises estimating noise floor from at least one of total variation residual and said total variation filtered signal envelope which provides discrimination of said voice signal from said non-voice signal in said input signal.
8. The method as in claim 1 wherein said method further comprises performing sampling rate conversion depending on the voice processing applications on said received input signal.
9. The method as in claim 1 wherein said method further comprises:
receiving total variation filtered signal frame from signal frame division module;
computing said temporal feature information for said signal frame;
comparing said feature information to pre-determined thresholds;
sending said non-silent signal frame to Voice/Non-voice frame classification (VNFC) module;
generating binary flag 0 information; and
sending said binary flag 0 information to said BMDS module.
10. The method as in claim 1 wherein said sending further comprises extracting feature information from said input signal by a hierarchical decision-tree (HDT).
11. The method as in claim 10 wherein said HDT sends at least one of silent or non-silent signal to at least one of voice endpoint storing (VES) module or total variation filtering module by comparing said temporal features to pre-determined threshold.
12. A system for voice activity detection (VAD) in adverse environmental conditions wherein said system is configured for:
receiving input signal from at least one source;
classifying said input signal into at least one of a silent or non-silent signal block by comparing temporal feature information;
sending said at least one of silent or non-silent signal block to at least one of voice endpoint storing (VES) module or total variation filtering module by comparing said temporal feature information to the pre-determined thresholds;
determining endpoint information of at least one of a voice signal or non-voice signal;
employing total variation filtering for enhancing speech features and suppressing noise levels in non-speech portions;
determining noise floor in said total variation filtered signal domain;
determining feature information in autocorrelation of said total variation filtered signal sequence;
determining binary-flag merging and deletion (BSMD) based on the pre-determined duration threshold on said determined feature information;
determining voice endpoint correction based on the short-term temporal feature information after said determined binary-flag merging and deletion; and
outputting said input signal with said voice endpoint information.
13. The system as in claim 12 wherein said system comprises a VNFC module that is configured for:
receiving said non-silent signal frame from silent/non-silent frame classification (SNFC) module;
computing normalized single-sided Autocorrelation sequence of said non-silent signal frame;
computing features parameters for a predefined lag range of said autocorrelation sequence;
comparing said features to said pre-determined threshold;
generating said binary flag 0 information which is sent to said BMDS module;
computing features parameters for a predefined lag range of said autocorrelation sequence based on said comparison;
comparing said features to said pre-determined threshold;
generating at least one of binary-flag 1 or binary flag-0; and
sending said generated binary flag sequence information to said BSMD module.
14. The system as in claim 12 wherein said parameters comprises at least one of lag index of first zerocrossing point zerocrossing rate lag index of minimum point amplitude of minimum point lag index of maximum point amplitude of maximum point and decaying energy ratios.
15. The system as in claim 12 wherein said BSMD is configured for:
receiving said binary flag sequence information;
finding locations of positive and negative transitions in said received binary flag sequence;
calculating difference in said locations; and
comparing said difference with said pre-determined threshold.
16. The system as in claim 12 wherein said BSMD is configured to perform at least one of: replacing binary block of 0 with another binary block of 1 replacing binary block of 1 with another binary block of 0 after said comparing.
17. An apparatus for voice activity detection in adverse environmental conditions wherein said apparatus comprising:
an integrated circuit further comprising at least one processor;
at least one memory having a computer program code within said circuit;
said at least one memory and said computer program code configured to with said at least one processor cause said apparatus to:
receive input signal from at least one source;
classify said input signal into at least one of a silent or non-silent signal block by comparing temporal feature information;
send said at least one of silent or non-silent signal block to at least one of voice endpoint storing (VES) module or total variation filtering module by comparing said temporal feature information to the pre-determined thresholds;
determine endpoint information of at least one of a voice signal or non-voice signal by said at least one of VES module or total variation filtering module;
employ total variation filtering by said total variation filtering module for enhancing speech features and suppressing noise levels in non-speech portions;
determine noise floor in said total variation filtered signal domain;
determine feature information in autocorrelation of said TV filtered signal sequence;
determine binary-flag merging and deletion (BSMD) based on the pre-determined duration threshold on said determined feature information by BSMD module;
determine voice endpoint correction based on the short-term temporal feature information after said determined binary-flag merging and deletion; and
output said input signal with said voice endpoint information.
18. The apparatus as in claim 17 wherein said apparatus is configured to extract said temporal features from said input signal by a signal block division (SBD) module.
19. The apparatus as in claim 17 wherein said apparatus is configured to send silent signal or non-silent signal extracting feature information from said input signal using a hierarchical decision-tree (HDT) in a silent/non-silent block classification (SNBC) module.
20. The apparatus as in claim 17 wherein said apparatus is configured to send at least one of silent or non-silent signal to at least one of voice endpoint storing (VES) module or filtering module by comparing said temporal features to pre-determined threshold.
21. The apparatus as in claim 17 wherein said apparatus is configured to output said input signal with said voice endpoint information after correcting said endpoint information by a voice endpoint determination and correction (VEDC) module.
22. The apparatus as in claim 17 wherein said apparatus is configured for
receiving audio data from at least one of data acquisition module audio communication storage device and compressive sensing devices.
23. The apparatus as in claim 17 wherein said apparatus is configured for
using said total variation filtering to enhance said voice features and suppress noise levels in said non-voice signal.
24. The apparatus as in claim 17 wherein said apparatus is configured for
preventing pitch doubling and pitch halving errors.
25. The apparatus as in claim 17 wherein said apparatus is configured for triggering said VAD in at least one of manual or automatic mode selected by a user.

Dated: 5th day of September 2012 Signature:
Dr Kalyan Chakravarthy
Patent agent

ABSTRACT
Methods and system for robust voice activity detection under adverse environments are disclosed. The present invention comprises a signal receiving module; a signal blocking module; a silent/non-silent classification module for discriminating silent blocks by comparing temporal feature to pre-determined thresholds; a total variation filtering module for enhancing voiced portions and reducing effect of background noises; a frame division module for dividing filtered signal into small frames; a residual processing module for estimating noise floor; a silent/non-silent frame classification module; a voice/non-voice signal frame classification module based on autocorrelation features of total variation filtered signal; a binary-flag merging and deletion module; a voice endpoint detection and correction module; and a voice endpoint storing/sending module. A decision-tree is arranged based on time and memory complexity of feature extraction methods. The preferred system can capable of accurately determining endpoints of voice regions in audio signals under different adverse environments.
FIG. 2

Documents

Application Documents

#	Name	Date
1	Form-5.pdf	2012-09-13
2	Form-3.pdf	2012-09-13
3	Form-1.pdf	2012-09-13
4	Drawings.pdf	2012-09-13
5	2761-del-2012-Correspondence-others (14-09-2012).pdf	2012-09-14
6	SEL_New POA_ipmetrix.pdf	2014-10-07
7	FORM 13-change of POA - Attroney.pdf	2014-10-07
8	2761-DEL-2012-FER.pdf	2018-10-30
9	2761-DEL-2012-FORM 3 [26-02-2019(online)].pdf	2019-02-26
10	2761-DEL-2012-FORM 3 [26-02-2019(online)]-1.pdf	2019-02-26
11	2761-DEL-2012-FER_SER_REPLY [28-04-2019(online)].pdf	2019-04-28
12	2761-DEL-2012-ASSIGNMENT DOCUMENTS [10-10-2019(online)].pdf	2019-10-10
13	2761-DEL-2012-8(i)-Substitution-Change Of Applicant - Form 6 [10-10-2019(online)].pdf	2019-10-10
14	2761-DEL-2012-FORM-26 [11-10-2019(online)].pdf	2019-10-11
15	2761-DEL-2012-Proof of Right (MANDATORY) [29-11-2019(online)].pdf	2019-11-29
16	2761-DEL-2012-FORM-26 [27-11-2020(online)].pdf	2020-11-27
17	2761-DEL-2012-Correspondence to notify the Controller [27-11-2020(online)].pdf	2020-11-27
18	2761-DEL-2012-Written submissions and relevant documents [16-12-2020(online)].pdf	2020-12-16
19	2761-DEL-2012-RELEVANT DOCUMENTS [16-12-2020(online)].pdf	2020-12-16
20	2761-DEL-2012-RELEVANT DOCUMENTS [16-12-2020(online)]-1.pdf	2020-12-16
21	2761-DEL-2012-Proof of Right [16-12-2020(online)].pdf	2020-12-16
22	2761-DEL-2012-PETITION UNDER RULE 137 [16-12-2020(online)].pdf	2020-12-16
23	2761-DEL-2012-PETITION UNDER RULE 137 [16-12-2020(online)]-1.pdf	2020-12-16
24	2761-DEL-2012-FORM-26 [16-12-2020(online)].pdf	2020-12-16
25	2761-DEL-2012-FORM-26 [16-12-2020(online)]-1.pdf	2020-12-16
26	2761-DEL-2012-FORM 3 [16-12-2020(online)].pdf	2020-12-16
27	2761-DEL-2012-ENDORSEMENT BY INVENTORS [16-12-2020(online)].pdf	2020-12-16
28	2761-DEL-2012-Annexure [16-12-2020(online)].pdf	2020-12-16
29	2761-DEL-2012-PatentCertificate25-03-2021.pdf	2021-03-25
30	2761-DEL-2012-IntimationOfGrant25-03-2021.pdf	2021-03-25
31	2761-DEL-2012-US(14)-HearingNotice-(HearingDate-02-12-2020).pdf	2021-10-17
32	2761-DEL-2012-RELEVANT DOCUMENTS [24-08-2022(online)].pdf	2022-08-24
33	2761-DEL-2012-FORM 4 [06-09-2022(online)].pdf	2022-09-06
34	2761-DEL-2012-PROOF OF ALTERATION [17-01-2024(online)].pdf	2024-01-17

Search Strategy

1	2761searchstrategy_04-07-2018.pdf