Speech enhancement using modified wiener filter based MMSE and speech presence probability estimation

ABSTRACT

estimated and reduced to improve the speech quality. During last decade, various enhancement techniques have been presented for speech signal enhancement. Substantially, these techniques are categorized singlechannel and multi-channel speech enhancement. Usually, back ground noise features can be identified using single channel speech enhancement and the effects of reverberation can be reduced through multichannel speech enhancement. Though, Multichannel schemes show significant performance for speech enhancement when compared to single-channel speech enhancement, the enhancement is to be done through each microphone individually. Hence the work is concise to single channel speech enhancement.
The speech enhancement algorithm aims to rectify the damaged input or output signal and to increase the performance of the communication link. The damaged speech signal generates huge trouble mainly for speech recognition applications. The quality and intelligibility of speech are damaged by the noise involved in the speech signal. The term intelligibility denotes the understandability of final outcome of the speech signal. The accuracy of the exact content of speech signal is termed as quality. The different types of speech enhancement algorithm used to deduct noise are Adaptive or non-adaptive, frequency or time domain [4]. Over the last years, several techniques have been presented for speech signal enhancement i.e. Spectral Subtraction, Wavelet based methods, model-based techniques and filtering techniques [5,6]. Furthermore, speech enhancement techniques can be categorized as spectral and temporal processing techniques. According to the spectral domain process, corrupted signal is processed through the transform domain technique whereas temporal processing method uses time-domain analysis for improving the quality of the speech signal.
The complete article is organized as follows: section 2 presents a brief literature survey and recent advancements in speech enhancement techniques. Section 3 presents proposed speech enhancement system modeling. Results are analyzed and discussion is elaborated in section 4, finally, concluding remarks are presented in the section 5.

LITERATURE SURVEY
Conventional multiband speech enhancement acquires two operations: one is splitting the spectrum into frequency bands, and the other is executing speech enhancement in each band independently. The polointeraction problem in the spectral domain leads to the suppression of few coefficients in the estimation of clean speech by the influence of the formants in the neighboring bands and thus grades in poor quality. To reduce the domination of stronger formants over the neighboring bands, the assessment of clean speech is done by, in the temporal domain. The unsuppressed speech is filtered into various equivalent rectangular bandwidth based subbands and followed by enhancement of spectral speech in each band based on Discrete Cosine Transform (DCT) using Spectral Subtraction/Minimum Mean Square Error (MMSE) [7].
Park et. al. [8] discussed about the use of speech enhancement technique in mobile communication systems. Authors developed an efficient scheme for noise reduction which can improve the performance of speech recognition. However, mobile devices have limited capacity which motivates to develop a low complex scheme for noise reduction. In this work, a speech coder is also utilized for packet data estimation. In general, this work uses pitch information for comb filtering. Adaptive filtering technique has a significant impact on the speech processing system. Adaptive filtering has been used widely in the various applications such as channel equalization, system identification and echo cancellation. Although, MMSE and DWT shows improvement in speech enhancement, their performance is poor at low SNR conditions [9,10].
The Dual Tree Complex Wavelet Packet Transform (DTCWPT) employed in [11] provides a solution to avoid the aliasing and oscillations due to shift invariance in DWT. Even though compatibility exists between MMSE based SPP and DTCWPT, the loss of intelligibility in the reconstructed speech due to the finite wavelet filters. In this field, Choi. et. al. [12] improved the existing adaptive filtering and developed a novel sub-band adaptive filter. Further, this technique takes the advantage of norm-optimization and norm as the computation of cost function. Authors have claimed that this technique is capable to improve the system performance and robustness in impulsive noise scenarios.
The optimally modified MMSE based log Spectral Amplitude estimator (OM-LSA) improves the performance compared to the MMSE in case of stationary and some sort of non-stationary noises but shows significantly poor performance at low and high SNR conditions [13,14]. SMPO based MMSE technique for speech enhancement improves the PESQ in all kinds of noisy environments. Binary masking results good speech quality but sometimes intelligibility of the speech may be lost due to over thresholds in masking noise coefficients. SMPO-Weiner is going to provide better solution in these issues [15]. Hence combination of Weiner filter with MMSE based SPP through soft decision may improve the speech intelligibility in addition to speech quality. − Challenges in Speech Enhancement: Speech is considered as powerful mode of communication not only for humans but also for human machine interface. Sometimes the speech signal travels through the noisy 65 medium before reaching to the listener (recognition system). Now-a-days, automated speech processing systems have gained lot of attraction from researchers and have been adopted widely in the real-life scenarios. State-of-art models of automated speech processing systems works well with controlled environment but real-time systems suffer from various background noises, reverberation and speech from other speakers, which causes partial loss of information to complete loss especially at low SNR. Noise removal can be done only by identifying the characteristics of the noise through preprocessing. However, huge number of researches have been carried out in this field but due to computational complexities, speech enhancement still remains a challenging task as the process itself introduces complexity. − Contribution of the work. Main contributions of this work are as follows: a. Initially, the features of the speech signal, like silence is estimated from the noisy signal spectrum through Speech Presence Probability (SPP) estimation. b. Extracting the noise characteristics from the silence. c. Finally, MMSE (Noise Tracker) based noise power estimation to suppress the noisy signal using noise power tracking scheme.

SINGLE-CHANNEL SPEECH ENHANCEMENT
Assuming that the majority of the transmission noise is additive, the speech enhancement for singlechannel is represented in (1), where x(n), d(n), y(n) corresponds to speech, noise and noisy speech signals and n indicates discrete time index. Both x(n) and d(n) are independent with zero mean value. The block diagram of single channel speech enhancement is represented in Figure 1. To estimate the clean speech signal x(n), the noisy signal y(n) is processed using speech enhancement algorithms.

Figure 1. General Speech Enhancement Systems
An analysis of speech enhancement method for a noisy speech signal y(n) is illustrated in Figure 2. Initially, the noisy signal is segmented into overlapping frames, then transformed into frequency domain using DFT or STFT. To achieve good quality of speech signal, DFT technique is applied as it easily understands the spectral content of the signal. Prior to the speech enhancement technique, the noise characteristics can be identified by the statistical model. The gain values are calculated based on the features extracted from the original input so that an upgraded speech signal is acquired. The target evaluation can be made through the power spectrum density of the noisy speech segments and noise. This helps to generate the gain coefficients of the signals.
Wiener filtering helps to reduce the Mean Square Error (MSE) among the actual signal and the estimated signal and to boost up the original signal from the noisy signal. Similarly, the noise can be minimized through hard or soft thresholding. Sometimes the important components of the original speech might be lost due to over thresholding.

Modified wiener filtering technique
The advantage of the modified adaptive wiener filter is that the speech signal is processed through filter by varying local statistics such as mean, variance. Here, the mean value of the additive noise d(n) is considered to be zero and holds a white nature with variance σd 2 . Thus, the power spectrum Pd(ω) can be expressed in (2), Here the segmented speech signal xj(n) is treated as stationary and thus can be modeled by (3), where mxj is the local mean and σxj is the standard deviation of xj(n). w(n) is a unit variance noise with a zero mean. For convenience xj is represented as x, the mean of the original signal ms is equal to average mean value of all the j frames, mx. The Wiener filter transfer function can be estimated by (4), The impulse response of the wiener filter can be achieved by applying Inverse Transformation to H(ω) and is given in (5), As the mx and σx are updated at each sample, the speech signal can be estimated from (6), and is denoted as ̂(n) .
Now estimating and tracking the noise frames in the noisy signal is the crucial task in the process of speech enhancement. Usually, the voice activity detector is employed for finding the presence of noise. But in this Weiner filtering method, frame energy is synchronized with the minimum frame energy. Minimum energy variation is directly proportional to the signal conditions. Therefore, by using smoothed 32 points of the spectrum the spectral deviance is calculated. To gain noise spectrum these points are verified and upgraded additionally. Then the level of energy is fixed to 10dB. So by analyzing the energy level which is more than 10 dB and RMS level more than 8 dB, the presence of speech can be identified. When there is no availability of speech this noise spectrum N(ω,m) can be measured for varied time samples. The ultimate plan is to implement MMSE based noise power estimation technique. In this process, speech & noise spectral coefficients have complex Gaussian distribution which can be expressed in (7).
Hence, the noisy power spectral coefficients can be expressed as given in (8).

Noise PSD estimation and tracking
The noise is represented as = and = . Spectral coefficients can be transformed into polar coordinates using (9).
In the MMSE method there are several noise power estimators present in the noisy signal periodogram estimation. Let us consider a priori SNR ξ and estimated noise power is 2 . The noise periodogram can be obtained in (11).
From the (11) it is identified that the noise periodogram can be gained. But here the signal gets altered from time to time, therefore the spectral density has to be revised frequently. To update the spectral density parameters the recursive smoothing is applied as shown in (12).
Where α = 0.0. Whenever ξ < 1, the noise periodogram is updated. In the same way by verifying a priori SNR factors the spectral noise power is improved. Hence the sum of the observed noisy signal and previous estimation of spectral noise power ̂2 is represented by MMSE. From this estimation the priori SNR value is obtained which is in between 0 and 1. In A priori SNR can be obtained by following (13).
where represents incomplete gamma function. According to the nature of a priori SNR, estimator E(|d| 2 x) value is unbiased whereas for low SNR values it is under-biased.

Soft-decision technique for noise presence estimation
Speech presence or absence is to be identified, to identify the noise characteristics. Generally, noise characteristics can be obtained when the speech is absent. The speech or silence part classification can be done by (15).
However, proposed model utilizes soft decision-based framework for noise estimation in the presence or absence of speech. The probability of speech presence or absence can be estimated from (16) and (17).
The (18) and (19) describes speech presence and speech absence respectively. Noise periodogram under speech presence conditions can be expressed in (20). For further improvement chi-square distributionbased hypothesis analysis can be applied.
There are two important components to be considered in estimating speech presence probability using chi-square approximation: observation sequence and estimated noise variance as shown in (22). Spectral component computation is applied resulting in observation sequence and estimated noise variance generation where total N number of frequency bins is considered given in (21). In next phase, chi-square calculation is applied for these frequency bins and the chi-square statistic is given by (22).
Through the chi-square calculation the obtained value is compared with the threshold parameter where (N-1) total degrees of frequency bins are available.

RESULTS AND DISCUSSION
The noise characteristics are estimated from the noisy speech signal during the absence of speech and the speech clearness is determined from Performance Evaluation of Speech Quality (PESQ) and segmented SNR (segSNR). In the aspect of mobile application, a sample-rate of 16 KHz, 8KHz TIMIT data base signals are used as input to measure the quality of the proposed algorithm. Initially, the signals are TFdecomposed and Hann-windowed frames with a length of 256 samples. But to calculate the objective intelligibility, each frame is zero-padded up to 512 samples. As the PESQ is highly correlated with the subjective measures, it is recommended by ITU-T for measuring the performance of the processing technique. The simulation has been done in MATLAB 2013b tool.
The noisy signals such as babble, airport, car, train noise are chosen at different SNR 0dB, 5 dB, 10 dB with 30 samples of male and female speakers from NOIZEUS database are chosen for performance evaluation of the proposed technique. By using the overlap-and-add procedure the signal is synthesized. General simulation parameters like down-sampling, Number of FFT points, window length and window overlap are described in Table 1.   The original speech, noisy signals are considered for simulation in .wav form and is represented in Figure 3 and Figure 4 respectively. Through these diagrammatical presentations the presence and absence can be identified. The probability distribution utilizes the phase analysis method and Figure 5. shows the Power spectra of original, noisy signals. Finally, filtered and noise suppressed signal analysis is represented in Figure 6.
The proposed model is compared with the other existing state of art (Pengfei Sun, Jun Qin [11]). PESQ and segSNR are the characteristics features which are used to evaluate the processing approach. By varying the noisy input SNR values from 0dB to 10dB the average values of PESQ and segSNR values are represented in the table for all samples. The Noizeus database with 30 speakers in each set contains babble noise, airport noise, car noise and train noise which is also treated as important characteristics to identify the performance of the method. The Table 2, Table 3, Table 4, and Table 5 describe the performance analysis for babble, airport, car and train noises respectively.
The calculated PESQ and segmental SNR are related with benchmark algorithms to estimate the efficiency of the system. In terms of babble and train noises of 0dB, 5dB and 10 dB SNR signals of existing state-of-art method the proposed method achieves enhanced results by means of babble and train noises 0dB, 5dB and 10 dB SNR signals. Also, better results are obtained in presence of airport and car noises for all  Table 3 and 4. From the results it is clearly identified that noise detection is more efficient than the other existing approaches. The stationary noise is recognized easily whereas in some case the non-stationary noise like babble noise and train noise tracked with the help of soft decision based estimation. The Figure 7, Figure 8, Figure 9, Figure 10 shows that comparative analysis representation of proposed approach in terms of PESQ for babble, airport, car and train noises for 0 dB, 5 dB and 10 dB signals, considering the five states of art techniques. The PESQ measurement and improvement in the Table  2, 3, 4 and 5 shows that proposed method develops the quality of the enhanced signal at a rate of 7.76% average improvement for 0 dB, 2.65% improvement for 5 dB and finally 3.78% improvement is marked at 10 dB SNR signals. . PESQ performance analysis for car noise Figure 10. PESQ performance analysis for train noise

CONCLUSION
The objective intelligibility measured for various voiced and unvoiced signals is found to be better in the proposed method. Wiener filtering is implemented along with spectral coefficient, and probability distribution models, to improve the performance of speech quality in terms of PESQ and segSNR. While extracting the noise characteristics phase coefficients are preserved and reused after the reconstruction. It performs better in case of low SNR where noise power is calculated locally. Furthermore, SPP is also used to make a soft-decision in detecting the noise statistics in the absence of the speech and MMSE based noise tracking is used in the speech presence. This gives better noise estimation and improves the performance at high SNR. Finally, recursive smoothing is applied resulting in the efficient single-channel speech enhancement. The speech enhancement using adaptive filtering may reduce the cepstral smoothing, also echo cancelation and improve the results further. The results indicate that the proposed work is improving the speech quality at 0 dB, 5 dB and 10 dB and stands mostly good, and comparative in some particular cases. Speech intelligibility is also good according to the Mean Opinion Score (MOS) of the subjects. Also, it improves the quality of the lower energy concentrated signals efficiently when compared to the other state of art techniques.