Real-time Single-channel Speech Enhancement with Recurrent Neural Networks

Real-time Single-channel Speech Enhancement with Recurrent Neural Networks


[MUSIC]>>Then, I guess we can start? Yes. So welcome everyone to the final internship
presentation talk of Yangyang Raymond is here. He’s doing his PhD at CMU; Carnegie Mellon University in Pittsburgh with
Professor Richard Stern. He worked here a few months with us on real-time single-channel
speech enhancement with recurrent neural networks. With that, the stage is yours.>>Okay. Thank you, Sebastian. Thanks for everyone for
coming to my talk today. I’m going to present you the work I’m doing for the past three months
in collaboration with Sebastian, and the title is Real-time
Single-channel Speech Enhancement with Recurrent Neural Networks. Let’s get started. So
throughout the talk, I’ll be first introducing
single-channel speech enhancement. Formulating the problem will
go over not only the methods based on deep learning but also the classical signal
processing methods. Then we’ll move on to
our method based on a recurrent neural network
and connect what we learn from classical
signal processing to our decision in
building our network. We’ll be doing a thorough evaluation on a bunch of objective
speech quality measures, and on a large-scale dataset collected with the help of
Raj, [inaudible] , and Harry. Finally, we’ll be concluding the talk and reporting
some major findings. Let’s get started with
the introduction. So what is a single-channel
speech enhancement? Simply put, single-channel speech
enhancement aims to reduce noise and retain speech quality to the best extent possible
from noisy speech. Our overall assumption is
that our noisy speech comes from addition of a clean speech and the noise signal and there is
no other assumed distortion like non-linear distortion or
channel distortion or reverberation. The other general assumption we
make is the noise attributes typically change slower than
speech and our goal is, as I said before, to suppress noise, to retain speech to the
best extent possible, to improve human or
machine perception. In this project, our goal
is for the end-users, the human listeners who are going
to listen to the enhanced clips. So that will be our focus. Let’s have a overview of the generic
speech enhancement pipeline. On top, we have the flow diagram of a classical signal processing-based
speech enhancement system. We start with our time-domain
waveform signal x of t, and throughout the
talk, I’ll be assigning x to all the noisy signals. That signal goes through a
short-time spectrum analysis, typically the short-time
Fourier transform, to get the short-time
spectral characteristics. After that, we separate the short-time spectral
features into phase and into magnitude denoted by
these little blocks there. One challenging aspect of single-channel enhancement is that the phase is typically
very hard to recover. So that is out of scope of our talk today as well and
will be leaving it as it is, the noisy phase for reconstruction. We do the majority of our
work in the magnitude domain. You see there are some generic
modules to estimate noise or estimate the gain from the
magnitude of the noise spectrum. After that, we send
into an estimator, will be called a gain estimator
that applies basically a gain function in the
frequency domain on each frame of our noisy spectra. We use that to point wise multiply to our noisy speech and use that enhance magnitude
to recover the clean speech. That is the generic pipeline for over 30 years until deep-learning
arise in the scene. The basic pipeline is similar in the sense that we start
with our time-domain signal. We do some feature extraction which doesn’t have to be spectral features
anymore, it can be anything. But our end goal is still to estimate this time-frequency gain function
denoted g with a hat there. Then point wise, multiply that to the noisy magnitude to recover
the clean speech, hopefully. As you can see here everything
else in the middle becomes more or less a black-box because of the neural
networks. Thanks to that. Of course, with deep-learning we
have training data to leverage. That’s a huge advantage of the modern machine learning approach compared to
the classical approach. Now, I would like to
focus in this talk on several aspects we can improve
in this deep-learning box. The first is feature extraction, and the neural network itself, the learning objective, and how
we actually train our system. Our method will be
broken into four pieces. Before I get into the actual method, let me briefly go over
this short chart I picked. As you can see, we have
six methods right here. The first two are from a classical
signal processing-based method. The middle two are deep learning-based but cannot
operate in real-time. The last two rows are deep learning-based and can
actually operate in real-time. As you can see I’m highlighting some of the key methods
that determine whether or not the method can be
real-time process or not. As you can see things like
spectral subtraction, the key part is estimating noise
by a moving average filter and for decision directive
method you have a recursive smoothing of the
instantaneous measure of SNRs. Now, these measures can actually be done if we drop the assumption
of the online processing. Things like noise estimation
actually can be improved massively if we incorporate
information from the future. But that’s how the scene was set up. We want to keep that because a real-time processing is
I think our ultimate goal. We have a speech coming in
and we enhance it without looking into what’s
in the near future. It’s not that something
they cannot do is just they keep the assumption as
real-time online processing. Although these two are
deep-learning method here have done very well, they use information from the future which is
breaking that assumption, and for that reason, we are focusing our model on an online single-frame in,
single-frame out basis. For that reason, the last two methods are the qualified candidates
we want to compare to, and the last one is not really real-time because they are
trained on a one-second waveform. All right. Let’s jump
into the method part. So as I said from the flow diagram, we break our method into four parts. The feature representation,
the learning machine, the learning objectives, and
how we train the network. Let’s start looking at the
first thing, the feature. We use the most standard
feature for a neural network, that is the short-time
Fourier transform magnitude. We also consider the
short-time log power spectra with a negative 80 dB floor. What you see on the left is actually the log power spectra with a
linear mapping of a color and displayed with a jet
color map in MATLAB. Let’s see. We have three, just for the ease of looking, we have in x-axis
your time in seconds, your y-axis in
frequency in kilohertz. But we have three
spectrograms stack together. The top one, we have the noisy. I think that’s with the air
conditioner noise at 20 dB. In the middle, we have
the clean speech signal, and on the bottom, we have this weird looking IRM or you call it the ideal ratio mass
which is the ideal gain function. You plot it in 2D on a dB scale because if I plot
it in between zero and one, you would not have
to see the contrast. As I said before, the output we’re
trying to estimate is the real magnitude gaining function, the range between
zero and one and some technical details about how we
construct this spectrogram. We have a 16k Hz sampling
rate for our audio. We used a 32-millisecond
analysis frame with a Hamming window and
a 75 percent overlap.>>Hamming window or Hann window?>>Hamming window.>>The four percent [inaudible]
window from the zero?>>Yeah. Not the Hann window, the 0.46 plus 0.54 times the cosine.>>That’s Hamming.>>Yeah.>>Hamming window is from
zero to one to zero.>>Okay. It’s not Hann, its Hamming.>>Okay.>>Yeah. It’s a raised cosine.>>Have you tried different
windows, different [inaudible] .>>I briefly tried out 20
millisecond 50 percent overlap and the performance went
down for the network. That was a month ago, and I set it aside and never
really changed my original setup. Yeah. But I think it’ll work. Well, the overlap might be a
problem but 20 millisecond window, I think it’ll work.>>So why do you use
the log power spectrum?>>So the question I got is why
I use a log power spectrum is; our perception is correlated
on a log scale for audio. As you can see, we are actually, visually we can see the contrast if we mapped linearly the value
obtained from the log scale. If I did it for just the magnitude, the contrast will be so low that you wouldn’t even visually
see that difference.>>But the magnitude
actually itself already contain the information
for the power, right? If you just double
the magnitude I mean just it comes back
to its power, right?>>Well, the log power is just a non-linear compression
on the linear power.>>Okay.>>We’ll actually do a
comparison by feature later. So you’ll see the result. Yeah.>>Dynamic range of audio
signals is extremely high and that’s why we usually use
a logarithmic [inaudible]. So we assume that neural networks will deal better
with slightly compressed data.>>Yeah. We’ll see.>>Most people find.>>In addition to the
feature I mentioned earlier, we are exploring some different normalization
techniques on audio features. The very first is your standard global mean and
variance normalization by frequency. The statistics is
accumulated over 80 hours of random sampled speech
from our training set. In addition to that, we also
explore online mean and variance normalization
as Sebastian mentioned, the dynamic range of
speech is very high and it changes drastically over time. One way to deal with it is to smooth the spectrum
in time and we apply a three second
exponential window either globally or per frequency basis. As you can see from the left here, the top graph shows the
original noisy spectra. The middle one is the spectra after frequency
dependent normalization and the bottom figure here is after frequency
independent normalization. The absolute value or the
color is not important. The contrast is more important. Okay. All right. So after the features we are getting into probably the most important part of our system which is
the learning machine. The neural network itself. The recurrent neural network is
the most natural choice for us because what the
recurrent neural network does is it outputs some value for, it has a notion of time
first and foremost. Then it outputs something for this time instant
based on some input you obtained for the current
time and also from the output you get from
the previous time stamps. Which is similar to what we
do with the filtering in or the classical approach we
have with speech enhancement. So that’s the basic
structure we are based on. One well-known example recently that uses RNNs for speech
enhancement is called RN Noise, you can check out the paper in with the reference in the
last slide there. The network has, I would say a pretty
complicated architecture, I’m not going to get into that. But there are two things
that caught our attention. The first is the use of
gated recurrent units which has proven to learn long-term
temporal patterns effectively. The second is a dense layer
which really acts like a long non-linear transformation
block to bring your feature from a long-term sequence into the gain function
at that time instant. So we take their ideas and what we realize is that
it’s important to have this residual connections
in the network somewhere. The residual connections
facilitate deep networks to learn. There’s a very famous
paper a couple years ago where the person was in the field of computer vision and he’s doing some image
classification task. What he found is that by having this simple residual
which means you’re, imagine you have
multiple layers and you are simply adding the input from the very first layer
to the later layers of the network and that facilitates learning a extremely deep network. I think tens of layers, something like 20 layers. That’s in computer vision
and in our case here the depth of a network actually corresponds to
the number of time-frames, I’m going to explain that later. Because we are going to train the network with a
very long sequence, we believe that the residue
would help within the network and that’s what we decided to do. This is, on the right, you see a standard gated recurrent
cell and what we did was we simply add this bypassing
connection from the input to where they add aggregate all the learn components and propagate
that into the next layer. After we do that we did some literature research and
found out there was actually a same idea has been applied
in different task already. There’s a sequence classification and probably the most similar one is Chen 2017 paper on
future compensation. What they did is they
estimate this mask for a Mel-spectrogram for
speech recognition. It’s for enhancing the speech
features as well but it’s not used to reconstruct speech. So we have believed that
this block will do well. But we don’t stop there, we still need an entire
network that’s going from our input feature to
the output gain function. What we did is simply stocking a few. In our case, just three grew layers
with our residue connections. If you zoom in on each block
it will look like this, except for the last layer where
we don’t add this residual. The justification is that for input features we’re getting
something from audio which has a very high dynamic range and all the output that come
in all of the [inaudible] is compressed already in the last layer before it gets transformed
by a fully-connected layer. We don’t want the input in dynamic range mess up with
what’s getting learned inside. So we don’t have that. Everything else has the residual and in the end we have a fully connected layer
with a sigmoid function. The outputs are again
between zero and one. That is our network architecture. Now let’s move on to the actual
learning objective which is probably equally as
important as the Network. We adopt the well-known
mean squared error. Okay, so I have inconsistency in
the notation here so X here is a clean speech in short-time
fourier transform magnitude and Y is our noisy speech and we’re applying a gain function
to the noisy speech. Just simply take the point-wise squared error and averaged them
across all time and frequency.>>So this is the mean squared
error and the difference between clean and estimated signal.>>Yes. I would like to bring some context about minimum
mean squared error. There’s a seminar work by
Efren and Malah in 1984. The problem by assuming complex as TFTs of speech and noise
have Gaussian distributions and are uncorrelated
and they solved for the optimal solution in minimum
mean squared error sense. For deep learning based approach, we actually don’t have
any assumptions about distributions of anything
and we simply learn by Stochastic Gradient descent
and hopefully we get to a point where it’s low enough for
training and low enough for test. The mean squared error has a staple convergence because if
you take the gradient of a square, you have a Linear Gradient
across everywhere.>>Raymond?>>Yeah.>>The short term Frequency
Spectra Transformation is Efren and Malah in 1984 but
actually in timed minutes, Robert Winner in 1947.>>That’s right.>>[inaudible]>>That’s right. Yeah, I was
going to emphasize the Gaussian.>>Yes.>>Yeah.>>But Winner actually is based on mean square error in time
domain signal, yeah it’s true. With that observation, we can
rewrite this mean squared error. If we put in statistical form in expected value instead
of actual average. If you just rewrite
a little bit and we ignore the cross term there, we ended up with two, that’s up by the way that’s a very.>>Very cold.>>Very coarse assumption
that maybe doesn’t hold. But for the, our goal is to separate speech distortion from noise suppression and by
ignoring the two terms, what we ended up is actually
the mean square error between the signal enhanced. Well, the S here is the clean signal, so it’s a mean square or
between the clean signal and the clean signal itself
multiplied by the gain function. We have a mean squared error of
just noise multiplied by the gain. So because we are not solving
for any optimal solutions in statistical sense and also we want to balance the speech distortion
and noise suppression terms, we first did this approximation
and then we come up with this new Loss Function
that have two separate terms. The first one is on a speech distortion and the second
term is on noise suppression. So the way to interpret this is, let’s say your Enhancement System does nothing which means
they’re going just simply pass everything then your
speech distortion is zero but then you have all the
error coming from the noise. Then by having a enhancement
system that suppresses everything, you have zero for the gain function
and then you get no error for the noise suppression and you get
all the error from the speech.>>So technically the
first and most to keep the gain as close to one as possible, and the second thing wants
to make the gain as low as possible to suppress more noise.>>Yes.>>If you want to balance.>>Yeah.>>You do so with Alpha.>>Yeah.>>Okay.>>For the speech, we only do that
for the speech active region. So we apply a simple Energy-based
voice activity detector on the speech here and the detector is simply a
thresholding on the energy accumulated from three Kilohertz
to 5,000 Hertz which is, typically where speech happens.>>Can we get that on a plain speech?>>Yes. Yeah, that’s a very
crude Energy-Based VAD, if we do our noise feature
will probably fail. We don’t stop there. We also have this observation from classical signal processing
point of view that when we have a noisy signal that’s almost clean then we don’t want to destroy
any speech content in there. So the result is we pass almost all the noisy speech on change to retain the speech quality. When there’s so much noise in speech that we
cannot even get a hang of where the speech is we just apply a very heavy suppression
on the entire thing. So that basically says, when the SNR is approaching
infinity we want very little speech distortion and
when SNR is approaching zero, we want very aggressive
suppression on the noisy speech. Motivated by this observation, we have another loss function
built on top of the previous one. With this SNR terms multiplied to each part of the
loss function there. I have to mention here that the original intention was to
view this as a whole term. So this is the waiting
for speech and one minus the other multiply
with this parenthesis here, that’s the original intention
but our Tensor implemented that. But not to confuse the audience and for the result here I’m showing the
result coming out of this, but one easy future work is to try with a corrected weighting
but we keep it as it is here.>>Well these just kills both
terms with the same number.>>But this is by example. So imagine you have a batch of audio.>>Okay. There your
global cost function will be weak and same, okay.>>Yeah, but still it will
be more correct if you do it with waiting on
the Alpha by the SNR.>>Sorry, why is it there
is no Sigma in that square, the lower right square?>>That was a typo, sorry. That
should be the square. Thank you. All right. So from the classical
decision directed approach from Efren and Malah, we have hidden state, Priori and a Posteriori SNRs as your hidden states in
deep learning language. The hidden states from the
previous estimate affect the current by a exponential
smoothing process. We have this analogy in
the RM based approach. But what we have is a blog walks
almost with hidden states that we don’t know the meaning of
the hidden states they carry. But we know that they are capable of learning very long
temporal sequence. They’re learning through
back propagation through time m. We want to actually study the effect of the length of the sequence
we pass in because this is just a simple pseudo recurrent
neural network I have here. So this is your hidden state from previous time frame and the
hidden state from the current. As the output here and your input is x of t and your
output some y of t here. Let’s just say your hidden
state of t simply equals to your output and your output is simply a function of your input plus
your previous hidden state. Then if we take the partial
derivative of the output with respect to the learning
parameters of the network, we see it’s a function of your current instantaneous
gradients multiplied by something from the previous
time frames and this T here. Here, I have from T all the way
back to zero but we can control this length and see how
it affects the impact, how it affects speech quality
of the enhanced signal. So we are doing this
comparison as well. Okay. That’s the end of our method. Now, let’s move on to evaluation. We have 84 hours of training data. The clean stage comes from the
Edinburgh 56 speakers corpus, the noise comes from 14 noise types from DEMAND database and Freesound. For test, we have 18
hours of test clips. The clean speech come from the Graz
University 20 speakers corpus. So there’s no overlap
to training at all. For noise, we are picking
nine challenging classes from the 14 in training, but we have different
signals for test. Those are very
challenging noise types. For example, we have the competing
talker in neighbor and we have transient noise
such as munching, or the door shutting, and a airport announcement.>>Noise clips in the test data are not ever ever
presented in the training set?>>No. Right? No.>>Okay.>>Yeah. We have five
different combinations of SNRs from 0 db to 40 db with a 10 dB step and all clips
are sampled at 16 kilohertz. This is just a close up
look of the data we have. On top, we have clean speech. On the bottom, we have the noise. This is our waveform plotted in dB and you see this is the same noisy repeated five times here
from 0 db to 40 db there. We have the speech normalize
to the same level, but it’s the same speech
repeated five times. During training, what we did is augment data a little bit by randomly drawing a segment of waveform
from any clean speech file. From noise file, we do
the same and we mix them. So the SNR wouldn’t change is
still the five discrete SNRs. But now, you have different speech
mixed with different noise.>>Do we have these
[inaudible] in one file? The drastic change of the noise, can we have a segment is like?>>Yeah. That’s possible.>>Thanks.>>Yeah.>>[inaudible] and
conditional speech.>>It’s probably a good thing
for the network too. Yeah.>>It makes you get to
set a bit more difficult.>>Yeah. It might be even better to mix with different
SNRs on the fly. So we’ll just draw one noise
and draw one speech and mix at a randomly draw SNR level,
but we didn’t try that. So that’s our data and we have
quite a few systems to compare. We start with a noisy unprocessed and we have the statistical based. This is the signal parsing
based method developed here in MSR without training
data of course. We have our proposed method
here with these set up, and we have a recurrent
neural network which is simply our network but removing
the residual connection. So we want to study how effective that is residual
connection actually is. Everything else stays the same. For RNNoise, we use the original
code published by our Valin 2018, because they have a package to do all the training and testing
and we couldn’t augment the data, so we just keep the data as it is. By our experience, the data
augmentation of the [inaudible] in was about 0.1 test score and
they don’t have this. Keep in mind that this number will be lower than what it
potentially could be. For simplified RNNoise here, what we did is we simply took
their network architecture. Theirs is enhancing
a very crude energy and we’ll hope it’s a 22-band, but we have a full band 257. So what we did was took
their architecture, scale up the future dimensions to match it for full band and scale up all the other dimensions
within the network to accommodate this
scaling difference. We don’t use data.
They are training with a voice activity detector as well and we didn’t use that there
because we don’t have label.>>[inaudible] or as output?>>So what they did is
they have it as output and a train it with label. Yeah, yeah.>>So do we have this in the
programs? Architectures.>>The VAD? No. For the
proposed method it is kind of building to the learning objective because of the speech
distortion. Yeah.>>It is in the RNNoise,
the original RNNoise. But we found [inaudible]. We randomize that and it
didn’t change anything.>>[inaudible].>>Sorry Donald. Yeah, your mic.
I can’t hear you. [inaudible]>>Let me know the question. Finally, we have Oracle information
plus Wiener filter rule, which marks theoretically
the best what we can do. So we have seven systems to compare. In terms of evaluation metrics, we have four classical speech quality or in intelligibility measures. They are the scale invariant
signal to distortion ratio, which is a really
robust version of SNR. We have captured distance which is a distance metric
in the capture domain. A capture domain is supposedly, you have a flattened channel
and speech dimension. For the third short-term objective, intelligibility this is
in terms of percentage. Finally, the perceptual
evaluation of speech quality or PESQ which predicts a mean opinion
score of a speech quality. Except capture distance, everything else is better
with higher value. Capture distance is
better with lower value. We also incorporate this new DNN based Mean Opinion
Score prediction called Audio Moss is trained on the moss scored by real users and
it has a 0.89 parsing, correlation, coefficient,
and sound test data. All right. Ready for the results?>>[inaudible] one more thing.>>Yes. Some experiments
learning on average.>>An added question on the
RNNoise when you begin play.>>Yes.>>So if you Google an
original [inaudible] bands when defining or
assuming [inaudible]. Do you change that or
is it still using that?>>Everything I think is
16-kilohertz something rate.>>Yeah but then it changed the code, because that’s something the
frequency is something that 48 and a kind of critical bands
has faced assuming 48 kilohertz. So you have to write a new resolution
in the 0-48 kilohertz range.>>Okay.>>Because if that
didn’t happen then it’s probably not a valid comparison.>>So this is the original
RNNoise it is not the one but I’m sure you modify.>>Yeah.>>Then it’s not. I
think we cannot convert that 16 kilohertz input because of the frequency bands are
not meant for that.>>Okay.>>There’s a better version of it.>>The frequency band is
based on the 0-24 kilohertz.>>So the one that we
used in our speech here, that one was a modified RNNoise. It performed way better
than the original one.>>Yeah. I actually didn’t remember that something
rate because I did this baseline almost
three months ago. Yeah.>>Basically, if you use
the RNNoise on GitHub and used it at 16 kilohertz speech I would not include it in [inaudible].>>Okay.>>We do have [inaudible] for
this version as to [inaudible].>>Okay.>>Yeah. We have I guess
suggested results.>>Well, the simplified
RNNoise but the full band enhanced measures still being a talk because we changed
the architecture.>>Because you are not
using any [inaudible]?>>Yes. All right. So in terms of result, let’s first look at the
best from each category. The best is surrounded by
quotation marks because I pick the best based on
the best PESQ score. I’ll show later that none of those objective measures
actually are optimal, but let’s start with
this comparison first. As you can see here, the first thing to
notice is that our noise has tremendously less number
of parameters because of the less dimension from
the crude energy contour. Our system has 1.26 million
trainable parameters, but put that number into perspective, it can actually enhance
a one second of audio in 39.6 milliseconds on a single GPU, with the GCR machine I
used just using Python and with the CPU
that’s 2.6 gigahertz. So it’s well within the real
time processing constraint. In terms of objective measures, we can find that our method outperforms other systems
in all categories. I’ll explain this in the
next few slides here. But again, I chose this one just because of the absolute
best PESQ score obtained from the test data, but this is for the human listeners. So we want to listen to how
it actually sound like. So let’s start with the noisy.>>She had jumped away
from his shy touch like a cat confronted by a side winder. He had left her in-violent thinking familiarity will gentle her in time.>>Can you hear that?>>It should if it
goes to the speakers.>>Wow I could hear it.>>Yeah I can hear it too.>>Okay.>>So that was a bubble noise
at 20 dB and this is based on.>>She hadn’t jumped away
from his shy touch like a cat confronted by a sidewinder. He had left her in-violent thinking familiarity will gentle her in time.>>That’s based on classical
signal processing. I’m going to skip a few here and let’s listen to the
full band RN noise.>>In a way, he couldn’t blame her. She had jumped away
from his shy touch like a cat confronted by a side winder. He had left her in-violent, thinking familiarity
will gentle her in time.>>Our proposed method.>>In a way, he couldn’t blame her, she had jumped away
from his shy touch like a cat confronted by a side winder. He had left her in-violent, thinking familiarity
will gentle her in time.>>Okay. We did some comparison on feature normalization and we felt that normalization in general helps, but maybe not as much as we expected. We expected to be. Yeah I’m going to skip
through this part, and on the effect of sequence
lengths we found that five. Yeah. So the first four bars are based on our short-term
spectral amplitude. The next four bars are
based on long spectra. Here we have the original spectra
after global normalization, after online frequency-dependent
normalization, and after online frequency
independent normalization and the same for long spectra. Using exact same network architecture only difference is the feature.>>Global normalization, [inaudible]
the green loud spectrum, global normalization
is actually the best?>>Yes. This is based on
mean-squared error only, not R, not the speech
distortion, weighted off.>>Raymond you mentioned that the clean speech is roughly the same.>>Yeah. I’ve either->>But in reality this
is valid if you have a [inaudible] microphone or you are roughly the same
distance from the microphone. But if you are using a local microphone you can be
half a meter or five meters away, you only have a 20 db
difference in the voice level.>>Yes.>>That dynamic range is where the normalization would
tremendously help.>>Yeah.>>Not here. So this
is of not much value. Any conclusion here is.>>That’s true.>>Okay.>>[inaudible]>>Speech is always
at the same level.>>Yeah, that’s true.>>You need augment the data with the whole input with
different levels. You have an knob for
your microphone, right? You can pull it down
20 dB or crank it up. You don’t know at which
time you get the audio, so you need to augment it.>>This is the dynamic range
which normalization battles.>>That’s true. That’s true. That’s true, yes. Good point. We also study the effect of sequence lengths when
we train the system. So basically for every batch we
are feeding one minute of speech, but it could be 61 seconds segments or it could be a two
30 seconds segments. What we found is that I had to stop this after 53 epochs because it was taking almost a week
and is very slow. But what we saw is
the last row here is not a fair comparison but what we found is that five second
is actually a good number. Definitely better than
one second per segment. So this length actually, and we’re surprised by the
recurring units actually are able to learn from such
a long sequence there, we have a eight millisecond overlap. So that means over 600 frames and it’s still able to
learn and learn well.>>So sequence length is what? The training examples.>>Yeah, the number of frames
in the training example. To a randomly sample of waveform
transform in regular, yeah. Now the more interesting results are from the two loss
function we had. Remember we have this
speech and distortion weighted laws by this
term alpha there, and we’re doing a sweep of alpha
between zero and one to see where the optimal value is by
different objective measures. What we found is that
first of all you see this, every measure has a nice shape. It starts something bad, goes to somewhere optimal
and goes bad again, happens to every measure. Second, they don’t
agree with each other. For speech and noise weighting, we found that all demos think this which is actually
very high speech distorter here, let me play an example.>>Those answers will
be straightforward if you think them
through carefully first. Drop five forms in the box.>>It’s very aggressive [inaudible] .>>Yeah. You can
almost hear no noise, but the speech distortion
is quite large.>>This maximizes best.>>No, this maximizes the audio most. The DNN-based prediction.>>[inaudible]>>Let’s listen to what
PESQ thinks the best is.>>Those answers will
be straightforward if you think them
through carefully first. Drop five forms in the
box before you go out. If people were more generous, there would be no need for work then.>>It’s right here. So the
speech quality is better, but you hear this residue
of noise that happens when the speech occurs
at the same time. I’m not going to play the other
two because they are worse. Capture this in the next
story, agree with each other.>>Can I have the 0.65?>>Okay, sure.>>Those answers will
be straightforward if you think them
through carefully first. Drop five forms in the
box before you go out. If people were more generous, there would be no need to work then.>>So it’s much more
noise than the previous.>>There’s not much
improvement in the quality. Can you play the noisy one?>>No, it is. The speech
is not [inaudible].>>Those answers will
be straightforward if you think them
through careful first. Drop five forms in the
box before you go out. If people were more generous, there would be no need to work then.>>If you want to compare
PESQ and audioMOS because they’re both
predicting MOS scores, you see this, they’re agreeing more. So in this end, there’s almost no suppression. So they agree more than when
you get here when there is a heavy speech distortion then
they start to disagree a lot.>>The other wasn’t trying
on architect like that.>>Yeah. So that’s
something to work on. We also have this
SNR-weighted Weighting here. Again, similar trend and they
don’t agree with each other. Again, audioMOS prefers
very heavy suppression.>>We welcome many new
students each year. George is paranoid about
the future guest storage. The carpet cleaner should include->>No, that was the speaker.>>Please shorten the
script for choice.>>PESQ and SDR here.>>We welcome many new
students each year. George is paranoid about
the future gas shortage. The carpet cleaner should
include our oriental growth. Please shorten the script for choice.>>If you just go->>We welcome many new
students each year. George is paranoid about
the future gas shortage. The carpet cleaner should
include our oriental growth. Please shorten the script for choice.>>So this is only one example. Had to say it works equally
well on the others. But my preference is around
that 0.2, 0.3 range. All right. That’s getting to the end. We have some major findings
from all the experiments here. The first and foremost is residual connections
really, really help. If you just compare the RNN
with our proposed method, the only difference is the
residue conduction and this makes a vast difference. We are surprised to find
that the recurring units, in this case Groose but
probably LSTMs as well, are able to encode extremely long patterns at
high-dimensional space. Yes.>>I’d want one comment on that. You compared one and five, you didn’t compare one to
two, the second segment. So it’s possible that it
learns more than one, but it doesn’t actually learn
all the way down to five.>>Maybe.>>Anyway.>>Yeah. But still for one second, it’s already 125 frames. So it’s able to really learn from a very long-term temporal patterns. There’s a point when I was looking at the enhanced gain function in dB and saw this constant suppression
at six kilohertz. I was wondering what’s
going on until I saw this example in training. We have vacuum noise
and there’s just a tong around six kilohertz and where
speech don’t usually occur there, so the network turns
out it just learns to suppress that
frequency very heavily. For stationary patterns like that, there might be room to incorporate
the classical processing.>>It’s a preprocessor.>>Yeah. But whether or not you can detect that
tone, it’s another story.>>Jamie, come on,
use the suppressors.>>Not a very good idea>>Yeah.>>So you get that separation of
the one that’s even not there.>>Yes, in any hertz.>>It means there’s not enough
variety in the noise data. Just as one vacuum cleaner
[inaudible] vacuum cleaners.>>Yeah, data augmentation
will definitely help there. Another major finding
is that by having the SNR Weighted or the Speech
Distortion Weighted objectives, we are able to enhance the speech
not with the broadband masks. So the problem with before, it was almost acting like a VAD where it suppressed everything when
there is no speech present. When there is speech, you just let open the frequency
and let everything go. But with the new weighting
function we have, it’s much more selective in terms of frequency of what the
enhancer is able to suppress, and also by listening
we confirm that. So in conclusions, we propose a DNN-based online speech
enhancement system with a very compact neural network. The storage complexity is linear function of your
feature dimensions squared. So by reducing that number, you can have even smaller networks. We introduced two novel
learning objectives motivated by balancing speech
distortion and noise suppression. Thanks to Ross, we found a couple of days ago that one of
the weighting function, the first one, there is a paper published like seven days ago
that has that in the paper. The other one is still new. So let’s hope by the time
we write a paper we->>[inaudible]>>Yeah. We study the impact of
multiple factors associated with training a neural
network for speech quality, and we explore feature normalization. But as Evan said, we need more variation in the
data to confirm that. We study the effect
of sequence length and the two objective weightings, and we compare competitive
signal-processing based and deep learning-based
online systems in terms of objectives speech
quality measures, and ours performed better than the system’s
mentioned in this talk. The future directions of this work involves studying the
speech quality improvement. But SNR, those numbers reported
before was the mean of everything. But by analysis on different SNRs, we might find different
patterns because our objective is a function of SNR. We will explore more
learning objectives to replace the mean squared error. There are some measures we tried before that we thought works well
in classical processing sense, but it turns out
didn’t work so well in the neural network sense probably
with the issue with training. That’s another path to explore. We can also explore to
reduce the dimensionality of the feature to reduce the model size and improve
the computation speed. That’s the end of my talk. I want to thank all of
you for coming here, but I want you to thank in
particular my mentor, Sebastian. It’s a real pleasure
working with you. I want to thank Hannis for providing
multiple tips on using Filly. Ninety percent of the
experiments shown today wouldn’t be
possible without Filly. So thank you for that
and thank you for all the other mentors in the
audio and acoustics group; Evan, David, and Dmitri. I would like to thank Ross, Chandon, and Harry who’s
not here from Skype.>>He’s online.>>Sorry.>>Harry is online.>>Oh, hi Harry. I
would like to thank them for preparing the training and test data on my first day here
and they gave the data to me and made that much
easier for me to work. Thank you for organizing
the event, that was great. Thanks for the suggestions we
have through our weekly meetings. Finally, thanks for all the interns and I’ll be missing your
company when I leave. I’m staying far. Thank
you very much. All right.>>It’s time for questions.>>I have one question really. Before you started out, it looks like you decided to
adopt the same architecture as a conventional nicer version in terms of making the magnitude
and the matrix in the space. Do you look at time
domain of process where you operate the idea of uniform or?>>I think we decided to take this masking approach in the first week because we are aware of the time domain enhancement, but we think that’s a completely
different research problem. Yeah. That’s certainly a
direction to look for. But we probably need
a different set up than what we have today to do it.>>One more question.
Otherwise if there’s not, let’s thank Raymond, again.>>Thank you.

1 Comment

  • IAN THESEIRA says:

    Being secretly hit-on?

    Resulting in skewed thought flows and warped thinking. caused by subconscious back flow, growing in volume due to rejection and other impediments in physical world.

    Why is this not the case in places like North Korea, or even Americas, or Europe, where cults of celebrity are greatly relied on as attention and energy conductors?

    Possibly even causing reorientation of sex drive, due to combination of zig zagging energy flows,

    That even shape subconscious vague impulses, such as loyalty to central pillar or core of community (hero of tribe) leading to same sex inclinations (energy inversions?)

    Resulting in Evita Peron coping methods, First Lady rumored to be famously promiscuous to displace and distribute such energies?

Leave a Reply

Your email address will not be published. Required fields are marked *