Lecture 3: Machine Learning 2 – Features, Neural Networks | Stanford CS221: AI (Autumn 2019)

Lecture 3: Machine Learning 2 – Features, Neural Networks | Stanford CS221: AI (Autumn 2019)


Okay. [NOISE] Uh, welcome back everyone. This is the second lecture on machine learning. Um, so just before we get started, a couple of announcements. Um, homework 1 foundations is due tomorrow at 11:00 PM. Note that it’s 11:00 PM, not 11:59. Um, and please I would recommend everyone try to do a test submission early, right. Um, it would be unfortunate if, uh, you wait until 10:59 and you realize that your computer, uh, you can’t login to the website. Um, if that happens, please don’t just bombard me or- or with emails. Just- just wondered- so there is- you can- you can resubmit as much as you want before the deadline? So there’s no penalty to just submitting something and checking to make sure it works. Yeah. So just to remind you, you’re responsible for any technical issues you encounter, so please do the test submission early. So you have peace of mind, and then you can go back to finishing your, um, your homework. Okay? Uh, homework 2, sentiment is out. This is the homework on machine learning, um, and it will be due next Tuesday. Um, and finally, there’s a section this Thursday which will talk about, uh, back propagation and nearest neighbors, and maybe a overview of scikit-learn which might be useful for your projects. So please, uh, come to that. Okay. So let’s jump in. I’m gonna spend a few minutes reviewing what we did last time. It’s kind of starting at the very abstract level and drilling down into the details. So ab- abstract level, learning is about taking a data-set and outputting a predictor F, which will be able to take inputs x, for example an image, and output a label or output y, for example whether it’s a cat or a truck or so on. And if you unpack the learner, we talked about how we want to frame it as a optimization problem which captures what we want to, uh, optimize, what properties a predictor app should satisfy. And apart from the optimization algorithm, which is how we accomplish our, um, objective. So the optimization problem that we talked about last time was, uh, minimizing the training loss. Um, and in symbols, this is the training loss which depends on a particular weight vector. Is the average over all examples in the training set of the loss of that particular example, uh, with respect to the wave vector w. Okay. And we want to find the w that minimizes the training loss. So we want to find the single w that, um, makes sure that on average, all the examples have low loss. Okay. So looking at the loss functions, um, now, this is where it depends on what we’re trying to do. If we’re doing regression, then the pertinent thing to look at is the residual, which remember, is the model’s prediction minus the true label. So this is kind of how much we overshoot. And the loss is going to be zero if the residual is zero, and in-increases either quadratically for the square loss, or linearly for the absolute, uh, deviation depending on how much we want to penalize large, uh, deviations. Um, for classification or binary classification more specifically, um, the pertinent quantity to look at is the margin, which is the score times, uh, the label y, which remember is plus 1 or minus 1. So the margin, um, is a single number that captures how correct we are. So a large margin is good. In that case we, uh, obtain either a 0 or a near 0, uh, loss. And margin less than 0 means that we’re making a mistake. So the 0 in loss captures that we’re making a mistake of a loss 1. Um, but, uh, the hinge loss and logistics loss kind of grow linearly, because it allows us to optimize the function better. Question? So I have a question about residuals. Yeah. Like, I know that I see the regression curve, the loss squared curve there with the residual- what- what would a residual look like on a graph? Would it be just a point away from the resid- away from the regression curve? Or what would the residual look like on this graph, if you were to put it? Um, so there are multiple graphs, uh, here. So remember last time we looked at residual. If you look at, um, x or rather Phi of x over y, so here’s the line. Um, here is a particular point Phi of x, um, Phi of x, uh, y. And the residual is the, uh, basically the difference between the model’s prediction and, uh, the actual point here. This graph is different. This graph is, um, visualizing, um, in- is- um, in a different space. Right. I’ll show you another graph that might make some of these things, uh, a bit clearer in a second. So the residuals won’t look exactly like that on this curve of velocity graph? Correct? Um, well okay, I guess one way to think about the residual is, um, the residual is a number. So if your residual is 2, then you’re kind of here, and this is the loss that you pay, which is, uh, 2 in this case. Oh. And if the residual is minus 2, then you pay, uh, 2. So the residual is the x-axis? Okay. Yes. The residual is the x-axis here. Oh, okay. Okay. And the margin is the x-axis over here. All right. Yeah. Okay. Any other questions about this? When would you use the absolute value? [BACKGROUND] Um, yeah. The question is, when would you use absolute value versus the square loss? Um, there is a slide from the, uh, previous lecture which I skipped over which talks about when you would want it. Um, most of the time people tend to use the square loss because it’s easier to optimize, but, um, you also see absolute, um, you know, deviation. Um, um, the- the square loss will penalize large outliers a lot more. Which means that it has kinda mean- uh, mean-like, uh, qualities. Whereas the absolute deviation, um, penalizes less, so it’s more like a median, uh, just for kind of intuition. Um, but the general point is that all of these loss functions capture properties of a desired predictor. They basically say, hand me a predictor, and I’ll try to assess for you how good this is, right. This is kind of establishing what we want out of it. And, um, you know, also another comment is that, you know, I’m presenting this loss minimization framework because it is so general. Anything basically that you see, um, in machine learning can be viewed as some sort of, you know, loss minimization. If you think about PCA or deep neural networks, um, different, um, types of auto-encoders, they can all be viewed as some sort of a loss function, um, which you’re trying to minimize. So, um, tha- that’s why, uh, I’m kind of keeping this ge-framework somewhat general. Okay. So let’s, uh, go to the opposite direction of generality. Let’s look at a particular example, and try to put all the pieces together. Um, so suppose we have a simple regression problem. We have three training examples: 1, 0, the output is 2, 1, 0, the output is 4, and 0, 1 the output is minus 1. Right. Um, so, um, how do we visualize what learning on this, uh, training set looks like? Um, so let’s try to form the training loss. The training loss, remember, is the average over the losses on the indivi- individual examples. So let’s look at the losses on individual examples. Um, so we’re doing linear regression, so and x is two-dimensional, and Phi of x equals x. So, uh, in this example, so, um, we’re basically trying to fit two numbers, w_1 and w_2. Um, so if you plug in these values for x and y into this loss function, then you get the following quantities. So the dot product between w and x is just w_1, right. Um, because x- x2 is 0. And you minus 2 and you square it because we’re looking at the square loss. Um, the same thing for, uh, this point instead of 2 you have a 4. Um, and then for this point, um, uh, w.Phi of x minus y is w_2 now because, um, now the- the x2 is, uh, active, uh, minus minus 1 squared. Okay. So these are the individual loss functions. Each of which tells what I kind of want out of w. So if here I’m looking at this, if w_1 is 2, then that’s great, I get a loss of 0. This one says if w_1 is 4, then that’s great, and I get a loss of 0. And obviously you can’t have both. And the goal of the training loss is trying to look at the average, so that you can pick one w that works for, as kind of on average, is good for all the points. Okay? So now, this is a function in two dimensions. It depends on uh, w_1 and w_2. So let me try to draw this on the board to give you some more intuition what this, uh, looks like. Okay. So I’m gonna draw a w_1, uh, w_2. And so the first, uh, function is, uh, w_1 minus 2. Okay. So, um, so what does this function want to do? It wants w_1 to be close to or, uh, close to 2, and it doesn’t care about w_2. Right? So, um, I’m not really sure how to draw this function, but it it really requires something in 3-D. So you can think about a ball-shape kind of coming out of the board, uh, like this, if this direction is meant to be the- the loss. Okay. So I’m gonna try to do, uh, um, well let’s- let’s try it this way. So it’s going to be like I have kind of a bunch of, um, problems that look like this coming out of the board. Okay? Uh- Um, okay. So what about the second one? The second one is, uh, w1 minus 4 squared. So that’s going to be basically the same thing [NOISE], but kind of centered, uh, around 4. So around this axis. Okay. So again, there is gonna be some parabolas coming out of the board. Um, and then finally, the other point is, uh, w2 minus, minus 1. So it’s going to be, um, happiest when, um, um, w2 is minus 1. Um, so it’s going to be kind of a bunch of, uh, parabolas coming out of the board here, okay? So you add all three functions up, and what do you get? You get something that is, um, has- first of all, where do you think the minimum should be? One of the two intersections of the [NOISE] on the- One of the two intersections. Yeah. Like the first, like the first, uh, vertical and horizontal or the second square vertical and horizontal. Oh, the red lines, I mean. Oh, yeah. There’s gonna be some sort of intersection here. So if you look at, um, the w2 axis, right, um, it should definitely be minus 1, because this is the fun- only function that cares about w1. So it’s gonna be somewhere here and both by symmetry, while this one wants it to be a 2, this one it wants it to be a 4, so the average is somewhere between. You can work all of this kinda actually mathematically out, I’m just kinda giving the rough intuition. Uh, and now let me draw the level curves here. The level curves are going to be something like this where, um, ag- again, if you draw it in 3D, it’s like a parabola, uh, or- coming out of the- a board here, um, where here’s the lowest point. Um, and as you venture away from this point, your loss is going to increase. [NOISE] Right? Okay. Yeah. Can you explain that bit again, that middle point? Uh, how do I get this middle point? Um, [NOISE] so one way is that if you add these two functions up and it kind of, um, just, you know, plot it, uh, it turns out to be a 3. Intuitively, um, the, the square loss when you average, uh, it acts kind of like, uh, um, a mean. So kind of, you know, it’s gonna be somewhere in between. It’s also, um, related to one of the homework problems. So hopefully, you’ll h- have a better appreciation for that. Um, okay. So, so I guess- yeah, question. Once we have the 3, how do we merge it with the negative 1 as well? Do we need to do another addition? Um, so the question is, once we have the 3, how do you merge it with a minus 1? Um, so the 3 is regarding w1 and the minus 1 is regarding w2. So you just add them together. They kind of don’t- in this particular example, they don’t interact. In general, they will. I still like this example, could you quickly summarize exactly what’s going on with this example. Yeah. So this plot shows for every possible wave vector w1, w2, you have a point and the amount that the function comes out of the board is the loss, right? And the loss function is defined on- in the slides, right there. And all I’m doing is trying to plot this loss function. Okay. So it’s actually w1 and w2 points, the loss is coming out of the board you’re plotting. Yes? No? Um, so unfortunately, it’s hard to kind of draw it in 3D here. So- Okay. What I’m trying to do here is taking each of the pieces and trying to explain what each piece is trying to do. All right. Yeah. Okay. So, um, in general, the, the training loss, you don’t have to think about kind of how exactly it composes the individual losses. Um, this is probably as complex of an example we’ll have to, you know, we’ll get to right trying to understand it. Um, but this kind of gives you an idea of how you connect these pictures where you see kind of these are parabolas, um, with, uh, the picture which is actually the- of the, you know, training loss. Okay. But for now, let’s assume you have the training loss. It’s a function of the parameter- it’s some function. And how do you optimize this function? So you do some sort of gradient descent. So last time we talked about how you can just do a v- vanilla gradient ascend where you’d initialize with 0. And then you compute the gradient of that entire training loss. And then you update once. And the problem with that is the, uh, up to computing the gradient requires going through all the training examples, and if you have a million training examples that’s really slow. So instead we looked at stochastic gradient descent which allows you to pick up an individual example and then make a gradient step right away, right? And, um, empirically we SHA encode how it can be a lot faster, you know. Of course there are cases where, um, it can also be less stable. So there’s kinda in general going to be some, you know, trade off here. But by and large, stochastic gradient descent it kind of really dominates, um, you know, machine learning applications today because you- there’s only way to really kind of scale to large, um, you know, datasets. Okay. Yeah. Is there any other benefit of stochastic gradient descent or gradient descent? Um, so apart from being able to scale up, is there any advantage of stochastic gradient descent? Um, another besides computation, another advantage might be that, um, your data might be coming in an online fashion, like over time, and you want to, you know, update kind of on the fly. Um, so there are cases where you don’t actually have all the data at once. Okay. So that was a quick overview of the general concepts. Um, now to set the stage for what we’re gonna do in this lecture, I wanna ask you guys the following question. So can we obtain decision boundaries, um, remember a decision boundary is the- the- it’s kind of a line that or the curve that separates the region of the space which is classified positively versus negatively. Um, can we obtain decision boundaries which are circles, um, by using linear classifiers? Okay. So, um, does that make sense? So we want to get something like this, um, where you have- now we’re going into, uh, Phi1 of x, um, you know, Phi2 of x. And we want decision boundaries that look like this where you classify maybe these as positive and these as negative. Okay. Is that possible? Yeah. If you map, if you like take a square of those inputs. Then you get something to be linear. You [inaudible]. [OVERLAPPING] Yeah, yeah. Okay. [LAUGHTER] So you’re saying, yes? Yeah. Okay. [inaudible]. Okay. Uh, okay. Well, there’s a punchline there. Um, so it turns out, um, that you can actually do this which maybe on the surface seems kinda surprising, right? Because we’re talking about linear classifiers. But as we’ll see it really depends on what you mean by linear classifiers and hopefully that will become clear soon. Okay. So we’re gonna start by talking about features which is going to be able to answer this question. Then we’re gonna shift gears a little bit and talk about neural networks which is, uh, in some sense an automatic way to learn features. And we’re gonna, ah, show you how to train neural networks using back propagation hopefully without tears. And, um, then talk about nearest neighbors which is another way to get really expressive models which is gonna be, um, much simpler in a way. Okay. So recall that we have the score. So the score is a dot product between the wave vector and the feature vector, and the score drives prediction. So if you’re doing regression, you just output the score as the number. If you’re doing classification, binary classification then you output the sign of the score. Um, and so far we’ve focused on learning which is how you choose the wave vector based on a bunch of data and how you optimize for that. And so now what we’re gonna do is focus on phi of x, um, and talk about how you choose these features in the first place. And this actually, feature extraction is such a really critical important part of kinda a machine learning pipeline which often gets neglected because when you take a class you’re saying, Okay, well there’s some feature vector and then let’s focus on all of these algorithms. But whenever you go and apply machine learning in the world, um, feature extraction, um, turns out to be kinda the main bottleneck. And neural nets can mitigate this to some extent but it doesn’t completely make feature extraction um, obsolete. So recall that a feature extractor takes an input, um, such as this uh, string and outputs a set of properties which are- are useful for prediction. So in this case, it’s a set of um, named feature values okay. And last time, we didn’t really say much about this. We just kinda waved our hands and say, okay here’s some features. So you in general how do you approach this problem, what features do you include? Um, do you just like start making them up and how many features you have? We need maybe a better organizational principle here. Um, and you know in general a feature engineer is gonna be someone of art. So I’m not gonna give you a recipe, but at least some framework for thinking about features. So the first notion um, is a feature template, and a feature template is informally just a group of features that are all computed in the same way. Um, this is kind of a somewhat pedantic but kinda, um, a terminology point that I want you all to kinda be aware of. Um, so a feature template is basically a feature um, name with holes. So for example length greater than blank. So remember the concrete feature has length greater than 10. Now, we’re gonna say length greater than blank, where blank can be replaced with 10, 9, 8 or any kind of number. And it’s a template that gives rise to multiple features. Last week, characters equals blank, contains character blank. These are all examples of feature templates. So when you go in your project or whatever and you describe your features or when you think about kind of grouping these features in terms of, you know, um, these blanks. Another example is pixel intensity of position. So even if you have what you consider to be like a raw input, like an image, right? There’s still implicitly some sort of way to think about it as a feature template, um, which corresponds to the pixel intensity of position, blank comma blank, ah, is a feature template where it gives a rise to the number of features equals to the number of pixels in the image. And this is useful because maybe your input isn’t just an image. Maybe it’s an image plus some metadata. Then having this kind of language for describing all the features in a unified way is really important for clarity. Okay. So as I alluded to, each feature template maps to a set of features. So by writing last three characters equals blank, I’m implicitly saying, well, I’m going to define a feature for each value of blank and that feature is gonna be associated with a value which is just the natural evaluation of that feature on the input. Okay. So all of these are 0 except for ends with .com is 1. Okay. So, um, and in general you are going to have each feature template that might give rise to many, many features, right? Um, the number of possible three-letter characters, you know some number of characters to a cube which is a large number. Ah, so one question is how do we represent this, right? First vector. Yes, first vector. [LAUGHTER]. Yeah good answer. So mathematically, it’s really useful. Just think about this vector as a d-dimensional vector, right. Just d numbers just laid out, right? And because that’s mathematically convenient but when you go to actually implement this stuff you might not represent things that way. In particular, you know, what are the ways you can represent a vector? Well, you can say, I’m going to represent it as an array which is just this list of numbers that you have. But this is inefficient if you have a huge number of features. But in the cases where you have sparse features which means that only a very few of the feature values are non-zero, then you’re better off representing as a map or in Python, a dictionary, which you specify the feature name um, is a key and the value is, you know, the value of that feature, right? And all the,um, the- the home homework two will basically work in this sparse feature framework. Um, and you know, just kind of a note, a lot of, um, especially in NLP and we have discrete objects and traditionally, it’s been common to use kind of these sparse feature maps. Ah, you know one thing that has happened with the rise of neural networks is that often um, you take basically your inputs and embed them into some sort of fixed dimensional vector space and dense feature representations have been more um, dominant. But sparse features if you wanna use linear classifiers is still kinda a good way to go. So it’s important to understand this. Okay. So now instead of storing possibly a lot of features now you just store the key and the value. All right. Um, So this was the feature templates. The overall point is that it’s kind of organizational principle, um, and you know, um, okay so now let’s switch gears a little bit. So which features or feature templates should you actually write down? And to get at that, I wanna introduce another notion which is pretty important especially if you think about the theory of machine learning, and that’s the notion of a hypothesis class. Okay. So remember we have this predictor. So for a particular wave vector, that defines a function that maps inputs into some sort of score or prediction. And the hypothesis class is just the set of all predictors that you can get if you vary the wave vector. Okay. So- so let me give you um, we’re gonna come back to this slide. Let me give you a kinda example here. So suppose you are doing a regression and you’re doing linear regression in particular. So you’re in one dimension. Here is x and, um, here is, ah, I guess y. Um, so if your feature map is just identity, so maps x to x, um, then this notation just means the set of all, ah, linear functions like this. Then the set of functions you get you can visualize um, as this, right? So you have one function here and for every possible value of w1, you have, ah, a slope. You also have 0. They should all go through the origin. Um, so you have- these are your functions, right? So your hypothesis class F1 here is essentially all lines that go through the origin. Okay. So just wanna think about it when you write down a feature vector you’re implicitly committing yourself to saying hey, I want to think about all possible predictors defined by this feature map. Okay. So here’s another example. Suppose I define the feature map to be x comma x squared. Okay. So now what are the possible functions, you know, I’m gonna get? So does anyone wanna say what, read off this slide what it is [LAUGHTER]. It’s gonna be all quadratic functions, right? Okay. So in particular, because they don’t have a bias term, it’s gonna be all quadratic functions that go through the origin. So let me actually draw another. [NOISE]. Um. So it’s gonna be all quadratic functions that go through the origin, which look like this, but it could be upside down. Um, and do it like that, I’m not gonna draw all of them. Um, in particular, it also includes the linear functions, right? Because I can always set w_2 equals 0, and vary w_1 which means that I also get all the linear functions too, right? So this means that w- f_2, if you think about the set of functions is a larger set than f_1, it’s more expressive. That’s what we mean by expressive. That means that it can represent more things. Okay. So for every feature vector, you should think also about the set of functions that you can get by, uh, that new feature vector. Okay. So let’s- is there a question? Yeah. We need to assess the time here- the best set of w’s, are- are the more expressive sets harder to optimize over? The question is, are the more expressive sets harder to optimize over? In terms of, ah, you know, the short answer is not necessarily. Um, um, in terms of- sure, you have more features so that it req- is more expensive. Yeah. At that level, um, but the difficulty optimization depends on a number of different, you know, factors. Um, and sometimes, adding more features can be easier to optimize because it’s easier to figure out training data, um, okay. So now, let’s go back to this picture. Okay. So, uh, this is- on the board is concrete examples of feature or, or ah, uh, hypothesis classes. Um, now, let’s think about this big blob as the set of all predictors. Any predictor in your wildest dreams, you know, they’re in this, this set. Okay? And whenever you go and you define a feature map, that’s going to carve out, uh, you know, much smaller set of, um, you know, uh, functions, right? And- and then what is learning doing, learning is choosing a particular element of that, um, function family based on the data. Okay. So this picture shows you kind of the full pipeline of how you’re doing machine learning. Is, you know, there- you first declare structurally a set of, ah, of functions that you’re interested in, and then you say [NOISE] okay, now, based on data, let me go and search through that set and find the one that is, you know, best for me. Okay. So now, there are, you know, two places where things can go wrong. Well, for feature extraction, maybe you, um, didn’t have enough features. So now, yours- your- your, uh, purple set is too small. Then, no matter how much learning you do, you’re just not going to get good accuracy. Right? And then conversely, even if you define a nice, um, you know, uh, hypothesis class, if you don’t optimize properly, you’re not gonna find the element of that null hypothesis class that fulfills your, um, your goals. Question? The function F- the feature function is extracted to get from the input, since that, you know, self as a function, how come you can assume that your weights, will be able to compute, um, that function also? So the question is- so you’re defining a function Phi, right? This is fixed. Um, and then learning sets weights, and together jointly, they specify a particular function or predictor. There’s something that saying that if you don’t choose Phi appropriately, you’re limiting the space they will be able to predict. Yeah. But so I’m wondering like why my under- my intuition tells me that the whole point of learning is that, uh, regardless of the Phi that you choose, the actual model that you choose should be able to, you know, learn the function Phi that you would have picked. Ah, I see. So the question is, ah, does- doesn’t learning kind of compensate and just figure out the Phi that you would have picked. Um, so the answer is- short answer is no. The- the Phi is really kind of a bottleneck here. For example, it just- if you define Phi to be, um, X, so that’s the linear function. Linear function is all you’re going to get. Right? So if your data moves around, um, in a sinusoidal way, you’re just gonna, like, fit a line through that and you’ll get, you know, horrible accuracy. And no amount of learning, um, can, you know, fix that. The only way to fix that is by, um, changing your, you know, feature representation. So does that assume that W is- is a linear model though? So yes. So all of this assumes that W- we’re talking about linear predictors here. Okay. But of course, the same general idea applies to any sort of function family in neural nets. Um, the arc- so the equivalent there would be not just, ah, the feature map, but also the neural network architecture. It’s a constraint on what kind of things you can express. So if you have in your- only a two-layer neural network, then there’s just some things that you just, you know, ah, with ah, with a fixed size, there’s just some things you just can’t express. Yeah. Another question. Just to follow onto that as well as an alternative interpretation of the question, I felt it was more of a question of why for a visualization rather than kicking in the raw data, and have like a neural net it still functions linear classifiers, but it has enough complexity that it can strive for non-linear behavior. Yeah. So the question is why bother doing feature in engineering? Hasn’t neural nets kind of basically, solved that? Um, so to some extent, the- the amount of feature engineering you have to do today is, you know, much less. One thing that I think it’s still important to think about in feature engineering is it’s really- think about it as what sources of information you want to, you know, predict. For example, if you want to predict, um, you know, this, uh, you know, some property about a movie review. And you know what- what if- part of the first-order bits are like what even goes into that. Does it- the text go into that? Do you have metadata? Do you have other star ratings? And those are [NOISE] you know, features you can- there’s- I guess no such thing as like raw, um, because there’s always some code that takes, you know, the- the, you know, the world and distills it down into something that fits in memory. So that’s you can think about it as feature extraction. Thank you. Yeah. Okay. One last question and then I’ll go. What is the problem with, uh, too many features, don’t you want your hypothesis class to be too big, is it like an overfitting thing? Yeah. Yeah. Um, so the question is why don’t you just make Phi as large as possible, throw on all the features, and overfitting is, um, you know, one of the main concerns there which, you know, we’ll come back to in the next lecture. Okay, great questions. Um, so let’s, um, let’s actually skip over this, um. So there’s another type of, uh, feature function you can define, but in interest of time, I’m going to skip over that. Um, okay, so now let’s come back to this question, this linear- I keep on saying near linear predictors. So what, what, what is linear, right? Uh, so remember the prediction is driven by the score. Right. So here’s a question. Is this score linear in w? Yes, right? Because, um, what is a, you know, linear function is basically some kind of weighted, er, combination of your inputs. Okay, so is it linear in Phi of x? By symmetry, it should be because it’s just a dot-product. So is it linear in x? No, in fact this, this question doesn’t even make sense because think about x. X remember was a string. Right, it’s not a- it’s not even a number. So, um, and that’s when you know the answer should be no because you know it doesn’t, there’s a type error. Um, okay so here’s, here’s kind of the cool thing now is, um, you know these predictors can be expressive nonlinear function and decision boundaries of x, you know, in the case where x is, uh, uh, is actually a real vector. Um, but the score is a linear function of w, okay? So this is cool because, you know, from a pr- there’s two perspectives, right? From the point of actually doing prediction, you know, you’re thinking about like wh- how does this function operate on x? And you can get all sorts of you know, crazy functions coming out. Um, we just looked at quadratic functions was clearly non-linear but you can do all sorts of, you know, crazy things. But from the point of view of learning, it doesn’t care about x. All it sees is Phi of x. In particular, your learning asked the question how does this function depend on w? Right? Because it’s tuning w. And from that perspective, it’s a linear function w and, um, for reasons I’m not gonna, you know, go into, um, these functions, er, permit efficient learning because the loss function becomes convex, um, which I’ll, that’s all I say about that. Okay. So, um, so one kind of cool way to visualize what’s going on here is when you’re going back to our circle as example. So remember we want this, um, two-dimensional classification problem where the true decision boundary is, you know, let’s say a circle. So how do we fit that and what does it mean for a linear thing because when you think linear it like, should be a line, right? Um, so here’s a kind of a cool graphic. So, okay. So here is, um, these points inside the circle and, you know, it can’t be classified. But the point is when you look at the feature map it actually lifts these points into a higher dimensional space. Now I will have three features, right? And- and you know, in this higher-dimensional space, I can actually- things are linear. I can slice it with a kind of a knife. And then, you know, in that high dimensional space if things are cut and what that induces in the lower-dimensional space is, you know, this circle. Okay. Okay, I don’t wanna- Okay, so hopefully that was, er, a nice visualization that shows how you can actually get nonlinear machine functions out of kind of essentially linear machinery, right? So someone- the next time someone says, well, you know, um, you know, linear classifiers are really limited, um, and you really need neural nets, um, you know, that’s technically false because you can actually get really expressive models out of, er, you know, neural networks- sorry, out of linear models. The point with neural networks is not that they’re not- you’re more necessarily more ex- expre- They can be more expressive but the fact that they have other advantages, for example, the inductive bias that comes with the architectures and, um, the fact that they are more efficient, ah, when you go to more expressive models and so on. Okay, so- so to kind of wrap up all things, I want to kind of do a you know, simple exercise. So here’s a task. So imagine you’re doing a final project and you want to, um, predict, you know, whether two consecutive messages in some forum or a chat are, um, where the second one is a response to the first. So it’s binary classification, input is two messages, and you’re asked to predict whether a second is a response to the first. Okay, so we’re gonna go through this exercise of coming up with, um, you know, features that might be or feature templates might be useful to pick out properties of x that might be useful. Um, and we’re gonna assume that we’re dealing with linear predictors. Okay. So what are some features that might be useful? Let’s, um, you know, let’s- here’s- let’s start with a few. Okay. So how about time elapsed, um, between the two messages, is that a useful feature or not? How many of you say yes? Okay, so this information is definitely good. Um, one subtle point is that this time elapsed is a single number, and this number is going to go into the score kind of in a linear fashion, okay? So what does- what does that mean? That means, um, you know, if I double the time, then the score is going to or that can p- the contribution to the score is going to like multiply by 2, right? So think about it, it’s, it’s kinda like saying them, as I increase the time, you know, the, it becomes linearly more likely that I’m going to be let’s say not a response or- or a response. So this is, you know, maybe it’s kind of not what you want because, you know, the difference and from that perspective like if you, the time elapsed is like a year then that really kind of dominates the- the score function. Um, and it’s like way more likely that it’s going to be a response than if it were like one minute, which is kind of not what you want. Yeah, question? Can you normalize it to teach them? Yeah, so the question is, can you normalize it? Um, so you have to be careful with normalization. So you have- if you normalize let’s say the span of like over one year. Now, now, there’s no difference between like, you know, five seconds and one minute because everything gets squashed down to 0, right? So, er, one way to kind of approach that is to, um, discretize the features. So one trick that people often do is if you have a numerical value which you really kind of want to, um, treat kind of in a sensitive way, you can kind of break up into pieces. So the feature template would look something like time elapsed is between blah and blah. So you can do things like okay is it between zero seconds and five seconds and is it between five seconds and like a minute and between a minute and an hour and an hour and a year or something? And then after that, it doesn’t matter. Um, because that will give you kind of more, um, it’s more domain knowledge that tells you kind of what things to look out for. That difference between let’s say a year and a year plus two seconds is really, you know, it doesn’t matter, right? Whereas the difference between one second and five seconds might be significant. So this is all a long way of saying you know if you’re using linear classifiers or even if you’re using neural networks, I think it’s really important to think about how your raw features are kind of entering the system and think about like, if I change this feature by like scaling it up, does the prediction change in a way that, you know, I expect? Yeah, you got a question? So if we approve that second feature right there, er, what prevents us from having let’s say, er, if we had a whole sort of 35 seconds from 30 to 40 seconds and maybe so on what prevents us from getting just the entire time? [NOISE] Yeah, so, er, the question like if you have every possible range isn’t that like an infinite number of features? Um, so, er, there’s two answers to that. One is that even if you did that, you might still be okay because there’s probably some, um, if you think about it like discretizing the space of, you know, here is your time elapsed, time, um, elapsed. And you’re basically saying for every bucket I’m going to have a feature. Um, it is true that you have an infinite number of, you know, features but, you know, at some point you might just cut it off. And if you didn’t cut it off and use a sparse feature representation, um, you don’t have to, um, pra- have a pre-set, you know, maximum because remember, most of these features are gonna be zero because the chances of some data point being like, you know, 10 years is going to be essentially you know, nil. Um, another answer is that, um, in general when you have features that, er, have multiple timescales, um, you want a kind of space that will work kind of logarithmically. Um, so you know, one to two, two to four, four to eight, um, so that you can have both kinds of sensitivity in lower events but also, um, kind of cover a large, um, magnitude. Yeah, in the back. Is it possible to learn, like how to discretize the features make it the most important? Um, question is, is it possible [NOISE] to learn how to discretize the, the features? Um, there are- there’s definitely more automatic things you can do besides, you know, just like spans specifying them. Uh, at some level though, you have to kind of input the value in a form, like, er, if you’ve inputted into x versus, let’s say log of x, um, those choices often can make a, you know, big difference. Um, but, um, if you use more expressive models like neural networks you can, you know, mitigate some of this. Yeah. I see the value in changing time elapsed, uh, from a number to like a Boolean whether it falls between a range. Why would you wanna retain a, a numerical value for teaching? When would you not wanna discretize it? Yeah, good question. So when would you actually want to not discretize it? Um, [NOISE] so there are- um, essentially when you expect kind of the- the scale of that feature to, um, [NOISE] really kind of matter in, in some, in some sense. So, so, so certainly when you think that some things behave linearly, um, then you just wanna preserve the linear. Or if you think that it behaves quadratically, then you wanna keep the feature but also add a squared term to it. Okay. I wanna maybe move on, um. Uh, these are all good questions, happy to discuss more offline. Um, so some other features might include, the first message contains blank where blank is a string. Right. So maybe things like, you know, question marks are more indicative of you know, things being the second message being a response. Second message contains certain words. Um, um, two messages both contain a particular word. Um, you know, there’s cases where, um, it doesn’t really m- it’s not the presence and absence of particular words in the- in individual, um, messages. But like the fact that they both share a common word, you know, that might be useful. Um, here’s another feature which is, you know, two meshes have the, um, some number of common words together. Um, so this feature is kind of interesting because it’s, um, there’s, you know, the, the- for example you look at this feature, it’s how the number of- when I say feature, I actually mean feature template. Um, so for this feature template, um, there are many, many features, one for possibly any number of words. And this again leads to cases where you might have a lot of, um, you know, sparsity and you might not have enough data to fit all the features. Whereas, this one is very compact. That says, I just have to look at the, um, number of overlap. So, er, the, the two messages might contain a word that I’ve never seen before, but I know it’s the same word and I can kind of recognize that pattern. Um, so, you know, there’s quite a bit of things you can do to play around with features that capture, um, you know, the intuitions about what might be relevant to your task. Question. Yeah. We have a lot of these sparse features like the working different point here. Is that when we want to do like dimensionality reduction, like knockout some of those many, many features? Um, so the question is when you have a lot of sparse features, do you wanna do dimensionality reduction? Um, not necessarily. Um, so in terms of computation, having sparse features it doesn’t necessarily mean that it’s gonna be, you know, really slow, um, because there’s efficient ways of, um, you know, representing sparse features, um, in terms of, you know, expressivity, one thing that, um, in a lot of NLP applications, you actually do want a lot of features. Um, and you can have a lot more features than you might think you can handle. Um, and because you really wanted, the first orbit is just to, you know, be expressive enough to even fit the data. Yeah. Okay. Let me move on, um, since, you know, I’m running short on time. Okay. So summary so far, you know, uh, we’re looking at features. We can define these feature templates which organize these, uh, features, uh, in a kind of meaningful way. And then we talked about hypothesis classes which are, are defined by features. And this defines what is possible f- out of, uh, from learning. Um, and all of this in the context of linear classifiers which incidentally can actually produce these nice non-linear decision, you know, boundaries. [NOISE] So at this point you can actually have kind of enough tools to, um, you know, do a lot. Um, but in the next section, I wanna talk about neural networks because, um, these are even more expressive models which can be, you know, more powerful. Um, um, one thing I, I often recommend is that, um, you know, when you’re given a problem, you know, always try the simplest thing. I will always try kind of a linear classifier and just see where it gets because sometimes you’d be surprised at how, uh, far you can get with linear classifiers. And then, and then go and kind of increase the complexity as you need it. I know there’s sometimes this temptation to, you know, fire the fancy new shot, um, you know, uh, hammer, but, um, sometimes keeping it simple is, you know, really, really good. Okay. So neural nets, um. There’s a couple of ways of motivating this, um, one motivation is, um, you know, comes from the brain. Um, I’m going to use a kind of slightly different, um, motivation which comes from, um, kind of this idea of decomposing a problem, you know, into parts, right. So this is a somewhat contrived example, but hopefully, it’ll allow us to build up the intuitions for, you know, what’s going on in a neural network. Um, okay. So suppose I am building some sort of, uh, system to detect whether two cars are gonna collide. Okay. So the way it works is I have this car at position x_1 and it’s, you know, driving, uh, this way. And then I have another car at position x_2 and it’s driving this way. And I want to determine whether it’s safe, um, which is positive or it’s- if it’s gonna collide. Okay. And let’s suppose for, uh, simplicity that the true function is as follows. Okay. So it’s just measuring whether the distance is at least 1 apart. Now th- this is kind of a little bit, uh, you know, s- like what we did in, uh, the last lecture where we suppose there was a true function and then see if learning can recover that, um, where in practice, obviously we don’t know the true function, but this is for- kind of pedagogical purposes. Okay. So just to kind of making sure we understand what function we’re talking about. So if, um, x_1 is 1 and x_2 is 3, um, kind of like that on the board, then here, plus 1. So this is like driving in the US. This is like driving in the, er, UK. Um, and that’s fine too. Um, but if you’re, um, uh, you know, too close together then that’s bad news. Okay? All right. So let’s think about decomposing the problem, right. Because if you look at this, you know, this, this could be a kind of a complicated, um, you know, function, but let’s try to break it down into kind of linear functions, right. Because at the end of the day, neural networks are just a bunch of linear functions with, um, which are stitched together with some nonlinearities. So like there are a kind of linear components that are, um, critical to neural nets. Okay. So one subproblem is detecting if car 1 is to the far right of car 2. Okay. So x_1 less x_2 is greater than or equal to 1. Um, another problem is testing whether car 2 is far right of car 1. And then, um, and then you can put these together by saying, um, if at least one of them is, you know, 1, then I’m going to predict safe, um, otherwise I will predict, uh, not safe. Okay. So here’s the kind of concrete examples. So for 1, 3, uh, car 2 is far right of car 1. So that’s a 1. You add these up, take the sign, that’s plus 1, in the opposite direction it’s still fine. And in this, this case both h_1 and h_2 are 0, so that’s, uh, bad news. Okay. So this is just kind of trying to take this expression which is a true function and kind of write it in, uh, kind of a more modular way, where you have different pieces corresponding to different competitions. Okay. So now, um, we, we could just write this down, obviously to solve this problem but th- we already knew what the right answer was. But suppose we didn’t know what the true function is and we just had data. So, so we don’t actually know what these functions are. So can we kind of learn, learn these functions automatically? So what I’m gonna do is I’m, I’m gonna define a feature vector now, um, of x which is gonna be a 1, x_1, x_2. Okay. Um, and then I’m going to rewrite this intermediate subproblem as follows. So x_1 is x_2 greater than 1, is going to be represented as this, uh, vector v_1.v of x, where v_1 is minus 1 plus 1 minus 1. So you can- you pause for a second. Um, you can verify that this is x_1 minus x_2, you know greater than equal to 1. Okay. So this is just another way of writing, um, you know, what we wanted in terms of this like dot product and you can see kind of how this is maybe moving more towards something that looks more general. Yeah. Why is that 1 there? So the question, why is there this 1 here? Um, so this 1 typically is known as a bias term which allows you to, um, not just, uh, you know, threshold on 0, but threshold on, uh, any arbitrary number. So in the linear classifiers that I’ve, you know, talked about, I’ve kinda swept this under the rug. Generally, you always have a bias term that allows you to kind of modulate how likely you’re gonna pre- predict 1 versus, uh, minus 1. Okay. So you can also do it for h_2. It’s the same thing, but just, um, you know, switching the roles of x_1 and x_2. Um, and now also the first sign of final sign prediction, you can write it as follows. Um, now th- these are just weights on, um, h_1 and h_2. Okay? So now, here is the, the kind of the punchline, is, you know, for a neural network, we’re just going to leave v_1, v_2, and w as unknown, uh, quantities that we’re going to try to, uh, fit through training. Right. We motivated this problem by saying, okay, in this case, there is some choice of v_1, v_2, w that works. But now we’re kind of generalizing. If we didn’t know these quantities, we can just leave them as variables and we can actually still fit them- fit these parameters. Okay. So, um, before we were just tuning w, and now we’re tuning both V and w. V specifies the choice of the hidden problems that we’re interested in and w governs how do we take the results of the hidden problems and, uh, come to a final prediction. Okay. So there’s one problem here, which is that if you look at the gradient of h1 with respect to v1, um, it happens to be 0, okay? So if you look at, um, the, uh, horizontal axis is v1 dot Phi of x and the vertical axis is h1, um, that function, um, is- looks like the step function, right? Because indicator function of some quantity greater than or equal to 0. It’s 1 over here, 0 over here. Um, and remember, we don’t like 0 gradients because SGD doesn’t work. So the solution, um, here is to, um, take some sandpaper, um, and you, you know, sand out this function to smooth it out and, uh, then you get something that is, um, you know, differentiable. So, uh, the logistic function is this function which is, um, a smoothed out version of this, which, um, rises. So it doesn’t hit 1 or 0 ever, but it becomes extremely close. But it kind of, um, goes up in the, in the middle. And you could think about this as, um, a differentiable, um, or I, I guess a smooth version of, uh, the step function, okay? So it kinda behaves and looks like the step function. It serves kind of the same intuition that you’re trying to test whether some quantity is greater than 0, but it doesn’t have 0 gradients anywhere, okay? And you can double-check. If you take the derivative, then this is actually- has this kind of really interesting nice form, which is the value of the function times 1 minus the value of the function. And the value of the function never hits 0, so this quantity never hits 0, okay? So, so now we can define, uh, neural nets in contrast to linear functions. So remember, linear functions, um, we can visualize it as, um, inputs go in, um, and each of the inputs gets, um, weighted by some, uh, w and you get the score, okay? [NOISE] So this is what a linear- what a linear function looks like. Now, neural networks with one hidden layer and two hidden units, 1, 2, looks something like this where you have, um, these intermediate hidden units, which are the sigmoid function, um, applied or logi- logistic function in this case in, uh, to be concrete, um, applied to, um, this wave vector Vj times Phi of x. So h1 is, uh, going to be taking the input and multiplying it by a vector of- and you get some number here, and then you send it through this, um, logistic function to get some number. And then finally you take the output of h1 and h2 and you, uh, take the dot product with respect to w, and then you get the final score, okay? So again, the intuition is that neural nets are trying to break down the problem into a set of, you know, subproblems where you- the subproblems are the kind of the, the result of these intermediate computations. [NOISE] And you can think about these as like, you know, h1 is really kind of the output of a mini linear classifier. h2 is the output of a mini linear classifier. And then you’re taking those outputs and then you’re, you know, sticking them through another linear classifier and getting the score. So this is what I mean by, you know, at the end of the day, it’s kind of linear classifiers packaged up and strung together. And their expressive power comes from, from the kind of the composition. Um, yeah, question. Phi h sub j when there’s like multiple Phi, like, how do you combine them? Uh, the question, how do you get h sub j when there’s multiple Phis? There’s only one Phi of x. Oh, so this is, this is the first component of Phi of x. So this vector, this, this is a three-dimensional vector, which is Phi of x. And it has three components. Yeah. Yeah. [inaudible] uh, isn’t that effectively features, kind of? Yeah. Then they’re like, they’re like the- like, I mean, some kind of function of the original features that you’ve put in and they make the new features that are better than the ones before? Yeah. [NOISE] Yeah. So that’s my- kind of my next point, which is that, um, one way you can think about it is that the hjs are actually just, you know, features which are learned automatically from data as opposed to having, a fixed, uh, set of your features Phi, right? Because at this layer, w always sees these, you know, hs which are coming through which look like, you know, uh, features. Um, and for deeper neural networks, you kind of just keep on stacking this. So, you know, this output of one set of classifiers becomes the features to the next layer and then the output of that class sort of becomes the features to the next layer, and so on. Um, and the intuition for, you know, deeper networks, um, is that, you know, as you proceed you can, uh, derive more abstract, you know, features. For example, images. You start with pixels and then you find kind of the edges, and then you define kind of object parts, and then now you define kind of, uh, things which are closer to the actual classification problem. Yeah. [NOISE] What if you wanted h2 to develop the exact same value, like, do you have to have a bias to start with? Ah, yeah. That’s a good question. So why don’t h1 and h2 do, uh, basically end up in the same place because, you know, because of symmetry? Um, if you’re not careful that will happen. So if you initialize all your weights to 0 and, uh, or initialize these weights the same way then, um, they will be kinda moving in locks- lockstep. Um, so what is typically done is you randomly initialize. So they’re, kinda, you break the symmetry. And then what the network is going to do is it’s trying to, um, use- learn auto- it kind of automatically learns these subproblems to, uh, be kind of complementary because you’re doing this joint learning. [NOISE] Yeah. Final question then. How do you choose the Sigma function? [NOISE] Uh, how do I choose the Sigma function? Um, so this is- so in general, sigmoid functions are these or activation functions are these nonlinear functions. So the important thing, uh, it’s, it’s a nonlinear function. Um, I chose this particular logistic function because it’s kind of the classic, um, neural net and it looks like the step function, which is kind of, uh, takes the score and outputs, uh, a classification result. I should, you know, responsibly note that, um, these are, um, maybe, uh, less in style than they used to be. And the, the cool thing to do now is to use, uh, what is called a ReLU or a rectified linear, which looks like this. Um, and you might ask, like, why this one? Um, well, there’s no one reason, but, um, this, um, this function has less of a kind of this, um, gradient going to zero problem. It’s also simpler because it doesn’t require exponentials. Um, but there’s, um, um, I’m gonna just leave it at that. [NOISE] [BACKGROUND] What- the benefit of this function is, uh, pedagogical reasons and it’s a little bit of a throwback too. [NOISE] Um, okay. [NOISE] Yeah, if you read the notes in the lecture slides, there’s more details on, like, why you would like change, choose one versus another. Okay, so now we’re kind of ready to do neural net learning, right? So- okay, remember we have this optimization problem, it’s, the training loss now depends on both V and w, and a training loss remember, is averaged over the losses of individual examples, uh, the loss of the individual example, let’s say we’re doing regression, is the square difference between y and the function value, and remember the function value is the summation over the- the weights at the last layer, times the activations of the hidden layer, and- and that’s the basic idea, okay? And now all I have to do is compute this gradient. Um, so you look at this and you say okay, well, if you get- have, enough scratch paper, you can probably like, work it out. Um, I’m gonna show you, a different way to do this, um, without grinding through the chain rule. Um, so this is going to be based on the computation graph, which will give you, um, insight- more additional insight into the kind of the structure of computations, and visualize what it means, what does a gradient kind of mean in some sense? And it also happens that these computation graphs, is really at the foundation of all of these modern deep learning frameworks like TensorFlow and PyTorch. So, um, this is a real thing. Um, it turns out that we’ve taught this it, many people still kinda prefer, uh, to grind out the amount. I can’t really tell why, except for maybe you’re more familiar with that, and so I would encourage everyone to kind of at least try to, um, think about the computation graph as a way to understand your gradients, even though initially it might not be faster. And it’s not to say that you always have to draw a graph, um, to compute gradients, but doing a few times might give you additional insight that you wouldn’t otherwise get. Okay, so here we go. Um, so functions, we can think about them as just boxes, right? The boxes you have some inputs going in, and then you get some output. That’s all a function is, okay? And partial derivatives or you know, gradients asked the question- the following question, how much does the output change if the input changes a little bit? Okay? So for example if we have this function, that just computes two times in1 plus in2in3. Um, you ask the question like, you take input one and you just add a little epsilon. So like 0.001. And you ask hmm, and- and you sti- uh, read out the output, and you say, “Well, what happens to the output? While in this case, uh, the output changed by 2 epsilon additively. Okay? So then you conclude that the gradient of this function with respect to in1 is, is what? [NOISE]. 2. 2, right? Because the gradient is kind of the amplification. If I put an epsilon, then I get 2 epsilon out, the gradient is 2, or the partial derivative. So okay, let’s do this one. So if I add epsilon to in2, then I- simple algebra shows I get a- a change in, in3 epsilon, so what’s, um, the partial with respect to in2? In3, right? Okay, good. So you know, you could have done basic calculus and gotten that, but I- I really kind of want to stress the kind of interpretation of, you know, perturbing inputs and witnessing the output, because I think that’s a useful, um, interpretation. Okay, so now, um, all functions are- well, not all functions are made out of building blocks, but most of the functions that we’re interested in, in this class are going to be made out of these- these five pieces, okay? And so for each of these pieces, it’s you know, it’s a function, it has inputs, a and b, and you pump these things in and you get some output. Um, this, so there is a plus, minus, times, max, and the logistic function. Okay, so on these edges, I’m going to write down in green the partial derivative with respect to the input that’s going into that function. Okay? So let’s do this. So if I have the function a plus b, the partial derivative with respect to a is 1, and the partial derivative with respect to b is 1, okay? And if you have minus, then it’s 1 and minus 1, um, if you have times, then the partial is b and a, okay? Everyone follow so far, okay? Okay so max, uh, what is this? This is maybe a little bit, you know, trickier. Um, so remember we kind of experienced the max last time. So when the max, um, example you have, uh, a formula, just refresh. Uh, uh, so- so remember our last time we had the- we saw the max in the context of, um, uh, the- the hinge loss, right? So you have the max of these two functions, which is this, which means that, um, you know, let’s say one is- one is a and the other is b. Um, so if a is greater than b, then the, um, then we need to take the derivative with, uh- sorry, then, uh- Okay, let me do it this way. Okay, um, ig- ignore that thing on the board. So I just have max a of b, okay? So suppose a is, uh, 7 and b is, uh, 3. Uh, okay, so max, uh, a and b and let’s say this is 7 and this is 3, so that means a is greater than b. So now, if I change, um, a by a little bit, then that change is going to be reflected by an output of a max function, right? Because this, uh- this region is small and it doesn’t matter. And, um, in this case, if I change b by a little bit, then does the output change? No, because like, you know, 3.1, 2.9 is all, the output doesn’t change, so the gradient is going to be 0 there. So the max function is partial derivatives, look like this. So if a is greater than b, then this is going to be a 1, if a is less than b, this is going to be a 0 and you know, conversely over here, if b is greater than a, then this is going to be a 1, if b is less than a, then this is going to be a 0. Okay? So the partial of maximum, there’s always 1 or 0 depending on this particular co- you know, condition. Okay, and then the logistic function, um, this is just a fact you can derive it in your, you know, free time but I had on a previous slide. It’s just like the sigmoid, um, uh, logistic function, times 1, minus the logistic function. Okay so now you have these building blocks, now you can compose and you can build castles out of them. It turns out like all- basically all functions that you see in, you know, you know, deep learning are just basically bail- built- built out of these blocks. Um, and how do you compose things? Um, there’s this nice, uh, thing called, the chain rule, which says that, “If you think about input going to one function and that output going into input in a new function, then the partial derivative with respect to the input of the output is just the product of the partial derivative.” This is just the chain rule, right? And you can think about as like- you know, think about amplification. So this function amplifies by two times, and this amplifi- this function amplifies by 5, then total amplification is going to be 2 times 5, okay? All right, so now let’s take an example, we’re going to do, uh, binary classification with the hinge loss, um, just as a warm-up, um, and I’m going to draw this computation graph, and then compute the partial derivative with respect to w. Okay, so what is this graph? Um, so I have w times Phi of X, that’s a score, times y, that’s a margin, 1 minus margin, um, max of 1 minus margin 0 is a loss, okay? So now for every edge I can draw the partial, uh, derivative, okay? So here remember the partial derivative here is, uh, left-hand-side greater than d or the right- the right branch. So 1 minus margin greater than 0. Um, for minus, this is a minus 1. For a times, this is going to be whatever is over here. Uh, for this times, it’s going to be whatever is over here. And by the chain rule, if you multiply what’s on all the edges, then you get the gradient of the loss with respect to w. Okay. So this is kind of a graphical way of doing what you, you know, probably wha- what I did last time, which is, um, if the margin is, um, uh, less than- greater than 1, then it’s- everything is 0. And if the margin’s less than 1 then I’d perform this, uh, particular update. Okay? So in the interest of time, um, I’m not going to do it for the simple neural network. Uh, I will do this in section. But, you know, at a high level, you basically do the same thing. You multiply all the, you know, blue edge, uh, the edges and you get the- the, uh, partial derivatives. Okay. So- so now, you know, we’ve kind of done everything kind of manually. I wanted to kind of systematized this and talk about an algorithm called back-propagation, uh, that, um, allows you to compute gradients for arbitrary computation graph. That means, any kind of, uh, function that you can build out of these building blocks, you can actually just get the derivatives. So, you know, one nice thing about these packages like PyTorch or TensorFlow is that, you actually don’t have to compute the derivatives on your own. It used to be the case that, you know, uh, before these, people would have to crank- implement this derivatives by- by, um, hand, which is really tedious and error prone. And part of why it’s been so easy to kind of develop new models is that all that’s done for you automatically. Okay. So back-propagation is gonna compute two types of values; a forward value and a backward value. So f_i for every, um, node I is the simply the value of that expression tree. And, um, the backward value, g_i,is going to be the partial derivative with respect to output of, uh, that- the value at that node. Okay? So for example, f_i here is gonna be, um, w_1 times, uh, um, Sigma v_1 times, uh, phi of x. And g of that node is going to be, uh, the, basically the product of all these edges. Basically, how much does this node change the output at the fin- uh, at- at the very top. Okay. So the algorithm itself is- is, you know, quite straight forward. There is a forward pass which computes all the f_i’s, and then there’s a backward pass that computes all the g_i’s. So in the forward pass, you start from the leaves and you go to the root, and you compute each of these, uh, values kind of recursively. Where the computation depends on, you know, the sub-expressions. Um, and in the backward pass, um, you, similarly have a recurrence that, uh, gives you the value of a particular- a g_i of a particular node is equal to the g_i of its parent times whatever is on, um, this edge. Okay? So it’s like you take a forward pass, you fill in all the f_i’s and then you take a backward pass, and you fill in all the g_i’s that you care about. Okay? All right. So section will go through this in, uh, detail. I realize this might have been a little bit quick. Um, one quick note about optimization is that, now, you have all the tools that you can do, you can run SLG on in which doesn’t really care about whether you’re, um, you’re, you know, what the function is. It’s just like a function. You have it, you can compute the gradient, that’s all you need. But one kind of, eh, important thing to note is that just because you can compute a gradient doesn’t mean you optimize the function. So for a linear function, it turns out that if you define these loss functions on top, you get these convex functions. So convex functions are these functions that you can hold in your hand, and, eh, um, and have a one global, uh, minimum. And so if you think about SLG, it’s going- going downhill. You converge to the global minimum and you solve the problem. Whereas neural nets, it turns out that the loss functions are non-convex, which means that if you try to go downhill, you might get stuck in local optima. And in general, optimization of neural nets is hard. In practice, people somehow managed to do it anyway and it works. There’s a gap between theory and practice which is an active area of research. Okay. So in one minute, I I have to do nearest neighbors. [BACKGROUND] Um, it will actually be fine because nearest neighbors is really simple, so you can do it in one minute. So here it goes. Um, so let’s throw away everything we knew about linear classifiers in neural nets. Here’s the algorithm. You’re training as you store your training examples. That’s it. And then, the predictor of a particular example that you get is you’re gonna go through all the training examples and find the one which is closest- has input which is closest to your- uh, your, um, input x prime. And then you’re just gonna train- you’re gonna return, um, y. Okay? So, um, and the intuition here is that similar examples- it’s similar inputs should get similar outputs. Okay? So here’s an, uh, pictorial example. So suppose we’re in two dimensions and you’re doing classification and [NOISE] you have, a plus over here. Um, let’s do this plus and you have, um, you know, [NOISE] a minus here. Okay? So if you are asking what is the pro- uh, label assigned to that point, it should be plus because this is closer. Um, this should be minus. This region should be minus. This should be plus. And, [NOISE] you know, one kind of cool thing is that is, where’s the decision boundary? So if you look at the point that is equidistant from these, and draw perpendicular, um, that’s the decision boundary there, um, same thing over here, um, and, uh, so you have basically carved out this region where this [NOISE] is minus and, [NOISE] um, everything here is [NOISE], you know, plus. Okay? [NOISE] Um, in general, this is, um, what I’ve drawn is an instance of a Voronoi diagram which if you’re given a bunch of points, um, the defined regions of points which are closest to that point. And everything in a particular region like this yellow region is assigned the same label as, um, this point here. And this- this is, um, what is called a non-parametric model which means that, the number- it doesn’t mean that there’s no parameters. It means that the number of parameters is not fixed. The more points you have, the more kind of each point has its own parameter. Um, so you can actually fit really expressive models, um, using that. It’s very simple, uh, but it’s kind of computationally expensive because you have to store your entire training examples. Okay. So we looked at three different, um, models and, you know, there’s a saying that well, and, uh, I guess in school, you- there’s three things study, sleep, and party or something, and you have to only pick two of them. Well, so for learning, it’s kinda the same. It can either be fast to predict for linear models and neural nets. Um, it can be easy to learn for linear models and uh, um, nearest neighbors or it could be powerful. For example, like neural networks and nearest neighbors but there’s always some sort of compromise and exactly what method you choose, um, will depend on kind of what you care about. Okay. See you next time.

Leave a Reply

Your email address will not be published. Required fields are marked *