In this video, you'll learn about Least Squares loss. First, I'll go over the overall concept of least squares and statistics and then what Least Squares loss means in GANs and this will be an adversarial loss, so this will be replacing BCE loss or W loss in cycle GAN. Quick background on least squares loss and GANs. It came out at a similar point as WGAN-GP, which was of course on week three. And this was a time when training stability in GANs was a big problem. Least Squares loss is used to help with training civility, namely vanishing gradient problems that you saw from BCE loss, but then cause mode collapse and other issues resulting in the end of learning, which is the worst thing you could possibly get. Note that before we dive in, there are a ton of GAN Loss Functions, particularly to help with training stability, and often people experiment empirically with this since some datasets actually fare better with one loss-function versus another even on the same model. Remember that this could also introduce bias as people have different views on what is "better''. Something to think about is that training time is also a huge component to how people drink GANs. You might have seen that WGAN-GP was a little bit slower because the calculation is a bit more complex. You might have observed that in your assignment. That also sometimes pushes people to use a certain loss function over another as well. Least squares is a pretty simple concept of from statistics. It's a method that minimizes the sum of squared residuals. What that means is that it tries to find the best fit line that has the smallest sum of squared distances between that line and all the points. Here's an example with three different points and you want to fit a line through them. Okay, cool. You can do that by finding each points distance from a line. You draw a line first, you find the distance from a point to that line. Specifically, you want to take a look at their squared distances and you want to minimize that squared distance, so you want to find the best fit line. When you draw a line like this, this is going to have a much larger square distance than this other line that you see here in purple. You get the sum of squares as the total distance or cost for each line that you tried to do, you minimize that sum of squares to find that best fit line. That is what that means by minimizing the sum of the squared residuals, which is just that squared distance you see between the line and each point. By taking the sum of all of those squares and minimizing that value, you get the best fit line you can. This is a really simple concept to find the best fit line across a few points. How this translates into GAN land is that your line is now your label, essentially, so your label of real or fake. From the discriminators point of view, a label of one is real and the equivalent of a point relative to that line is your discriminators prediction on a real image. Point is the prediction and line you want to find that distance to essentially is that one label. You want to get that squared distance from the prediction to that label. You want to do this across many different points, so you can take the expected value across many real images, X, and for fake images it's the same but with a label of zero. This is a very general way of putting those, of course, with cycle GAN, you're expecting G is taking in that real image and discriminator is evaluating that fake image. Then you did the same, you do this on multiple fake examples, which is just multiple z values passed into G, where z is perhaps different z images. You can simplify this by removing that zero because thats just minus zero. That's your discriminators loss-function using the least squares adversarial loss-function. On the generator side, it's quite similar to other cases you've seen before. With the generator wanting it's fake outputs to look as real as possible, it will instead want to see how far away it is from one. In summary, these are the discriminator and generator loss terms under least squares adversarial loss. This might look similar to previous loss functions you've seen, namely BCE loss. But importantly, you'll see in this case that loss isn't very flat like you saw with those sigmoids in BCE, which cause all of those vanishing gradient problems because it's only flat with the discriminators predictions are exactly one here in exactly zero here for fake. Otherwise thinking of the squared distance using this Least Squares loss will not cause this vanishing gradient problem as drastically. This least-squares loss-function by and large house with the vanishing gradient problem you've seen in BCE loss and with that issue remedied comes better training stability in your GAN. For those that might be familiar at, this is also known as the mean squared error or MSE. But no worries if you don't know what MSE is, just trying to tie this into a topic that you might know about already. In the context of cycle GAN, what this looks like is you have the cycle consistency loss term over here with the Lambda, but your adversarial loss is this Least Squares loss that you just saw. In summary, you learned about Least Squares loss and how it computes the square distance. This loss is used in cycle GANs for its adversarial loss and has less saturation and vanishing gradient problems than BCE loss, since it's only flat when you're point is on the line, meaning your prediction is perfect. For example, one for real or zero for fake. Anywhere else it'll be that squared distance. and so it doesn't have that saturation issue. You also saw in this video the equations for the least squared loss on both the generator and the discriminator. But remember that this isn't the final loss equation. There's also that cycle consistency loss term that you learned about earlier. You will actually learned about an additional loss term in the next video as well.