Binary Cross-Entropy loss or BCE loss, is traditionally used for training GANs, but it isn't the best way to do it. With BCE loss GANs are prone to mode collapse and other problems. In this video, you'll see why GANs trained with BCE loss are susceptible to vanishing gradient problems. To that end, you'll review the BCE loss function and what that means for the generator and the discriminators objectives. Then you'll see when and why GANs with this BCE loss are likely to have those vanishing gradient problems. Remember the form of the BCE loss function, its just an average of the cost for the discriminator for misclassifying real and fake observations. Where the first term is for reals and the second term is for the fakes. The higher this cost value is, the worse the discriminator is doing at it. The generator wants to maximize this cost because that means the discriminator is doing poorly and is classifying it's fake values into reals. Whereas the discriminator wants to minimize this cost function because that means it's classifying things correctly. Of course the generator only sees the fake side of things, so it actually doesn't see anything about the reals. This maximization and minimization is often called a minimax game, and that's how you might hear it being referred to as. At the end of this minimax game, the generator and discriminator interaction translates to a more general objective for the whole GAN architecture. That is to make the real in generated data distributions of features very similar. Trying to get the generated distribution to be as close as possible to the reals. This minimax of the Binary Cross-Entropy loss function is somewhat approximating the minimization of another complex hash function that's trying to make this happen. Of course, during this whole training process, the discriminator naturally is trying to delineate this real and fake distribution as much as possible, whereas the generator is trying to make the generated distribution look more like the reals. However, let's take a step back again to the generator and discriminators roles. The discriminator and again, needs to output just a single value prediction within zero and one. Whereas the generator actually needs to produce a pretty complex output composed of multiple features to try and fool the discriminator, for example, an image. As a result that discriminators job tends to be a little bit easier. To put it in another way, it's more straightforward to look at images in a museum than it is to paint those masterpieces, right? During training it's possible for the discriminator to outperform the generator, very possible, in fact, quite common. But at the beginning of training, this isn't such a big problem because the discriminator isn't that good. It has trouble distinguishing the generated and real distributions. There's some overlap and it's not quite sure. As a result, it's able to give useful feedback in the form of a non-zero gradient back to the generator. However, as it gets better at training, it starts to delineate the generated and real distributions a little bit more such that it can start distinguishing them much more. Where the real distribution will be centered around one and the generated distribution will start to approach zero. As a result, when it's starting to get better, as this discriminator is getting better, it'll start giving less informative feedback. In fact, it might give gradients closer to zero, and that becomes unhelpful for the generator because then the generator doesn't know how to improve. This is how the vanishing gradient problem will arise. In summary, GANs try to make the generated distribution look similar to the real one by minimizing the underlying cost function that measures how different the distributions are. As a discriminator improves during training and sometimes improves more easily than the generator, that underlying cost function will have those flat regions when the distributions are very different from one another, where the discriminator is able to distinguish between the reals and the fakes much more easily, and be able to say, "Reals look really real, a label of one and fakes look really fake, a label of zero." All of this will cause vanishing gradient problems.