Finally, it's finished. Quick look at the performance MSE of point 124, logloss point 389, 17.9 percent error abide the same as the other models. Oh and if we scroll down, showing the same over fitting after roughly 10 epochs. But what I really wanted to focus on, was this model took 13 and a half minutes to build. When we used three layers each of 200 neurons, it took six minutes, six minutes 10. Our default model took what was it 70 seconds. So why is three layers of 200 neurons, 200 by 200 by 200, twice as quick as 400 by 400 just two layers. Let's go and have a look, if we run summary it tells us what layers we have, how many units in each layer, and hidden in here is the number of weights and biases. So our default model which is 200 by 200 has 800,000 weights that's calculated. If we just bring in the units we can see we have 3,800 units in the first input layer, then we multiply that by the size of the next layer, so I'm multiplying these two numbers. Then we add the number of connections between the next two layers, which is 200 times 200. And then we add that to the number of connections between the next two layers. And we only have two, I put units so that's 200 times two. And then the number of units we have in these three layers is the number of biases. If we run that, we get 800,600 weights plus 402 biases. When we add one more hidden layer with 200 neurons, all we're adding is another 200 times 200. So you can see we have 840,000 weights and biases now. With 400 by 400, we have almost 1.7 million weights and biases, so it's doubled. And the reason for that is we're doing 3,800 times 400. It's all about this first multiplication because our first layer has doubled in size. And our input layer is very large, effectively the total number doubles. This is really important because the number of weights is basically proportional to how long it will take to train. Now, you may work for Google and have an infinite amount of computing power, but for the rest of us if a model can build in 30 minutes instead of building in 30 hours we can be a lot more productive. So this kind of thing is important to pay attention to. If using Python the way to spot these kind of problems is to look at your training data or any of your data sets, they should be identical and the N levels. And it tells you the zeros are simple numbers. When it has more than zero. It's an enum column, a factor, or a category column. So one of them here has 3,500, a single column has 3,500 different categories and a couple here with 132. So this is the one that's getting a red flag from me. If we do train structure, it's a bit more verbose but it allows us to find out which is the problem and it's tail number. So we have 3,500 different values for tail number and you can see a sample there. Yet there's well, thousands of them, we've just found out. The other large ones, 132 levels for origin and destination. I want to keep those two in because I think they're carrying a lot of information. I don't think tail number is going to be carrying much information, so let's try removing it. A Python way to filter out a single value from a list. And now using X2, my slightly smaller list, I'm going to build the same full models we've already build. I'm going to build them back to back and come and talk about them afterwards. Over in R, you can see the number of units with this kind of command every model has an @model object. You can then get the dollar sign model summary, and specifically dollar units to see those four numbers. And a couple of R functions there that will tell you the size of our four models. And we can see have 400 by 400 is twice as large as the others, has twice as many weights which is why it took twice as long to build a model. And you can run H2O describe on your dataset to find out the cardinality. This function in Python, the described function in Python doesn't tell you the cardinality, which is why we needed to use N levels. And to get rid of TailNum did setdiff and then off we go. Building four models again. Those four models have finished rebuilding and something really interesting has happened. The first model, the default settings, took 70 seconds before and that's reduced in a 13 seconds. Increasing it to 200 epochs without stopping which was using about 50 epochs, wasn't it? Has gone from five to six minutes then to one minute 20, and the same for all three layer 200 by 200 by 200 model. And the 400 by 400, the two layers of 400 neurons has dropped from 12 minutes to two minutes. Which is great, wonderful. We've saved a sixth of the time, no we've saved five sixth of the time. But have we lost any performance? So what I've done is I've put all the models into a Python list. And then using a lambda to run logloss on each of those models and then we'll do MSE as well. What have we got? This is also a useful way of seeing which is the best model so far. So for logloss, low is better. It's gone up slightly, gone up a fraction, dropped a lot. This could be random variation, I think this number for the three layered model was too high. And the 400 by 400 was our best model, has actually got better by excluding that tail number. MSE is a very similar pattern on all four models. And just to understand what was going on there. This was an original 400 by 400 model with 1.6 million weights. A new version has gone from 1.6 million to 280,000, roughly a sixth and roughly a sixth of the training time. So you can see we're down to just 299 input neurons now. If you're confused why tail number, created so many input neurons this is called One Hot Encoding. I think we might have covered it in week one, but basically when you have a categorical input to either Deep Learning or GLM, it creates one input for each possible value and all of them will be set to zero except the category of that particular record. So we've looked at quite a few things in this set of videos. We've looked to how to fiddle with a number of neurons and the number of layers. And H2O make that really easy, you just set hidden as a list. And we've seen that we should be paying attention to how many input neurons we have. I'm perhaps excluding some data if we don't think it's going to be that useful.