Hello everyone. Today, I will explain to you how, me and my team mate Gert, we won a very special Kaggle competition called the Acquired Valued Shoppers Challenge. First let me give you some background about the competition. It was a recommender's challenge. And when I say a recommender's challenge, I mean you have a list of products and a list of customers, and you try to bring them together. So we target them so we recommend which customer in order to increase sales loyalty or something else. There were around 1,000 teams and for back then, at least they were quite a lot. Now Kaggle has become much more popular. But back then, given the size of the data, which was quite big, I think this competition attracted a lot of attention. And as I said, we attained first place with my teammate Gert, for what it was, I think, a very clear win because we took a lead early in the competition, and we maintained it. And that solution was not actually very machine learning Kebbi, but it was focused on really trying to understand the problem well and find ways to validate properly. And in general, it was very, very focused on getting sensible results. And that's why I think it's really valuable to explain what we did. So what was the actual problem we tried to solve? Imagine you have 310,000 shoppers, almost equally split. I'm on a train and test. You have all their transactions from a point where they were given an offer, and these were about 350 million transactions. So, one year of all the transactions of all the customers involved in this challenge, and you have 37 different offers. When I say offer here, it is actually a coupon. So, a shopper is sent normally. It's a piece of paper that says, "If you buy this item, you can get a discount." So it recommends to certain item. Now, we don't know exactly what discount is. Maybe discounts, maybe it says, "You can buy another item for free." So, maybe a different promotion, but in principle is some sort of incentive for you to buy this item. I have mentioned items so far or products, but the notion of product was not actually present in this challenge, but we could infer it. We could say that a unique combination of a brand, category, and company that were present could form a product, at least for this competition. Let me give you a bit more details about the actual problem we are trying to solve. Imagine you have a timeline, starts from one year in the past until one year after. So, in the past, a customer makes a visit to the shop, buys certain items, leaves, comes back another day, makes another visit, buys more items. Then at some point, he or she is targeted with an offer, a coupon as I said. All transactional history for this customer stops there. You don't know anything else. Up to this point, you know everything for one year back up to this. The only thing you know is, you know that the customer has indeed redeemed that coupon. So he did buy the offer product. And for that training data only, you also have a special flag that tells you whether he or she bought that item again. And you can see why this is valuable. Because normally, when you target someone with a coupon, you give a discount, and you don't make that much profit actually, but you aim in establishing a long-term relationship with the customer. And that's why they were really interested in getting a model here that could predict which recommendation, which offer will have that effect, will make a customer develop a habit in buying this item. And what we were being tested on was AUC. By this point, I expect you roughly know what AUC is as a metric, but in principle, you're interested in how well your score discriminates between those that bought or can buy the item again and those that will not. So, when you have the highest score, you expect higher chance to buy the item, and a lower score, lower attempts to buy the item again. So, higher AUC means that this discrimination is stronger with your score. This was a challenging challenge. And it was challenging because, first of all, the datasets were quite big. As I said, 350 million transactions. Me and Gert, we didn't have crazy resources back then. I have to admit that I have personally improved my hardware since then, but actually back then, I was working only with a laptop that had 32-bit Windows and only four gigabytes of RAM. So, really small and mainly challenging that we had to deal with these client files. And then, we didn't have features. So, what we knew is this customer was given this offer, which is consistent by these three elements I mentioned before, category, brand, and company, and the time that this recommendation was made nothing else. Then, you had to go to the transactional history and try to generate features. And you know, anybody could create really anything they wanted. There was not a clear answer about which features would work best. So this is not your typical challenge where you're normally given the thesis. But it is quite difficult for the type of the recommender's challenge. And what really makes this competition difficult, interesting, and what I think at the end of the day gave us the win was the fact that the testing environment was very irregular. And we can define irregular, in this context, as an environment where the train data and the test data had different customers. So, no overlaps. Different customers, and one different in the other. Also, the training this data had in general different offers. It was showing you a graph that shows that the distribution of its offer and whether it appears in the train or in the test data or both. And you can see that most offers, either appear only in test or they appear only in train with minimal overlap. So, that makes it a bit difficult because you basically have to make a model with soft products. They were offering the train, but in the test data, you have completely other offers. So you don't know how they would behave as these products have never been offered before. And the last element is, the test data is obviously in the future. That is expected. But given the other elements, this makes it more difficult, especially in some cases were well in the future. And some of it is not as important elements, but still crucial was that this challenge was focusing on acquisition. So, there is not that much history between the customer and the offered product. And I say this is irregular because grocery sales are in principle based on what the customer already like and has bought many times in the past. So we referred to these type of acquisition problem, where we don't have much history, as the cold start problem, and it is much more challenging because you don't have that direct link. That's, the customer really like this product I made an offer because we don't have a past history that can verify this or we don't have much history. And the last element is that if you actually see the propensity of an offer to be bought, again in the training data, the results were quite different. And here, I give you the offer by shortened propensity, and you can see some offers had much success to be bought again. It's like offer two that somehow this had much less. For example, 20 percent, and this is just a sample. There were some other offers that had around five percent. So, if you put now everything into the context, you have different customer, different offers, different buyer, different time periods. In principle, you don't have that much information about the customer and the offer product, and you know that the offers of the training data are actually quite different. It's really difficult to get a standard pattern here. And you know that the offers in the test data are going to be different. So, all this made it a difficult problem to solve or in irregular environment. How did we handle big data? We did it with indexing. And the way I did the indexing was, I saw that the data were already shorted by customer and time. So, I passed through these big data file of transactions, and every time I encountered a new customer, I created a new file. So I created a different file for each customer that contained all his or her transactions, and that made it really easy to generate features, because if I have to generate features for a specific customer, I would just access this file and create all the features I wanted at the will. This is also very scalable. So I could create threads to do this in parallel. So, access many customers in parallel. And I did this not only for every customer, but also for every category, brand, and company. So, every time I wanted to access information, I would just access the right category, the right brand, or the right customer, and I will get the information I wanted, and that made it very quick to handle all these big chunks of data. But what I think was the most crucial thing is how we handle this irregularity. I think at the end of the day, this is what determines our victory because once we got this right and we were able to try all sorts of things and we had the confidence that it will work in the test data. The first thing we tried to do and this is something that I want you to really understand, is how we can replicate internally what we are being tested on. That's really important. I'll give you the room to try all these things. Try all different permutations and combinations of data techniques, anything you have you can put in mind, and really understand what works and what's not. So, we tried to do that. The first of attempt didn't go very well. So we try to randomly split the data between train and validation and we're trying to make certain that each offer is represented equally in each one of this train and validation data set proportionately equally. But was that correct? I mean, if you think about it. What we were doing there, we were saying I'm building a model with some offers and I'm validating in the same offers. That's good. Maybe we can do well here. But is this what we're really being tested on? No. Because in the test data, we'll have completely different offers. So, this didn't work very well. This was giving very nice internal results but not very good results in the test data. So, we tried something else. Can we leave one offer out? And I'm showing you roughly what this look like. So, for every offer, can we put an offer in the validation data and use all the cases of every other offer to train a model? So, if we were to predict offer 16, we will use all customers that received offer 1 to 15 and 17 to 24 to build the model and then we'll make predictions for all those customers that received offer 16. And you can see that this actually is quite close to what you're being tested on because you know you're building a model with some offers but, you're being tested on some other offer that is not there. And you can take the average of all these 24 users and I put 24 because this is how many offers you really have in the training data. You can take that average and that average may be much more close to the reality, close to what you were being tested on. And this was true. This gave better results, but we were still not there. And I'll show you why we were not there. Consider the following problem. Here, I'll give you a small sample of predictions for offer two and what was the actual target? What we see here is a perfect AUC score. Why? Because all our positive cases that are labeled with one and they have the green color, have higher score than the red ones, where the target is zero. So, the discrimination here is perfect. We have a point, a cutoff point. We can set 0.5 here where all cases that have score higher than this. We can safely say they are one and that is true and everything that has a score lower than this are zero. So, you see here one discrimination is perfect. Let's now take a sample from offer four. If you remember offer four, had in general lower propensity. Offer two had around 0.5 and offer four had around 0.2. So, it's mean we're center much lower and what you can see here is that, again, AUC is perfect for this sample because again, all the higher scores that are labeled with green have a target of one. And then the lower scores, everything that has a score less than 0.18 has a target of 0. The discrimination is perfect. We can find this cutoff point. We can say 0.8, where everything that has a score higher than this can safely be set to one. And that is always true. And vice versa, everything that's less than 18, then it's a 0. And that is always true. So, we have two scores. They discriminate really well between the good and the bad cases. However, we are not tested on the AUC of one offer. We are tested on the AUC of all offers together. So the test data have many offers. So, you are interested in the score that generalizes well against any offer. So, what happens if we try to merge this table? AUC is no longer perfect and why this happens? Because some of the negative cases of the first offer had higher score than the positive cases, those that have a target equal to one from the second offer. So you can see, although the discrimination internally is really good, they don't blend that well. You lose something from that ability of your score to discriminate between ones and zeros. And the moment we saw this, we knew that just leaving one offer out was not going to be enough. We had to make certain that when we merge all those scores together, the score is still good. The ability of our model to discriminate is still powerful or it doesn't lose. And that's why we use a combination of the previous average AUC of all the offers and the AUC after doing this concatenation. So, the average of the two AUCs which really the metric we try to optimize because we thought that this is actually very close to what we were being tested on. And here I can show you the result of all our attempts and this is with a small subset of features because by that point, we were not interested to create the best features, we were interested to test which approach works best. So, you can see if you do it standard stratified K-fold, you can get much nicer results in internal cross-validation but in the test data, the relationship is almost opposite. So, highest score in cross-validation leads to worse results in the test data. And you can see why because you're not internally modeling or internally validating or on what you are actually being tested on. Doing the one-offer out keep obviously lower internal cross-validation results and better performance in the test data, but even better was doing this leave-one-offer plus one concatenation in the end. And this AUC was lower but actually had better test results. I believe we could get even better results if we made certain that we are also validating in new customers. But we didn't actually do this because we saw that this approach had already good results. But as a means to improve, we could have also made certains that we validate on different customers because this is what the test was like.