In this video we'll begin our discussion of hash tables; we'll focus first on the support operations, and on some of the canonical applications. So hash tables are insanely useful. If you want to be a serious programmer or a computer scientist you really have no choice but to learn about hash tables. I'm sure many of you have used them in your own programs in the past in fact. Now on the one hand what's funny is they don't actually do that many things in terms of the number of supported operations, but what they do, do they do really, really well. So what is a hash table? Well conceptually, ignoring all of the aspects of the implementation, you may wanna think of a hash table as an array. So one thing that arrays do super well is support immediate random access. So if you're wondering what's the position number seventeen of some array, boom, with a couple of machine instructions you can find out, wanna change the contents of position number 23 in some array? Done, in constant time. So let's think about an application in which you want to remember your friends phone numbers. So if you're lucky your friends parents were all u nu, unusually unimaginative people and all of your friends names are integers let's say between one and 10,000. So if this is the case then you can just maintain an array of link 10,000. And to store the phone number of say, your best friend, 173, you can just use position 173 of this modest sized array. So this array based solution would work great, even if your friends change over time, you gain some here you lose some there, as long as all your friends names happen to be integers between 1-10,000. Now, of course, your friends have more interesting names: Alice, Bob, Carol, whatever. And last names as well. So in principal you could have an array with one position in the array for every conceivable name you might encounter, with at least 30 letters set. But of course this array would be way too big. It would be something like 26 raised to the thirtieth power and you could never implement it. So what you'd really want is you'd want an array of reasonable size, say, you know ballpark the number of friends that you'd ever have, so say in the thousands or something, where it's positions are indexed not by the numbers, not integers. [inaudible] Between one and 10,000, but rather by your friends Names And what you'd like to do is you'd like to have random access to this array based on your friend's name. So you just look up the quote unquote Alice position of this array and. Boom, there would be Alice's phone number in constant time. And this, on a conceptual level is basically what a hash table, can do for you. So there's a lot of magic happening under the hood of a hash table and that's something we'll discuss to some extent in other videos. So you have to have this mapping between the keys that you care about, like your friends' names, and, numerical positions of some array. That's done by what's called a hash function, but properly implemented, this is the kind of functionality that hash tables gives you, So like an array with its positions indexed by the keys that you're storing. So you can think of the purpose of the hash table as to maintain a possibly evolving set of stuff. Where of course the set of things that you're maintaining, you know, will vary with the application. It can be any number of things. So if you're running an e-commerce website, maybe you're keeping track of transactions. You know, again, maybe you're keeping track of people, like for example, your friends and various data about them. So maybe you're keeping track of I-P addresses, for example if you wanna know, who was, were there unique visitors to your websites. And so on. So a little bit more formally, you know, the basic operations, you need to be able to insert stuff into a hash table. In many, but not all applications, you need to be able to delete stuff as well. And typically the most important operation is look-up. And for all these three operation you do it in a key based way. Where as usual a key should just be a unique identifier for the record that you're concerned with. So, for example, for employees you might be using social security numbers. For transactions you might have a transaction ID number. And then IP addresses could act as their own key. And so sometimes all you're doing is keeping track of the keys themselves. So, for example, in IP addresses, maybe you just want to remember a list of IP addresses. You don't actually have any associated data but in many applications, you know, along with the key, is a bunch of other stuff. So along with the employee's social security number, you gotta remember a bunch of other data about that employee. But when you do the insert, when you do the delete, when you do the look up, you do it based. On this key, and then for example, on look up you feed the key into the hash table and the hash table will spit back out all of the data associated with that key. We sometimes hear people refer to data structures that support these operations as a dictionary. So the main thing the hash table is meant to support is look up in the spirit of a dictionary. I find that terminology a little misleading actually. You know, most dictionaries that you'll find are in alphabetical order. So they'll support something like binary search. And I want to emphasis something a hash table does not do is maintain an ordering on the elements that it supports. So if you're storing stuff and you do want to have order based operations, you wanna find the minimum or the maximum, or something like that, a hash table's probably not the right data structure. You want something more. You wanna look at a heap or you wanna look at a, a search tree. But for applications in which all you have to do is basically look stuff up you gotta, you gotta know what's there and what's not, then there should be a light bulb that goes off in your head. And you can say, let me consider a hash table, that's probably the perfect data structure for this application. Now, looking at this menu-supported operations, you may be left kinda unimpressed. Alright, so a hash table, in some sense, doesn't do that many things; but again, what it does, it does really, really well. So, to first order. What hash tables give you is the following amazing guarantee. All of these operations run in constant time. And again this is in the spirit of thinking of a hash table as just like an array. Where its positions are conveniently indexed by your keys, So just like an array supports random access in constant time, you can see if, you know, there's anything in the array position, and what it is. As similarly a hash table will let you look up based on the key in constant time. So what is the fine print? Well, there's basically two caveats. So the first thing is that hash tables are easy to implement badly. And if you implement them badly you will not get this guarantee. So this guarantee is for properly implemented hash tables. Now, of course if you're just using a hash table from a well known library, it's probably a pretty good assumption that it's properly implemented. You'd hope. But in the event that you're forced to come up with your own hash table and your own hash function and unlike many of the other data structures we'll talk about, some of you probably will have to do that at some point in your career. Then you'll get this guarantee only if you implement it well. And we'll talk about exactly what that means in other videos. So the second caveat is that, unlike most of the problems that we've solved in this course, hash tables don't enjoy worst case guarantees. You cannot say for a given hash table that for every possible data set you're gonna get cost and time. What's true is that for non-pathological data, you will get cost and time operations in a properly implemented hash table. So we'll talk about both of these issues a bit more in other videos, but for now just high order bits are, you know, hash tables, constant time performance, subject to a couple of caveats. So now that I've covered the operations that hash tables support and the recommend way to think about them, let's turn our attention to some applications. All of these applications are gonna be in some sense, you know, kinda trivial uses of hash tables, but they're also all really practical. These come up all the time. So the first application we'll discuss, which again is a conical one, is removing duplicates from a bunch of stuff, Also known as the deduplication problem. So in the De-duplication problem, the input is essentially a stream of objects. Where, when I say a stream I have kinda, you know two different things in mind as canonical examples. So first of all you can imagine you have a huge file. So you have, you know, a log of everything that happened on some website you're running. Or all of the transactions that were made in a store on some day, And you do a pass through this huge file. So you're just in the middle of some outer for loop going line by line through this massive file. The other example of a stream that I had in mind, is, where you're getting new data over time. So here, you might imagine that you're running software to be deployed on an internet router. And data packets are coming through this router at a constant extremely fast rate. And so you might be looking at, say, the IP addresses and the sender, and use your data packet which is going through your router. So it would be another example of a stream of objects. And now, what do you gotta do? What you gotta do is you gotta ignore the duplicates. So remember just the distinct objects that you see in this stream. And I hope you find it easy to imagine why you might want to do this task in various applications. So, for example, if you're running a website you might want to keep track of the distinct visitors that you ever saw in a given day or a given week. If you're doing something like a web crawl, you might want to identify duplicate documents and only remember them once. So, for example, it would be annoying if in search results both the top link and the second link both led to identical pages at different URLs, okay, so search engines obviously want to avoid that, so you want to detect duplicate web pages and only report unique ones. And the solution using a hash table is laughably simple. So every time a new object arrives in the stream, you look it up. If it?s there, then it?s a duplicate and you ignore it. If it?s not there, then this is a new object and you remember it. Qed, that's it. And so then after the string completes, so for example after you finish reading some huge file, if you just want to report all of the unique objects, hash tables generally support a linear scan through them and you can just report all of the distinct objects when this stream finishes. So let's move on to a second application slightly less trivial maybe but still quite easy, and this is the subject of Programming Projects number five. So this is a problem called the two sum problem. You're given as input an array of N number. These images are in no particular order. You're also given a target sum, which I'll call T. And what you want to know is are there two integers from amongst these N you are given that sum to T. Now the most obvious and naive way to solve this problem is just to go over all N, choose two pairs of integers in the input, and check each one separately. So that's clearly a quadratic time algorithm. But now, of course, we need to ask, can we do better? And, yes, we can. And first of all let's see what you'd do if you couldn't use any data structures. So if you were clever, but you didn't use any data structures like a hash table, here would be a reasonable improvement over the naive one. So the first step of a better solution is to sort A upfront, For example, using word sort or heap sort, something that runs in end log and time. So you may be asking about the motivation for sorting. Well, again, you know, one thing is just, you know whenever you're trying to do better than N squared; you might think that sorting your data somehow helps. Right and you can sort of do it almost for free in N log N time. Now, why would sorting the array up front help us? Well, then the clever insight is that for each entry of the array a, say the first entry, now we know what we're looking for to achieve this given target, right. If the target that we're trying to get to is summed to 100 and the first entry in the sorted array is 43, then we know we're looking for a 57 somewhere else in. This now sorted array. And we know that searching a sorted array is pretty easy, right. That just binary search. That just takes logarithmic time. So for each of the n array entries, we can look for a complementary. Entry, namely of reach X we can look for T - X using binary search. And to use binary search takes log N time. So the sorting upfront speeds up this entire batch of N searches. So that's why it's a win. So, in the second step, because we do a linear number of binary searches, again, this is just n, the number of searches, times log-n, the time per search. So, this is just another theta of N log N factor. Alright, so that's pretty cool. You, I don't think you could come up with this N log N solution without having some basic, facility with algorithms. This is already a really nice improvement over the naive N squared. But we can do even better. It is no reason we're stuck with an N log N lower bound for the [inaudible] problem. Obviously, because the array is unsorted, we have to look at all the integers. So we're not gonna do better than linear time. But we can do linear time via a hash table. So a good question you might ask at this point is what's the clue about this problem, about this task that suggests we want to use a hash table. Well, so hash tables are going to dramatically speed up any application where the bulk of the word is just repeated look-ups. And if we examine this n log n solution, once we have this idea of doing a search for T minus X for each value of X, we realize actually, you know, the only thing we needed the sorted array for was to support look-ups. That's all binary search here is doing, is just looking stuff up. So we say, ah-ha. All of the work here in step two is from repeated look-ups. We're paying an exorbitant relatively, logarithm per amount of time per look-up, whereas hash tables can do them in cost and time. So, repeated look-ups, ding, ding, ding, let's use a hash table; and indeed that's what gives us linear time in this problem. So from the amazing guarantee of hash tables, we get the following amazing solution for the true [inaudible] problem, although again this is subject to the same fine print about you better use it properly implemented hash table and you better not have pathological data. So rather than sorting, you just insert everything in the array into a hash table. So insertions cost time. So this is gonna be linear time rather than the end log [inaudible] we were paying before. Once all the stuff is in the hash table, we just do the same thing as in the n log-n solution. For each x in the array, we look for its matching elements, t-x in the hash table using the cost and time look-up operation exported by the hash table. And of course if for some X, you do find the matching element T minus X. Then you can just report X and T minus X. That proves that there is indeed a pair of integers of target sum T. If for every single element of the input array A, you fail to find this matching element T minus X in the hash table. Then, for sure there is no pair of integers in the input that sums to T. So this solves the problem correctly. Moreover, constant time insertion, so that means this first step is going to be O of end time. And constant time look-up. So that means that the second step is also gonna be linear time. That leaves subjects to the caveats that we discussed on the previous slide. So it's kind of amazing how many different applications of computer science boil down in their essence to repeated look up operations. Therefore, having a super fast look up operation, like that supported by a hash table, permits these applications to scale to fantastic sizes. It's really amazing, and it drives a lot of modern technology. So let me just mention a couple examples. Again, if you look around or do some research on the web, you'll quickly find many more. So originally what prompted researchers to think hard about data structures that support super fast look ups, was back when people were first building compilers. So this is a long time ago. This is in the fifties or so. And these repeated look ups to figure out, you know, what has and has not been defined before was, was emerging as a bottleneck in compilers. Back in the early days of programming languages. And that was one of the early applications of hash tables. Was to support super fast look ups to speed up compile time. To keep track of the function of variable names and things like that. Hash table technology is also super useful for software on routers in the Internet. So, for example, you might want to block network traffic from certain sources. So, for example, maybe you suspect that a certain IP address has been taken over by spammers and so any traffic coming from that IP address you just want to ignore. And you don't wanna even let it get to the end host, to the computer on someone's desktop, or to someone's mobile device but rather inside the internet. You wanna just drop packets that are coming certain, certain centers. So what is that problem boil down to? Well, you might have a blacklist of IP addresses that you're refusing traffic from and then the tasks faced by the router is really the look up problem. So if data packet comes in at some insanely fast data rate, and when you wanna. You immediately, just look up, is this in the blacklist or not, and if it is in the blacklist then you drop the packet, if it?s not, then you let it go through. So a very different application is for speeding up search algorithms. And when I say a search algorithm, what I'm thinking about here is something like a chess playing program. So something that does game tree exploration. So we've already talked a fair amount about graph search in this class, but in our discussion of breadth first and depth first search, we were thinking about graphs that you could basically write down. You could store them in the main memory of your machine or, in the worst case, on some big cluster. So maybe graphs, you know, about the size of the web graph or possibly smaller. But in a context of something like a chess playing program the graph that you're interested in is way, way, way bigger than the web graph. So what's the graph we care about for a chess playing program? Well, the nodes of the graph are going to correspond to all possible configurations of chess pieces On a chess board. So every chess board that you might ever encounter in a game of chess. So that's a. Massive, massive number of configurations. And you're never gonna be able to write down these vertices. The edges in this graph are going to take you from one configuration to another. And there gonna correspond to legal moves. So if you can move a bishop from. One place to another place, and you get from one configuration to another configuration, there's an edge in the graph corresponding to that move. Now you can't write down this graph. So you can't implement breadth versus depth versus search exactly as we discussed it before. But, you'd still like to do graph exploration, right? So you'd like to have your computer program, reason about the at least short term ramifications of your possible next move. So that will correspond to searching through this graph. Now, how are you gonna, it's remembering graphs search a really important property was you don't want to do redundant work, you don't want to re-explore things you've already explored. That would be silly and might lead into loops and so on. And you can't write down the graph just remembering where you've been, is suddenly a non-trivial problem; but what is remembering where you've been, fundamentally? Fundamentally that's a look up operation. So that is exactly what hash tables are for. So to be a little more concrete, you know, one where you use the hash table in, say, a chess playing program, is you'd stake, take the initial configuration. You would sort of imagine trying all possible moves from this configuration. And then you'd try, you'd sort of have all moves from your opponent and then you'd have all your moves in response. And you would always remember, as you were doing this reasoning, every chessboard configuration you'd ever looked at before and you'd stick in the hash table. And before you go exploring some configuration, you'd look it up in your hash table to see if you've already explored it in the past. And if you have, you don't bother. You've already seen it. All right. So chess playing programs operate by exploring systematically as many configurations as they'd have time for. You know, obviously, in a budget of three minutes or whatever you don't wanna waste any work exploring any given configuration more than once. How do you remember where you've been? Well everything you've explored you stick in a hash table Before you explore a configuration you look it up in a hash table and see if you've already done it. So these of course are just scratching the surface. I just wanted to highlight a couple, you know, fairly different looking applications, you know to convince you that hash tables come up all the time. And the reason they come up all the time is because you know the need for fast look-ups comes up all the time. It's kind of amazing how much technology is being driven just by you know repeated fast look-ups. So as homework I encourage you to just sort of think about you know your own life, or think about technology out there in the world, and come up with some. You know, guesses about where probably hash tables are making something out there running blazingly fast. I think it won't take you more than a few minutes to come up with some good examples.