[SOUND] This lecture is about the inverted index construction. In this lecture, we will continue the discussion of system implementation. In particular, we're going to discuss how to construct the inverted index. The construction of the inverted index is actually very easy if the dataset is very small. It's very easy to construct a dictionary and then store the postings in a file. The problem is that when our data is not able to fit to the memory then we have to use some special method to deal with it. And unfortunately, in most retrieval applications the dataset will be large. And they generally cannot be loaded into memory at once. And there are many approaches to solve that problem, and sorting-based method is quite common and works in four steps as shown here. First, you collect the local termID, documentID and frequency tuples. Basically you will locate the terms in a small set of documents. And then once you collect those accounts you can sort those count based on terms. So that you will be able to local a partial inverted index and these are called rounds. And then you write them into a temporary file on the disk and then you merge in step 3. Do pairwise merging of these runs, until you eventually merge all the runs and generate a single inverted index. So this is an illustration of this method. On the left you see some documents and on the right we have a term lexicon and a document ID lexicon. These lexicons are to map string-based representations of document IDs or terms into integer representations or map back from integers to the stream representation. The reason why we want our interest using integers to present these IDs is because integers are often easier to handle. For example, integers can be used as index for array, and they are also easy to compress. So this is one reason why we tend to map these strings into integers, so that we don't have to carry these strings around. So how does this approach work? Well, it's very simple. We're going to scan these documents sequentially and then parse the documents and count the frequencies of terms. And in this stage we generally sort the frequencies by document IDs, because we process each document sequentially. So we'll first encounter all the terms in the first document. Therefore the document IDs are all ones in this case. And this will be followed by document IDs two and they are natural results in this only just because we process the data in a sequential order. At some point, we will run out of memory and that would have to write them into the disc. Before we do that we 're going to sort them, just use whatever memory we have. We can sort them and then this time we're going to sort based on term IDs. Note that here, we're using the term IDs as a key to sort. So all the entries that share the same term would be grouped together. In this case, we can see all the IDs of documents that match term 1 would be grouped together. And we're going to write this into that this is a temporary file. And would that allows you to use the memory to process and makes a batch of documents. And we're going to do that for all the documents. So we're going to write a lot of temporary files into the disc. And then the next stage is we do merge sort basically. We're going to merge them and then sort them. Eventually, we will get a single inverted index, where the entries are sorted based on term IDs. And on the top, we're going to see these are the older entries for the documents that match term ID 1. So this is basically, how we can do the construction of inverted index. Even though the data cannot be all loaded into the manner. Now, we mention earlier that because of hostings are very large, it's desirable to compress them. So let's now take a little bit how we compressed inverted index. Well the idea of compression in general, is for leverage skewed distributions of values. And we generally have to use variable-length encoding, instead of the fixed-length encoding as we use by default in a program manager like C++. And so how can we leverage the skewed distributions of values to compress these values? Well in general, we will use few bits to encode those frequent words at the cost of using longer bit string code those rare values. So in our case, let's think about how we can compress the TF, tone frequency. Now, if you can picture what the inverted index look like, and you will see in post things, there are a lot of tone frequencies. Those are the frequencies of terms in all those documents. Now, if you think about it, what kind of values are most frequent there? You probably will be able to guess that small numbers tend to occur far more frequently than large numbers. Why? Well, think about the distribution of words and this is to do the sip of slopes, and many words occur just rarely so we see a lot of small numbers. Therefore, we can use fewer bits for the small, but highly frequent integers and that's cost of using more bits for larger integers. This is a trade off of course. If the values are distributed to uniform, then this won't save us any space, but because we tend to see many small values, they are very frequent. We can save on average even though sometimes when we see a large number we have to use a lot of bits. What about the document IDs that we also saw in postings? Well they are not distributed in the skewed way. So how can we deal with that? Well it turns out that we can use a trick called a d-gap and that is to store the difference of these term IDs. And we can imagine if a term has matched that many documents then there will be longest of document IDs. So when we take the gap, and we take the difference between adjacent document IDs, those gaps will be small. So again, see a lot of small numbers. Whereas if a term occurred in only a few documents, then the gap would be large, the large numbers would not be frequent. So this creates some skewed distribution, that would allow us to compress these values. This is also possible because in order to uncover or uncompress these document IDs, we have to sequentially process the data. Because we stored the difference and in order to recover the exact document ID we have to first recover the previous document ID. And then we can add the difference to the previous document ID to restore the current document ID. Now this was possible because we only needed to have sequential access to those document IDs. Once we look up the term, we look up all the document IDs that match the term, then we sequentially process them. So it's very natural, that's why this trick actually works. And there are many different methods for encoding. So binary code is a commonly used code in just any program language. We use basically fixed glance in coding. Unary code, gamma code, and delta code are all possibilities and there are many other possibilities. So let's look at some of them in more detail. Binary coding is really equal length coding, and that's a property for randomly distributed values. The unary coding is a variable length in coding method. In this case, integer this 1 will be encoded as x -1, 1 bit followed by 0. So for example, 3 will be encoded as 2, 1s followed by 0, whereas 5 will be encoded as 4, 1s, followed by 0, etc. So now you can imagine how many bits do we have to use for a large number like 100? So how many bits do you have to use exactly for a number like 100? Well exactly, we have to use 100 bits. So it's the same number of bits as the value of this number. So this is very inefficient if you were likely to see some large numbers. Imagine if you occasionally see a number like 1,000, you have to use 1,000 bits. So this only works well if you are absolutely sure that there will be no large numbers, mostly very often you see very small numbers. Now, how do you decode this code? Now since these are variable length encoding methods, you can't just count how many bits and then just stop. You can't say 8-bits or 32-bits, then you will start another code. They are variable length, so you will have to rely on some mechanism. In this case for unary, you can see it's very easy to see the boundary. Now you can easily see 0 would signal the end of encoding. So you just count up how many 1s you have seen and at the end you hit 0. You have finished one number, you will start another number. Now we just saw that unary coding is too aggressive. In rewarding small numbers, and if you occasionally can see a very big number, it would be a disaster. So what about some other less aggressive method? Well gamma coding's one of them and in this method we can use unary coding for a transform form of that. So it's 1 plus the floor of log of x. So the magnitude of this value is much lower than the original x. So that's why we can afford using unary code for that. And so first I have the unary code for coding this log of x. And this would be followed by a uniform code or binary code. And this basically the same uniform code, and binary code are the same. And we're going to use this coder to code the remaining part of the value of x. And this is basically precisely x-1 to the floor of log of x So the unary code are basically called the flow of log of x, well add one there and here. But the remaining part we'll be using uniform code through actually code the difference between the x and this 2 to the log of x. And it's easy to show that for this difference we only need to use up to this many bits and the floor of log of x bits. And this is easy to understand, if the difference is too large, then we would have a higher floor of log of x. So here are some examples for example, 3 is is encoded as 101. The first two digits are the unary code. So this isn't for the value 2, 10 encodes 2 in unary coding. And so that means the floor of log of x is 1, because we won't actually use unary codes. In code 1 plus the flow of log of x, since this is two then we know that the flow of log of x is actually 1. So that 3 is still larger than 2 to the 1. So the difference is 1, and the 1 is encoded here at the end. So that's why we have 101 for 3. Now similarly 5 is encoded as 110, followed by 01. And in this case the unary code in code 3. And so this is a unary code 110 and so the flow of log of x is 2. And that means we're going to compute a difference between 5 and the 2 to the 2 and that's 1. And so we now have again 1 at the end. But this time we're going to use 2 bits, because with this level of flow of log of x. We could have more numbers a 5, 6, 7 they would all share the same prefix here, 110. So in order to differentiate them, we have to use 2 bits in the end to differentiate them. So you can imagine 6 would be 10 here in the end instead of 01 after 10. It's also true that the form of a gamma code is always the first odd number of bits, and in the center there is a 0. That's the end of the unary code. And before that or on the left side of this 0, there will be all 1s. And on the right side of this 0, it's binary coding or uniform coding. So how can you decode such code? Well you again first do unary coding. Once you hit 0, you have got the unary code and this also tell you how many bits you have to read further to decode the uniform code. So this is how you can decode a gamma code. There is also a delta code that's basically the same as a gamma code except that you replace the unary prefix with the gamma code. So that's even less conservative than gamma code in terms of wording the small integers. So that means, it's okay if you occasionally see a large number. It's okay with delta code. It's also fine with the gamma code, it's really a big loss for unary code. And they are all operating of course, at different degrees of favoring short or favoring small integers. And that also means they would be appropriate for a sorting distribution. But none of them is perfect for all distributions. And which method works the best would have to depend on the actual distribution in your dataset. For inverted index compression, people have found that gamma coding seems to work well. So how to uncompress inverted index? I will just talk about this. Firstly, you decode those encoded integers. And we just I think discussed the how we decode unary coding and gamma coding. What about the document IDs that might be compressed using d-gap? Well, we're going to do sequential decoding so supposed the encoded I list is x1, x2, x3 etc. We first decode x1 to obtain the first document ID, ID1. Then we can decode x2, which is actually the difference between the second ID and the first one. So we have to add the decoder value of x2 to ID1 to recover the value of the ID at this secondary position. So this is where you can see the advantages of converting document IDs to integers. And that allows us to do this kind of compression. And we just repeat until we decode all the documents. Every time we use the document ID in the previous position to help to recover the document ID in the next position. [MUSIC]