[MUSIC] When I'm giving a talk and describing using this slide, well I'll ask is how many people have heard of algebraic optimization. And typically, very few have, even if they're computer scientists, unless it's a room full of database people. But the thing is, that you already understand what this is. Right, you don't have to know databases to know what this is. This is just something you learned in high school, in Algebra class. Okay. So, forget tables for a second, just think about integers. Well, I've got this expression here, and I want you to evaluate this expression. When I tell you z is equal to 4. Okay. So, one thing you might do is just well, say, well four times two is eight. And, 4 times 3 is 12, and so that's 20, and I add 0, and that doesn't change anything, and then I divide by 1. Fine. But if you're clever you might notice that, well adding zero to any number doesn't change it at all, so I'll just ignore that all together. Similarly, dividing any number by 1, or any integer by 1, is the same number. So I'll ignore that as well. And then if you're really clever you might notice that there's a distributivity law here that says when I see this pattern I can pull out the multiplication. And by applying these rules in turn including commutativity law that allows things to be reordered I can simplify this expression down to this. And this just says well now, 2 plus 3 is 5, just multiply 5 times 4 and I get 20, and I've done fewer operations. I've only done two operations instead of five, and I didn't have to do division, which is potentially an expensive operator, if you're thinking about a computer evaluating this. Now, do computers use this kind of symbolic reasoning when they evaluate expressions over integers? No, the answer is no and the reason is that this kind of symbolic reasoning is much, much more expensive than just evaluating the damn thing, right? So, fine, but if the objects that you're manipulating are not small integers but rather, you know, terrabyte size tables, then this kind of symbolic reason is not only valuable but it's absolutely critical. If you do things in the wrong order, if you do wasted work, or you do more operations than you need to over massive tables, you're dead in the water and you'll get nothing done. And so, all databases, all relational databases rather, do this kind of algebraic optimization when you write a query. Right? And so if we think in terms of SQL, if you're familiar with SQL, your query gets translated into a relational algebra expression in terms of selects and projection joins. And then is manipulated according to algebraic rewrite rules, just like you learned in algebra class. And that's why the term algebra is there. And they attempt to simplify the expression. Simplify, the reason I pause is simplify is perhaps not the right word cuz it's not always true that the shorter the expression, the faster it is. We actually use this notion of cost-based optimization, which means we'll try lots of different kinds of equivalent expressions, assign each one of them an estimated cost, and choose the one with the lowest cost. And this is something that all relational databases are doing in one form or another, okay? So fine. So this is the magic trick of query processing in relational databases, and this is a really, really great idea. And the reason why this works is because we understand, very formally, what these operations are and what they mean, okay. And so when you relax this formal model, and start allowing anybody to write any kind of code they want over the data, you lose the ability to do this kind of algebraic optimization. And you leave it up to the programmer to write the best possible algorithm. What I'm hinting at here is, we'll talk about it more later, but when you think about writing large scale data processing pipelines in something like MapReduce. And if you haven't heard of MapReduce, don't worry. We'll talk about it. You're leaving all the work up to the programmer to not only write the logic, but also to do the optimization. And this you can pay, this can impose a penalty. Okay? One final comment about this is the term algebra is not just kind of trying to connote Algebra from high school, it literally is the same thing. And so what when you hear the word algebra what you should be thinking of is this notion of algebraic closure. And what I mean by that is every operation that applies to a table, also returns a table. And so, I can chain these operations together to always get tables. All right. Now, that's the exact same idea that's going on when you talk about operations over integers or real numbers. And you might sort of quibble and say well if I divide an integer by some other integer, I may get a real number and that's true, but there's notions of multi-sorted algebras with different types involved. The point is that this notion of closure, you can't escape the system by applying operations, is always true when you hear the term algebra. We're not making things up. Fine, so here's some relational algebra expressions that if you squint hard enough you can see kind of look like similar expressions over integers. Except instead of addition and multiplication we have stuff like joins and selects. And so what this says, select certain values from a relation, R. And here I'm gonna select other values from the same relation R. That's okay. I can have two different, the relation arc can appear in two different places in the same expression, no problem. And then join them together, then select still other values from the relation R, and join this one. And one way of evaluating this plan is to perform this join first, and then this join second, and indicated by these parentheses. Right? Another way to evaluate this expression is to perform this join first. And then form this join second, indicated by the parentheses. Still another expression is to take the full cross product of all three relations, which I haven't told you what cross product is, this is actually a pretty bad one to do. If you do know what a cross product is, it generates an enormous amount of data and you would never actually want to evaluate this plan. But you could and it's provably equivalent to these other plans, so you know that it returns the same answer. And now all we have to do is figure out which one of these is likely to be the cheapest one and we'll choose that one to run. And this is the kind of reasoning that all databases do internally whenever you write a query. Okay. [MUSIC]