Numpy is the fundamental package for numeric computing with Python. It provides powerful ways to create store and manipulate data, which makes it able to seamlessly and speedily integrate with a wide variety of databases and data formats. This is also the foundation that Pandas is built on which is a high performance data-centric package that we're going to learn more about in this course. In this lecture, we're going to talk about creating arrays with certain data types, manipulating arrays ,selecting elements from arrays and loading data sets into arrays. Such functions are useful for manipulating data and understanding the functionalities of other common python data packages. So, you'll recall that we import a library using the import keyword and numpy's common abbreviation is np. So let's import numpy as np and import math. Right, arrays are displayed as a list or lists of lists and can be created through lists as well. When creating an array, we pass into it a list as an argument in a numpy array. So, a equals np.array and I'm just going to create a list here, 1, 2, 3 and we'll print out what it looks like. And we can print out the number of dimensions of a list using this ndim attribute, so, print[a.ndim]. If we pass in a list of lists into a numpy array, we create a multi-dimensional array, for instance a matrix. So, here I'll say b equals mp.array and I'm going to create a list and inside of it I'll pass two other lists. So, 1, 2, 3 is the first list and 4, 5, 6 is the second list and let's look at b. So, we can print out the length of each dimension by calling the shape attribute, which returns a tuple, so, b.shape. And we can also check the type of items in the array, so, a.dtype. Now, beside integers, floats are also accepted in numpy array, so, c equals mp.array and we can just put in some floating point numbers here, 2.2, maybe 5, maybe 1.1 and let's do c.dtype.name. All right, so let's look at the data in our array. So c, note that numpy automatically converts integers like five up to floats since there's no loss of precision. Numpy will try and give you the best data type format possible to keep your data types homogeneous, which means that they're all the same in the array. Sometimes we know the shape of an array that we want to create, but not what we want to be in it. None by offer several functions, to create arrays with initial placeholders such as zeros or ones. Let's create two arrays, both the same shape, but with different filler values. So, I'm going to say d equals np.zeros and I'll give it a shape of (2,3) and it'll print d. And then equals np.ones and we'll give it the same shape (2,3) and print e and then we can see our arrays. We can also generate an array with random numbers. So, np.random.rand and we give it a shape, (2,3). You'll see zeros, ones and rand used quite often to create example arrays, especially in stack overflow posts and other discussion forums. We can also create a sequence of numbers in an array with the arrange function. The first argument is the starting bound and the second argument is the ending bound and the third argument is the difference between each consecutive number. So, let's create an array of every even number from 10 inclusive to 50, exclusive. So, f equals np.arrange, we're going to start at 10, we're going to end at 50. Remember this is exclusive and we're going to jump by twos and let's look at f. If we want to generate a sequence of floats, we use something called linspace. In this function, the third argument isn't the difference between two numbers, but it's the total number of items that you want to generate. So you want to watch out for that, so, np.linspace( 0, 2, 15 ] and what this really means is we want 15 numbers from 0, inclusive to 2, inclusive. So, we can do many things on arrays, such as mathematical manipulation, addition, subtraction, square, exponents, as well as Boolean arrays, which are binary values. And we can also do matrix manipulations, such as product transpose, inverse and so forth. So, let's see some of these, arithmetic operators on arrays apply elementwise. So let's create a couple of arrays, I'll create a is np.array. I'll just pass in a list 10, 20, 30, 40 and then b, we'll do np.array[1, 2, 3, 4] now let's look at a minus b. So c is equal to a minus b and let's look at a times b, so d equals a times b. So, with arithmetic manipulation, we can convert current data to the way we want it to be. So here's a real world problem that I faced. I moved down to the United States about six years ago from Canada. In Canada, we use Celsius for temperatures and my wife still hasn't converted to the US system, which uses Fahrenheit. With numpy I could easily convert a number of Fahrenheit values, say, the weather forecast to Celsius for her. So, let's create an array of typical Ann Arbor winter Fahrenheit values. So, fahrenheit equals np.array and sometimes it'll be zero degrees Fahrenheit, may be minus 10 minus 5 minus 15 or 0. These are not a typical Ann Arbor winter values. And the formula for conversion is the temperature in fahrenheit minus 32 times 5 over 9, and this gives you the temperature in celsius. So, we'll just say celsius equals fahrenheit minus 31 times 5 over 9 and let's look at Celsius. Okay, great, so, now she knows it's a little chilly outside this week, but it's not so bad. Another useful and important manipulation is the Boolean array. We can apply an operator on an array and a Boolean array will be returned for any element in the original with true being emitted if it meets the condition. For instance, if we want to get a Boolean array to check the Celsius degrees that are greater than minus 20 degrees, we will just say Celsius is greater than minus 20. And there's our Boolean array, True, False, False, False and True. Here's another example, we could use the modulus operator to check numbers in array to see if they're even, so, celsius mod 2 equals 0. Beside elementwise manipulation, it's important to know that numpy supports matrix manipulation. Let's look at the matrix product, if we wanted to do elementwise product, we use the asterisk sign. So a equals np.array, we'll create an array here and b equals np.array, and we'll create it. And let's print out the product of a times b. If we want to do the matrix product, we're going to use the @ sign instead of the asterisk. So the asterisks is for elementwise and this is really important. Actually, the asterisks is for element wise and you can think of just the default is elementwise comparisons or modifications. But the @ sign is going to use the dot product, so we'll print a@b. So, you don't have to worry about complex matrix operations for this course. But it's important to know that numpy is the underpinning of scientific computing libraries and Python. And that is capable of doing both element wise operations, so the asterisks as well as matrix level operations, so the @ sign. And there's more on this in subsequent courses. So a few more linear algebra concepts are worth layering in here. You might recall that the product of two matrices is only plausible when the inner dimensions of the two matrices are the same. The dimensions refer to the number of elements, both horizontal and vertical in the rendered matrices that you've been seeing here. So, we can use numpy to quickly see the shape of the matrix. So, a.shape, for instance, will give us this one a two by two matrix. When manipulating arrays of different types, the type of the resulting array will correspond to the more general of the two types. And this is called upcasting and you saw an example of that before, but let's see another one. So, let's create an array of integers, so, array1 equals np.array, we'll do 1, 2, 3 and 4, 5, 6 and let's print out its data type to make sure that it's actually integers. Now, let's create an array of floats, so, array2 equals np.array 7.1, 8.2, 9.1 and then we'll give a second list 10.4, 11.2 and 12.3. And let's print out its data type. So, integers, int are whole numbers only and floating, point numbers float, can have a whole number portion and a decimal portion. The 64 in this example refers to the number of bits that the operating system is reserving to represent the number which determines the size or the precision of the numbers that can be represented. So, let's do addition for the two arrays, so, array3 is equal to array1 plus array2 and we'll print array3 and then we'll print array3 dtype. So, notice how the items in the resulting arrays have been upcast into floating point numbers. Now, numpy arrays also have an interesting aggregation functions on them, such as sum, max, min and mean. So, we can print out array three sum, array three max, array three min and let's try array 3 mean. For two dimensional arrays, we could do the same thing for each row or column. So, let's create an array with 15 elements ranging from 1 to 15, with the dimension of 3 by 5. So b equals np.arrang 1,16, and 1 and one and we're going to reshape this immediately to 3 by 5 and let's print b. Now, we often think about two dimensional arrays being made up of rows and columns. But you can also think of these arrays is just giant ordered lists of numbers and the shape of the array. The number of rows and columns is just an abstraction that we have for a particular purpose. Actually, this is exactly how basic images are stored in computer environments. So, let's take a look at an example and see how numpy comes into play in something like images. For this demonstration, I'm going to use the Python Imaging Library, PIL and a function to display images in the Jupyter notebook. So, from PIL, import image and from IPython.display, import display. And let's just look at the image that I'm going to talk about. So, I'm just going to open this image called Chris.tiff and display this image. Now, we could convert this PIL image to a numpy array. So, array equals np.array and we just pass in the PIL image and we're going to print array.shape and then let's look at array. So, we see the shape is 200 by 200 and then we see the values are all uint8. The uint means that they're unsigned integer, so, there's no negative numbers and the 8 means 8 bits per byte. This means that each value can be up to two by two by, well 2 to the 256th in size. But it's actually only 255, because we start at zero, all right, that's computer science. For black and white images, black is stored at zero and white is stored is 255. So, if we just wanted to invert this image, we can just use the numpy array to do so. Okay, so, let's create an array the same shape. So I'm going to create a mask, I'll call it an np.full and array.shape, so, I want it the same as our existing array. But I want it to be full of 255 valued uints, let's take a look at mask. Okay, so, this is like zeros or ones, but for basically let's set the value that you want everywhere. Now, let's subtract that from the modified array. So, we'll create a modified array, we'll take our main array and we'll subtract the mask from it. Remember this is going to do elementwise subtraction. So, all of the values in an array, let's say, first values, 100 minus mask, 255 means that we'll be left with negative 155 in that cell. And we're going to do that for all cells in the array and let's just convert all of the negative values to positive values. So, modified array is equal to modified array times negative one. So, it's going to take that negative one and do, again, elementwise multiplication. So, negative one times the first cell, negative one times the second cell and so forth. And as a last step, let's tell numpy to set the value of the data type correctly. So, we're going to say modified array is equal to modified array.astype. So this is going to tell numpy, you should really trust us. We know what the data type is in here and we're going to say, np.uint8 and let's look at modified array. All right, so, that looks maybe as we expect. Lastly, let's display this new array. So, we can do this using the from array function in the Python Imaging Library to convert the numpy array into an object the Jupyter can render. So, display Image.fromarray(modified_array), cool, okay. Remember how I started talking about how this could just be thought of as a giant array of bytes and its shape was an abstraction? Well, we could just decide to reshape the array and still try and render. PIL is interpreting the individual rows lines, so, we can change the number of lines and columns if we want to. So, what do you think that would look like? Well, let's take a look. So, we'll create something new reshaped equals np.reshape and we're going to take our modified array. So, this is just our array from above, before we've done the inversion and we're going to reshape it to 100 by 400. Now remember it was 200 by 200. So, we're changing both the width and the height here, but we're keeping the total number of cells the same. And let's print out, just to convince ourselves that we've actually changed it, the reshaped shape and then let's display it in line. All right, so, I can't say I find that particularly flattering by reshaping the array to be only 100 rows high but 400 columns. We've essentially doubled the image by taking every other line and stacking them out in width. And this makes the image look more stretched out too and maybe adds a little bit of weight to me. So, this is an image manipulation course, but the point was to show you that these numpy arrays are really just abstractions on top of data. And that data has an underlying format, in this case you uint8. But further, we can build abstractions on top of that, such as computer code which renders a bite as either black or white which has meaning to people. And in some ways, this whole degree is all data and the abstractions that we can build on top of that data. From Individual bite representations, through the complex neural networks of functions or interactive visualizations. Your role as a data scientist is to understand what the data means, it's context in a collection and to transform it into a different representation to be used for sense-making. Okay, let's get back to the mechanics of numpy. So, indexing, slicing and iterating are extremely important for data manipulation and analysis. Because these techniques allow us to select data based on conditions and copy or update the data. So, first we're going to look at integer indexing, a one-dimensional array works in similar ways to a list. To get an element in a one-dimensional array, we just use an offset index. So, we'll create some array, bunch of elements and then we'll say a sub 2 and we get the value 5 out. For multidimensional arrays, we need to use integer array indexing. So, let's create a new multi-dimensional array. So, we'll create this one two by three and let's look at a. If we want to select one certain element, we can do so by entering the index, which is comprised of two integers. The first being the row and the second being the column, so, a sub 1 comma 1, remember in Python, we're starting at zero. All right, so there is the value of 4, if we want to get multiple elements, for example, one four and six and put them into a one-dimensional array, we can enter the indices directly into the array function. So, we can create some new array np.array and in that, we're going to pass it a list. And from that list, we're actually taking our other array and plucking out the values that we're interested in. We can also do that using another form of array indexing, which essentially zips the first list and the second list up. So, we can take a and we can actually pass it to lists and it will zip these values up for us. And so we get the one, four and six. Boolean indexing allows us to select arbitrary elements based on conditions. For example, in the matrix that we just talked about, we want to find elements that are greater than five. So, we set up a condition, a greater than five. So, let's just print, what is a greater than five? This returns a Boolean array showing the values in the corresponding index that are greater than five. And so, here we get a bunch of Falses and a bunch of Trues. So, we can then place this array of Boolean values like a mask over the original array to return a one-dimensional array relating to the true values. So, if we do a sub a greater than five, what's happening here is it will take the greater than five operator, we'll broadcast that. So, we'll compare that across all of the elements of a, creating a new matrix and then we'll apply that as a mask over the outer a and emit the results. So as we can see, this functionality is essential in the Pandas tool kit, which is the bulk of this course, so, we'll be using this a lot. So, slicing is a way to create a sub array based on the original array. For one-dimensional array slicing works in similar ways to a list. To slice, we use the colon, for instance, if we want to put colon three in the indexing brackets, we get the elements from index zero to index three. So, remember, excluding index three. So, we'll create some array, just numbers zero through five. So, that's six numbers and then we'll print a sub colon three. And so, we just get the first few elements, the first three. By putting two colon four in the brackets, so, we could get elements from index two to index four. So, again, excluding index four, so print a sub 2 : 4, instead of just giving us the length, now that will actually give us just those indexes. For multi-dimensional arrays, it works similarly, Let's see an example, so a equals np.array and we're going to create a new list. And in here we're going to put three lists and so, let's take a look at this and we see that we've got three rows and four columns. So, first, if we put one argument in the array, for example a sub colon two, then we get all of the elements from the first, the zeroth and the second row, the oneth. If we add another argument to the array, for example a colon two comma, so, second argument one colon three, we're going to get the first two rows. But then the second and third column values only. So, let's give that a try, a, so we want all of the first two rows and then we want one colon three. So, in multi-dimensional arrays, the first argument is for selecting rows and the second argument is for selecting columns. It's important to realize that a slice of an array is a view into the same data. This is really important, this is called passing by reference. So, modifying the sub array will consequently modify the original array as well. Here I'll change the element at position (0,0) which is 2 to 50. Then we can see that the value in the original array is changed to 50 as well. So I'm going to create a sub array and I'm going to just use what we use before. So a sub colon two, so grab a couple of rows and then one colon three. So, grab a couple of columns and I'll print out the value of the sub-array, sub 0,0 and then I'm going to set that to 50. I'm going to change it to 50 and then we'll print out what the subarray thinks it is. So, that should be 50, but then we're actually going to print our original array, so that's a as well. And remember a here is zero comma one, because we've changed which columns that we've taken out for r sub array. So, when we took our sub array and we did one colon three, we got rid of the 0th column in a. So, zero column in sub array is the first or the one column in a. Okay, now that we've learned the essentials of numpy, by let's use it on a couple of datasets. So, here we have a very popular data set on wine quality. And we're going to only look at red wines, the data fields include fixed acidity volatile acids, residual sugars, chlorides and so forth. The important one here is the alcohol content and the quality, that's how I buy wine anyway. To load a dataset into numpy, we can use the genfrom text function. We can specify data file name, the delimiter which is optional but we often use it and the number of rows to skip if we have a header row. It's one here, so, the genfrom text function has a parameter called d-type for specifying data types for each column and this parameter is optional. Without specifying the type, all types will be casted to a more general or precise type. So there will be some inference done. So, wines equals np.genfromtxt, we'll take in our CSV file, we'll set our delimiter to semicolon. And we're going to skip our header right now and let's take a look at wines. All right, so a whole bunch of data here about wines. So, recall that we can use integer indexing to get a certain column or row. For example, if we wanted to select the fixed acidity column, which is the first column, we can do so by entering the index into the array. Also remember that for multi-dimensional arrays, the first argument refers to the row and the second argument refers to the column. And if we just give one argument, then we'll get a single dimensional list back. Okay, so all rows are combined, but only the first column from them. That will be print integer zero for slicing, so, wines sub colon, that means we want all rows, right? We haven't given any numbers to that parameter, comma zero, we just want to get the first columns. But if we want to get the same values, but we wanted to preserve that they sit in their own rows, we can actually write wines colon and then the second is zero colon one. So, take a look at these two statements for a moment before we run them. In the first one, we say we want one column, the 0th column. In the second statement, we say we want all the columns between index zero and one. Which happens to only b in the 0th column because we never include the end. But the result that numpy gives us after we execute that, actually looks different. All the numbers are the same, the first one though, gives us a single list of numbers. And the second one preserves the general shape that this is a single column. So, this is another great example of how the shape of data is actually just an abstraction. Which we can layer intentionally on top of the data that were working with. If we want a range of columns in order, say, column zero through three and recall this means first second and third, since we started zero. And we don't include the training index value, we could do that too. So, wines sub colon comma 0 comma 3 and what if we want several non-consecutive columns? Well, we can place the indices of the columns that we want into an array and past that array as the second argument. So here's an example, we can take wines, we want all rows so, colon. And then our second argument is actually a list of the indexes that we're interested in. So, we can also do some basic summarization of this data set. For example, if we wanted to find out the average quality of red wine, we can select the quality column. We could do this in a couple of ways, but the most appropriate is to use the minus one value for the index, as negative numbers means slicing from the back of a list. And then we just call the aggregation functions on this data. So, we could say wines, the first parameter is colon, the second one is minus one. Because we just want the last column and then we'll take the mean. And just pause this for a minute and think, do you understand what the minus one is? So, if not, you'll want to revisit some of the basics on Python slicing and string slicing, all right, so 5.6. And let's take a look at another dataset, this time on graduate school admissions. So it is field such as GRE score, TOEFL score, university rating and so forth and it has a chance of admission at the end. With this dataset, we can do data manipulation and basic analysis to infer what conditions are associated with higher chances of admission. So, let's take a look, so we can specify data field names using genfromtext as it loads the CSV data. And also we can have numpy try and infer this type of the column by setting the d type parameter to none, as we've seen. So graduate admission equals np.genfromtxt, we reload the data from dataset/admission-predict.csv. We'll set the d-type to none, we're going to set the delimiter here to a comma. We're going to skip our header and instead, we're just going to pass it the actual names of the column. So, I'm going to write them here, serial number GRE Score, TOEFL Score Ranking, SOP, Letters of recommendation GPAs, Research and the Chance of Admissions. And let's take a look at what that graduate admission looks like. So, notice that the resulting array is actually a one-dimensional array with 400 tuples in it. So let's look at the shape, it's actually got four hundred tuples and it's just one dimension. So, we can retrieve a column from the array using the columns name, for example. Let's get the CGPA column and only the first five values, so we'll take graduate_admission sub CGPA. So, this tells us we only want to get that one column and then we want to get the first five values, so zero colon five. So, since the GPA in the data set range from one to ten and in the US it's more common to use a scale of four, a common task might be to convert the GPA by dividing by ten and then multiplying by four. So, graduate admission, CGPA is equal to the graduate_admission sub CGPA divided by ten times four. And let's look at the output of like 20 values there. And it's important to keep in mind that we've actually changed now this data, right? So, we actually assigned graduate admissions sub CGPA, that changes it on the underlying array. So we've normalized this to a four-point scale, remember Boolean masking? Well, we can use this to find out how many students have had research experience by creating a Boolean mask and passing it to the array indexing operator. So, we'll take the graduate_admission sub research will compare that to one, if it's one, a True will be admitted, otherwise a False will be admitted. That creates us a mask, which we then pass into graduate admission using the indexing operator. So, that will now just emit certain values that are true and it will not admit all the false values. And then we're just going to calculate the length and so we just, the length of that array or the length of that, it's just the same as if it were a list. So, since we've got the data field chance of admission, which ranges from zero to one, we can try and see if students with high chance of admission, let's say 80% on average, have hired GRE scores than those with lower chance of admission, let's say 40%. So first we're going to use Boolean masking to pull out only those students that we're interested in based on their chance of admission. And then we pull out only their GPA scores and then we're going to print the mean values. So let's print the values, so, we'll take graduate admission sub chance of admit compared to 0.8. We're going to broadcast this out, take the GRE score and look at the mean. And we're going to do the same thing for the 0.4. So, take it a moment here to reflect, do you understand what's happening in the calls above? When we do the Boolean masking, we are left with an array with tuples in it still. And numpy holds underneath this a list of the columns we specified and their name and indexes. And so, we can do graduate_admission sub graduate_admission sub chance of admit greater than 0.8. In this case, we're taking that we're creating the Boolean mask and then we're applying it to the graduate admission data and we see that the output is actually still in tuples. Because there's many different columns being kept here. And this I think is a little bit more clear in Pandas, which we'll talk about later in this course. So, let's also do this with GPA. So, I'm just going to copy and paste the above, but I'm going to change the values to GPA. Well, I guess one could have expected this, the GPA and the GRE for students who have a higher chance of being admitted, at least based on our cursory look here seem to be higher. So that's a bit of a whirlwind tour of numpy, the core scientific Computing library in Python. Now, you're going to see a lot more of this kind of discussion using this library. And we'll be focusing on this in this course on Pandas, which is actually built on top of numpy. Don't worry if this didn't all make sense the first time through, we're going to dig in a lot more over the next couple of weeks with pandas. But it's useful and the point of this lecture is so that you know that underneath, numpy is used as a library. And the capabilities of numpy are available to you within Pandas as we go forward.