In this video, we introduce you to matplotib. Matplotlib is the state-of-the-art plotting library for Python. And also to a set of diagrams very useful to exploratory data analysis. Matpotlib is the defacto stand out for plotting in Python. It is open source and under active development in the Python community. Throughout this course we will use matplotlib and Python for plotting. One important aspect when plotting is on data size. So plotting libraries run on a single machine and expect a rather small input dataset in the form of vectors or matrices in order to render the plot. So once you have too much data, you either run into memory or performance problems. The solution is sampling. Sampling takes only a subset of your original data, but due to inherent randomness in selecting the values actually returning, sampling preserves most of the properties of your original data. Sampling reduces cost because at downstream, data processing steps only a fraction of data has to be considered. So let's start with our first diagram type, box plots. Box plots show many statistical properties of your data at the same time. Mean, standard deviation, skew, and outlier content. Basically, it looks a bit like a histogram introduced in the last module from a vertical perspective. It tells you about the distribution of your data. So let's actually create such a box but using a. Let's start with the already created DataFrame from week two. Assume we want to obtain some insights how the voltage behaves. Using Spark SQL, we issue a SQL query to get the voltage values. Note, that this virtual table also contains data from other census. So some rows might not contain values for voltage. Therefore, we select only values which are containing value for voltage. Now, we have obtained the data frame for for voltage. So let's see what it exactly contains. Seems to be a list of values of type. But wait, seems that the values are somehow wrapped in a row object. As previously mentioned, data frames are wrapper of oddities, so we now access the wrap oddity and the oddity API in order to extract the containing values in the row wrapper objects. So again, we use a lambda function to obtain the individual instances of the row wrapper objects, and in the lambda function we basically can directly access the wrapped value. Let's check if this works, by looking at the first ten results. Now, we applied the most important function throughout this week. So please make sure you understand why we are doing this. We are using the sample function in order to obtain the random fraction of the original data. This is not necessary here but imagine that this data frame could potentially contain trillions of rows and petabytes of date. There is no way to post such an amount of data to a plotting library since the plotting code is always executed on a single machine. In this case, we get the random fraction of ten percent, but if you really have a lot of data, then 0.01, or 0.001, or even less would be appropriate. As a rule of thumb, not more than 100 of data points should be plotted. Now, of course, you can call collect on the RDD without any problems, because the sample function only returned a subset of the data. Let's start reciting array to a variable called Recite_Array in order to show that it is a plain, old Python data type instead of an RDD. Then, we print the contents to the screen by running the notebook. This looks fine. This is the subset of all voltage values coming from the cloud and couched to be NoSQL database. Now, we have a meaningful array, containing integer values reflecting the voltage of the power source of a washing machine in different points in time. Accessible to a local driver in a Python array. So let's plot it. But first, we have to configure the Jupyter Notebook to display images generated by matplotib directly and their code, by setting the in line parameter. This is no Python code but an instruction sent to the captor notebook. Now, we import the product library clip, we use the sole called box product for forgetting the first idea about our data. We're using the previously generated exude parameter. Now, we generate a plot. This is a very powerful diagram because it allows us to see multiple properties or statistical moments respectively on a single chart. We can see that the mean is around 228. To get some idea about a standard deviation in the so-called interquartile range where 50% of all data lies in. Finally, outliers are shown as individual points on the plot. Run charts are another way of visualizing data, maybe the most common way you can think of. Run charts you might already have seen on stock market data and guess what, stock market data are time series data as well like IoT central data. Run charts always define the x-axis as time progressing from left to right. On the y-axis we can see the observed value over time. Sometimes also multiple values, because you could plot multiple dimensions using different colors. This especially works well if the value ranges of each dimension are similar. So for creating a run chart, we need to fetch an additional dimension from our data set, namely to time dimension. So let's copy the SQL statement from the previous example and modify it accordingly. First, we have to include the time stamp. By the way, in relational algebra, this is called a protection list. We are never guaranteed to receive and store data in the order it has been generated, so let's start by time stamp. Time stem refers to what we call event time, the time the actual measurement occurred. In contrast the processing time, time that the data point was entering and being stored in the Cloud. Now, let's sample first because this is the processing time of all subsequent steps since the amount of data is less. The SQL statement now returns two columns, therefore we have to unwrap two various from the row of checked and create a tablet out of it. Let's have a look at the other side. As we can see, we've obtained the list of containing time stamp and voltage, this is nearly what we want, so let's turn those into two Python arrays. We store the new oddity into a variable called recite and then store oddity because we have to process it twice. One array we want to obtain is result_array_voltage, by simply taking the oddity and using a amp function to flatten down the tablet to scalar values. In this case, we just use voltage and get rid of timestamp, by adding the array variables as last line we can make distributed to print the content. Nice, now we do the same prototype then. Only two modifications are necessary. First, we have to change the name the name of variable, second we have to excess the other value of the table. Again, let's run a little test. Okay, we are done. Now, we can plot a run chart. In fact, this is really simple. We just use the already imported matplotlib and pass both arrays as a parameter to the plot function. We start with the timestamp. This will be the X axis. Then we take the voltage as Y axis. Do Y confusion? Let's lay it on both axis. X first and then Y finally we generate the plot. And this is it. Somehow mean stand of radiation and outline can be estimated as well. But with much less precision and using box split. The run chart in contrast preserves the time domain therefore we can easily that we have missing data for specific interval. This is totally fine since we have to cope with missing and incorrect data as well. In this case, I've simply stopped and restarted a test data generator but in real life this could also mean that a sensor device was offline for a moment. Scatter plots don't draw lines they put individual data points toward a two or three dimensional data space. Each point is reflecting a row in a dataset, and each dimension a column. Once plotted, this chart can be used for defining classification boundaries between two IoT events for example, or cluster of similar data, or detecting outliers from normal behavior. Let's paste the code from the previous example. You want to plot values for hardness, temperature, and flow rate into one single chart. Each row is represented as one point. Let's add the columns to the projection list. Again, we have to exclude the null values. And we have to do this for all the rows. This includes flow rate as well. Again, we have to unwrap the row objects. Let's create a Python array for the hardness column. We unwrap hardness from the and repeat this for the two remaining columns as well. Seems to work, so let's create a plot. Lazy as I am, I'm using an already prepared skeleton for the plotting code. I just have to pass the three arrays through the scatter function of the mplot 3D library. To make it more meaningful, you set the labels for each axis accordingly. Let's plot. We can see two things immediately. First, hardness is very stable at a certain value range, but then has some remarkable outliers. Second, the points from a plane, this means the temperature and flow rate are highly correlated. This can be by chance or by some really physical effect. In the former case, we also talked about procedure correlation because it is not caused by a physical effect. Finally, we just take another look at histograms. Using histograms, we can get an idea of the distribution of values within a single dimension. We can find value regions of high and low frequency within that particular dimension. In order to explain what I mean, let's jump into Spark. You simply use the hist function of matplotlib and pass an array of values to it. As in the scatter plot, we clearly see that hardness values are clearly concentrated around 80, with a very low amount of exceptions. Temperature, in contrast, it's very evenly distributed with a peak around 100. As you have seen, we can visualize one, two, or three dimensions at the same time. Many properties we've learned in week three can be visually recognized and estimated using plots. Every plot we've covered has a clear purpose. But often tweaking around with many of them at the same time using different configurations, pre-aggregations and dimensions gives you the best insight on how your data looks like. So far, the maximum number of dimensions we could explore has been three. Although we are quite limited on this number, there are other techniques to reduce the number of dimensions of a dataset besides just ignoring them. So let's have a look how this works.