In this lecture, we're going to take a look at data analysis and AI for IoT networks. As we've discussed previously, we're looking at different design perspectives. And what we see when we're looking at large scale IoT networks that are creating a lot of data is that AI and machine learning becomes more and more part of the decision process that's needed to work with the data from such networks. So I'd like to consider some of the characteristics of the data that flows through these type of networks, and some of the tools that we might use, whether they're big data tools that we might use in the cloud, or analytical tools that we would use at the edge devices. So we'll take another look at data again here. And what we see is, as data flows through an IoT network, it is both structured and unstructured. Structured data is data that ends up in a database or in a spreadsheet. It has a clear structure, we know what each one of the elements is for, we know why we're storing it, it has a reason for being in the structure that it's in. But 80% of the data that flows through this is unstructured. These are things that are in text files, or in speech or images or video. If we want to use those as part of our analysis, we really have to do some fairly advanced approaches around machine learning, natural language processing, etc, in order to get our information out of these types of systems. And certainly, IoT network devices generate both types of these data. Is the data in motion or is it at rest? In an IoT network, often, especially at the edge, we'll see data that's coming in from client-server connections, from sensors that are flowing into bridges. This data is in motion, and we have to deal with it in motion. We may have to analyze it as it's flowing through those devices. In many cases, especially prior to the IoT, we had the luxury that the data would be at rest, that the data we were dealing with would show up in a storage array somewhere in a database, in a hard drive. It would be relatively easy for us to take a snapshot of that data and work with it, and we still do that. Certainly for big data analysis, we're looking at data coming in to these types of storage arrays. And we're working with that data to make decisions based on what we're seeing there. But more typically, these days, we're trying to push some of that intelligence down to the edge of the network in sort of a fog architecture and deal with it in motion. When we're working with it in a big data setting, Hadoop is a tool that you've probably heard of for doing stored data analysis. We'll talk a little bit about Hadoop's role in these types of systems just for your own information. In an IoT system, the analytics have different roles. Certainly, sometimes, it's strictly descriptive data, fairly low complexity, but potentially not all that much value in looking at the data itself. As soon as we start to increase the complexity of the data, its value also increases. So as we start to do diagnostics, prediction, prescriptive control, we start to see that the value of working with this complex data increases. Scale of data, the volatility of the data, all these things flow into the challenges that we have when we're working with data that's flowing through an IoT network. The value here could be that working with these volumes of streaming data will let us find things like anomalies or trends, or limits or changes. And if we can do those things in real time, clearly, that value increases. So the cloud vendors that work with these data sets continue to try to find ways to give us analysis tools that let us work at the edge, that let us work on large data sets that we gather, and provide visual dashboards of this data so that we can draw more value from it. The network analytics that we work with to understand whether or not a network is working appropriately also become a key element of an effective IoT network. The data generally, as we've talked about, gets analyzed as a big data collection, or closer to the edge in some sort of analysis at the device or at a bridge. And we'll talk through that on both sides. One of the things that is showing up in this discussion is machine learning and artificial intelligence. And you have to ask why that's starting to appear here. Again, what's happening is that, because we're working with larger and larger volumes of data, data from multiple sources that are looking at different aspects of the system. We start to get to a point where all this multivariate data has to be driven into some sort of model, so that we can start to draw conclusions from the data that we're gathering. These AI tools like machine learning have been around for some time. But what's different now is, much like we discussed for wearables, as connected networks increase as the data volume increases, as the tools improve, as Moore's law gives us better hardware to work with. The AI tools and the work required to draw conclusions from data using those tools become more and more useful to us. When we say AI, what we're really talking about is some sort of a process that looks at data in some form similarly to what a human would do. And it almost models a human capacity for contemplating and judging and the intent of what it's doing when it's working with a set of data. The key essences of AI, intentionality, intelligence, and adaptability, are really what make these analysis routines more difficult and more challenging. Machine learning is an AI method that's becoming more and more common, as we're looking for programs that can access this data and automatically identify rules and patterns to provide some kind of expected output or analysis. When we look at machine learning, there's really two methods of dealing with setting up a machine learning system. One is through supervised learning, and the other is through unsupervised. We'll take a look at both of those. Supervised machine learning, here's an example. An IoT security system that's monitoring an area for human activity. Maybe a video is looking for something that's moving in its space. In a supervised machine learning system, we're going to take some known training data sets, and we're going to provide those to the system so that it can infer a function that can make predictions about classifying the data that comes in. Classification is a key ML use, it's trying to figure out if a data set belongs to a specific category. In this case, is there a person there, is there not? It differs a little bit from a statistical regression, where we're trying to take independent data to model some predicted or associated numeric outcome. And usually, that requires a statistician to develop models and see about their goodness of fit, what type of error there is, etc. The machine learning system is taking its training data set and doing a good part of that automatically. There's also unsupervised machine learning algorithms that are used when the data for training isn't really as classifiable or as easy to label, where the system has to infer a function using some hidden structure in a set of data. An example of this might be trying to identify defects in a factory, where you're just looking at all the available parameters that are coming from the factory and environmental data, behavior measures of certain machines, etc. When you discover a defect, you take a snapshot, perhaps, of this environmental data. And you can start to see what associations there might be in this multivariate data that apply to these defects that you're looking for. So again, this is a more organic approach to taking this cloud of data and trying to pull some meaning from it. And certainly, the algorithms and the methods here are both CPU-intensive and data-intensive. And so in a large IoT network, you can see where these things can start to come into play. So again, using machine learning in an IoT system usually is being used because you have data coming in that requires either some kind of local learning at the edge, or remote learning where you're taking your collected data and trying to draw information from it in the cloud. Typically, we would use this type of data from machine learning algorithms and these learning models to monitor systems, to control behavior, to optimize the operation of the system. Or at the highest end, to let a system self-heal or self-optimize when it starts to see issues. When we think about the type of analysis that we would do with big data, we're really talking about categorizing that data in different ways. How quickly does the data show up, what's the velocity of its arrival? What's the variety of data that shows up? And what volume of data do we have to deal with? It can get into extremely massive amounts of data. And you could see that if you consider an IoT network where sensors are regularly picking up data and bringing it to the cloud. The types of data that we might analysis could be from IoT devices, machine data. It could be looking at transactions, social data that perhaps we might be pulling from other networks, or enterprise data from different business sources. When we think about working with big data, we start to slip out of the scope of this class, and we start to think about tools for this, like Hadoop. Where you've got tools there that are providing both for the file systems that distribute these huge amounts of data, and the distributed processing tools like MapReduce that let us do analysis in parallel across these massive data sets. Certainly, Hadoop is used for machine learning, data and text mining, predictive analytics. AWS provides a set of these tools that can be applied. But again, this is sort of outside of the scope of the device prototyping. Although it may be part of your overall system design, we're going to probably focus in a little bit more on the edge side, where we might actually apply some of these things in our own prototypes. If we start looking at using these tools at the edge, the benefits are that, one, we're reducing the overall data that has to come into the network. If we can start to look at the data at the edge, we can take some of that data and funnel it, so that we're not flooding some cloud network with data. We're only pulling the most important things from what we're looking at. We can also, by doing analysis at the edge, we can quickly respond to issues at the device, at the point where we really need the response. And we're being sensitive to time, we're able to analyze data through the network and react to it as changing conditions occur. So again, a lot of benefits to trying to push some of those analytics toward the edge. There's a couple of different models that we'll talk about here, just as examples, and you'll want to dig into them if you want to take them for a spin. Again, when we get into AWS in depth, we'll talk a little bit more about the IoT Greengrass, which is essentially a processor at the edge that lets us do a variety of analysis functionality. In this particular case, we're using Greengrass for machine learning inference. And so what we might do is collect some data, send it to the cloud to do some of the modeling of that data, and then bring that model back to the device to actually use live data to be able to respond to what's happening at the device. So we can do filtering, transforming, correlation, pattern matching at the device, and again, move some of that functionality to the edge. Another method for working at the edge would be to do some machine learning in a more custom way at our device. And one of the tools that's bubbling up for this recently is TensorFlow and TensorFlow Lite. TensorFlow is a system that is very commonly used for open source machine learning. But recently, there was a version of it introduced for mobile and IoT devices called TensorFlow Lite. There is a reference here to building a TensorFlow Lite system on a Raspberry Pi. So potentially, you could do things like image classification, object detection, pose, assessment, etc at the device, rather than having to connect to the cloud and move data there. So it localizes decisions to the edge devices, and really is a great way to prototype trying to do model systems at the edge of your particular network. So something else to dig into and consider. So in summary, again, this perspective on IoT networks. If we consider the vast volumes of data that are flowing through the ways that we're using different technologies like Hadoop for storage and analysis of big data. While we appreciate that, we really want to, for our own IoT networks, consider what we could do at the edge. And what types of tools, whether it be AWS Greengrass, or TensorFlow, or other tools we could use to do some of that analysis at the edge, and make for smarter, faster systems that we could prototype. So another perspective on the design, take a look at some of these things, and see how you could use them in your own designs.