So let's dive in and start to explore this data using the aggregation framework. We'll look at what work lays in front of us in order to transform this data set into something we can work with. In terms of differentiating audiences, one factor is going to be the language of a film. Ideally, it would be easy to segment movies based on language. So let's see what languages are represented here and what cleanup is going to be required. Here, I have a simple script that makes use of the aggregation framework to summarize the values in the language field. What I'd like to do first is run this, so that we can look at the output as part of the explanation for how this works. In order to run this for yourself, you will need to plug in the connection URI for your free tier cluster. Note that this is the connection for mine. You'll need to plug in the connection URI for yours. Now, let's run. And you can see from the output that there are many movies in multiple languages. But it immediately occurs to us to ask, how many are there in this mix of languages? Do we really need to deal with all of them? The other issue that might pop out at you is that, the languages are simply comma separated within a single string for each movie. This makes it difficult to filter on regardless of the database you're using. Since this is our first aggregation, let's discuss the syntax before we go further. We can then return to the issues we've just identified. Here we have a $group stage. The identifier first stage, always begins with the dollar sign. Group is no exception. A group stage groups its input documents by a specified identifier expression and applies any accumulator expressions supplied to each group. This identifier expression stipulates that we want each group produced to be identified by a dictionary containing a single field. The key for this field should be the string, language, and the value should be a distinct $language value. This type of expression in the MongoDB aggregation framework is a field path identifier. This identifies a particular field in input documents, and the semantics of this, are that it's the value of that field that should be used where this placeholder is found. So for every distinct value of language in this collection, this pipeline will create a group and apply the specified accumulator to this group. Using a dictionary, labeling each of these distinct values with what they represent that being language, ensures that the semantics are clear. The value around which each group was created is clear and it's also clear that the value reflects a language designation. This is important especially when the output contains a variety of results. As you know, an aggregation pipeline passes documents from one stage to another. This group stage is at the very beginning of our pipeline, so it will be applied to the collection on which the aggregate command is run. In this case, aggregate is being run on our movies_initial collection. Each document in the collection will be passed through this group stage. Group stages allow us to apply any number of accumulators that operate on the documents passed through the stage. $sum is one such accumulator. This expression means that for every document matching the identifier for a group, add one to a running count of the documents grouped around that identifier. There are a number of other accumulators, many of which you will encounter in this Coursera specialization. Here is the MongoDB document page that describes the available accumulators. There are several others for arithmetic operations. Several that work with lists or arrays and a couple that enable you to calculate descriptive statistics. Now, one big advantage of the aggregation framework is that all the work is done within the database server which has been optimized for the operators the aggregation framework supports. This, combined with indexes that support extremely fast lookup operations, means that using MongoDB in this way makes for a very powerful data manipulation and analytics toolset. Returning to our example here, once all input documents have been processed through this group stage, the stage will admit as output one document for each group. Each of those documents will contain two fields an _id field specifying the distinct value for a language that group represents, and a count field that contains the number of documents in that group. Since this group stage is both the first and the last stage in this pipeline, it is the output from the group stage that we receive as output from the entire pipeline. The last statement in this script is our call to the aggregate method. Note that we're using our client connection to access the mflix database attribute of this connection and the movies_initial attribute of the mflix database. Finally, we're calling the aggregate method of movies_initial. Aggregate is a collection class method and through aggregate, we're passing a single argument, the pipeline that we've constructed here. Note that the pipeline is a document, or more precisely because this is Python, the pipeline is a dictionary. Just two other things I want to point out. I'm using the pprint package so that we can print out nicely formatted output. And because we're using Jupyter Notebooks, I'm importing this clear output function so that we can run this as many times as we like, and each time the prior output will be cleared for us. Given the way we've structured this pipeline, the result is the output of hundreds of documents, each representing a distinct value for language and the count of the number of documents in the movies_initial collection that have that value for language. This isn't as useful as it could be and doesn't address our question of how many distinct document values there are in any real useful way. In order to do that, we'll need to expand our pipeline with a couple more stages.