0:16

Now, you can download this data actually.

This is real data that Yelp is making available for

academic purposes, and so you can download that data from this

website that they've given for a dataset challenge.

This data will be in jscon format,

something that we haven't dealt with before.

So we want to convert it into csv, so that we can use the tools that we already know.

So what I've done I've provided this script called json2csv_business.py.

So this is Python script that you can download with this course,

and on your command line you can run this command Python, the script name,

and the name of that json file that you downloaded from Yelp.

Now keep in mind that this is a large file, and so

1:15

it's going to take some time to actually go have that conversion done.

All right, but you can do this outside of this environment,

and you can have it ready in a csv format.

So that's what will happen once that script is run.

Okay, now given that you've finished this, we're going to switch to R studio, and

here we'll load up some of the packages first

to start working on this data set, okay.

So the first thing we need to do, because we're doing the step by step.

The first thing we need to do is just look at the data.

To do that we need the library plot2.

We've done this before.

So this is the library for doing different kind of visualization.

Now, the next thing to do is load up the data.

2:16

So we're going to load the date, and

right, this is business data.

Right, so Yelp has a lot of businesses, and

their data about the reviews of those businesses by the customers, and so

we're going to use the read.csv function, so that allows us to load a csv file.

3:06

Enter that, now this is going to take a little bit of time,

because it's a large amount of data right, and so

again depending on how fast your machine is, how much memory you have available,

don't be surprised if this takes several seconds.

3:22

Now the data is loaded in this business data, and so

let's just go ahead and plot it.

And it's so easy to do that in R.

And so we can just tell it that we want to plot this business data.

And how we want to do it, well, let's get

a bar chart where the x-axis will have state,

where the data is, the business data is from,

ergo where the business is.

And we're going to fill it using gray.

4:09

And there you have it, so

now we have a visualization of the data that's available.

And let me just expand this a little bit, if you see here on the x axis,

you have states from where this data is.

4:26

Now as it so happens that this particular data from Yelp that's available,

that's mostly from Arizona.

So don't be surprised that the most of the things are in Arizona.

This is not the normal thing for Yelp.

Yelp has data from all kinds of states.

But for this particular dataset has most of the business is located in

Arizona and Nevada.

So that's why we are saying, there's nothing really all that interesting.

This is simply to practice our skills with R, and

it's very easy to see how things are located.

5:31

And so here's the visualization of that.

So one could give stars from 1 to 5.

And so you can see that on the x-axis we have number of stars, and

on the y-axis we have the number from all those businesses.

All right, so there are thousand of businesses represented,

and here's the distribution of stars, and maybe this is not surprising,

maybe this just confirms what we have, but it's kind of nice to see how easy it is,

once you have the data loaded how easy it is to just do these quick visualizations.

6:34

In this case, we're going to use an expanded command for

this ggplot.

So we'll say data is

in business_data,

we're going to use x factor one.

I'll explain this in a second,

going to fill it factor of stars, and so

this is our sort of more expanded ggplot command.

And we're going to create a bar chart, With width=1.

7:28

And Coord_polar(theta="y"),

and again I'll explain this in a second.

But let's first see what happens, okay?

So, what we created here is

a pie chart where it is using

the counts for this factors.

So the factor in this case happens to be the stars.

And so what it means is it's looking at this particular variable, which is stars.

Stars that are assigned to each business using that as a factor and

then counting how many things fall under different values of that factor.

Okay, so stars being 1, 1.5, 2 and so on up to 5.

All right, so those are the possible values for this factor,

stars and then what you see here are the actual counts right?

But of course this is a pie chart so everything needs to fit in the circle so

everything is proportion, right?

So here's the blue one that's the largest one with the rating 4.

The light blue one with 3.5 and again,

this is just confirming what we saw before with the bar chart.

So this is just a different visualization to show similar things.

And again, one of the reasons we are doing this is to practice our

skills and different visualization we can find.

As we saw before often even just being able to see this could be very

informative, right?

And all it takes is just one command.

9:19

Okay, so next we're going to look at the user data.

We already did this,

we're going to look at another file that you downloaded,

you should have downloaded from Yelp which is the user data.

So what we just played with was the business data.

Now we want to see things that are user related.

So again, you'll find the data in JSON format.

And I've provided a script called json2csv_user.py and

that script again can be use just before on the command line running a Python

command with that script and the name of the JSON file.

10:08

And the resulting file will be a CSV file that we can use

to load up in R just like we did before.

All right, so at this point, I'm assuming that you've been able to convert the JSON

file to a CSV file and now you can load it in R, okay?

So going to load it as a different dataset.

10:34

And so we're going to call it, this is the user.

What actually we're going to first load sorry your data and

that will be similar to what we did read.csv function file

equals get the full path to where the file is, yelp.

11:02

Dataset_user.json.csv.

And so this is the file has the user related data and

memory converted it from JSON to CSV.

And once we give this command it's going to load it up.

Again, it may take a few moments because this is a large file and

R is trying to load it in R environment.

So don't be surprised if it takes a little longer.

And this is what we're doing here,

loading it up a CSV file as loading up the CSV file in R.

Okay, now we have it loaded.

So now we can start playing with just like how we did before.

Okay, so let's see,

11:54

let's extract some information from this whole dataset.

It's a huge table but we don't need all of them.

One of the things that it has or

some of the fields that has our, it's like number cool votes.

And so customer can say this business is cool, or it's funny, or it's useful.

So let's extract those votes, user votes.

So the entire data is in user data table.

We're going to extract, so remember this is table, so it's got rows and

it's got columns.

So a user data has two dimensions.

And the dimensions are separated by commas.

If I do this, that means everything comma everything.

So this will get the whole table.

But we don't want the whole table, we want only specific columns, right?

But we do want all the rows.

So I'm going to leave before comma empty, so that means get me everything.

And then after comma I want specific columns.

I'm going to say use a C operator.

Let's say I want cool votes,

I want funny votes.

These are the columns and I want useful votes.

These three columns and all the rows.

So this is what it means, that give me all the rows with these three columns,

from user_data table into this user_votes, okay?

So now we have this sort of a subset of the data, okay?

13:42

Now let's ask some questions using this.

Does a user who has more fans get more useful votes, right?

One of the fields that user data has is number of fans for a given user.

So we can find a correlation, right?

We can just use the user_data right away, and say

14:12

funny_votes, and

user_data$fans.

So yes, there is a high correlation, it's positive between funny votes and fans.

So a user who has more fans tend to have more funny votes, okay?

So that's probably not very surprising but

it's very easy to find that kind of correlation.

14:42

Okay, so we're going to do something more of this.

We're going to actually look at how different things are related.

We're going to create a linear model, a regression model, right?

To do some progression analysis to do some prediction and things like that.

Okay, so let's create a linear model.

So this is a regression model, interchangeably linear model and

regression model but they are both the same in this case.

So R has a command or a function called lm that creates a linear model.

And what we want to do is see if

useful_votes that somebody

has have it's related to things

like review_count, fans.

And well, actually,

I already got the review_count.

So just see this much and the data is in user_data.

So now how do I know these things well?

These are columns, and if you like, you can open the CSV file, just be

careful that it could take up a lot of memory because it is actually pretty big.

Other option is to kind of just go here and

look at the user data.

And see that this is the data that you loaded up, it has half a million entries.

And here are the columns so that's how we know what those things are,

and that's what we are actually working with, okay.

Go back to, How was this?

And so we just did.

16:54

We did ran a regression model where useful votes is our outcome.

All right, it's a dependent variable.

review_count and fans are our independent variables, okay?

And that's down on this whole data.

So this is similar to regression that we did in Python, okay?

Now of course, we need to get the actual coefficients from this.

17:26

Well, let's do that,

coefficients from my linear model, right?

And you can list this, so this is our model.

Okay, and so this is the coefficient that you multiply to review_count.

This is the coefficient that you multiply to fans, right.

And this is the intercept or the constant that you add to this equation, right.

So in other words, useful_votes is

equal to 1.41 times review_count

plus 22.68 times fans plus -18.25.

18:22

Before we have seen it with only one variable, in this case,

we have two variables.

But this is a linear regression, so you can have several

factors all added to each other to create the linear regression equation.

So this is our regression equation,

and here we have the coefficient information, right.

So if you know that if you come and if you know the fans,

you can predate useful_votes using this yeah.

Now, let's do something more with this, let's actually visualize this.

19:42

So we can see, this is a very,

very skewed distribution if you can actually even see it,

right, where you can see how many reviews people write.

Well, not many, everybody writes a very few number of reviews,

so most of your data is really concentrated here.

So again, this is not very surprising that people write very few

reviews instead of in single date or signaling less than 100,

so this is very, very small percentage of your scene.

20:19

Now, we want to go a little deeper into this to see how people

are distributed in terms of doing their review_count, having their fans,

and so on, and it's not very clear how do we analyze it.

Do people have a lot of fans, do you have different number of reviews,

and so one thing to do in that kind of case is something called clustering.

And so what clustering does, it's unsupervised method which means that

we don't know the labels.

We don't know how people should be distributed.

We just have all this user data, we have a lot of people in that, these people do

different things like voting on things, rating things, writing reviews.

Some people are more active, some people are less active,

but we don't know exactly what to call them,

we don't know how many categories can we have, how do we divide it up.

So it's an unsupervised method, we're not trying to predict or

put people on specific labels.

21:32

We're just trying to see how they're distributed.

And so the clustering allows us to see how the data is distributed or

how it's organized.

Maybe there is some underlying organization that we're not seeing,

but we could perhaps.

And so we're going to use a technique called K-means.

It's a popular clustering technique where k represents number of clusters, okay?

And so R gives us a very easy way to do that.

22:00

And so that's what we're going to, right now, work on.

Now R has a function, actually, I'm going to clean this up for now.

R has a function called kmeans, and

you can give it the user_data that we have.

22:21

We're not interested in all the columns, all the fields,

we're interested in only some of them, and so

let's actually look at the data to see which columns we are interested in.

22:39

This is our data.

Name is not all that useful.

When looking at numbers, numbers are what's kind of useful,

so review_count, average count, maybe date and

so numerical things are useful.

So 3, 4, 5, 6, 7, 8, 9, 10,

11, so we have a total 11 columns.

And actually you can even see it here.

There are 11 columns and 552,000 rows.

Okay, so the user ID is just a unique random ID, name is name.

So we can eliminate the first two columns.

At the left side, just take columns 3 to 11,

and we'll ask for 3 clusters, and

I'm not sure if that's good enough.

I mean, normally, three is a good starting point,

and I'll put this in a variable cluster.

23:53

Run it, and it's finished running.

What we're going to do is to visualize these clusters, and

once it's done, then it'll be easier to explain what's going on.

So we're go look at all the user_data,

we're going to have the view_count

as x-axis and fans as the y-axis.

And we're going to use [INAUDIBLE] again this will be easier to explain once

we have things in front of us and so I'm just going to write it for now.

24:46

Sorry, it's going to take a little bit of time because there's a large

amount of data here.

It's not just loading the data, but

it's also creating a visualization of that data.

So again be surprised that this takes some time.

In the mean time I just want to show you what really happened here.

So we're doing is we first off we run this kmeans

25:17

We ask it to take all this data but, only columns 3 to 11 because

the first two columns are not really useful in terms of representing a user.

Right, it's their name as the ID, that's not very represented of the user.

So, we take all the rules but, only these columns, we ask for three clusters.

25:51

And then, so the clustering information is stored here in userCluster.

And then we're taking that to plot it.

No what we're doing with plotting is we're saying all the data was plotted.

We're going to do a two dimensional plot with review

count is x dimension, fans is a y dimension and

are we going to do a point base or scatter plot based charting.

26:27

And so here's the visualization of those clusters.

I know it took a little while and if it takes too long for you or

by any reason it fails chances are you don't have enough memory,

enough processing power on your computer.

And in that case, what I would suggest is, instead of using this

whole user_data data, take a sample of it.

So create a subset, and I'll leave that as a homework for you.

We've already seen how to create a subset but this time create a subset based on

some condition that will get you a smaller part of this whole data set.

Because there's a half a million rows and so that's a lot of data points and

so unfortunately this is something where you will need more processing power,

more memory on your computer.

And so that's not the case then I recommend taking the subset.

27:27

But whatever you do, hopefully you will have something like this.

Now, now that we have this clustering visualization I can explain what we did.

Okay, so we have this two dimensional plane.

On the x-axis, review_count, on the y-axis we have fans.

So I know that this is scattered plot.

So each point shows us a user and corresponding

review count and the number of fans for that user.

Now of course, what we've done in addition to that is actually created clustering.

So we did the clustering and if you remember we did the CAIMANS, we ask for

three clusters, okay?

28:12

we ask it to be organized somehow separated in three groups.

That's what that CAIMANS did, and now when you're doing the visualization,

this is what it means.

So we visualize this is the x-axis, y-axis and then we said that color

each dot using the clustering information, okay?

So user cluster is where our the whole clustering data is.

That's where the whole thing generated and

cluster represents a number.

So, we because we ask for three numbers, this is what we have.

We have three different numbers,

that means three different values for the color.

So, those three different values using as an id for a color, and

that's how we've seen three different colors here.

And so that's a very nice, easy way to visualize our clustering and

you can see that there's a light blue, dark blue and a medium blue color and so

those things indicate three different groups that this clusterng has organized.

And while it's kind of difficult to see, you can Imagine that this medium blue

cluster is very sporadic, it's people with very high review count and font count.

And then we have things in the middle with moderate review count and

moderate font count, and then we have the light

blue with very little view count and fan count.

Okay so that's what we did.

Let's go back to our Clustering.

This is a clustering that we just talked about.

And with that, we conclude this session.

What we saw here is using read.csv function to load CSV data in R.

Now, we're not always lucky to have CSV data, and

if we don't, we saw that in this case we've got the data in JSON,

30:20

but we also had help to convert that JSON into a CSV.

So that's another thing that could happen that if you get data from something else

in some other format you will have either write a program yourself or

in most cases you can find existing programmer script that will

30:39

convert that into CSV that you understand.

We saw how to use ggplot library

to plot the data, it's very easy to create a bar chart histogram.

And there are a lot of functions, a lot of options of those functions,

we didn't look at all them but you know at least where to start.

We did correlation analysis to see if variables are related somehow.

And once we find that there is some correlation, we

31:09

did regression analysis to see how exactly those variables are related.

Right and to remember regression gives us coefficient information and

constants that makes up what's called the regression Qmodel and regression line.

And using that information you can then calculate the outcome values.

And their there are times when

31:33

of we just want to see if there's any underlying organization of the data.

And for that clustering is a great way.

It's unsupervised learning algorithm and unsurprised learnings technique.

And the algorithm that we used was CAIMANS.

And what it does, it simply provides us.

Some kind of organization.

In some cases it's very clear, in other cases it's not, but it could become

a starting point where we can start formulating some hypothesis, right?

So with that, we end this session on using R for social media data analysis.