0:08

Welcome to this week's programming tip.

Now, in this specialization,

we focused mainly on writing scripts that use good code and up to now,

good has meant code that's correct,

understandable, and that others can work with easily.

However, as you progress,

you're going to see that going to be another metric by which we

assess the quality of your code and that's it's efficiency.

We're really interested in how fast your code runs on a problem of a particular size.

If you move on into the larger field of computer science,

kind of the core topic in this area are called algorithms and algorithms are

common recipes for solving important computational problems.

One of the ways we measure how good an algorithm is we talk about it's efficiency,

how fast it runs on an input of a given size.

In this video, what I want to do is talk a little bit about some of the kind of

the core concepts that come up when we're talking about efficiency of algorithms.

Then, in a follow up video,

we'll talk about a couple of different approaches to joining

two CSV files in Python and

how this idea of efficiency can kind of sway which approach we take.

Let's go do it.

Okay. Let's talk a little bit about how you measure the efficiency of your code.

There's a whole subject area in computer science that

focuses on this and one of the key things is algorithmic efficiency.

Your code implements an algorithm,

some mathematical recipe for doing something and you want to know how good is it.

This is a Wiki page.

It gives that a very general overview of things.

You might look over just to get some vague ideas of how things work.

There are two ways that you can go about measuring the efficiency of your code.

This is called benchmarking.

This is probably the one that you're most used to and the idea is you take an algorithm,

you write it up in some language, say Python,

and you run your Python code on some particular machine,

and you put in a timing code to see how long a particular portion of

the code took to run and you get a number out.

What happens is when people want to measure how well things work,

they often put what are called benchmarks.

They'll go through and take a common problem and they'll take

a particular language and they'll run it through

their machine and they'll see how long it takes.

If you've ever bought a graphics card for one of your computers,

you'll often do some benchmarking to see how well

your graphics card works on a particular game.

Now, that's kind of a nice way to understand a little bit about how fast your code runs,

but there are some drawbacks here.

The absolute running time for your code depends on a lot of factors

beyond just how good your mathematical algorithm is for solving a problem.

It depends on how fast your computer is,

depends on which language you chose.

Did use Java or C or Python?

It also depends on the particular implementation of

the language that you chose to use here.

If you've used Code Skulptor 3, you'll see it runs a lot slower than regular old Python.

There's another way to measure the efficiency of your code,

the algorithms you're going to use,

and that's a theoretical method.

What that method does is it's going to instead look at the running time of

your code in terms of

how fast the running time increases as the size of the input increases.

Notice that approach of looking at how fast the running time increases as the size

of the input increases is going to wash out a lot of factors,

like how fast your machine is.

If the machine is fast or slow,

you'll see we still see the same growth in the running time.

Which language you choose, same thing,

you'll see really which language you use doesn't really

impact the increase in the running time as you increase the size of the input.

Now, I'm not going to go into all the details but the common way

we measure this in computer science is we use,

it's called Big O notation.

It's basically a Big O of some function kind of gives you a function that

balance the growth and the running time as the size of the input increases and so,

for a few examples here, Big O,

you'll find just means that the running time of your codes are

constant and if you double the size of the input,

the running time doesn't really change.

That's a really good property to have.

Another common one is Big O, again,

it's linear and that means that if you double the size of your input,

the running time of your code doubles. Pretty good.

It means basically your running time is proportional to the amount of input you have.

That's a very reasonable thing.

Something a little slower,

quadratic, that's Big O squared.

That means if you double the size of the input,

then the running time goes up by a factor of four.

It's not the end of the world,

but as you keep doubling,

things get a lot slower,

so it's not great.

These are three examples of some common running times this much.

You see Big O notation in here, Big O(1), great.

Big O of pretty good, Big-O(1) squared.

It's okay but not great.

Okay, let's move on now and we'll talk a little bit about how Python is implemented.

You want to use this Big O notation to understand the efficiency of

some operations that we commonly use inside Python.

Let's move on now to Python and

try to understand the efficiency of the code that you've written in Python.

There's a very helpful page that you should be

aware of that sits out on python.org which is called

the Time Complexity page which talks about some of

the specifics of how fast certain operations run inside Python.

Now, when you first started working with Python,

maybe you used some kind of debugger or stepper and

you kind of tried to understand the running time of your code in terms of

how many statements had to be evaluated to arrive at

a particular result and that was a reasonable starting point for

understanding how Python works but there

is one subtlety that actually exists in Python that you should be aware of.

That executing one statement in Python doesn't always take constant time.

It can take a lot more time based on

the size of the data structure that you're working with.

Remember Python is implemented in terms of so much lower level language like

C and that C code for one line of Python could be doing lots of stuff.

What I want to do is,

I want to talk a little bit about the speed

of operations for two important data types in Python,

lists and dictionaries. So let's do that next.

Okay, let's talk about the running time of various list operations in Python.

Here, we are in the Time Complexity page,

we're looking at lists and here,

you have a collection of FI operations that we can

run on a list of links so there's in items in the list.

The first operation is make a copy of the list.

It turns out that the running time is O,

then you kind of have to go through

the entire list to make that copy, so that makes sense.

Kind of the two operations that are most useful in the list,

we can get a particular item,

we can set a particular item as O(1).

The time it takes to either get or update an item in a list,

like the tenth item doesn't

depend on whether the list is linked to a thousand or linked to a million.

It still takes the same amount of time.

That's a very nice property and that's why lists are very useful.

There's one other operation that you'll be

using fairly frequently with lists and it's down here,

you want to check if something is in the list.

That looks kind of like an innocuous operation.

Is something in a list?

That can't be too hard to do.

Well, it turns out the only way you can check if

something is in a list is actually just go through

the entire list and check that particular item versus every item in the list.

That's O(n), that's linear time.

This is kind of the most common cause of kind of mystery when people are writing

Python code is they use is x in s where s is a long list,

and they think one line of Python must be really fast to implement.

But, then their code runs really slow as a result.

So, let's look at dictionaries now.

We're going to see that dictionaries are going to have

slightly different running times for important operations,

and particularly checking membership.

Okay, to finish off lecture,

let's talk a little bit about the running time of

various operations for dictionaries in Python.

Now, this mentions the important thing about dictionaries is they're

implemented using an algorithmic technique called hashing.

The important thing about hashing is,

if you have a collection of n items,

we can check whether a particular item is in

that collection in constant time, that's O(1).

So, we can get an item in constant time,

we can set an item in constant time.

Notice that was the same thing that we had for lists,

but that although it's not mentioned here,

there's one important difference between lists.

We can check to see if a particular key is in a dictionary in constant time, O(1).

Remember for list, it was O again.

So there's going to be times when using a dictionary is going to

lead to significant improvement in the efficiency of your code.

In fact, that's what we're going to look at next,

is we're going to look at the problem of joining two CSV files.

And, we're going to motivate using dictionaries over lists in this particular case.