In this course, you will learn to design the computer architecture of complex modern microprocessors.

Loading...

来自 普林斯顿大学 的课程

Computer Architecture

232 评分

In this course, you will learn to design the computer architecture of complex modern microprocessors.

从本节课中

Vector Processors and GPUs

This lecture covers the vector processor and optimizations for vector processors.

- David WentzlaffAssistant Professor

Electrical Engineering

So, it's an important aspect of all this vector work of how do you compile for

this. Well,

Thankfully, we actually have compilers that can do automatic vectorization.

And one of the challenges here, if you look at this element wise multiply is, you

have a loop that's running and another loop that's running and your compiler

needs to figure out that it can merge those loops and run them at the same time.

And, compilers actually have gotten pretty sophisticated.

If you look at the, the, the Craig compiler now, it can basically do outer

loop parallelism, it can do certain types of parallelism with loop carry

dependencies and vectorize all this. But it requires some pretty deep compiler

analysis. This especially works well for things like

Fortran codes where you don't have random pointers pointing in different places.

C codes get a little bit hard. So, what if you don't want to execute the

same code in all the elements of your vector?

Well, that could be a problem. So, here we have a piece of code which

loops over some big vector, this is C code. And, it checks to see whether the

value is greater than zero. And only if it's greater than zero does it

do this next operation. So, there's been extensions to vector

processors that have allowed effectively predicates or masked operations on a per

element basis of the, of the vector. So, the way you would do this is you would

actually load the entire vector, set a mask register where you have a one or a

zero which is the result of this comparison on an element to element basis,

And then, do the operation. And you can basically put this together with these bit

by bit comparisons and have slightly different control flow for the different

elements within a vector. And, just sort of, showing the

implementation of this, if we looked at how to actually implement masking, one way

to do it is, you actually do every operation.

So, let's say, you're doing multiply and your vector length is 64.

You do all 64 but you just disable the right to the register file on, the ones

that have the mask bit turned off. Or, you could have a much more fancy

implementation which takes out the work that doesn't have to be done.

But, the control on this is, is quite a bit harder.

And, I would say, that this is probably more common, just the simple

implementation. And the, the,

This is, this is harder largely because, if you have the resources anyway, say if

you have multiple lanes, it might just make sense to go execute a sort of a null

operation later. Some other things that are pretty common

in vectors is you want to have reductions. What I mean by reduction is let's say, you

have this array and you want to add all the elements in the array into a variable.

There's a sort of a vector to scalar operation You can't really do this on what

we discussed so far. You can't do a vector operation which will

actually operate on all of these values and, and try to do something useful with

it. But, what you can do is you can try and do

some software tricks. So, one of the software tricks is, you

take a whole vector, and instead, call it two vectors.

Sort of, cut it in half, and then overlap them and do parallel adds.

And then, you take the results of that. You take, it was someplace else in there,

And you take those two parts and you overlap them, you do adds.

So, you could do lots of parallel adds and effectively build a reduction operation,

by building a tree of adds. So, if we have our vector here, we would

cut it in half and add this part with this part, and then the result would be half

the size. If we cut in half we had this, part of that part, the result is half the

size and cut again, we do, we keep doing adds. So, we can use our vector arithmetic

to effectively do a reduction. So we're about out of time here.

Talk about scatter gather, this isn't that deep.

The implementation of this can be very hard though.

Um,, A of d of i. So, we want to index base off

a index of the vector. This is called gather.

Scatter is the other direction when you're doing store with a double lead-in, a, a,

a, a, index of a index. And, in the instruction set in your book,

they actually have an instruction to do this.

Lvi here, Well, what that basically does is it takes

each element of vector D here, indexes into vector C, and then that is that

result. Problems with this is, of course, your

memory layout is not going to be all nicely laid out in memory.

You're going to be sort of jumping around in memory.

Let's, let's stop here for today, and we'll talk a little bit more about vectors

and GPUs next time.