TechGeek: API

Saturday, 28 September 2013

OpenCL- Understanding the Framework

With the first post in this series, we have had a basic and formal introduction to OpenCL. We have discussed the need of parallelism in computation and understood idea of OpenCL with a simple analogy. In the previous post we created an OpenCL runtime environment using Python. I am pretty sure, these gave you a good idea of what OpenCL is, and its working! Having the arena set, now we shall try to understand OpenCL in more detail.

Well, I am a huge fan of parallelism (I believe in theory of Parallel universes as well! ), so let us commence our advanced discussion on OpenCL with its wonderful ability of exploiting data level parallelism.

Data Parallelism in OpenCL

Kernel programmes are executed on the device with multiple threads. One should understand how OpenCL manages this. As introduced earlier, total number of Work Items that execute in parallel is represented by N-D domain. In other words, Kernels are executed across a global domain of work items. And these work items are executed in parallel, unlike conventional sequential execution.

Choosing Dimensions

As you see in the figure above(courtesy:AMD), N-D range is represented in 3D. User may specify the dimensions he wishes to use in N D computational domain where N is 1,2 or 3. 1 would represent a vector, 2 would be an Image and 3 would be a volume. By default(in our prev example), it is 1D. Choosing these is left to designer, but choice must be for a better mapping and speed.

//10 million elements in a vector: 1D
global_dim[3]={10000000,1,1}
//An image: 2D
global_dim[3]={1024,1024,1}
// Volume 3D
global_dim[3]={512,512,512}

For the modularity, user can divide the ND range into separate work groups(with same dimensions as ND range). If you are familiar with CUDA, they are analogous to grids and blocks. Say, you want to process an Image of size 1024*1024, you can have work groups to process blocks of size 128*128! Some note worthy points about work groups are,

They have their local dimensions.
They have their Local memory.
Work Items can be in synch within the group, using barriers. But Global work Items can never be in synch.

OpenCL extends Task Level parallelism as well, which makes it robust and well performing in various platforms.

OpenCL Memory Model

In OpenCL, which is very data intensive, memory management is of utmost importance. The programmer must have a very clear picture of memory model, otherwise the program will crash horribly !

(Courtesy: AMD)

GPU(OpenCL Device) will never request data from CPU(host), but only responds to the data requests by host. Buffers with data are sent to computation by the host to the main memory called Global or constant memory. These are accessible to every work item in the context. Please keep in mind that they are NOT synchronised!

The work groups have their local memory, which are accessible to only the work items in the group. With explicit coding by the designer, these can be made in synch to work items of the group. Also, every work item has its private memory for the execution of kernel, specific to the work item.

OpenCL Overview

With these fundamentals, let us have an overview of what really happens when we run an OpenCL code. Observe the figure below.

OpenCL Framework(courtesy: AMD)

Once you write your code and compile it, the code is built on the host. If your code is bug free, then it is built and Kernel program is obtained. Note that these are specific to a context, and context is in control of the host. Host will now create memory objects like buffers or images to manage the inputs and outputs.

After these steps, OpenCL magic starts! A proper command queue is set for the queue of instructions to the OpenCL device. Those who are familiar with processor architecure, it is analogous to scheduling. Inorder of execution is static scheduling, where the instructions are simply executed in order. But in the other case, dynamic scheduling happens where instructions are executed in the order of their dependancies, thus improving the speed by a great extent. This has its own trade-off with hardware requirements , as one can easily see.

AMD boasts that its OpenCL can support multicore AMD CPUs as OpenCL devices as well. This piece of code depicts the command queue creation.

cl_command_queue gpu_q,cpu_q;
gpu_q= clCreateCommandQueue(cntxt,device_gpu,0,&err);
//'cntxt' represents the created context and device_gpu represents device id of gpu
cpu_q= clCreateCommandQueue(cntxt,device_cpu,0,&err);
//device_gpu represents device id of cpu

Above framework gives a very clear overview of OpenCL execution. Now that you have a good idea of OpenCL, go ahead and start your projects with a single motto, "Think Parallel !". I will get back with a next post about Kernel execution and some more OpenCL programs. If you have any doubts, comment box is right below!

CHEERS!

Friday, 20 September 2013

Beginner's Tutorial In PyOpenCL

Hello! Hope you liked previous introductory post. Let us get started with OpenCL environment creation! Before you start, please make sure you have the following programs installed:

Python 2.7 with PyOpenCL module and its dependencies(includes pytools and decorator)
NumPy module for python
OpenCL SDK according to your GPU vendor.

PyOpenCL makes creation of OpenCL environment easier to an extent I can not possibly describe you. Coder gets to concentrate more on writing an efficient Kernel, rather than struggling to create the environment. Before we begin, make sure that you have set environment variables(in windows) like PYOPENCL_CTX accordingly.

A standard and a minimal OpenCL code will have following parts.

Identifying a Platform
Finding the device ID
Creating the context
Creating a command queue in the context
Creating a program source and a kernel entry point
Creating the buffers for data handling
Kernel Program
Build and Launch the Kernel
Read the Output Buffer and clear it(if needed)

These are some standard procedures one has to follow to create the environment. A pyopencl user will have his device identified already by environment variables. For the introduction, we may start from step 3. Let us go ahead and do that,

# import the required modules
import pyopencl as cl
import numpy as np

#this line would create a context
cntxt = cl.create_some_context()
#now create a command queue in the context
queue = cl.CommandQueue(cntxt)

Isn't that pretty simple!? This is the advantage of using pyopencl to create the environment. To give you an idea, let me show you how the same thing can be done in C++.

cl_context context = clCreateContext( NULL,1,
&device,NULL, NULL, NULL);
cl_command_queue queue= clCreateCommandQueue(context,
device,0, NULL );
/*you may improvise the code by adding exceptions,
but let me keep it simple ;) */

Now, having created the context and queue, we need to create the buffers that hold the input and output data. User will dump the input data into the input buffer before passing the control to the Kernel. And as the Kernel is being executed, OpenCL puts the result back into output buffer. One should note that buffers are the link from host instruction to the device level execution.
Let us use the Numpy module to create the array of data,

# create some data array to give as input to Kernel and get output
num1 = np.array(range(10), dtype=np.int32)
num2 = np.array(range(10), dtype=np.int32)
out = np.empty(num1.shape, dtype=np.int32)

# create the buffers to hold the values of the input
num1_buf = cl.Buffer(cntxt, cl.mem_flags.READ_ONLY | 
cl.mem_flags.COPY_HOST_PTR,hostbuf=num1)
num2_buf = cl.Buffer(cntxt, cl.mem_flags.READ_ONLY | 
cl.mem_flags.COPY_HOST_PTR,hostbuf=num2)

# create output buffer
out_buf = cl.Buffer(cntxt, cl.mem_flags.WRITE_ONLY, out.nbytes)

Note that input buffers are made read only and output buffers are write only.

This job would have been quite lengthy in C++. Afterall, it is the mighty Python! And it is its user friendly nature which makes it so popular!

Well, now we come to the most important part, let us write a Kernel to Unleash the power of the OpenCL!

# Kernel Program
code = """
__kernel void frst_prog(__global int* num1, __global int* num2,__global int* out) 
{
    int i = get_global_id(0);
    out[i] = num1[i]*num1[i]+ num2[i]*num2[i];
}
"""

In this simple Kernel, we take the data in by the pointers mentioned as Global int, then we fetch the global ID of the work item (hope you remember the introduction!) every time the Kernel is launched and finally we carry out the mathematical process that we need.

Giving you an overview of what really happens here is, your CPU will send the data to your GPU in multiple threads to exploit the parallelism of GPU where these threads are executed in parallel. (check out this post for better understanding.)

After writing the Kernel, we should compile and launch the Kernel with the help of following code.

# build the Kernel
bld = cl.Program(cntxt, code).build()
# Kernel is now launched
launch = bld.frst_prog(queue, num1.shape, num1_buf,num2_buf,out_buf)
# wait till the process completes
launch.wait()

Now the Kernel is launched and OpenCL does its job by running this Kernel on available devices efficiently. We may read the data from the output buffer,

cl.enqueue_read_buffer(queue, out_buf, out).wait()
# print the output
print "Number1:", num1
print "Number2:", num2
print "Output :", out

If you have followed everything perfectly, if you put together above code and run, then you should see output like this,

If you are a beginner, I suggest you type the code by yourself to get used to specific syntaxes. If you are curious to see the efficiency of OpenCL, create an array of a big size, write a pure python code for the same and find out time of execution(you can use time module). Then do the same for the OpenCL based code. After comparing both, I am sure you will end up exclaiming, "Ah! It was worth the effort!"

With this I will conclude the Introduction to OpenCL. Check out next post about framework of OpenCL. Subscribe to my posts and feel free to leave a comment below!

Follow @hegdekartik7