Friday, 20 September 2013

Beginner's Tutorial In PyOpenCL

     Hello! Hope you liked previous introductory post. Let us get started with OpenCL environment creation! Before you start, please make sure you have the following programs installed:
  • Python 2.7 with PyOpenCL  module and its dependencies(includes pytools and decorator)
  • NumPy module for python
  • OpenCL SDK according to your GPU vendor.
    PyOpenCL makes creation of OpenCL environment easier to an extent I can not possibly describe you. Coder gets to concentrate more on writing an efficient Kernel, rather than struggling to create the environment. Before we begin, make sure that you have set environment variables(in windows) like PYOPENCL_CTX accordingly.

     A standard and a minimal OpenCL code will have following parts.
  1. Identifying a Platform
  2. Finding the device ID
  3. Creating the context
  4. Creating a command queue in the context
  5. Creating a program source and a kernel entry point
  6. Creating the buffers for data handling
  7. Kernel Program
  8. Build and Launch the Kernel
  9. Read the Output Buffer and clear it(if needed)

    These are some standard procedures one has to follow to create the environment. A pyopencl user will have his device identified already by environment variables. For the introduction, we may start from step 3. Let us go ahead and do that,

# import the required modules
import pyopencl as cl
import numpy as np

#this line would create a context
cntxt = cl.create_some_context()
#now create a command queue in the context
queue = cl.CommandQueue(cntxt)

     Isn't that pretty simple!? This is the advantage of using pyopencl to create the environment. To give you an idea, let me show you how the same thing can be done in C++.

cl_context context = clCreateContext( NULL,1,
&device,NULL, NULL, NULL);
cl_command_queue queue= clCreateCommandQueue(context,
device,0, NULL );
/*you may improvise the code by adding exceptions,
but let me keep it simple ;) */

    Now, having created the context and queue, we need to create the buffers that hold the input and output data. User will dump the input data into the input buffer before passing the control to the Kernel. And as the Kernel is being executed, OpenCL puts the result back into output buffer. One should note that buffers are the link from host instruction to the device level execution.
   Let us use the Numpy module to create the array of data,
# create some data array to give as input to Kernel and get output
num1 = np.array(range(10), dtype=np.int32)
num2 = np.array(range(10), dtype=np.int32)
out = np.empty(num1.shape, dtype=np.int32)

# create the buffers to hold the values of the input
num1_buf = cl.Buffer(cntxt, cl.mem_flags.READ_ONLY | 
num2_buf = cl.Buffer(cntxt, cl.mem_flags.READ_ONLY | 

# create output buffer
out_buf = cl.Buffer(cntxt, cl.mem_flags.WRITE_ONLY, out.nbytes)

   Note that input buffers are made read only and output buffers are write only.

   This job would have been quite lengthy in C++. Afterall, it is the mighty Python! And it is its user friendly nature which makes it so popular!

   Well, now we come to the most important part, let us write a Kernel to Unleash the power of the OpenCL!

# Kernel Program
code = """
__kernel void frst_prog(__global int* num1, __global int* num2,__global int* out) 
    int i = get_global_id(0);
    out[i] = num1[i]*num1[i]+ num2[i]*num2[i];

  In this simple Kernel, we take the data in by the pointers mentioned as Global int, then we fetch the global ID of the work item (hope you remember the introduction!) every time the Kernel is launched and finally we carry out the mathematical process that we need.

  Giving you an overview of what really happens here is, your CPU will send the data to your GPU in multiple threads to exploit the parallelism of GPU where these threads are executed in parallel. (check out this post for better understanding.)

   After writing the Kernel, we should compile and launch the Kernel with the help of following code.

# build the Kernel
bld = cl.Program(cntxt, code).build()
# Kernel is now launched
launch = bld.frst_prog(queue, num1.shape, num1_buf,num2_buf,out_buf)
# wait till the process completes

    Now the Kernel is launched and OpenCL does its job by running this Kernel on available devices efficiently. We may read the data from the output buffer,

cl.enqueue_read_buffer(queue, out_buf, out).wait()
# print the output
print "Number1:", num1
print "Number2:", num2
print "Output :", out

    If you have followed everything perfectly, if you put together above code and run, then you should see output like this,

    If you are a beginner, I suggest you type the code by yourself to get used to specific syntaxes. If you are curious to see the efficiency of OpenCL, create an array of a big size, write a pure python code for the same and find out time of execution(you can use time module). Then do the same for the OpenCL based code. After comparing both, I am sure you will end up exclaiming, "Ah! It was worth the effort!"

    With this I will conclude the Introduction to OpenCL. Check out next post about framework of OpenCL. Subscribe to my posts and feel free to leave a comment below! 


  1. very good, thanks!

  2. I got the following:

    AssertionError: length of argument list (2) and CL-generated number of arguments (3) do not agree

    know what that means?

    1. It is due to the change in syntax of new version of pyopencl. Change this line as follows:
      launch = bld.frst_prog(queue, num1.shape, None, num1_buf,num2_buf,out_buf).

      Hope that helps.

  3. Great posts man! Really! The problem is that there is not a single good tutorial how to install and get the PyOpenCL working... I mean a complete tutorial like software prerequisites, python versions, architecture .. All this. I know that everything is working fine in Python(x,y) but there you can't use pyinstaller to port the project afterwards. Can you point to some good resource please?

    1. Try to install Ubuntu and then apt-get the pyopencl. Good luck.

  4. Excuse me, what exactly is the function of the input variable 'num1.shape' in bld.first_prog?

  5. Hi,
    Thanks for the tutorial. But what if I include more kernel codes for image processing, can you provide sample code for multi kernel access??


  6. When I run:

    # build the Kernel
    bld = cl.Program(cntxt, code).build()
    # Kernel is now launched
    launch = bld.frst_prog(queue, num1.shape, num1_buf,num2_buf,out_buf)

    it gives me an error:

    "TypeError: enqueue_knl_frst_prog() missing 1 required positional argument: 'arg2'"

    1. I had the same issue. When calling the function here you need to specify local_size, which if you insert as "None":
      launch = bld.frst_prog(queue, num1.shape, num1_buf,None,num2_buf,out_buf)
      will work.

  7. Hiiii....Thanks for sharing Great information....Nice post...Keep move on....
    Python Training in Hyderabad

  8. Thank you for the post!

  9. I am definitely enjoying your website. You definitely have some great insight and great stories.
    代 写 java