Monday, 6 February 2017

Are We On The Verge of Next Hardware Revolution?

Intel 8080, an early CPU design from Intel in 1970s powered one of the first Personal Computers, Altair 8800. Following commercial success of desktop computers created the whole new industry of software development. Significant evolutions in hardware design such as SIMD processors, VLIW architectures and multi-core CPUs over the time have influenced the way efficient software is designed and deployed. Appearance of multi-core CPUs such as AMD Athlon64 X2 and IBM POWER4 and a huge increase in the number of Supercomputers in the last decade resulted in the growth of the field of parallel computing. RISC based low-power architectures from ARM was perhaps the major reason for the boom in handheld devices and the associated software innovations & services. In the last few years, we have seen the advent GPUs in general purpose computing, changing the way applications are developed for HPC applications.

The ol' times! [ref]
In essence, the advancements in the hardware world profoundly influences the trends of Computer Science and Engineering research. The recent breakthroughs and surge in popularity of Deep Learning clearly manifests this. Therefore, the development of next generation computing infrastructure will be based on today's needs and will define the way next generation software is developed, making it relevant to the larger CS Research community.

Why do you think there is a revolution coming?

There are a few factors that requires us to bring a major change in the way we perform compute.
  1. Moore's Law Uncertainty: We have seen the process nodes shrinking over time, but now dangerously close to the atomic sizes. Moore's scaling may not continue to push the Perf/Watt of CPUs anymore. I spoke to many researchers who are working on sub-10nm technologies and all of them express skepticism in going further. They explained how challenging the scaling in terms of transistor density is and how low the yield could be. 
  2. Increase in Performance Demand: Fast growing fields such as Deep Learning demand more compute power than ever, because for them, the availability of more compute is the key for better results. Usual routine of improving 10% performance per CPU design cycle isn't going to cut it.
  3. Parallelism is limited: Multi-core era helped us to push through the performance barriers posed by saturating single thread performance. However, as per Amdahl's law, the amount to which programs can be parallelized limits the performance growth using multi-core systems. This is true in most of the day to day raw compute requirements. It is clear that putting in 1000s of cores isn't a great solution for many applications.
  4. Power Wall: There are two extremes here. 
    1. Low-end devices: Growth in handheld devices and the expected boom of IoT devices restrict us in terms of power and Silicon area budget. Many applications that were restricted to high-end computers are coming down to handheld devices, which requires a performance scaling at the same power envelope.
    2. High performance computing: The dream of Exascale computing would remain a dream unless we figure how we can power such a system without needing a Thermal Power plant for the computer itself[ref]! 
Overall, we are certainly at one of the most challenging juncture in computer architecture research, and clearly "status quo" is not an option. There is a need to build radically new architectures and rethink the way we balance the PPA (Power, Performance and Area) metrics. Exciting times!

Towards Heterogeneous Compute

If we look back at the recent trends in the SoCs released for applications ranging from sensors to servers, there is one trend that is quite common. Consider the specs of an early ARM based Qualcomm SoC for mobile devices: Snapdragon S1. It has an ARM CPU, a GPU and a small Video processor other than the modem. However, if we look at a recent Snapdragon 835 SoC, along with the CPU, GPU and Video, there is a dedicated Image Processor, a display control processor, a dedicated Audio Codec, a powerful DSP processor in three variants and a security subsystem. There are also talks of the addition of dedicated processor for Machine Learning named NPU (Neural Processing Unit). This can be seen even in Desktop class SoCs from Intel and AMD. There is one key take away here: Things are getting Heterogeneous.

Overview of the Snapdragon 820 featuring various special purpose hardware components

To give a perspective, the Silicon area occupied by the CPU is <15% in a modern mobile SoC!

What's the next big thing?

Well, you must be thinking, "Of course! ASICs are more efficient since they sacrifice the flexibility". Yes, that is the point: they are orders of magnitude more efficient and Silicon area isn't a concern if the power consumption is low. Most systems today are deployed for specific tasks- IoT systems, Automotive Infotainments, Specialised servers for ML, Big Data, Genome Sequencing etc. We know what they will be used for, then there isn't a compelling necessity to offer the "flexibility" that CPUs have to offer (But should be programmable for that specific application). Moreover, specialised hardwares in the SoC can be power gated and can be brought up when needed.

How efficient are ASICs?

The next generation computers are going to get a lot heterogeneous. There will be plenty of special purpose hardware blocks, optimised for different tasks. To put it in other words, the CPU will no more be a Central "Processing" Unit, rather it will be a Central "Control" Unit. CPUs are built to handle tasks with lot of conditional branches, but not built for the sheer compute horsepower. The heavy processing will be lifted by the special purpose hardwares, and CPU would run the OS and the control algorithms.

The Heterogeneous Revolution has already begun and like all the revolutions, we only realise it once it is over! This is high time for computer architecture researchers to explore other fields and offer ASIC solutions, and integrate them in the system. The system integration in terms of I/Os and software is really the key, and in the coming days, we would only see SoCs becoming larger and larger with more specialised functionalities.

Thursday, 6 February 2014

Parallel Computing on FPGAs

           The Parallel Computing community is on the verge of a major Paradigm shift! GPUs and multi-core processors, which are the work horses of parallel computing are now facing a challenge from FPGAs! Modern FPGA development boards, equipped with powerful processors provides a heterogeneous platform as well.

 Are FPGAs better than GPUs for Parallel Computing? In many ways, Yes!! I will put down a few reasons for you.

1. Fine Grained Parallelism:

      FPGAs are famous for their fine grained parallelism. It refers to the amount of interaction between the threads that are running in parallel. FPGAs can support high amount of interactions between threads, which makes it ideal for algorithms that are non-SIMD or irregular.

2. Data Path Flexibility:

 FPGAs are reprogrammable and reconfigurable, which makes it very powerful over GPUs, as they can change the hardware design according to the requirement. In fact, the whole emerging science of Reconfigurable computing is dependent on FPGAs. Run time reconfiguration of hardware is certainly very promising.

3. No more HDL! :

   The major drawback with FPGAs was the tedious programming exercise they posed with Hardware Description Languages like VHDL, Verilog etc. Altera's SDK for OpenCL is no doubt, a great tool for a parallel programmer to devote more time on Algorithmic implementation unlike in HDLs. Xilinx's recent release of sdAccel is another such effort.

4. Power Consumption: 

     FPGAs consume a very less power compared to their counterparts, the GPUs. In today's world, where low power solutions are very much sought for, FPGAs will surely come in handy!

5. Heterogeneous Platform: 

    GPUs perform very poorly in case of a sequential instruction, hence they are always assisted with onboard CPU. However, FPGAs can perform well even if the algorithm is predominantly sequential(Amdahl's Law), with some part of its resources configured for that task. Also, the processor on-board creates a heterogeneous platform that delivers a high performance in case of a non-SIMD instruction.

6. Reconfigurable Computing on FPGAs:

  Fastly emerging field of reconfigurabe computing is all about the flexibility offered by FPGAs. Run time reconfiguration, which in turn can be again parallel by dividing the FPGA area accelerates the performance to a great extent! This helps the designer to adapt his hardware according to the needs of the executing program. Partial Reconfiguration tool by Xilinx can help designers to use FPGAs for run-time reconfiguration.

  This helps FPGAs to exploit the advantages of both, flexibility of general purpose computing systems and benefits in power, area and speed of ASICs  simultaneously, delivering very high performance.The wide variety of high-bandwidth I/O capabilities of FPGAs is also an important factor.

Here is an example(courtesy:Altera) of parallel FIR filter's performance on different platforms. Hope that settles this discussion!

We are already seeing FPGAs becoming a part of hardware acceleration in HPC applications and appearing in embedded applications such as Advanced Driver Assistance Systems. With better software support and a more active research community in the field of reconfigurable computing can surely drive a wave of revolution.

Saturday, 28 September 2013

OpenCL- Understanding the Framework

With the first post in this series, we have had a basic and formal introduction to OpenCL. We have discussed the need of parallelism in computation and understood  idea of OpenCL with a simple analogy. In the previous post we created an OpenCL runtime environment using Python. I am pretty sure, these gave you a good idea of what OpenCL is, and its working! Having the arena set, now we shall try to understand OpenCL in more detail.

                Well, I am a huge fan of parallelism (I believe in theory of Parallel universes as well! ), so let us commence our advanced discussion on OpenCL with its wonderful ability of exploiting data level parallelism. 


 Data Parallelism in OpenCL  

                Kernel programmes are executed on the device with multiple threads. One should understand how OpenCL manages this. As introduced earlier, total number of Work Items that execute in parallel  is represented by N-D domain. In other words, Kernels are executed across a global domain of work items. And these work items are executed in parallel, unlike conventional sequential execution.


  Choosing Dimensions

            As you see in the figure above(courtesy:AMD), N-D range is represented in 3D. User may specify the dimensions he wishes to use in N D computational domain where N is 1,2 or 3. 1 would represent a vector, 2 would be an Image and 3 would be a volume. By default(in our prev example), it is 1D. Choosing these is left to designer, but choice must be for a better mapping and speed.
//10 million elements in a vector: 1D
//An image: 2D
// Volume 3D

             For the modularity, user can divide the ND range into separate work groups(with same dimensions as ND range). If you are familiar with CUDA, they are analogous to grids and blocks. Say, you want to process an Image of size 1024*1024, you can have work groups to process blocks of size 128*128! Some note worthy points about work groups are,
  • They have their local dimensions.
  • They have their Local memory.
  • Work Items can be in synch within the group, using barriers. But Global work Items can never be in synch.
          OpenCL extends Task Level parallelism as well, which makes it robust and well performing in various platforms.

  OpenCL Memory Model

             In OpenCL, which is very data intensive, memory management is of utmost importance. The programmer must have a very clear picture of memory model, otherwise the program will crash horribly !

(Courtesy: AMD)

       GPU(OpenCL Device) will never request data from CPU(host), but only responds to the data requests by host. Buffers with data are sent to computation by the host to the main memory called Global or constant memory. These are accessible to every work item in the context. Please keep in mind that they are NOT synchronised! 

         The work groups have their local memory, which are accessible to only the work items in the group. With explicit coding by the designer, these can be made in synch to work items of the group. Also, every work item has its private memory for the execution of kernel, specific to the work item.

 OpenCL Overview

            With these fundamentals, let us have an overview of what really happens when we run an OpenCL code. Observe the figure below.

OpenCL Framework(courtesy: AMD)

                Once you write your code and compile it, the code is built on the host. If your code is bug free, then it is built and Kernel program is obtained. Note that these are specific to a context, and context is in control of the host. Host will now create memory objects like buffers or images to manage the inputs and outputs. 
            After these steps, OpenCL magic starts! A proper command queue is set for the queue of instructions to the OpenCL device. Those who are familiar with processor architecure, it is analogous to scheduling. Inorder of execution is static scheduling, where the instructions are simply executed in order. But in the other case, dynamic scheduling happens where instructions are executed in the order of their dependancies, thus improving the speed by a great extent. This has its own trade-off with hardware requirements , as one can easily see.
             AMD boasts that its OpenCL can support multicore AMD CPUs as OpenCL devices as well. This piece of code depicts the command queue creation.

cl_command_queue gpu_q,cpu_q;
gpu_q= clCreateCommandQueue(cntxt,device_gpu,0,&err);
//'cntxt' represents the created context and device_gpu represents device id of gpu
cpu_q= clCreateCommandQueue(cntxt,device_cpu,0,&err);
//device_gpu represents device id of cpu

            Above framework gives a very clear overview of OpenCL execution. Now that you have a good idea of OpenCL, go ahead and start your projects with a single motto, "Think Parallel !". I will get back with a next post about Kernel execution and  some more OpenCL programs. If you have any doubts, comment box is right below!


Friday, 20 September 2013

Beginner's Tutorial In PyOpenCL

     Hello! Hope you liked previous introductory post. Let us get started with OpenCL environment creation! Before you start, please make sure you have the following programs installed:
  • Python 2.7 with PyOpenCL  module and its dependencies(includes pytools and decorator)
  • NumPy module for python
  • OpenCL SDK according to your GPU vendor.
    PyOpenCL makes creation of OpenCL environment easier to an extent I can not possibly describe you. Coder gets to concentrate more on writing an efficient Kernel, rather than struggling to create the environment. Before we begin, make sure that you have set environment variables(in windows) like PYOPENCL_CTX accordingly.

     A standard and a minimal OpenCL code will have following parts.
  1. Identifying a Platform
  2. Finding the device ID
  3. Creating the context
  4. Creating a command queue in the context
  5. Creating a program source and a kernel entry point
  6. Creating the buffers for data handling
  7. Kernel Program
  8. Build and Launch the Kernel
  9. Read the Output Buffer and clear it(if needed)

    These are some standard procedures one has to follow to create the environment. A pyopencl user will have his device identified already by environment variables. For the introduction, we may start from step 3. Let us go ahead and do that,

# import the required modules
import pyopencl as cl
import numpy as np

#this line would create a context
cntxt = cl.create_some_context()
#now create a command queue in the context
queue = cl.CommandQueue(cntxt)

     Isn't that pretty simple!? This is the advantage of using pyopencl to create the environment. To give you an idea, let me show you how the same thing can be done in C++.

cl_context context = clCreateContext( NULL,1,
&device,NULL, NULL, NULL);
cl_command_queue queue= clCreateCommandQueue(context,
device,0, NULL );
/*you may improvise the code by adding exceptions,
but let me keep it simple ;) */

    Now, having created the context and queue, we need to create the buffers that hold the input and output data. User will dump the input data into the input buffer before passing the control to the Kernel. And as the Kernel is being executed, OpenCL puts the result back into output buffer. One should note that buffers are the link from host instruction to the device level execution.
   Let us use the Numpy module to create the array of data,
# create some data array to give as input to Kernel and get output
num1 = np.array(range(10), dtype=np.int32)
num2 = np.array(range(10), dtype=np.int32)
out = np.empty(num1.shape, dtype=np.int32)

# create the buffers to hold the values of the input
num1_buf = cl.Buffer(cntxt, cl.mem_flags.READ_ONLY | 
num2_buf = cl.Buffer(cntxt, cl.mem_flags.READ_ONLY | 

# create output buffer
out_buf = cl.Buffer(cntxt, cl.mem_flags.WRITE_ONLY, out.nbytes)

   Note that input buffers are made read only and output buffers are write only.

   This job would have been quite lengthy in C++. Afterall, it is the mighty Python! And it is its user friendly nature which makes it so popular!

   Well, now we come to the most important part, let us write a Kernel to Unleash the power of the OpenCL!

# Kernel Program
code = """
__kernel void frst_prog(__global int* num1, __global int* num2,__global int* out) 
    int i = get_global_id(0);
    out[i] = num1[i]*num1[i]+ num2[i]*num2[i];

  In this simple Kernel, we take the data in by the pointers mentioned as Global int, then we fetch the global ID of the work item (hope you remember the introduction!) every time the Kernel is launched and finally we carry out the mathematical process that we need.

  Giving you an overview of what really happens here is, your CPU will send the data to your GPU in multiple threads to exploit the parallelism of GPU where these threads are executed in parallel. (check out this post for better understanding.)

   After writing the Kernel, we should compile and launch the Kernel with the help of following code.

# build the Kernel
bld = cl.Program(cntxt, code).build()
# Kernel is now launched
launch = bld.frst_prog(queue, num1.shape, num1_buf,num2_buf,out_buf)
# wait till the process completes

    Now the Kernel is launched and OpenCL does its job by running this Kernel on available devices efficiently. We may read the data from the output buffer,

cl.enqueue_read_buffer(queue, out_buf, out).wait()
# print the output
print "Number1:", num1
print "Number2:", num2
print "Output :", out

    If you have followed everything perfectly, if you put together above code and run, then you should see output like this,

    If you are a beginner, I suggest you type the code by yourself to get used to specific syntaxes. If you are curious to see the efficiency of OpenCL, create an array of a big size, write a pure python code for the same and find out time of execution(you can use time module). Then do the same for the OpenCL based code. After comparing both, I am sure you will end up exclaiming, "Ah! It was worth the effort!"

    With this I will conclude the Introduction to OpenCL. Check out next post about framework of OpenCL. Subscribe to my posts and feel free to leave a comment below! 

Wednesday, 18 September 2013

OpenCL- Unleash the Power of your GPU

      Hi everyone! In this series of posts, I will introduce you to the new cutting edge API OpenCL which is being rigorously developed by technology giants like AMD, Nvidia, Intel and IBM right now. We shall use Python to get introduced to OpenCL environment and revise your C programming basics to write Kernel programs. In this post we shall get familiar with the terminologies of OpenCL. Why wait!? Let us get started!

     According to Moore's law, transistors per cm square would double every year, and it is no surprise that he was indeed correct. When we look back a decade, we realise how fast we have developed in computing performance. But one should admit that, we are reaching the fundamental limits in the size of transistors! We can not improve computing performance with more transistors in the same chip any more! That is precisely where Parallel Computing spreads a ray of hope, and OpenCL is the brightest source of hope right now!

    Real power of OpenCL lies in the Kernel programs, which are executed on multiple devices in synch. Suppose, if you have an AMD processor and an AMD graphics card, you may use OpenCL to improve the performance by the proper use computing power. In SIMD(Single Instrn Multiple Data) type of requirements like Graphics rendering, Image processing etc, OpenCL is very handy! Main reasons why OpenCL must be your choice of parallel computing is,

  1. CPU and GPU portability:  OpenCL is a very flexible API that allows you to switch between devices for a better performance.
  2. OpenGL and DirectX compatibility: You can have an interoperability between the Graphic APIs for a better performance in graphics rendering.
  3. It can even run on a computer without an OpenCL enabled GPU! Your CPU will still support it!

    If you are a hard core programmer, not satisfied with speed of execution of your program, or if you are a graphics developer and want to improve rendering performance, or if you are an OpenGL/DirectX developer trying to improve performance, or if(I can go on)....., you are in the damn right place!

    Let us dive right in, and get introduced to some basic terms in OpenCL.

  • Device : They represent the computing resources in the computer available for OpenCL. For ex, GPU.
  • Kernels  : Kernels are the programmes that is executed on the device. It basically contains the main Instruction that you want to accelerate. It is written in C (with some restrictions and keywords). A proper understanding of OpenCL functioning can help the coder to write efficient Kernels.  
  • Context  : An OpenCL context allows the system to execute the Kernels and transfer of data. Suppose you are adding two vectors, with the kernel and host programs, it represents the context.
  • Command Queue : In a specific context, in the runtime environment, command queues are created so as to instruct the devices to carry out the operation.
       I shall explain some more basics with a simple analogy, which actually made me understand what OpenCL actually  is! Here we go!
      Consider a building under construction. Workers are spread across the place, unless they are arranged properly, it is a mess. Now there comes the saviour, the contractor! He would arrange the workers into groups, and assign a work to all the groups. Every group shall do a similar work, but at different places. Worker has to fetch the equipments from the main store in the beginning. Contractor will know every work group, and he would also identify each worker individually. Now we are ready to learn more! Keep an eye on the words in apostrophes!
      Building construction is like the OpenCL context. Every individual worker is known as a 'Work-Item'. The groups formed are called 'Work-Groups' containing Work-Items. Contractor knows work groups by an ID. Every worker has a 'Global ID' and a 'Local ID' specific to the Work-Group. Worker fetches instruments from the store, analogous to a work item fetching data from a 'Global Memory'. Work items have some instruments already in their 'Private Memory'. Work groups have some instruments local to the group called 'Local Memory'. Finally, the work represents your Kernel.
     Now guess who the OpenCL is!!?? I am sure you have guessed it right! Check out the next post about creating a basic OpenCL environment in Python!

      You are most welcome to leave a comment below!