Tuesday, October 06, 2015

OpenCL - Part 1

As the title suggests, I've been interested lately in parallel computing, the possible applications, and how can you explode the processing potential of a computer.

There are few options out there in regards to programming languages/libraries to allow programmers developing their tools and applications using a parallel programming model.

As I have a computer equipped with an AMD video card, my natural choice was OpenCL.

After struggling with the video card drivers, and giving up to ignore the invitation of Microsoft to upgrade my OS to windows 10 (which by the way is better than I expected), I was able to install the [AMD sdk] and run the hello world program. You know, if Hello World compiles, and displays (somehow) a hello world message on your screen, you are blessed.

Now, what can I do with so much power? Well, first thing came to my mind was to develop an algorithm which can run in parallel, that is simple to code, and can take advantage of this programming model. I decided then to code a 2D Frustum culling algorithm (well, a simplified one) which tells whether a polygon lies inside a truncated triangle (truncated pyramid in 3D), lies outside, or is partially contained in the polygon. Just something like in the figure:

Green means inside, yellow partially contained, and red, outside

The interesting part of this development was to find if a Parallel Computing model would improve the performance of this algorithm, and if it showed an interesting improvement over the sequential version.

First results were more than promising. Running the algorithm in Debug Mode, applying the algorithm to about 2'000.000 polygons took for OpenCL 153ms while using a linear algorithm took 2913ms. Quite impressive huh? That's about 20 times faster.

Very excited about my findings, and after some debugging here and there, I decided to recompile the code in Release Mode. What a disappointment!. After running the algorithm for the same sample, results showed that OpenCL took 32ms while the linear algorithm took 11ms.

Of course, I didn't realize that running in Debug would mostly affect operations done within my program, and not what was going on in OpenCL. Still, you can see an improvement in both versions, but not that good from within OpenCL.

I will post more details about my findings, but I can already tell you which ones are the worst offenders:

1. Creating Buffers: takes about 6ms. Perhaps is better to create them only once
2. Writing Buffers: 15ms
3. Reading results: 6ms (enqueueReadBuffer)

Also, it is important to note that the kernel contains a simple algorithm, so the processing time is almost negligible, and the data very dynamic (consider an scenario where the polygons are moving). This may pose a serious issue in memory transfer between the host and the devices. Something to research on.

No comments: