Monday, November 30, 2015

OpenCL - Part 2

CL_MEM_COPY_HOST_PTR: Copies memory from the host to the device. In my case it was taking about 15ms. Instead I use a zero buffer operation which saves that time. Also, creating buffers with zero buffer operations makes a big difference in terms of performance.

CL_MEM_USE_HOST_PTR: In my case, this is a preferred choice as the memory will use the memory referenced by the host as the storage. After some reading, this would be a better option when using GPU as memory will be allocated in the Pinned memory.

As a result, we get the following values (compared to the previous post)

1. Creating Buffers: From 6ms to 9microSec
2. Writing Buffers: From 15ms to 6ms
3. Reading results: 6ms (enqueueReadBuffer). This is still an issue.

That's already impressive. in the end, the result varies from 10 to 14ms (compared to the previous 32ms).

Doubling the amount of data keeps the same ratio between both versions, so there's still no advantage on the parallel version.

A further improvement required me to write buffers only once. In this particular test, The polygon list and their positions is a static data structure; what's dynamic is the frustum (The polygon) which encloses visible polygons.

After the first load, which takes around 14ms, all subsequent kernel executions and results reading improve to around 7ms vs 11ms of the sequential version. That's a small difference, but a bit more obvious when the sample size grows, as in the following table:

Sample OpenCL Sequential
200,0007ms 11ms
400,00013ms 20ms
800,00024ms 41ms
1'600,00047ms 82ms

Not bad, one could say that performance is about 2x using openCL. That's good, but actually I expected more. I hope to cover cover in next post ... Can be gained more performance?

Some Thoughts:

1. Vectorization: Let's be fair, we saw an improvement because the proposed exercise suits perfect parallel processing. But we may not see advantages in other situations. If you can vectorize your application then you may have chance in this realm. (I hope Vectorize is a valid term, as it keeps being squiggled)

2. Memory transfer: Although it can be fast keep in mind that memory transfer can hit performance really bad specially if you have to transfer memory from/to the device often. In the exercise, I wanted to do calculations every frame (or as fast as possible), but this implies transferring/copying memory ofte (Not really practical. I will look into other possibilities, like enqueue maps)

3. Keep the Device busy: Even though there's an important gain using OpenCL, it's better suited for heavy processing tasks, like for instance Fourier transforms and image processing. So, if the problem allows it, keep kernels busy as much as you can.

4. Different platforms: I've read that performance may greatly vary in different Hardware (of course) as well as in using different libraries. Some library functions may work better in some platforms than others (That's what I heard ... the internet gossip)

5. One drawback of using OpenCL is that is written for plain C, and most stl classes will not work properly. There's a wrapper for c++, which I currently use for my tests, but I still don't taste the power of Object Oriented programming when using it.

Tuesday, October 06, 2015

OpenCL - Part 1

As the title suggests, I've been interested lately in parallel computing, the possible applications, and how can you explode the processing potential of a computer.

There are few options out there in regards to programming languages/libraries to allow programmers developing their tools and applications using a parallel programming model.

As I have a computer equipped with an AMD video card, my natural choice was OpenCL.

After struggling with the video card drivers, and giving up to ignore the invitation of Microsoft to upgrade my OS to windows 10 (which by the way is better than I expected), I was able to install the [AMD sdk] and run the hello world program. You know, if Hello World compiles, and displays (somehow) a hello world message on your screen, you are blessed.

Now, what can I do with so much power? Well, first thing came to my mind was to develop an algorithm which can run in parallel, that is simple to code, and can take advantage of this programming model. I decided then to code a 2D Frustum culling algorithm (well, a simplified one) which tells whether a polygon lies inside a truncated triangle (truncated pyramid in 3D), lies outside, or is partially contained in the polygon. Just something like in the figure:

Green means inside, yellow partially contained, and red, outside

The interesting part of this development was to find if a Parallel Computing model would improve the performance of this algorithm, and if it showed an interesting improvement over the sequential version.

First results were more than promising. Running the algorithm in Debug Mode, applying the algorithm to about 2'000.000 polygons took for OpenCL 153ms while using a linear algorithm took 2913ms. Quite impressive huh? That's about 20 times faster.

Very excited about my findings, and after some debugging here and there, I decided to recompile the code in Release Mode. What a disappointment!. After running the algorithm for the same sample, results showed that OpenCL took 32ms while the linear algorithm took 11ms.

Of course, I didn't realize that running in Debug would mostly affect operations done within my program, and not what was going on in OpenCL. Still, you can see an improvement in both versions, but not that good from within OpenCL.

I will post more details about my findings, but I can already tell you which ones are the worst offenders:

1. Creating Buffers: takes about 6ms. Perhaps is better to create them only once
2. Writing Buffers: 15ms
3. Reading results: 6ms (enqueueReadBuffer)

Also, it is important to note that the kernel contains a simple algorithm, so the processing time is almost negligible, and the data very dynamic (consider an scenario where the polygons are moving). This may pose a serious issue in memory transfer between the host and the devices. Something to research on.