Monday, November 30, 2015

OpenCL - Part 2



CL_MEM_COPY_HOST_PTR: Copies memory from the host to the device. In my case it was taking about 15ms. Instead I use a zero buffer operation which saves that time. Also, creating buffers with zero buffer operations makes a big difference in terms of performance.

CL_MEM_USE_HOST_PTR: In my case, this is a preferred choice as the memory will use the memory referenced by the host as the storage. After some reading, this would be a better option when using GPU as memory will be allocated in the Pinned memory.

As a result, we get the following values (compared to the previous post)

1. Creating Buffers: From 6ms to 9microSec
2. Writing Buffers: From 15ms to 6ms
3. Reading results: 6ms (enqueueReadBuffer). This is still an issue.

That's already impressive. in the end, the result varies from 10 to 14ms (compared to the previous 32ms).

Doubling the amount of data keeps the same ratio between both versions, so there's still no advantage on the parallel version.

A further improvement required me to write buffers only once. In this particular test, The polygon list and their positions is a static data structure; what's dynamic is the frustum (The polygon) which encloses visible polygons.

After the first load, which takes around 14ms, all subsequent kernel executions and results reading improve to around 7ms vs 11ms of the sequential version. That's a small difference, but a bit more obvious when the sample size grows, as in the following table:

Sample OpenCL Sequential
200,0007ms 11ms
400,00013ms 20ms
800,00024ms 41ms
1'600,00047ms 82ms

Not bad, one could say that performance is about 2x using openCL. That's good, but actually I expected more. I hope to cover cover in next post ... Can be gained more performance?

Some Thoughts:

1. Vectorization: Let's be fair, we saw an improvement because the proposed exercise suits perfect parallel processing. But we may not see advantages in other situations. If you can vectorize your application then you may have chance in this realm. (I hope Vectorize is a valid term, as it keeps being squiggled)

2. Memory transfer: Although it can be fast keep in mind that memory transfer can hit performance really bad specially if you have to transfer memory from/to the device often. In the exercise, I wanted to do calculations every frame (or as fast as possible), but this implies transferring/copying memory ofte (Not really practical. I will look into other possibilities, like enqueue maps)

3. Keep the Device busy: Even though there's an important gain using OpenCL, it's better suited for heavy processing tasks, like for instance Fourier transforms and image processing. So, if the problem allows it, keep kernels busy as much as you can.

4. Different platforms: I've read that performance may greatly vary in different Hardware (of course) as well as in using different libraries. Some library functions may work better in some platforms than others (That's what I heard ... the internet gossip)

5. One drawback of using OpenCL is that is written for plain C, and most stl classes will not work properly. There's a wrapper for c++, which I currently use for my tests, but I still don't taste the power of Object Oriented programming when using it.


No comments: