Monday, November 30, 2015

OpenCL - Part 2

CL_MEM_COPY_HOST_PTR: Copies memory from the host to the device. In my case it was taking about 15ms. Instead I use a zero buffer operation which saves that time. Also, creating buffers with zero buffer operations makes a big difference in terms of performance.

CL_MEM_USE_HOST_PTR: In my case, this is a preferred choice as the memory will use the memory referenced by the host as the storage. After some reading, this would be a better option when using GPU as memory will be allocated in the Pinned memory.

As a result, we get the following values (compared to the previous post)

1. Creating Buffers: From 6ms to 9microSec
2. Writing Buffers: From 15ms to 6ms
3. Reading results: 6ms (enqueueReadBuffer). This is still an issue.

That's already impressive. in the end, the result varies from 10 to 14ms (compared to the previous 32ms).

Doubling the amount of data keeps the same ratio between both versions, so there's still no advantage on the parallel version.

A further improvement required me to write buffers only once. In this particular test, The polygon list and their positions is a static data structure; what's dynamic is the frustum (The polygon) which encloses visible polygons.

After the first load, which takes around 14ms, all subsequent kernel executions and results reading improve to around 7ms vs 11ms of the sequential version. That's a small difference, but a bit more obvious when the sample size grows, as in the following table:

Sample OpenCL Sequential
200,0007ms 11ms
400,00013ms 20ms
800,00024ms 41ms
1'600,00047ms 82ms

Not bad, one could say that performance is about 2x using openCL. That's good, but actually I expected more. I hope to cover cover in next post ... Can be gained more performance?

Some Thoughts:

1. Vectorization: Let's be fair, we saw an improvement because the proposed exercise suits perfect parallel processing. But we may not see advantages in other situations. If you can vectorize your application then you may have chance in this realm. (I hope Vectorize is a valid term, as it keeps being squiggled)

2. Memory transfer: Although it can be fast keep in mind that memory transfer can hit performance really bad specially if you have to transfer memory from/to the device often. In the exercise, I wanted to do calculations every frame (or as fast as possible), but this implies transferring/copying memory ofte (Not really practical. I will look into other possibilities, like enqueue maps)

3. Keep the Device busy: Even though there's an important gain using OpenCL, it's better suited for heavy processing tasks, like for instance Fourier transforms and image processing. So, if the problem allows it, keep kernels busy as much as you can.

4. Different platforms: I've read that performance may greatly vary in different Hardware (of course) as well as in using different libraries. Some library functions may work better in some platforms than others (That's what I heard ... the internet gossip)

5. One drawback of using OpenCL is that is written for plain C, and most stl classes will not work properly. There's a wrapper for c++, which I currently use for my tests, but I still don't taste the power of Object Oriented programming when using it.

Tuesday, October 06, 2015

OpenCL - Part 1

As the title suggests, I've been interested lately in parallel computing, the possible applications, and how can you explode the processing potential of a computer.

There are few options out there in regards to programming languages/libraries to allow programmers developing their tools and applications using a parallel programming model.

As I have a computer equipped with an AMD video card, my natural choice was OpenCL.

After struggling with the video card drivers, and giving up to ignore the invitation of Microsoft to upgrade my OS to windows 10 (which by the way is better than I expected), I was able to install the [AMD sdk] and run the hello world program. You know, if Hello World compiles, and displays (somehow) a hello world message on your screen, you are blessed.

Now, what can I do with so much power? Well, first thing came to my mind was to develop an algorithm which can run in parallel, that is simple to code, and can take advantage of this programming model. I decided then to code a 2D Frustum culling algorithm (well, a simplified one) which tells whether a polygon lies inside a truncated triangle (truncated pyramid in 3D), lies outside, or is partially contained in the polygon. Just something like in the figure:

Green means inside, yellow partially contained, and red, outside

The interesting part of this development was to find if a Parallel Computing model would improve the performance of this algorithm, and if it showed an interesting improvement over the sequential version.

First results were more than promising. Running the algorithm in Debug Mode, applying the algorithm to about 2'000.000 polygons took for OpenCL 153ms while using a linear algorithm took 2913ms. Quite impressive huh? That's about 20 times faster.

Very excited about my findings, and after some debugging here and there, I decided to recompile the code in Release Mode. What a disappointment!. After running the algorithm for the same sample, results showed that OpenCL took 32ms while the linear algorithm took 11ms.

Of course, I didn't realize that running in Debug would mostly affect operations done within my program, and not what was going on in OpenCL. Still, you can see an improvement in both versions, but not that good from within OpenCL.

I will post more details about my findings, but I can already tell you which ones are the worst offenders:

1. Creating Buffers: takes about 6ms. Perhaps is better to create them only once
2. Writing Buffers: 15ms
3. Reading results: 6ms (enqueueReadBuffer)

Also, it is important to note that the kernel contains a simple algorithm, so the processing time is almost negligible, and the data very dynamic (consider an scenario where the polygons are moving). This may pose a serious issue in memory transfer between the host and the devices. Something to research on.

Thursday, August 02, 2012

Launch Boot loader from MBR

In previous posts I went into the process of building a Boot loader in assembler on the MBR (Master Boot Record).

If you open a binary editor to inspect the contents of the Boot loader created in the previous post you will realize that the assembled code occupies around 203 bytes.

Due to the distribution of the MBR, there are around 440 bytes available for the code. So this simple code used to print basic drive information takes around 50% of the total available.

In modern OS (ie Linux) the MBR is used to launch the Boot loader (Linux) from a specific location in the Disk (where there should be more than enough space). Linux, in particular, uses the MBR to launch it's Boot loader (GRUB)

The following code will launch our boot loader from the MBR. First, our launcher must load the Bootloader from the media (aka Disk Drive) to a specific location in memory. Then we call a "jmp" to transfer the flow of execution to the loaded application.

  1. BITS 16                 ;Tell compiler that this is 16 Bit code
  2. org 07C00h              ;Tell compiler code will be loaded in this memory section
  3. mov si, txtStr
  4. call printTxt
  5. call crlf
  6. mov si, txtStr2
  7. call printTxt
  8. call crlf
  9. mov ah, 02h
  10. mov al, 1               ; Sectors to Read
  11. mov cl, 2               ; Sector to Start Reading [5..0]
  12. mov ch, 0               ; Cylinder [15..6]
  13. mov dh, 0               ; Head Number
  14. mov dl, 80h             ; Drive Number Hard Drive 1
  15. mov bx,0x07E0   ;Where to load Bootloader 0x70E0:00
  16. mov es,bx
  17. mov bx,0x00
  18. int 13h         ;Int 13 is all functions for disks
  19. mov si, txtLaunch
  20. call printTxt
  21. call crlf
  22. jmp 0x07E0:0x00         ;Run Bootloader.
  23.                                                 ; PhysicalAddress = Segment * 16 + Offset
  24. JMP $
  25. %include "utils.asm"
  26. ;AppData
  27. txtStr db "Hello Again..", 0
  28. txtStr2 db "Loading Kernel", 0
  29. txtLaunch db "Launching Kernel", 0
  30. times 510-($-$$) db 0   ;Fill the rest of the Loader with "0"
  31. dw 0AA55h               ;Bootloader Signature

Loading from Disk is achieved in lines 11-22. Please read INT 13H documentation to get more details about the "02h" service. In the code above I will read 1 sector from the (virtual) Hard Drive starting from sector #2 cx[5..0], Cylinder 0 cx[15..6], Head #0. It's important to know what this location means in the logical space on Disk; refer to CHS to LBA mapping. Logical space in this code point to the Sector #1 (It's the second sector, having MBR as the first one).

Bootloader will be stored in memory in the location 0x07E0:0x00 just after the location of the MBR. This area of memory is around 480 KB of available space (See Memory Map).

After it has been loaded in the memory, the code will jump to that location in memory and continue executing (line 28).

The following code is an example of our Boot loader. It simply prints "Bootloader Running" if everything is working correctly. There's not much to say other than the data registry (ds) must be reset, otherwise data will point to wrong locations in memory.

  1. BITS 16         ;Tell compiler that this is 16 Bit code
  2. push cs         ;Reset ds
  3. pop ds
  4. mov si, txtStr
  5. call printTxt
  6. call crlf
  7. JMP $
  8. %include "utils.asm"
  9. ;AppData
  10. txtStr db "Bootloader Running!", 0

There's no special magic during the assembly step. Files are compiled as in previous posts.

Only one question is left to answer. Where is our Boot loader stored? The answer lies in the line 13 on the first piece of code. Binary file must be stored in Sector 1 of the hard drive. The following instruction will do the job for us:

       dd if=test.bin bs=512 of=/dev/sdb seek=1

As usual (It will be usual in the following posts) there are a lot of steps to be done. You can complete them as exercise.

  • Take proper actions on the result of INT 13h. What if it returns an error?
  • How would you load a bigger Boot loader?
  • How would you implement "utils.asm"?

Friday, July 27, 2012

Reading Disk Properties in Assembler - INT 13H

This post is a continuation from the previous post [Writting a Bootloader] which was an attempt to build a Bootloader for the x86 architecture.

Although the executable size is limited to 512 Bytes, there are several important tasks that can be accomplished during booting process. One of them is to read the Geometry from several storage devices.

Interrupt 13h [INT 13h] provides sector-based disk access services using the CHS addressing  method (Cylinder/Head/Sectors). This values provide the physical location of the data (although this is not completely true nowadays).

The code below is an example of this interrupt. It reads the CHS from the first Hard Disk using Interrupt 13h.

Briefly, interrupt parameters are set in AX and DX registers. After interrupt is completed, its values can be read from the appropriate registers (CX, DX). The rest of the code includes helper functions to print numbers and strings to present results properly.

After booting our virtual system using the code, we get the following result:

    Hello Again
    Hard Drive Information
    Disk Geometry (C/H/S): 130/16/63

Comparing these results with fdisk on the same drive:

    Disk /dev/sdb: 67 MB, 67108864 bytes
    255 heads, 63 sectors/track, 8 cylinders, total 131072 sectors
    Units = sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disk identifier: 0x00000000

I am still trying to figure why these values are so different. Anyway, the total size of the drive taking the total oh Cylinders/Heads/Sectors will be close to the drive size described by fdisk.

Finally, notice that 08h service is limited to drives up to 500 MB. If there's required to read properties from bigger drives, service 48h is the appropriate.

As exercise, following improvements can be done:
  • Report an error if reading fails
  • Read properties from all connected drives
  • Determine drive size in LBA format
  • Modify code to read bigger drives

  1. BITS 16                 ;Tell compiler that this is 16 Bit code
  2. org 07C00h              ;Tell compiler code will be loaded in this memory section
  4. mov si, txtStr
  5. call printTxt
  6. call crlf
  7. mov si, txtStr2
  8. call printTxt
  9. call crlf
  11. mov si, txtDisk
  12. call printTxt
  14. ;Read HD1 parameters
  15. mov AH, 08h     ;Read Drive Parameters
  16. mov DL, 80h     ; First Drive
  17. int 13h
  19. push cx
  20. mov ax, cx      ;Cylinders
  21. and ax, 0C0h
  22. shl ax, 2
  23. and cx, 0FF00h
  24. shr cx, 8
  25. or cx, ax
  26. mov ax, cx
  27. inc ax
  28. call printNum
  29. mov si, txtSlash
  30. call printTxt
  31. pop cx
  33. mov al, dh      ;Num Heads
  34. mov ah, 0
  35. inc ax
  36. call printNum
  37. mov si, txtSlash
  38. call printTxt
  40. and cx, 3Fh ;[5..0] Sectors Per Track
  41. mov ax, cx
  42. call printNum
  43. call crlf
  45. JMP $
  47. ;********************* Utility Functions ******************
  49. printNum:       ;Print a number (ax)
  50.         push cx
  51.         push dx
  52.         mov cx,000Ah    ;divide by 10
  53.         mov bx, sp
  54.         getDigit:
  55.                 mov dx,0        ;puting 0 in the high part of the divided number (DX:AX)
  56.                 div cx          ;DX:AX/cx.  ax=dx:ax/cx  and dx=dx:ax%cx(modulus)
  57.                 push dx
  58.                 cmp ax,0
  59.         jne getDigit
  61.         printNmb:
  62.                 pop ax
  63.                 add al, 30h     ;adding the '0' char for printing
  64.                 mov ah,0eh      ;print char interrupt
  65.                 int 10h
  66.                 cmp bx, sp
  67.         jne printNmb
  69.         pop dx
  70.         pop cx
  71.         RET
  73. printTxt:       ;Print a String (si)
  74.   lodsb
  75.   cmp al, 0
  76.   je exitFn
  77.   mov ah, 0Eh
  78.   int 10h
  79.   jmp printTxt
  80. exitFn:
  81.   RET
  83. crlf:   ;Print End of line
  84.         mov ax, 0E0dh
  85.         int 10h
  86.         mov ax, 0E0ah
  87.         int 10h
  88.         RET
  90. ;AppData
  91. txtStr db "Hello Again", 0
  92. txtStr2 db "Hard Drive Information", 0
  93. txtDisk db "Disk Geometry (C/H/S): ", 0
  94. txtSlash db "/", 0
  95. times 510-($-$$) db 0   ;Fill the rest of the Loader with "0"
  96. dw 0AA55h               ;Bootloader Signature

Thursday, July 26, 2012

Writing a Bootloader

I am starting to think of any reason to Write a Bootloader ...

I've been always interested not only how to write applications but understanding how the whole system works, from the moment the machine boots.

The first steps in computer booting can be described in general as follows:
  • Power Up
  • Start Processing ROM BIOS
  • Execute POS (Power-on self Test)
  • Check hardware devices
  • Look for OS to load

During the last step, BIOS will search for Storage Devices (described in the device Boot Order). These are usually CD-Rom, Hard Drives, USB Drives, Disk Drives, etc.

For each Device (In order), BIOS will look for MBR (Master Boot Record) which corresponds to the first 512 bytes of the Device Image. These 512 Bytes are loaded then into the memory and executed. It's a very small space in memory to be able to achieve many tasks. 

Bootloaders are commonly used to access the kernel loader located in the boot sector of the drive. Other tasks includes initialize critical hardware devices and initialize other resources such as memory, interrupts, clocks and timers.

Hmm looks like I found a reason to create a bootloader ... Just for Fun! (Anyway, I got sick note from Doctor so I have to stay home ... It looks like I am getting Bored!)

Let's get our hands Dirty. Which tools from the Hardware store do we need?
Good news is we don't need any (If you want to go farther, then get some old USB Flash Drive)

Software of course is a must:
  • Any Linux flavor (I used fedora)
  • A Virtualization Application (Virtual Box)
I will run the Bootloader in a Virtual Machine because it will allow me to test easier every change I do to the source code and performing a real test will require me to have a spare machine for tests. (My wife would kill me if She finds her computer broken when She arrives Home)

As my computer is running Windows 7, I have trouble finding appropriate tools to develop a bootloader (It must be compiled as 16Bit Binary File). If you are interested, there's nasm for windows to compile your code, and partcopy to write the MBR of a Device.

Instead a Virtual Environment is also possible to use Bochs which is a x86 (PC) emulator.

First, create the Linux virtual machine in VirtualBox (It's pretty straightforward). After you spent some time installing it, let's create the next Virtual Machine.

I've created a Virtual Machine with 64MB of memory and "Other" as operating system. For the storage I've created a virtual Drive with 64MB of space and type "Shareable". (It will be used inside Linux and our new OS).

Then, Include this Drive in your linux configuration as IDE Primary slave. I will allow linux to recognize it as a valid drive.

Now it's time to code. Run your Linux Virtual Machine and open your favorite text editor.

There are four important requirements for the bootloader:

  • Must be compiled in 16Bit mode
  • Define Memory offset when code is loaded into the physical address.
  • Boot Memory space not used must be cleared.
  • Boot Signature: This is required by the BIOS to determine if the Device is Bootable. Failing to find this signature will cause BIOS to ignore the device.
The following code will generate a basic bootloader:

  1. BITS 16                 ;Tell compiler that this is 16 Bit code
  2. org 07C00h              ;Tell compiler code will be loaded in this memory section
  3. mov si, txtStr
  4. call printTxt
  5. JMP $
  6. printTxt:
  7.   lodsb
  8.   cmp al, 0
  9.   je exitFn
  10.   mov ah, 0Eh
  11.   int 10h
  12.   jmp printTxt
  13. exitFn:
  14.   RET
  15. ;AppData
  16. txtStr db "Hello You!", 0
  17. times 510-($-$$) db 0   ;Fill the rest of the Loader with "0"
  18. dw 0AA55h               ;Bootloader Signature

I will not go into the details of each call. Briefly, it defines code as 16 Bits, tells compiler where in physical memory code will be loaded, prints the string defined in "txtStr", fills the rest of bytes with "0" and finally writes the bootloader signature.

Now I will compile code above using nasm:

     nasm bootload.asm -f bin -o bootload.bin

In the final step I will use the linux "dd" command to copy the generated binary file to the device:

     dd if= bootload.bin bs=512 of=/dev/sdb

Notice that /dev/fd0 is the additional virtual drive we added to the Linux machine. Do not mount the device!

If command above succeeded, go to to Virtual Box Manager and launch the OS where you just copied the Bootloader. Hopefully you will get a screen like this:

If you get this screen, then you can continue your research and go beyond the typical "Hello World" example. Low level programming and OS development is a very interesting topic that is worth exploring.