• 0

[CUDA vc++] performance issues


Question

I was recently looking into developing applications with cuda however i've been unsuccessful

i pieced together a program that runs through an array and increments each value by a certain amount just to test it out however the performance was fairly poor in comparison to the cpu

I don't usually like asking for code however i have tried and am completely stuck i was hoping someone could provide me with sample code that will allow me to copy an array over to the gpu with pre-defined values and able to take advantage of all the unified shaders on my card (32 for my 8600gt)? any help will be greatly appreciated :)

i'm trying to code this in visual studio 2008 however i heard there was some issues with cuda not supporting vs2008 so i've downloaded c++ 2005 express edition

the problem i'm having is that its really slow and i guess i was expecting more but i can't say for certain i'm hoping i've done something wrong and more performance can be gained.

thank you :)

Link to comment
Share on other sites

8 answers to this question

Recommended Posts

  • 0

yeah i did look through the sdk and the closest one i could find to what i was after is cppIntegration but it still doesn't do exactly what i want i've changed it a fair bit but it still seems slow thats

hmm i just realised i hadn't read the programming guide (i thought i read it but must of been something else)

please dis-regard this thread...i'll come back if i get stuck after reading it :)

Link to comment
Share on other sites

  • 0

ok i've gotten a program to work and it runs fairly fast (atleast 7x faster then cpu i'm still curious if it could be faster pentium 4 3.2GHz vs 8600gt)

but i've been reading a bit on cuda and how the gpu works (bear with me please :) )and i'm sort of confused on exactly how many threads are being executed at the same time

on the 8600gt theres 4 sm's making a total of 32 unified shaders i've read that you can only have 768 threads running per sm so i can only have 3,072 threads ready for execution

so even though each block can have 512 threads really i can only have 96 threads if i want to use all 8 blocks per sm

so in my mind i figure ok i can execute 3072 threads at any given time simulatenously but then i read this

[snip]since there can be no more than 768 threads in each SM and this amounts to 768/32 = 24 warps.[snip]

To summarize, for the GeForce-8 series processors, there can be up to 24 warps residing ineach Streaming Multiprocessor at any point in time. We should also point out that the SMsare designed such that only one of these warps will be actually executed by the hardware at any point in time.

so does this mean only 32 threads can be executed per sm (128 threads for 8600gt and 512 threads for 8800gtx)?

this grid/thread/warp/block/kernel thing is fairly confusing i just want to have a rough idea of how much can be processed at a given time and be able to compare the 8600gt and 8800gt :)

also i'm working on 33,553,920 elements i'm using 65,535 blocks with a block size of 512...is this the best way to go about it? if i wanted to process more elements how would i do so?

am i able to create multiple grids to have multiple sets of 65,535 blocks with 512 threads?

Link to comment
Share on other sites

  • 0
ok i've gotten a program to work and it runs fairly fast (atleast 7x faster then cpu i'm still curious if it could be faster pentium 4 3.2GHz vs 8600gt)

but i've been reading a bit on cuda and how the gpu works (bear with me please :) )and i'm sort of confused on exactly how many threads are being executed at the same time

on the 8600gt theres 4 sm's making a total of 32 unified shaders i've read that you can only have 768 threads running per sm so i can only have 3,072 threads ready for execution

so even though each block can have 512 threads really i can only have 96 threads if i want to use all 8 blocks per sm

so in my mind i figure ok i can execute 3072 threads at any given time simulatenously but then i read this

so does this mean only 32 threads can be executed per sm (128 threads for 8600gt and 512 threads for 8800gtx)?

this grid/thread/warp/block/kernel thing is fairly confusing i just want to have a rough idea of how much can be processed at a given time and be able to compare the 8600gt and 8800gt :)

also i'm working on 33,553,920 elements i'm using 65,535 blocks with a block size of 512...is this the best way to go about it? if i wanted to process more elements how would i do so?

am i able to create multiple grids to have multiple sets of 65,535 blocks with 512 threads?

8600 has 4SM, 8800 has 8SM, and 1SM can have at most 768 threads running concurrently so I will leave the mathematics to you ;)

And for the grid/thread/warp/block/kernel thing, grid size = number of blocks, block size = number of threads, 1 warp = 32 threads, and kernel is just the program segment that runs on GPU.

Link to comment
Share on other sites

  • 0

cool thanks i got the first part saying 768 threads can run concurrently but further down in the document i was reading it said only 1 warp will be executed by hardware (i'm assuming per multiprocessor)

which means only 32 threads will be executed at the same time per multiprocessor, 128 overall for my 8600gt and 512 overall for the 8800gtx

i'm hoping its executing 768 threads concurrently instead of 32 per multiprocessor :D

code is working great i'm just playing around with manipulating large arrays and benchmarking them against running on the cpu i'll need to find out if its possible to store a bitmap array on the video card that can be accessible by bitblt so i can render an image without having to it back to system memory :)

Link to comment
Share on other sites

  • 0
cool thanks i got the first part saying 768 threads can run concurrently but further down in the document i was reading it said only 1 warp will be executed by hardware (i'm assuming per multiprocessor)

which means only 32 threads will be executed at the same time per multiprocessor, 128 overall for my 8600gt and 512 overall for the 8800gtx

i'm hoping its executing 768 threads concurrently instead of 32 per multiprocessor :D

code is working great i'm just playing around with manipulating large arrays and benchmarking them against running on the cpu i'll need to find out if its possible to store a bitmap array on the video card that can be accessible by bitblt so i can render an image without having to it back to system memory :)

As i recall it should be 1 warp in each PROCESSOR not multiprocessor

Link to comment
Share on other sites

  • 0
As i recall it should be 1 warp in each PROCESSOR not multiprocessor
there can be up to 24 warps residing ineach Streaming Multiprocessor at any point in time.

We should also point out that the SMsare designed such that only one of these warps will be actually executed by the hardware at any point in time.

that would be 384 warps overall on a 8800gtx

so yeah ahwell it doesn't matter i'm having fun tinkering with it would be nice if i could store data on the video card and use bitblt to access it and draw on the screen do you know if thats possible? i'm going to try and get a pointer to some memory and hope bitblt can access it would be very interesting :D

Link to comment
Share on other sites

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.