1 |
On Fri, Sep 23, 2011 at 11:16 AM, Mark Knecht <markknecht@×××××.com> wrote: |
2 |
> On Fri, Sep 23, 2011 at 6:49 AM, Michael Mol <mikemol@×××××.com> wrote: |
3 |
> While I'm not a programmer at all I have been playing with some CUDA |
4 |
> programming this year. The couple of comments below are based around |
5 |
> that GPU framework and might differ for others. |
6 |
> |
7 |
> 1) I don't think the GPU latencies are much different than CPU |
8 |
> latencies. A lot of it can be done with DMA so that the CPU is hardly |
9 |
> involved once the pointers are set up. Of course it depends on the |
10 |
> system but the GPU is pretty close to the action so it should be quite |
11 |
> fast getting started. |
12 |
|
13 |
As long as stuff is done wholly in the GPU, the kind of latency I was |
14 |
worried about (GPU<->system RAM<->CPU) isn't a problem. The problem is |
15 |
going to be anything that involves data being passed back and forth, |
16 |
or decisions needing to be made by the CPU. I concur with James that |
17 |
CPU+GPU parts will help a great deal in that regard. |
18 |
|
19 |
> 2) The big deal with GPUs is that they really pay off when you need to |
20 |
> do a lot of the same calculations on different data in parallel. A |
21 |
> book I read + some online stuff suggested they didn't pay off speed |
22 |
> wise until you were doing at least 100 operations in parallel. |
23 |
> |
24 |
> 3) You do have to get the data into the GPU so for things that used |
25 |
> fixed data blocks, like shading graphical elements, that data can be |
26 |
> loaded once and reused over and over. That can be very fast. In my |
27 |
> case it's financial data getting evaluated 1000 ways so that's |
28 |
> effective. For data like a packet I don't know how many ways there are |
29 |
> to evaluate that so I cannot suggest what the value would be. |
30 |
|
31 |
Yeah, that's the problem. Cache loses its utility the less and less |
32 |
you have to revisit the same pieces of data. When they're talking |
33 |
about multiple gigabits per second of throughput, cache won't be much |
34 |
good for more than prefetches. |
35 |
|
36 |
> |
37 |
> None the less it's an interesting idea and certainly offloads computer |
38 |
> cycles that might be better used for other things. |
39 |
|
40 |
Earlier this year, I experimented a little bit in how one could |
41 |
implement a Turing-complete language in a branchless, like on GPGPUs*. |
42 |
I figure it's doable, but you waste cores and memory with discarded |
43 |
results. (Similar to when CPUs mispredict branches misprediction, but |
44 |
worse.) |
45 |
|
46 |
* OK, they're not branchless, but branches kill performance; I recall |
47 |
my reading of the CUDA manual indicating that code has to be brought |
48 |
back in step after a branch before any of the results are available. |
49 |
But that was about two years ago when I read it. |
50 |
|
51 |
> |
52 |
> My NVidia 465GTX has 352 CUDA cores while the GS8200 has only 8 so |
53 |
> there can be a huge difference based on what GPU you have available. |
54 |
|
55 |
|
56 |
|
57 |
-- |
58 |
:wq |