Gentoo Archives: gentoo-user

From: Michael Mol <mikemol@×××××.com>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] PacketShader - firewall using GPU
Date: Fri, 23 Sep 2011 15:34:07
Message-Id: CA+czFiCR62tEGYVc=GxmZqu+J94S9kdOhu6mjVMH5dqxAqrEmg@mail.gmail.com
In Reply to: Re: [gentoo-user] PacketShader - firewall using GPU by Mark Knecht
1 On Fri, Sep 23, 2011 at 11:16 AM, Mark Knecht <markknecht@×××××.com> wrote:
2 > On Fri, Sep 23, 2011 at 6:49 AM, Michael Mol <mikemol@×××××.com> wrote:
3 > While I'm not a programmer at all I have been playing with some CUDA
4 > programming this year. The couple of comments below are based around
5 > that GPU framework and might differ for others.
6 >
7 > 1) I don't think the GPU latencies are much different than CPU
8 > latencies. A lot of it can be done with DMA so that the CPU is hardly
9 > involved once the pointers are set up. Of course it depends on the
10 > system but the GPU is pretty close to the action so it should be quite
11 > fast getting started.
12
13 As long as stuff is done wholly in the GPU, the kind of latency I was
14 worried about (GPU<->system RAM<->CPU) isn't a problem. The problem is
15 going to be anything that involves data being passed back and forth,
16 or decisions needing to be made by the CPU. I concur with James that
17 CPU+GPU parts will help a great deal in that regard.
18
19 > 2) The big deal with GPUs is that they really pay off when you need to
20 > do a lot of the same calculations on different data in parallel. A
21 > book I read + some online stuff suggested they didn't pay off speed
22 > wise until you were doing at least 100 operations in parallel.
23 >
24 > 3) You do have to get the data into the GPU so for things that used
25 > fixed data blocks, like shading graphical elements, that data can be
26 > loaded once and reused over and over. That can be very fast. In my
27 > case it's financial data getting evaluated 1000 ways so that's
28 > effective. For data like a packet I don't know how many ways there are
29 > to evaluate that so I cannot suggest what the value would be.
30
31 Yeah, that's the problem. Cache loses its utility the less and less
32 you have to revisit the same pieces of data. When they're talking
33 about multiple gigabits per second of throughput, cache won't be much
34 good for more than prefetches.
35
36 >
37 > None the less it's an interesting idea and certainly offloads computer
38 > cycles that might be better used for other things.
39
40 Earlier this year, I experimented a little bit in how one could
41 implement a Turing-complete language in a branchless, like on GPGPUs*.
42 I figure it's doable, but you waste cores and memory with discarded
43 results. (Similar to when CPUs mispredict branches misprediction, but
44 worse.)
45
46 * OK, they're not branchless, but branches kill performance; I recall
47 my reading of the CUDA manual indicating that code has to be brought
48 back in step after a branch before any of the results are available.
49 But that was about two years ago when I read it.
50
51 >
52 > My NVidia 465GTX has 352 CUDA cores while the GS8200 has only 8 so
53 > there can be a huge difference based on what GPU you have available.
54
55
56
57 --
58 :wq