1 |
Am 13.03.2012 20:07, schrieb Stroller: |
2 |
> |
3 |
> On 13 March 2012, at 18:18, Michael Mol wrote: |
4 |
>> ... |
5 |
>>> So I assume the i586 version is better for you --- unless GCC |
6 |
>>> suddenly got a lot better at optimizing code. |
7 |
>> |
8 |
>> Since when, exactly? GCC isn't the best compiler at optimization, |
9 |
>> but I fully expect current versions to produce better code for |
10 |
>> x86-64 than hand-tuned i586. Wider registers, more registers, |
11 |
>> crypto acceleration instructions and SIMD instructions are all very |
12 |
>> nice to have. I don't know the specifics of AES, though, or what |
13 |
>> kind of crypto algorithm it is, so it's entirely possible that one |
14 |
>> can't effectively parallelize it except in some relatively unique |
15 |
>> circumstances. |
16 |
> |
17 |
> Do you have much experience of writing assembler? |
18 |
> |
19 |
> I don't, and I'm not an expert on this, but I've read the odd blog |
20 |
> article on this subject over the years. |
21 |
> |
22 |
> What I've read often has the programmer looking at the compiled gcc |
23 |
> bytecode and examining what it does. The compiler might not care how |
24 |
> many registers it uses, and thus a variable might find itself |
25 |
> frequently swapped back into RAM; the programmer does not have any |
26 |
> control over the compiler, and IIRC some flags reserve a register for |
27 |
> degugging (IIRC -fomit-frame-pointer disables this). I think it's |
28 |
> possible to use registers more efficiently by swapping them (??) or |
29 |
> by using bitwise comparisons and other tricks. |
30 |
> |
31 |
|
32 |
You recall correctly about the frame pointer. |
33 |
|
34 |
Concerning the register usage: I'm no expert in this field, either, but |
35 |
I think the main issue is not simply register allocation but branch and |
36 |
exception prediction and so on. The compiler can either optimize for a |
37 |
seamless continuation if the jump happens or if it doesn't. A human or a |
38 |
just-in-time compiler can better handle these cases by predicting the |
39 |
outcome of -- in the case of a JIT -- analyze the outcome of the first |
40 |
few iterations. |
41 |
|
42 |
OT: IIRC, register reuse is also the main performance problem of |
43 |
state-of-the-art javascript engines, at the moment. Concerning the code |
44 |
they compile at runtime, they are nearly as good as `gcc -O0` but they |
45 |
have the same problem concerning registers (GCC with -O0 produces code |
46 |
that works exactly as you describe above: Storing the result after every |
47 |
computation and loading it again). |
48 |
|
49 |
> Assembler optimisation is only used on sections of code that are at |
50 |
> the core of a loop - that are called hundreds or thousands (even |
51 |
> millions?) of times during the program's execution. It's not for |
52 |
> code, such as reading the .config file or initialisation, which is |
53 |
> only called once. Because the code in the core of the loop is called |
54 |
> so often, you don't have to achieve much of an optimisation for the |
55 |
> aggregate to be much more considerable. |
56 |
> |
57 |
> The operations in question may only be constitute a few lines of C, |
58 |
> or a handful of machine operations, so it boils down to an algorithm |
59 |
> that a human programmer is capable of getting a grip on and |
60 |
> comprehending. Whilst compilers are clearly more efficient for large |
61 |
> programs, on this micro scale, humans are more clever and creative |
62 |
> than machines. |
63 |
> |
64 |
> Encryption / decryption is an example of code that lends itself to |
65 |
> this kind of optimisation. In particular AES was designed, I believe, |
66 |
> to be amenable to implementation in this way. The reason for that was |
67 |
> that it was desirable to have it run on embedded devices and on |
68 |
> dedicated chips. So it boils down to a simple bitswap operation (??) |
69 |
> - the plaintext is modified by the encryption key, input and output |
70 |
> as a fast stream. Each byte goes in, each byte goes out, the same |
71 |
> function performed on each one. |
72 |
> |
73 |
|
74 |
Well, sort of. First of, you are right, AES was designed with hardware |
75 |
implementations in mind. |
76 |
|
77 |
The algorithm boils down to a number of substitution and permutation |
78 |
networks and XOR operations (I assume that's what you meant with byte |
79 |
swap). If you look at the portable C code |
80 |
(/usr/src/linux/crypto/aes_generic.c), you can see that it mostly |
81 |
consists of lookup tables and XORs. |
82 |
|
83 |
The thing about "each byte goes in, each byte goes out", however, is a |
84 |
bit wrong. What you think of is a stream cipher like RC4. AES is a block |
85 |
cipher. These use an (in this case 128 bit long) input string and XOR it |
86 |
with the encryption (sub-)key and shuffle it around according to the |
87 |
exact algorithm. |
88 |
|
89 |
> Another operation that lends itself to assembler optimisation is |
90 |
> video decoding - the video is encoded only once, and then may be |
91 |
> played back hundreds or millions of times by different people. The |
92 |
> same operations must be repeated a number of times on each frame, |
93 |
> then c 25 - 60 frames are decoded per second, so at least 90,000 |
94 |
> frames per hour. Again, the smallest optimisation is worthwhile. |
95 |
> |
96 |
> Stroller. |
97 |
> |
98 |
> |