Gentoo Archives: gentoo-user

From: Florian Philipp <lists@×××××××××××.net>
To: gentoo-user@l.g.o
Subject: Re: [gentoo-user] hard drive encryption
Date: Tue, 13 Mar 2012 20:03:36
Message-Id: 4F5FA7C0.5020909@binarywings.net
In Reply to: Re: [gentoo-user] hard drive encryption by Stroller
1 Am 13.03.2012 20:07, schrieb Stroller:
2 >
3 > On 13 March 2012, at 18:18, Michael Mol wrote:
4 >> ...
5 >>> So I assume the i586 version is better for you --- unless GCC
6 >>> suddenly got a lot better at optimizing code.
7 >>
8 >> Since when, exactly? GCC isn't the best compiler at optimization,
9 >> but I fully expect current versions to produce better code for
10 >> x86-64 than hand-tuned i586. Wider registers, more registers,
11 >> crypto acceleration instructions and SIMD instructions are all very
12 >> nice to have. I don't know the specifics of AES, though, or what
13 >> kind of crypto algorithm it is, so it's entirely possible that one
14 >> can't effectively parallelize it except in some relatively unique
15 >> circumstances.
16 >
17 > Do you have much experience of writing assembler?
18 >
19 > I don't, and I'm not an expert on this, but I've read the odd blog
20 > article on this subject over the years.
21 >
22 > What I've read often has the programmer looking at the compiled gcc
23 > bytecode and examining what it does. The compiler might not care how
24 > many registers it uses, and thus a variable might find itself
25 > frequently swapped back into RAM; the programmer does not have any
26 > control over the compiler, and IIRC some flags reserve a register for
27 > degugging (IIRC -fomit-frame-pointer disables this). I think it's
28 > possible to use registers more efficiently by swapping them (??) or
29 > by using bitwise comparisons and other tricks.
30 >
31
32 You recall correctly about the frame pointer.
33
34 Concerning the register usage: I'm no expert in this field, either, but
35 I think the main issue is not simply register allocation but branch and
36 exception prediction and so on. The compiler can either optimize for a
37 seamless continuation if the jump happens or if it doesn't. A human or a
38 just-in-time compiler can better handle these cases by predicting the
39 outcome of -- in the case of a JIT -- analyze the outcome of the first
40 few iterations.
41
42 OT: IIRC, register reuse is also the main performance problem of
43 state-of-the-art javascript engines, at the moment. Concerning the code
44 they compile at runtime, they are nearly as good as `gcc -O0` but they
45 have the same problem concerning registers (GCC with -O0 produces code
46 that works exactly as you describe above: Storing the result after every
47 computation and loading it again).
48
49 > Assembler optimisation is only used on sections of code that are at
50 > the core of a loop - that are called hundreds or thousands (even
51 > millions?) of times during the program's execution. It's not for
52 > code, such as reading the .config file or initialisation, which is
53 > only called once. Because the code in the core of the loop is called
54 > so often, you don't have to achieve much of an optimisation for the
55 > aggregate to be much more considerable.
56 >
57 > The operations in question may only be constitute a few lines of C,
58 > or a handful of machine operations, so it boils down to an algorithm
59 > that a human programmer is capable of getting a grip on and
60 > comprehending. Whilst compilers are clearly more efficient for large
61 > programs, on this micro scale, humans are more clever and creative
62 > than machines.
63 >
64 > Encryption / decryption is an example of code that lends itself to
65 > this kind of optimisation. In particular AES was designed, I believe,
66 > to be amenable to implementation in this way. The reason for that was
67 > that it was desirable to have it run on embedded devices and on
68 > dedicated chips. So it boils down to a simple bitswap operation (??)
69 > - the plaintext is modified by the encryption key, input and output
70 > as a fast stream. Each byte goes in, each byte goes out, the same
71 > function performed on each one.
72 >
73
74 Well, sort of. First of, you are right, AES was designed with hardware
75 implementations in mind.
76
77 The algorithm boils down to a number of substitution and permutation
78 networks and XOR operations (I assume that's what you meant with byte
79 swap). If you look at the portable C code
80 (/usr/src/linux/crypto/aes_generic.c), you can see that it mostly
81 consists of lookup tables and XORs.
82
83 The thing about "each byte goes in, each byte goes out", however, is a
84 bit wrong. What you think of is a stream cipher like RC4. AES is a block
85 cipher. These use an (in this case 128 bit long) input string and XOR it
86 with the encryption (sub-)key and shuffle it around according to the
87 exact algorithm.
88
89 > Another operation that lends itself to assembler optimisation is
90 > video decoding - the video is encoded only once, and then may be
91 > played back hundreds or millions of times by different people. The
92 > same operations must be repeated a number of times on each frame,
93 > then c 25 - 60 frames are decoded per second, so at least 90,000
94 > frames per hour. Again, the smallest optimisation is worthwhile.
95 >
96 > Stroller.
97 >
98 >

Attachments

File name MIME type
signature.asc application/pgp-signature