1 |
Am 13.03.2012 20:38, schrieb Michael Mol: |
2 |
> On Tue, Mar 13, 2012 at 3:07 PM, Stroller |
3 |
> <stroller@××××××××××××××××××.uk> wrote: |
4 |
>> |
5 |
>> On 13 March 2012, at 18:18, Michael Mol wrote: |
6 |
>>> ... |
7 |
>>>> So I assume the i586 version is better for you --- unless GCC |
8 |
>>>> suddenly got a lot better at optimizing code. |
9 |
>>> |
10 |
>>> Since when, exactly? GCC isn't the best compiler at optimization, |
11 |
>>> but I fully expect current versions to produce better code for |
12 |
>>> x86-64 than hand-tuned i586. Wider registers, more registers, |
13 |
>>> crypto acceleration instructions and SIMD instructions are all |
14 |
>>> very nice to have. I don't know the specifics of AES, though, or |
15 |
>>> what kind of crypto algorithm it is, so it's entirely possible |
16 |
>>> that one can't effectively parallelize it except in some |
17 |
>>> relatively unique circumstances. |
18 |
>> |
19 |
>> Do you have much experience of writing assembler? |
20 |
>> |
21 |
>> I don't, and I'm not an expert on this, but I've read the odd blog |
22 |
>> article on this subject over the years. |
23 |
> |
24 |
> Similar level of experience here. I can read it, even debug it from |
25 |
> time to time. A few regular bloggers on the subject are like candy. |
26 |
> And I used to have pagetable.org, Ars's Technopaedia and specsheets |
27 |
> for early x86 and motorola processors memorized. For the past couple |
28 |
> years, I've been focusing on reading blogs of language and compiler |
29 |
> authors, academics involved in proofing, testing and improving them, |
30 |
> etc. |
31 |
> |
32 |
>> |
33 |
>> What I've read often has the programmer looking at the compiled gcc |
34 |
>> bytecode and examining what it does. The compiler might not care |
35 |
>> how many registers it uses, and thus a variable might find itself |
36 |
>> frequently swapped back into RAM; the programmer does not have any |
37 |
>> control over the compiler, and IIRC some flags reserve a register |
38 |
>> for degugging (IIRC -fomit-frame-pointer disables this). I think |
39 |
>> it's possible to use registers more efficiently by swapping them |
40 |
>> (??) or by using bitwise comparisons and other tricks. |
41 |
> |
42 |
> Sure; it's cheaper to null out a register by XORing it with itself |
43 |
> than setting it to 0. |
44 |
> |
45 |
>> |
46 |
>> Assembler optimisation is only used on sections of code that are at |
47 |
>> the core of a loop - that are called hundreds or thousands (even |
48 |
>> millions?) of times during the program's execution. It's not for |
49 |
>> code, such as reading the .config file or initialisation, which is |
50 |
>> only called once. Because the code in the core of the loop is |
51 |
>> called so often, you don't have to achieve much of an optimisation |
52 |
>> for the aggregate to be much more considerable. |
53 |
> |
54 |
> Sure; optimize the hell out of the code where you spend most of your |
55 |
> time. I wasn't aware that gcc passed up on safe optimization |
56 |
> opportunities, though. |
57 |
> |
58 |
>> |
59 |
>> The operations in question may only be constitute a few lines of C, |
60 |
>> or a handful of machine operations, so it boils down to an |
61 |
>> algorithm that a human programmer is capable of getting a grip on |
62 |
>> and comprehending. Whilst compilers are clearly more efficient for |
63 |
>> large programs, on this micro scale, humans are more clever and |
64 |
>> creative than machines. |
65 |
> |
66 |
> I disagree. With defined semantics for the source and target, a |
67 |
> computer's cleverness is limited only by the computational and |
68 |
> memory expense of its search algorithms. Humans get through this by |
69 |
> making habit various optimizations, but those habits become less |
70 |
> useful as additional paths and instructions are added. As system |
71 |
> complexity increases, humans operate on personally cached techniques |
72 |
> derived from simpler systems. I would expect very, very few people to |
73 |
> be intimately familiar with the the majority of optimization |
74 |
> possibilities present on an amdfam10 processor or a core2. Compiler's |
75 |
> aren't necessarily familiar with them, either; they're just quicker |
76 |
> at discovering them, given knowledge of the individual instructions |
77 |
> and the rules of language semantics. |
78 |
> |
79 |
>> |
80 |
>> Encryption / decryption is an example of code that lends itself to |
81 |
>> this kind of optimisation. In particular AES was designed, I |
82 |
>> believe, to be amenable to implementation in this way. The reason |
83 |
>> for that was that it was desirable to have it run on embedded |
84 |
>> devices and on dedicated chips. So it boils down to a simple |
85 |
>> bitswap operation (??) - the plaintext is modified by the |
86 |
>> encryption key, input and output as a fast stream. Each byte goes |
87 |
>> in, each byte goes out, the same function performed on each one. |
88 |
> |
89 |
> I'd be willing to posit that you're right here, though if there |
90 |
> isn't a per-byte feedback mechanism, SIMD instructions would come |
91 |
> into serious play. But I expect there's a per-byte feedback |
92 |
> mechanism, so parallelization would likely come in the form of |
93 |
> processing simultaneous streams. |
94 |
> |
95 |
>> |
96 |
>> Another operation that lends itself to assembler optimisation is |
97 |
>> video decoding - the video is encoded only once, and then may be |
98 |
>> played back hundreds or millions of times by different people. The |
99 |
>> same operations must be repeated a number of times on each frame, |
100 |
>> then c 25 - 60 frames are decoded per second, so at least 90,000 |
101 |
>> frames per hour. Again, the smallest optimisation is worthwhile. |
102 |
> |
103 |
> Absolutely. My position, though, is that compilers are quicker and |
104 |
> more capable of discovering optimization possibilities than humans |
105 |
> are, when the target architecture changes. Sure, you've got several |
106 |
> dozen video codecs in, say, ffmpeg, and perhaps it all boils down to |
107 |
> less than a dozen very common cases of inner loop code. With |
108 |
> hand-tuned optimization, you'd need to fork your assembly patch for |
109 |
> each new processor feature that comes out, and then work to find the |
110 |
> most efficient way to execute code on that processor. |
111 |
> |
112 |
> There's also cases where processor features get changed. I don't |
113 |
> remember the name of the instruction (it had something to do with |
114 |
> stack operations) in x86, but Intel switched it from a 0-cycle |
115 |
> instruction to something more expensive. Any code which assumed that |
116 |
> instruction was a 0-cycle instruction now became less efficient. A |
117 |
> compiler (presuming it has a knowledge of the target processor's |
118 |
> instruction set properties) would have an easier time coping with |
119 |
> that change than a human would. |
120 |
> |
121 |
> I'm not saying humans are useless; this is just one of those areas |
122 |
> which is sufficiently complex-yet-deterministic that sufficient |
123 |
> knowledge of the source and target environments would give a |
124 |
> computer the edge over a human in finding the optimal sequence of |
125 |
> CPU instructions. |
126 |
> |
127 |
|
128 |
This thread is becoming ridiculously long. Just as a last side-note: |
129 |
|
130 |
One of the primary reasons that the IA64 architecture failed was that it |
131 |
relied on the compiler to optimize the code in order to exploit the |
132 |
massive instruction-level parallelism the CPU offered. Compilers never |
133 |
became good enough for the job. Of course, that happended in the |
134 |
nineties and we have much better compilers now (and x86 is easier to |
135 |
handle for compilers). But on the other hand: That was Intel's next big |
136 |
thing and if they couldn't make the compilers work, I have no reason to |
137 |
believe in their efficiency now. |
138 |
|
139 |
Regards, |
140 |
Florian Philipp |