1 |
Matti Nykyri <matti.nykyri@×××.fi> schrieb: |
2 |
|
3 |
>> On Jun 27, 2014, at 0:00, Kai Krakow <hurikhan77@×××××.com> wrote: |
4 |
>> |
5 |
>> Matti Nykyri <Matti.Nykyri@×××.fi> schrieb: |
6 |
>> |
7 |
>>> If you are looking a mathematically perfect solution there is a simple |
8 |
>>> one even if your list is not in the power of 2! Take 6 bits at a time of |
9 |
>>> the random data. If the result is 62 or 63 you will discard the data and |
10 |
>>> get the next 6 bits. This selectively modifies the random data but keeps |
11 |
>>> the probabilities in correct balance. Now the probability for index of |
12 |
>>> 0-61 is 1/62 because the probability to get 62-63 out of 64 if 0. |
13 |
>> |
14 |
>> Why not do just something like this? |
15 |
>> |
16 |
>> index = 0; |
17 |
>> while (true) { |
18 |
>> index = (index + get_6bit_random()) % 62; |
19 |
>> output << char_array[index]; |
20 |
>> } |
21 |
>> |
22 |
>> Done, no bits wasted. Should have perfect distribution also. We also |
23 |
>> don't have to throw away random data just to stay within unaligned |
24 |
>> boundaries. The unalignment is being taken over into the next loop so the |
25 |
>> "error" corrects itself over time (it becomes distributed over the whole |
26 |
>> set). |
27 |
> |
28 |
> Distribution will not be perfect. The same original problem persists. |
29 |
> Probability for index 0 to 1 will be 2/64 and for 2 to 61 it will be 1/64. |
30 |
> Now the addition changes this so that index 0 to 1 reflects to previous |
31 |
> character and not the original index. |
32 |
> |
33 |
> The distribution of like 10GB of data should be quite even but not on a |
34 |
> small scale. The next char will depend on previous char. It is 100% more |
35 |
> likely that the next char is the same or one index above the previous char |
36 |
> then any of the other ones in the series. So it is likely that you will |
37 |
> have long sets of same character. |
38 |
|
39 |
I cannot follow your reasoning here - but I'd like to learn. Actually, I ran |
40 |
this multiple times and never saw long sets of the same character, even no |
41 |
short sets of the same character. The 0 or 1 is always rolled over into the |
42 |
next random addition. I would only get sets of the same character if rand() |
43 |
returned zero multiple times after each other - which wouldn't be really |
44 |
random. ;-) |
45 |
|
46 |
Keep in mind: The last index will be reused whenever you'd enter the |
47 |
function - it won't reset to zero. But still that primitive implementation |
48 |
had a flaw: It will tend to select characters beyond the current offset, if |
49 |
it is >= 1/2 into the complete set, otherwise it will prefer selecting |
50 |
characters before the offset. |
51 |
|
52 |
In my tests I counted how ofter new_index > index and new_index < index, and |
53 |
it had a clear bias for the first. So I added swapping of the selected index |
54 |
with offset=0 in the set. Now the characters will be swapped and start to |
55 |
distribute that flaw. The distribution, however, didn't change. |
56 |
|
57 |
Of course I'm no mathematician, I don't know how I'd calculate the |
58 |
probabilities for my implementation because it is sort of a recursive |
59 |
function (for get_rand()) when looking at it over time: |
60 |
|
61 |
int get_rand() { |
62 |
static int index = 0; |
63 |
return (index = (index + get_6bit_rand()) % 62); |
64 |
} |
65 |
|
66 |
char get_char() { |
67 |
int index = get_rand(); |
68 |
char tmp = chars[index]; |
69 |
chars[index] = chars[0]; |
70 |
return (chars[0] = tmp); |
71 |
} |
72 |
|
73 |
However, get_char() should return evenly distributes results. |
74 |
|
75 |
What this shows, is, that while distribution is even among the result set, |
76 |
the implementation may still be flawed because results could be predictable |
77 |
for a subset of results. Or in other words: Simply looking at the |
78 |
distribution of results is not an indicator for randomness. I could change |
79 |
get_rand() in the following way: |
80 |
|
81 |
int get_rand() { |
82 |
static int index = 0; |
83 |
return (index = (index + 1) % 62); |
84 |
} |
85 |
|
86 |
Results would be distributed even, but clearly it is not random. |
87 |
|
88 |
-- |
89 |
Replies to list only preferred. |