1 |
Matti Nykyri <matti.nykyri@×××.fi> schrieb: |
2 |
|
3 |
> On Jun 29, 2014, at 0:28, Kai Krakow <hurikhan77@×××××.com> wrote: |
4 |
>> |
5 |
>> Matti Nykyri <matti.nykyri@×××.fi> schrieb: |
6 |
>> |
7 |
>>>> On Jun 27, 2014, at 0:00, Kai Krakow <hurikhan77@×××××.com> wrote: |
8 |
>>>> |
9 |
>>>> Matti Nykyri <Matti.Nykyri@×××.fi> schrieb: |
10 |
>>>> |
11 |
>>>>> If you are looking a mathematically perfect solution there is a simple |
12 |
>>>>> one even if your list is not in the power of 2! Take 6 bits at a time |
13 |
>>>>> of the random data. If the result is 62 or 63 you will discard the |
14 |
>>>>> data and get the next 6 bits. This selectively modifies the random |
15 |
>>>>> data but keeps the probabilities in correct balance. Now the |
16 |
>>>>> probability for index of 0-61 is 1/62 because the probability to get |
17 |
>>>>> 62-63 out of 64 if 0. |
18 |
>>>> |
19 |
>>>> Why not do just something like this? |
20 |
>>>> |
21 |
>>>> index = 0; |
22 |
>>>> while (true) { |
23 |
>>>> index = (index + get_6bit_random()) % 62; |
24 |
>>>> output << char_array[index]; |
25 |
>>>> } |
26 |
>>>> |
27 |
>>>> Done, no bits wasted. Should have perfect distribution also. We also |
28 |
>>>> don't have to throw away random data just to stay within unaligned |
29 |
>>>> boundaries. The unalignment is being taken over into the next loop so |
30 |
>>>> the "error" corrects itself over time (it becomes distributed over the |
31 |
>>>> whole set). |
32 |
>>> |
33 |
>>> Distribution will not be perfect. The same original problem persists. |
34 |
>>> Probability for index 0 to 1 will be 2/64 and for 2 to 61 it will be |
35 |
>>> 1/64. Now the addition changes this so that index 0 to 1 reflects to |
36 |
>>> previous character and not the original index. |
37 |
>>> |
38 |
>>> The distribution of like 10GB of data should be quite even but not on a |
39 |
>>> small scale. The next char will depend on previous char. It is 100% more |
40 |
>>> likely that the next char is the same or one index above the previous |
41 |
>>> char then any of the other ones in the series. So it is likely that you |
42 |
>>> will have long sets of same character. |
43 |
>> |
44 |
>> I cannot follow your reasoning here - but I'd like to learn. Actually, I |
45 |
>> ran this multiple times and never saw long sets of the same character, |
46 |
>> even no short sets of the same character. The 0 or 1 is always rolled |
47 |
>> over into the next random addition. I would only get sets of the same |
48 |
>> character if rand() returned zero multiple times after each other - which |
49 |
>> wouldn't be really random. ;-) |
50 |
> |
51 |
> In your example that isn't true. You will get the same character if 6bit |
52 |
> random number is 0 or if it is 62! This is what makes the flaw! |
53 |
> |
54 |
> You will also get the next character if random number is 1 or 63. |
55 |
> |
56 |
> That is why the possibility for 0 and 1 (after modulo 62) is twice as |
57 |
> large compared to all other values (2-61). |
58 |
|
59 |
Ah, now I get it. |
60 |
|
61 |
> By definition random means that the probability for every value should be |
62 |
> the same. So if you have 62 options and even distribution of probability |
63 |
> the probability for each of them is 1/62. |
64 |
|
65 |
Still, the increased probability for single elements should hit different |
66 |
elements each time. So for large sets it will distribute - however, I now |
67 |
get why it's not completely random by definition. |
68 |
|
69 |
>> In my tests I counted how ofter new_index > index and new_index < index, |
70 |
>> and it had a clear bias for the first. So I added swapping of the |
71 |
>> selected index with offset=0 in the set. Now the characters will be |
72 |
>> swapped and start to distribute that flaw. The distribution, however, |
73 |
>> didn't change. |
74 |
> |
75 |
> Try counting how of often new_index = index and new_index = (index + 1) % |
76 |
> 62 and new_index = (index + 2) % 62. With your algorithm the last one |
77 |
> should be significantly less then the first two in large sample. |
78 |
|
79 |
I will try that. It looks like a good approach. |
80 |
|
81 |
-- |
82 |
Replies to list only preferred. |