Given a random number generator that generates 1's and 0's that is biased (BIASED) such that it generates 1's p fraction of time and 0's (1-p) fraction of the time, write an algorithm that generates 1's and 0's with an even distribution. You do not know p (p is not an input to the algorithm).Summary: I discuss how I attempted to solve the problem, statistical & performance results of an implementation of the solution, and a problem with the implementation.
Update: from comments and friends' help, a new post summarizing classic / efficient solutions and a second attempt: http://dllahr.blogspot.com/2013/01/2nd-try-unbiased-random-number.html
I assumed that I couldn't just measure p as the start up / initialization of the algorithm. Here's how I attempted to solve this problem:
- create an array with 2 entries
- The first entry is the cumulative number of 0's returned by BIASED
- The second entry is the cumulative number of 1's returned by BIASED
- determine which array entry is smaller - the smaller indicates the value that we must favor in order to balance the distribution - call this the "target value"
- if the first entry is larger than BIASED currently appears to be biased towards 0's
- if the second entry is larger than BIASED currently appears to be biased towards 1's
- during each call store the results in the array above
- NOTE: do not update the ratio
- i.e. if target value was 1, the opposite value is 0
I implemented the above in c++ (Cygwin g++ to compile), available at github here
Algorithm results and performance
The source code for the key class is listed below, but first I'd like to present some data from running this. The biased source had a ratio of 3:1 more 0's than 1's. A single test of the calculation consisted of running the unbiased algorithm 10,000,000 times. The statistics of 200 of these tests are presented in the left columns. For the average, the percentage is the percent different from the expected value of 5,000,000. For the standard deviation, the percentage is the ratio of the standard deviation to the average. For comparison, the right column shows the same calculation using the direct stlib random number generator.
What is most striking to me is the difference between standard deviations of the unbiased algorithm and the standard random number generator. This difference in the standard deviations can easily account for the difference in the averages. The standard deviation from the standard random number generator is inline with the predicted value from a binomial distribution. The unbiased algorithm is not. I can only provide hand-waving rationalizations for this difference - basically, my hunch is that the array that tracks the results of the calls to BIASED is acting like a badly tuned control / feedback loop and it continually overshoots the "correct" ratio that would create unbiased output. This may be because of the digital nature of the problem.
A bigger question is how does the algorithm behave when the bias of BIASED is made worse. This is summarized these charts and the table of data. Within each chart, the total number of loop executions is plotted vs. the ratio of the bias in BIASED. For the first chart the test consisted of 1000 runs, the second chart 10,000 runs, the third 100,000 runs. The "loop execution" that is being measured is the heart of the algorithm - this is the loop that repeatedly calls the BIASED algorithm to attempt to achieve the balanced result.
What really jumps out from the data is the problem that the unbiased algorithm is actually failing to be truly unbiased. The trend is clear that as the bias ratio increases, the bias of unbiased increases - although not as much and in the opposite direction! So, for example, although BIASED is putting out 63 times more 0's than 1's, unbiased in putting out a ratio of 63:37 1's to 0's.
Most of the above is captured in this class: