Saturday, January 19, 2013

An unbiased random number generator: references / classic solutions and a 2nd try

In a previous post I discussed my first attempt to create "an efficient unbiased" random number generator.  Efficient may not have been the right word to use;  friends have subsequently pointed me towards the classic solution to this problem:
http://stackoverflow.com/questions/1986859/unbiased-random-number-generator-using-a-biased-one
The events (p)(1-p) and (1-p)(p) are equiprobable. Taking them as 0 and 1 respectively and discarding the other two pairs of results you get an unbiased random generator.
Basically, treat a 0 followed by a 1 as a 1; treat a 1 followed by a 0 as  0.  If a 0-0 or a 1-1 occur, re-roll.  This is efficient and easy to implement.  One problem however is that it has an unbounded upper limit on its run time.  Related, as the fraction p gets closer to 1 or 0 the number of re-rolls required increases.  Note that  the above StackOverflow answer links out to a solution which solves these problems!
http://www.eecs.harvard.edu/~michaelm/coinflipext.pdf

I modified the algorithm / code from my previous post to attempt to remove the bias that was still present.  The previous algorithm and the new one both share the benefit of having well-defined upper bound on the run time.  The new algorithm calls BIASED N times, and compares the fraction of 1's returned to the previously obtained distribution of 1's.  If the ratio of the current fraction to the previous fraction is greater than 1.0, then it returns a 1, otherwise a zero.  After each run, it updates the number of calls N to make sure that it is high enough such that the increments in fractions is of the result is close enough to the running fraction of measured of 1's and 0's.

The result is an algorithm that is still biased, and that is much less efficient than the Von Neuman algorithm!  The code is available at github anyway:
https://github.com/dllahr/sandbox/tree/master/intro_to_alg/5.1-3/second_attempt

Update:  A thought occurred to me that my algorithm is basically treating this like a control problem.  I'm using a feedback loop to take the observed bias of BIASED (the error) and compensate accordingly as needed.  The reason it does not work perfectly is that basic control theory says that response should be proportional to error, derivative of the error and the integral of the error:
R = a*E + b*(dE / dt) + c*(integral(E dt))

In both of the algorithms I've written I'm just using the integral of the error.

Why does the above algorithm require many more calls?  I think the short answer is provided from the StackOverflow entry linked to above (1st link), as it describes an enhanced Von Neuman solution to the problem (2nd link):
Further on, the paper develops this into generating multiple unbiased bits from the biased source, essentially using two different ways of generating bits from the bit-pairs, and giving a sketch that this is optimal in the sense that it produces exactly the number of bits that the original sequence had entropy in it.
I am not sure exactly how to apply the above to the algorithm I wrote; at best I would say that converting UNBIASED results to numbers / doubles is wasted information.


Wednesday, January 16, 2013

Attempting to create an Unbiased random number generator (with defined upper bound runtime) from a Biased random number generator

This is based on a problem in Chapter 5 of Introduction to Algorithms.  Basically, the problem asks:
Given a random number generator that generates 1's and 0's that is biased (BIASED) such that it generates 1's p fraction of time and 0's (1-p) fraction of the time, write an algorithm that generates 1's and 0's with an even distribution.  You do not know p (p is not an input to the algorithm).
Summary:  I discuss how I attempted to solve the problem, statistical & performance results of an implementation of the solution, and a problem with the implementation.

Update:  from comments and friends' help, a new post summarizing classic / efficient solutions and a second attempt:  http://dllahr.blogspot.com/2013/01/2nd-try-unbiased-random-number.html

I assumed that I couldn't just measure p as the start up / initialization of the algorithm.  Here's how I attempted to solve this problem:

  1. create an array with 2 entries
    • The first entry is the cumulative number of 0's returned by BIASED
    • The second entry is the cumulative number of 1's returned by BIASED
  2. when generating an unbiased number:
    1. determine which array entry is smaller - the smaller indicates the value that we must favor in order to balance the distribution - call this the "target value"
      • if the first entry is larger than BIASED currently appears to be biased towards 0's
      • if the second entry is larger than BIASED currently appears to be biased towards 1's
    2. take the ratio of the array entries such that the ratio > 1
    3. call BIASED repeatedly until either the target value is returned or the number of calls equals the ratio
      • during each call store the results in the array above
      • NOTE:  do not update the ratio
    4. if the target value was returned from BIASED, return this result
    5. if the instead BIASED was called ratio number of times without returning the target value, return the opposite of the target value
      • i.e. if target value was 1, the opposite value is 0
I implemented the above in c++ (Cygwin g++ to compile), available at github here

Saturday, December 22, 2012

Python code to simplify loading data into SciDB

Summary:  I've written some Python code that simplifies the loading of data from a csv file into SciDB. The programmer specifies for each column in the csv file whether it should be an attribute or a dimension in the SciDB array, and then the code loads it as a raw array, creates the the destination array based on the provided specifications and the data loaded into raw, and then transfers the data from the raw array to the destination array.

I've added the code to this GitHub repository under the directory ScidbLoader:

TODO
  1. when calculating dimension chunk sizes, need to scale by the number of attributes - currently assumes 1 attribute

MIT GPL

All code presented on this blog is Copyright (C) David L. Lahr in the year it was published and released under the MIT GPL:


Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in the
Software without restriction, including without limitation the rights to use, copy,
modify, merge, publish, distribute, sublicense, and/or sell copies of the Software,
and to permit persons to whom the Software is furnished to do so, subject to the
following conditions:

The above copyright notice and this permission notice shall be included in all copies
or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE
FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.


Sunday, December 16, 2012

Python code to simplify reading SciDB data

I've written some code in python that uses the SciDB python connector to access data in a more straightforward manner.  In summary, you submit a query and get back an iterator over the data.  It is currently very incomplete, and needs:

  1. Currently it just returns the attributes.  It needs to also return the values of the dimensions.   Done!
  2. Ability to reset the iterator - it can currently only be used once
The code is publicly available from this github repository:
https://github.com/dllahr/scidb_python_utils

Saturday, December 8, 2012

Size of a polymer from a random walk

Summary:  From statistical mechanics, the size of a polymer is generally estimated using the statistics of a random walk.  Here I investigate the assumption that the size of the polymer is proportional to the distance between the start and end points of a random walk as it is generally taught in statistical mechanics.

Review of random walk in 1 dimension

Start at the origin of the x-axis (x = 0).  At each step, there is a 50% chance of moving 1 unit to the right, 50% chance of moving 1 unit to the left.

Here are some examples of random walks:

For N steps, the probability of having ended up at position x is given by the binomial distribution:


(from the above page at Wolfram). The full width at half max of the above distribution is:
sqrt(# of steps)

Tuesday, November 13, 2012

Timing Matrix Multiplication in SciDB and Setting the Number of Worker Instances in SciDB and Running Matrix Multiplication Piecemeal

 Summary:  I am multiplying 2 matrices in SciDB.  Previously I recorded the calculation ran in 5 hours, but now I am observing / estimating it to run in 33 hours.  This post is a description of my investigation and my attempt to speed up the calculation by reconfiguring SciDB to use more processors, and ultimately running the calculation piecemeal so I could monitor its progress.  
Update:  Using a system with 4 worker instances instead of 1 decreased the time by approximately a factor of 3.
Further Update:  adding the specifier 'dense' to the multiply command increased the speed further by a factor of 1.5

Background

I have 2 matrices in SciDB that I want to multiply:
particleStem_3 is 873637 x 42315
eigVect_3 is 42315 x 100

schema:
[("particleStem_3<count:double> [stem=0:873636,20,0,particle=0:42314,42315,0]")]
[("eigVect_3<value:double> [particle=0:42314,42315,0,eig=0:99,20,0]")]