Saturday, December 8, 2012

Size of a polymer from a random walk

Summary:  From statistical mechanics, the size of a polymer is generally estimated using the statistics of a random walk.  Here I investigate the assumption that the size of the polymer is proportional to the distance between the start and end points of a random walk as it is generally taught in statistical mechanics.

Review of random walk in 1 dimension

Start at the origin of the x-axis (x = 0).  At each step, there is a 50% chance of moving 1 unit to the right, 50% chance of moving 1 unit to the left.

Here are some examples of random walks:

For N steps, the probability of having ended up at position x is given by the binomial distribution:


(from the above page at Wolfram). The full width at half max of the above distribution is:
sqrt(# of steps)


Statistical mechanics: size of polymer

Statistical mechanics describing the size of polymers starts with the basic assumption that you can model a polymer as a series rigid links that can have any relative orientation.  It is simplified even further by assuming that it exists in only 1 dimension.  Then, putting one end of the polymer at the origin, the position of the monomers in the polymer are modeled using a random walk in 1 dimension.  One important conclusion from this analysis is that the average size of a polymer N units long will be proportional to the width of the binomial distribution (above):
sqrt(N)

Assumption / problem with the assumption

Modeling the polymer as a random walk means that one end of the polymer is at the origin, and the probability of finding the other end of the polymer at position x is given by the binomial distribution.  Then, the binomial distribution of the end point of the polymer is used to calculate the average size of the polymer.  This assumes that the spacing between the start and end points of the polymer are representative of the size of the polymer.  But for the random walk there are many cases where the distance between the start and end point of the walk is less than the greatest extent of the walk.  

Here are some examples of individual random walks, illustrating the full width of the walk or size of the polymer and comparing that to the distance between the start and end point of the walks:
For walk 1, note that the distance from the origin to the end is only 1, compared to the actual width of the walk / size of the polymer which is 5

For walk 2, the difference between the origin and the end point (9) is closer to the width (11) but still not the same.

Testing the assumption

In order to test the above assumption, I calculated the actual widths of polymers modeled by random walks (as defined above).

Interruption:  I subsequently found a derivation that analytically derives that the min and max are proportional to the square root of the number of steps - problem solved!  It is in this pdf.  The result is at the end of example 12.3, within the section "Expected Number of Equalizations".

That aside, here is how I initially solved the problem:  I wrote code to generate all of the possible random walks for some # of steps, and for each of these walks calculating the furthest negative position reached (min), the furthest positive position reached (max), and then calculated the distance between them to be the size of the polymer (width).  Initially this was done for walks up to 22 steps in size - for a random walk with 22 steps there are 2^22 random walks = 8,388,608.  Below is plotted the size of the polymer versus the number of steps:
This appears like it could be that the
average size = sqrt(# steps)

If this is true, then if I I plot the above data as average size vs. sqrt(# steps) it should be linear:


And it is!

Notes on code design and more results

I initially ran the calculation by building a tree containing all the possible walks up to a certain # of steps (depth-first), but that was memory intensive.  I reduced the memory requirement by doing the calculation for all walks of a certain # of steps at a time (breadth-first, results shown above), but this required storing the endpoints of all the walks of the previous # of steps which ran out of memory.  I then switched back to a depth-first calculation, but did not store the tree in memory - now the constraint is on the amount of time I'm willing to wait for the results - I calculated the results out to 31 steps:

Still linear!  This calculation took just shy of 7 hours to run.

Engineering details and code

I used 3 methods to run this calculation
  • depth-first, stored tree (didn't actually run calculation, just generated all random walks)
  • breadth-first, stored previous walk endpoints
  • depth-first, store statistics only
The last is the only way to achieve the higher number of steps.  I never implemented calculating the results from the stored tree of the first method because the memory restriction was immediately too prohibitive to be useful.  Code to be shared below, but first, here is how I estimated the run time for the calculation out to 31 steps.

I used 4 runs which had these timing results:
# steps run time [s]
15 0.5
20 13
21 28.26
22 54


I plotted the above, fit to an exponential, and then used the fit equation to estimate about how many steps could be run in 6 hours.  The result was ~31, so I set the calculation to that number of steps and called it a night.  Here is a graph of the timing results (blue diamonds), the exponential fit (black line), estimate of the time for 31 steps (dashed green line), and the actual run time for 31 steps:

The calculation took almost 7 hours to run, was over the estimated time by 24%.  Interesting to note that in the logarithmic scale above, that 24% discrepancy is hardly noticeable.  A good warning about looking at fits on logarithmic scales.

Code

Written in python, run in python 2.7.

Common class


class StepData:
    pos = None
    max_pos = None
    min_pos = None
    depth = None

    def __init__(self, pos, prev_min_pos, prev_max_pos, depth):
        self.pos = pos

        self.max_pos = pos if pos > prev_max_pos else prev_max_pos
        self.min_pos = pos if pos < prev_min_pos else prev_min_pos

        self.depth = depth

    def __repr__(self):
        return "{}, {}, {}, {}".format(self.pos, self.min_pos, self.max_pos, self.depth)

depth first, no storage

Recursive algorithm, does a running calculation of the statistics at for each number of steps in the calculation as it does a depth-first calculation.



import step_data as sd
import math

class Stats:
    sum = None
    sum_sq = None
    count = None
    
    def __init__(self):
        self.sum = 0
        self.sum_sq = 0
        self.count = 0
        
    def add_value(self, value):
        self.sum += value
        self.sum_sq += value*value
        self.count += 1


def add_step(prev_step_data, stats_list):
    depth = prev_step_data.depth + 1
    current_stats = stats_list[depth]
    
    next_pos_tuple = (prev_step_data.pos - 1, prev_step_data.pos + 1)
    
    for next_pos in next_pos_tuple:
        step_data = sd.StepData(next_pos, prev_step_data.min_pos, 
                                     prev_step_data.max_pos, depth)
#        print step_data
        
        width = step_data.max_pos - step_data.min_pos
        current_stats.add_value(width)
        
        if (depth < max_depth):
            add_step(step_data, stats_list)
    

max_depth = 31

stats_list = [Stats() for i in range(0, max_depth+1)]

origin = sd.StepData(0, 0, 0, 0)
stats_list[0].add_value(0)

add_step(origin, stats_list)

for depth in range(1, len(stats_list)):
    stat = stats_list[depth]
    
    ave = float(stat.sum) / float(stat.count)
    sdev = math.sqrt((float(stat.sum_sq) / float(stat.count)) - (ave*ave))
    
    print depth, stat.count, ave, sdev


breadth-first calculation, store only previous # of steps


import step_data as sd
import math

prev_step_data_list = [sd.StepData(0,0,0,0)]

max_depth = 35
for depth in range(0,max_depth):
    step_data_list = list()

    for prev_step_data in prev_step_data_list:
        next_pos_tuple = (prev_step_data.pos - 1, prev_step_data.pos + 1)
    
        for next_pos in next_pos_tuple:
            next_step_data = sd.StepData(next_pos, prev_step_data.min_pos, 
                                         prev_step_data.max_pos, depth)
            step_data_list.append(next_step_data)
#            print next_step_data

    sum_ = 0
    sum_sq = 0    
    for prev_step_data in step_data_list:
        width = prev_step_data.max_pos - prev_step_data.min_pos
        sum_ += width
        sum_sq += width*width

    N = len(step_data_list)
    ave = float(sum_) / float(N)
    sdev = math.sqrt((float(sum_sq) / float(N)) - (ave*ave))
    print depth, N, ave, sdev
#    print

    prev_step_data_list = step_data_list


depth-first calculation, store all random walks then print left most side of tree


class TreeNode:
    children = None

    data = None

    def __init__(self, data):
        self.data = data
        self.children = list()



import tree_node as tn
import step_data as sd

max_depth = 30

                
def add_step(tree_node, prev_depth):
    depth = prev_depth + 1

    next_pos_tuple = (tree_node.data.pos - 1, tree_node.data.pos + 1)
    
    for next_pos in next_pos_tuple:
        next_step_data = sd.StepData(next_pos, tree_node.data.min_pos, tree_node.data.max_pos, depth)
        next_tree_node = tn.TreeNode(next_step_data)

        tree_node.children.append(next_tree_node)

        if (depth < max_depth):
            add_step(next_tree_node, depth)


origin = sd.StepData(0, 0, 0, 0)
root = tn.TreeNode(origin)

add_step(root, 0)

node = root

while (node is not None):
    print node.data.pos, node.data.min_pos, node.data.max_pos, node.data.depth

    node = node.children[0] if len(node.children) > 0 else None



Edit:  Some further questions and answers

I do not know the origin of the phrase "random walks".  I've done a quick search, and it appears to just happen to be how the problem was stated originally by Karl Pearson.

How do you calculate the random walk?
There are computer programs called "random number generators" that can provide you with what is essentially a random number every time you make a request.  You might remember that I built you some excel spreadsheets once that generated math problems for your students - they used a random number generator within excel.  I have used the same thing to generate the examples of the random walks on the blog post.  
 Is the point of departure or starting point a given?
Generally the starting point is a given and is fixed at x = 0, but that does not have to be the case.  The other thing that is then specified (usually) is the number of steps taken in the walk.
 Is the ending fixed, or is it stretched to infinity?
Generally the number of steps is fixed, but since the walk is random the end point is not fixed.  Sometimes they will do the calculation for an infinite number of steps (to prove some other point).  I can imagine also a situation where they would fix the endpoint and then ask the question "which or how many random walks out of all possible random walks will reach this endpoint?"

No comments:

Post a Comment