Wednesday, August 27, 2014

Clustering vs. Linear Regression - which to use?

I've got some data and I'm trying to make an informed decision whether it is better described by a linear regression or a set of clusters.

My goal here is to compare linear regression and clustering for some cases that are obviously better for one of these or the other, using 2-dimensional data that is easy to visualize.  By comparing these two workhorse methods under these conditions I'm hoping to gain better understanding of each and of how to decide when to use one or the other.

The 3 data sets I used were:
1. "obviously" better described by linear regression
2. "obviously" better described by clustering
3. in between the above 2 extremes
I used R.  The scripts I used are present in this repository:
https://github.com/dllahr/cluster_vs_linear_regression

I got the code for clustering from:

Before we get started:  my friend Phil Montgomery who kindly reviewed this post made a good suggestion that in general, when you have 2 models and you are trying to decide which one to use, you want to compare the statistical likelihood of each.  Usually this is done by comparing different values of parameters for a mathematical model, but it is worth investigating if it has been done for comparison of these two systems.

Data that seems to be better described by linear regression

The above data is uniformly distributed within the rectangle, which is at 45 degrees with respect to the axes.  Since the data is 2-dimensional, a linear regression is just a fancy way of saying fitting it to a line.  Fitting the above to a linear regression yields:

y = 0.98 * x + 0.79

Plotting the regression as a line on top of the data looks like:
A good place to start with clustering seems to be k-means clustering, which in its simplest form takes as input the data and the number of clusters ("k"), and then groups the data into that number of clusters.  Which raises the question:  how many clusters should be used?  (What value of k should be used?) One way to try to answer that is called the "scree test".  In this test, for each number of clusters (k) the variance of the data within each cluster is calculated and summed, and this is repeated for increasing k.  The total variance as a function of k is plotted, and the number of clusters is chosen based on where the bend or elbow in the plot occurs.  For the above data, that plot looks like:
I have to admit to not being able to identify where exactly the bend or elbow is in the above plot.  My first guess would be at 3 or 4.  Zooming in to try to get a better idea of where it should be doesn't help:
The overall pattern / shape of the zoomed out plot appears to repeat itself at this magnification.  As a result, it now looks like the elbow could be at 5 or 6.  I'm guessing we could zoom in again and the apparent location of the elbow would change again...instead I'm just going to choose to use 2 clusters (k = 2) for the purposes of illustration.  Also, since this is a 2-dimensional example, we can see directly how "good" or "bad" this choice is.

Applying k-means clustering to attempt to create 2 clusters yields this:
These examples are interesting, but as a reminder, what we're looking for is an answer to the question:  what quantitative, non-subjective metrics of the fit of the linear regression vs. the results of the clustering can be used to decide that for this data set, a linear regression would probably be a more useful model?

Data that seems to be better described by clusters

The above data is uniformly distributed within each of the 2 squares, which are oriented at 45 degrees with respect to the axes.
Fitting this data to a linear regression yields:
The linear regression is there, but it mainly represents the relative positions of the 2 rectangles, and I would claim that the main feature of interest in this data set is that it is 2 groups, and that their relative positions are secondary (also, once you know that there are 2 groups it is relatively easy to get their relative positions - or you can examine their shapes!).

The screen test for the above looks like:
This plot is strikingly different from the screen test for the ideal linear regression.  At this scale, a very dramatic change occurs at k = 2, making this value a clear candidate to be used.  Of course this result shouldn't be much of a surprise since this a contrived example.  Zooming the plot in we see:
The pattern we saw in the zoomed out plot does not repeat itself here.  We could make the case of an elbow/bend at k = 6, but this plot is nowhere near as sharply "bent" as the zoomed out one, so starting with k = 2 seems reasonable.

Generating the clusters for k = 2 yields:
The clusters are a good representation of this data, as we expected.

Comparing methods using effective cumulative distribution function (ECDF)

For data in one dimension the effective cumulative distribution function is extremely useful.  It is a plot in which you sort your data and then plot the percentile of the data point (y-axis) vs. the value of the data point (x-axis).  It is a good way to look at the distribution of the data points without resorting to creating a histogram and having to choose an arbitrary bin size.

As useful as it is, for even 2-dimensions it becomes problematic.  However, it seems we can use it to evaluate the whether the linear regression or cluster model is appropriate.  The linear regression can be thought of as a mapping from 1 or more dimensions of data down to a single dimension.  In this case, it is 1 to 1 since we just have one independent variable.  We can apply the linear regression equation to our independent variable to generate the "fitted values", and then we can look at the effective cumulative distribution of these values.  Here is what we get when we do this for the data set that was ideal for the linear regression:

The above is mostly consistent with a uniform distribution - except for the tail ends, the percentile increases linearly as the value increases.  The linear regression model applies uniformly throughout the data set.

Now we can do the same thing for the data set which is ideal for clustering:

This plot is pretty clearly characteristic of a bi-modal distribution - the fitted values fall into 2 groups.  This result isn't surprising since that was the design of the data set, but it does illustrate that the ECDF of the fitted values can indicate the presence of clusters and / or that perhaps clustering should be used instead of a linear regression.

Mixed / intermediate case

Here is some data for a mixed / intermediate case in which there are fundamentally two groups but they start to run into each other.  The main points to note are:
1. The ECDF of the linear regression fitted values still has a bi-modal distribution, but rather than having a sharp gap, there are data points in the intermediate region
2. The scree test still has a sharp elbow, but after the elbow does have some curvature - in contrast with the ideal clustering case data in which it was largely flat after the elbow
3. increasing the overlap between the distributions pushes the above two effects further - so for example, there would be more points in the gap of the ECDF, and the scree test elbow would be less sharp / more curved after the eblow