My goal here is to compare linear regression and clustering for some cases that are obviously better for one of these or the other, using 2-dimensional data that is easy to visualize. By comparing these two workhorse methods under these conditions I'm hoping to gain better understanding of each and of how to decide when to use one or the other.
The 3 data sets I used were:
- "obviously" better described by linear regression
- "obviously" better described by clustering
- in between the above 2 extremes
I used R. The scripts I used are present in this repository:
https://github.com/dllahr/cluster_vs_linear_regression
I got the code for clustering from:
http://www.statmethods.net/advstats/cluster.html
Before we get started: my friend Phil Montgomery who kindly reviewed this post made a good suggestion that in general, when you have 2 models and you are trying to decide which one to use, you want to compare the statistical likelihood of each. Usually this is done by comparing different values of parameters for a mathematical model, but it is worth investigating if it has been done for comparison of these two systems.
https://github.com/dllahr/cluster_vs_linear_regression
I got the code for clustering from:
http://www.statmethods.net/advstats/cluster.html
Before we get started: my friend Phil Montgomery who kindly reviewed this post made a good suggestion that in general, when you have 2 models and you are trying to decide which one to use, you want to compare the statistical likelihood of each. Usually this is done by comparing different values of parameters for a mathematical model, but it is worth investigating if it has been done for comparison of these two systems.