Monday, November 12, 2012

Summary of loading data and fitting regression in R

I mainly learned about R from this post about doing linear regression in R:
http://www.jeremymiles.co.uk/regressionbook/extras/appendix2/R/

Of course the R-manual is great:
http://cran.r-project.org/doc/manuals/r-release/R-intro.html

However R is a huge system and when you are getting started the entire manual can be daunting, which is why I prefer a tutorial to get me started, and then start looking up more functionality in the manual.

Here are some tips and tricks that I commonly use:


list all objects
ls()

remove an object
rm()

remove all objects
rm(list = ls())

load data from csv file
reference:  http://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html

data <- read.csv("myfile.csv")

  • myfile.csv has headers for each column that are used as variable names
  • if a column header starts with a number (e.g. "9") the corresponding variable name will start with an "X" (e.g. "X9")
  • the variable "data" contains all file data, the column headers are elements of the data object
    • for example data.frac_change accesses the data in column "frac_change" within myfile.csv
make an object the "root" of all subsequent calls
rather than type data.frac_change and data.X1 etc. you can make the object "data" automatically assumed to be preprended

attach(data)

calculate linear regression
fit the data in frac_change to a linear combination of the data in X1, X2, X3 and a constant

glm.linear = glm(frac_change ~ X1 + X2 + X3)

the object glm.linear now contains the best fit linear regression model.

Also the function bayesglm carries out the regression but assumes priors for the coefficients and then...?

see details of the linear regression model
glm.linear

or

summary(glm.linear)

calculate model's predicted values
predictions <- fitted.values(glm.linear)

histogram of data
hist(frac_change)

hist(frac_change, 24)
(uses 24 bins)

graph / plot data
plot (predictions, frac_change)

(a good quick graph to show if your predictions match the actual data)

sorting data
sorted = sort(predictions, index.return = TRUE)

index.return = TRUE tells R that you want the indexes (into the original data) of the sorted value returned as well

sorted.x contains the sorted values
sorted.ix contains the indexes into the original data of the sorted values

user-defined functions
I do it very simply in that I just paste the function definition into R, I know there are much better ways to do it.  A function I use frequently is below.  Taking as inputs a vector of sort indexes (indexes of sorted values), a vector of values (not necessarily directly the sorted values), and a lower and upper fraction, the function determines the set of the sorted indexes corresponding to those fractional values and then returns the values from the value vector at those indexes.  For example, if lowerFrac = 0.0, upperFrac = 0.10, then the sort indexes corresponding to the bottom 10% are used to pull the values from reterieveVector.


getSubset <- function(sortIndexes, retrieveVector, lowerFrac, upperFrac) {
##sortIndexes - indexes of sorted reference vector, will be used to pick range from retrieveVector
lowerIndex = round(lowerFrac * length(sortIndexes))
if (lowerIndex == 0) {
lowerIndex = 1
}

upperIndex = round(upperFrac * length(sortIndexes))

print("lowerIndex   upperIndex")
print(c(lowerIndex, upperIndex))

subsetSortIndexes = sortIndexes[lowerIndex:upperIndex]

result = list(x = retrieveVector[subsetSortIndexes], ix = subsetSortIndexes)

result
}
Vector manipulation
1-based indexes, so to get the first element of a:
a[1]

to get a range of elements - in this case the 2nd through the 5th
a[2:5]


create a vector that is a sequence of integers, e.g. 2 through 11
2:11
creates: 2 3 4 5 6 7 8 9 10 11


create a vector that is a sequence of numbers, e.g. from -0.015 to 0.015 increment by 0.00125:

 seq(-0.015, 0.015, 0.00125)

creates:
-0.01500 -0.01375 -0.01250 -0.01125 -0.01000 -0.00875 -0.00750 -0.00625 -0.00500 -0.00375 -0.00250 -0.00125 0.00000  0.00125  0.00250  0.00375  0.00500  0.00625  0.00750  0.00875  0.01000  0.01125  0.01250  0.01375  0.01500


Graphing
plot
xlim = c(-2, 2) sets the x-axis limits to -2 and 2
ylim = c(-2, 2) does the same but for the y-axis

windows() to create a new window to plot in
points(), lines() act like plot but add the data to the existing window

No comments:

Post a Comment