Notes on the R language

R is an environment for statistical computing. This is a cheatsheet and a collection of additional notes, mostly collected during university courses or home exercises.

Typing help(function) or example(function) into the R console, gives us quite useful documentation.


Vector objects

Basic vectors are initialized with the c() function. It accepts values or other vectors as arguments and it'll merge them.

Repeated sequences can be generated with the rep(values, counts) function. For example rep(1:3, c(2, 2, 3)) will return 1, 1, 2, 2, 3, 3, 3.

We can generate closed numeric sequences with the seq(from, to, step) function or with the from:to shorthand.

Matrix objects

Matrix objects can be generated by using the matrix(values, rows, cols) function or by setting the shape of a vector with dim(x) <- c(rows, cols).

Empirical data

The built-in functions in R to calculate summary statistics are using the formulas for sampled data. We can fake the raw data with some c(rep(mark1, af1), rep(mark2, af2)) calls.

Alternatively, we can apply some corrections to the built-in functions to get the precise result. In common cases (such as variance or covariance) this equals to replacing the n - 1 denominator with n.

empirical <- function (fn, x) {
	fn(x) * (length(x) - 1) / length(x)
}

I/O

We can start writing to a file with sink(path) and return the output to the console with sink(NULL). Writing both to a file and to the console can be done with split=TRUE.

Outputting to images can be started by calling the png(path, width=w, height=h) or jpeg(path, width=w, height=h) functions. Afterwards all plots are written to the file. This can be stopped with dev.off().

Charts, plots

R has powerful plotting functions to represent different types of data. These functions generally accept the main, xlab and ylab parameters for the different labels. Simple ones are pie(x), barplot(x), plot(x, y), plot.stepfun(x, pch=16), boxplot(x).

Histograms can be plotted with the hist(x) function. With freq=FALSE the function will use relative frequencies. We can specify the breaks with the breaks parameter or by passing custom breaks as an argument: hist(x, c(0, k1, k2, n))

Quantiles, percentiles

Calculating quantiles for raw data can be done in many ways. Tukey's five numbers (min, 1st quartile, median, 3rd quartile, max) with fivenum(x), specific percentile with quantile(x, probs=c(p1, p2)). Calculating the quantile of grouped data can be done with faking the raw data (biased quantile) or by implementing its formula as a function.

Measures of correlation

The built-in cov(x, y) function is handy to calculate the covariance of sampled data. We can calculate difference correlation coefficients with the cor(x, y, method="method") function, where the method can be one of "pearson", "spearman" or "kendall".

Linear Regression

R provides a very simple way to fit linear regression models through the lm(y~x) or lm(y ~ x1 + x2) function. Afterwards, using the summary(model) function we can get values like the intercept, slope or R squared.

We can plot the model by calling abline(model) after a plot call.

Combinatorics

The package gtools provides simple helper functions for combinatorics: permutations(length(x), n, x) and combinations(n, k, x). choose(n, k) provides a quick way to calculate binomial coefficients.