Tutorial 1: Introduction to R

Base R and matrix manipulations

There are a number of ways in which we can work with data in R. Here, we will primarily rely on data.table, the other prominent dialect being tidyverse. But to develop an intuition, let’s start off with base R, specifically—matrices.

Consider a random sequence of observations:26

a <- c(1,0,4,3,2,6)
a
## [1] 1 0 4 3 2 6

We used c() function to combine the set of observations. Without this function, that is if we were to just list the observations separated by commas, we would have received an error message.

This sequence, unlike a vector, has no dimensions.27 But we can transform it to a \(n \times 1\) vector, and call it b, using the as.matrix() function:

b <- as.matrix(a)
b
##      [,1]
## [1,]    1
## [2,]    0
## [3,]    4
## [4,]    3
## [5,]    2
## [6,]    6
dim(b)
## [1] 6 1

The result is a \(6 \times 1\) vector, or a column matrix. To obtain a \(1 \times 6\) vector, or a row matrix, we transpose the aforementioned vector using the t() function:

bt <- t(b)
bt
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    0    4    3    2    6
dim(bt)
## [1] 1 6

We can create any \(n \times k\) matrix, using the matrix() function. For example, consider a \(3 \times 2\) matrix:

B <- matrix(a,nrow=3,ncol=2)
B
##      [,1] [,2]
## [1,]    1    3
## [2,]    0    2
## [3,]    4    6

Note that we included the earlier generated sequence of six values, which we assigned to a, as the elements of this matrix.

We can add column names and row names to this matrix:

colnames(B) <- c("c1","c2")
rownames(B) <- c("r1","r2","r3")
B
##    c1 c2
## r1  1  3
## r2  0  2
## r3  4  6

If, at this point, we would like to only work with, say, the first column of the matrix, we can call it using its column number or the column name as follows:

B[,"c1"]
## r1 r2 r3 
##  1  0  4

Note that we indicated the column name after comma within the brackets placed immediately after the matrix. In general, the syntax goes as follows M[r,c], where M is the name of the matrix (in our example B), r is the row number(s) or the row name(s), and c is the column number(s) or column names(s).

So, for example, if we want to extract a matrix element, say \(b_{3,2}\), we can do this by entering the respective indices in the brackets:

B[3,2]
## [1] 6

Matrix multiplication is done using %*% command, granted that the two matrices are compatible. For example, we obtain a product of matrix \(B\) and a new \(2 \times 1\) vector, \(d\), as follows:

d <- as.matrix(c(5,-2))
Bd <- B%*%d
Bd
##    [,1]
## r1   -1
## r2   -4
## r3    8

We can add columns (and rows) to the existing matrix using a cbind() function:

c3 <- c(0,1,0)
D <- cbind(B,c3)
D
##    c1 c2 c3
## r1  1  3  0
## r2  0  2  1
## r3  4  6  0

We can invert a(n invertible) matrix using the solve() function:

Di <- solve(D)
Di
##            r1 r2         r3
## c1 -1.0000000  0  0.5000000
## c2  0.6666667  0 -0.1666667
## c3 -1.3333333  1  0.3333333

Estimating parameters using OLS

By now, we have covered enough ground to obtain the least squares estimator. For that, we will generate vectors of dependent and independent variables, and then estimate the vector of parameters. Specifically, we will generate a sequence of 200 binary variables, which will serve as our independent variable x, and then we will construct the dependent variable y using the following formula: \(y=2+0.5x+e\), where e is a sequence of 200 standard normal random variables.

set.seed(1)
x <- sample(c(0,1),200,replace=T)
set.seed(2)
e <- rnorm(200)
y <- 2+0.5*x+e

X <- cbind(1,x)
b <- solve(t(X)%*%X)%*%t(X)%*%y
b
##        [,1]
##   1.8699318
## x 0.7639315

Note that when generating the random samples of observations, we set seeds (in two instances). We did so to ensure that the results exactly replicate every time we re-run the model. Otherwise, owing to the random sampling, the parameter estimates that I observe will differ from those that you observe and, moreover, they will differ from each other every time you re-run the code.

The foregoing “do it by hand” exercise can be easily replicated using the lm() function:

ols <- lm(y~x)
ols
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      1.8699       0.7639

We can apply the summary() function to the lm() output to see the complete summary of the regression results:

summary(ols)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.44493 -0.77746 -0.06555  0.80647  2.22089 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.8699     0.1058  17.674  < 2e-16 ***
## x             0.7639     0.1511   5.054  9.8e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.069 on 198 degrees of freedom
## Multiple R-squared:  0.1143, Adjusted R-squared:  0.1098 
## F-statistic: 25.55 on 1 and 198 DF,  p-value: 9.8e-07

  1. In these tutorials, we will use the ‘assignment’ operator, <-, which is, for all practical purposes, equivalent to the ‘equal’ operator, =.↩︎

  2. You don’t need to take my word for it. You can check this for yourself by running the dim(a) in your R console, for example.↩︎