# Tutorial 1: Introduction to R

## Base R and matrix manipulations

There are a number of ways in which we can work with data in R. Here, we will primarily rely on `data.table`

, the other prominent dialect being `tidyverse`

. But to develop an intuition, let’s start off with base R, specifically—matrices.

Consider a random sequence of observations:^{26}

`## [1] 1 0 4 3 2 6`

We used `c()`

function to combine the set of observations. Without this function, that is if we were to just list the observations separated by commas, we would have received an error message.

This sequence, unlike a *vector*, has no dimensions.^{27} But we can transform it to a \(n \times 1\) vector, and call it `b`

, using the `as.matrix()`

function:

```
## [,1]
## [1,] 1
## [2,] 0
## [3,] 4
## [4,] 3
## [5,] 2
## [6,] 6
```

`## [1] 6 1`

The result is a \(6 \times 1\) vector, or a column matrix. To obtain a \(1 \times 6\) vector, or a row matrix, we *transpose* the aforementioned vector using the `t()`

function:

```
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 0 4 3 2 6
```

`## [1] 1 6`

We can create any \(n \times k\) matrix, using the `matrix()`

function. For example, consider a \(3 \times 2\) matrix:

```
## [,1] [,2]
## [1,] 1 3
## [2,] 0 2
## [3,] 4 6
```

Note that we included the earlier generated sequence of six values, which we assigned to `a`

, as the elements of this matrix.

We can add column names and row names to this matrix:

```
## c1 c2
## r1 1 3
## r2 0 2
## r3 4 6
```

If, at this point, we would like to only work with, say, the first column of the matrix, we can call it using its column number or the column name as follows:

```
## r1 r2 r3
## 1 0 4
```

Note that we indicated the column name after comma within the brackets placed immediately after the matrix. In general, the syntax goes as follows `M[r,c]`

, where `M`

is the name of the matrix (in our example `B`

), `r`

is the row number(s) or the row name(s), and `c`

is the column number(s) or column names(s).

So, for example, if we want to extract a matrix element, say \(b_{3,2}\), we can do this by entering the respective indices in the brackets:

`## [1] 6`

Matrix multiplication is done using `%*%`

command, granted that the two matrices are compatible. For example, we obtain a product of matrix \(B\) and a new \(2 \times 1\) vector, \(d\), as follows:

```
## [,1]
## r1 -1
## r2 -4
## r3 8
```

We can add columns (and rows) to the existing matrix using a `cbind()`

function:

```
## c1 c2 c3
## r1 1 3 0
## r2 0 2 1
## r3 4 6 0
```

We can invert a(n invertible) matrix using the `solve()`

function:

```
## r1 r2 r3
## c1 -1.0000000 0 0.5000000
## c2 0.6666667 0 -0.1666667
## c3 -1.3333333 1 0.3333333
```

## Estimating parameters using OLS

By now, we have covered enough ground to obtain the least squares estimator. For that, we will generate vectors of dependent and independent variables, and then estimate the vector of parameters. Specifically, we will generate a sequence of 200 binary variables, which will serve as our independent variable `x`

, and then we will construct the dependent variable `y`

using the following formula: \(y=2+0.5x+e\), where `e`

is a sequence of 200 standard normal random variables.

```
set.seed(1)
x <- sample(c(0,1),200,replace=T)
set.seed(2)
e <- rnorm(200)
y <- 2+0.5*x+e
X <- cbind(1,x)
b <- solve(t(X)%*%X)%*%t(X)%*%y
b
```

```
## [,1]
## 1.8699318
## x 0.7639315
```

Note that when generating the random samples of observations, we set seeds (in two instances). We did so to ensure that the results exactly replicate every time we re-run the model. Otherwise, owing to the random sampling, the parameter estimates that I observe will differ from those that you observe and, moreover, they will differ from each other every time you re-run the code.

The foregoing “do it by hand” exercise can be easily replicated using the `lm()`

function:

```
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 1.8699 0.7639
```

We can apply the `summary()`

function to the `lm()`

output to see the complete summary of the regression results:

```
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.44493 -0.77746 -0.06555 0.80647 2.22089
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.8699 0.1058 17.674 < 2e-16 ***
## x 0.7639 0.1511 5.054 9.8e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.069 on 198 degrees of freedom
## Multiple R-squared: 0.1143, Adjusted R-squared: 0.1098
## F-statistic: 25.55 on 1 and 198 DF, p-value: 9.8e-07
```