This note is largely based on Applied Statistics with R
. https://daviddalpiaz.github.io/appliedstats/
R
has a number of basic data types.
1
, 1.0
, 42.5
TRUE
and FALSE
T
and F
, but this is not recommended.NA
is also considered logical."a"
, "Statistics"
, "1 plus 2."
R
also has a number of basic data structures.Dimension | Homogeneous | Heterogeneous |
---|---|---|
1 | Vector | List |
2 | Matrix | Data Frame |
3+ | Array |
Basics of vectors
R
make heavy use of vectors.
R
are indexed starting at 1
.R
is using the c()
function, which is short for “combine.”"## [1] 1 3 5 7 8 9
=
.
x
now holds the vector we just created, and we can access the vector by typing x
.## [1] 1 3 5 7 8 9
## [1] 1 3 5 7 8 9
=
and <-
work as an assignment operator.
R
code the line starting with #
is comment
, which is ignored when you run the fode.:
operator, which creates a sequence of integers between two specified integers.## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100
R
both stores the vector in a variable called y
andy
to the console.seq()
function for a more general sequence.## [1] 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3
## [20] 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2
from
, to
, and by
are optional.## [1] 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3
## [20] 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2
c()
:
seq()
rep()
## [1] 1 3 5 7 8 9 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 2 3 42
## [26] 2 3 4
length()
function.## [1] 6
## [1] 100
[]
, to obtain a subset of a vector.x[1]
returns the first element.## [1] 1 3 5 7 8 9
## [1] 1
## [1] 5
## [1] 1 5 7 8 9
## [1] 1 3 5
## [1] 1 5 7
## [1] TRUE TRUE FALSE TRUE TRUE FALSE
## [1] 1 3 7 8
R
is its use of vectorized operations.
R
is slow.log()
is called on a vector x
, a vector is returned which has applied the function to each element of the vector x
.## [1] 2 3 4 5 6 7 8 9 10 11
## [1] 2 4 6 8 10 12 14 16 18 20
## [1] 2 4 8 16 32 64 128 256 512 1024
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
## [9] 3.000000 3.162278
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
## [8] 2.0794415 2.1972246 2.3025851
Operator | Summary | Example | Result |
---|---|---|---|
x < y |
x less than y |
3 < 42 |
TRUE |
x > y |
x greater than y |
3 > 42 |
FALSE |
x <= y |
x less than or equal to y |
3 <= 42 |
TRUE |
x >= y |
x greater than or equal to y |
3 >= 42 |
FALSE |
x == y |
x equal to y |
3 == 42 |
FALSE |
x != y |
x not equal to y |
3 != 42 |
TRUE |
!x |
not x |
!(3 > 42) |
TRUE |
x | y |
x or y |
(3 > 42) | TRUE |
TRUE |
x & y |
x and y |
(3 < 4) & (42 > 13) |
TRUE |
## [1] FALSE FALSE TRUE TRUE TRUE TRUE
## [1] TRUE FALSE FALSE FALSE FALSE FALSE
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
## [1] TRUE FALSE TRUE TRUE TRUE TRUE
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
## [1] 5 7 8 9
## [1] 1 5 7 8 9
R
can also be used for matrix calculations.
matrix
function.## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
x
, which stores a vector andX
, which stores a matrix.matrix
function reorders a vector into columns, but we can also tell R
to use rows instead.## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
0
.## [,1] [,2] [,3] [,4]
## [1,] 0 0 0 0
## [2,] 0 0 0 0
[]
.## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## [1] 4
## [1] 1 4 7
## [1] 4 5 6
cbind
, or combining vectors as rows, using rbind
.## [1] 9 8 7 6 5 4 3 2 1
## [1] 1 1 1 1 1 1 1 1 1
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## x 1 2 3 4 5 6 7 8 9
## 9 8 7 6 5 4 3 2 1
## 1 1 1 1 1 1 1 1 1
rbind
and cbind
you can specify “argument” names that will be used as column names.## col_1 col_2 col_3
## [1,] 1 9 1
## [2,] 2 8 1
## [3,] 3 7 1
## [4,] 4 6 1
## [5,] 5 5 1
## [6,] 6 4 1
## [7,] 7 3 1
## [8,] 8 2 1
## [9,] 9 1 1
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## [,1] [,2] [,3]
## [1,] 9 6 3
## [2,] 8 5 2
## [3,] 7 4 1
## [,1] [,2] [,3]
## [1,] 10 10 10
## [2,] 10 10 10
## [3,] 10 10 10
## [,1] [,2] [,3]
## [1,] -8 -2 4
## [2,] -6 0 6
## [3,] -4 2 8
## [,1] [,2] [,3]
## [1,] 9 24 21
## [2,] 16 25 16
## [3,] 21 24 9
## [,1] [,2] [,3]
## [1,] 0.1111111 0.6666667 2.333333
## [2,] 0.2500000 1.0000000 4.000000
## [3,] 0.4285714 1.5000000 9.000000
X * Y
is not matrix multiplication.X / Y
).%*%
.## [,1] [,2] [,3]
## [1,] 90 54 18
## [2,] 114 69 24
## [3,] 138 84 30
t()
which gives the transpose of a matrix## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
solve()
which returns the inverse of a square matrix if it is invertible.## [,1] [,2] [,3]
## [1,] 9 2 -3
## [2,] 2 4 -2
## [3,] -3 -2 16
## [,1] [,2] [,3]
## [1,] 0.12931034 -0.05603448 0.01724138
## [2,] -0.05603448 0.29094828 0.02586207
## [3,] 0.01724138 0.02586207 0.06896552
solve(Z)
returns the inverse, we multiply it by Z
. ## [,1] [,2] [,3]
## [1,] 1.000000e+00 -6.245005e-17 0.000000e+00
## [2,] 8.326673e-17 1.000000e+00 5.551115e-17
## [3,] 2.775558e-17 0.000000e+00 1.000000e+00
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
## [1] TRUE
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## [1] 2 3
## [1] 9 12
## [1] 3 7 11
## [1] 3 4
## [1] 1.5 3.5 5.5
diag()
function can be used in a number of ways. We can extract the diagonal of a matrix.## [1] 9 4 16
0
on the off-diagonals.)## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 0 0 0
## [2,] 0 2 0 0 0
## [3,] 0 0 3 0 0
## [4,] 0 0 0 4 0
## [5,] 0 0 0 0 5
1
for every element of the diagonal and 0
for the off-diagonals.## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 0 0 0
## [2,] 0 1 0 0 0
## [3,] 0 0 1 0 0
## [4,] 0 0 0 1 0
## [5,] 0 0 0 0 1
## [[1]]
## [1] 42
##
## [[2]]
## [1] "Hello"
##
## [[3]]
## [1] TRUE
ex_list = list(
a = c(1, 2, 3, 4),
b = TRUE,
c = "Hello!",
d = function(arg = 42) {print("Hello World!")},
e = diag(5)
)
$
operator, and[]
.## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 0 0 0
## [2,] 0 1 0 0 0
## [3,] 0 0 1 0 0
## [4,] 0 0 0 1 0
## [5,] 0 0 0 0 1
## $a
## [1] 1 2 3 4
##
## $b
## [1] TRUE
## $e
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 0 0 0
## [2,] 0 1 0 0 0
## [3,] 0 0 1 0 0
## [4,] 0 0 0 1 0
## [5,] 0 0 0 0 1
##
## $a
## [1] 1 2 3 4
## $e
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 0 0 0
## [2,] 0 1 0 0 0
## [3,] 0 0 1 0 0
## [4,] 0 0 0 1 0
## [5,] 0 0 0 0 1
## function(arg = 42) {print("Hello World!")}
x = 1
y = 3
if (x > y) {
z = x * y
print("x is larger than y")
} else {
z = x + 5 * y
print("x is less than or equal to y")
}
## [1] "x is less than or equal to y"
## [1] 16
R
also has a special function ifelse()
## [1] 1
ifelse()
comes from its ability to be applied to vectors.## [1] "Bar" "Bar" "Bar" "Bar" "Bar" "Foo" "Foo" "Foo"
for
loopfor
loop repeats the same procedure for the specified number of times## [1] 22 24 26 28 30
for
loop is very normal in many programming languages.R
we would not use a loop, instead we would simply use a vectorized operation.
for
loop in R
is known to be very slow.## [1] 22 24 26 28 30
# The following is just a demonstration,
# not the real function in R.
function_name(arg1 = 10, arg2 = 20)
R
.function()
{}
.standardize
,x
which is used in the body of function.n = 10
from a normal distribution with a mean of 2
and a standard deviation of 5
.## [1] -3.1526696 0.6309771 3.9649879 2.3507166 3.7361776 6.9010731
## [7] 1.3665305 0.9355939 2.4036735 1.5121131
## [1] -1.9986285 -0.5492796 0.7278336 0.1094771 0.6401864 1.8525189
## [7] -0.2675214 -0.4325942 0.1297626 -0.2117551
10^2
resulting in 100
.## [1] 100
## [1] 100
## [1] 100
## [1] 100
## [1] 1024
To further illustrate a function with a default argument, we will write a function that calculates sample variance two ways.
By default, the function will calculate the unbiased estimate of \(\sigma^2\), which we will call \(s^2\).
\[ s^2 = \frac{1}{n - 1}\sum_{i=1}^{n}(x - \bar{x})^2 \]
\[ \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}(x - \bar{x})^2 \]
get_var = function(x, unbiased = TRUE) {
if (unbiased == TRUE){
n = length(x) - 1
} else if (unbiased == FALSE){
n = length(x)
}
(1 / n) * sum((x - mean(x)) ^ 2)
}
## [1] 6.815147
## [1] 6.815147
## [1] 6.815147
R
’s built in function var()
. Finally, let’s examine the biased estimate of \(\sigma^2\).## [1] 6.133632