## Warning: package 'knitr' was built under R version 3.6.2

1 Data frame

1.1 Acknowledgement

This note is largely based on Applied Statistics with R. https://daviddalpiaz.github.io/appliedstats/

1.2 Introduction

  • A data frame is the most common way that we store and interact with data in this course.
  • A data frame is a list of vectors.
    • Each vector must contain the same data type
    • The difference vectors can store different data types.


  • write.csv save (or export) the dataframe in .csv format.

1.3 Load csv file

  • We can also import data from various file types in into R, as well as use data stored in packages.
  • Read csv file into R.
    • read.csv() function as default
    • read_csv() function from the readr package. This is faster for larger data.

  • Note: This particular line of code assumes that the file example_data.csv exists in your current working directory.
  • The current working directory is the folder that you are working with. To see this, you type
## [1]
"C:/Users/Yuta/Dropbox/Teaching/2020_1_3_4_Applied_Metrics/Note_Github/02_RIntro"
  • If you want to set the working directory, use setwd() function

1.4 Examine dataframe

  • Inside the ggplot2 package is a dataset called mpg. By loading the package using the library() function, we can now access mpg.

  • Three things we would generally like to do with data:
    • Look at the raw data.
    • Understand the data. (Where did it come from? What are the variables? Etc.)
    • Visualize the data.
  • To look at the data, we have two useful commands: head() and str()

  • The function str() will display the “structure” of the data frame.
    • It will display the number of observations and variables, list the variables, give the type of each variable, and show some elements of each variable.
    • This information can also be found in the “Environment” window in RStudio.
## Classes 'tbl_df', 'tbl' and 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...

  • names() function to obtain names of the variables in the dataset
## [1] "manufacturer" "model" "displ" "year" "cyl"
## [6] "trans" "drv" "cty" "hwy" "fl"
## [11] "class"
  • To access one of the variables as a vector, we use the $ operator.
## [1] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 2008 1999 1999
2008 2008
## [16] 1999 2008 2008 2008 2008 2008 1999 2008 1999 1999 2008 2008 2008
2008 2008
## [31] 1999 1999 1999 2008 1999 2008 2008 1999 1999 1999 1999 2008 2008
2008 1999
## [46] 1999 2008 2008 2008 2008 1999 1999 2008 2008 2008 1999 1999 1999
2008 2008
## [61] 2008 1999 2008 1999 2008 2008 2008 2008 2008 2008 1999 1999 2008
1999 1999
## [76] 1999 2008 1999 1999 1999 2008 2008 1999 1999 1999 1999 1999 2008
1999 2008
## [91] 1999 1999 2008 2008 1999 1999 2008 2008 2008 1999 1999 1999 1999
1999 2008
## [106] 2008 2008 2008 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008
2008 2008
## [121] 2008 2008 2008 2008 1999 1999 2008 2008 2008 2008 1999 2008 2008
1999 1999
## [136] 1999 2008 1999 2008 2008 1999 1999 1999 2008 2008 2008 2008 1999
1999 2008
## [151] 1999 1999 2008 2008 1999 1999 1999 2008 2008 1999 1999 2008 2008
2008 2008
## [166] 1999 1999 1999 1999 2008 2008 2008 2008 1999 1999 1999 1999 2008
2008 1999
## [181] 1999 2008 2008 1999 1999 2008 1999 1999 2008 2008 1999 1999 2008
1999 1999
## [196] 1999 2008 2008 1999 2008 1999 1999 2008 1999 1999 2008 2008 1999
1999 2008
## [211] 2008 1999 1999 1999 1999 2008 2008 2008 2008 1999 1999 1999 1999
1999 1999
## [226] 2008 2008 1999 1999 2008 2008 1999 1999 2008
## [1] 29 29 31 30 26 26 27 26 25 28 27 25 25 25 25 24 25 23 20 15 20 17
17 26 23
## [26] 26 25 24 19 14 15 17 27 30 26 29 26 24 24 22 22 24 24 17 22 21 23
23 19 18
## [51] 17 17 19 19 12 17 15 17 17 12 17 16 18 15 16 12 17 17 16 12 15 16
17 15 17
## [76] 17 18 17 19 17 19 19 17 17 17 16 16 17 15 17 26 25 26 24 21 22 23
22 20 33
## [101] 32 32 29 32 34 36 36 29 26 27 30 31 26 26 28 26 29 28 27 24 24 24
22 19 20
## [126] 17 12 19 18 14 15 18 18 15 17 16 18 17 19 19 17 29 27 31 32 27 26
26 25 25
## [151] 17 17 20 18 26 26 27 28 25 25 24 27 25 26 23 26 26 26 26 25 27 25
27 20 20
## [176] 19 17 20 17 29 27 31 31 26 26 28 27 29 31 31 26 26 27 30 33 35 37
35 15 18
## [201] 20 20 22 17 19 18 20 29 26 29 29 24 44 29 26 29 29 29 29 23 24 44
41 29 26
## [226] 28 29 29 29 28 29 26 26 26

  • We can use the dim(), nrow() and ncol() functions to obtain information about the dimension of the data frame.
## [1] 234  11
## [1] 234
## [1] 11

1.5 Subsetting data

  • Subsetting data frames can work much like subsetting matrices using square brackets, [,].
  • Here, we find fuel efficient vehicles earning over 35 miles per gallon and only display manufacturer, model and year.

  • An alternative would be to use the subset() function, which has a much more readable syntax.

  • Lastly, we could use the filter and select functions from the dplyr package which introduces the %>% operator from the magrittr package.
  • I will give you an assignment about dplyr package in the DataCamp as a makeup lecture.