Introduction
- A data frame is the most common way that we store and interact with data in this course.
example_data = data.frame(x = c(1, 3, 5, 7, 9, 1, 3, 5, 7, 9),
y = c(rep("Hello", 9), "Goodbye"),
z = rep(c(TRUE, FALSE), 5))
- A data frame is a list of vectors.
- Each vector must contain the same data type
- The difference vectors can store different data types.
## x y z
## 1 1 Hello TRUE
## 2 3 Hello FALSE
## 3 5 Hello TRUE
## 4 7 Hello FALSE
## 5 9 Hello TRUE
## 6 1 Hello FALSE
## 7 3 Hello TRUE
## 8 5 Hello FALSE
## 9 7 Hello TRUE
## 10 9 Goodbye FALSE
write.csv
save (or export) the dataframe in .csv
format.
Load csv file
- We can also import data from various file types in into
R
, as well as use data stored in packages.
- Read
csv
file into R.
read.csv()
function as default
read_csv()
function from the readr
package. This is faster for larger data.
# install.packages("readr")
#library(readr)
#example_data_from_csv = read_csv("example-data.csv")
example_data_from_csv = read.csv("example-data.csv")
- Note: This particular line of code assumes that the file
example_data.csv
exists in your current working directory.
- The current working directory is the folder that you are working with. To see this, you type
## [1] "C:/Users/Yuta/Dropbox/Teaching/2019S_Applied_Econometrics_JPN_ENG/Material_Github"
- If you want to set the working directory, use
setwd()
function
setwd(dir = "directory path" )
Examine dataframe
- Inside the
ggplot2
package is a dataset called mpg
. By loading the package using the library()
function, we can now access mpg
.
- Three things we would generally like to do with data:
- Look at the raw data.
- Understand the data. (Where did it come from? What are the variables? Etc.)
- Visualize the data.
- To look at the data, we have two useful commands:
head()
and str()
## # A tibble: 10 x 11
## manufacturer model displ year cyl trans drv cty hwy fl cla~
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <ch>
## 1 audi a4 1.8 1999 4 auto~ f 18 29 p com~
## 2 audi a4 1.8 1999 4 manu~ f 21 29 p com~
## 3 audi a4 2 2008 4 manu~ f 20 31 p com~
## 4 audi a4 2 2008 4 auto~ f 21 30 p com~
## 5 audi a4 2.8 1999 6 auto~ f 16 26 p com~
## 6 audi a4 2.8 1999 6 manu~ f 18 26 p com~
## 7 audi a4 3.1 2008 6 auto~ f 18 27 p com~
## 8 audi a4 q~ 1.8 1999 4 manu~ 4 18 26 p com~
## 9 audi a4 q~ 1.8 1999 4 auto~ 4 16 25 p com~
## 10 audi a4 q~ 2 2008 4 manu~ 4 20 28 p com~
- The function
str()
will display the “structure” of the data frame.
- It will display the number of observations and variables, list the variables, give the type of each variable, and show some elements of each variable.
- This information can also be found in the “Environment” window in RStudio.
## Classes 'tbl_df', 'tbl' and 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
names()
function to obtain names of the variables in the dataset
## [1] "manufacturer" "model" "displ" "year"
## [5] "cyl" "trans" "drv" "cty"
## [9] "hwy" "fl" "class"
- To access one of the variables as a vector, we use the
$
operator.
## [1] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 2008 1999 1999 2008
## [15] 2008 1999 2008 2008 2008 2008 2008 1999 2008 1999 1999 2008 2008 2008
## [29] 2008 2008 1999 1999 1999 2008 1999 2008 2008 1999 1999 1999 1999 2008
## [43] 2008 2008 1999 1999 2008 2008 2008 2008 1999 1999 2008 2008 2008 1999
## [57] 1999 1999 2008 2008 2008 1999 2008 1999 2008 2008 2008 2008 2008 2008
## [71] 1999 1999 2008 1999 1999 1999 2008 1999 1999 1999 2008 2008 1999 1999
## [85] 1999 1999 1999 2008 1999 2008 1999 1999 2008 2008 1999 1999 2008 2008
## [99] 2008 1999 1999 1999 1999 1999 2008 2008 2008 2008 1999 1999 2008 2008
## [113] 1999 1999 2008 1999 1999 2008 2008 2008 2008 2008 2008 2008 1999 1999
## [127] 2008 2008 2008 2008 1999 2008 2008 1999 1999 1999 2008 1999 2008 2008
## [141] 1999 1999 1999 2008 2008 2008 2008 1999 1999 2008 1999 1999 2008 2008
## [155] 1999 1999 1999 2008 2008 1999 1999 2008 2008 2008 2008 1999 1999 1999
## [169] 1999 2008 2008 2008 2008 1999 1999 1999 1999 2008 2008 1999 1999 2008
## [183] 2008 1999 1999 2008 1999 1999 2008 2008 1999 1999 2008 1999 1999 1999
## [197] 2008 2008 1999 2008 1999 1999 2008 1999 1999 2008 2008 1999 1999 2008
## [211] 2008 1999 1999 1999 1999 2008 2008 2008 2008 1999 1999 1999 1999 1999
## [225] 1999 2008 2008 1999 1999 2008 2008 1999 1999 2008
## [1] 29 29 31 30 26 26 27 26 25 28 27 25 25 25 25 24 25 23 20 15 20 17 17
## [24] 26 23 26 25 24 19 14 15 17 27 30 26 29 26 24 24 22 22 24 24 17 22 21
## [47] 23 23 19 18 17 17 19 19 12 17 15 17 17 12 17 16 18 15 16 12 17 17 16
## [70] 12 15 16 17 15 17 17 18 17 19 17 19 19 17 17 17 16 16 17 15 17 26 25
## [93] 26 24 21 22 23 22 20 33 32 32 29 32 34 36 36 29 26 27 30 31 26 26 28
## [116] 26 29 28 27 24 24 24 22 19 20 17 12 19 18 14 15 18 18 15 17 16 18 17
## [139] 19 19 17 29 27 31 32 27 26 26 25 25 17 17 20 18 26 26 27 28 25 25 24
## [162] 27 25 26 23 26 26 26 26 25 27 25 27 20 20 19 17 20 17 29 27 31 31 26
## [185] 26 28 27 29 31 31 26 26 27 30 33 35 37 35 15 18 20 20 22 17 19 18 20
## [208] 29 26 29 29 24 44 29 26 29 29 29 29 23 24 44 41 29 26 28 29 29 29 28
## [231] 29 26 26 26
- We can use the
dim()
, nrow()
and ncol()
functions to obtain information about the dimension of the data frame.
## [1] 234 11
## [1] 234
## [1] 11
Subsetting data
- Subsetting data frames can work much like subsetting matrices using square brackets,
[,]
.
- Here, we find fuel efficient vehicles earning over 35 miles per gallon and only display
manufacturer
, model
and year
.
mpg[mpg$hwy > 35, c("manufacturer", "model", "year")]
## # A tibble: 6 x 3
## manufacturer model year
## <chr> <chr> <int>
## 1 honda civic 2008
## 2 honda civic 2008
## 3 toyota corolla 2008
## 4 volkswagen jetta 1999
## 5 volkswagen new beetle 1999
## 6 volkswagen new beetle 1999
- An alternative would be to use the
subset()
function, which has a much more readable syntax.
subset(mpg, subset = hwy > 35, select = c("manufacturer", "model", "year"))
- Lastly, we could use the
filter
and select
functions from the dplyr
package which introduces the %>%
operator from the magrittr
package.
library(dplyr)
mpg %>%
filter(hwy > 35) %>%
select(manufacturer, model, year)
- I will give you an assignment about
dplyr
package in the DataCamp
as a makeup lecture.