This lab is designed to give you practice manipulating basic R data structures. In addition, it should give you more practice using R and Markdown.
You have this information from Governor Inslee:
The information was got from wikipedia.
Make a list, in a way you can answer these questions:
Write the code to answer those questions.
Last session you were asked to get some data from SEATTLE OPEN DATA portal. I will show you an example on how to use R to look for and solve some issues.
linkToData='ExampleLab3.xlsx'
library(rio)
## Warning: package 'rio' was built under R version 3.4.4
myData=import("ExampleLab3.xlsx")
str(myData)
## 'data.frame': 30 obs. of 5 variables:
## $ personID : chr "A145" "A185" "A108" "A172" ...
## $ age : chr "21" "34" "35" NA ...
## $ County : chr "King County" "King County" "King County" "King County" ...
## $ Degree : chr "HighSchoolDiploma" "Bachelor" "Bachelor" "Bachelor" ...
## $ HouseholdSize: chr "2" "1" "1" "1" ...
From the result above, age and household size should be numbers, but they have been read as text; that is, there should be a number written as text.
myData[is.na(myData$age),]
## personID age County Degree HouseholdSize
## 4 A172 <NA> King County Bachelor 1
myData[is.na(myData$HouseholdSize),]
## personID age County Degree HouseholdSize
## 28 C148 35 Spokane County Bachelor <NA>
Each column has one missing value.
myData[is.na(as.numeric(myData$age)),]
## personID age County Degree HouseholdSize
## 4 A172 <NA> King County Bachelor 1
In the previous case, when turning the column into numeric no extra missing value was found.
myData[is.na(as.numeric(myData$HouseholdSize)),]
## Warning in `[.data.frame`(myData,
## is.na(as.numeric(myData$HouseholdSize)), : NAs introduced by coercion
## personID age County Degree HouseholdSize
## 18 A195 43 Kitsap County Bachelor l
## 28 C148 35 Spokane County Bachelor <NA>
In the previous case, when turning the column into numeric ONE extra missing value was found. In this case, R also told you that it introduced NAs. When you apply the function as.numeric() to a text, R will return a NA.
In the first column, there were no issues, then I simply alter the data type:
myData$age=as.numeric(myData$age)
In the second column, you first need to modify the value:
myData$HouseholdSize[18]
## [1] "l"
# then:
myData$HouseholdSize[18]=1
Now you can turn the column into numbers:
myData$HouseholdSize=as.numeric(myData$HouseholdSize)
summary(myData[,c('age','HouseholdSize')])
## age HouseholdSize
## Min. :21.00 Min. : 1.000
## 1st Qu.:29.00 1st Qu.: 2.000
## Median :38.00 Median : 2.000
## Mean :36.72 Mean : 3.828
## 3rd Qu.:42.00 3rd Qu.: 3.000
## Max. :49.00 Max. :44.000
## NA's :1 NA's :1
In the previous summary, you see that age values seem well, but it looks as though there might be something weird in household size: there is a person in this data set whose household size is 44. Is this an outlier?
boxplot(myData$HouseholdSize,horizontal = T)
It looks too far from the other answers. As it is an integer value (do not do this if it has decimal values), we could try a frequency table:
table(myData$HouseholdSize)
##
## 1 2 3 4 44
## 7 9 6 6 1
My best guess is that this value is atually a 4, so there is some kind of mistyping; however, you might want to spend some time looking for the original record that produced this row.