Course: Data-Driven Management and Policy

Prof. José Manuel Magallanes, PhD


Session 3 LAB: Data Structures

Lab Instructions

This lab is designed to give you practice manipulating basic R data structures. In addition, it should give you more practice using R and Markdown.

Part 1: Working with lists

You have this information from Governor Inslee:

The information was got from wikipedia.

Make a list, in a way you can answer these questions:

  • How old is he? (compute it from his birthday)
  • How long has he been married?
  • How many Universities has he attended?
  • In what state was he born?

Write the code to answer those questions.

Part 2: Solving Data issues

Last session you were asked to get some data from SEATTLE OPEN DATA portal. I will show you an example on how to use R to look for and solve some issues.

EXAMPLE:

  1. Open the data:
linkToData='ExampleLab3.xlsx'

library(rio)
## Warning: package 'rio' was built under R version 3.4.4
myData=import("ExampleLab3.xlsx")
  1. Verify the data type:
str(myData)
## 'data.frame':    30 obs. of  5 variables:
##  $ personID     : chr  "A145" "A185" "A108" "A172" ...
##  $ age          : chr  "21" "34" "35" NA ...
##  $ County       : chr  "King County" "King County" "King County" "King County" ...
##  $ Degree       : chr  "HighSchoolDiploma" "Bachelor" "Bachelor" "Bachelor" ...
##  $ HouseholdSize: chr  "2" "1" "1" "1" ...
  1. Identify problems

From the result above, age and household size should be numbers, but they have been read as text; that is, there should be a number written as text.

  1. Count missing values in numeric data:
myData[is.na(myData$age),]
##   personID  age      County   Degree HouseholdSize
## 4     A172 <NA> King County Bachelor             1
myData[is.na(myData$HouseholdSize),]
##    personID age         County   Degree HouseholdSize
## 28     C148  35 Spokane County Bachelor          <NA>

Each column has one missing value.

  1. Count missing values in numeric data when turned into numeric:
myData[is.na(as.numeric(myData$age)),]
##   personID  age      County   Degree HouseholdSize
## 4     A172 <NA> King County Bachelor             1

In the previous case, when turning the column into numeric no extra missing value was found.

myData[is.na(as.numeric(myData$HouseholdSize)),]
## Warning in `[.data.frame`(myData,
## is.na(as.numeric(myData$HouseholdSize)), : NAs introduced by coercion
##    personID age         County   Degree HouseholdSize
## 18     A195  43  Kitsap County Bachelor             l
## 28     C148  35 Spokane County Bachelor          <NA>

In the previous case, when turning the column into numeric ONE extra missing value was found. In this case, R also told you that it introduced NAs. When you apply the function as.numeric() to a text, R will return a NA.

  1. Make replacements

In the first column, there were no issues, then I simply alter the data type:

myData$age=as.numeric(myData$age)

In the second column, you first need to modify the value:

myData$HouseholdSize[18]
## [1] "l"
# then:
myData$HouseholdSize[18]=1

Now you can turn the column into numbers:

myData$HouseholdSize=as.numeric(myData$HouseholdSize)
  1. Make a statistical summary of the numeric variables
summary(myData[,c('age','HouseholdSize')])
##       age        HouseholdSize   
##  Min.   :21.00   Min.   : 1.000  
##  1st Qu.:29.00   1st Qu.: 2.000  
##  Median :38.00   Median : 2.000  
##  Mean   :36.72   Mean   : 3.828  
##  3rd Qu.:42.00   3rd Qu.: 3.000  
##  Max.   :49.00   Max.   :44.000  
##  NA's   :1       NA's   :1
  1. Identify weird values

In the previous summary, you see that age values seem well, but it looks as though there might be something weird in household size: there is a person in this data set whose household size is 44. Is this an outlier?

boxplot(myData$HouseholdSize,horizontal = T)

It looks too far from the other answers. As it is an integer value (do not do this if it has decimal values), we could try a frequency table:

table(myData$HouseholdSize)
## 
##  1  2  3  4 44 
##  7  9  6  6  1

My best guess is that this value is atually a 4, so there is some kind of mistyping; however, you might want to spend some time looking for the original record that produced this row.

YOUR TURN:

  • Use the data you downloaded last week.
  • Using str(), verify if there is a numeric value that has been read as text. If so, please modify it. If there were several numeric columns, do not use more than three.
  • Identify if there are weird values in the numeric columns (now all re formatted). If there were several numeric columns, do not use more than three.

Part 3: Final Project

  • Specify what action or intervention you will begin in two weeks as your experiment.
  • Which of your measures do you expect to see a difference in based on that experiment (you can see it in multiple)?
  • What is your hypothesis, do you think that measure will increase or decrease? It is recommended that you begin your data collection by tonight, although in the next few days will also work.