Course: Data-Driven Management and Policy

Prof. José Manuel Magallanes, PhD


Session 4: Working with Data Frames

The data frame is a common data structure for people with experience with spreasheets or data tables. It is just that: a collection of rows with several columns. However, each software application has its own way of dealing with data frames.

In this session, we will learn the main strategies that we use to deal with data frames:

  1. Logical Operators.
  2. Control of Execution.
  3. Functions.

Logical Operators

Let me open a simple data frame from an Excel file:

library(rio)
fileName='people.xlsx'
people=import(fileName) # table '1'
people
##    cod ages state education
## 1   A1   27    NY         3
## 2   A2   27    WA         1
## 3   A3   25    WA         3
## 4   A4   31    CO         2
## 5   A5   28    CO         1
## 6   A6   37    NY         3
## 7   A7   37    NY         2
## 8   A8   25    NY         2
## 9   A9   38    CO         1
## 10 A10   28    VA         1
## 11 A11   31    WA         2
## 12 A12   34    WA         3
## 13 A13   27    NY         1
## 14 A14   27    CO         2
## 15 A15   28    NY         3
## 16 A16   28    WA         1
## 17 A17   27    WA         1
## 18 A18   32    WA         1
## 19 A19   40    NY         2
## 20 A20   25    NY         1

The first step is making sure of what data types you have:

str(people)
## 'data.frame':    20 obs. of  4 variables:
##  $ cod      : chr  "A1" "A2" "A3" "A4" ...
##  $ ages     : num  27 27 25 31 28 37 37 25 38 28 ...
##  $ state    : chr  "NY" "WA" "WA" "CO" ...
##  $ education: num  3 1 3 2 1 3 2 2 1 1 ...

Education is represented as a number, let me turn it into a category:

people$education=factor(people$education,
                        levels=c(1,2,3),
                        labels=c("Low","Medium","High"),
                        ordered = T)

The purpose of logical operatos can be divided into:

  • DETECTING COMPLIANCE

How many people are older that 30?

# count:
sum(people$ages>30)
## [1] 8
# count:
nrow(people[people$ages>30,])
## [1] 8
#count:
table(people$ages>30)
## 
## FALSE  TRUE 
##    12     8
# percent:
mean(people$ages>30)
## [1] 0.4
# percent:
nrow(people[people$ages>30,])/nrow(people)
## [1] 0.4
# percent:
prop.table(table(people$ages>30))
## 
## FALSE  TRUE 
##   0.6   0.4
  • SUBSETTING

Case 1: selecting rows

#People from WA

condition=c("WA")
people[people$state %in% condition,]
##    cod ages state education
## 2   A2   27    WA       Low
## 3   A3   25    WA      High
## 11 A11   31    WA    Medium
## 12 A12   34    WA      High
## 16 A16   28    WA       Low
## 17 A17   27    WA       Low
## 18 A18   32    WA       Low
#People from WA and NY

condition=c("WA","NY")
people[people$state %in% condition,]
##    cod ages state education
## 1   A1   27    NY      High
## 2   A2   27    WA       Low
## 3   A3   25    WA      High
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 8   A8   25    NY    Medium
## 11 A11   31    WA    Medium
## 12 A12   34    WA      High
## 13 A13   27    NY       Low
## 15 A15   28    NY      High
## 16 A16   28    WA       Low
## 17 A17   27    WA       Low
## 18 A18   32    WA       Low
## 19 A19   40    NY    Medium
## 20 A20   25    NY       Low
#top two highest levels

condition=c("Medium","High")
people[people$education %in% condition,]
##    cod ages state education
## 1   A1   27    NY      High
## 3   A3   25    WA      High
## 4   A4   31    CO    Medium
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 8   A8   25    NY    Medium
## 11 A11   31    WA    Medium
## 12 A12   34    WA      High
## 14 A14   27    CO    Medium
## 15 A15   28    NY      High
## 19 A19   40    NY    Medium
#top two highest levels

people[people$education >= "Medium",] # valid for ordinal factors
##    cod ages state education
## 1   A1   27    NY      High
## 3   A3   25    WA      High
## 4   A4   31    CO    Medium
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 8   A8   25    NY    Medium
## 11 A11   31    WA    Medium
## 12 A12   34    WA      High
## 14 A14   27    CO    Medium
## 15 A15   28    NY      High
## 19 A19   40    NY    Medium

The logical operator “>=” can not be used in text, or nominal variables, as they do not have order.

#People aged 37 OR 25

condition=c(37,25)
people[people$ages %in% condition,]
##    cod ages state education
## 3   A3   25    WA      High
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 8   A8   25    NY    Medium
## 20 A20   25    NY       Low
#People older than 35

condition=35
people[people$ages > condition,]
##    cod ages state education
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 9   A9   38    CO       Low
## 19 A19   40    NY    Medium
# the youngest person/people

condition=min(people$ages)
people[people$ages==condition,] 
##    cod ages state education
## 3   A3   25    WA      High
## 8   A8   25    NY    Medium
## 20 A20   25    NY       Low
# compare
people[which.min(people$ages),] 
##   cod ages state education
## 3  A3   25    WA      High

Multiple conditions bring the challenge of dealing with several data types:

# older than 35 OR younger than 27
people[people$ages>35 | people$ages<27,]
##    cod ages state education
## 3   A3   25    WA      High
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 8   A8   25    NY    Medium
## 9   A9   38    CO       Low
## 19 A19   40    NY    Medium
## 20 A20   25    NY       Low
# younger than 27 WITH education level Medium
people[people$ages<27 & people$education=="Medium",]
##   cod ages state education
## 8  A8   25    NY    Medium
# younger than 30 WITH education level 2 or 3
condition=c("Medium","High")
people[people$ages<30 & people$education %in% condition,]
##    cod ages state education
## 1   A1   27    NY      High
## 3   A3   25    WA      High
## 8   A8   25    NY    Medium
## 14 A14   27    CO    Medium
## 15 A15   28    NY      High
# older than 30 WITH education level 2 OR 3 FROM NY

people[ people$ages>30& 
           people$education %in% condition &
           people$state=='NY',]
##    cod ages state education
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 19 A19   40    NY    Medium

R has the filter() function which allows the selection of rows in an ‘english language’ approach:

library(dplyr)
filter( people, ages > 30 & 
            education %in% condition & 
            people$state=='NY'  )
##   cod ages state education
## 1  A6   37    NY      High
## 2  A7   37    NY    Medium
## 3 A19   40    NY    Medium

Case 2: selecting columns

This is how you can get some columns:

people[,c('cod','state')] # notice the comma position
##    cod state
## 1   A1    NY
## 2   A2    WA
## 3   A3    WA
## 4   A4    CO
## 5   A5    CO
## 6   A6    NY
## 7   A7    NY
## 8   A8    NY
## 9   A9    CO
## 10 A10    VA
## 11 A11    WA
## 12 A12    WA
## 13 A13    NY
## 14 A14    CO
## 15 A15    NY
## 16 A16    WA
## 17 A17    WA
## 18 A18    WA
## 19 A19    NY
## 20 A20    NY

This can be also done by using select (from dplyr)

select(people, c('cod','state'))
##    cod state
## 1   A1    NY
## 2   A2    WA
## 3   A3    WA
## 4   A4    CO
## 5   A5    CO
## 6   A6    NY
## 7   A7    NY
## 8   A8    NY
## 9   A9    CO
## 10 A10    VA
## 11 A11    WA
## 12 A12    WA
## 13 A13    NY
## 14 A14    CO
## 15 A15    NY
## 16 A16    WA
## 17 A17    WA
## 18 A18    WA
## 19 A19    NY
## 20 A20    NY

Or like this:

select(people, cod,state)
##    cod state
## 1   A1    NY
## 2   A2    WA
## 3   A3    WA
## 4   A4    CO
## 5   A5    CO
## 6   A6    NY
## 7   A7    NY
## 8   A8    NY
## 9   A9    CO
## 10 A10    VA
## 11 A11    WA
## 12 A12    WA
## 13 A13    NY
## 14 A14    CO
## 15 A15    NY
## 16 A16    WA
## 17 A17    WA
## 18 A18    WA
## 19 A19    NY
## 20 A20    NY

Of course, you can use conditions for rows and columns.

# older than 35 OR younger than 27
people[people$ages>35 | people$ages<27,c('cod','state')]
##    cod state
## 3   A3    WA
## 6   A6    NY
## 7   A7    NY
## 8   A8    NY
## 9   A9    CO
## 19 A19    NY
## 20 A20    NY

You can turn the previous command using the dplyr tools and the pip (%>%) operator:

people %>% filter(ages>35 | ages<27) %>% select(cod,state)
##   cod state
## 1  A3    WA
## 2  A6    NY
## 3  A7    NY
## 4  A8    NY
## 5  A9    CO
## 6 A19    NY
## 7 A20    NY

Note on missing values

It is always a good practice, when you know missing values are present, so that you can anticipate possible unexpected results.

# current value
people$ages[1]
## [1] 27
#turning that value to  missing:
people$ages[1]=NA

# then
people
##    cod ages state education
## 1   A1   NA    NY      High
## 2   A2   27    WA       Low
## 3   A3   25    WA      High
## 4   A4   31    CO    Medium
## 5   A5   28    CO       Low
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 8   A8   25    NY    Medium
## 9   A9   38    CO       Low
## 10 A10   28    VA       Low
## 11 A11   31    WA    Medium
## 12 A12   34    WA      High
## 13 A13   27    NY       Low
## 14 A14   27    CO    Medium
## 15 A15   28    NY      High
## 16 A16   28    WA       Low
## 17 A17   27    WA       Low
## 18 A18   32    WA       Low
## 19 A19   40    NY    Medium
## 20 A20   25    NY       Low

Let me query:

people[people$ages>27,]
##     cod ages state education
## NA <NA>   NA  <NA>      <NA>
## 4    A4   31    CO    Medium
## 5    A5   28    CO       Low
## 6    A6   37    NY      High
## 7    A7   37    NY    Medium
## 9    A9   38    CO       Low
## 10  A10   28    VA       Low
## 11  A11   31    WA    Medium
## 12  A12   34    WA      High
## 15  A15   28    NY      High
## 16  A16   28    WA       Low
## 18  A18   32    WA       Low
## 19  A19   40    NY    Medium

What is the row at the top?

Now,

people[people$ages<=27,]
##     cod ages state education
## NA <NA>   NA  <NA>      <NA>
## 2    A2   27    WA       Low
## 3    A3   25    WA      High
## 8    A8   25    NY    Medium
## 13  A13   27    NY       Low
## 14  A14   27    CO    Medium
## 17  A17   27    WA       Low
## 20  A20   25    NY       Low

Again the same row appears.

R gave you a row with all missing values. This happens when your query does not control the NA. You may want to use this to create your sub data frame:

people[people$ages>27 & !is.na(people$ages) ,]
##    cod ages state education
## 4   A4   31    CO    Medium
## 5   A5   28    CO       Low
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 9   A9   38    CO       Low
## 10 A10   28    VA       Low
## 11 A11   31    WA    Medium
## 12 A12   34    WA      High
## 15 A15   28    NY      High
## 16 A16   28    WA       Low
## 18 A18   32    WA       Low
## 19 A19   40    NY    Medium

It is different if the query does not refer to a column with missing values:

people[people$education=="High",]
##    cod ages state education
## 1   A1   NA    NY      High
## 3   A3   25    WA      High
## 6   A6   37    NY      High
## 12 A12   34    WA      High
## 15 A15   28    NY      High

Go to page beginning


Control of Execution

We will teach soon how to build functions, and for that you may want to have some knowledge on how to program in R. For that, you need to know how you can control the execution of code. This is done via:

  • Conditional execution
  • Repetitive execution

CONDITIONAL EXECUTION is how you tell the computer what part of a code to execute, depending if an event is true or false.

###### INPUT
value=-100

###### CONDITIONAL EXECUTION

if (value >= 0){ 
  # what to do if condition is TRUE
  rootValue=sqrt(value)
  print (rootValue)
} else {  
  # what to do if condition is FALSE
  print('Sorry, I do not compute square roots of negative numbers')
}
## [1] "Sorry, I do not compute square roots of negative numbers"

Notice the use of parenthesis in the condition. Also, the use of {} to enclose the commands. You do not need those curly braces if you have just one command after the condition. If you omitted the whole else{} section, the program will still run, but the program will not get a message when the input is invalid.

REPETITIVE EXECUTION is how you tell the computer to do something many times (and stop when it has to):

values=c(9,25,100)

for (value in values){ # do this for each value in values
  print(sqrt(value))
}
## [1] 3
## [1] 5
## [1] 10

You do not need to show each result, you could save the results.

values=c(9,25,100,500)
rootValues=c() # empty vector
for (value in values){
  rootValues=c(rootValues,sqrt(value)) # updating vector
}

Then, you see what you saved:

# to see the results:
rootValues
## [1]  3.00000  5.00000 10.00000 22.36068

It is evident that combining loops and control of execution, we can make better programs. For example, this code is not controlling well the process:

values=c(9,25,-100)
rootValues=c()
for (value in values){
  rootValues=c(rootValues,sqrt(value))
}
## Warning in sqrt(value): NaNs produced

Then,

# to see the results:
rootValues
## [1]   3   5 NaN

In the last result, R decided what to do when a negative value was input (it also sent a warning). This one does a better job:

values=c(9,25,-100,144,-72)
rootValuesNew=c()

for (value in values){
  if (value >=0){
    rootValuesNew=c(rootValuesNew,sqrt(value))
  }else {
    print('We added a missing value, negative input detected')
    rootValuesNew=c(rootValuesNew,NA)
  }
}
## [1] "We added a missing value, negative input detected"
## [1] "We added a missing value, negative input detected"

Then,

# to see the results:
rootValues
## [1]   3   5 NaN

We are producing an ouput with the same size as input. If we omit the else structure, we will produce an output with smaller size than the input.

You can also use break when you consider the execution should stop:

values=c(9,25,-100,144,-72)
rootValues=c()
for (value in values){
  if (value <0){
    print('We need to stop, invalid value detected')
    break
  }
  rootValues=c(rootValues,sqrt(value))
}
## [1] "We need to stop, invalid value detected"

The code above halted the program, but some results were saved:

rootValues
## [1] 3 5

You can use next when you consider the execution should continue:

values=list(9,NA,'1000',-100,144,-72)


for (value in values){
  if (is.na(value)){
    print('missing value as input')
    next
  }
  
  if (value <0){
    print('negative value as input')
    next
  }
  
  if (is.character(value)){
    print('char as input')
    next
  }
  rootVal=sqrt(value)
  print(paste(rootVal,'is the root of ',value))
}
## [1] "3 is the root of  9"
## [1] "missing value as input"
## [1] "char as input"
## [1] "negative value as input"
## [1] "12 is the root of  144"
## [1] "negative value as input"

Go to page beginning


Functions

We build functions to make the code more readable. Functions plus the data structures and control of execution capabilities you saw before will give you the basic tools you need to develop your own programs.

A function is a three-step process: Input, Transformation, Output. For example, if you need to convert a numeric value from Fahrenheit into Celsius , the input is the value in Fahrenheit, the transformation is the formula, and the output the result of the formula (a value in Celsius).

converterToCelsius=function(valueInFarenheit){ #input
  #transformation
  resultInCelsius= (valueInFarenheit-32)*5/9
  #output
  return (resultInCelsius)}
  • A function has a name (for example: converterToCelsius) to the left of the =;
  • then a definition of the input of the function using the reserved word function;
  • the transformation process between {};
  • the process output (for example: resultInCelsius), which requires the keyword return and ().

You can omit the return command and a function still works; in that case, R will output the last line that was executed; I avoid doing that as it impoverishes reading the code.

Above, we created a function, and after you run it, R has a new function available:

converterToCelsius(100)
## [1] 37.77778

The function input

We control the amount of input in a function:

# this function requires TWO inputs:
XsumY=function(valueX,valueY){
  ###
  resultSum=valueX+valueY
  ###
  return (resultSum)
}

The code above receives two values and outputs their sum. You can see how it works this way:

XsumY(3,10)
## [1] 13

You can have some inputs with default values:

riseToPower=function(base,exponent=2){
  #####
  result=1
  if (exponent > 0){
    for (time in 1:exponent){
      result=result*base
    }
  }
  #####
  return(result)
}

Since you have a default value in the input arguments, you decide if you give that input or not. Let’s see how it works:

# to the power 2 by default
riseToPower(9) 
## [1] 81
# to the power 3
riseToPower(9,3) 
## [1] 729
# using argument names does not require order:
riseToPower(exponent=0,base=9)
## [1] 1

The function output

Our output has been a single value, but it can be several ones; however, you need the right structure.

# one input, and several output in simple data structure:
factors=function(number){
    # empty vector that will collect output
    vectorOfAnswers=c()
    
    # for every value in the sequence...
    for (i in 1:number){
        
        #if the remainder of 'number'/'i' equals zero...
        if ((number %% i) == 0){ 
            
            # ...add 'i' to the vector of factors!
            vectorOfAnswers=c(vectorOfAnswers,i)
    }
  }
  return (vectorOfAnswers) # returning  the vector
}

Testing:

factors(20) 
## [1]  1  2  4  5 10 20

Applying functions to data structures

Imaging you have created a function that takes a value and multiplies it by two, like this:

double=function(x){
    return (2*x)
    }

and you have this vector:

myVector=c(1,2,3)

What will you get here?

double(myVector)
## [1] 2 4 6

If you use a vector as an input in a function in R, R will apply the function to each element.

If you use a list of numbers as input:

myList=list(1,2,3)

You get:

double(myList)

…you get an error. In this case, you can use Map or mapply:

# Map returns a list, and the input can be a vector or a list
Map(double,myList)
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 6

You can also use mapply:

# mapply returns a vector, and the input can be a vector or a list
mapply(double,myList)
## [1] 2 4 6

Notice that the outputs are returned in different data structures.

As we use data frames most of the time, pay attention on how you use a function and what you get as result:

numberA=c(10,20,30,40,50)
numberB=c(6,7,8,9,10)
dataDF=data.frame(numberA,numberB)
dataDF
##   numberA numberB
## 1      10       6
## 2      20       7
## 3      30       8
## 4      40       9
## 5      50      10

Let’s double each value applying the function directly to the data frame:

double(dataDF)
##   numberA numberB
## 1      20      12
## 2      40      14
## 3      60      16
## 4      80      18
## 5     100      20

As you saw above, the function double was designed to receive as input a simple value (a number). Then, without effort from your side, R itself decided to apply it to each element in the data frame.

Try now simple function as as.character() :

as.character(dataDF)
## [1] "c(10, 20, 30, 40, 50)" "c(6, 7, 8, 9, 10)"

I am pretty sure, this is not what you wanted.

You know that it works in one column (but not in a data frame):

as.character(dataDF$numberA)
## [1] "10" "20" "30" "40" "50"

We can use Map():

Map(as.character,dataDF)
## $numberA
## [1] "10" "20" "30" "40" "50"
## 
## $numberB
## [1] "6"  "7"  "8"  "9"  "10"

Or the very common lapply():

lapply(dataDF,as.character)
## $numberA
## [1] "10" "20" "30" "40" "50"
## 
## $numberB
## [1] "6"  "7"  "8"  "9"  "10"

In both cases, the function worked with a data frame as an input; but the output was a list. If you want a data frame as output:

# selecting columns:
dataDF[,c(1,2)]=lapply(dataDF[,c(1,2)],as.character)

dataDF
##   numberA numberB
## 1      10       6
## 2      20       7
## 3      30       8
## 4      40       9
## 5      50      10

Then, to get the data frame using lapply, you need to specify the columns.

Let me turn the values back to numbers:

dataDF[,c(1,2)]=lapply(dataDF[,c(1,2)],as.numeric)

There are functions that could be applied to columns or rows and get a total from them. Keep in mind that lapply applies a function to columns.

# you are adding the column values here:
as.data.frame(lapply(dataDF,sum))
##   numberA numberB
## 1     150      40

If you need to apply a function by row or by column, the right option is apply:

# you are adding by row:
apply(dataDF,1,sum) # 1 to apply by row (2 for column).
## [1] 16 27 38 49 60

Let me use the data frame people to show tapply

tapply(X=people$ages,INDEX=list(people$education),FUN=mean)
##      Low   Medium     High 
## 28.88889 31.83333       NA

This function applies a function to X, organised by the variables in INDEX. Then:

tapply(X=people$ages,
       INDEX=list(people$education,people$state),
       FUN=mean)
##        CO NY VA   WA
## Low    33 26 28 28.5
## Medium 29 34 NA 31.0
## High   NA NA NA 29.5

The results are not data frames. You can get them using:

people  %>%  
    group_by( education, state )  %>%  
    summarize(mean(ages))  # summarize('means'=mean(ages))
## # A tibble: 9 x 3
## # Groups:   education [3]
##   education state `mean(ages)`
##   <ord>     <chr>        <dbl>
## 1 Low       CO            33  
## 2 Low       NY            26  
## 3 Low       VA            28  
## 4 Low       WA            28.5
## 5 Medium    CO            29  
## 6 Medium    NY            34  
## 7 Medium    WA            31  
## 8 High      NY            NA  
## 9 High      WA            29.5

Manipulating the data frame structure

I have two spreadsheets. One has information on the democracy index: