Course: Data-Driven Management and Policy

Prof. José Manuel Magallanes, PhD


Session 4: Working with Data Frames

The data frame is a common data structure for people with experience with spreasheets or data tables. It is just that: a collection of rows with several columns. However, each software application has its own way of dealing with data frames.

In this session, we will learn the main strategies that we use to deal with data frames:

  1. Logical Operators.
  2. Control of Execution.
  3. Functions.

Logical Operators

Let me open a simple data frame from an Excel file:

library(rio)
fileName='people.xlsx'
people=import(fileName) # table '1'
people
##    cod ages state education
## 1   A1   27    NY         3
## 2   A2   27    WA         1
## 3   A3   25    WA         3
## 4   A4   31    CO         2
## 5   A5   28    CO         1
## 6   A6   37    NY         3
## 7   A7   37    NY         2
## 8   A8   25    NY         2
## 9   A9   38    CO         1
## 10 A10   28    VA         1
## 11 A11   31    WA         2
## 12 A12   34    WA         3
## 13 A13   27    NY         1
## 14 A14   27    CO         2
## 15 A15   28    NY         3
## 16 A16   28    WA         1
## 17 A17   27    WA         1
## 18 A18   32    WA         1
## 19 A19   40    NY         2
## 20 A20   25    NY         1

The first step is making sure of what data types you have:

str(people)
## 'data.frame':    20 obs. of  4 variables:
##  $ cod      : chr  "A1" "A2" "A3" "A4" ...
##  $ ages     : num  27 27 25 31 28 37 37 25 38 28 ...
##  $ state    : chr  "NY" "WA" "WA" "CO" ...
##  $ education: num  3 1 3 2 1 3 2 2 1 1 ...

Education is represented as a number, let me turn it into a category:

people$education=factor(people$education,
                        levels=c(1,2,3),
                        labels=c("Low","Medium","High"),
                        ordered = T)

The purpose of logical operatos can be divided into:

  • DETECTING COMPLIANCE

How many people are older that 30?

# count:
sum(people$ages>30)
## [1] 8
# count:
nrow(people[people$ages>30,])
## [1] 8
#count:
table(people$ages>30)
## 
## FALSE  TRUE 
##    12     8
# percent:
mean(people$ages>30)
## [1] 0.4
# percent:
nrow(people[people$ages>30,])/nrow(people)
## [1] 0.4
# percent:
prop.table(table(people$ages>30))
## 
## FALSE  TRUE 
##   0.6   0.4
  • SUBSETTING

Case 1: selecting rows

#People from WA

condition=c("WA")
people[people$state %in% condition,]
##    cod ages state education
## 2   A2   27    WA       Low
## 3   A3   25    WA      High
## 11 A11   31    WA    Medium
## 12 A12   34    WA      High
## 16 A16   28    WA       Low
## 17 A17   27    WA       Low
## 18 A18   32    WA       Low
#People from WA and NY

condition=c("WA","NY")
people[people$state %in% condition,]
##    cod ages state education
## 1   A1   27    NY      High
## 2   A2   27    WA       Low
## 3   A3   25    WA      High
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 8   A8   25    NY    Medium
## 11 A11   31    WA    Medium
## 12 A12   34    WA      High
## 13 A13   27    NY       Low
## 15 A15   28    NY      High
## 16 A16   28    WA       Low
## 17 A17   27    WA       Low
## 18 A18   32    WA       Low
## 19 A19   40    NY    Medium
## 20 A20   25    NY       Low
#top two highest levels

condition=c("Medium","High")
people[people$education %in% condition,]
##    cod ages state education
## 1   A1   27    NY      High
## 3   A3   25    WA      High
## 4   A4   31    CO    Medium
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 8   A8   25    NY    Medium
## 11 A11   31    WA    Medium
## 12 A12   34    WA      High
## 14 A14   27    CO    Medium
## 15 A15   28    NY      High
## 19 A19   40    NY    Medium
#top two highest levels

people[people$education >= "Medium",] # valid for ordinal factors
##    cod ages state education
## 1   A1   27    NY      High
## 3   A3   25    WA      High
## 4   A4   31    CO    Medium
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 8   A8   25    NY    Medium
## 11 A11   31    WA    Medium
## 12 A12   34    WA      High
## 14 A14   27    CO    Medium
## 15 A15   28    NY      High
## 19 A19   40    NY    Medium

The logical operator “>=” can not be used in text, or nominal variables, as they do not have order.

#People aged 37 OR 25

condition=c(37,25)
people[people$ages %in% condition,]
##    cod ages state education
## 3   A3   25    WA      High
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 8   A8   25    NY    Medium
## 20 A20   25    NY       Low
#People older than 35

condition=35
people[people$ages > condition,]
##    cod ages state education
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 9   A9   38    CO       Low
## 19 A19   40    NY    Medium
# the youngest person/people

condition=min(people$ages)
people[people$ages==condition,] 
##    cod ages state education
## 3   A3   25    WA      High
## 8   A8   25    NY    Medium
## 20 A20   25    NY       Low
# compare
people[which.min(people$ages),] 
##   cod ages state education
## 3  A3   25    WA      High

Multiple conditions bring the challenge of dealing with several data types:

# older than 35 OR younger than 27
people[people$ages>35 | people$ages<27,]
##    cod ages state education
## 3   A3   25    WA      High
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 8   A8   25    NY    Medium
## 9   A9   38    CO       Low
## 19 A19   40    NY    Medium
## 20 A20   25    NY       Low
# younger than 27 WITH education level Medium
people[people$ages<27 & people$education=="Medium",]
##   cod ages state education
## 8  A8   25    NY    Medium
# younger than 30 WITH education level 2 or 3
condition=c("Medium","High")
people[people$ages<30 & people$education %in% condition,]
##    cod ages state education
## 1   A1   27    NY      High
## 3   A3   25    WA      High
## 8   A8   25    NY    Medium
## 14 A14   27    CO    Medium
## 15 A15   28    NY      High
# older than 30 WITH education level 2 OR 3 FROM NY

people[ people$ages>30& 
           people$education %in% condition &
           people$state=='NY',]
##    cod ages state education
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 19 A19   40    NY    Medium

R has the filter() function which allows the selection of rows in an ‘english language’ approach:

library(dplyr)
filter( people, ages > 30 & 
            education %in% condition & 
            people$state=='NY'  )
##   cod ages state education
## 1  A6   37    NY      High
## 2  A7   37    NY    Medium
## 3 A19   40    NY    Medium

Case 2: selecting columns

This is how you can get some columns:

people[,c('cod','state')] # notice the comma position
##    cod state
## 1   A1    NY
## 2   A2    WA
## 3   A3    WA
## 4   A4    CO
## 5   A5    CO
## 6   A6    NY
## 7   A7    NY
## 8   A8    NY
## 9   A9    CO
## 10 A10    VA
## 11 A11    WA
## 12 A12    WA
## 13 A13    NY
## 14 A14    CO
## 15 A15    NY
## 16 A16    WA
## 17 A17    WA
## 18 A18    WA
## 19 A19    NY
## 20 A20    NY

This can be also done by using select (from dplyr)

select(people, c('cod','state'))
##    cod state
## 1   A1    NY
## 2   A2    WA
## 3   A3    WA
## 4   A4    CO
## 5   A5    CO
## 6   A6    NY
## 7   A7    NY
## 8   A8    NY
## 9   A9    CO
## 10 A10    VA
## 11 A11    WA
## 12 A12    WA
## 13 A13    NY
## 14 A14    CO
## 15 A15    NY
## 16 A16    WA
## 17 A17    WA
## 18 A18    WA
## 19 A19    NY
## 20 A20    NY

Or like this:

select(people, cod,state)
##    cod state
## 1   A1    NY
## 2   A2    WA
## 3   A3    WA
## 4   A4    CO
## 5   A5    CO
## 6   A6    NY
## 7   A7    NY
## 8   A8    NY
## 9   A9    CO
## 10 A10    VA
## 11 A11    WA
## 12 A12    WA
## 13 A13    NY
## 14 A14    CO
## 15 A15    NY
## 16 A16    WA
## 17 A17    WA
## 18 A18    WA
## 19 A19    NY
## 20 A20    NY

Of course, you can use conditions for rows and columns.

# older than 35 OR younger than 27
people[people$ages>35 | people$ages<27,c('cod','state')]
##    cod state
## 3   A3    WA
## 6   A6    NY
## 7   A7    NY
## 8   A8    NY
## 9   A9    CO
## 19 A19    NY
## 20 A20    NY

You can turn the previous command using the dplyr tools and the pip (%>%) operator:

people %>% filter(ages>35 | ages<27) %>% select(cod,state)
##   cod state
## 1  A3    WA
## 2  A6    NY
## 3  A7    NY
## 4  A8    NY
## 5  A9    CO
## 6 A19    NY
## 7 A20    NY

Note on missing values

It is always a good practice, when you know missing values are present, so that you can anticipate possible unexpected results.

# current value
people$ages[1]
## [1] 27
#turning that value to  missing:
people$ages[1]=NA

# then
people
##    cod ages state education
## 1   A1   NA    NY      High
## 2   A2   27    WA       Low
## 3   A3   25    WA      High
## 4   A4   31    CO    Medium
## 5   A5   28    CO       Low
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 8   A8   25    NY    Medium
## 9   A9   38    CO       Low
## 10 A10   28    VA       Low
## 11 A11   31    WA    Medium
## 12 A12   34    WA      High
## 13 A13   27    NY       Low
## 14 A14   27    CO    Medium
## 15 A15   28    NY      High
## 16 A16   28    WA       Low
## 17 A17   27    WA       Low
## 18 A18   32    WA       Low
## 19 A19   40    NY    Medium
## 20 A20   25    NY       Low

Let me query:

people[people$ages>27,]
##     cod ages state education
## NA <NA>   NA  <NA>      <NA>
## 4    A4   31    CO    Medium
## 5    A5   28    CO       Low
## 6    A6   37    NY      High
## 7    A7   37    NY    Medium
## 9    A9   38    CO       Low
## 10  A10   28    VA       Low
## 11  A11   31    WA    Medium
## 12  A12   34    WA      High
## 15  A15   28    NY      High
## 16  A16   28    WA       Low
## 18  A18   32    WA       Low
## 19  A19   40    NY    Medium

What is the row at the top?

Now,

people[people$ages<=27,]
##     cod ages state education
## NA <NA>   NA  <NA>      <NA>
## 2    A2   27    WA       Low
## 3    A3   25    WA      High
## 8    A8   25    NY    Medium
## 13  A13   27    NY       Low
## 14  A14   27    CO    Medium
## 17  A17   27    WA       Low
## 20  A20   25    NY       Low

Again the same row appears.

R gave you a row with all missing values. This happens when your query does not control the NA. You may want to use this to create your sub data frame:

people[people$ages>27 & !is.na(people$ages) ,]
##    cod ages state education
## 4   A4   31    CO    Medium
## 5   A5   28    CO       Low
## 6   A6   37    NY      High
## 7   A7   37    NY    Medium
## 9   A9   38    CO       Low
## 10 A10   28    VA       Low
## 11 A11   31    WA    Medium
## 12 A12   34    WA      High
## 15 A15   28    NY      High
## 16 A16   28    WA       Low
## 18 A18   32    WA       Low
## 19 A19   40    NY    Medium

It is different if the query does not refer to a column with missing values:

people[people$education=="High",]
##    cod ages state education
## 1   A1   NA    NY      High
## 3   A3   25    WA      High
## 6   A6   37    NY      High
## 12 A12   34    WA      High
## 15 A15   28    NY      High

Go to page beginning


Control of Execution

We will teach soon how to build functions, and for that you may want to have some knowledge on how to program in R. For that, you need to know how you can control the execution of code. This is done via:

  • Conditional execution
  • Repetitive execution

CONDITIONAL EXECUTION is how you tell the computer what part of a code to execute, depending if an event is true or false.

###### INPUT
value=-100

###### CONDITIONAL EXECUTION

if (value >= 0){ 
  # what to do if condition is TRUE
  rootValue=sqrt(value)
  print (rootValue)
} else {  
  # what to do if condition is FALSE
  print('Sorry, I do not compute square roots of negative numbers')
}
## [1] "Sorry, I do not compute square roots of negative numbers"

Notice the use of parenthesis in the condition. Also, the use of {} to enclose the commands. You do not need those curly braces if you have just one command after the condition. If you omitted the whole else{} section, the program will still run, but the program will not get a message when the input is invalid.

REPETITIVE EXECUTION is how you tell the computer to do something many times (and stop when it has to):

values=c(9,25,100)

for (value in values){ # do this for each value in values
  print(sqrt(value))
}
## [1] 3
## [1] 5
## [1] 10

You do not need to show each result, you could save the results.

values=c(9,25,100,500)
rootValues=c() # empty vector
for (value in values){
  rootValues=c(rootValues,sqrt(value)) # updating vector
}

Then, you see what you saved:

# to see the results:
rootValues
## [1]  3.00000  5.00000 10.00000 22.36068

It is evident that combining loops and control of execution, we can make better programs. For example, this code is not controlling well the process:

values=c(9,25,-100)
rootValues=c()
for (value in values){
  rootValues=c(rootValues,sqrt(value))
}
## Warning in sqrt(value): NaNs produced

Then,

# to see the results:
rootValues
## [1]   3   5 NaN

In the last result, R decided what to do when a negative value was input (it also sent a warning). This one does a better job:

values=c(9,25,-100,144,-72)
rootValuesNew=c()

for (value in values){
  if (value >=0){
    rootValuesNew=c(rootValuesNew,sqrt(value))
  }else {
    print('We added a missing value, negative input detected')
    rootValuesNew=c(rootValuesNew,NA)
  }
}
## [1] "We added a missing value, negative input detected"
## [1] "We added a missing value, negative input detected"

Then,

# to see the results:
rootValues
## [1]   3   5 NaN

We are producing an ouput with the same size as input. If we omit the else structure, we will produce an output with smaller size than the input.

You can also use break when you consider the execution should stop:

values=c(9,25,-100,144,-72)
rootValues=c()
for (value in values){
  if (value <0){
    print('We need to stop, invalid value detected')
    break
  }
  rootValues=c(rootValues,sqrt(value))
}
## [1] "We need to stop, invalid value detected"

The code above halted the program, but some results were saved:

rootValues
## [1] 3 5

You can use next when you consider the execution should continue:

values=list(9,NA,'1000',-100,144,-72)


for (value in values){
  if (is.na(value)){
    print('missing value as input')
    next
  }
  
  if (value <0){
    print('negative value as input')
    next
  }
  
  if (is.character(value)){
    print('char as input')
    next
  }
  rootVal=sqrt(value)
  print(paste(rootVal,'is the root of ',value))
}
## [1] "3 is the root of  9"
## [1] "missing value as input"
## [1] "char as input"
## [1] "negative value as input"
## [1] "12 is the root of  144"
## [1] "negative value as input"

Go to page beginning


Functions

We build functions to make the code more readable. Functions plus the data structures and control of execution capabilities you saw before will give you the basic tools you need to develop your own programs.

A function is a three-step process: Input, Transformation, Output. For example, if you need to convert a numeric value from Fahrenheit into Celsius , the input is the value in Fahrenheit, the transformation is the formula, and the output the result of the formula (a value in Celsius).

converterToCelsius=function(valueInFarenheit){ #input
  #transformation
  resultInCelsius= (valueInFarenheit-32)*5/9
  #output
  return (resultInCelsius)}
  • A function has a name (for example: converterToCelsius) to the left of the =;
  • then a definition of the input of the function using the reserved word function;
  • the transformation process between {};
  • the process output (for example: resultInCelsius), which requires the keyword return and ().

You can omit the return command and a function still works; in that case, R will output the last line that was executed; I avoid doing that as it impoverishes reading the code.

Above, we created a function, and after you run it, R has a new function available:

converterToCelsius(100)
## [1] 37.77778

The function input

We control the amount of input in a function:

# this function requires TWO inputs:
XsumY=function(valueX,valueY){
  ###
  resultSum=valueX+valueY
  ###
  return (resultSum)
}

The code above receives two values and outputs their sum. You can see how it works this way:

XsumY(3,10)
## [1] 13

You can have some inputs with default values:

riseToPower=function(base,exponent=2){
  #####
  result=1
  if (exponent > 0){
    for (time in 1:exponent){
      result=result*base
    }
  }
  #####
  return(result)
}

Since you have a default value in the input arguments, you decide if you give that input or not. Let’s see how it works:

# to the power 2 by default
riseToPower(9) 
## [1] 81
# to the power 3
riseToPower(9,3) 
## [1] 729
# using argument names does not require order:
riseToPower(exponent=0,base=9)
## [1] 1

The function output

Our output has been a single value, but it can be several ones; however, you need the right structure.

# one input, and several output in simple data structure:
factors=function(number){
    # empty vector that will collect output
    vectorOfAnswers=c()
    
    # for every value in the sequence...
    for (i in 1:number){
        
        #if the remainder of 'number'/'i' equals zero...
        if ((number %% i) == 0){ 
            
            # ...add 'i' to the vector of factors!
            vectorOfAnswers=c(vectorOfAnswers,i)
    }
  }
  return (vectorOfAnswers) # returning  the vector
}

Testing:

factors(20) 
## [1]  1  2  4  5 10 20

Applying functions to data structures

Imaging you have created a function that takes a value and multiplies it by two, like this:

double=function(x){
    return (2*x)
    }

and you have this vector:

myVector=c(1,2,3)

What will you get here?

double(myVector)
## [1] 2 4 6

If you use a vector as an input in a function in R, R will apply the function to each element.

If you use a list of numbers as input:

myList=list(1,2,3)

You get:

double(myList)

…you get an error. In this case, you can use Map or mapply:

# Map returns a list, and the input can be a vector or a list
Map(double,myList)
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 6

You can also use mapply:

# mapply returns a vector, and the input can be a vector or a list
mapply(double,myList)
## [1] 2 4 6

Notice that the outputs are returned in different data structures.

As we use data frames most of the time, pay attention on how you use a function and what you get as result:

numberA=c(10,20,30,40,50)
numberB=c(6,7,8,9,10)
dataDF=data.frame(numberA,numberB)
dataDF
##   numberA numberB
## 1      10       6
## 2      20       7
## 3      30       8
## 4      40       9
## 5      50      10

Let’s double each value applying the function directly to the data frame:

double(dataDF)
##   numberA numberB
## 1      20      12
## 2      40      14
## 3      60      16
## 4      80      18
## 5     100      20

As you saw above, the function double was designed to receive as input a simple value (a number). Then, without effort from your side, R itself decided to apply it to each element in the data frame.

Try now simple function as as.character() :

as.character(dataDF)
## [1] "c(10, 20, 30, 40, 50)" "c(6, 7, 8, 9, 10)"

I am pretty sure, this is not what you wanted.

You know that it works in one column (but not in a data frame):

as.character(dataDF$numberA)
## [1] "10" "20" "30" "40" "50"

We can use Map():

Map(as.character,dataDF)
## $numberA
## [1] "10" "20" "30" "40" "50"
## 
## $numberB
## [1] "6"  "7"  "8"  "9"  "10"

Or the very common lapply():

lapply(dataDF,as.character)
## $numberA
## [1] "10" "20" "30" "40" "50"
## 
## $numberB
## [1] "6"  "7"  "8"  "9"  "10"

In both cases, the function worked with a data frame as an input; but the output was a list. If you want a data frame as output:

# selecting columns:
dataDF[,c(1,2)]=lapply(dataDF[,c(1,2)],as.character)

dataDF
##   numberA numberB
## 1      10       6
## 2      20       7
## 3      30       8
## 4      40       9
## 5      50      10

Then, to get the data frame using lapply, you need to specify the columns.

Let me turn the values back to numbers:

dataDF[,c(1,2)]=lapply(dataDF[,c(1,2)],as.numeric)

There are functions that could be applied to columns or rows and get a total from them. Keep in mind that lapply applies a function to columns.

# you are adding the column values here:
as.data.frame(lapply(dataDF,sum))
##   numberA numberB
## 1     150      40

If you need to apply a function by row or by column, the right option is apply:

# you are adding by row:
apply(dataDF,1,sum) # 1 to apply by row (2 for column).
## [1] 16 27 38 49 60

Let me use the data frame people to show tapply

tapply(X=people$ages,INDEX=list(people$education),FUN=mean)
##      Low   Medium     High 
## 28.88889 31.83333       NA

This function applies a function to X, organised by the variables in INDEX. Then:

tapply(X=people$ages,
       INDEX=list(people$education,people$state),
       FUN=mean)
##        CO NY VA   WA
## Low    33 26 28 28.5
## Medium 29 34 NA 31.0
## High   NA NA NA 29.5

The results are not data frames. You can get them using:

people  %>%  
    group_by( education, state )  %>%  
    summarize(mean(ages))  # summarize('means'=mean(ages))
## # A tibble: 9 x 3
## # Groups:   education [3]
##   education state `mean(ages)`
##   <ord>     <chr>        <dbl>
## 1 Low       CO            33  
## 2 Low       NY            26  
## 3 Low       VA            28  
## 4 Low       WA            28.5
## 5 Medium    CO            29  
## 6 Medium    NY            34  
## 7 Medium    WA            31  
## 8 High      NY            NA  
## 9 High      WA            29.5

Manipulating the data frame structure

I have two spreadsheets. One has information on the democracy index:

The other one about the Human Development Index:

The first spreadsheet has three tables, Full, notFull and scoresALL; while the second has only one.

APPENDING is when you combine blocks of data that have a similar structure.

From the first spreadsheet, we can append the first two tables horizontally:

# read data

full=import("Democracy.xlsx",which = "Full")
notfull=import("Democracy.xlsx",which="notFull")

Then,

all=rbind(full,notfull)
str(all)
## 'data.frame':    167 obs. of  2 variables:
##  $ country: chr  "Norway" "Iceland" "Sweden" "New Zealand" ...
##  $ demType: chr  "Full democracy" "Full democracy" "Full democracy" "Full democracy" ...

We can add the last column, by doing a vertical appending:

# read in
scores=import("Democracy.xlsx",which="scoresALL")

First let’s see the first rows:

head(scores)
##       country demScore
## 1       Nepal     5.18
## 2 Switzerland     9.03
## 3 Afghanistan     2.97
## 4     Albania     5.98
## 5     Algeria     3.50
## 6      Angola     3.62
head(all)
##       country        demType
## 1      Norway Full democracy
## 2     Iceland Full democracy
## 3      Sweden Full democracy
## 4 New Zealand Full democracy
## 5     Denmark Full democracy
## 6     Ireland Full democracy

We can not append vertically if the size or order are different. In this case, the order is.

Then,

all=arrange(all, country) 
scores=arrange(scores, country) 

This has sorted BOTH data frames by the column country.

Now you can do:

demo=cbind(all,scores)

However…

head(demo,10)
##        country          demType     country demScore
## 1  Afghanistan    Authoritarian Afghanistan     2.97
## 2      Albania    Hybrid regime     Albania     5.98
## 3      Algeria    Authoritarian     Algeria     3.50
## 4       Angola    Authoritarian      Angola     3.62
## 5    Argentina Flawed democracy   Argentina     7.02
## 6      Armenia    Hybrid regime     Armenia     4.79
## 7    Australia   Full democracy   Australia     9.09
## 8      Austria   Full democracy     Austria     8.29
## 9   Azerbaijan    Authoritarian  Azerbaijan     2.65
## 10     Bahrain    Authoritarian     Bahrain     2.71

This appending produced a repeated column. Then,

demo=demo[,-c(3)] # without the third.

Now let me get the other data frame:

hdi=import("hdi.xlsx")
str(hdi)
## 'data.frame':    190 obs. of  2 variables:
##  $ COUNTRY: chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ hdi    : num  0.498 0.785 0.754 0.858 0.581 0.78 0.825 0.755 0.939 0.908 ...

MERGING has the goal of integrating two data frames that have a common column, the key. Merging will not produced a repeated column, as cbind did.

Let’s do it:

demo_hdi=merge(demo,hdi,by.x = 'country', by.y = 'COUNTRY')
head(demo_hdi)
##       country          demType demScore   hdi
## 1 Afghanistan    Authoritarian     2.97 0.498
## 2     Albania    Hybrid regime     5.98 0.785
## 3     Algeria    Authoritarian     3.50 0.754
## 4      Angola    Authoritarian     3.62 0.581
## 5   Argentina Flawed democracy     7.02 0.825
## 6     Armenia    Hybrid regime     4.79 0.755

Notice some details:

  1. One of the key columns dissappeared, as it becomes redundant.

  2. You do not use cbind this time, because data comes from different sources, and that you are not sure the key columns are the same.

  3. The democray data had 167 rows, and the hdi 190. But the merge produces 161 rows. Those are the countries from demo that found a country with hdi in the second data frame.

From the last point, let’s try to keep all countries from both files:

ALLALL=merge(demo,hdi,by.x = 'country', by.y = 'COUNTRY',
             all.x = T, all.y = T)

This one has 196 countries. This means that there are countries in one data set that are not present in the other. These are the countries that are not present in the DEMOCRACY data frame:

ALLALL[!complete.cases(ALLALL$hdi),"country"]
## [1] "Democratic Republic of the Congo"  "North Korea"                      
## [3] "North Macedonia"                   "Republic of the Congo"            
## [5] "Taiwan Republic of China (Taiwan)" "Timor-Leste"

These are the countries that are not present in the HDI data frame:

ALLALL[!complete.cases(ALLALL$demScore),"country"]
##  [1] "Andorra"                          
##  [2] "Antigua and Barbuda"              
##  [3] "Bahamas"                          
##  [4] "Barbados"                         
##  [5] "Belize"                           
##  [6] "Brunei"                           
##  [7] "Congo, Democratic Republic of the"
##  [8] "Congo, Republic of the"           
##  [9] "Dominica"                         
## [10] "East Timor"                       
## [11] "Grenada"                          
## [12] "Kiribati"                         
## [13] "Liechtenstein"                    
## [14] "Macedonia"                        
## [15] "Maldives"                         
## [16] "Marshall Islands"                 
## [17] "Micronesia"                       
## [18] "Palau"                            
## [19] "Saint Kitts and Nevis"            
## [20] "Saint Lucia"                      
## [21] "Saint Vincent and the Grenadines" 
## [22] "Samoa"                            
## [23] "São Tomé and Príncipe"            
## [24] "Seychelles"                       
## [25] "Solomon Islands"                  
## [26] "South Sudan"                      
## [27] "Tonga"                            
## [28] "Vanuatu"                          
## [29] "World"

In both cases, you see that there are countries or territories that are not common. But also you see that the CONGO countries are present in both countries, but are written differently.

RESHAPING is the process of turning the data frame from wide format to long format, and viceversa. So far we have used only wide format.

head(people)
##   cod ages state education
## 1  A1   NA    NY      High
## 2  A2   27    WA       Low
## 3  A3   25    WA      High
## 4  A4   31    CO    Medium
## 5  A5   28    CO       Low
## 6  A6   37    NY      High

The long format for this data frame is:

library(reshape2)
people_L1=melt(people)
## Using cod, state, education as id variables
people_L1
##    cod state education variable value
## 1   A1    NY      High     ages    NA
## 2   A2    WA       Low     ages    27
## 3   A3    WA      High     ages    25
## 4   A4    CO    Medium     ages    31
## 5   A5    CO       Low     ages    28
## 6   A6    NY      High     ages    37
## 7   A7    NY    Medium     ages    37
## 8   A8    NY    Medium     ages    25
## 9   A9    CO       Low     ages    38
## 10 A10    VA       Low     ages    28
## 11 A11    WA    Medium     ages    31
## 12 A12    WA      High     ages    34
## 13 A13    NY       Low     ages    27
## 14 A14    CO    Medium     ages    27
## 15 A15    NY      High     ages    28
## 16 A16    WA       Low     ages    28
## 17 A17    WA       Low     ages    27
## 18 A18    WA       Low     ages    32
## 19 A19    NY    Medium     ages    40
## 20 A20    NY       Low     ages    25

The melt function assumed that the only variables were ages and agesGroup, as this were the only numeric ones. Then the column value is filled with numeric values, and the column named variable holds the names of the melted variables. The rest were considered identifiers:

This is how you separate the variables from the identifiers:

people_L2=melt(people, id=c('cod','state'))
## Warning: attributes are not identical across measure variables; they will
## be dropped
people_L2
##    cod state  variable  value
## 1   A1    NY      ages   <NA>
## 2   A2    WA      ages     27
## 3   A3    WA      ages     25
## 4   A4    CO      ages     31
## 5   A5    CO      ages     28
## 6   A6    NY      ages     37
## 7   A7    NY      ages     37
## 8   A8    NY      ages     25
## 9   A9    CO      ages     38
## 10 A10    VA      ages     28
## 11 A11    WA      ages     31
## 12 A12    WA      ages     34
## 13 A13    NY      ages     27
## 14 A14    CO      ages     27
## 15 A15    NY      ages     28
## 16 A16    WA      ages     28
## 17 A17    WA      ages     27
## 18 A18    WA      ages     32
## 19 A19    NY      ages     40
## 20 A20    NY      ages     25
## 21  A1    NY education   High
## 22  A2    WA education    Low
## 23  A3    WA education   High
## 24  A4    CO education Medium
## 25  A5    CO education    Low
## 26  A6    NY education   High
## 27  A7    NY education Medium
## 28  A8    NY education Medium
## 29  A9    CO education    Low
## 30 A10    VA education    Low
## 31 A11    WA education Medium
## 32 A12    WA education   High
## 33 A13    NY education    Low
## 34 A14    CO education Medium
## 35 A15    NY education   High
## 36 A16    WA education    Low
## 37 A17    WA education    Low
## 38 A18    WA education    Low
## 39 A19    NY education Medium
## 40 A20    NY education    Low

Notice that the last column has been formatted as character.

Once you have created a long format, you can turn it back into a wide format with dcast():

dcast(people_L1,cod + state + education~ variable)
##    cod state education ages
## 1   A1    NY      High   NA
## 2  A10    VA       Low   28
## 3  A11    WA    Medium   31
## 4  A12    WA      High   34
## 5  A13    NY       Low   27
## 6  A14    CO    Medium   27
## 7  A15    NY      High   28
## 8  A16    WA       Low   28
## 9  A17    WA       Low   27
## 10 A18    WA       Low   32
## 11 A19    NY    Medium   40
## 12  A2    WA       Low   27
## 13 A20    NY       Low   25
## 14  A3    WA      High   25
## 15  A4    CO    Medium   31
## 16  A5    CO       Low   28
## 17  A6    NY      High   37
## 18  A7    NY    Medium   37
## 19  A8    NY    Medium   25
## 20  A9    CO       Low   38

Notice this is the original data frame.

Now, for the second case:

people_L2=melt(people, id=c('cod','state'))
## Warning: attributes are not identical across measure variables; they will
## be dropped
dcast(people_L2,cod + state ~ variable)
##    cod state ages education
## 1   A1    NY <NA>      High
## 2  A10    VA   28       Low
## 3  A11    WA   31    Medium
## 4  A12    WA   34      High
## 5  A13    NY   27       Low
## 6  A14    CO   27    Medium
## 7  A15    NY   28      High
## 8  A16    WA   28       Low
## 9  A17    WA   27       Low
## 10 A18    WA   32       Low
## 11 A19    NY   40    Medium
## 12  A2    WA   27       Low
## 13 A20    NY   25       Low
## 14  A3    WA   25      High
## 15  A4    CO   31    Medium
## 16  A5    CO   28       Low
## 17  A6    NY   37      High
## 18  A7    NY   37    Medium
## 19  A8    NY   25    Medium
## 20  A9    CO   38       Low

Notice this look like the original data frame, but the data types are not the original ones.