The data frame is a common data structure for people with experience with spreasheets or data tables. It is just that: a collection of rows with several columns. However, each software application has its own way of dealing with data frames.
In this session, we will learn the main strategies that we use to deal with data frames:
Let me open a simple data frame from an Excel file:
library(rio)
fileName='people.xlsx'
people=import(fileName) # table '1'
people
## cod ages state education
## 1 A1 27 NY 3
## 2 A2 27 WA 1
## 3 A3 25 WA 3
## 4 A4 31 CO 2
## 5 A5 28 CO 1
## 6 A6 37 NY 3
## 7 A7 37 NY 2
## 8 A8 25 NY 2
## 9 A9 38 CO 1
## 10 A10 28 VA 1
## 11 A11 31 WA 2
## 12 A12 34 WA 3
## 13 A13 27 NY 1
## 14 A14 27 CO 2
## 15 A15 28 NY 3
## 16 A16 28 WA 1
## 17 A17 27 WA 1
## 18 A18 32 WA 1
## 19 A19 40 NY 2
## 20 A20 25 NY 1
The first step is making sure of what data types you have:
str(people)
## 'data.frame': 20 obs. of 4 variables:
## $ cod : chr "A1" "A2" "A3" "A4" ...
## $ ages : num 27 27 25 31 28 37 37 25 38 28 ...
## $ state : chr "NY" "WA" "WA" "CO" ...
## $ education: num 3 1 3 2 1 3 2 2 1 1 ...
Education is represented as a number, let me turn it into a category:
people$education=factor(people$education,
levels=c(1,2,3),
labels=c("Low","Medium","High"),
ordered = T)
The purpose of logical operatos can be divided into:
How many people are older that 30?
# count:
sum(people$ages>30)
## [1] 8
# count:
nrow(people[people$ages>30,])
## [1] 8
#count:
table(people$ages>30)
##
## FALSE TRUE
## 12 8
# percent:
mean(people$ages>30)
## [1] 0.4
# percent:
nrow(people[people$ages>30,])/nrow(people)
## [1] 0.4
# percent:
prop.table(table(people$ages>30))
##
## FALSE TRUE
## 0.6 0.4
Case 1: selecting rows
#People from WA
condition=c("WA")
people[people$state %in% condition,]
## cod ages state education
## 2 A2 27 WA Low
## 3 A3 25 WA High
## 11 A11 31 WA Medium
## 12 A12 34 WA High
## 16 A16 28 WA Low
## 17 A17 27 WA Low
## 18 A18 32 WA Low
#People from WA and NY
condition=c("WA","NY")
people[people$state %in% condition,]
## cod ages state education
## 1 A1 27 NY High
## 2 A2 27 WA Low
## 3 A3 25 WA High
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 8 A8 25 NY Medium
## 11 A11 31 WA Medium
## 12 A12 34 WA High
## 13 A13 27 NY Low
## 15 A15 28 NY High
## 16 A16 28 WA Low
## 17 A17 27 WA Low
## 18 A18 32 WA Low
## 19 A19 40 NY Medium
## 20 A20 25 NY Low
#top two highest levels
condition=c("Medium","High")
people[people$education %in% condition,]
## cod ages state education
## 1 A1 27 NY High
## 3 A3 25 WA High
## 4 A4 31 CO Medium
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 8 A8 25 NY Medium
## 11 A11 31 WA Medium
## 12 A12 34 WA High
## 14 A14 27 CO Medium
## 15 A15 28 NY High
## 19 A19 40 NY Medium
#top two highest levels
people[people$education >= "Medium",] # valid for ordinal factors
## cod ages state education
## 1 A1 27 NY High
## 3 A3 25 WA High
## 4 A4 31 CO Medium
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 8 A8 25 NY Medium
## 11 A11 31 WA Medium
## 12 A12 34 WA High
## 14 A14 27 CO Medium
## 15 A15 28 NY High
## 19 A19 40 NY Medium
The logical operator “>=” can not be used in text, or nominal variables, as they do not have order.
#People aged 37 OR 25
condition=c(37,25)
people[people$ages %in% condition,]
## cod ages state education
## 3 A3 25 WA High
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 8 A8 25 NY Medium
## 20 A20 25 NY Low
#People older than 35
condition=35
people[people$ages > condition,]
## cod ages state education
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 9 A9 38 CO Low
## 19 A19 40 NY Medium
# the youngest person/people
condition=min(people$ages)
people[people$ages==condition,]
## cod ages state education
## 3 A3 25 WA High
## 8 A8 25 NY Medium
## 20 A20 25 NY Low
# compare
people[which.min(people$ages),]
## cod ages state education
## 3 A3 25 WA High
Multiple conditions bring the challenge of dealing with several data types:
# older than 35 OR younger than 27
people[people$ages>35 | people$ages<27,]
## cod ages state education
## 3 A3 25 WA High
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 8 A8 25 NY Medium
## 9 A9 38 CO Low
## 19 A19 40 NY Medium
## 20 A20 25 NY Low
# younger than 27 WITH education level Medium
people[people$ages<27 & people$education=="Medium",]
## cod ages state education
## 8 A8 25 NY Medium
# younger than 30 WITH education level 2 or 3
condition=c("Medium","High")
people[people$ages<30 & people$education %in% condition,]
## cod ages state education
## 1 A1 27 NY High
## 3 A3 25 WA High
## 8 A8 25 NY Medium
## 14 A14 27 CO Medium
## 15 A15 28 NY High
# older than 30 WITH education level 2 OR 3 FROM NY
people[ people$ages>30&
people$education %in% condition &
people$state=='NY',]
## cod ages state education
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 19 A19 40 NY Medium
R has the filter() function which allows the selection of rows in an ‘english language’ approach:
library(dplyr)
filter( people, ages > 30 &
education %in% condition &
people$state=='NY' )
## cod ages state education
## 1 A6 37 NY High
## 2 A7 37 NY Medium
## 3 A19 40 NY Medium
Case 2: selecting columns
This is how you can get some columns:
people[,c('cod','state')] # notice the comma position
## cod state
## 1 A1 NY
## 2 A2 WA
## 3 A3 WA
## 4 A4 CO
## 5 A5 CO
## 6 A6 NY
## 7 A7 NY
## 8 A8 NY
## 9 A9 CO
## 10 A10 VA
## 11 A11 WA
## 12 A12 WA
## 13 A13 NY
## 14 A14 CO
## 15 A15 NY
## 16 A16 WA
## 17 A17 WA
## 18 A18 WA
## 19 A19 NY
## 20 A20 NY
This can be also done by using select (from dplyr)
select(people, c('cod','state'))
## cod state
## 1 A1 NY
## 2 A2 WA
## 3 A3 WA
## 4 A4 CO
## 5 A5 CO
## 6 A6 NY
## 7 A7 NY
## 8 A8 NY
## 9 A9 CO
## 10 A10 VA
## 11 A11 WA
## 12 A12 WA
## 13 A13 NY
## 14 A14 CO
## 15 A15 NY
## 16 A16 WA
## 17 A17 WA
## 18 A18 WA
## 19 A19 NY
## 20 A20 NY
Or like this:
select(people, cod,state)
## cod state
## 1 A1 NY
## 2 A2 WA
## 3 A3 WA
## 4 A4 CO
## 5 A5 CO
## 6 A6 NY
## 7 A7 NY
## 8 A8 NY
## 9 A9 CO
## 10 A10 VA
## 11 A11 WA
## 12 A12 WA
## 13 A13 NY
## 14 A14 CO
## 15 A15 NY
## 16 A16 WA
## 17 A17 WA
## 18 A18 WA
## 19 A19 NY
## 20 A20 NY
Of course, you can use conditions for rows and columns.
# older than 35 OR younger than 27
people[people$ages>35 | people$ages<27,c('cod','state')]
## cod state
## 3 A3 WA
## 6 A6 NY
## 7 A7 NY
## 8 A8 NY
## 9 A9 CO
## 19 A19 NY
## 20 A20 NY
You can turn the previous command using the dplyr tools and the pip (%>%) operator:
people %>% filter(ages>35 | ages<27) %>% select(cod,state)
## cod state
## 1 A3 WA
## 2 A6 NY
## 3 A7 NY
## 4 A8 NY
## 5 A9 CO
## 6 A19 NY
## 7 A20 NY
Note on missing values
It is always a good practice, when you know missing values are present, so that you can anticipate possible unexpected results.
# current value
people$ages[1]
## [1] 27
#turning that value to missing:
people$ages[1]=NA
# then
people
## cod ages state education
## 1 A1 NA NY High
## 2 A2 27 WA Low
## 3 A3 25 WA High
## 4 A4 31 CO Medium
## 5 A5 28 CO Low
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 8 A8 25 NY Medium
## 9 A9 38 CO Low
## 10 A10 28 VA Low
## 11 A11 31 WA Medium
## 12 A12 34 WA High
## 13 A13 27 NY Low
## 14 A14 27 CO Medium
## 15 A15 28 NY High
## 16 A16 28 WA Low
## 17 A17 27 WA Low
## 18 A18 32 WA Low
## 19 A19 40 NY Medium
## 20 A20 25 NY Low
Let me query:
people[people$ages>27,]
## cod ages state education
## NA <NA> NA <NA> <NA>
## 4 A4 31 CO Medium
## 5 A5 28 CO Low
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 9 A9 38 CO Low
## 10 A10 28 VA Low
## 11 A11 31 WA Medium
## 12 A12 34 WA High
## 15 A15 28 NY High
## 16 A16 28 WA Low
## 18 A18 32 WA Low
## 19 A19 40 NY Medium
What is the row at the top?
Now,
people[people$ages<=27,]
## cod ages state education
## NA <NA> NA <NA> <NA>
## 2 A2 27 WA Low
## 3 A3 25 WA High
## 8 A8 25 NY Medium
## 13 A13 27 NY Low
## 14 A14 27 CO Medium
## 17 A17 27 WA Low
## 20 A20 25 NY Low
Again the same row appears.
R gave you a row with all missing values. This happens when your query does not control the NA. You may want to use this to create your sub data frame:
people[people$ages>27 & !is.na(people$ages) ,]
## cod ages state education
## 4 A4 31 CO Medium
## 5 A5 28 CO Low
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 9 A9 38 CO Low
## 10 A10 28 VA Low
## 11 A11 31 WA Medium
## 12 A12 34 WA High
## 15 A15 28 NY High
## 16 A16 28 WA Low
## 18 A18 32 WA Low
## 19 A19 40 NY Medium
It is different if the query does not refer to a column with missing values:
people[people$education=="High",]
## cod ages state education
## 1 A1 NA NY High
## 3 A3 25 WA High
## 6 A6 37 NY High
## 12 A12 34 WA High
## 15 A15 28 NY High
We will teach soon how to build functions, and for that you may want to have some knowledge on how to program in R. For that, you need to know how you can control the execution of code. This is done via:
CONDITIONAL EXECUTION is how you tell the computer what part of a code to execute, depending if an event is true or false.
###### INPUT
value=-100
###### CONDITIONAL EXECUTION
if (value >= 0){
# what to do if condition is TRUE
rootValue=sqrt(value)
print (rootValue)
} else {
# what to do if condition is FALSE
print('Sorry, I do not compute square roots of negative numbers')
}
## [1] "Sorry, I do not compute square roots of negative numbers"
Notice the use of parenthesis in the condition. Also, the use of {} to enclose the commands. You do not need those curly braces if you have just one command after the condition. If you omitted the whole else{} section, the program will still run, but the program will not get a message when the input is invalid.
REPETITIVE EXECUTION is how you tell the computer to do something many times (and stop when it has to):
values=c(9,25,100)
for (value in values){ # do this for each value in values
print(sqrt(value))
}
## [1] 3
## [1] 5
## [1] 10
You do not need to show each result, you could save the results.
values=c(9,25,100,500)
rootValues=c() # empty vector
for (value in values){
rootValues=c(rootValues,sqrt(value)) # updating vector
}
Then, you see what you saved:
# to see the results:
rootValues
## [1] 3.00000 5.00000 10.00000 22.36068
It is evident that combining loops and control of execution, we can make better programs. For example, this code is not controlling well the process:
values=c(9,25,-100)
rootValues=c()
for (value in values){
rootValues=c(rootValues,sqrt(value))
}
## Warning in sqrt(value): NaNs produced
Then,
# to see the results:
rootValues
## [1] 3 5 NaN
In the last result, R decided what to do when a negative value was input (it also sent a warning). This one does a better job:
values=c(9,25,-100,144,-72)
rootValuesNew=c()
for (value in values){
if (value >=0){
rootValuesNew=c(rootValuesNew,sqrt(value))
}else {
print('We added a missing value, negative input detected')
rootValuesNew=c(rootValuesNew,NA)
}
}
## [1] "We added a missing value, negative input detected"
## [1] "We added a missing value, negative input detected"
Then,
# to see the results:
rootValues
## [1] 3 5 NaN
We are producing an ouput with the same size as input. If we omit the else structure, we will produce an output with smaller size than the input.
You can also use break when you consider the execution should stop:
values=c(9,25,-100,144,-72)
rootValues=c()
for (value in values){
if (value <0){
print('We need to stop, invalid value detected')
break
}
rootValues=c(rootValues,sqrt(value))
}
## [1] "We need to stop, invalid value detected"
The code above halted the program, but some results were saved:
rootValues
## [1] 3 5
You can use next when you consider the execution should continue:
values=list(9,NA,'1000',-100,144,-72)
for (value in values){
if (is.na(value)){
print('missing value as input')
next
}
if (value <0){
print('negative value as input')
next
}
if (is.character(value)){
print('char as input')
next
}
rootVal=sqrt(value)
print(paste(rootVal,'is the root of ',value))
}
## [1] "3 is the root of 9"
## [1] "missing value as input"
## [1] "char as input"
## [1] "negative value as input"
## [1] "12 is the root of 144"
## [1] "negative value as input"
We build functions to make the code more readable. Functions plus the data structures and control of execution capabilities you saw before will give you the basic tools you need to develop your own programs.
A function is a three-step process: Input, Transformation, Output. For example, if you need to convert a numeric value from Fahrenheit into Celsius , the input is the value in Fahrenheit, the transformation is the formula, and the output the result of the formula (a value in Celsius).
converterToCelsius=function(valueInFarenheit){ #input
#transformation
resultInCelsius= (valueInFarenheit-32)*5/9
#output
return (resultInCelsius)}
You can omit the return command and a function still works; in that case, R will output the last line that was executed; I avoid doing that as it impoverishes reading the code.
Above, we created a function, and after you run it, R has a new function available:
converterToCelsius(100)
## [1] 37.77778
We control the amount of input in a function:
# this function requires TWO inputs:
XsumY=function(valueX,valueY){
###
resultSum=valueX+valueY
###
return (resultSum)
}
The code above receives two values and outputs their sum. You can see how it works this way:
XsumY(3,10)
## [1] 13
You can have some inputs with default values:
riseToPower=function(base,exponent=2){
#####
result=1
if (exponent > 0){
for (time in 1:exponent){
result=result*base
}
}
#####
return(result)
}
Since you have a default value in the input arguments, you decide if you give that input or not. Let’s see how it works:
# to the power 2 by default
riseToPower(9)
## [1] 81
# to the power 3
riseToPower(9,3)
## [1] 729
# using argument names does not require order:
riseToPower(exponent=0,base=9)
## [1] 1
Our output has been a single value, but it can be several ones; however, you need the right structure.
# one input, and several output in simple data structure:
factors=function(number){
# empty vector that will collect output
vectorOfAnswers=c()
# for every value in the sequence...
for (i in 1:number){
#if the remainder of 'number'/'i' equals zero...
if ((number %% i) == 0){
# ...add 'i' to the vector of factors!
vectorOfAnswers=c(vectorOfAnswers,i)
}
}
return (vectorOfAnswers) # returning the vector
}
Testing:
factors(20)
## [1] 1 2 4 5 10 20
Imaging you have created a function that takes a value and multiplies it by two, like this:
double=function(x){
return (2*x)
}
and you have this vector:
myVector=c(1,2,3)
What will you get here?
double(myVector)
## [1] 2 4 6
If you use a vector as an input in a function in R, R will apply the function to each element.
If you use a list of numbers as input:
myList=list(1,2,3)
You get:
double(myList)
…you get an error. In this case, you can use Map or mapply:
# Map returns a list, and the input can be a vector or a list
Map(double,myList)
## [[1]]
## [1] 2
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 6
You can also use mapply:
# mapply returns a vector, and the input can be a vector or a list
mapply(double,myList)
## [1] 2 4 6
Notice that the outputs are returned in different data structures.
As we use data frames most of the time, pay attention on how you use a function and what you get as result:
numberA=c(10,20,30,40,50)
numberB=c(6,7,8,9,10)
dataDF=data.frame(numberA,numberB)
dataDF
## numberA numberB
## 1 10 6
## 2 20 7
## 3 30 8
## 4 40 9
## 5 50 10
Let’s double each value applying the function directly to the data frame:
double(dataDF)
## numberA numberB
## 1 20 12
## 2 40 14
## 3 60 16
## 4 80 18
## 5 100 20
As you saw above, the function double was designed to receive as input a simple value (a number). Then, without effort from your side, R itself decided to apply it to each element in the data frame.
Try now simple function as as.character() :
as.character(dataDF)
## [1] "c(10, 20, 30, 40, 50)" "c(6, 7, 8, 9, 10)"
I am pretty sure, this is not what you wanted.
You know that it works in one column (but not in a data frame):
as.character(dataDF$numberA)
## [1] "10" "20" "30" "40" "50"
We can use Map():
Map(as.character,dataDF)
## $numberA
## [1] "10" "20" "30" "40" "50"
##
## $numberB
## [1] "6" "7" "8" "9" "10"
Or the very common lapply():
lapply(dataDF,as.character)
## $numberA
## [1] "10" "20" "30" "40" "50"
##
## $numberB
## [1] "6" "7" "8" "9" "10"
In both cases, the function worked with a data frame as an input; but the output was a list. If you want a data frame as output:
# selecting columns:
dataDF[,c(1,2)]=lapply(dataDF[,c(1,2)],as.character)
dataDF
## numberA numberB
## 1 10 6
## 2 20 7
## 3 30 8
## 4 40 9
## 5 50 10
Then, to get the data frame using lapply, you need to specify the columns.
Let me turn the values back to numbers:
dataDF[,c(1,2)]=lapply(dataDF[,c(1,2)],as.numeric)
There are functions that could be applied to columns or rows and get a total from them. Keep in mind that lapply applies a function to columns.
# you are adding the column values here:
as.data.frame(lapply(dataDF,sum))
## numberA numberB
## 1 150 40
If you need to apply a function by row or by column, the right option is apply:
# you are adding by row:
apply(dataDF,1,sum) # 1 to apply by row (2 for column).
## [1] 16 27 38 49 60
Let me use the data frame people to show tapply
tapply(X=people$ages,INDEX=list(people$education),FUN=mean)
## Low Medium High
## 28.88889 31.83333 NA
This function applies a function to X, organised by the variables in INDEX. Then:
tapply(X=people$ages,
INDEX=list(people$education,people$state),
FUN=mean)
## CO NY VA WA
## Low 33 26 28 28.5
## Medium 29 34 NA 31.0
## High NA NA NA 29.5
The results are not data frames. You can get them using:
people %>%
group_by( education, state ) %>%
summarize(mean(ages)) # summarize('means'=mean(ages))
## # A tibble: 9 x 3
## # Groups: education [3]
## education state `mean(ages)`
## <ord> <chr> <dbl>
## 1 Low CO 33
## 2 Low NY 26
## 3 Low VA 28
## 4 Low WA 28.5
## 5 Medium CO 29
## 6 Medium NY 34
## 7 Medium WA 31
## 8 High NY NA
## 9 High WA 29.5
I have two spreadsheets. One has information on the democracy index:
The other one about the Human Development Index:
The first spreadsheet has three tables, Full, notFull and scoresALL; while the second has only one.
APPENDING is when you combine blocks of data that have a similar structure.
From the first spreadsheet, we can append the first two tables horizontally:
# read data
full=import("Democracy.xlsx",which = "Full")
notfull=import("Democracy.xlsx",which="notFull")
Then,
all=rbind(full,notfull)
str(all)
## 'data.frame': 167 obs. of 2 variables:
## $ country: chr "Norway" "Iceland" "Sweden" "New Zealand" ...
## $ demType: chr "Full democracy" "Full democracy" "Full democracy" "Full democracy" ...
We can add the last column, by doing a vertical appending:
# read in
scores=import("Democracy.xlsx",which="scoresALL")
First let’s see the first rows:
head(scores)
## country demScore
## 1 Nepal 5.18
## 2 Switzerland 9.03
## 3 Afghanistan 2.97
## 4 Albania 5.98
## 5 Algeria 3.50
## 6 Angola 3.62
head(all)
## country demType
## 1 Norway Full democracy
## 2 Iceland Full democracy
## 3 Sweden Full democracy
## 4 New Zealand Full democracy
## 5 Denmark Full democracy
## 6 Ireland Full democracy
We can not append vertically if the size or order are different. In this case, the order is.
Then,
all=arrange(all, country)
scores=arrange(scores, country)
This has sorted BOTH data frames by the column country.
Now you can do:
demo=cbind(all,scores)
However…
head(demo,10)
## country demType country demScore
## 1 Afghanistan Authoritarian Afghanistan 2.97
## 2 Albania Hybrid regime Albania 5.98
## 3 Algeria Authoritarian Algeria 3.50
## 4 Angola Authoritarian Angola 3.62
## 5 Argentina Flawed democracy Argentina 7.02
## 6 Armenia Hybrid regime Armenia 4.79
## 7 Australia Full democracy Australia 9.09
## 8 Austria Full democracy Austria 8.29
## 9 Azerbaijan Authoritarian Azerbaijan 2.65
## 10 Bahrain Authoritarian Bahrain 2.71
This appending produced a repeated column. Then,
demo=demo[,-c(3)] # without the third.
Now let me get the other data frame:
hdi=import("hdi.xlsx")
str(hdi)
## 'data.frame': 190 obs. of 2 variables:
## $ COUNTRY: chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ hdi : num 0.498 0.785 0.754 0.858 0.581 0.78 0.825 0.755 0.939 0.908 ...
MERGING has the goal of integrating two data frames that have a common column, the key. Merging will not produced a repeated column, as cbind did.
Let’s do it:
demo_hdi=merge(demo,hdi,by.x = 'country', by.y = 'COUNTRY')
head(demo_hdi)
## country demType demScore hdi
## 1 Afghanistan Authoritarian 2.97 0.498
## 2 Albania Hybrid regime 5.98 0.785
## 3 Algeria Authoritarian 3.50 0.754
## 4 Angola Authoritarian 3.62 0.581
## 5 Argentina Flawed democracy 7.02 0.825
## 6 Armenia Hybrid regime 4.79 0.755
Notice some details:
One of the key columns dissappeared, as it becomes redundant.
You do not use cbind this time, because data comes from different sources, and that you are not sure the key columns are the same.
The democray data had 167 rows, and the hdi 190. But the merge produces 161 rows. Those are the countries from demo that found a country with hdi in the second data frame.
From the last point, let’s try to keep all countries from both files:
ALLALL=merge(demo,hdi,by.x = 'country', by.y = 'COUNTRY',
all.x = T, all.y = T)
This one has 196 countries. This means that there are countries in one data set that are not present in the other. These are the countries that are not present in the DEMOCRACY data frame:
ALLALL[!complete.cases(ALLALL$hdi),"country"]
## [1] "Democratic Republic of the Congo" "North Korea"
## [3] "North Macedonia" "Republic of the Congo"
## [5] "Taiwan Republic of China (Taiwan)" "Timor-Leste"
These are the countries that are not present in the HDI data frame:
ALLALL[!complete.cases(ALLALL$demScore),"country"]
## [1] "Andorra"
## [2] "Antigua and Barbuda"
## [3] "Bahamas"
## [4] "Barbados"
## [5] "Belize"
## [6] "Brunei"
## [7] "Congo, Democratic Republic of the"
## [8] "Congo, Republic of the"
## [9] "Dominica"
## [10] "East Timor"
## [11] "Grenada"
## [12] "Kiribati"
## [13] "Liechtenstein"
## [14] "Macedonia"
## [15] "Maldives"
## [16] "Marshall Islands"
## [17] "Micronesia"
## [18] "Palau"
## [19] "Saint Kitts and Nevis"
## [20] "Saint Lucia"
## [21] "Saint Vincent and the Grenadines"
## [22] "Samoa"
## [23] "São Tomé and Príncipe"
## [24] "Seychelles"
## [25] "Solomon Islands"
## [26] "South Sudan"
## [27] "Tonga"
## [28] "Vanuatu"
## [29] "World"
In both cases, you see that there are countries or territories that are not common. But also you see that the CONGO countries are present in both countries, but are written differently.
RESHAPING is the process of turning the data frame from wide format to long format, and viceversa. So far we have used only wide format.
head(people)
## cod ages state education
## 1 A1 NA NY High
## 2 A2 27 WA Low
## 3 A3 25 WA High
## 4 A4 31 CO Medium
## 5 A5 28 CO Low
## 6 A6 37 NY High
The long format for this data frame is:
library(reshape2)
people_L1=melt(people)
## Using cod, state, education as id variables
people_L1
## cod state education variable value
## 1 A1 NY High ages NA
## 2 A2 WA Low ages 27
## 3 A3 WA High ages 25
## 4 A4 CO Medium ages 31
## 5 A5 CO Low ages 28
## 6 A6 NY High ages 37
## 7 A7 NY Medium ages 37
## 8 A8 NY Medium ages 25
## 9 A9 CO Low ages 38
## 10 A10 VA Low ages 28
## 11 A11 WA Medium ages 31
## 12 A12 WA High ages 34
## 13 A13 NY Low ages 27
## 14 A14 CO Medium ages 27
## 15 A15 NY High ages 28
## 16 A16 WA Low ages 28
## 17 A17 WA Low ages 27
## 18 A18 WA Low ages 32
## 19 A19 NY Medium ages 40
## 20 A20 NY Low ages 25
The melt function assumed that the only variables were ages and agesGroup, as this were the only numeric ones. Then the column value is filled with numeric values, and the column named variable holds the names of the melted variables. The rest were considered identifiers:
This is how you separate the variables from the identifiers:
people_L2=melt(people, id=c('cod','state'))
## Warning: attributes are not identical across measure variables; they will
## be dropped
people_L2
## cod state variable value
## 1 A1 NY ages <NA>
## 2 A2 WA ages 27
## 3 A3 WA ages 25
## 4 A4 CO ages 31
## 5 A5 CO ages 28
## 6 A6 NY ages 37
## 7 A7 NY ages 37
## 8 A8 NY ages 25
## 9 A9 CO ages 38
## 10 A10 VA ages 28
## 11 A11 WA ages 31
## 12 A12 WA ages 34
## 13 A13 NY ages 27
## 14 A14 CO ages 27
## 15 A15 NY ages 28
## 16 A16 WA ages 28
## 17 A17 WA ages 27
## 18 A18 WA ages 32
## 19 A19 NY ages 40
## 20 A20 NY ages 25
## 21 A1 NY education High
## 22 A2 WA education Low
## 23 A3 WA education High
## 24 A4 CO education Medium
## 25 A5 CO education Low
## 26 A6 NY education High
## 27 A7 NY education Medium
## 28 A8 NY education Medium
## 29 A9 CO education Low
## 30 A10 VA education Low
## 31 A11 WA education Medium
## 32 A12 WA education High
## 33 A13 NY education Low
## 34 A14 CO education Medium
## 35 A15 NY education High
## 36 A16 WA education Low
## 37 A17 WA education Low
## 38 A18 WA education Low
## 39 A19 NY education Medium
## 40 A20 NY education Low
Notice that the last column has been formatted as character.
Once you have created a long format, you can turn it back into a wide format with dcast():
dcast(people_L1,cod + state + education~ variable)
## cod state education ages
## 1 A1 NY High NA
## 2 A10 VA Low 28
## 3 A11 WA Medium 31
## 4 A12 WA High 34
## 5 A13 NY Low 27
## 6 A14 CO Medium 27
## 7 A15 NY High 28
## 8 A16 WA Low 28
## 9 A17 WA Low 27
## 10 A18 WA Low 32
## 11 A19 NY Medium 40
## 12 A2 WA Low 27
## 13 A20 NY Low 25
## 14 A3 WA High 25
## 15 A4 CO Medium 31
## 16 A5 CO Low 28
## 17 A6 NY High 37
## 18 A7 NY Medium 37
## 19 A8 NY Medium 25
## 20 A9 CO Low 38
Notice this is the original data frame.
Now, for the second case:
people_L2=melt(people, id=c('cod','state'))
## Warning: attributes are not identical across measure variables; they will
## be dropped
dcast(people_L2,cod + state ~ variable)
## cod state ages education
## 1 A1 NY <NA> High
## 2 A10 VA 28 Low
## 3 A11 WA 31 Medium
## 4 A12 WA 34 High
## 5 A13 NY 27 Low
## 6 A14 CO 27 Medium
## 7 A15 NY 28 High
## 8 A16 WA 28 Low
## 9 A17 WA 27 Low
## 10 A18 WA 32 Low
## 11 A19 NY 40 Medium
## 12 A2 WA 27 Low
## 13 A20 NY 25 Low
## 14 A3 WA 25 High
## 15 A4 CO 31 Medium
## 16 A5 CO 28 Low
## 17 A6 NY 37 High
## 18 A7 NY 37 Medium
## 19 A8 NY 25 Medium
## 20 A9 CO 38 Low
Notice this look like the original data frame, but the data types are not the original ones.