The data frame is a common data structure for people with experience with spreasheets or data tables. It is just that: a collection of rows with several columns. However, each software application has its own way of dealing with data frames.

In this session, we will learn the main strategies that we use to deal with data frames:

Let me open a simple data frame from an Excel file:

```
library(rio)
fileName='people.xlsx'
people=import(fileName) # table '1'
people
```

```
## cod ages state education
## 1 A1 27 NY 3
## 2 A2 27 WA 1
## 3 A3 25 WA 3
## 4 A4 31 CO 2
## 5 A5 28 CO 1
## 6 A6 37 NY 3
## 7 A7 37 NY 2
## 8 A8 25 NY 2
## 9 A9 38 CO 1
## 10 A10 28 VA 1
## 11 A11 31 WA 2
## 12 A12 34 WA 3
## 13 A13 27 NY 1
## 14 A14 27 CO 2
## 15 A15 28 NY 3
## 16 A16 28 WA 1
## 17 A17 27 WA 1
## 18 A18 32 WA 1
## 19 A19 40 NY 2
## 20 A20 25 NY 1
```

The first step is making sure of what data types you have:

`str(people)`

```
## 'data.frame': 20 obs. of 4 variables:
## $ cod : chr "A1" "A2" "A3" "A4" ...
## $ ages : num 27 27 25 31 28 37 37 25 38 28 ...
## $ state : chr "NY" "WA" "WA" "CO" ...
## $ education: num 3 1 3 2 1 3 2 2 1 1 ...
```

Education is represented as a number, let me turn it into a category:

```
people$education=factor(people$education,
levels=c(1,2,3),
labels=c("Low","Medium","High"),
ordered = T)
```

The purpose of logical operatos can be divided into:

- DETECTING COMPLIANCE

**How many people are older that 30?**

```
# count:
sum(people$ages>30)
```

`## [1] 8`

```
# count:
nrow(people[people$ages>30,])
```

`## [1] 8`

```
#count:
table(people$ages>30)
```

```
##
## FALSE TRUE
## 12 8
```

```
# percent:
mean(people$ages>30)
```

`## [1] 0.4`

```
# percent:
nrow(people[people$ages>30,])/nrow(people)
```

`## [1] 0.4`

```
# percent:
prop.table(table(people$ages>30))
```

```
##
## FALSE TRUE
## 0.6 0.4
```

- SUBSETTING

**Case 1: selecting rows**

```
#People from WA
condition=c("WA")
people[people$state %in% condition,]
```

```
## cod ages state education
## 2 A2 27 WA Low
## 3 A3 25 WA High
## 11 A11 31 WA Medium
## 12 A12 34 WA High
## 16 A16 28 WA Low
## 17 A17 27 WA Low
## 18 A18 32 WA Low
```

```
#People from WA and NY
condition=c("WA","NY")
people[people$state %in% condition,]
```

```
## cod ages state education
## 1 A1 27 NY High
## 2 A2 27 WA Low
## 3 A3 25 WA High
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 8 A8 25 NY Medium
## 11 A11 31 WA Medium
## 12 A12 34 WA High
## 13 A13 27 NY Low
## 15 A15 28 NY High
## 16 A16 28 WA Low
## 17 A17 27 WA Low
## 18 A18 32 WA Low
## 19 A19 40 NY Medium
## 20 A20 25 NY Low
```

```
#top two highest levels
condition=c("Medium","High")
people[people$education %in% condition,]
```

```
## cod ages state education
## 1 A1 27 NY High
## 3 A3 25 WA High
## 4 A4 31 CO Medium
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 8 A8 25 NY Medium
## 11 A11 31 WA Medium
## 12 A12 34 WA High
## 14 A14 27 CO Medium
## 15 A15 28 NY High
## 19 A19 40 NY Medium
```

```
#top two highest levels
people[people$education >= "Medium",] # valid for ordinal factors
```

```
## cod ages state education
## 1 A1 27 NY High
## 3 A3 25 WA High
## 4 A4 31 CO Medium
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 8 A8 25 NY Medium
## 11 A11 31 WA Medium
## 12 A12 34 WA High
## 14 A14 27 CO Medium
## 15 A15 28 NY High
## 19 A19 40 NY Medium
```

The logical operator “>=” can not be used in text, or nominal variables, as they do not have order.

```
#People aged 37 OR 25
condition=c(37,25)
people[people$ages %in% condition,]
```

```
## cod ages state education
## 3 A3 25 WA High
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 8 A8 25 NY Medium
## 20 A20 25 NY Low
```

```
#People older than 35
condition=35
people[people$ages > condition,]
```

```
## cod ages state education
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 9 A9 38 CO Low
## 19 A19 40 NY Medium
```

```
# the youngest person/people
condition=min(people$ages)
people[people$ages==condition,]
```

```
## cod ages state education
## 3 A3 25 WA High
## 8 A8 25 NY Medium
## 20 A20 25 NY Low
```

```
# compare
people[which.min(people$ages),]
```

```
## cod ages state education
## 3 A3 25 WA High
```

**Multiple conditions** bring the challenge of dealing with several data types:

```
# older than 35 OR younger than 27
people[people$ages>35 | people$ages<27,]
```

```
## cod ages state education
## 3 A3 25 WA High
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 8 A8 25 NY Medium
## 9 A9 38 CO Low
## 19 A19 40 NY Medium
## 20 A20 25 NY Low
```

```
# younger than 27 WITH education level Medium
people[people$ages<27 & people$education=="Medium",]
```

```
## cod ages state education
## 8 A8 25 NY Medium
```

```
# younger than 30 WITH education level 2 or 3
condition=c("Medium","High")
people[people$ages<30 & people$education %in% condition,]
```

```
## cod ages state education
## 1 A1 27 NY High
## 3 A3 25 WA High
## 8 A8 25 NY Medium
## 14 A14 27 CO Medium
## 15 A15 28 NY High
```

```
# older than 30 WITH education level 2 OR 3 FROM NY
people[ people$ages>30&
people$education %in% condition &
people$state=='NY',]
```

```
## cod ages state education
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 19 A19 40 NY Medium
```

R has the **filter()** function which allows the selection of rows in an ‘english language’ approach:

```
library(dplyr)
filter( people, ages > 30 &
education %in% condition &
people$state=='NY' )
```

```
## cod ages state education
## 1 A6 37 NY High
## 2 A7 37 NY Medium
## 3 A19 40 NY Medium
```

**Case 2: selecting columns**

This is how you can get some columns:

`people[,c('cod','state')] # notice the comma position`

```
## cod state
## 1 A1 NY
## 2 A2 WA
## 3 A3 WA
## 4 A4 CO
## 5 A5 CO
## 6 A6 NY
## 7 A7 NY
## 8 A8 NY
## 9 A9 CO
## 10 A10 VA
## 11 A11 WA
## 12 A12 WA
## 13 A13 NY
## 14 A14 CO
## 15 A15 NY
## 16 A16 WA
## 17 A17 WA
## 18 A18 WA
## 19 A19 NY
## 20 A20 NY
```

This can be also done by using **select** (from dplyr)

`select(people, c('cod','state'))`

```
## cod state
## 1 A1 NY
## 2 A2 WA
## 3 A3 WA
## 4 A4 CO
## 5 A5 CO
## 6 A6 NY
## 7 A7 NY
## 8 A8 NY
## 9 A9 CO
## 10 A10 VA
## 11 A11 WA
## 12 A12 WA
## 13 A13 NY
## 14 A14 CO
## 15 A15 NY
## 16 A16 WA
## 17 A17 WA
## 18 A18 WA
## 19 A19 NY
## 20 A20 NY
```

Or like this:

`select(people, cod,state)`

```
## cod state
## 1 A1 NY
## 2 A2 WA
## 3 A3 WA
## 4 A4 CO
## 5 A5 CO
## 6 A6 NY
## 7 A7 NY
## 8 A8 NY
## 9 A9 CO
## 10 A10 VA
## 11 A11 WA
## 12 A12 WA
## 13 A13 NY
## 14 A14 CO
## 15 A15 NY
## 16 A16 WA
## 17 A17 WA
## 18 A18 WA
## 19 A19 NY
## 20 A20 NY
```

Of course, you can use conditions for rows and columns.

```
# older than 35 OR younger than 27
people[people$ages>35 | people$ages<27,c('cod','state')]
```

```
## cod state
## 3 A3 WA
## 6 A6 NY
## 7 A7 NY
## 8 A8 NY
## 9 A9 CO
## 19 A19 NY
## 20 A20 NY
```

You can turn the previous command using the dplyr tools and the **pip** (%>%) operator:

`people %>% filter(ages>35 | ages<27) %>% select(cod,state)`

```
## cod state
## 1 A3 WA
## 2 A6 NY
## 3 A7 NY
## 4 A8 NY
## 5 A9 CO
## 6 A19 NY
## 7 A20 NY
```

**Note on missing values**

It is always a good practice, when you know missing values are present, so that you can anticipate possible unexpected results.

```
# current value
people$ages[1]
```

`## [1] 27`

```
#turning that value to missing:
people$ages[1]=NA
# then
people
```

```
## cod ages state education
## 1 A1 NA NY High
## 2 A2 27 WA Low
## 3 A3 25 WA High
## 4 A4 31 CO Medium
## 5 A5 28 CO Low
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 8 A8 25 NY Medium
## 9 A9 38 CO Low
## 10 A10 28 VA Low
## 11 A11 31 WA Medium
## 12 A12 34 WA High
## 13 A13 27 NY Low
## 14 A14 27 CO Medium
## 15 A15 28 NY High
## 16 A16 28 WA Low
## 17 A17 27 WA Low
## 18 A18 32 WA Low
## 19 A19 40 NY Medium
## 20 A20 25 NY Low
```

Let me query:

`people[people$ages>27,]`

```
## cod ages state education
## NA <NA> NA <NA> <NA>
## 4 A4 31 CO Medium
## 5 A5 28 CO Low
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 9 A9 38 CO Low
## 10 A10 28 VA Low
## 11 A11 31 WA Medium
## 12 A12 34 WA High
## 15 A15 28 NY High
## 16 A16 28 WA Low
## 18 A18 32 WA Low
## 19 A19 40 NY Medium
```

What is the row at the top?

Now,

`people[people$ages<=27,]`

```
## cod ages state education
## NA <NA> NA <NA> <NA>
## 2 A2 27 WA Low
## 3 A3 25 WA High
## 8 A8 25 NY Medium
## 13 A13 27 NY Low
## 14 A14 27 CO Medium
## 17 A17 27 WA Low
## 20 A20 25 NY Low
```

Again the same row appears.

R gave you a row with all missing values. This happens when your query does not control the *NA*. You may want to use this to create your sub data frame:

`people[people$ages>27 & !is.na(people$ages) ,]`

```
## cod ages state education
## 4 A4 31 CO Medium
## 5 A5 28 CO Low
## 6 A6 37 NY High
## 7 A7 37 NY Medium
## 9 A9 38 CO Low
## 10 A10 28 VA Low
## 11 A11 31 WA Medium
## 12 A12 34 WA High
## 15 A15 28 NY High
## 16 A16 28 WA Low
## 18 A18 32 WA Low
## 19 A19 40 NY Medium
```

It is different if the query does not refer to a column with missing values:

`people[people$education=="High",]`

```
## cod ages state education
## 1 A1 NA NY High
## 3 A3 25 WA High
## 6 A6 37 NY High
## 12 A12 34 WA High
## 15 A15 28 NY High
```

We will teach soon how to build functions, and for that you may want to have some knowledge on how to program in R. For that, you need to know how you can control the execution of code. This is done via:

- Conditional execution
- Repetitive execution

**CONDITIONAL EXECUTION** is how you tell the computer what part of a code to execute, depending if an event is true or false.

```
###### INPUT
value=-100
###### CONDITIONAL EXECUTION
if (value >= 0){
# what to do if condition is TRUE
rootValue=sqrt(value)
print (rootValue)
} else {
# what to do if condition is FALSE
print('Sorry, I do not compute square roots of negative numbers')
}
```

`## [1] "Sorry, I do not compute square roots of negative numbers"`

Notice the use of **parenthesis** in the condition. Also, the use of **{}** to enclose the commands. You do not need those curly braces if you have just *one* command after the condition. If you omitted the whole **else{}** section, the program will still run, but the program will not get a message when the input is invalid.

**REPETITIVE EXECUTION** is how you tell the computer to do something many times (and stop when it has to):

```
values=c(9,25,100)
for (value in values){ # do this for each value in values
print(sqrt(value))
}
```

```
## [1] 3
## [1] 5
## [1] 10
```

You do not need to show each result, you could save the results.

```
values=c(9,25,100,500)
rootValues=c() # empty vector
for (value in values){
rootValues=c(rootValues,sqrt(value)) # updating vector
}
```

Then, you see what you saved:

```
# to see the results:
rootValues
```

`## [1] 3.00000 5.00000 10.00000 22.36068`

It is evident that combining *loops* and *control of execution*, we can make better programs. For example, this code is not controlling well the process:

```
values=c(9,25,-100)
rootValues=c()
for (value in values){
rootValues=c(rootValues,sqrt(value))
}
```

`## Warning in sqrt(value): NaNs produced`

Then,

```
# to see the results:
rootValues
```

`## [1] 3 5 NaN`

In the last result, R decided what to do when a negative value was input (it also sent a warning). This one does a better job:

```
values=c(9,25,-100,144,-72)
rootValuesNew=c()
for (value in values){
if (value >=0){
rootValuesNew=c(rootValuesNew,sqrt(value))
}else {
print('We added a missing value, negative input detected')
rootValuesNew=c(rootValuesNew,NA)
}
}
```

```
## [1] "We added a missing value, negative input detected"
## [1] "We added a missing value, negative input detected"
```

Then,

```
# to see the results:
rootValues
```

`## [1] 3 5 NaN`

We are producing an ouput with the same size as input. If we omit the **else** structure, we will produce an output with smaller size than the input.

You can also use **break** when you consider the execution should stop:

```
values=c(9,25,-100,144,-72)
rootValues=c()
for (value in values){
if (value <0){
print('We need to stop, invalid value detected')
break
}
rootValues=c(rootValues,sqrt(value))
}
```

`## [1] "We need to stop, invalid value detected"`

The code above halted the program, but some results were saved:

`rootValues`

`## [1] 3 5`

You can use **next** when you consider the execution should continue:

```
values=list(9,NA,'1000',-100,144,-72)
for (value in values){
if (is.na(value)){
print('missing value as input')
next
}
if (value <0){
print('negative value as input')
next
}
if (is.character(value)){
print('char as input')
next
}
rootVal=sqrt(value)
print(paste(rootVal,'is the root of ',value))
}
```

```
## [1] "3 is the root of 9"
## [1] "missing value as input"
## [1] "char as input"
## [1] "negative value as input"
## [1] "12 is the root of 144"
## [1] "negative value as input"
```

We build functions to make the code more readable. Functions plus the data structures and control of execution capabilities you saw before will give you the basic tools you need to develop your own programs.

A function is a three-step process: Input, Transformation, Output. For example, if you need to convert a numeric value from Fahrenheit into Celsius , the input is the value in Fahrenheit, the transformation is the formula, and the output the result of the formula (a value in Celsius).

```
converterToCelsius=function(valueInFarenheit){ #input
#transformation
resultInCelsius= (valueInFarenheit-32)*5/9
#output
return (resultInCelsius)}
```

- A function has a name (for example:
*converterToCelsius*) to the left of the**=**; - then
*a definition of the input of the function*using the reserved word**function**; - the
*transformation process*between**{}**; - the process
*output*(for example:*resultInCelsius*), which requires the keyword**return**and**()**.

You can omit the *return* command and a function still works; in that case, R will output the last line that was **executed**; I avoid doing that as it impoverishes reading the code.

Above, we created a function, and after you run it, R has a new function available:

`converterToCelsius(100)`

`## [1] 37.77778`

We control the amount of input in a function:

```
# this function requires TWO inputs:
XsumY=function(valueX,valueY){
###
resultSum=valueX+valueY
###
return (resultSum)
}
```

The code above receives two values and outputs their sum. You can see how it works this way:

`XsumY(3,10)`

`## [1] 13`

You can have some inputs with *default* values:

```
riseToPower=function(base,exponent=2){
#####
result=1
if (exponent > 0){
for (time in 1:exponent){
result=result*base
}
}
#####
return(result)
}
```

Since you have a default value in the input arguments, you decide if you give that input or not. Let’s see how it works:

```
# to the power 2 by default
riseToPower(9)
```

`## [1] 81`

```
# to the power 3
riseToPower(9,3)
```

`## [1] 729`

```
# using argument names does not require order:
riseToPower(exponent=0,base=9)
```

`## [1] 1`

Our output has been a single value, but it can be several ones; however, you need the right structure.

```
# one input, and several output in simple data structure:
factors=function(number){
# empty vector that will collect output
vectorOfAnswers=c()
# for every value in the sequence...
for (i in 1:number){
#if the remainder of 'number'/'i' equals zero...
if ((number %% i) == 0){
# ...add 'i' to the vector of factors!
vectorOfAnswers=c(vectorOfAnswers,i)
}
}
return (vectorOfAnswers) # returning the vector
}
```

Testing:

`factors(20) `

`## [1] 1 2 4 5 10 20`

Imaging you have created a function that takes a value and multiplies it by two, like this:

```
double=function(x){
return (2*x)
}
```

and you have this vector:

`myVector=c(1,2,3)`

What will you get here?

`double(myVector)`

`## [1] 2 4 6`

If you use a vector as an input in a function in R, R will apply the function to each element.

If you use a list of numbers as input:

`myList=list(1,2,3)`

You get:

`double(myList)`

…you get an error. In this case, you can use **Map** or **mapply**:

```
# Map returns a list, and the input can be a vector or a list
Map(double,myList)
```

```
## [[1]]
## [1] 2
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 6
```

You can also use **mapply**:

```
# mapply returns a vector, and the input can be a vector or a list
mapply(double,myList)
```

`## [1] 2 4 6`

Notice that the outputs are returned in different data structures.

As we use data frames most of the time, pay attention on how you use a function and what you get as result:

```
numberA=c(10,20,30,40,50)
numberB=c(6,7,8,9,10)
dataDF=data.frame(numberA,numberB)
dataDF
```

```
## numberA numberB
## 1 10 6
## 2 20 7
## 3 30 8
## 4 40 9
## 5 50 10
```

Let’s *double* each value applying the function *directly* to the data frame:

`double(dataDF)`

```
## numberA numberB
## 1 20 12
## 2 40 14
## 3 60 16
## 4 80 18
## 5 100 20
```

As you saw above, the function *double* was designed to receive as input a simple value (a number). Then, without effort from your side, R itself decided to apply it to each element in the data frame.

Try now simple function as **as.character()** :

`as.character(dataDF)`

`## [1] "c(10, 20, 30, 40, 50)" "c(6, 7, 8, 9, 10)"`

I am pretty sure, this is not what you wanted.

You know that it works in one column (but not in a data frame):

`as.character(dataDF$numberA)`

`## [1] "10" "20" "30" "40" "50"`

We can use *Map()*:

`Map(as.character,dataDF)`

```
## $numberA
## [1] "10" "20" "30" "40" "50"
##
## $numberB
## [1] "6" "7" "8" "9" "10"
```

Or the very common **lapply()**:

`lapply(dataDF,as.character)`

```
## $numberA
## [1] "10" "20" "30" "40" "50"
##
## $numberB
## [1] "6" "7" "8" "9" "10"
```

In both cases, the function worked with a data frame as an input; but the output was a list. If you want a data frame as output:

```
# selecting columns:
dataDF[,c(1,2)]=lapply(dataDF[,c(1,2)],as.character)
dataDF
```

```
## numberA numberB
## 1 10 6
## 2 20 7
## 3 30 8
## 4 40 9
## 5 50 10
```

Then, to get the data frame using **lapply**, you need to specify the columns.

Let me turn the values back to numbers:

`dataDF[,c(1,2)]=lapply(dataDF[,c(1,2)],as.numeric)`

There are functions that could be applied to columns or rows and get a total from them. Keep in mind that **lapply** applies a function to columns.

```
# you are adding the column values here:
as.data.frame(lapply(dataDF,sum))
```

```
## numberA numberB
## 1 150 40
```

If you need to apply a function by row or by column, the right option is **apply**:

```
# you are adding by row:
apply(dataDF,1,sum) # 1 to apply by row (2 for column).
```

`## [1] 16 27 38 49 60`

Let me use the data frame *people* to show **tapply**

`tapply(X=people$ages,INDEX=list(people$education),FUN=mean)`

```
## Low Medium High
## 28.88889 31.83333 NA
```

This function applies a function to X, organised by the variables in INDEX. Then:

```
tapply(X=people$ages,
INDEX=list(people$education,people$state),
FUN=mean)
```

```
## CO NY VA WA
## Low 33 26 28 28.5
## Medium 29 34 NA 31.0
## High NA NA NA 29.5
```

The results are not data frames. You can get them using:

```
people %>%
group_by( education, state ) %>%
summarize(mean(ages)) # summarize('means'=mean(ages))
```

```
## # A tibble: 9 x 3
## # Groups: education [3]
## education state `mean(ages)`
## <ord> <chr> <dbl>
## 1 Low CO 33
## 2 Low NY 26
## 3 Low VA 28
## 4 Low WA 28.5
## 5 Medium CO 29
## 6 Medium NY 34
## 7 Medium WA 31
## 8 High NY NA
## 9 High WA 29.5
```

I have two spreadsheets. One has information on the democracy index: