Session 3: Data Structures

Before starting, keep in mind the following ideas:

Computers and humans need some structure in their language to communicate.
Different from humans, we should not allowed the computer to guess what we mean. Then, talking to the computer has to follow a particular set of rules so our orders are unambiguous.
Errors happen when we do not speak clearly to the computer; but it is worse if the computer does something we did not mean.
Data structures is the way the computer organizes pieces of data, so it can be stored, retrieved, used and modified.

We are going to talk about 3 data structures in R:

Lists.
Vectors.
Data Frame.

Lists and vectors are simple structures; a data frame is a more complex one (built from the simple ones).

List

Lists are containers of values. The values can be of any kind (numbers or non-numbers), and even other containers (simple or complex).

If we have an spreadsheet as a reference, a row is a ‘natural’ list.

Then this can be a list:

DetailStudent=list("Fred Meyers",
                   40,
                   FALSE)

The object DetailStudent serves to store temporarily the list in the computer. To name a list, use combinations of letters and numbers in a meaningful way (do not start with a number or a special character).

Typing the name of the object DetailStudent, now representing a list, will give you all the contents you saved in there:

DetailStudent

The list above has three elements. However, you may be wondering if those elements have a meaning altogether. In those situations, it is better to have names for each elements.

DetailStudent=list(fullName="Fred Meyers",
                   age=40,
                   female=FALSE)

# seeing the result
DetailStudent

This list has three elements, which we can also call fields. Each of these, in this case, holds a different data type:

FullName holds characters
age holds a number
female holds a logical (Boolean) value.

You can access any of those elements using these approaches:

# position
DetailStudent[[1]]

# name of the field
DetailStudent[['fullName']]

# name of the field
DetailStudent$fullName

If you do not have names for the list fields, you can only access them using positions:

NewList=list('a','b','c','d',1,2,3)
NewList[[1]]

Once you access an element, you can alter it:

DetailStudent[[1]]='Alfred Mayer'
# Then:
DetailStudent

You can even add an totally NEW field like this:

DetailStudent$city='Seattle'

# show:
DetailStudent

And destroy it by NULLing it, like this:

DetailStudent$city=NULL # do you like: DetailStudent[[4]]=NULL
DetailStudent

You can get rid of a list using:

rm(DetailStudent)
DetailStudent

** How would you create a list of this person out of his personal information data?**

Drawing

cr7=list('FullName'='Cristiano Ronaldo dos Santos Aveiro', 
         'DateOfBirth'='5 February 1985',
         'PlaceOfBirth'='Funchal, Madeira, Portugal',
         'HeightInMeters'=1.89,
         'PlayingPosition'='Forward'
        )

#seeing the result:
cr7

The previous list has nothing wrong. But keep in mind that we save data to retrieve it and act (decide) upon its value. For example, can we answer the question:

What is Ronaldo’s playing position?

cr7$PlayingPosition

Great! However, we can not answer, directly:

How old is he?

# what is today? 
today - cr7$DateOfBirth

Right way:

Sys.Date()

# Then,
Sys.Date() - cr7$DateOfBirth

The problem is that DateOfBirth is not a date, is simply a text.

cr7$DateOfBirth; str(cr7$DateOfBirth)

# udpating
# some may need: Sys.setlocale("LC_TIME", "English")

cr7$DateOfBirth=as.Date(cr7$DateOfBirth,format="%d %B %Y");str(cr7$DateOfBirth)

Using the right format will allow you to accomplish what you need:

# then

Sys.Date()-cr7$DateOfBirth

Or, in a simpler way (with the help of lubridate package):

library(lubridate)

# how many years:
# notice I am using 2 functions: interval and time_length

time_length(interval(cr7$DateOfBirth,Sys.Date()),"years")

Go to page beginning

Vectors

Vectors are also containers of values. The values should be of only one type (R may alter or coerce them silently, otherwise). If we have an spreadsheet as a reference, a column can be a natural vector.

Here, we will create three vectors using the “c(…)” function:

fullnames=c("Fred Meyers","Sarah Jones", "Lou Ferrigno","Sky Turner")
ages=c(40,35, 60,77)
female=c(F,T,T,T)

Each object is holding temporarily a vector. Use combinations of letters and numbers in a meaningful way to name a vector (never start with a number or a special character). When typing the name of the object you will get all the contents:

fullnames

ages

female

Each vector is composed of elements with the same type. If you want to access individual elements, you can write:

fullnames[1]

# or
ages[1]

# or
female[1]

You can alter the vector using any of the above mechanisms:

fullnames[1]='Alfred Mayer'
# Then:
fullnames[1]

You can add an element to a vector like this:

elements=c(1,20,3)
elements=c(elements,40) # adding to the same one
elements

You can NOT delete it with NULL:

elements
elements[4]=NULL

Just do this:

# by position
elements
elements2=elements[-2] # vector 'without' position 4
elements2

# by value
elements3=elements[elements!=20]
elements3

You can get rid of those vectors using:

rm(elements2)
elements2

Another operation is to get rid of repeated values, R will not complaint if they exist:

weekdays=c('M','T','W','Th','S','Su','Su')
weekdays

Then, use the function unique:

unique(weekdays)

Vector elements can have ‘names’, but their contents still need to be homogeneous:

newAges=c("Sam"=50, "Paul"=30, "Jim"="40")
newAges

As you see above, the presence of “40” as an element, coerced the other values to characters (the numbers are now text, the symbol ’’ is used to show that). Updating that value, will not change the vector type:

newAges["Jim"]=20
newAges

Updating the value will not take away the initial coercion.

Then, you could tell explicitly to change the mode of the vector:

storage.mode(newAges)

storage.mode(newAges)='double' # or integer
newAges

The more familiar function as.numeric can be used, but that will also delete the field names:

newAges=as.numeric(newAges)
newAges

Notice that as.numeric coerces text into missing values, if the text is not a number:

someData1=c(1,2,3,'4')
as.numeric(someData1)

## [1] 1 2 3 4

But,

someData2=c(1,2,3,'O') # O not 0
as.numeric(someData2)

## Warning: NAs introduced by coercion

## [1]  1  2  3 NA

You can use the is.na function to know if some coercing may happen:

is.na(as.numeric(someData2))

## Warning: NAs introduced by coercion

## [1] FALSE FALSE FALSE  TRUE

Vectors versus Lists

Let me share some ideas for comparing these two basic structures:

A) Make sure what you have:

The functions is.vector, is.list, is.character and is.numeric should be used frequently, because we need to be sure of what structure we are dealing with:

aList=list(1,2,3)
aVector=c(1,2,3)

is.vector(aVector); is.list(aVector)

# then:
is.vector(aList,mode='vector'); is.list(aList)

The function str could be another alternative to find out what we have:

str(aVector)

str(aList)

B) Arithmetics:

You will find great differences when doing arithmetic:

# if we have these vectors:
numbers1=c(1,2,3)
numbers2=c(10,20,30)
numbers3=c(5)
numbers4=c(1,10)

Then, these work well:

# adding element by element:
numbers1+numbers2

# adding 5  to all the elements of other vector:
numbers2+numbers3

# multiplication (element by element):
numbers1*numbers2

# and this kind of multiplication:
numbers1 * numbers3

However, R will give another warning here:

numbers1+numbers4 # different size matters!

Comparisons make sense:

numbers1>numbers2

# but:
numbers1>numbers4

Now, let’s see how the previous operations work here. These are our lists:

numbersL1=list(11,22,33)
numbersL2=list(1,2,3)

…the adding can not be interpreted:

numbersL1+numbersL2

… and neither the comparisons…

numbersL1>numbersL2

So do not expect neither of these to work:

numbersL1*numbersL2

numbersL1*3

Go to page beginning

Data Frames

Data frames are containers of values. You use a data frame because you need to combine what vectors and lists do. The most common analogy is a data table like the ones in a spreadsheet:

# VECTORS
names=c("Qing", "Françoise", "Raúl", "Bjork")
ages=c(32,33,28,30)
country=c("China", "Senegal", "Spain", "Norway")
education=c("Bach", "Bach", "Master", "PhD")

#DF as a "List" of vectors:
students=data.frame(names,ages,country,education)
students

You see your data frame above. Just by watching, you can not be sure of what you have, so using str is highly recommended:

str(students)

By default, R turns text vectors into factors (categorical values)You can avoid that by writing:

students=data.frame(names,ages,country,education,
                    stringsAsFactors=FALSE)
str(students)

The function str showed you the dimensions of the structure (number of rows and columns); R has alternative ways to get the dimensions:

dim(students)

#also
nrow(students) ; ncol(students)

# and very important:
length(students)

The function length works for vectors and lists telling the amount of elements. In data frames, it gives you number of columns, NOT rows.

Data frames have the functions head(), which is very useful to show the top rows of the dataframe:

head(students,2) # top 2

Of course, we have tail:

tail(students,2) # last 2

You can access data frames elements in an easy way:

# one particular column
students$names

# two columns using positions
students[,c(1,4)]

## two columns using names of columns
students[,c('names','education')]

Using positions is the best way to get several columns:

students[,c(1,3:4)] # ':' is used to facilitate 'from-to' sequence

Of course, you can create a new object with subsets:

studentsNoEd=students[,c(1:3)]
studentsNoEd

You have a summary function:

summary(students)

If you had the categorical value as a factor, you could get a frequency table:

students$country=as.factor(students$country)
students$education=as.factor(students$education)

Then,

summary(students)

You can modify any values in a data frame. Let me create a copy of this data frame to play with:

studentsCopy=students # I make a copy to avoid altering my original dataframe

Now, I can change the age of Qing to 23 replacing 32:

studentsCopy[1,2]=23
# change is immediate! (you will not get any warning)
studentsCopy[1,]

We can set a column as missing:

studentsCopy$country=NA

studentsCopy

And, delete a column by nulling it:

studentsCopy$ages=NULL

studentsCopy

Querying Data Frames:

Once you have a data frame you can start writing interesting queries (notice the use of commas):

Who is the oldest in the group?

students[which.max(students$ages),]

Who is the youngest in the group?

students[which.min(students$ages),]

Who is above 30 and from China?

students[students$ages>30 & students$country=='China',]

Who is not from Norway?

students[students$country!="Norway",]

Who is from one of these places?

Places=c("Peru", "USA", "Spain")
students[students$country %in% Places,]

# the opposite
students[!students$country %in% Places,]

The education level of the one above 30 year old and from China?

students[students$ages>30 & students$country=='China',]$education

Show me the data ordered by age (decreasing)?

students[order(-ages),]