Before starting, keep in mind the following ideas:
We are going to talk about 3 data structures in R:
Lists and vectors are simple structures; a data frame is a more complex one (built from the simple ones).
Lists are containers of values. The values can be of any kind (numbers or non-numbers), and even other containers (simple or complex).
If we have an spreadsheet as a reference, a row is a ‘natural’ list.
Then this can be a list:
DetailStudent=list("Fred Meyers",
40,
FALSE)
The object DetailStudent serves to store temporarily the list in the computer. To name a list, use combinations of letters and numbers in a meaningful way (do not start with a number or a special character).
Typing the name of the object DetailStudent, now representing a list, will give you all the contents you saved in there:
DetailStudent
The list above has three elements. However, you may be wondering if those elements have a meaning altogether. In those situations, it is better to have names for each elements.
DetailStudent=list(fullName="Fred Meyers",
age=40,
female=FALSE)
# seeing the result
DetailStudent
This list has three elements, which we can also call fields. Each of these, in this case, holds a different data type:
You can access any of those elements using these approaches:
# position
DetailStudent[[1]]
# name of the field
DetailStudent[['fullName']]
# name of the field
DetailStudent$fullName
If you do not have names for the list fields, you can only access them using positions:
NewList=list('a','b','c','d',1,2,3)
NewList[[1]]
Once you access an element, you can alter it:
DetailStudent[[1]]='Alfred Mayer'
# Then:
DetailStudent
You can even add an totally NEW field like this:
DetailStudent$city='Seattle'
# show:
DetailStudent
And destroy it by NULLing it, like this:
DetailStudent$city=NULL # do you like: DetailStudent[[4]]=NULL
DetailStudent
You can get rid of a list using:
rm(DetailStudent)
DetailStudent
** How would you create a list of this person out of his personal information data?**
cr7=list('FullName'='Cristiano Ronaldo dos Santos Aveiro',
'DateOfBirth'='5 February 1985',
'PlaceOfBirth'='Funchal, Madeira, Portugal',
'HeightInMeters'=1.89,
'PlayingPosition'='Forward'
)
#seeing the result:
cr7
The previous list has nothing wrong. But keep in mind that we save data to retrieve it and act (decide) upon its value. For example, can we answer the question:
cr7$PlayingPosition
Great! However, we can not answer, directly:
# what is today?
today - cr7$DateOfBirth
Right way:
Sys.Date()
# Then,
Sys.Date() - cr7$DateOfBirth
The problem is that DateOfBirth is not a date, is simply a text.
cr7$DateOfBirth; str(cr7$DateOfBirth)
# udpating
# some may need: Sys.setlocale("LC_TIME", "English")
cr7$DateOfBirth=as.Date(cr7$DateOfBirth,format="%d %B %Y");str(cr7$DateOfBirth)
Using the right format will allow you to accomplish what you need:
# then
Sys.Date()-cr7$DateOfBirth
Or, in a simpler way (with the help of lubridate package):
library(lubridate)
# how many years:
# notice I am using 2 functions: interval and time_length
time_length(interval(cr7$DateOfBirth,Sys.Date()),"years")
Vectors are also containers of values. The values should be of only one type (R may alter or coerce them silently, otherwise). If we have an spreadsheet as a reference, a column can be a natural vector.
Here, we will create three vectors using the “c(…)” function:
fullnames=c("Fred Meyers","Sarah Jones", "Lou Ferrigno","Sky Turner")
ages=c(40,35, 60,77)
female=c(F,T,T,T)
Each object is holding temporarily a vector. Use combinations of letters and numbers in a meaningful way to name a vector (never start with a number or a special character). When typing the name of the object you will get all the contents:
fullnames
ages
female
Each vector is composed of elements with the same type. If you want to access individual elements, you can write:
fullnames[1]
# or
ages[1]
# or
female[1]
You can alter the vector using any of the above mechanisms:
fullnames[1]='Alfred Mayer'
# Then:
fullnames[1]
You can add an element to a vector like this:
elements=c(1,20,3)
elements=c(elements,40) # adding to the same one
elements
You can NOT delete it with NULL:
elements
elements[4]=NULL
Just do this:
# by position
elements
elements2=elements[-2] # vector 'without' position 4
elements2
# by value
elements3=elements[elements!=20]
elements3
You can get rid of those vectors using:
rm(elements2)
elements2
Another operation is to get rid of repeated values, R will not complaint if they exist:
weekdays=c('M','T','W','Th','S','Su','Su')
weekdays
Then, use the function unique:
unique(weekdays)
Vector elements can have ‘names’, but their contents still need to be homogeneous:
newAges=c("Sam"=50, "Paul"=30, "Jim"="40")
newAges
As you see above, the presence of “40” as an element, coerced the other values to characters (the numbers are now text, the symbol ’’ is used to show that). Updating that value, will not change the vector type:
newAges["Jim"]=20
newAges
Updating the value will not take away the initial coercion.
Then, you could tell explicitly to change the mode of the vector:
storage.mode(newAges)
storage.mode(newAges)='double' # or integer
newAges
The more familiar function as.numeric can be used, but that will also delete the field names:
newAges=as.numeric(newAges)
newAges
Notice that as.numeric coerces text into missing values, if the text is not a number:
someData1=c(1,2,3,'4')
as.numeric(someData1)
## [1] 1 2 3 4
But,
someData2=c(1,2,3,'O') # O not 0
as.numeric(someData2)
## Warning: NAs introduced by coercion
## [1] 1 2 3 NA
You can use the is.na function to know if some coercing may happen:
is.na(as.numeric(someData2))
## Warning: NAs introduced by coercion
## [1] FALSE FALSE FALSE TRUE
Let me share some ideas for comparing these two basic structures:
A) Make sure what you have:
The functions is.vector, is.list, is.character and is.numeric should be used frequently, because we need to be sure of what structure we are dealing with:
aList=list(1,2,3)
aVector=c(1,2,3)
is.vector(aVector); is.list(aVector)
# then:
is.vector(aList,mode='vector'); is.list(aList)
The function str could be another alternative to find out what we have:
str(aVector)
str(aList)
B) Arithmetics:
You will find great differences when doing arithmetic:
# if we have these vectors:
numbers1=c(1,2,3)
numbers2=c(10,20,30)
numbers3=c(5)
numbers4=c(1,10)
Then, these work well:
# adding element by element:
numbers1+numbers2
# adding 5 to all the elements of other vector:
numbers2+numbers3
# multiplication (element by element):
numbers1*numbers2
# and this kind of multiplication:
numbers1 * numbers3
However, R will give another warning here:
numbers1+numbers4 # different size matters!
Comparisons make sense:
numbers1>numbers2
# but:
numbers1>numbers4
Now, let’s see how the previous operations work here. These are our lists:
numbersL1=list(11,22,33)
numbersL2=list(1,2,3)
…the adding can not be interpreted:
numbersL1+numbersL2
… and neither the comparisons…
numbersL1>numbersL2
So do not expect neither of these to work:
numbersL1*numbersL2
numbersL1*3
Data frames are containers of values. You use a data frame because you need to combine what vectors and lists do. The most common analogy is a data table like the ones in a spreadsheet:
# VECTORS
names=c("Qing", "Françoise", "Raúl", "Bjork")
ages=c(32,33,28,30)
country=c("China", "Senegal", "Spain", "Norway")
education=c("Bach", "Bach", "Master", "PhD")
#DF as a "List" of vectors:
students=data.frame(names,ages,country,education)
students
You see your data frame above. Just by watching, you can not be sure of what you have, so using str is highly recommended:
str(students)
By default, R turns text vectors into factors (categorical values)You can avoid that by writing:
students=data.frame(names,ages,country,education,
stringsAsFactors=FALSE)
str(students)
The function str showed you the dimensions of the structure (number of rows and columns); R has alternative ways to get the dimensions:
dim(students)
#also
nrow(students) ; ncol(students)
# and very important:
length(students)
The function length works for vectors and lists telling the amount of elements. In data frames, it gives you number of columns, NOT rows.
Data frames have the functions head(), which is very useful to show the top rows of the dataframe:
head(students,2) # top 2
Of course, we have tail:
tail(students,2) # last 2
You can access data frames elements in an easy way:
# one particular column
students$names
# two columns using positions
students[,c(1,4)]
## two columns using names of columns
students[,c('names','education')]
Using positions is the best way to get several columns:
students[,c(1,3:4)] # ':' is used to facilitate 'from-to' sequence
Of course, you can create a new object with subsets:
studentsNoEd=students[,c(1:3)]
studentsNoEd
You have a summary function:
summary(students)
If you had the categorical value as a factor, you could get a frequency table:
students$country=as.factor(students$country)
students$education=as.factor(students$education)
Then,
summary(students)
You can modify any values in a data frame. Let me create a copy of this data frame to play with:
studentsCopy=students # I make a copy to avoid altering my original dataframe
Now, I can change the age of Qing to 23 replacing 32:
studentsCopy[1,2]=23
# change is immediate! (you will not get any warning)
studentsCopy[1,]
We can set a column as missing:
studentsCopy$country=NA
studentsCopy
And, delete a column by nulling it:
studentsCopy$ages=NULL
studentsCopy
Once you have a data frame you can start writing interesting queries (notice the use of commas):
Who is the oldest in the group?
students[which.max(students$ages),]
Who is the youngest in the group?
students[which.min(students$ages),]
Who is above 30 and from China?
students[students$ages>30 & students$country=='China',]
Who is not from Norway?
students[students$country!="Norway",]
Who is from one of these places?
Places=c("Peru", "USA", "Spain")
students[students$country %in% Places,]
# the opposite
students[!students$country %in% Places,]
The education level of the one above 30 year old and from China?
students[students$ages>30 & students$country=='China',]$education
Show me the data ordered by age (decreasing)?
students[order(-ages),]