Contents:
Part 1:
Part 2: Customizing visual elements.
We are very familiar with data frames and data types. From that knowledge we will learn how information can be obtained using a visual approach.
Let me get some data from the website of the Common Core of Data from the US Department of Education. There you can get a data set with detailed information on public schools at the state of Washington:
link='https://github.com/EvansDataScience/VisualAnalytics_2_tabularData/raw/master/data/eduwa.rda'
#getting the data TABLE from the file in the cloud:
load(file=url(link))
Generally speaking, you have either categorical or numerical data in each column, and whatever question you have, you first need to know how that variable you are planing to use has been encoded:
# this 'width = 70,strict.width='cut' means
# you do not want to see more than 70 characters per row.
str(eduwa,width = 70,strict.width='cut')
## 'data.frame': 2427 obs. of 24 variables:
## $ NCES.School.ID : chr "530486002475" "530270001270" "53091"..
## $ State.School.ID : chr "WA-31025-1656" "WA-06114-1646" "WA-"..
## $ NCES.District.ID : chr "5304860" "5302700" "5309100" "53000"..
## $ State.District.ID : chr "WA-31025" "WA-06114" "WA-34033" "WA"..
## $ Low.Grade : Ord.factor w/ 14 levels "PK"<"KG"<"1"<..: ..
## $ High.Grade : Ord.factor w/ 15 levels "PK"<"KG"<"1"<..: ..
## $ School.Name : chr "10th Street School" "49th Street Ac"..
## $ District : chr "Marysville School District" "Evergr"..
## $ County : chr "Snohomish" "Clark" "Thurston" "Gray"..
## $ Street.Address : chr "7204 27th Ave NE" "14619B NE 49th S"..
## $ City : chr "Marysville" "Vancouver" "Tumwater" "..
## $ State : chr "WA" "WA" "WA" "WA" ...
## $ ZIP : chr "98271" "98682" "98512" "98520" ...
## $ ZIP.4-digit : chr NA "6308" NA "5510" ...
## $ Phone : chr "(360)965-0400" "(360)604-6700" "(36"..
## $ Locale.Code : Factor w/ 12 levels "11","12","13",..: 5 2..
## $ LocaleType : Factor w/ 4 levels "City","Rural",..: 3 1 ..
## $ LocaleSub : Factor w/ 12 levels "City: Small",..: 5 2 ..
## $ Charter : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1..
## $ Title.I.School : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1..
## $ Title.1.School.Wide : Factor w/ 2 levels "No","Yes": 2 NA NA 2 N..
## $ Student.Teacher.Ratio: num 23.4 8.4 21.5 15.9 6.5 15.3 NA 16.3 1..
## $ Free.Lunch : num 28 53 169 292 12 411 48 102 101 268 ...
## $ Reduced.Lunch : num 3 9 40 10 4 23 12 22 23 0 ...
The ones that say num are obviously numbers (numbers in R are numeric when decimal values are detected, and integer if they are not). The ones that say chr are strings, which are candidates to be key columns, which are not variables themselves, but identifiers of the cases. In this case, the first four are identifiers, as well as the the 7th, 10th and 15th columns (school names, address and phone, respectively). Those variables are not to be analyzed statistically, but may be used for annotating (7th and 15th column) or for geocoding (10th column). Notice that for these data, State is not to be analyzed as it is a constant (all rows are from WA); but it would be if the data were from the whole USA. Then, you see several variables identified as factor or ordered factor, which are categorical variables: they can be analyzed statistically but not in the same way that numbers.
You can get a clear idea of what a categorical data has by producing a simple frequency table:
# absolute values
table(eduwa$LocaleType,exclude = 'nothing')
##
## City Rural Suburb Town <NA>
## 714 505 798 338 72
# relative values
absoluteT=table(eduwa$LocaleType,exclude = 'nothing')
prop.table(absoluteT)
##
## City Rural Suburb Town <NA>
## 0.29419036 0.20807581 0.32880099 0.13926658 0.02966625
You may want to give a name to the missing values. However, when the column is a factor, you may need something like this:
library(forcats)
eduwa$LocaleType=fct_explicit_na(eduwa$LocaleType, "Unknown")
The basic option for nominal data is a barplot. However, most people tend to use pie charts with categorical data, but this should not be the default option to visualize classification (see this discussion).
Let’s start by calling the library to use:
library(ggplot2)
frTable=as.data.frame(table(eduwa$LocaleType))
names(frTable)=c('Type','Count')
baseNom= ggplot(data = frTable,
aes(x=Type, y=Count))
barNom=baseNom + geom_bar(stat = 'identity')
barNom
frTableProp=as.data.frame(prop.table(table(eduwa$LocaleType)))
names(frTableProp)=c('Type','Percent')
baseNomProp= ggplot(data = frTableProp,
aes(x=Type, y=Percent))
barNomProp=baseNomProp + geom_bar(stat = 'identity')
barNomProp
You should always keep it simple. Then decorate.
Go to table of contents. ________
For this section, we will use the variable that tells us the highest grade offered in a school. A simple exploration gives:
table(eduwa$High.Grade,exclude = 'nothing')
##
## PK KG 1 2 3 4 5 6 7 8 9 10 11 12 13
## 82 7 6 16 19 45 755 266 11 427 15 7 5 757 9
Being a categorical variable, the default option is again the bar plot:
ordTable=as.data.frame(table(eduwa$High.Grade,exclude = 'nothing'))
names(ordTable)=c('Grade','Count')
baseOrd = ggplot(ordTable,aes(x=Grade,y=Count))
barOrd=baseOrd + geom_bar(stat = 'identity')
barOrd
The x-values in this variable have order. That is, there is an increasing level in the values. Whenever we have an ordering, besides concentration we can visualize symmetry: if there is bias towards lower or higher values.
Bar plots help you see concentration and symmetry, but we have an alternative way to clearly detect symmetry, via boxplots:
# boxplots do not use frequency tables
# as.numeric turns levels of the factor into numbers
baseOrd2 = ggplot(eduwa, aes(y=as.numeric(High.Grade)))
baseOrdBox = baseOrd2 + geom_boxplot()
baseOrdBox
You have symmetry when the distance of those whiskers to the box is the same, and when the thick line is in the middle of the box. You can see that the values show a negative asymmetry as the tail towards the bottom (lowest values).
Box plots expect a numeric value as an input, but we have an ordered categorical, so we used the as.numeric() function. However, that eliminated the levels we saw in the previous bar plot; we can put the levels back in our plot:
# the labels use the original ordinal levels
ordLabels= levels(eduwa$High.Grade)
baseOrdBox2 = baseOrdBox + scale_y_continuous(labels=ordLabels,breaks=1:15)
baseOrdBox2
Box plots have important statistical information. The beginning and the ending of the box indicates the first (q1) and the third quantile (q75); and the thicker line in the middle represents the median. From the boxplot, we know:
We can find these results with a detailed frequency table; that is, instead of using the command table as we did before, we could try a more advanced function:
x=eduwa$High.Grade
Freq=table(x)
CumulF=cumsum(table(x))
Relative=100*round(prop.table(table(x)),4)
CumulR=cumsum(Relative)
cbind(Freq, CumulF, Relative, CumulR)
## Freq CumulF Relative CumulR
## PK 82 82 3.38 3.38
## KG 7 89 0.29 3.67
## 1 6 95 0.25 3.92
## 2 16 111 0.66 4.58
## 3 19 130 0.78 5.36
## 4 45 175 1.85 7.21
## 5 755 930 31.11 38.32
## 6 266 1196 10.96 49.28
## 7 11 1207 0.45 49.73
## 8 427 1634 17.59 67.32
## 9 15 1649 0.62 67.94
## 10 7 1656 0.29 68.23
## 11 5 1661 0.21 68.44
## 12 757 2418 31.19 99.63
## 13 9 2427 0.37 100.00
Integers represent counting. They could be represented with bar plots if their frequency table had few different values. For example, the variable Reduced.Lunch informs how many kids there are in each school that have that lunch for a reduced price.
# how many unique values
length(unique(eduwa$Reduced.Lunch))
## [1] 172
There are too many different values. Then, although R could produce a frequency table and a plot, we should not do go for the bar plot.
When the frequency table can not be our first step, we need to turn to statistical measures that help us understand behavior of the data:
# median close to mean?
# median and mean far from max or min?
# q1 distance to min is similar ti q3 distance to max?
# how many missing?
summary(eduwa$Reduced.Lunch)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 5.00 25.50 33.53 47.00 301.00 131
Let’s take care of missing values, by removing them:
eduwa_Lunch=eduwa[complete.cases(eduwa$Reduced.Lunch),]
The boxplot help us identify clearly the values obtained form summary:
# boxplots do not use frequency tables
baseInt= ggplot(eduwa_Lunch,aes(y = Reduced.Lunch))
baseIntBox = baseInt + geom_boxplot()
baseIntBox
The bar plot is not good option as it produces a bar for each unique value in the data, counting how many times this value appeared. Now, we have many values, so if we want to use bars, we need to organize the data into intervals. The histogram is the basic plot when intervals are needed, you can use the basic function:
baseInt2= ggplot(eduwa_Lunch,aes(x = Reduced.Lunch))
baseIntHist= baseInt2 + geom_histogram()
baseIntHist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A simplistic idea of measurement tells you the times a particular unit is present in the unit of analysis; which allows for the presence of decimal places or even be negative values.
Let’s analyze the variable Student.Teacher.Ratio, but organized by county:
# tapply(variable,group,functionToApply)
tapply(eduwa$Student.Teacher.Ratio, eduwa$County, mean)
## Adams Asotin Benton Chelan Clallam
## NA NA NA NA NA
## Clark Columbia Cowlitz Douglas Ferry
## NA NA NA NA NA
## Franklin Garfield Grant Grays Harbor Island
## NA 17.35000 NA NA NA
## Jefferson King Kitsap Kittitas Klickitat
## NA NA NA NA NA
## Lewis Lincoln Mason Okanogan Pacific
## NA 11.56000 NA NA NA
## Pend Oreille Pierce San Juan Skagit Skamania
## 15.47778 NA NA NA 16.37000
## Snohomish Spokane Stevens Thurston Wahkiakum
## NA NA NA NA 18.15000
## Walla Walla Whatcom Whitman Yakima
## NA NA NA NA
Above, I tried to compute the mean for each county, but the function mean() outputs a missing value (NA) as the result when there is one NA in the column. Then we need no missing values in that column:
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.4
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
eduwa_ratioST=eduwa[complete.cases(eduwa$Student.Teacher.Ratio),]
meanValuesCounty= eduwa_ratioST %>%
group_by(County) %>%
summarize('means'=mean(Student.Teacher.Ratio))
meanValuesCounty
## # A tibble: 39 x 2
## County means
## <chr> <dbl>
## 1 Adams 14.8
## 2 Asotin 19.1
## 3 Benton 20.4
## 4 Chelan 18.6
## 5 Clallam 19.3
## 6 Clark 19.2
## 7 Columbia 11.3
## 8 Cowlitz 20.4
## 9 Douglas 16.5
## 10 Ferry 16.8
## # … with 29 more rows
Great!
Let’s compute some statistics:
summary(meanValuesCounty$means)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.30 16.42 18.65 17.96 19.41 23.77
# boxplots do not use frequency tables
baseDec= ggplot(meanValuesCounty,aes(y = means))
baseDecBox = baseDec + geom_boxplot()
baseDecBox
Now let me plot a histogram of those means:
baseDec2= ggplot(meanValuesCounty,aes(x = means))
baseDecHist= baseDec2 + geom_histogram(bins=7) # bins 7 (default 30)
baseDecHist
Titles and captions are important, they can serve to represent a question to be answered by the plot:
titleText='Do we have counties with less than 15 students per teacher (on average)?'
sourceText='Source: US Department of Education'
xaxisText='Average of students-teacher ratio'
yaxisText='Amount of counties'
baseDecHist2= baseDecHist + labs(title=titleText,
x = xaxisText,
y = yaxisText,
caption = sourceText)
baseDecHist2
Titles can also guide the reader to recognise the purpose of your plot:
# using \n
titleText2='Most schools in WA do not have\nstudents in the Reduced Lunch Program'
sourceText='Source: US Department of Education'
xaxisText='Students in Reduced Lunch Program'
yaxisText='Amount of schools'
baseIntHist2= baseIntHist + labs(title=titleText2,
x = xaxisText,
y = yaxisText,
caption = sourceText)
# changing position of titles
baseIntHist3= baseIntHist2 + theme(plot.caption = element_text(hjust = 0),
plot.title = element_text(hjust = 0.5))
baseIntHist3
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
They can suggest a decision:
titleText3='WA needs to fully categorize school locations\n(info from 2018)'
sourceText='Source: US Department of Education'
xaxisText='Location of Schools'
yaxisText='%'
barNomProp2= barNomProp + labs(title=titleText3,
x = xaxisText,
y = yaxisText,
caption = sourceText)
barNomProp2
You can use the attributes colour and fill for that purpose.
It works in every previous plot. Here you have the barplot:
baseNom + geom_bar(stat = 'identity',
colour='orange', # border
fill='white')
The boxplot:
baseOrd2 + geom_boxplot(colour='green',fill='black')
And the histogram:
baseDec2 + geom_histogram(bins=7,
colour='magenta',
fill='yellow')
Notice that the default area has a grid in gray. You can change the theme to make it simpler.
Here you have no grid:
baseDec2 + geom_histogram(bins=7,
colour='magenta',
fill='yellow') +
theme_classic()
Here minimal grid with no color:
baseDec2 + geom_histogram(bins=7,
colour='magenta',
fill='yellow') +
theme_minimal()
It is similar to the previos one, but it has a box for the grid.
baseDec2 + geom_histogram(bins=7,
colour='magenta',
fill='yellow') +
theme_light()
You should review:
The names of the colors for R. Please note the one who are friendly for colorblindness.
Sometimes axes need to be re oriented:
baseDecBox2=baseDecBox + coord_flip()
baseDecBox2
The values and their symbols on the vertical axis are not needed for the las boxplot:
baseDecBox3=baseDecBox2 +
theme(axis.text.y = element_blank(), # no values in ticks
axis.ticks = element_blank()) # no symbol in ticks
baseDecBox3
Axis default values may need to be customized:
# vector of the summary statistics with one decimal place
statVals=round(as.vector(summary(meanValuesCounty$means)),1)
baseDecBox4=baseDecBox3 +
# customize tick values
scale_y_continuous(breaks=statVals,
limits = c(10, 25)) +
# change angle of tick values
theme(axis.text.x = element_text(angle=45),
panel.grid.minor = element_blank()) # grid only on ticks
baseDecBox4
You may need percents instead of decimals:
library(scales)
## Warning: package 'scales' was built under R version 3.4.4
barNomProp2 + scale_y_continuous(labels=scales::percent)
You may to put a line to represent a particular value:
meanV=round(mean(meanValuesCounty$means),2)
baseDecHist3=baseDecHist2 + geom_vline(xintercept = meanV,
linetype="dotted",
color = "yellow",
size=1.5)
baseDecHist3
Reference are more effective if we add text:
baseDecHist4=baseDecHist3+ annotate("text", x = meanV+0.5,y=10,
angle = 90,
label = paste("MEAN",meanV),
color="yellow")
baseDecHist4
But annotation can do more than making lines explicit. Let me count how many have an average ratio les than 15:
(count_Less15=nrow(meanValuesCounty[meanValuesCounty$means<15,]))
## [1] 5
Let me annotate using a rectangular area:
baseDecHist5= baseDecHist4 + annotate("rect",
#points for rectangle:
xmin = 10, xmax = 15,
ymin = 0, ymax = 5,
fill='red',alpha = .2) +
annotate("text", x= 12.5, y = 4,
label=paste(count_Less15,'counties'))
baseDecHist5
We can use dots instead of bars (position instead of length):
baseNomProp2= baseNomProp + geom_point()
baseNomProp2
We can add lines to reinforce distance:
baseNomProp2+ geom_segment(aes(y = 0,
x = Type,
yend = Percent,
xend = Type), color = "grey50")
We could reorder the categories as they are not ordinal:
frTableProp[order(frTableProp$Percent),]
## Type Percent
## 5 Unknown 0.02966625
## 4 Town 0.13926658
## 2 Rural 0.20807581
## 1 City 0.29419036
## 3 Suburb 0.32880099
You can get:
# saving new order:
tableFreqO=frTableProp[order(frTableProp$Percent),]
baseNomRe = ggplot(tableFreqO, aes(Type,Percent))
lollipop1=baseNomRe + geom_segment(aes(y = 0,
x = Type,
yend = Percent,
xend = Type), color = "gray")
lollipop2 = lollipop1 + geom_point()
lollipop2 + scale_x_discrete(limits=tableFreqO$Type) # key element
These graphs are called lollipops. We can use them to represent the direction of the distance from a particular reference line.
For example, if we have four locations, the uniform share will be 25%. Then we can compute a new column gaps:
# new variable
tableFreqO$gap=tableFreqO$Percent-0.25 # 0.25 is uniform share
head(tableFreqO)
## Type Percent gap
## 5 Unknown 0.02966625 -0.22033375
## 4 Town 0.13926658 -0.11073342
## 2 Rural 0.20807581 -0.04192419
## 1 City 0.29419036 0.04419036
## 3 Suburb 0.32880099 0.07880099
Let’s plot this column, instead of Percent:
# plot the new variable
base = ggplot(tableFreqO, aes(Type,gap))
lollipopGap=base + geom_segment(aes(y = 0,
x = Type,
yend = gap,
xend = Type), color = "gray")
lollipopGap1 = lollipopGap + geom_point()
lollipopGap2 = lollipopGap1 +
scale_x_discrete(limits=tableFreqO$Type) # key element
##
lollipopGap2
We can create another column, a flag to signal if the gap is negative or positive:
# a new column for color
tableFreqO$flag=ifelse(tableFreqO$gap>0,T,F)
head(tableFreqO)
## Type Percent gap flag
## 5 Unknown 0.02966625 -0.22033375 FALSE
## 4 Town 0.13926658 -0.11073342 FALSE
## 2 Rural 0.20807581 -0.04192419 FALSE
## 1 City 0.29419036 0.04419036 TRUE
## 3 Suburb 0.32880099 0.07880099 TRUE
I will REplot the previous plot, but using the extra column to give color to the line:
# add new aesthetics 'color'
base = ggplot(tableFreqO, aes(Type,gap))
lollipopGap1=base + geom_segment(aes(y = 0,
x = Type,
yend = gap,
xend = Type,color=flag), color = "gray")
lollipopGap2 = lollipopGap1 + geom_point(aes(color=flag)) #adding color
lollipopGap3 = lollipopGap2 + scale_x_discrete(limits=tableFreqO$Type)
lollipopGap3
Color is using a variable, then ggplot will create a legend to inform what this third dimension means in the bidimensional plot.
Let me annotate the last plot:
lollipopGap4= lollipopGap3 +
geom_text(aes(color=flag,label = round(gap,3)),
nudge_x=0.3) # push text to the right
lollipopGap4
The legend is plotting two symbols to represent the variables that give color to the text and the color of the dot, we can alter the previous code to avoid that:
lollipopGap4= lollipopGap3 +
geom_text(aes(color=flag,label = round(gap,3)),
nudge_x=0.3,
show.legend = FALSE)
lollipopGap4
Another alternative to the histogram is the density plot. We had this:
#baseDec2= ggplot(meanValuesCounty,aes(x = means))
baseDecHist= baseDec2 + geom_histogram(bins=7)
baseDecHist2= baseDecHist + labs(title=titleText,
x = xaxisText,
y = yaxisText,
caption = sourceText)
baseDecHist2
Then, we need a couple of steps. First, represent the y values as density:
baseDecHistDen= baseDec2 + geom_histogram(aes(y = ..density..),
bins=7)
baseDecHistDen2= baseDecHistDen + labs(title=titleText,
x = xaxisText,
y = 'density',
caption = sourceText)
baseDecHistDen2
And now, plot the density:
baseDecHistDen3 = baseDecHistDen2 + geom_density()
baseDecHistDen3
We can improve:
baseDecHistDen3 = baseDecHistDen2 + geom_density(alpha = .2,
fill="pink")
baseDecHistDen3