Course: Data-Driven Management and Policy

Prof. José Manuel Magallanes, PhD


Session 5: Intro to Visualization

_____

Contents:

Part 1:

  1. Nominal Visualization.
  2. Ordinal Visualization.
  3. Integers Visualization.
  4. Decimals Visualization.

Part 2: Customizing visual elements.

  1. Titles
  2. Changing color
  3. Working on axes
  4. Reference lines
  5. Annotation
  6. Alternatives

We are very familiar with data frames and data types. From that knowledge we will learn how information can be obtained using a visual approach.

Let me get some data from the website of the Common Core of Data from the US Department of Education. There you can get a data set with detailed information on public schools at the state of Washington:

link='https://github.com/EvansDataScience/VisualAnalytics_2_tabularData/raw/master/data/eduwa.rda'

#getting the data TABLE from the file in the cloud:
load(file=url(link))

Generally speaking, you have either categorical or numerical data in each column, and whatever question you have, you first need to know how that variable you are planing to use has been encoded:

# this 'width = 70,strict.width='cut' means
# you do not want to see more than 70 characters per row.

str(eduwa,width = 70,strict.width='cut')
## 'data.frame':    2427 obs. of  24 variables:
##  $ NCES.School.ID       : chr  "530486002475" "530270001270" "53091"..
##  $ State.School.ID      : chr  "WA-31025-1656" "WA-06114-1646" "WA-"..
##  $ NCES.District.ID     : chr  "5304860" "5302700" "5309100" "53000"..
##  $ State.District.ID    : chr  "WA-31025" "WA-06114" "WA-34033" "WA"..
##  $ Low.Grade            : Ord.factor w/ 14 levels "PK"<"KG"<"1"<..: ..
##  $ High.Grade           : Ord.factor w/ 15 levels "PK"<"KG"<"1"<..: ..
##  $ School.Name          : chr  "10th Street School" "49th Street Ac"..
##  $ District             : chr  "Marysville School District" "Evergr"..
##  $ County               : chr  "Snohomish" "Clark" "Thurston" "Gray"..
##  $ Street.Address       : chr  "7204 27th Ave NE" "14619B NE 49th S"..
##  $ City                 : chr  "Marysville" "Vancouver" "Tumwater" "..
##  $ State                : chr  "WA" "WA" "WA" "WA" ...
##  $ ZIP                  : chr  "98271" "98682" "98512" "98520" ...
##  $ ZIP.4-digit          : chr  NA "6308" NA "5510" ...
##  $ Phone                : chr  "(360)965-0400" "(360)604-6700" "(36"..
##  $ Locale.Code          : Factor w/ 12 levels "11","12","13",..: 5 2..
##  $ LocaleType           : Factor w/ 4 levels "City","Rural",..: 3 1 ..
##  $ LocaleSub            : Factor w/ 12 levels "City: Small",..: 5 2 ..
##  $ Charter              : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1..
##  $ Title.I.School       : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1..
##  $ Title.1.School.Wide  : Factor w/ 2 levels "No","Yes": 2 NA NA 2 N..
##  $ Student.Teacher.Ratio: num  23.4 8.4 21.5 15.9 6.5 15.3 NA 16.3 1..
##  $ Free.Lunch           : num  28 53 169 292 12 411 48 102 101 268 ...
##  $ Reduced.Lunch        : num  3 9 40 10 4 23 12 22 23 0 ...

The ones that say num are obviously numbers (numbers in R are numeric when decimal values are detected, and integer if they are not). The ones that say chr are strings, which are candidates to be key columns, which are not variables themselves, but identifiers of the cases. In this case, the first four are identifiers, as well as the the 7th, 10th and 15th columns (school names, address and phone, respectively). Those variables are not to be analyzed statistically, but may be used for annotating (7th and 15th column) or for geocoding (10th column). Notice that for these data, State is not to be analyzed as it is a constant (all rows are from WA); but it would be if the data were from the whole USA. Then, you see several variables identified as factor or ordered factor, which are categorical variables: they can be analyzed statistically but not in the same way that numbers.


1. Visualization for nominal scales

You can get a clear idea of what a categorical data has by producing a simple frequency table:

# absolute values
table(eduwa$LocaleType,exclude = 'nothing')
## 
##   City  Rural Suburb   Town   <NA> 
##    714    505    798    338     72
# relative values
absoluteT=table(eduwa$LocaleType,exclude = 'nothing')
prop.table(absoluteT)
## 
##       City      Rural     Suburb       Town       <NA> 
## 0.29419036 0.20807581 0.32880099 0.13926658 0.02966625

You may want to give a name to the missing values. However, when the column is a factor, you may need something like this:

library(forcats)

eduwa$LocaleType=fct_explicit_na(eduwa$LocaleType, "Unknown")

The basic option for nominal data is a barplot. However, most people tend to use pie charts with categorical data, but this should not be the default option to visualize classification (see this discussion).

Let’s start by calling the library to use:

library(ggplot2)
  • For categorical, create a frequency table as a data frane:
frTable=as.data.frame(table(eduwa$LocaleType))
names(frTable)=c('Type','Count')
  • Create the base object, which is not a plot, just informing the variable to plot:
baseNom= ggplot(data = frTable, 
             aes(x=Type, y=Count)) 
  • Request what “geometry” you want:
barNom=baseNom + geom_bar(stat = 'identity')
barNom

  • For barplots, you may need the proportions instead, so alter the geometry like this:
frTableProp=as.data.frame(prop.table(table(eduwa$LocaleType)))
names(frTableProp)=c('Type','Percent')

baseNomProp= ggplot(data = frTableProp, 
             aes(x=Type, y=Percent))

barNomProp=baseNomProp + geom_bar(stat = 'identity')
barNomProp

You should always keep it simple. Then decorate.

Go to table of contents. ________

2. Visualization for ordinal scales

For this section, we will use the variable that tells us the highest grade offered in a school. A simple exploration gives:

table(eduwa$High.Grade,exclude = 'nothing')
## 
##  PK  KG   1   2   3   4   5   6   7   8   9  10  11  12  13 
##  82   7   6  16  19  45 755 266  11 427  15   7   5 757   9

Being a categorical variable, the default option is again the bar plot:

  • Preparing the data:
ordTable=as.data.frame(table(eduwa$High.Grade,exclude = 'nothing'))
names(ordTable)=c('Grade','Count')
baseOrd = ggplot(ordTable,aes(x=Grade,y=Count))
barOrd=baseOrd + geom_bar(stat = 'identity') 
barOrd

The x-values in this variable have order. That is, there is an increasing level in the values. Whenever we have an ordering, besides concentration we can visualize symmetry: if there is bias towards lower or higher values.

Bar plots help you see concentration and symmetry, but we have an alternative way to clearly detect symmetry, via boxplots:

# boxplots do not use frequency tables

# as.numeric  turns levels of the factor into numbers
baseOrd2 = ggplot(eduwa, aes(y=as.numeric(High.Grade))) 
baseOrdBox = baseOrd2 + geom_boxplot() 

baseOrdBox

You have symmetry when the distance of those whiskers to the box is the same, and when the thick line is in the middle of the box. You can see that the values show a negative asymmetry as the tail towards the bottom (lowest values).

Box plots expect a numeric value as an input, but we have an ordered categorical, so we used the as.numeric() function. However, that eliminated the levels we saw in the previous bar plot; we can put the levels back in our plot:

# the labels use the original ordinal levels
ordLabels= levels(eduwa$High.Grade)

baseOrdBox2 = baseOrdBox + scale_y_continuous(labels=ordLabels,breaks=1:15)
baseOrdBox2

Box plots have important statistical information. The beginning and the ending of the box indicates the first (q1) and the third quantile (q75); and the thicker line in the middle represents the median. From the boxplot, we know:

  • 25% of the public Schools offer at most 5th GRADE.
  • 50% of the public Schools offer at most 8th GRADE.
  • 75% of the public Schools offer at most 12th GRADE. Also, 25% of the schools offer 13th grade.

We can find these results with a detailed frequency table; that is, instead of using the command table as we did before, we could try a more advanced function:

x=eduwa$High.Grade

Freq=table(x)

CumulF=cumsum(table(x))

Relative=100*round(prop.table(table(x)),4)

CumulR=cumsum(Relative)

cbind(Freq, CumulF, Relative, CumulR)
##    Freq CumulF Relative CumulR
## PK   82     82     3.38   3.38
## KG    7     89     0.29   3.67
## 1     6     95     0.25   3.92
## 2    16    111     0.66   4.58
## 3    19    130     0.78   5.36
## 4    45    175     1.85   7.21
## 5   755    930    31.11  38.32
## 6   266   1196    10.96  49.28
## 7    11   1207     0.45  49.73
## 8   427   1634    17.59  67.32
## 9    15   1649     0.62  67.94
## 10    7   1656     0.29  68.23
## 11    5   1661     0.21  68.44
## 12  757   2418    31.19  99.63
## 13    9   2427     0.37 100.00

Go to table of contents.


3. Visualization for integer values

Integers represent counting. They could be represented with bar plots if their frequency table had few different values. For example, the variable Reduced.Lunch informs how many kids there are in each school that have that lunch for a reduced price.

# how many unique values
length(unique(eduwa$Reduced.Lunch))
## [1] 172

There are too many different values. Then, although R could produce a frequency table and a plot, we should not do go for the bar plot.

When the frequency table can not be our first step, we need to turn to statistical measures that help us understand behavior of the data:

# median close to mean?
# median and mean far from max or min?
# q1 distance to min is similar ti q3 distance to max?
# how many missing?

summary(eduwa$Reduced.Lunch)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   25.50   33.53   47.00  301.00     131

Let’s take care of missing values, by removing them:

eduwa_Lunch=eduwa[complete.cases(eduwa$Reduced.Lunch),]

The boxplot help us identify clearly the values obtained form summary:

# boxplots do not use frequency tables
baseInt= ggplot(eduwa_Lunch,aes(y = Reduced.Lunch))  
baseIntBox = baseInt + geom_boxplot() 

baseIntBox

The bar plot is not good option as it produces a bar for each unique value in the data, counting how many times this value appeared. Now, we have many values, so if we want to use bars, we need to organize the data into intervals. The histogram is the basic plot when intervals are needed, you can use the basic function:

baseInt2= ggplot(eduwa_Lunch,aes(x = Reduced.Lunch))  
baseIntHist= baseInt2 + geom_histogram()
baseIntHist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Go to table of contents.


4. Visualization for values with decimals

A simplistic idea of measurement tells you the times a particular unit is present in the unit of analysis; which allows for the presence of decimal places or even be negative values.

Let’s analyze the variable Student.Teacher.Ratio, but organized by county:

# tapply(variable,group,functionToApply)
tapply(eduwa$Student.Teacher.Ratio, eduwa$County, mean)
##        Adams       Asotin       Benton       Chelan      Clallam 
##           NA           NA           NA           NA           NA 
##        Clark     Columbia      Cowlitz      Douglas        Ferry 
##           NA           NA           NA           NA           NA 
##     Franklin     Garfield        Grant Grays Harbor       Island 
##           NA     17.35000           NA           NA           NA 
##    Jefferson         King       Kitsap     Kittitas    Klickitat 
##           NA           NA           NA           NA           NA 
##        Lewis      Lincoln        Mason     Okanogan      Pacific 
##           NA     11.56000           NA           NA           NA 
## Pend Oreille       Pierce     San Juan       Skagit     Skamania 
##     15.47778           NA           NA           NA     16.37000 
##    Snohomish      Spokane      Stevens     Thurston    Wahkiakum 
##           NA           NA           NA           NA     18.15000 
##  Walla Walla      Whatcom      Whitman       Yakima 
##           NA           NA           NA           NA

Above, I tried to compute the mean for each county, but the function mean() outputs a missing value (NA) as the result when there is one NA in the column. Then we need no missing values in that column:

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.4
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
eduwa_ratioST=eduwa[complete.cases(eduwa$Student.Teacher.Ratio),]

meanValuesCounty= eduwa_ratioST  %>%  
                    group_by(County)  %>%  
                        summarize('means'=mean(Student.Teacher.Ratio))
meanValuesCounty
## # A tibble: 39 x 2
##    County   means
##    <chr>    <dbl>
##  1 Adams     14.8
##  2 Asotin    19.1
##  3 Benton    20.4
##  4 Chelan    18.6
##  5 Clallam   19.3
##  6 Clark     19.2
##  7 Columbia  11.3
##  8 Cowlitz   20.4
##  9 Douglas   16.5
## 10 Ferry     16.8
## # … with 29 more rows

Great!

Let’s compute some statistics:

summary(meanValuesCounty$means)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.30   16.42   18.65   17.96   19.41   23.77
# boxplots do not use frequency tables
baseDec= ggplot(meanValuesCounty,aes(y = means))  
baseDecBox = baseDec + geom_boxplot() 

baseDecBox

Now let me plot a histogram of those means:

baseDec2= ggplot(meanValuesCounty,aes(x = means))  
baseDecHist= baseDec2 + geom_histogram(bins=7) # bins 7 (default 30)
baseDecHist

Go to table of contents.


Customizing visualization elements

  • Working on titles

Titles and captions are important, they can serve to represent a question to be answered by the plot:

titleText='Do we have counties with less than 15 students per teacher (on average)?'
sourceText='Source: US Department of Education'
xaxisText='Average of students-teacher ratio'
yaxisText='Amount of counties'

baseDecHist2= baseDecHist + labs(title=titleText,
                               x = xaxisText, 
                               y = yaxisText,
                               caption = sourceText)

baseDecHist2

Titles can also guide the reader to recognise the purpose of your plot:

# using \n
titleText2='Most schools in WA do not have\nstudents in the Reduced Lunch Program'
sourceText='Source: US Department of Education'
xaxisText='Students in Reduced Lunch Program'
yaxisText='Amount of schools'

baseIntHist2= baseIntHist  + labs(title=titleText2,
                                  x = xaxisText, 
                                  y = yaxisText,
                                  caption = sourceText)

# changing position of titles

baseIntHist3= baseIntHist2 + theme(plot.caption = element_text(hjust = 0), 
                                   plot.title = element_text(hjust = 0.5))

baseIntHist3
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

They can suggest a decision:

titleText3='WA needs to fully categorize school locations\n(info from 2018)'
sourceText='Source: US Department of Education'
xaxisText='Location of Schools'
yaxisText='%'

barNomProp2= barNomProp + labs(title=titleText3,
                                  x = xaxisText, 
                                  y = yaxisText,
                                  caption = sourceText)

barNomProp2

  • Changing color

You can use the attributes colour and fill for that purpose.

It works in every previous plot. Here you have the barplot:

baseNom + geom_bar(stat = 'identity',
                   colour='orange', # border
                   fill='white') 

The boxplot:

baseOrd2 + geom_boxplot(colour='green',fill='black') 

And the histogram:

baseDec2 + geom_histogram(bins=7,
                          colour='magenta',
                          fill='yellow')

Notice that the default area has a grid in gray. You can change the theme to make it simpler.

Here you have no grid:

baseDec2 + geom_histogram(bins=7,
                          colour='magenta',
                          fill='yellow') +
           theme_classic()

Here minimal grid with no color:

baseDec2 + geom_histogram(bins=7,
                          colour='magenta',
                          fill='yellow') +
           theme_minimal()

It is similar to the previos one, but it has a box for the grid.

baseDec2 + geom_histogram(bins=7,
                          colour='magenta',
                          fill='yellow') +
           theme_light()

You should review:

Sometimes axes need to be re oriented:

baseDecBox2=baseDecBox + coord_flip()
baseDecBox2

The values and their symbols on the vertical axis are not needed for the las boxplot:

baseDecBox3=baseDecBox2 + 
    theme(axis.text.y = element_blank(), # no values in ticks
          axis.ticks = element_blank())  # no symbol in ticks

baseDecBox3

Axis default values may need to be customized:

# vector of the summary statistics with one decimal place
statVals=round(as.vector(summary(meanValuesCounty$means)),1)

baseDecBox4=baseDecBox3 + 
            # customize tick values
            scale_y_continuous(breaks=statVals, 
                               limits = c(10, 25)) +
            # change angle of tick values
            theme(axis.text.x = element_text(angle=45),
                  panel.grid.minor =   element_blank()) # grid only on ticks

baseDecBox4

You may need percents instead of decimals:

library(scales)
## Warning: package 'scales' was built under R version 3.4.4
barNomProp2 + scale_y_continuous(labels=scales::percent)

  • Reference lines

You may to put a line to represent a particular value:

meanV=round(mean(meanValuesCounty$means),2)
baseDecHist3=baseDecHist2 + geom_vline(xintercept = meanV,
                            linetype="dotted", 
                            color = "yellow", 
                            size=1.5)
baseDecHist3

  • Annotation

Reference are more effective if we add text:

baseDecHist4=baseDecHist3+ annotate("text", x = meanV+0.5,y=10,
                                    angle = 90, 
                                    label = paste("MEAN",meanV),
                                    color="yellow")  
                
baseDecHist4

But annotation can do more than making lines explicit. Let me count how many have an average ratio les than 15:

(count_Less15=nrow(meanValuesCounty[meanValuesCounty$means<15,]))
## [1] 5

Let me annotate using a rectangular area:

baseDecHist5= baseDecHist4 + annotate("rect", 
                                      #points for rectangle:
                                      xmin = 10, xmax = 15, 
                                      ymin = 0, ymax = 5,
                                      fill='red',alpha = .2) +
                             annotate("text", x= 12.5, y = 4,
                                      label=paste(count_Less15,'counties'))
baseDecHist5

  • Alternatives

We can use dots instead of bars (position instead of length):

baseNomProp2= baseNomProp + geom_point() 
baseNomProp2

We can add lines to reinforce distance:

baseNomProp2+ geom_segment(aes(y = 0,
                               x = Type,
                               yend = Percent,
                               xend = Type), color = "grey50")

We could reorder the categories as they are not ordinal:

frTableProp[order(frTableProp$Percent),]
##      Type    Percent
## 5 Unknown 0.02966625
## 4    Town 0.13926658
## 2   Rural 0.20807581
## 1    City 0.29419036
## 3  Suburb 0.32880099

You can get:

# saving new order:
tableFreqO=frTableProp[order(frTableProp$Percent),]


baseNomRe = ggplot(tableFreqO, aes(Type,Percent)) 
lollipop1=baseNomRe + geom_segment(aes(y = 0, 
                                   x = Type, 
                                   yend = Percent, 
                                   xend = Type), color = "gray") 
lollipop2 = lollipop1 + geom_point()
lollipop2 + scale_x_discrete(limits=tableFreqO$Type) # key element

These graphs are called lollipops. We can use them to represent the direction of the distance from a particular reference line.

For example, if we have four locations, the uniform share will be 25%. Then we can compute a new column gaps:

# new variable
tableFreqO$gap=tableFreqO$Percent-0.25 # 0.25 is uniform share
head(tableFreqO)
##      Type    Percent         gap
## 5 Unknown 0.02966625 -0.22033375
## 4    Town 0.13926658 -0.11073342
## 2   Rural 0.20807581 -0.04192419
## 1    City 0.29419036  0.04419036
## 3  Suburb 0.32880099  0.07880099

Let’s plot this column, instead of Percent:

# plot the new variable
base = ggplot(tableFreqO, aes(Type,gap)) 

lollipopGap=base + geom_segment(aes(y = 0, 
                                   x = Type, 
                                   yend = gap, 
                                   xend = Type), color = "gray") 
lollipopGap1 = lollipopGap + geom_point()
lollipopGap2 = lollipopGap1 + 
    scale_x_discrete(limits=tableFreqO$Type) # key element
##
lollipopGap2

We can create another column, a flag to signal if the gap is negative or positive:

# a new column for color
tableFreqO$flag=ifelse(tableFreqO$gap>0,T,F)
head(tableFreqO)
##      Type    Percent         gap  flag
## 5 Unknown 0.02966625 -0.22033375 FALSE
## 4    Town 0.13926658 -0.11073342 FALSE
## 2   Rural 0.20807581 -0.04192419 FALSE
## 1    City 0.29419036  0.04419036  TRUE
## 3  Suburb 0.32880099  0.07880099  TRUE

I will REplot the previous plot, but using the extra column to give color to the line:

# add new aesthetics 'color'
base = ggplot(tableFreqO, aes(Type,gap)) 
lollipopGap1=base + geom_segment(aes(y = 0, 
                                   x = Type, 
                                   yend = gap, 
                                   xend = Type,color=flag), color = "gray") 

lollipopGap2 = lollipopGap1 + geom_point(aes(color=flag)) #adding color
lollipopGap3 = lollipopGap2 + scale_x_discrete(limits=tableFreqO$Type) 
lollipopGap3

Color is using a variable, then ggplot will create a legend to inform what this third dimension means in the bidimensional plot.

Let me annotate the last plot:

lollipopGap4= lollipopGap3 + 
              geom_text(aes(color=flag,label = round(gap,3)),
                        nudge_x=0.3) # push text to the right

lollipopGap4

The legend is plotting two symbols to represent the variables that give color to the text and the color of the dot, we can alter the previous code to avoid that:

lollipopGap4= lollipopGap3 + 
              geom_text(aes(color=flag,label = round(gap,3)),
                        nudge_x=0.3,
                        show.legend = FALSE) 

lollipopGap4

Another alternative to the histogram is the density plot. We had this:

#baseDec2= ggplot(meanValuesCounty,aes(x = means))  
baseDecHist= baseDec2 + geom_histogram(bins=7) 
baseDecHist2= baseDecHist + labs(title=titleText,
                               x = xaxisText, 
                               y = yaxisText,
                               caption = sourceText)

baseDecHist2

Then, we need a couple of steps. First, represent the y values as density:

baseDecHistDen= baseDec2 + geom_histogram(aes(y = ..density..),
                                       bins=7) 

baseDecHistDen2= baseDecHistDen + labs(title=titleText,
                               x = xaxisText, 
                               y = 'density',
                               caption = sourceText)

baseDecHistDen2

And now, plot the density:

baseDecHistDen3 = baseDecHistDen2 + geom_density()
baseDecHistDen3

We can improve:

baseDecHistDen3 = baseDecHistDen2 + geom_density(alpha = .2, 
                                                 fill="pink")
baseDecHistDen3