Course: Data-Driven Management and Policy

Prof. José Manuel Magallanes, PhD


Session 5: Intro to Visualization

_____

Contents:

Part 1:

  1. Nominal Visualization.
  2. Ordinal Visualization.
  3. Integers Visualization.
  4. Decimals Visualization.

Part 2: Customizing visual elements.

  1. Titles
  2. Changing color
  3. Working on axes
  4. Reference lines
  5. Annotation
  6. Alternatives

We are very familiar with data frames and data types. From that knowledge we will learn how information can be obtained using a visual approach.

Let me get some data from the website of the Common Core of Data from the US Department of Education. There you can get a data set with detailed information on public schools at the state of Washington:

link='https://github.com/EvansDataScience/VisualAnalytics_2_tabularData/raw/master/data/eduwa.rda'

#getting the data TABLE from the file in the cloud:
load(file=url(link))

Generally speaking, you have either categorical or numerical data in each column, and whatever question you have, you first need to know how that variable you are planing to use has been encoded:

# this 'width = 70,strict.width='cut' means
# you do not want to see more than 70 characters per row.

str(eduwa,width = 70,strict.width='cut')
## 'data.frame':    2427 obs. of  24 variables:
##  $ NCES.School.ID       : chr  "530486002475" "530270001270" "53091"..
##  $ State.School.ID      : chr  "WA-31025-1656" "WA-06114-1646" "WA-"..
##  $ NCES.District.ID     : chr  "5304860" "5302700" "5309100" "53000"..
##  $ State.District.ID    : chr  "WA-31025" "WA-06114" "WA-34033" "WA"..
##  $ Low.Grade            : Ord.factor w/ 14 levels "PK"<"KG"<"1"<..: ..
##  $ High.Grade           : Ord.factor w/ 15 levels "PK"<"KG"<"1"<..: ..
##  $ School.Name          : chr  "10th Street School" "49th Street Ac"..
##  $ District             : chr  "Marysville School District" "Evergr"..
##  $ County               : chr  "Snohomish" "Clark" "Thurston" "Gray"..
##  $ Street.Address       : chr  "7204 27th Ave NE" "14619B NE 49th S"..
##  $ City                 : chr  "Marysville" "Vancouver" "Tumwater" "..
##  $ State                : chr  "WA" "WA" "WA" "WA" ...
##  $ ZIP                  : chr  "98271" "98682" "98512" "98520" ...
##  $ ZIP.4-digit          : chr  NA "6308" NA "5510" ...
##  $ Phone                : chr  "(360)965-0400" "(360)604-6700" "(36"..
##  $ Locale.Code          : Factor w/ 12 levels "11","12","13",..: 5 2..
##  $ LocaleType           : Factor w/ 4 levels "City","Rural",..: 3 1 ..
##  $ LocaleSub            : Factor w/ 12 levels "City: Small",..: 5 2 ..
##  $ Charter              : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1..
##  $ Title.I.School       : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1..
##  $ Title.1.School.Wide  : Factor w/ 2 levels "No","Yes": 2 NA NA 2 N..
##  $ Student.Teacher.Ratio: num  23.4 8.4 21.5 15.9 6.5 15.3 NA 16.3 1..
##  $ Free.Lunch           : num  28 53 169 292 12 411 48 102 101 268 ...
##  $ Reduced.Lunch        : num  3 9 40 10 4 23 12 22 23 0 ...

The ones that say num are obviously numbers (numbers in R are numeric when decimal values are detected, and integer if they are not). The ones that say chr are strings, which are candidates to be key columns, which are not variables themselves, but identifiers of the cases. In this case, the first four are identifiers, as well as the the 7th, 10th and 15th columns (school names, address and phone, respectively). Those variables are not to be analyzed statistically, but may be used for annotating (7th and 15th column) or for geocoding (10th column). Notice that for these data, State is not to be analyzed as it is a constant (all rows are from WA); but it would be if the data were from the whole USA. Then, you see several variables identified as factor or ordered factor, which are categorical variables: they can be analyzed statistically but not in the same way that numbers.


1. Visualization for nominal scales

You can get a clear idea of what a categorical data has by producing a simple frequency table:

# absolute values
table(eduwa$LocaleType,exclude = 'nothing')
## 
##   City  Rural Suburb   Town   <NA> 
##    714    505    798    338     72
# relative values
absoluteT=table(eduwa$LocaleType,exclude = 'nothing')
prop.table(absoluteT)
## 
##       City      Rural     Suburb       Town       <NA> 
## 0.29419036 0.20807581 0.32880099 0.13926658 0.02966625

You may want to give a name to the missing values. However, when the column is a factor, you may need something like this:

library(forcats)

eduwa$LocaleType=fct_explicit_na(eduwa$LocaleType, "Unknown")

The basic option for nominal data is a barplot. However, most people tend to use pie charts with categorical data, but this should not be the default option to visualize classification (see this discussion).

Let’s start by calling the library to use:

library(ggplot2)
  • For categorical, create a frequency table as a data frane:
frTable=as.data.frame(table(eduwa$LocaleType))
names(frTable)=c('Type','Count')
  • Create the base object, which is not a plot, just informing the variable to plot:
baseNom= ggplot(data = frTable, 
             aes(x=Type, y=Count)) 
  • Request what “geometry” you want:
barNom=baseNom + geom_bar(stat = 'identity')
barNom

  • For barplots, you may need the proportions instead, so alter the geometry like this:
frTableProp=as.data.frame(prop.table(table(eduwa$LocaleType)))
names(frTableProp)=c('Type','Percent')

baseNomProp= ggplot(data = frTableProp, 
             aes(x=Type, y=Percent))

barNomProp=baseNomProp + geom_bar(stat = 'identity')
barNomProp

You should always keep it simple. Then decorate.

Go to table of contents. ________

2. Visualization for ordinal scales

For this section, we will use the variable that tells us the highest grade offered in a school. A simple exploration gives:

table(eduwa$High.Grade,exclude = 'nothing')
## 
##  PK  KG   1   2   3   4   5   6   7   8   9  10  11  12  13 
##  82   7   6  16  19  45 755 266  11 427  15   7   5 757   9

Being a categorical variable, the default option is again the bar plot:

  • Preparing the data:
ordTable=as.data.frame(table(eduwa$High.Grade,exclude = 'nothing'))
names(ordTable)=c('Grade','Count')
baseOrd = ggplot(ordTable,aes(x=Grade,y=Count))
barOrd=baseOrd + geom_bar(stat = 'identity') 
barOrd

The x-values in this variable have order. That is, there is an increasing level in the values. Whenever we have an ordering, besides concentration we can visualize symmetry: if there is bias towards lower or higher values.

Bar plots help you see concentration and symmetry, but we have an alternative way to clearly detect symmetry, via boxplots:

# boxplots do not use frequency tables

# as.numeric  turns levels of the factor into numbers
baseOrd2 = ggplot(eduwa, aes(y=as.numeric(High.Grade))) 
baseOrdBox = baseOrd2 + geom_boxplot() 

baseOrdBox

You have symmetry when the distance of those whiskers to the box is the same, and when the thick line is in the middle of the box. You can see that the values show a negative asymmetry as the tail towards the bottom (lowest values).

Box plots expect a numeric value as an input, but we have an ordered categorical, so we used the as.numeric() function. However, that eliminated the levels we saw in the previous bar plot; we can put the levels back in our plot:

# the labels use the original ordinal levels
ordLabels= levels(eduwa$High.Grade)

baseOrdBox2 = baseOrdBox + scale_y_continuous(labels=ordLabels,breaks=1:15)
baseOrdBox2