Course: Data-Driven Management and Policy

Prof. José Manuel Magallanes, PhD


Session 6: Spatial and Multivariate Plotting


  1. Bivariate Plots.
  2. Multivariate Plots.
  3. Making maps.

Bivariate Plots

We analyze two variables to find out if there might be some kind of association between them. Even though that may be difficult to clearly identify, bivariate analysis still helps reveal signs of association that may serve at least to raise concern.

This time, I will use the data about crime from the Seattle Open Data portal:

link="https://github.com/EvansDataScience/data/raw/master/crime.RData"
load(file = url(link))

Let’s see what kind of data we have:

str(crime,width = 70,strict.width='cut')
## 'data.frame':    499698 obs. of  17 variables:
##  $ Report.Number              : chr  "20130000244104" "201300002429"..
##  $ Occurred.Date              : Date, format: "2013-07-09" ...
##  $ year                       : num  2013 2013 2013 2013 2013 ...
##  $ month                      : num  7 7 7 7 7 7 7 7 7 7 ...
##  $ weekday                    : Ord.factor w/ 7 levels "Monday"<"Tu"..
##  $ Occurred.Time              : num  1930 1917 1900 1900 1846 ...
##  $ Occurred.DayTime           : Ord.factor w/ 4 levels "day"<"after"..
##  $ Reported.Date              : Date, format: "2013-07-10" ...
##  $ Reported.Time              : num  1722 2052 35 1258 1846 ...
##  $ DaysToReport               : num  1 0 1 1 0 0 0 0 1 0 ...
##  $ crimecat                   : Factor w/ 20 levels "AGGRAVATED ASS"..
##  $ Crime.Subcategory          : Factor w/ 30 levels "AGGRAVATED ASS"..
##  $ Primary.Offense.Description: Factor w/ 144 levels "ADULT-VULNERA"..
##  $ Precinct                   : Factor w/ 5 levels "EAST","NORTH",....
##  $ Sector                     : Factor w/ 23 levels "6804","9512",....
##  $ Beat                       : Factor w/ 64 levels "B1","B2","B3",...
##  $ Neighborhood               : Factor w/ 58 levels "ALASKA JUNCTIO"..

Categorical association

The main way to organize these relationships are the contingency tables. Let’s select a couple of categorical variables:

(CrimeTotal=table(crime$crimecat,crime$Occurred.DayTime))
##                            
##                               day afternoon evening night
##   AGGRAVATED ASSAULT         3564      5366    4884  7501
##   ARSON                       196       167     191   486
##   BURGLARY                  24139     22288   14121 16082
##   CAR PROWL                 26740     38273   42595 34839
##   DISORDERLY CONDUCT           41        81      67    79
##   DUI                         706       939    2038  8522
##   FAMILY OFFENSE-NONVIOLENT  1748      2516    1217  1120
##   GAMBLE                        4         4       7     2
##   HOMICIDE                     41        46      49   131
##   LIQUOR LAW VIOLATION        112       491     410   606
##   LOITERING                    20        31      25     9
##   NARCOTIC                   2415      6416    3924  4109
##   PORNOGRAPHY                  65        53      17    31
##   PROSTITUTION                115       675    1425  1340
##   RAPE                        332       318     354   854
##   ROBBERY                    2584      4737    4139  5372
##   SEX OFFENSE-OTHER          1501      1759    1014  1776
##   THEFT                     38687     64868   38980 28410
##   TRESPASS                   4848      5184    2598  3289
##   WEAPON                      735      1445     947  1624

The table above shows counts, but in most situations, it is important to see relative values:

# using "pipes" to help readability:
library(magrittr)

CrimeTotal=table(crime$crimecat,crime$Occurred.DayTime)%>% #create table and then...
        prop.table() %>% #compute proportion and then...
        "*"(100)%>% # multiply by 100 and then...
        round(2) #...round to to decimals

# you get:
CrimeTotal
##                            
##                               day afternoon evening night
##   AGGRAVATED ASSAULT         0.71      1.07    0.98  1.50
##   ARSON                      0.04      0.03    0.04  0.10
##   BURGLARY                   4.83      4.46    2.83  3.22
##   CAR PROWL                  5.35      7.66    8.53  6.98
##   DISORDERLY CONDUCT         0.01      0.02    0.01  0.02
##   DUI                        0.14      0.19    0.41  1.71
##   FAMILY OFFENSE-NONVIOLENT  0.35      0.50    0.24  0.22
##   GAMBLE                     0.00      0.00    0.00  0.00
##   HOMICIDE                   0.01      0.01    0.01  0.03
##   LIQUOR LAW VIOLATION       0.02      0.10    0.08  0.12
##   LOITERING                  0.00      0.01    0.01  0.00
##   NARCOTIC                   0.48      1.28    0.79  0.82
##   PORNOGRAPHY                0.01      0.01    0.00  0.01
##   PROSTITUTION               0.02      0.14    0.29  0.27
##   RAPE                       0.07      0.06    0.07  0.17
##   ROBBERY                    0.52      0.95    0.83  1.08
##   SEX OFFENSE-OTHER          0.30      0.35    0.20  0.36
##   THEFT                      7.75     12.99    7.80  5.69
##   TRESPASS                   0.97      1.04    0.52  0.66
##   WEAPON                     0.15      0.29    0.19  0.33

Those tables show total counts or percents. However, when a table tries to hypothesize a relationship, you should have the independent variable in the columns, and the dependent one in the rows; then, the percent should be calculated by column, to see how the levels of the dependent variable varies by each level of the independent one, and compare along rows.

CrimeCol=table(crime$crimecat,crime$Occurred.DayTime)%>%
         prop.table(margin = 2)%>%   # 2 is % by column
         "*"(100)%>%
         round(3)

# you get:
CrimeCol
##                            
##                                day afternoon evening  night
##   AGGRAVATED ASSAULT         3.282     3.447   4.104  6.456
##   ARSON                      0.180     0.107   0.161  0.418
##   BURGLARY                  22.229    14.319  11.866 13.842
##   CAR PROWL                 24.624    24.588  35.794 29.987
##   DISORDERLY CONDUCT         0.038     0.052   0.056  0.068
##   DUI                        0.650     0.603   1.713  7.335
##   FAMILY OFFENSE-NONVIOLENT  1.610     1.616   1.023  0.964
##   GAMBLE                     0.004     0.003   0.006  0.002
##   HOMICIDE                   0.038     0.030   0.041  0.113
##   LIQUOR LAW VIOLATION       0.103     0.315   0.345  0.522
##   LOITERING                  0.018     0.020   0.021  0.008
##   NARCOTIC                   2.224     4.122   3.297  3.537
##   PORNOGRAPHY                0.060     0.034   0.014  0.027
##   PROSTITUTION               0.106     0.434   1.197  1.153
##   RAPE                       0.306     0.204   0.297  0.735
##   ROBBERY                    2.380     3.043   3.478  4.624
##   SEX OFFENSE-OTHER          1.382     1.130   0.852  1.529
##   THEFT                     35.626    41.674  32.756 24.453
##   TRESPASS                   4.464     3.330   2.183  2.831
##   WEAPON                     0.677     0.928   0.796  1.398

The complexity of two variables requires plots, as tables like these will not allow you to discover association patterns easily, even though they are already a summary of two columns. However, you must check the data format the plotting functions require, as most plots will use the contingency table as input (not the raw data).

Let me try a basic bar plot with the contingency table as input:

barplot(CrimeCol)

This plot will need a lot of work, so the base capabilities of R may not be a good strategy; and as before, we will turn to ggplot.

However, when using alternative/more specialized plotting features you may need to convert your table into a dataframe of frequencies, let me create the base proportions table:

df.T=as.data.frame(CrimeCol) # table of proportion based on total
# YOU GET:
head(df.T)
##                 Var1 Var2   Freq
## 1 AGGRAVATED ASSAULT  day  3.282
## 2              ARSON  day  0.180
## 3           BURGLARY  day 22.229
## 4          CAR PROWL  day 24.624
## 5 DISORDERLY CONDUCT  day  0.038
## 6                DUI  day  0.650

We should rename the above table:

names(df.T)=c('Crime','Daytime','Percent') #renaming
head(df.T)
##                Crime Daytime Percent
## 1 AGGRAVATED ASSAULT     day   3.282
## 2              ARSON     day   0.180
## 3           BURGLARY     day  22.229
## 4          CAR PROWL     day  24.624
## 5 DISORDERLY CONDUCT     day   0.038
## 6                DUI     day   0.650

A first option you may have is to reproduce the table:

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
base = ggplot(df.T, aes(Daytime,Crime)) 

# plot value as point, size by value of percent
tablePlot1 = base + geom_point(aes(size = Percent), colour = "gray") 

# add value of Percent as label
tablePlot2 = tablePlot1 + geom_text(aes(label = Percent),
                                    nudge_x = 0.1, # push the value to the right on the horizontal
                                    size=2)
tablePlot2

…some more work:

# improve the size of dots
tablePlot3 = tablePlot2 + scale_size_continuous(range=c(0,10)) #change 10?

# less ink
tablePlot4 = tablePlot3 + theme_minimal() 

# no legend
tablePlot4 + theme(legend.position="none") 

The plot looks nice, but unless the differences are clearly cut, you may see more noise than information, which distracts and delays decision making. Keep in mind that length of bars are easier to compare than circle areas. You need a barplot:

And, the original relationship Input-Output table can be plotted like this:

base  = ggplot(df.T, aes(x = Crime, y = Percent ) ) 
bars1 = base + geom_bar( stat = "identity" )
bars2 = bars1 + facet_wrap( ~ Daytime ,nrow = 1) 
bars2 + coord_flip()

The type of crime is not ordinal, then we could reorder the bars:

base  = ggplot(df.T, aes(x = reorder(Crime, Percent), # reordering Crime based on percent value
                         y = Percent ) ) 

bars1 = base + geom_bar( stat = "identity" )

bars2 = bars1 + facet_wrap( ~ Daytime ,nrow = 1) 

bars2 + coord_flip() + theme(axis.text.y = element_text(size=4,angle = 45)) 

Numerical correlation

The study of bivariate relationships among numerical variables is known as correlation analysis. The data we have been using has few numerical columns, but I will produce two by aggregating the original data set using Neigborhood:

  • Aggregating days to report and neighborhood:
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.4
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
daysByNeigh= crime  %>%  
                    filter(!is.na(Neighborhood)) %>%
                    group_by(Neighborhood)  %>%  
                    summarize(DaysToReport=mean(DaysToReport))  

str(daysByNeigh)
## Classes 'tbl_df', 'tbl' and 'data.frame':    58 obs. of  2 variables:
##  $ Neighborhood: Factor w/ 58 levels "ALASKA JUNCTION",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ DaysToReport: num  8.97 14.46 7.79 5.26 5.01 ...
  • Aggregating share of crimes by neighborhood
crimesByNeigh=crime  %>%  
                    filter(!is.na(Neighborhood)) %>%
                    group_by(Neighborhood)  %>%  
                    summarize(crimeShare=length(crimecat)/nrow(crime))#  %>% 
    

str(crimesByNeigh)
## Classes 'tbl_df', 'tbl' and 'data.frame':    58 obs. of  2 variables:
##  $ Neighborhood: Factor w/ 58 levels "ALASKA JUNCTION",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ crimeShare  : num  0.01338 0.00483 0.02114 0.02941 0.02971 ...
  • Merging the two dataframes: Since both data frames have the same neighboorhood, we can make one data frame by merging them:
num_num=merge(daysByNeigh,crimesByNeigh) # 'row name' is the "key"
head(num_num)
##      Neighborhood DaysToReport  crimeShare
## 1 ALASKA JUNCTION     8.968437 0.013378080
## 2            ALKI    14.463768 0.004832919
## 3   BALLARD NORTH     7.788203 0.021136767
## 4   BALLARD SOUTH     5.262384 0.029409763
## 5        BELLTOWN     5.012865 0.029711946
## 6      BITTERLAKE     7.442241 0.019073520

Notice that Neighborhood is a factor, since I will use it to annotate it would be better to turn it into a text:

num_num$Neighborhood=as.character(num_num$Neighborhood)

Once we have the data organized, the clear option is the scatterplot:

base = ggplot(num_num, aes(DaysToReport,crimeShare)) 
plot1= base +  geom_point() 
plot1

We can improve the plot, this time introducing ggrepel:

library(ggrepel)
base = ggplot(num_num, aes(DaysToReport,crimeShare,
                           label=Neighborhood)) # you need this aesthetics!
plot1= base +  geom_point() 
plot1 + geom_text_repel()

We can limit the labels, annotating the ones that represent at least 5% of the crimes in the city:

plot1 + geom_text_repel(aes(label=ifelse(crimeShare>=0.05,
                                         as.character(num_num$Neighborhood), "")))

Notice the difference without ggrepel:

plot1 + geom_text(aes(label=ifelse(crimeShare>=0.05,num_num$Neighborhood, "")))