Session 6: Spatial and Multivariate Plotting

Bivariate Plots.
- Categorical association
- Numerical correlation
Multivariate Plots.
- Without Dimensionality Reduction
- With Dimensionality Reduction
Making maps.

Bivariate Plots

We analyze two variables to find out if there might be some kind of association between them. Even though that may be difficult to clearly identify, bivariate analysis still helps reveal signs of association that may serve at least to raise concern.

This time, I will use the data about crime from the Seattle Open Data portal:

link="https://github.com/EvansDataScience/data/raw/master/crime.RData"
load(file = url(link))

Let’s see what kind of data we have:

str(crime,width = 70,strict.width='cut')

## 'data.frame':    499698 obs. of  17 variables:
##  $ Report.Number              : chr  "20130000244104" "201300002429"..
##  $ Occurred.Date              : Date, format: "2013-07-09" ...
##  $ year                       : num  2013 2013 2013 2013 2013 ...
##  $ month                      : num  7 7 7 7 7 7 7 7 7 7 ...
##  $ weekday                    : Ord.factor w/ 7 levels "Monday"<"Tu"..
##  $ Occurred.Time              : num  1930 1917 1900 1900 1846 ...
##  $ Occurred.DayTime           : Ord.factor w/ 4 levels "day"<"after"..
##  $ Reported.Date              : Date, format: "2013-07-10" ...
##  $ Reported.Time              : num  1722 2052 35 1258 1846 ...
##  $ DaysToReport               : num  1 0 1 1 0 0 0 0 1 0 ...
##  $ crimecat                   : Factor w/ 20 levels "AGGRAVATED ASS"..
##  $ Crime.Subcategory          : Factor w/ 30 levels "AGGRAVATED ASS"..
##  $ Primary.Offense.Description: Factor w/ 144 levels "ADULT-VULNERA"..
##  $ Precinct                   : Factor w/ 5 levels "EAST","NORTH",....
##  $ Sector                     : Factor w/ 23 levels "6804","9512",....
##  $ Beat                       : Factor w/ 64 levels "B1","B2","B3",...
##  $ Neighborhood               : Factor w/ 58 levels "ALASKA JUNCTIO"..

Categorical association

The main way to organize these relationships are the contingency tables. Let’s select a couple of categorical variables:

(CrimeTotal=table(crime$crimecat,crime$Occurred.DayTime))

##                            
##                               day afternoon evening night
##   AGGRAVATED ASSAULT         3564      5366    4884  7501
##   ARSON                       196       167     191   486
##   BURGLARY                  24139     22288   14121 16082
##   CAR PROWL                 26740     38273   42595 34839
##   DISORDERLY CONDUCT           41        81      67    79
##   DUI                         706       939    2038  8522
##   FAMILY OFFENSE-NONVIOLENT  1748      2516    1217  1120
##   GAMBLE                        4         4       7     2
##   HOMICIDE                     41        46      49   131
##   LIQUOR LAW VIOLATION        112       491     410   606
##   LOITERING                    20        31      25     9
##   NARCOTIC                   2415      6416    3924  4109
##   PORNOGRAPHY                  65        53      17    31
##   PROSTITUTION                115       675    1425  1340
##   RAPE                        332       318     354   854
##   ROBBERY                    2584      4737    4139  5372
##   SEX OFFENSE-OTHER          1501      1759    1014  1776
##   THEFT                     38687     64868   38980 28410
##   TRESPASS                   4848      5184    2598  3289
##   WEAPON                      735      1445     947  1624

The table above shows counts, but in most situations, it is important to see relative values:

# using "pipes" to help readability:
library(magrittr)

CrimeTotal=table(crime$crimecat,crime$Occurred.DayTime)%>% #create table and then...
        prop.table() %>% #compute proportion and then...
        "*"(100)%>% # multiply by 100 and then...
        round(2) #...round to to decimals

# you get:
CrimeTotal

##                            
##                               day afternoon evening night
##   AGGRAVATED ASSAULT         0.71      1.07    0.98  1.50
##   ARSON                      0.04      0.03    0.04  0.10
##   BURGLARY                   4.83      4.46    2.83  3.22
##   CAR PROWL                  5.35      7.66    8.53  6.98
##   DISORDERLY CONDUCT         0.01      0.02    0.01  0.02
##   DUI                        0.14      0.19    0.41  1.71
##   FAMILY OFFENSE-NONVIOLENT  0.35      0.50    0.24  0.22
##   GAMBLE                     0.00      0.00    0.00  0.00
##   HOMICIDE                   0.01      0.01    0.01  0.03
##   LIQUOR LAW VIOLATION       0.02      0.10    0.08  0.12
##   LOITERING                  0.00      0.01    0.01  0.00
##   NARCOTIC                   0.48      1.28    0.79  0.82
##   PORNOGRAPHY                0.01      0.01    0.00  0.01
##   PROSTITUTION               0.02      0.14    0.29  0.27
##   RAPE                       0.07      0.06    0.07  0.17
##   ROBBERY                    0.52      0.95    0.83  1.08
##   SEX OFFENSE-OTHER          0.30      0.35    0.20  0.36
##   THEFT                      7.75     12.99    7.80  5.69
##   TRESPASS                   0.97      1.04    0.52  0.66
##   WEAPON                     0.15      0.29    0.19  0.33

Those tables show total counts or percents. However, when a table tries to hypothesize a relationship, you should have the independent variable in the columns, and the dependent one in the rows; then, the percent should be calculated by column, to see how the levels of the dependent variable varies by each level of the independent one, and compare along rows.

CrimeCol=table(crime$crimecat,crime$Occurred.DayTime)%>%
         prop.table(margin = 2)%>%   # 2 is % by column
         "*"(100)%>%
         round(3)

# you get:
CrimeCol

##                            
##                                day afternoon evening  night
##   AGGRAVATED ASSAULT         3.282     3.447   4.104  6.456
##   ARSON                      0.180     0.107   0.161  0.418
##   BURGLARY                  22.229    14.319  11.866 13.842
##   CAR PROWL                 24.624    24.588  35.794 29.987
##   DISORDERLY CONDUCT         0.038     0.052   0.056  0.068
##   DUI                        0.650     0.603   1.713  7.335
##   FAMILY OFFENSE-NONVIOLENT  1.610     1.616   1.023  0.964
##   GAMBLE                     0.004     0.003   0.006  0.002
##   HOMICIDE                   0.038     0.030   0.041  0.113
##   LIQUOR LAW VIOLATION       0.103     0.315   0.345  0.522
##   LOITERING                  0.018     0.020   0.021  0.008
##   NARCOTIC                   2.224     4.122   3.297  3.537
##   PORNOGRAPHY                0.060     0.034   0.014  0.027
##   PROSTITUTION               0.106     0.434   1.197  1.153
##   RAPE                       0.306     0.204   0.297  0.735
##   ROBBERY                    2.380     3.043   3.478  4.624
##   SEX OFFENSE-OTHER          1.382     1.130   0.852  1.529
##   THEFT                     35.626    41.674  32.756 24.453
##   TRESPASS                   4.464     3.330   2.183  2.831
##   WEAPON                     0.677     0.928   0.796  1.398

The complexity of two variables requires plots, as tables like these will not allow you to discover association patterns easily, even though they are already a summary of two columns. However, you must check the data format the plotting functions require, as most plots will use the contingency table as input (not the raw data).

Let me try a basic bar plot with the contingency table as input:

barplot(CrimeCol)

This plot will need a lot of work, so the base capabilities of R may not be a good strategy; and as before, we will turn to ggplot.

However, when using alternative/more specialized plotting features you may need to convert your table into a dataframe of frequencies, let me create the base proportions table:

df.T=as.data.frame(CrimeCol) # table of proportion based on total
# YOU GET:
head(df.T)

##                 Var1 Var2   Freq
## 1 AGGRAVATED ASSAULT  day  3.282
## 2              ARSON  day  0.180
## 3           BURGLARY  day 22.229
## 4          CAR PROWL  day 24.624
## 5 DISORDERLY CONDUCT  day  0.038
## 6                DUI  day  0.650

We should rename the above table:

names(df.T)=c('Crime','Daytime','Percent') #renaming
head(df.T)

##                Crime Daytime Percent
## 1 AGGRAVATED ASSAULT     day   3.282
## 2              ARSON     day   0.180
## 3           BURGLARY     day  22.229
## 4          CAR PROWL     day  24.624
## 5 DISORDERLY CONDUCT     day   0.038
## 6                DUI     day   0.650

A first option you may have is to reproduce the table:

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.4

base = ggplot(df.T, aes(Daytime,Crime)) 

# plot value as point, size by value of percent
tablePlot1 = base + geom_point(aes(size = Percent), colour = "gray") 

# add value of Percent as label
tablePlot2 = tablePlot1 + geom_text(aes(label = Percent),
                                    nudge_x = 0.1, # push the value to the right on the horizontal
                                    size=2)
tablePlot2

…some more work:

# improve the size of dots
tablePlot3 = tablePlot2 + scale_size_continuous(range=c(0,10)) #change 10?

# less ink
tablePlot4 = tablePlot3 + theme_minimal() 

# no legend
tablePlot4 + theme(legend.position="none")

The plot looks nice, but unless the differences are clearly cut, you may see more noise than information, which distracts and delays decision making. Keep in mind that length of bars are easier to compare than circle areas. You need a barplot:

And, the original relationship Input-Output table can be plotted like this:

base  = ggplot(df.T, aes(x = Crime, y = Percent ) ) 
bars1 = base + geom_bar( stat = "identity" )
bars2 = bars1 + facet_wrap( ~ Daytime ,nrow = 1) 
bars2 + coord_flip()

The type of crime is not ordinal, then we could reorder the bars:

base  = ggplot(df.T, aes(x = reorder(Crime, Percent), # reordering Crime based on percent value
                         y = Percent ) ) 

bars1 = base + geom_bar( stat = "identity" )

bars2 = bars1 + facet_wrap( ~ Daytime ,nrow = 1) 

bars2 + coord_flip() + theme(axis.text.y = element_text(size=4,angle = 45))

Numerical correlation

The study of bivariate relationships among numerical variables is known as correlation analysis. The data we have been using has few numerical columns, but I will produce two by aggregating the original data set using Neigborhood:

Aggregating days to report and neighborhood:

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.4.4

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

daysByNeigh= crime  %>%  
                    filter(!is.na(Neighborhood)) %>%
                    group_by(Neighborhood)  %>%  
                    summarize(DaysToReport=mean(DaysToReport))  

str(daysByNeigh)

## Classes 'tbl_df', 'tbl' and 'data.frame':    58 obs. of  2 variables:
##  $ Neighborhood: Factor w/ 58 levels "ALASKA JUNCTION",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ DaysToReport: num  8.97 14.46 7.79 5.26 5.01 ...

Aggregating share of crimes by neighborhood

crimesByNeigh=crime  %>%  
                    filter(!is.na(Neighborhood)) %>%
                    group_by(Neighborhood)  %>%  
                    summarize(crimeShare=length(crimecat)/nrow(crime))#  %>% 
    

str(crimesByNeigh)

## Classes 'tbl_df', 'tbl' and 'data.frame':    58 obs. of  2 variables:
##  $ Neighborhood: Factor w/ 58 levels "ALASKA JUNCTION",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ crimeShare  : num  0.01338 0.00483 0.02114 0.02941 0.02971 ...

Merging the two dataframes: Since both data frames have the same neighboorhood, we can make one data frame by merging them:

num_num=merge(daysByNeigh,crimesByNeigh) # 'row name' is the "key"
head(num_num)

##      Neighborhood DaysToReport  crimeShare
## 1 ALASKA JUNCTION     8.968437 0.013378080
## 2            ALKI    14.463768 0.004832919
## 3   BALLARD NORTH     7.788203 0.021136767
## 4   BALLARD SOUTH     5.262384 0.029409763
## 5        BELLTOWN     5.012865 0.029711946
## 6      BITTERLAKE     7.442241 0.019073520

Notice that Neighborhood is a factor, since I will use it to annotate it would be better to turn it into a text:

num_num$Neighborhood=as.character(num_num$Neighborhood)

Once we have the data organized, the clear option is the scatterplot:

base = ggplot(num_num, aes(DaysToReport,crimeShare)) 
plot1= base +  geom_point() 
plot1

We can improve the plot, this time introducing ggrepel:

library(ggrepel)
base = ggplot(num_num, aes(DaysToReport,crimeShare,
                           label=Neighborhood)) # you need this aesthetics!
plot1= base +  geom_point() 
plot1 + geom_text_repel()

We can limit the labels, annotating the ones that represent at least 5% of the crimes in the city:

plot1 + geom_text_repel(aes(label=ifelse(crimeShare>=0.05,
                                         as.character(num_num$Neighborhood), "")))

Notice the difference without ggrepel:

plot1 + geom_text(aes(label=ifelse(crimeShare>=0.05,num_num$Neighborhood, "")))

An alternative, to highlight overlapping of points:

library(hexbin)

## Warning: package 'hexbin' was built under R version 3.4.3

base = ggplot(num_num, aes(DaysToReport,crimeShare)) 
scatp1 = base +  geom_hex(bins = 10)
scatp2= scatp1 + geom_text_repel(aes(label=ifelse(crimeShare>=0.05,
                                                  num_num$Neighborhood,
                                                  ""))) 
scatp2 + scale_fill_distiller(palette ="Greys",direction=1) # try -1

The palettes can be selected from the brewer colors website. Using the same palette as before, we can try a different plot (stat_density_2d):

base = ggplot(num_num, aes(DaysToReport,crimeShare)) 

scatp1 = base +  stat_density_2d(aes(fill = ..density..), 
                                 geom = "raster", contour = FALSE)

scatp2=scatp1+geom_text_repel(aes(label=ifelse(crimeShare>=0.05,
                                               num_num$Neighborhood, "")))

scatp3= scatp2 + scale_fill_distiller(palette="Greys", direction=1) 

scatp4= scatp3 + guides(fill = guide_legend(title = "Number of\nNeighborhoods",
                                    title.theme = element_text(size=8)))

scatp4

The extra space you see can disappear using:

scatp5 = scatp4 +  scale_x_continuous(expand = c(0, 0)) + 
         scale_y_continuous(expand = c(0, 0)) 
scatp5

Go to page beginning

Multivariate Plots

This time, I will use the data about city safety:

library(rio)

## Warning: package 'rio' was built under R version 3.4.4

link="https://github.com/EvansDataScience/data/raw/master/safeCitiesIndexAll.xlsx"

safe=import(link)

The data available are:

names(safe)

##  [1] "city"                         "D_In_PrivacyPolicy"          
##  [3] "D_In_AwarenessDigitalThreats" "D_In_PubPrivPartnerships"    
##  [5] "D_In_TechnologyEmployed"      "D_In_CyberSecurity"          
##  [7] "D_Out_IdentityTheft"          "D_Out_CompInfected"          
##  [9] "D_Out_InternetAccess"         "H_In_EnvironmentPolicies"    
## [11] "H_In_AccessHealthcare"        "H_In_Beds_1000"              
## [13] "H_In_Doctors_1000"            "H_In_AccessFood"             
## [15] "H_In_QualityHealthServ"       "H_Out_AirQuality"            
## [17] "H_Out_WaterQuality"           "H_Out_LifeExpectY"           
## [19] "H_Out_InfMortality"           "H_Out_CancerMortality"       
## [21] "H_Out_AttacksBioChemRad"      "I_In_EnforceTransportSafety" 
## [23] "I_In_PedestrianFriendliness"  "I_In_QualityRoad"            
## [25] "I_In_QualityElectricity"      "I_In_DisasterManagement"     
## [27] "I_Out_DeathsDisaster"         "I_Out_VehicularAccidents"    
## [29] "I_Out_PedestrianDeath"        "I_Out_LiveSlums"             
## [31] "I_Out_AttacksInfrastructure"  "P_In_PoliceEngage"           
## [33] "P_In_CommunityPatrol"         "P_In_StreetCrimeData"        
## [35] "P_In_TechForCrime"            "P_In_PrivateSecurity"        
## [37] "P_In_GunRegulation"           "P_In_PoliticalStability"     
## [39] "P_Out_PettyCrime"             "P_Out_ViolentCrime"          
## [41] "P_Out_OrganisedCrime"         "P_Out_Corruption"            
## [43] "P_Out_DrugUse"                "P_Out_TerroristAttacks"      
## [45] "P_Out_SeverityTerrorist"      "P_Out_GenderSafety"          
## [47] "P_Out_PerceptionSafety"       "P_Out_ThreaTerrorism"        
## [49] "P_Out_ThreatMilitaryConf"     "P_Out_ThreatCivUnrest"

These are several variables telling us information about the safety levels of cities in the world, and are related to Digital, Health, Infrastructure, and Personal dimensions. For each of these dimensions, there are measures of actions taken (In), and results (Out).

Without dimensionality reduction:

Heatmaps will show you the whole data set. First, we need some reshaping:

library(reshape2)

## Warning: package 'reshape2' was built under R version 3.4.3

safeA=melt(safe,
           id.vars = 'city') # the unit of analysis
head(safeA)

##        city           variable value
## 1 Abu Dhabi D_In_PrivacyPolicy    50
## 2 Amsterdam D_In_PrivacyPolicy   100
## 3    Athens D_In_PrivacyPolicy    75
## 4   Bangkok D_In_PrivacyPolicy    25
## 5 Barcelona D_In_PrivacyPolicy   100
## 6   Beijing D_In_PrivacyPolicy    75

The melting changed the direction of the data: the columns were sent into rows. This the data in long format. Now, the heatmap using this format:

base = ggplot(data = safeA, aes(x = variable,
                                y =city)) 

heat1= base +  geom_tile(aes(fill = value)) 
heat1

Here you can see what rows have higher or lower colors on what set of variables. You can add color palette:

#inverse color -1
heat2 = heat1 + scale_fill_distiller(palette = "RdYlGn",direction = 1)  
heat2

The column and row names need some work:

heat2 + theme(axis.text.x = element_text(angle = 90, 
                                         hjust = 1,
                                         size = 4),
              axis.text.y = element_text(size = 4))

The last heatmap above could be ‘ordered’ so that column and row positions can give us more information:

# change in REORDER
base= ggplot(data = safeA, aes(x = reorder(variable, 
                                           value, median, order=TRUE),
                               y =reorder(city,
                                          value, median, order=TRUE)))
# THIS IS THE SAME
base + geom_tile(aes(fill = value)) + 
    scale_fill_distiller(palette = "RdYlGn",direction = 1) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1,size = 4),
              axis.text.y = element_text(size = 4))

This is still hard to read. An alternative could be to average each dimension, so you get four columns. These data has that information:

link2="https://github.com/EvansDataScience/data/raw/master/safeCitiesIndex.xlsx"

safe2=import(link2)
head(safe2)

##        city  DIGITAL   HEALTH INFRASTRUCTURE PERSONAL
## 1 Abu Dhabi 68.77421 68.54043       91.36416 78.95145
## 2 Amsterdam 85.79071 79.79323       96.04870 87.42197
## 3    Athens 61.93823 74.57161       82.04507 69.03243
## 4   Bangkok 44.44312 66.63739       68.33223 60.79785
## 5 Barcelona 74.44147 78.54113       96.59329 85.28254
## 6   Beijing 65.37625 67.62942       74.48954 80.76429

This same plot can be done using some additional packages related to ggplot. Let me show you the use of ggiraph, and ggiraphExtra.

This packages does not need that we alter the shape from wide to long, as the heatplot needed.

library(ggiraph)

## Warning: package 'ggiraph' was built under R version 3.4.4

library(ggiraphExtra)

## Warning: package 'ggiraphExtra' was built under R version 3.4.4

base = ggRadar(safe2,aes(colour='city'),legend.position="none") 

plot1 = base + facet_wrap('city',ncol = 10) # ten columns of cities

plot1