Course: Data-Driven Management and Policy

Prof. José Manuel Magallanes, PhD


Session 6: Spatial and Multivariate Plotting


  1. Bivariate Plots.
  2. Multivariate Plots.
  3. Making maps.

Bivariate Plots

We analyze two variables to find out if there might be some kind of association between them. Even though that may be difficult to clearly identify, bivariate analysis still helps reveal signs of association that may serve at least to raise concern.

This time, I will use the data about crime from the Seattle Open Data portal:

link="https://github.com/EvansDataScience/data/raw/master/crime.RData"
load(file = url(link))

Let’s see what kind of data we have:

str(crime,width = 70,strict.width='cut')
## 'data.frame':    499698 obs. of  17 variables:
##  $ Report.Number              : chr  "20130000244104" "201300002429"..
##  $ Occurred.Date              : Date, format: "2013-07-09" ...
##  $ year                       : num  2013 2013 2013 2013 2013 ...
##  $ month                      : num  7 7 7 7 7 7 7 7 7 7 ...
##  $ weekday                    : Ord.factor w/ 7 levels "Monday"<"Tu"..
##  $ Occurred.Time              : num  1930 1917 1900 1900 1846 ...
##  $ Occurred.DayTime           : Ord.factor w/ 4 levels "day"<"after"..
##  $ Reported.Date              : Date, format: "2013-07-10" ...
##  $ Reported.Time              : num  1722 2052 35 1258 1846 ...
##  $ DaysToReport               : num  1 0 1 1 0 0 0 0 1 0 ...
##  $ crimecat                   : Factor w/ 20 levels "AGGRAVATED ASS"..
##  $ Crime.Subcategory          : Factor w/ 30 levels "AGGRAVATED ASS"..
##  $ Primary.Offense.Description: Factor w/ 144 levels "ADULT-VULNERA"..
##  $ Precinct                   : Factor w/ 5 levels "EAST","NORTH",....
##  $ Sector                     : Factor w/ 23 levels "6804","9512",....
##  $ Beat                       : Factor w/ 64 levels "B1","B2","B3",...
##  $ Neighborhood               : Factor w/ 58 levels "ALASKA JUNCTIO"..

Categorical association

The main way to organize these relationships are the contingency tables. Let’s select a couple of categorical variables:

(CrimeTotal=table(crime$crimecat,crime$Occurred.DayTime))
##                            
##                               day afternoon evening night
##   AGGRAVATED ASSAULT         3564      5366    4884  7501
##   ARSON                       196       167     191   486
##   BURGLARY                  24139     22288   14121 16082
##   CAR PROWL                 26740     38273   42595 34839
##   DISORDERLY CONDUCT           41        81      67    79
##   DUI                         706       939    2038  8522
##   FAMILY OFFENSE-NONVIOLENT  1748      2516    1217  1120
##   GAMBLE                        4         4       7     2
##   HOMICIDE                     41        46      49   131
##   LIQUOR LAW VIOLATION        112       491     410   606
##   LOITERING                    20        31      25     9
##   NARCOTIC                   2415      6416    3924  4109
##   PORNOGRAPHY                  65        53      17    31
##   PROSTITUTION                115       675    1425  1340
##   RAPE                        332       318     354   854
##   ROBBERY                    2584      4737    4139  5372
##   SEX OFFENSE-OTHER          1501      1759    1014  1776
##   THEFT                     38687     64868   38980 28410
##   TRESPASS                   4848      5184    2598  3289
##   WEAPON                      735      1445     947  1624

The table above shows counts, but in most situations, it is important to see relative values:

# using "pipes" to help readability:
library(magrittr)

CrimeTotal=table(crime$crimecat,crime$Occurred.DayTime)%>% #create table and then...
        prop.table() %>% #compute proportion and then...
        "*"(100)%>% # multiply by 100 and then...
        round(2) #...round to to decimals

# you get:
CrimeTotal
##                            
##                               day afternoon evening night
##   AGGRAVATED ASSAULT         0.71      1.07    0.98  1.50
##   ARSON                      0.04      0.03    0.04  0.10
##   BURGLARY                   4.83      4.46    2.83  3.22
##   CAR PROWL                  5.35      7.66    8.53  6.98
##   DISORDERLY CONDUCT         0.01      0.02    0.01  0.02
##   DUI                        0.14      0.19    0.41  1.71
##   FAMILY OFFENSE-NONVIOLENT  0.35      0.50    0.24  0.22
##   GAMBLE                     0.00      0.00    0.00  0.00
##   HOMICIDE                   0.01      0.01    0.01  0.03
##   LIQUOR LAW VIOLATION       0.02      0.10    0.08  0.12
##   LOITERING                  0.00      0.01    0.01  0.00
##   NARCOTIC                   0.48      1.28    0.79  0.82
##   PORNOGRAPHY                0.01      0.01    0.00  0.01
##   PROSTITUTION               0.02      0.14    0.29  0.27
##   RAPE                       0.07      0.06    0.07  0.17
##   ROBBERY                    0.52      0.95    0.83  1.08
##   SEX OFFENSE-OTHER          0.30      0.35    0.20  0.36
##   THEFT                      7.75     12.99    7.80  5.69
##   TRESPASS                   0.97      1.04    0.52  0.66
##   WEAPON                     0.15      0.29    0.19  0.33

Those tables show total counts or percents. However, when a table tries to hypothesize a relationship, you should have the independent variable in the columns, and the dependent one in the rows; then, the percent should be calculated by column, to see how the levels of the dependent variable varies by each level of the independent one, and compare along rows.

CrimeCol=table(crime$crimecat,crime$Occurred.DayTime)%>%
         prop.table(margin = 2)%>%   # 2 is % by column
         "*"(100)%>%
         round(3)

# you get:
CrimeCol
##                            
##                                day afternoon evening  night
##   AGGRAVATED ASSAULT         3.282     3.447   4.104  6.456
##   ARSON                      0.180     0.107   0.161  0.418
##   BURGLARY                  22.229    14.319  11.866 13.842
##   CAR PROWL                 24.624    24.588  35.794 29.987
##   DISORDERLY CONDUCT         0.038     0.052   0.056  0.068
##   DUI                        0.650     0.603   1.713  7.335
##   FAMILY OFFENSE-NONVIOLENT  1.610     1.616   1.023  0.964
##   GAMBLE                     0.004     0.003   0.006  0.002
##   HOMICIDE                   0.038     0.030   0.041  0.113
##   LIQUOR LAW VIOLATION       0.103     0.315   0.345  0.522
##   LOITERING                  0.018     0.020   0.021  0.008
##   NARCOTIC                   2.224     4.122   3.297  3.537
##   PORNOGRAPHY                0.060     0.034   0.014  0.027
##   PROSTITUTION               0.106     0.434   1.197  1.153
##   RAPE                       0.306     0.204   0.297  0.735
##   ROBBERY                    2.380     3.043   3.478  4.624
##   SEX OFFENSE-OTHER          1.382     1.130   0.852  1.529
##   THEFT                     35.626    41.674  32.756 24.453
##   TRESPASS                   4.464     3.330   2.183  2.831
##   WEAPON                     0.677     0.928   0.796  1.398

The complexity of two variables requires plots, as tables like these will not allow you to discover association patterns easily, even though they are already a summary of two columns. However, you must check the data format the plotting functions require, as most plots will use the contingency table as input (not the raw data).

Let me try a basic bar plot with the contingency table as input:

barplot(CrimeCol)

This plot will need a lot of work, so the base capabilities of R may not be a good strategy; and as before, we will turn to ggplot.

However, when using alternative/more specialized plotting features you may need to convert your table into a dataframe of frequencies, let me create the base proportions table:

df.T=as.data.frame(CrimeCol) # table of proportion based on total
# YOU GET:
head(df.T)
##                 Var1 Var2   Freq
## 1 AGGRAVATED ASSAULT  day  3.282
## 2              ARSON  day  0.180
## 3           BURGLARY  day 22.229
## 4          CAR PROWL  day 24.624
## 5 DISORDERLY CONDUCT  day  0.038
## 6                DUI  day  0.650

We should rename the above table:

names(df.T)=c('Crime','Daytime','Percent') #renaming
head(df.T)
##                Crime Daytime Percent
## 1 AGGRAVATED ASSAULT     day   3.282
## 2              ARSON     day   0.180
## 3           BURGLARY     day  22.229
## 4          CAR PROWL     day  24.624
## 5 DISORDERLY CONDUCT     day   0.038
## 6                DUI     day   0.650

A first option you may have is to reproduce the table:

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
base = ggplot(df.T, aes(Daytime,Crime)) 

# plot value as point, size by value of percent
tablePlot1 = base + geom_point(aes(size = Percent), colour = "gray") 

# add value of Percent as label
tablePlot2 = tablePlot1 + geom_text(aes(label = Percent),
                                    nudge_x = 0.1, # push the value to the right on the horizontal
                                    size=2)
tablePlot2

…some more work:

# improve the size of dots
tablePlot3 = tablePlot2 + scale_size_continuous(range=c(0,10)) #change 10?

# less ink
tablePlot4 = tablePlot3 + theme_minimal() 

# no legend
tablePlot4 + theme(legend.position="none") 

The plot looks nice, but unless the differences are clearly cut, you may see more noise than information, which distracts and delays decision making. Keep in mind that length of bars are easier to compare than circle areas. You need a barplot:

And, the original relationship Input-Output table can be plotted like this:

base  = ggplot(df.T, aes(x = Crime, y = Percent ) ) 
bars1 = base + geom_bar( stat = "identity" )
bars2 = bars1 + facet_wrap( ~ Daytime ,nrow = 1) 
bars2 + coord_flip()

The type of crime is not ordinal, then we could reorder the bars:

base  = ggplot(df.T, aes(x = reorder(Crime, Percent), # reordering Crime based on percent value
                         y = Percent ) ) 

bars1 = base + geom_bar( stat = "identity" )

bars2 = bars1 + facet_wrap( ~ Daytime ,nrow = 1) 

bars2 + coord_flip() + theme(axis.text.y = element_text(size=4,angle = 45)) 

Numerical correlation

The study of bivariate relationships among numerical variables is known as correlation analysis. The data we have been using has few numerical columns, but I will produce two by aggregating the original data set using Neigborhood:

  • Aggregating days to report and neighborhood:
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.4
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
daysByNeigh= crime  %>%  
                    filter(!is.na(Neighborhood)) %>%
                    group_by(Neighborhood)  %>%  
                    summarize(DaysToReport=mean(DaysToReport))  

str(daysByNeigh)
## Classes 'tbl_df', 'tbl' and 'data.frame':    58 obs. of  2 variables:
##  $ Neighborhood: Factor w/ 58 levels "ALASKA JUNCTION",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ DaysToReport: num  8.97 14.46 7.79 5.26 5.01 ...
  • Aggregating share of crimes by neighborhood
crimesByNeigh=crime  %>%  
                    filter(!is.na(Neighborhood)) %>%
                    group_by(Neighborhood)  %>%  
                    summarize(crimeShare=length(crimecat)/nrow(crime))#  %>% 
    

str(crimesByNeigh)
## Classes 'tbl_df', 'tbl' and 'data.frame':    58 obs. of  2 variables:
##  $ Neighborhood: Factor w/ 58 levels "ALASKA JUNCTION",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ crimeShare  : num  0.01338 0.00483 0.02114 0.02941 0.02971 ...
  • Merging the two dataframes: Since both data frames have the same neighboorhood, we can make one data frame by merging them:
num_num=merge(daysByNeigh,crimesByNeigh) # 'row name' is the "key"
head(num_num)
##      Neighborhood DaysToReport  crimeShare
## 1 ALASKA JUNCTION     8.968437 0.013378080
## 2            ALKI    14.463768 0.004832919
## 3   BALLARD NORTH     7.788203 0.021136767
## 4   BALLARD SOUTH     5.262384 0.029409763
## 5        BELLTOWN     5.012865 0.029711946
## 6      BITTERLAKE     7.442241 0.019073520

Notice that Neighborhood is a factor, since I will use it to annotate it would be better to turn it into a text:

num_num$Neighborhood=as.character(num_num$Neighborhood)

Once we have the data organized, the clear option is the scatterplot:

base = ggplot(num_num, aes(DaysToReport,crimeShare)) 
plot1= base +  geom_point() 
plot1

We can improve the plot, this time introducing ggrepel:

library(ggrepel)
base = ggplot(num_num, aes(DaysToReport,crimeShare,
                           label=Neighborhood)) # you need this aesthetics!
plot1= base +  geom_point() 
plot1 + geom_text_repel()

We can limit the labels, annotating the ones that represent at least 5% of the crimes in the city:

plot1 + geom_text_repel(aes(label=ifelse(crimeShare>=0.05,
                                         as.character(num_num$Neighborhood), "")))

Notice the difference without ggrepel:

plot1 + geom_text(aes(label=ifelse(crimeShare>=0.05,num_num$Neighborhood, "")))

An alternative, to highlight overlapping of points:

library(hexbin)
## Warning: package 'hexbin' was built under R version 3.4.3
base = ggplot(num_num, aes(DaysToReport,crimeShare)) 
scatp1 = base +  geom_hex(bins = 10)
scatp2= scatp1 + geom_text_repel(aes(label=ifelse(crimeShare>=0.05,
                                                  num_num$Neighborhood,
                                                  ""))) 
scatp2 + scale_fill_distiller(palette ="Greys",direction=1) # try -1

The palettes can be selected from the brewer colors website. Using the same palette as before, we can try a different plot (stat_density_2d):

base = ggplot(num_num, aes(DaysToReport,crimeShare)) 

scatp1 = base +  stat_density_2d(aes(fill = ..density..), 
                                 geom = "raster", contour = FALSE)

scatp2=scatp1+geom_text_repel(aes(label=ifelse(crimeShare>=0.05,
                                               num_num$Neighborhood, "")))

scatp3= scatp2 + scale_fill_distiller(palette="Greys", direction=1) 

scatp4= scatp3 + guides(fill = guide_legend(title = "Number of\nNeighborhoods",
                                    title.theme = element_text(size=8)))

scatp4

The extra space you see can disappear using:

scatp5 = scatp4 +  scale_x_continuous(expand = c(0, 0)) + 
         scale_y_continuous(expand = c(0, 0)) 
scatp5


Multivariate Plots

This time, I will use the data about city safety:

library(rio)
## Warning: package 'rio' was built under R version 3.4.4
link="https://github.com/EvansDataScience/data/raw/master/safeCitiesIndexAll.xlsx"

safe=import(link)

The data available are:

names(safe)
##  [1] "city"                         "D_In_PrivacyPolicy"          
##  [3] "D_In_AwarenessDigitalThreats" "D_In_PubPrivPartnerships"    
##  [5] "D_In_TechnologyEmployed"      "D_In_CyberSecurity"          
##  [7] "D_Out_IdentityTheft"          "D_Out_CompInfected"          
##  [9] "D_Out_InternetAccess"         "H_In_EnvironmentPolicies"    
## [11] "H_In_AccessHealthcare"        "H_In_Beds_1000"              
## [13] "H_In_Doctors_1000"            "H_In_AccessFood"             
## [15] "H_In_QualityHealthServ"       "H_Out_AirQuality"            
## [17] "H_Out_WaterQuality"           "H_Out_LifeExpectY"           
## [19] "H_Out_InfMortality"           "H_Out_CancerMortality"       
## [21] "H_Out_AttacksBioChemRad"      "I_In_EnforceTransportSafety" 
## [23] "I_In_PedestrianFriendliness"  "I_In_QualityRoad"            
## [25] "I_In_QualityElectricity"      "I_In_DisasterManagement"     
## [27] "I_Out_DeathsDisaster"         "I_Out_VehicularAccidents"    
## [29] "I_Out_PedestrianDeath"        "I_Out_LiveSlums"             
## [31] "I_Out_AttacksInfrastructure"  "P_In_PoliceEngage"           
## [33] "P_In_CommunityPatrol"         "P_In_StreetCrimeData"        
## [35] "P_In_TechForCrime"            "P_In_PrivateSecurity"        
## [37] "P_In_GunRegulation"           "P_In_PoliticalStability"     
## [39] "P_Out_PettyCrime"             "P_Out_ViolentCrime"          
## [41] "P_Out_OrganisedCrime"         "P_Out_Corruption"            
## [43] "P_Out_DrugUse"                "P_Out_TerroristAttacks"      
## [45] "P_Out_SeverityTerrorist"      "P_Out_GenderSafety"          
## [47] "P_Out_PerceptionSafety"       "P_Out_ThreaTerrorism"        
## [49] "P_Out_ThreatMilitaryConf"     "P_Out_ThreatCivUnrest"

These are several variables telling us information about the safety levels of cities in the world, and are related to Digital, Health, Infrastructure, and Personal dimensions. For each of these dimensions, there are measures of actions taken (In), and results (Out).

Without dimensionality reduction:

Heatmaps will show you the whole data set. First, we need some reshaping:

library(reshape2)
## Warning: package 'reshape2' was built under R version 3.4.3
safeA=melt(safe,
           id.vars = 'city') # the unit of analysis
head(safeA)
##        city           variable value
## 1 Abu Dhabi D_In_PrivacyPolicy    50
## 2 Amsterdam D_In_PrivacyPolicy   100
## 3    Athens D_In_PrivacyPolicy    75
## 4   Bangkok D_In_PrivacyPolicy    25
## 5 Barcelona D_In_PrivacyPolicy   100
## 6   Beijing D_In_PrivacyPolicy    75

The melting changed the direction of the data: the columns were sent into rows. This the data in long format. Now, the heatmap using this format:

base = ggplot(data = safeA, aes(x = variable,
                                y =city)) 

heat1= base +  geom_tile(aes(fill = value)) 
heat1

Here you can see what rows have higher or lower colors on what set of variables. You can add color palette:

#inverse color -1
heat2 = heat1 + scale_fill_distiller(palette = "RdYlGn",direction = 1)  
heat2

The column and row names need some work:

heat2 + theme(axis.text.x = element_text(angle = 90, 
                                         hjust = 1,
                                         size = 4),
              axis.text.y = element_text(size = 4))

The last heatmap above could be ‘ordered’ so that column and row positions can give us more information:

# change in REORDER
base= ggplot(data = safeA, aes(x = reorder(variable, 
                                           value, median, order=TRUE),
                               y =reorder(city,
                                          value, median, order=TRUE)))
# THIS IS THE SAME
base + geom_tile(aes(fill = value)) + 
    scale_fill_distiller(palette = "RdYlGn",direction = 1) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1,size = 4),
              axis.text.y = element_text(size = 4))

This is still hard to read. An alternative could be to average each dimension, so you get four columns. These data has that information:

link2="https://github.com/EvansDataScience/data/raw/master/safeCitiesIndex.xlsx"

safe2=import(link2)
head(safe2)
##        city  DIGITAL   HEALTH INFRASTRUCTURE PERSONAL
## 1 Abu Dhabi 68.77421 68.54043       91.36416 78.95145
## 2 Amsterdam 85.79071 79.79323       96.04870 87.42197
## 3    Athens 61.93823 74.57161       82.04507 69.03243
## 4   Bangkok 44.44312 66.63739       68.33223 60.79785
## 5 Barcelona 74.44147 78.54113       96.59329 85.28254
## 6   Beijing 65.37625 67.62942       74.48954 80.76429

This same plot can be done using some additional packages related to ggplot. Let me show you the use of ggiraph, and ggiraphExtra.

This packages does not need that we alter the shape from wide to long, as the heatplot needed.

library(ggiraph)
## Warning: package 'ggiraph' was built under R version 3.4.4
library(ggiraphExtra)
## Warning: package 'ggiraphExtra' was built under R version 3.4.4
base = ggRadar(safe2,aes(colour='city'),legend.position="none") 

plot1 = base + facet_wrap('city',ncol = 10) # ten columns of cities

plot1 

There are many units for this plot. However we could try some improvement making a little change to the original data.

The plan is to rank the cities, and then turn the cities into an ordinal. That requires:

# get minimun value by row
safe2$min=apply(safe2[,c(2:5)],1,min)

# turn this min values into a ranking
safe2$min=rank(safe2$min)

# order city by ranking and turn that ordering into a factor
cityRk=as.factor(safe2[order(safe2$min),]$city)

# turn city into ordered factor
safe2$city=factor(safe2$city,
                   levels= cityRk,
                   labels = cityRk,
                   ordered = T)

# delete column with ranks
safe2$min=NULL

Notice the data seems the same:

head(safe2)
##        city  DIGITAL   HEALTH INFRASTRUCTURE PERSONAL
## 1 Abu Dhabi 68.77421 68.54043       91.36416 78.95145
## 2 Amsterdam 85.79071 79.79323       96.04870 87.42197
## 3    Athens 61.93823 74.57161       82.04507 69.03243
## 4   Bangkok 44.44312 66.63739       68.33223 60.79785
## 5 Barcelona 74.44147 78.54113       96.59329 85.28254
## 6   Beijing 65.37625 67.62942       74.48954 80.76429

But the structure has varied:

str(safe2)
## 'data.frame':    60 obs. of  5 variables:
##  $ city          : Ord.factor w/ 60 levels "Karachi"<"Jakarta"<..: 34 52 28 9 47 30 18 55 33 8 ...
##  $ DIGITAL       : num  68.8 85.8 61.9 44.4 74.4 ...
##  $ HEALTH        : num  68.5 79.8 74.6 66.6 78.5 ...
##  $ INFRASTRUCTURE: num  91.4 96 82 68.3 96.6 ...
##  $ PERSONAL      : num  79 87.4 69 60.8 85.3 ...

This is a simpler approach, when data is ready:

library(ggiraph)
library(ggiraphExtra)
library(snakecase)
## Warning: package 'snakecase' was built under R version 3.4.4
base = ggRadar(safe2,aes(group='city'),legend.position="none") 

plot1 = base + facet_wrap('city',ncol = 10)

plot1 #+ geom_polygon(fill = 'white',col='orange')

For sure, if we had a small number of cases we could plot layers on top:

some=c("Manila","Lima", "Washington DC","Tokyo")

subSafe=safe2[safe2$city %in% some,]

base = ggRadar(subSafe,aes(group='city'),
               alpha = 0,legend.position="top") 

base #+  theme(legend.title=element_blank())

Areas and color hues are not as easy to help us discern differences compared to length, so the plots above should be used with care when many cases are represented.


With dimensionality reduction:

None of our previous plots represent dimensionality reduction, and that is what is coming now.

Multidimensional scaling is a technique used to distribute cases in, for example, two dimensions. That will give you a map, where closeness is interpreted as similarity.

  1. Create ‘distance’ matrix: using all the information (columns) available, each case (row) will be compared to each another, finding a distance among each pair.
distanceAmong <- dist(safe[,-1]) # euclidean distances is default method, omitting name of city
  1. Apply the technique:
resultMDS <- cmdscale(distanceAmong,eig=TRUE, k=2) # k is the number of dim
  1. Get the coordinates to plot the cases:
dim1 <- resultMDS$points[,1]
dim2 <- resultMDS$points[,2]

#new data frame created
coordinates=data.frame(dim1,dim2,city=safe$city)

head(coordinates)
##          dim1       dim2      city
## 1  -27.583521  -2.299102 Abu Dhabi
## 2 -115.110368  -6.914814 Amsterdam
## 3   -7.052776  36.698576    Athens
## 4  115.815227 -26.422557   Bangkok
## 5  -99.861054  -4.390724 Barcelona
## 6   11.259394   4.009000   Beijing
  1. Show the map:
base= ggplot(coordinates,aes(x=dim1, y=dim2,label=city)) 
base + geom_text(size=2)

Notice that the coordinates do not inform the same as in the scatter plot (it is not the case that Caracas is among the best), what matters is to know that the closer a city is to another, the more similar it is.

Another key way to reduce dimensionality is cluster analysis. In this case we will group the cities using all the information available per city with the kmeans clustering technique:

library(cluster)
set.seed(123)

# computing clusters
resultKM <- kmeans(safe[,-c(1)], # omitting name of city
                 centers = 3)    # how many clusters

Let’s add the cluster label to our coordinates:

coordinates$cluster=as.factor(resultKM$cluster)

We could use the cluster information in the MDS plot:

base= ggplot(coordinates,aes(x=dim1, y=dim2,label=city,color=cluster)) 
base + geom_text(size=2)



Making Maps

Finding Data to plot

We have a dataset on contributions to Candidates and Political Committees in Washington State for the election year 2016., from the WA state open data portal.

link='https://github.com/EvansDataScience/DataDriven_ManagementAndPolicy/raw/master/Session6/contriWA_2016.RData'
#getting the data TABLE from the file in the cloud:
load(file=url(link))

This data frame has more than three million rows, and it is informing of every contribution someone has done to a particular campaign year.

str(contriWA_2016,width = 60, strict.width = 'cut')
## 'data.frame':    374584 obs. of  10 variables:
##  $ id                  : chr  "3982630.rcpt" "3982631.rcp"..
##  $ contributor_state   : chr  "WA" "WA" "WA" "WA" ...
##  $ contributor_zip     : num  98683 98683 98683 98168 9850..
##  $ amount              : num  50 50 50 500 900 900 50 225 ..
##  $ election_year       : int  2016 2016 2016 2016 2016 201..
##  $ party               : Factor w/ 9 levels "","CONSTITUT"..
##  $ cash_or_in_kind     : Factor w/ 2 levels "Cash","In ki"..
##  $ contributor_location: chr  "(45.60817, -122.51972)" "("..
##  $ Lat                 : num  45.6 45.6 45.6 47.5 47 ...
##  $ Lon                 : num  -123 -123 -123 -122 -123 ...

We could plot any of the categorical or numerical columns in a map, as long as we can have columns that represent a coordinate, or a particular location. As we can see, we could use the last two columns, and the zip code column to connect this data to a map, and then plot our selection of factor or number.

Keep in mind that if every row of our data to plot is based on zip codes, we need a map of zipcodes.

Getting the Map

Maps come in different formats. The most common is the shapefile which is in fact a collection of files. That makes it more complicated if we want to read the map from a cloud repository. The recommendation is to keep all the shapefiles into a zipped folder in any cloud repository.

# link to zipped folder
zippedSHP= "https://github.com/EvansDataScience/data/raw/master/WAzips.zip"

The strategy in R will be to download the compressed folder into your computer. Then, use the following code to unzip it.

library(utils)
temp=tempfile()
download.file(zippedSHP, temp)
unzip(temp)

To know what shapefiles are now in your session folder:

(maps=list.files(pattern = 'shp'))
## [1] "SAEP_ZIP_Code_Tabulation_Areas.shp"

We will use the file name you need to open the map:

# notice the parameters use in the chunk!!

library(rgdal)
wazipMap <- readOGR("SAEP_ZIP_Code_Tabulation_Areas.shp",stringsAsFactors=F) 

This is your map:

library(tmap)
## Warning: package 'tmap' was built under R version 3.4.4
waZips = tm_shape(wazipMap) + tm_polygons()
waZips
## Warning: package 'sf' was built under R version 3.4.4
## Linking to GEOS 3.6.1, GDAL 2.1.3, proj.4 4.9.3

You can control color like this:

tm_shape(wazipMap) + tm_polygons(border.col = 'blue',
                                 col = 'yellow')

Sometimes you need a dissolved map. That is, just keep the limits of the map:

library(rmapshaper)
# This will make just a border of the state
baseMap <- ms_dissolve(wazipMap)

Then:

waBorder = tm_shape(baseMap) + tm_polygons(col = 'white',
                                           lwd = 1)
waBorder

Plotting coordinates:

The dataframe contriWA_2016 has columns with coordinates, let’s turn that data frame into a spatial point data frame, while making sure it has the same coordinate system as our map:

library(raster)
## 
## Attaching package: 'raster'
## The following object is masked from 'package:dplyr':
## 
##     select
## The following object is masked from 'package:magrittr':
## 
##     extract
mapCRS=crs(wazipMap) # projection of our map

contriWA_geo <- SpatialPointsDataFrame(contriWA_2016[,c(10:9)], # Lon/Lat
                    contriWA_2016,    #the original data frame
                    proj4string = mapCRS)   # assign a CRS of map 

Our new spatial points dataframe looks the same:

names(contriWA_geo)
##  [1] "id"                   "contributor_state"    "contributor_zip"     
##  [4] "amount"               "election_year"        "party"               
##  [7] "cash_or_in_kind"      "contributor_location" "Lat"                 
## [10] "Lon"

But it is not a simple data frame:

class(contriWA_geo)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"

Now, plot the coordinates of the contributors using both our original and dissolved map (select the right shape coordinate point):

waDots1 = waBorder + 
          tm_shape(contriWA_geo) + 
          tm_dots(size = 0.1,col = 'red',alpha=0.5,shape = 20) 

waDots1

For sure, we can add more elements; like title and its position:

waDots2 = waDots1 + 
          tm_layout(main.title = "Points",
                    main.title.position = 'center') 
waDots2

The compass (position and symbol):

waDots3 = waDots2 +
          tm_compass(position = c('left','TOP'),type = 'arrow')

waDots3

And the scale bar:

waDots4 = waDots3 +
          tm_scale_bar(position=c("RIGHT", "BOTTOM"),width = 0.2) 

waDots4

Currently, it is very usual to use interactive maps. In that situation, Leaflet is a good option:

library(leaflet)

leaflet(contriWA_geo) %>% # the shapefile
    addTiles() %>%        #basemap
    addCircleMarkers(clusterOptions = markerClusterOptions())

Adding information from data frame

When you have a way to organize you data by a row that represents a geographical unit, you can plot your data on a map. However, in the current format, each row represents a contribution; we do not need that, we need a data frame where each row is a ZIP code, and the amount tells us, for example, the average contribution generated in that location. This is an aggregation process:

library(dplyr)


WA_zip_contri= contriWA_2016  %>%  
                    group_by(contributor_zip)  %>%  
                        summarize('AVE_Amount'=mean(amount))
#see result:
head(WA_zip_contri)
## # A tibble: 6 x 2
##   contributor_zip AVE_Amount
##             <dbl>      <dbl>
## 1           98001      331. 
## 2           98002       94.1
## 3           98003      143. 
## 4           98004     1795. 
## 5           98005     1488. 
## 6           98006      264.

This data frame has the average of contributions for every zip code.

Our map has also interesting information (check the definitions here):

names(wazipMap)
##   [1] "OBJECTID"   "ZCTA5CE10"  "GEOID10"    "MTFCC10"    "FUNCSTAT10"
##   [6] "PARTFLG10"  "INTPTLON10" "INTPTLAT10" "Version"    "POP2000"   
##  [11] "POP2001"    "POP2002"    "POP2003"    "POP2004"    "POP2005"   
##  [16] "POP2006"    "POP2007"    "POP2008"    "POP2009"    "POP2010"   
##  [21] "POP2011"    "POP2012"    "POP2013"    "POP2014"    "POP2015"   
##  [26] "POP2016"    "POP2017"    "HHP2000"    "HHP2001"    "HHP2002"   
##  [31] "HHP2003"    "HHP2004"    "HHP2005"    "HHP2006"    "HHP2007"   
##  [36] "HHP2008"    "HHP2009"    "HHP2010"    "HHP2011"    "HHP2012"   
##  [41] "HHP2013"    "HHP2014"    "HHP2015"    "HHP2016"    "HHP2017"   
##  [46] "GQ2000"     "GQ2001"     "GQ2002"     "GQ2003"     "GQ2004"    
##  [51] "GQ2005"     "GQ2006"     "GQ2007"     "GQ2008"     "GQ2009"    
##  [56] "GQ2010"     "GQ2011"     "GQ2012"     "GQ2013"     "GQ2014"    
##  [61] "GQ2015"     "GQ2016"     "GQ2017"     "HU2000"     "HU2001"    
##  [66] "HU2002"     "HU2003"     "HU2004"     "HU2005"     "HU2006"    
##  [71] "HU2007"     "HU2008"     "HU2009"     "HU2010"     "HU2011"    
##  [76] "HU2012"     "HU2013"     "HU2014"     "HU2015"     "HU2016"    
##  [81] "HU2017"     "OHU2000"    "OHU2001"    "OHU2002"    "OHU2003"   
##  [86] "OHU2004"    "OHU2005"    "OHU2006"    "OHU2007"    "OHU2008"   
##  [91] "OHU2009"    "OHU2010"    "OHU2011"    "OHU2012"    "OHU2013"   
##  [96] "OHU2014"    "OHU2015"    "OHU2016"    "OHU2017"    "Shape__Are"
## [101] "Shape__Len"

The column with the zip code has the name ZCTA5CE10, let’s check its data type:

str(wazipMap$ZCTA5CE10)
##  chr [1:598] "98001" "98002" "98003" "98004" "98005" "98006" "98007" ...

Let’s turn ZCTA5CE10 into a character, to be in the same type as our map data frame:

WA_zip_contri$contributor_zip=as.character(WA_zip_contri$contributor_zip)

Having common data types in both key columns of each data frame, we can merge.

# As the zip codes in each are under different column names, 
# I tell the merge function what columns to use:

layerContrib=merge(wazipMap,WA_zip_contri,  # use map first
                   by.x='ZCTA5CE10', 
                   by.y='contributor_zip',all.x=F)

There is a new map: layerContrib.

We will plot the average amounts contributed, which will be organised into 5 quantiles. Let’s follow these steps:

  1. Install and load the necessary packages to manage color and divisions:
library(RColorBrewer)
library(classInt)
  1. Define the variable to plot:
varToPLot=layerContrib$AVE_Amount
  1. Define color palette. Review the available palettes from here):
colorForPalette='YlGnBu'
  1. Plot the choropleth; notice we are choosing a particular classification method known as quantile classification:
layer1= waBorder +  
        tm_shape(layerContrib) +
                tm_polygons("AVE_Amount", 
                            style="quantile", # classification method
                            n=5, # number of groups for classification
                            title="Contributions", # title of legend
                            palette=colorForPalette) 

fullMap= layer1 + tm_compass(position = c('left','TOP'),type = 'arrow') +
                  tm_scale_bar(position=c("RIGHT", "BOTTOM"),width = 0.2)

fullMap

We need to adjust the elements:

fullMap +  tm_layout(main.title = "Choropleth",
                     main.title.position = 'center',
                     legend.position = c('RIGHT','center'),
                                    #bottom,left,top,right
                     inner.margins=c(0.1,0,0.1,0.2)) 

For sure, you can use leaflet:

# function for COLORING quantiles in leaflet
paletteFun=colorQuantile(colorForPalette, 
                         varToPLot,
                         n = 5)

# the base map
base_map = leaflet(baseMap) %>% addPolygons(weight = 3,color = 'red')

final = base_map %>%
         addPolygons(data=layerContrib,
                     weight = 1, #thickness of border
                     opacity =  1, # # the closer to 0 the more transparent
                     fillOpacity = 0.7, # color brigthness
                     fillColor = ~paletteFun(AVE_Amount)) # coloring

final