We analyze two variables to find out if there might be some kind of association between them. Even though that may be difficult to clearly identify, bivariate analysis still helps reveal signs of association that may serve at least to raise concern.
This time, I will use the data about crime from the Seattle Open Data portal:
link="https://github.com/EvansDataScience/data/raw/master/crime.RData"
load(file = url(link))
Let’s see what kind of data we have:
str(crime,width = 70,strict.width='cut')
## 'data.frame': 499698 obs. of 17 variables:
## $ Report.Number : chr "20130000244104" "201300002429"..
## $ Occurred.Date : Date, format: "2013-07-09" ...
## $ year : num 2013 2013 2013 2013 2013 ...
## $ month : num 7 7 7 7 7 7 7 7 7 7 ...
## $ weekday : Ord.factor w/ 7 levels "Monday"<"Tu"..
## $ Occurred.Time : num 1930 1917 1900 1900 1846 ...
## $ Occurred.DayTime : Ord.factor w/ 4 levels "day"<"after"..
## $ Reported.Date : Date, format: "2013-07-10" ...
## $ Reported.Time : num 1722 2052 35 1258 1846 ...
## $ DaysToReport : num 1 0 1 1 0 0 0 0 1 0 ...
## $ crimecat : Factor w/ 20 levels "AGGRAVATED ASS"..
## $ Crime.Subcategory : Factor w/ 30 levels "AGGRAVATED ASS"..
## $ Primary.Offense.Description: Factor w/ 144 levels "ADULT-VULNERA"..
## $ Precinct : Factor w/ 5 levels "EAST","NORTH",....
## $ Sector : Factor w/ 23 levels "6804","9512",....
## $ Beat : Factor w/ 64 levels "B1","B2","B3",...
## $ Neighborhood : Factor w/ 58 levels "ALASKA JUNCTIO"..
The main way to organize these relationships are the contingency tables. Let’s select a couple of categorical variables:
(CrimeTotal=table(crime$crimecat,crime$Occurred.DayTime))
##
## day afternoon evening night
## AGGRAVATED ASSAULT 3564 5366 4884 7501
## ARSON 196 167 191 486
## BURGLARY 24139 22288 14121 16082
## CAR PROWL 26740 38273 42595 34839
## DISORDERLY CONDUCT 41 81 67 79
## DUI 706 939 2038 8522
## FAMILY OFFENSE-NONVIOLENT 1748 2516 1217 1120
## GAMBLE 4 4 7 2
## HOMICIDE 41 46 49 131
## LIQUOR LAW VIOLATION 112 491 410 606
## LOITERING 20 31 25 9
## NARCOTIC 2415 6416 3924 4109
## PORNOGRAPHY 65 53 17 31
## PROSTITUTION 115 675 1425 1340
## RAPE 332 318 354 854
## ROBBERY 2584 4737 4139 5372
## SEX OFFENSE-OTHER 1501 1759 1014 1776
## THEFT 38687 64868 38980 28410
## TRESPASS 4848 5184 2598 3289
## WEAPON 735 1445 947 1624
The table above shows counts, but in most situations, it is important to see relative values:
# using "pipes" to help readability:
library(magrittr)
CrimeTotal=table(crime$crimecat,crime$Occurred.DayTime)%>% #create table and then...
prop.table() %>% #compute proportion and then...
"*"(100)%>% # multiply by 100 and then...
round(2) #...round to to decimals
# you get:
CrimeTotal
##
## day afternoon evening night
## AGGRAVATED ASSAULT 0.71 1.07 0.98 1.50
## ARSON 0.04 0.03 0.04 0.10
## BURGLARY 4.83 4.46 2.83 3.22
## CAR PROWL 5.35 7.66 8.53 6.98
## DISORDERLY CONDUCT 0.01 0.02 0.01 0.02
## DUI 0.14 0.19 0.41 1.71
## FAMILY OFFENSE-NONVIOLENT 0.35 0.50 0.24 0.22
## GAMBLE 0.00 0.00 0.00 0.00
## HOMICIDE 0.01 0.01 0.01 0.03
## LIQUOR LAW VIOLATION 0.02 0.10 0.08 0.12
## LOITERING 0.00 0.01 0.01 0.00
## NARCOTIC 0.48 1.28 0.79 0.82
## PORNOGRAPHY 0.01 0.01 0.00 0.01
## PROSTITUTION 0.02 0.14 0.29 0.27
## RAPE 0.07 0.06 0.07 0.17
## ROBBERY 0.52 0.95 0.83 1.08
## SEX OFFENSE-OTHER 0.30 0.35 0.20 0.36
## THEFT 7.75 12.99 7.80 5.69
## TRESPASS 0.97 1.04 0.52 0.66
## WEAPON 0.15 0.29 0.19 0.33
Those tables show total counts or percents. However, when a table tries to hypothesize a relationship, you should have the independent variable in the columns, and the dependent one in the rows; then, the percent should be calculated by column, to see how the levels of the dependent variable varies by each level of the independent one, and compare along rows.
CrimeCol=table(crime$crimecat,crime$Occurred.DayTime)%>%
prop.table(margin = 2)%>% # 2 is % by column
"*"(100)%>%
round(3)
# you get:
CrimeCol
##
## day afternoon evening night
## AGGRAVATED ASSAULT 3.282 3.447 4.104 6.456
## ARSON 0.180 0.107 0.161 0.418
## BURGLARY 22.229 14.319 11.866 13.842
## CAR PROWL 24.624 24.588 35.794 29.987
## DISORDERLY CONDUCT 0.038 0.052 0.056 0.068
## DUI 0.650 0.603 1.713 7.335
## FAMILY OFFENSE-NONVIOLENT 1.610 1.616 1.023 0.964
## GAMBLE 0.004 0.003 0.006 0.002
## HOMICIDE 0.038 0.030 0.041 0.113
## LIQUOR LAW VIOLATION 0.103 0.315 0.345 0.522
## LOITERING 0.018 0.020 0.021 0.008
## NARCOTIC 2.224 4.122 3.297 3.537
## PORNOGRAPHY 0.060 0.034 0.014 0.027
## PROSTITUTION 0.106 0.434 1.197 1.153
## RAPE 0.306 0.204 0.297 0.735
## ROBBERY 2.380 3.043 3.478 4.624
## SEX OFFENSE-OTHER 1.382 1.130 0.852 1.529
## THEFT 35.626 41.674 32.756 24.453
## TRESPASS 4.464 3.330 2.183 2.831
## WEAPON 0.677 0.928 0.796 1.398
The complexity of two variables requires plots, as tables like these will not allow you to discover association patterns easily, even though they are already a summary of two columns. However, you must check the data format the plotting functions require, as most plots will use the contingency table as input (not the raw data).
Let me try a basic bar plot with the contingency table as input:
barplot(CrimeCol)
This plot will need a lot of work, so the base capabilities of R may not be a good strategy; and as before, we will turn to ggplot.
However, when using alternative/more specialized plotting features you may need to convert your table into a dataframe of frequencies, let me create the base proportions table:
df.T=as.data.frame(CrimeCol) # table of proportion based on total
# YOU GET:
head(df.T)
## Var1 Var2 Freq
## 1 AGGRAVATED ASSAULT day 3.282
## 2 ARSON day 0.180
## 3 BURGLARY day 22.229
## 4 CAR PROWL day 24.624
## 5 DISORDERLY CONDUCT day 0.038
## 6 DUI day 0.650
We should rename the above table:
names(df.T)=c('Crime','Daytime','Percent') #renaming
head(df.T)
## Crime Daytime Percent
## 1 AGGRAVATED ASSAULT day 3.282
## 2 ARSON day 0.180
## 3 BURGLARY day 22.229
## 4 CAR PROWL day 24.624
## 5 DISORDERLY CONDUCT day 0.038
## 6 DUI day 0.650
A first option you may have is to reproduce the table:
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
base = ggplot(df.T, aes(Daytime,Crime))
# plot value as point, size by value of percent
tablePlot1 = base + geom_point(aes(size = Percent), colour = "gray")
# add value of Percent as label
tablePlot2 = tablePlot1 + geom_text(aes(label = Percent),
nudge_x = 0.1, # push the value to the right on the horizontal
size=2)
tablePlot2
…some more work:
# improve the size of dots
tablePlot3 = tablePlot2 + scale_size_continuous(range=c(0,10)) #change 10?
# less ink
tablePlot4 = tablePlot3 + theme_minimal()
# no legend
tablePlot4 + theme(legend.position="none")
The plot looks nice, but unless the differences are clearly cut, you may see more noise than information, which distracts and delays decision making. Keep in mind that length of bars are easier to compare than circle areas. You need a barplot:
And, the original relationship Input-Output table can be plotted like this:
base = ggplot(df.T, aes(x = Crime, y = Percent ) )
bars1 = base + geom_bar( stat = "identity" )
bars2 = bars1 + facet_wrap( ~ Daytime ,nrow = 1)
bars2 + coord_flip()
The type of crime is not ordinal, then we could reorder the bars:
base = ggplot(df.T, aes(x = reorder(Crime, Percent), # reordering Crime based on percent value
y = Percent ) )
bars1 = base + geom_bar( stat = "identity" )
bars2 = bars1 + facet_wrap( ~ Daytime ,nrow = 1)
bars2 + coord_flip() + theme(axis.text.y = element_text(size=4,angle = 45))
The study of bivariate relationships among numerical variables is known as correlation analysis. The data we have been using has few numerical columns, but I will produce two by aggregating the original data set using Neigborhood:
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.4
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
daysByNeigh= crime %>%
filter(!is.na(Neighborhood)) %>%
group_by(Neighborhood) %>%
summarize(DaysToReport=mean(DaysToReport))
str(daysByNeigh)
## Classes 'tbl_df', 'tbl' and 'data.frame': 58 obs. of 2 variables:
## $ Neighborhood: Factor w/ 58 levels "ALASKA JUNCTION",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ DaysToReport: num 8.97 14.46 7.79 5.26 5.01 ...
crimesByNeigh=crime %>%
filter(!is.na(Neighborhood)) %>%
group_by(Neighborhood) %>%
summarize(crimeShare=length(crimecat)/nrow(crime))# %>%
str(crimesByNeigh)
## Classes 'tbl_df', 'tbl' and 'data.frame': 58 obs. of 2 variables:
## $ Neighborhood: Factor w/ 58 levels "ALASKA JUNCTION",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ crimeShare : num 0.01338 0.00483 0.02114 0.02941 0.02971 ...
num_num=merge(daysByNeigh,crimesByNeigh) # 'row name' is the "key"
head(num_num)
## Neighborhood DaysToReport crimeShare
## 1 ALASKA JUNCTION 8.968437 0.013378080
## 2 ALKI 14.463768 0.004832919
## 3 BALLARD NORTH 7.788203 0.021136767
## 4 BALLARD SOUTH 5.262384 0.029409763
## 5 BELLTOWN 5.012865 0.029711946
## 6 BITTERLAKE 7.442241 0.019073520
Notice that Neighborhood is a factor, since I will use it to annotate it would be better to turn it into a text:
num_num$Neighborhood=as.character(num_num$Neighborhood)
Once we have the data organized, the clear option is the scatterplot:
base = ggplot(num_num, aes(DaysToReport,crimeShare))
plot1= base + geom_point()
plot1
We can improve the plot, this time introducing ggrepel:
library(ggrepel)
base = ggplot(num_num, aes(DaysToReport,crimeShare,
label=Neighborhood)) # you need this aesthetics!
plot1= base + geom_point()
plot1 + geom_text_repel()
We can limit the labels, annotating the ones that represent at least 5% of the crimes in the city:
plot1 + geom_text_repel(aes(label=ifelse(crimeShare>=0.05,
as.character(num_num$Neighborhood), "")))
Notice the difference without ggrepel:
plot1 + geom_text(aes(label=ifelse(crimeShare>=0.05,num_num$Neighborhood, "")))
An alternative, to highlight overlapping of points:
library(hexbin)
## Warning: package 'hexbin' was built under R version 3.4.3
base = ggplot(num_num, aes(DaysToReport,crimeShare))
scatp1 = base + geom_hex(bins = 10)
scatp2= scatp1 + geom_text_repel(aes(label=ifelse(crimeShare>=0.05,
num_num$Neighborhood,
"")))
scatp2 + scale_fill_distiller(palette ="Greys",direction=1) # try -1
The palettes can be selected from the brewer colors website. Using the same palette as before, we can try a different plot (stat_density_2d):
base = ggplot(num_num, aes(DaysToReport,crimeShare))
scatp1 = base + stat_density_2d(aes(fill = ..density..),
geom = "raster", contour = FALSE)
scatp2=scatp1+geom_text_repel(aes(label=ifelse(crimeShare>=0.05,
num_num$Neighborhood, "")))
scatp3= scatp2 + scale_fill_distiller(palette="Greys", direction=1)
scatp4= scatp3 + guides(fill = guide_legend(title = "Number of\nNeighborhoods",
title.theme = element_text(size=8)))
scatp4
The extra space you see can disappear using:
scatp5 = scatp4 + scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0))
scatp5
This time, I will use the data about city safety:
library(rio)
## Warning: package 'rio' was built under R version 3.4.4
link="https://github.com/EvansDataScience/data/raw/master/safeCitiesIndexAll.xlsx"
safe=import(link)
The data available are:
names(safe)
## [1] "city" "D_In_PrivacyPolicy"
## [3] "D_In_AwarenessDigitalThreats" "D_In_PubPrivPartnerships"
## [5] "D_In_TechnologyEmployed" "D_In_CyberSecurity"
## [7] "D_Out_IdentityTheft" "D_Out_CompInfected"
## [9] "D_Out_InternetAccess" "H_In_EnvironmentPolicies"
## [11] "H_In_AccessHealthcare" "H_In_Beds_1000"
## [13] "H_In_Doctors_1000" "H_In_AccessFood"
## [15] "H_In_QualityHealthServ" "H_Out_AirQuality"
## [17] "H_Out_WaterQuality" "H_Out_LifeExpectY"
## [19] "H_Out_InfMortality" "H_Out_CancerMortality"
## [21] "H_Out_AttacksBioChemRad" "I_In_EnforceTransportSafety"
## [23] "I_In_PedestrianFriendliness" "I_In_QualityRoad"
## [25] "I_In_QualityElectricity" "I_In_DisasterManagement"
## [27] "I_Out_DeathsDisaster" "I_Out_VehicularAccidents"
## [29] "I_Out_PedestrianDeath" "I_Out_LiveSlums"
## [31] "I_Out_AttacksInfrastructure" "P_In_PoliceEngage"
## [33] "P_In_CommunityPatrol" "P_In_StreetCrimeData"
## [35] "P_In_TechForCrime" "P_In_PrivateSecurity"
## [37] "P_In_GunRegulation" "P_In_PoliticalStability"
## [39] "P_Out_PettyCrime" "P_Out_ViolentCrime"
## [41] "P_Out_OrganisedCrime" "P_Out_Corruption"
## [43] "P_Out_DrugUse" "P_Out_TerroristAttacks"
## [45] "P_Out_SeverityTerrorist" "P_Out_GenderSafety"
## [47] "P_Out_PerceptionSafety" "P_Out_ThreaTerrorism"
## [49] "P_Out_ThreatMilitaryConf" "P_Out_ThreatCivUnrest"
These are several variables telling us information about the safety levels of cities in the world, and are related to Digital, Health, Infrastructure, and Personal dimensions. For each of these dimensions, there are measures of actions taken (In), and results (Out).
Heatmaps will show you the whole data set. First, we need some reshaping:
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.4.3
safeA=melt(safe,
id.vars = 'city') # the unit of analysis
head(safeA)
## city variable value
## 1 Abu Dhabi D_In_PrivacyPolicy 50
## 2 Amsterdam D_In_PrivacyPolicy 100
## 3 Athens D_In_PrivacyPolicy 75
## 4 Bangkok D_In_PrivacyPolicy 25
## 5 Barcelona D_In_PrivacyPolicy 100
## 6 Beijing D_In_PrivacyPolicy 75
The melting changed the direction of the data: the columns were sent into rows. This the data in long format. Now, the heatmap using this format:
base = ggplot(data = safeA, aes(x = variable,
y =city))
heat1= base + geom_tile(aes(fill = value))
heat1
Here you can see what rows have higher or lower colors on what set of variables. You can add color palette:
#inverse color -1
heat2 = heat1 + scale_fill_distiller(palette = "RdYlGn",direction = 1)
heat2
The column and row names need some work:
heat2 + theme(axis.text.x = element_text(angle = 90,
hjust = 1,
size = 4),
axis.text.y = element_text(size = 4))
The last heatmap above could be ‘ordered’ so that column and row positions can give us more information:
# change in REORDER
base= ggplot(data = safeA, aes(x = reorder(variable,
value, median, order=TRUE),
y =reorder(city,
value, median, order=TRUE)))
# THIS IS THE SAME
base + geom_tile(aes(fill = value)) +
scale_fill_distiller(palette = "RdYlGn",direction = 1) +
theme(axis.text.x = element_text(angle = 90, hjust = 1,size = 4),
axis.text.y = element_text(size = 4))
This is still hard to read. An alternative could be to average each dimension, so you get four columns. These data has that information:
link2="https://github.com/EvansDataScience/data/raw/master/safeCitiesIndex.xlsx"
safe2=import(link2)
head(safe2)
## city DIGITAL HEALTH INFRASTRUCTURE PERSONAL
## 1 Abu Dhabi 68.77421 68.54043 91.36416 78.95145
## 2 Amsterdam 85.79071 79.79323 96.04870 87.42197
## 3 Athens 61.93823 74.57161 82.04507 69.03243
## 4 Bangkok 44.44312 66.63739 68.33223 60.79785
## 5 Barcelona 74.44147 78.54113 96.59329 85.28254
## 6 Beijing 65.37625 67.62942 74.48954 80.76429
This same plot can be done using some additional packages related to ggplot. Let me show you the use of ggiraph, and ggiraphExtra.
This packages does not need that we alter the shape from wide to long, as the heatplot needed.
library(ggiraph)
## Warning: package 'ggiraph' was built under R version 3.4.4
library(ggiraphExtra)
## Warning: package 'ggiraphExtra' was built under R version 3.4.4
base = ggRadar(safe2,aes(colour='city'),legend.position="none")
plot1 = base + facet_wrap('city',ncol = 10) # ten columns of cities
plot1
There are many units for this plot. However we could try some improvement making a little change to the original data.
The plan is to rank the cities, and then turn the cities into an ordinal. That requires:
# get minimun value by row
safe2$min=apply(safe2[,c(2:5)],1,min)
# turn this min values into a ranking
safe2$min=rank(safe2$min)
# order city by ranking and turn that ordering into a factor
cityRk=as.factor(safe2[order(safe2$min),]$city)
# turn city into ordered factor
safe2$city=factor(safe2$city,
levels= cityRk,
labels = cityRk,
ordered = T)
# delete column with ranks
safe2$min=NULL
Notice the data seems the same:
head(safe2)
## city DIGITAL HEALTH INFRASTRUCTURE PERSONAL
## 1 Abu Dhabi 68.77421 68.54043 91.36416 78.95145
## 2 Amsterdam 85.79071 79.79323 96.04870 87.42197
## 3 Athens 61.93823 74.57161 82.04507 69.03243
## 4 Bangkok 44.44312 66.63739 68.33223 60.79785
## 5 Barcelona 74.44147 78.54113 96.59329 85.28254
## 6 Beijing 65.37625 67.62942 74.48954 80.76429
But the structure has varied:
str(safe2)
## 'data.frame': 60 obs. of 5 variables:
## $ city : Ord.factor w/ 60 levels "Karachi"<"Jakarta"<..: 34 52 28 9 47 30 18 55 33 8 ...
## $ DIGITAL : num 68.8 85.8 61.9 44.4 74.4 ...
## $ HEALTH : num 68.5 79.8 74.6 66.6 78.5 ...
## $ INFRASTRUCTURE: num 91.4 96 82 68.3 96.6 ...
## $ PERSONAL : num 79 87.4 69 60.8 85.3 ...
This is a simpler approach, when data is ready:
library(ggiraph)
library(ggiraphExtra)
library(snakecase)
## Warning: package 'snakecase' was built under R version 3.4.4
base = ggRadar(safe2,aes(group='city'),legend.position="none")
plot1 = base + facet_wrap('city',ncol = 10)
plot1 #+ geom_polygon(fill = 'white',col='orange')
For sure, if we had a small number of cases we could plot layers on top:
some=c("Manila","Lima", "Washington DC","Tokyo")
subSafe=safe2[safe2$city %in% some,]
base = ggRadar(subSafe,aes(group='city'),
alpha = 0,legend.position="top")
base #+ theme(legend.title=element_blank())
Areas and color hues are not as easy to help us discern differences compared to length, so the plots above should be used with care when many cases are represented.
None of our previous plots represent dimensionality reduction, and that is what is coming now.
Multidimensional scaling is a technique used to distribute cases in, for example, two dimensions. That will give you a map, where closeness is interpreted as similarity.
distanceAmong <- dist(safe[,-1]) # euclidean distances is default method, omitting name of city
resultMDS <- cmdscale(distanceAmong,eig=TRUE, k=2) # k is the number of dim
dim1 <- resultMDS$points[,1]
dim2 <- resultMDS$points[,2]
#new data frame created
coordinates=data.frame(dim1,dim2,city=safe$city)
head(coordinates)
## dim1 dim2 city
## 1 -27.583521 -2.299102 Abu Dhabi
## 2 -115.110368 -6.914814 Amsterdam
## 3 -7.052776 36.698576 Athens
## 4 115.815227 -26.422557 Bangkok
## 5 -99.861054 -4.390724 Barcelona
## 6 11.259394 4.009000 Beijing
base= ggplot(coordinates,aes(x=dim1, y=dim2,label=city))
base + geom_text(size=2)
Notice that the coordinates do not inform the same as in the scatter plot (it is not the case that Caracas is among the best), what matters is to know that the closer a city is to another, the more similar it is.
Another key way to reduce dimensionality is cluster analysis. In this case we will group the cities using all the information available per city with the kmeans clustering technique:
library(cluster)
set.seed(123)
# computing clusters
resultKM <- kmeans(safe[,-c(1)], # omitting name of city
centers = 3) # how many clusters
Let’s add the cluster label to our coordinates:
coordinates$cluster=as.factor(resultKM$cluster)
We could use the cluster information in the MDS plot:
base= ggplot(coordinates,aes(x=dim1, y=dim2,label=city,color=cluster))
base + geom_text(size=2)
We have a dataset on contributions to Candidates and Political Committees in Washington State for the election year 2016., from the WA state open data portal.
link='https://github.com/EvansDataScience/DataDriven_ManagementAndPolicy/raw/master/Session6/contriWA_2016.RData'
#getting the data TABLE from the file in the cloud:
load(file=url(link))
This data frame has more than three million rows, and it is informing of every contribution someone has done to a particular campaign year.
str(contriWA_2016,width = 60, strict.width = 'cut')
## 'data.frame': 374584 obs. of 10 variables:
## $ id : chr "3982630.rcpt" "3982631.rcp"..
## $ contributor_state : chr "WA" "WA" "WA" "WA" ...
## $ contributor_zip : num 98683 98683 98683 98168 9850..
## $ amount : num 50 50 50 500 900 900 50 225 ..
## $ election_year : int 2016 2016 2016 2016 2016 201..
## $ party : Factor w/ 9 levels "","CONSTITUT"..
## $ cash_or_in_kind : Factor w/ 2 levels "Cash","In ki"..
## $ contributor_location: chr "(45.60817, -122.51972)" "("..
## $ Lat : num 45.6 45.6 45.6 47.5 47 ...
## $ Lon : num -123 -123 -123 -122 -123 ...
We could plot any of the categorical or numerical columns in a map, as long as we can have columns that represent a coordinate, or a particular location. As we can see, we could use the last two columns, and the zip code column to connect this data to a map, and then plot our selection of factor or number.
Keep in mind that if every row of our data to plot is based on zip codes, we need a map of zipcodes.
Maps come in different formats. The most common is the shapefile which is in fact a collection of files. That makes it more complicated if we want to read the map from a cloud repository. The recommendation is to keep all the shapefiles into a zipped folder in any cloud repository.
# link to zipped folder
zippedSHP= "https://github.com/EvansDataScience/data/raw/master/WAzips.zip"
The strategy in R will be to download the compressed folder into your computer. Then, use the following code to unzip it.
library(utils)
temp=tempfile()
download.file(zippedSHP, temp)
unzip(temp)
To know what shapefiles are now in your session folder:
(maps=list.files(pattern = 'shp'))
## [1] "SAEP_ZIP_Code_Tabulation_Areas.shp"
We will use the file name you need to open the map:
# notice the parameters use in the chunk!!
library(rgdal)
wazipMap <- readOGR("SAEP_ZIP_Code_Tabulation_Areas.shp",stringsAsFactors=F)
This is your map:
library(tmap)
## Warning: package 'tmap' was built under R version 3.4.4
waZips = tm_shape(wazipMap) + tm_polygons()
waZips
## Warning: package 'sf' was built under R version 3.4.4
## Linking to GEOS 3.6.1, GDAL 2.1.3, proj.4 4.9.3
You can control color like this:
tm_shape(wazipMap) + tm_polygons(border.col = 'blue',
col = 'yellow')
Sometimes you need a dissolved map. That is, just keep the limits of the map:
library(rmapshaper)
# This will make just a border of the state
baseMap <- ms_dissolve(wazipMap)
Then:
waBorder = tm_shape(baseMap) + tm_polygons(col = 'white',
lwd = 1)
waBorder
The dataframe contriWA_2016 has columns with coordinates, let’s turn that data frame into a spatial point data frame, while making sure it has the same coordinate system as our map:
library(raster)
##
## Attaching package: 'raster'
## The following object is masked from 'package:dplyr':
##
## select
## The following object is masked from 'package:magrittr':
##
## extract
mapCRS=crs(wazipMap) # projection of our map
contriWA_geo <- SpatialPointsDataFrame(contriWA_2016[,c(10:9)], # Lon/Lat
contriWA_2016, #the original data frame
proj4string = mapCRS) # assign a CRS of map
Our new spatial points dataframe looks the same:
names(contriWA_geo)
## [1] "id" "contributor_state" "contributor_zip"
## [4] "amount" "election_year" "party"
## [7] "cash_or_in_kind" "contributor_location" "Lat"
## [10] "Lon"
But it is not a simple data frame:
class(contriWA_geo)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
Now, plot the coordinates of the contributors using both our original and dissolved map (select the right shape coordinate point):
waDots1 = waBorder +
tm_shape(contriWA_geo) +
tm_dots(size = 0.1,col = 'red',alpha=0.5,shape = 20)
waDots1
For sure, we can add more elements; like title and its position:
waDots2 = waDots1 +
tm_layout(main.title = "Points",
main.title.position = 'center')
waDots2
The compass (position and symbol):
waDots3 = waDots2 +
tm_compass(position = c('left','TOP'),type = 'arrow')
waDots3
And the scale bar:
waDots4 = waDots3 +
tm_scale_bar(position=c("RIGHT", "BOTTOM"),width = 0.2)
waDots4
Currently, it is very usual to use interactive maps. In that situation, Leaflet is a good option:
library(leaflet)
leaflet(contriWA_geo) %>% # the shapefile
addTiles() %>% #basemap
addCircleMarkers(clusterOptions = markerClusterOptions())
When you have a way to organize you data by a row that represents a geographical unit, you can plot your data on a map. However, in the current format, each row represents a contribution; we do not need that, we need a data frame where each row is a ZIP code, and the amount tells us, for example, the average contribution generated in that location. This is an aggregation process:
library(dplyr)
WA_zip_contri= contriWA_2016 %>%
group_by(contributor_zip) %>%
summarize('AVE_Amount'=mean(amount))
#see result:
head(WA_zip_contri)
## # A tibble: 6 x 2
## contributor_zip AVE_Amount
## <dbl> <dbl>
## 1 98001 331.
## 2 98002 94.1
## 3 98003 143.
## 4 98004 1795.
## 5 98005 1488.
## 6 98006 264.
This data frame has the average of contributions for every zip code.
Our map has also interesting information (check the definitions here):
names(wazipMap)
## [1] "OBJECTID" "ZCTA5CE10" "GEOID10" "MTFCC10" "FUNCSTAT10"
## [6] "PARTFLG10" "INTPTLON10" "INTPTLAT10" "Version" "POP2000"
## [11] "POP2001" "POP2002" "POP2003" "POP2004" "POP2005"
## [16] "POP2006" "POP2007" "POP2008" "POP2009" "POP2010"
## [21] "POP2011" "POP2012" "POP2013" "POP2014" "POP2015"
## [26] "POP2016" "POP2017" "HHP2000" "HHP2001" "HHP2002"
## [31] "HHP2003" "HHP2004" "HHP2005" "HHP2006" "HHP2007"
## [36] "HHP2008" "HHP2009" "HHP2010" "HHP2011" "HHP2012"
## [41] "HHP2013" "HHP2014" "HHP2015" "HHP2016" "HHP2017"
## [46] "GQ2000" "GQ2001" "GQ2002" "GQ2003" "GQ2004"
## [51] "GQ2005" "GQ2006" "GQ2007" "GQ2008" "GQ2009"
## [56] "GQ2010" "GQ2011" "GQ2012" "GQ2013" "GQ2014"
## [61] "GQ2015" "GQ2016" "GQ2017" "HU2000" "HU2001"
## [66] "HU2002" "HU2003" "HU2004" "HU2005" "HU2006"
## [71] "HU2007" "HU2008" "HU2009" "HU2010" "HU2011"
## [76] "HU2012" "HU2013" "HU2014" "HU2015" "HU2016"
## [81] "HU2017" "OHU2000" "OHU2001" "OHU2002" "OHU2003"
## [86] "OHU2004" "OHU2005" "OHU2006" "OHU2007" "OHU2008"
## [91] "OHU2009" "OHU2010" "OHU2011" "OHU2012" "OHU2013"
## [96] "OHU2014" "OHU2015" "OHU2016" "OHU2017" "Shape__Are"
## [101] "Shape__Len"
The column with the zip code has the name ZCTA5CE10, let’s check its data type:
str(wazipMap$ZCTA5CE10)
## chr [1:598] "98001" "98002" "98003" "98004" "98005" "98006" "98007" ...
Let’s turn ZCTA5CE10 into a character, to be in the same type as our map data frame:
WA_zip_contri$contributor_zip=as.character(WA_zip_contri$contributor_zip)
Having common data types in both key columns of each data frame, we can merge.
# As the zip codes in each are under different column names,
# I tell the merge function what columns to use:
layerContrib=merge(wazipMap,WA_zip_contri, # use map first
by.x='ZCTA5CE10',
by.y='contributor_zip',all.x=F)
There is a new map: layerContrib.
We will plot the average amounts contributed, which will be organised into 5 quantiles. Let’s follow these steps:
library(RColorBrewer)
library(classInt)
varToPLot=layerContrib$AVE_Amount
colorForPalette='YlGnBu'
layer1= waBorder +
tm_shape(layerContrib) +
tm_polygons("AVE_Amount",
style="quantile", # classification method
n=5, # number of groups for classification
title="Contributions", # title of legend
palette=colorForPalette)
fullMap= layer1 + tm_compass(position = c('left','TOP'),type = 'arrow') +
tm_scale_bar(position=c("RIGHT", "BOTTOM"),width = 0.2)
fullMap
We need to adjust the elements:
fullMap + tm_layout(main.title = "Choropleth",
main.title.position = 'center',
legend.position = c('RIGHT','center'),
#bottom,left,top,right
inner.margins=c(0.1,0,0.1,0.2))
For sure, you can use leaflet:
# function for COLORING quantiles in leaflet
paletteFun=colorQuantile(colorForPalette,
varToPLot,
n = 5)
# the base map
base_map = leaflet(baseMap) %>% addPolygons(weight = 3,color = 'red')
final = base_map %>%
addPolygons(data=layerContrib,
weight = 1, #thickness of border
opacity = 1, # # the closer to 0 the more transparent
fillOpacity = 0.7, # color brigthness
fillColor = ~paletteFun(AVE_Amount)) # coloring
final