We analyze two variables to find out if there might be some kind of association between them. Even though that may be difficult to clearly identify, bivariate analysis still helps reveal signs of association that may serve at least to raise concern.
This time, I will use the data about crime from the Seattle Open Data portal:
link="https://github.com/EvansDataScience/data/raw/master/crime.RData"
load(file = url(link))
Let’s see what kind of data we have:
str(crime,width = 70,strict.width='cut')
## 'data.frame': 499698 obs. of 17 variables:
## $ Report.Number : chr "20130000244104" "201300002429"..
## $ Occurred.Date : Date, format: "2013-07-09" ...
## $ year : num 2013 2013 2013 2013 2013 ...
## $ month : num 7 7 7 7 7 7 7 7 7 7 ...
## $ weekday : Ord.factor w/ 7 levels "Monday"<"Tu"..
## $ Occurred.Time : num 1930 1917 1900 1900 1846 ...
## $ Occurred.DayTime : Ord.factor w/ 4 levels "day"<"after"..
## $ Reported.Date : Date, format: "2013-07-10" ...
## $ Reported.Time : num 1722 2052 35 1258 1846 ...
## $ DaysToReport : num 1 0 1 1 0 0 0 0 1 0 ...
## $ crimecat : Factor w/ 20 levels "AGGRAVATED ASS"..
## $ Crime.Subcategory : Factor w/ 30 levels "AGGRAVATED ASS"..
## $ Primary.Offense.Description: Factor w/ 144 levels "ADULT-VULNERA"..
## $ Precinct : Factor w/ 5 levels "EAST","NORTH",....
## $ Sector : Factor w/ 23 levels "6804","9512",....
## $ Beat : Factor w/ 64 levels "B1","B2","B3",...
## $ Neighborhood : Factor w/ 58 levels "ALASKA JUNCTIO"..
The main way to organize these relationships are the contingency tables. Let’s select a couple of categorical variables:
(CrimeTotal=table(crime$crimecat,crime$Occurred.DayTime))
##
## day afternoon evening night
## AGGRAVATED ASSAULT 3564 5366 4884 7501
## ARSON 196 167 191 486
## BURGLARY 24139 22288 14121 16082
## CAR PROWL 26740 38273 42595 34839
## DISORDERLY CONDUCT 41 81 67 79
## DUI 706 939 2038 8522
## FAMILY OFFENSE-NONVIOLENT 1748 2516 1217 1120
## GAMBLE 4 4 7 2
## HOMICIDE 41 46 49 131
## LIQUOR LAW VIOLATION 112 491 410 606
## LOITERING 20 31 25 9
## NARCOTIC 2415 6416 3924 4109
## PORNOGRAPHY 65 53 17 31
## PROSTITUTION 115 675 1425 1340
## RAPE 332 318 354 854
## ROBBERY 2584 4737 4139 5372
## SEX OFFENSE-OTHER 1501 1759 1014 1776
## THEFT 38687 64868 38980 28410
## TRESPASS 4848 5184 2598 3289
## WEAPON 735 1445 947 1624
The table above shows counts, but in most situations, it is important to see relative values:
# using "pipes" to help readability:
library(magrittr)
CrimeTotal=table(crime$crimecat,crime$Occurred.DayTime)%>% #create table and then...
prop.table() %>% #compute proportion and then...
"*"(100)%>% # multiply by 100 and then...
round(2) #...round to to decimals
# you get:
CrimeTotal
##
## day afternoon evening night
## AGGRAVATED ASSAULT 0.71 1.07 0.98 1.50
## ARSON 0.04 0.03 0.04 0.10
## BURGLARY 4.83 4.46 2.83 3.22
## CAR PROWL 5.35 7.66 8.53 6.98
## DISORDERLY CONDUCT 0.01 0.02 0.01 0.02
## DUI 0.14 0.19 0.41 1.71
## FAMILY OFFENSE-NONVIOLENT 0.35 0.50 0.24 0.22
## GAMBLE 0.00 0.00 0.00 0.00
## HOMICIDE 0.01 0.01 0.01 0.03
## LIQUOR LAW VIOLATION 0.02 0.10 0.08 0.12
## LOITERING 0.00 0.01 0.01 0.00
## NARCOTIC 0.48 1.28 0.79 0.82
## PORNOGRAPHY 0.01 0.01 0.00 0.01
## PROSTITUTION 0.02 0.14 0.29 0.27
## RAPE 0.07 0.06 0.07 0.17
## ROBBERY 0.52 0.95 0.83 1.08
## SEX OFFENSE-OTHER 0.30 0.35 0.20 0.36
## THEFT 7.75 12.99 7.80 5.69
## TRESPASS 0.97 1.04 0.52 0.66
## WEAPON 0.15 0.29 0.19 0.33
Those tables show total counts or percents. However, when a table tries to hypothesize a relationship, you should have the independent variable in the columns, and the dependent one in the rows; then, the percent should be calculated by column, to see how the levels of the dependent variable varies by each level of the independent one, and compare along rows.
CrimeCol=table(crime$crimecat,crime$Occurred.DayTime)%>%
prop.table(margin = 2)%>% # 2 is % by column
"*"(100)%>%
round(3)
# you get:
CrimeCol
##
## day afternoon evening night
## AGGRAVATED ASSAULT 3.282 3.447 4.104 6.456
## ARSON 0.180 0.107 0.161 0.418
## BURGLARY 22.229 14.319 11.866 13.842
## CAR PROWL 24.624 24.588 35.794 29.987
## DISORDERLY CONDUCT 0.038 0.052 0.056 0.068
## DUI 0.650 0.603 1.713 7.335
## FAMILY OFFENSE-NONVIOLENT 1.610 1.616 1.023 0.964
## GAMBLE 0.004 0.003 0.006 0.002
## HOMICIDE 0.038 0.030 0.041 0.113
## LIQUOR LAW VIOLATION 0.103 0.315 0.345 0.522
## LOITERING 0.018 0.020 0.021 0.008
## NARCOTIC 2.224 4.122 3.297 3.537
## PORNOGRAPHY 0.060 0.034 0.014 0.027
## PROSTITUTION 0.106 0.434 1.197 1.153
## RAPE 0.306 0.204 0.297 0.735
## ROBBERY 2.380 3.043 3.478 4.624
## SEX OFFENSE-OTHER 1.382 1.130 0.852 1.529
## THEFT 35.626 41.674 32.756 24.453
## TRESPASS 4.464 3.330 2.183 2.831
## WEAPON 0.677 0.928 0.796 1.398
The complexity of two variables requires plots, as tables like these will not allow you to discover association patterns easily, even though they are already a summary of two columns. However, you must check the data format the plotting functions require, as most plots will use the contingency table as input (not the raw data).
Let me try a basic bar plot with the contingency table as input:
barplot(CrimeCol)
This plot will need a lot of work, so the base capabilities of R may not be a good strategy; and as before, we will turn to ggplot.
However, when using alternative/more specialized plotting features you may need to convert your table into a dataframe of frequencies, let me create the base proportions table:
df.T=as.data.frame(CrimeCol) # table of proportion based on total
# YOU GET:
head(df.T)
## Var1 Var2 Freq
## 1 AGGRAVATED ASSAULT day 3.282
## 2 ARSON day 0.180
## 3 BURGLARY day 22.229
## 4 CAR PROWL day 24.624
## 5 DISORDERLY CONDUCT day 0.038
## 6 DUI day 0.650
We should rename the above table:
names(df.T)=c('Crime','Daytime','Percent') #renaming
head(df.T)
## Crime Daytime Percent
## 1 AGGRAVATED ASSAULT day 3.282
## 2 ARSON day 0.180
## 3 BURGLARY day 22.229
## 4 CAR PROWL day 24.624
## 5 DISORDERLY CONDUCT day 0.038
## 6 DUI day 0.650
A first option you may have is to reproduce the table:
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
base = ggplot(df.T, aes(Daytime,Crime))
# plot value as point, size by value of percent
tablePlot1 = base + geom_point(aes(size = Percent), colour = "gray")
# add value of Percent as label
tablePlot2 = tablePlot1 + geom_text(aes(label = Percent),
nudge_x = 0.1, # push the value to the right on the horizontal
size=2)
tablePlot2