Course: Visual Analytics for Policy and Management

Prof. José Manuel Magallanes, PhD


Part 2: Visualizing Tabular data

Bivariate Case


Contents:

  1. Intro.

  2. Categorical-Categorical case.

  3. Categorical-Numerical case.

  4. Numerical-Numerical case.

Exercises:


We analyze two variables to find out if there might be some kind of association between them. Even though that may be difficult to clearly identify, bivariate analysis still helps reveal signs of association that may serve at least to raise concern.

As before, the nature of the data allows for some particular analytical techniques, while also providing some limitations to our inferences. Let’s see what we can visualize with the different combinations of data.

This time, I will use the data about crime from the Seattle Open Data portal:

link="https://github.com/EvansDataScience/data/raw/master/crime.RData"
load(file = url(link))

The data available are:

names(crime)

A quick look will give us:

head(crime)

Let’s see what kind of data we have:

str(crime,width = 70,strict.width='cut')

Go to table of contents.

Categorical-Categorical relationships

The main way to organize these relationships are the contingency tables. Let’s select a couple of categorical variables:

(CrimeTotal=table(crime$crimecat,crime$Occurred.DayTime))

The table above shows counts, but in most situations, it is important to see relative values:

# using "pipes" to help readability:
library(magrittr)
(CrimeTotal=table(crime$crimecat,crime$Occurred.DayTime)%>% #create table and then...
        prop.table() %>% #compute proportion and then...
        "*"(100)%>% # multiply by 100 and then...
        round(2) #...round to to decimals
        )

Those tables show total counts or percents. However, when a table tries to hypothesize a relationship, you should have the independent variable in the columns, and the dependent one in the rows; then, the percent should be calculated by column, to see how the levels of the dependent variable varies by each level of the independent one, and compare along rows.

CrimeCol=table(crime$crimecat,crime$Occurred.DayTime)%>%
         prop.table(margin = 2)%>%   # 2 is % by column
         "*"(100)%>%
         round(3)

CrimeCol

The complexity of two variables requires plots, as tables like these will not allow you to discover association patterns easily, even though they are already a summary of two columns. However, you must check the data format the plotting functions require, as most plots will use the contingency table as input (not the raw data).

As before, we can use the bar plot with the contingency table as input:

barplot(CrimeCol)

This plot will need a lot of work, so the base capabilities of R may not be a good strategy.

However, when using alternative/more specialized plotting features you may need to convert your table into a dataframe of frequencies, let me create the base proportions table:

df.T=as.data.frame(CrimeTotal) # table of proportion based on total
# YOU GET:
head(df.T)

We should rename the above table:

names(df.T)=c('Crime','Daytime','Percent') #renaming
head(df.T)

A first option you may have is to reproduce the table:

library(ggplot2)                           
base = ggplot(df.T, aes(Daytime,Crime)) 
# plot value as point, size by value of percent
tablePlot1 = base + geom_point(aes(size = Percent), colour = "gray") 
# add value of Percent as label, move it
tablePlot2 = tablePlot1 + geom_text(aes(label = Percent),
                                    nudge_x = 0.1,
                                    size=2)
tablePlot2

…some more work:

tablePlot3 = tablePlot2 + scale_size_continuous(range=c(0,10)) #change 10?
tablePlot4 = tablePlot3 + theme_minimal() # less ink
tablePlot4 + theme(legend.position="none") # no legend

The plot looks nice, but unless the differences are clearly cut, you may see more noise than information, which distracts and delays decision making. Keep in mind that length of bars are easier to compare than circle areas. You need a barplot, but with better tools:

base  = ggplot(df.T, aes(x = Crime, y = Percent ) ) 
bars1 = base + geom_bar( stat = "identity" ) + theme_minimal()
# bar per day time with 'facet'
bars2 = bars1 + facet_wrap( ~ Daytime ,nrow = 1) 
bars2 

…some more work:

# change the minimal theme
bars3 = bars2 + theme( axis.text.x = element_text(angle = 90,
                                                  hjust = 1,
                                                  size=3 ) )
bars3

And, the original relationship Input-Output table can be plotted like this:

df.C=as.data.frame(CrimeCol)
colnames(df.C)=c('Crime','Daytime','Percent')
#####

base  = ggplot(df.C, aes(x = Crime, y = Percent ) ) 
bars1 = base + geom_bar( stat = "identity" )
bars2 = bars1 + facet_wrap( ~ Daytime ,nrow = 1) 
bars2 + coord_flip()

The type of crime is not ordinal, then we could reorder the bars:

base  = ggplot(df.C, aes(x = reorder(Crime, Percent), y = Percent ) ) 
bars1 = base + geom_bar( stat = "identity" )
bars2 = bars1 + facet_wrap( ~ Daytime ,nrow = 1) 
bars2 + coord_flip() + theme(axis.text.y = element_text(size=4,angle = 45)) 

Exercise 1:
Turn the bars into lollipop with the right components.

Once you see a plot of two bivariate categorical data, you may consider other plots:

# heatplot
base  = ggplot(df.C, aes(x = Daytime, y = reorder(Crime, Percent), fill = Percent)) 
heat1 = base +  geom_tile() 
heat2 = heat1 +scale_fill_gradient(low = "white", 
                                   high = "black")
heat2

Some work on the legend:

heat3 = heat2 + theme_classic() 

heat4 = heat3 + theme(axis.text.x = element_text(angle = 90, vjust = 0.6), 
                      legend.title = element_blank(), #no title for legend
                      legend.position="top", 
                      legend.direction="horizontal",
                      legend.key.width=unit(1, "cm"),
                      legend.key.height=unit(1, "cm")) 

heat4 + labs(y="Crime")

Exercise 2:
Change the heatplot to ascending order, where intensity goes from yellow to purple.
_____

Go to table of contents.

Categorical-Numerical relationships

Similar to the previous case, categorical variables can be used for understanding the behavior of numerical variables. In this case, the curiosity and experience of the analyst is critical in mining the data to reveal some insight. This is so because numerical data have longer value ranges than categorical data. The data used will be a good example of this.

In the previous data set we had a variable that informs the amount of days it takes someone to report a crime:

summary(crime$DaysToReport)

There are several good categorical variables that can be used to study the behavior of this one. Let’s use precint:

tapply(crime$DaysToReport,crime$Precinct,mean)

Above, you see the mean time (in days) it takes per precint for people to notify a crime. You can suddenly create a plot in your mind just by reading those values, but the plot you imagine may be far from this one:

boxplot(DaysToReport~Precinct,data=crime)

The plot above would not give so much insight, there is so much noise. The fact is that a better summary would tell us more to consider:

tapply(crime$DaysToReport,crime$Precinct,summary)

From the information above, you know that for each precint, the 75% of crimes are reported in a day or less. If we consider that situation as the expected behavior, we could omit those cases:

boxplot(DaysToReport~Precinct,data=crime,
        subset = DaysToReport>1) #subsetting

We see no structure appear yet. Let me try different versions while teaching how to divide the screen:

par(mfrow=c(2,2)) # 2 rows, and two columns, by row:

boxplot(DaysToReport~Precinct,data=crime,subset = DaysToReport>=7,
        main="One week or more")

boxplot(DaysToReport~Precinct,data=crime,subset = DaysToReport>=30,
        main="One 30-day month or more")

boxplot(DaysToReport~Precinct,data=crime,subset = DaysToReport>=180,
        main="180 days or more")

boxplot(DaysToReport~Precinct,data=crime,subset = DaysToReport>=365,
        main="One year or more")

Up to this point, you need to be planing a good story. The situation is different for each case, but let’s build our story from the crimes that take a year or longer to report.

First, let’s see how many cases we have per precinct:

crimeYear=crime[crime$DaysToReport>=365,]
tapply(crimeYear$DaysToReport,crimeYear$Precinct,length)

The year the crime occurred would be another variable to consider, to see if we can filter more cases:

tapply(crimeYear$DaysToReport,crimeYear$year,length)

If we were to plot by year, several years before 2000 are too few (cases with two or less may require a case study- some might even be typing mistakes). Let’s get rid of those old cases:

crimeY2000=crime[(crime$DaysToReport>=365) & (crime$year>=2000),]
tapply(crimeY2000$DaysToReport,crimeY2000$Precinct,length)

Now, we see a better boxplot:

boxplot(DaysToReport~Precinct,data=crimeY2000,
        main="One year or more (from 2000)")

For sure, it would be better if the numerical units were in years:

crimeY2000$YearsToReport=crimeY2000$DaysToReport/365
boxplot(YearsToReport~Precinct,data=crimeY2000,
        main="One year or more (from 2000)")

This is a good data subset, but you still see that the distribution is pretty similar in every precint. At this stage, you can try visualizing the outliers distributions, which start around year five:

boxplot(YearsToReport~Precinct,data=crimeY2000,subset = YearsToReport>=5,
        main="Five years or more (from 2000)")

If your story is about the similarity of distributions among precincts, you can start improving the last plots. But in case you need to show some variety you can try another relevant categorical variable.

Let’s try weekday and Ocurred.DayTime:

par(mfrow=c(2,1))

boxplot(YearsToReport~weekday,data=crimeY2000,
        main="One year or more BY WEEKDAY (from 2000)",las=2)

boxplot(YearsToReport~Occurred.DayTime,data=crimeY2000,
        main="One year or more BY TIME CRIME OCCURRED (from 2000)",las=2)

Not much variety is perceived; then let’s try crimecat and year:

par(mfrow=c(2,1))

boxplot(YearsToReport~year,data=crimeY2000,
        main="One year or more (from 2000)",las=2)

boxplot(YearsToReport~crimecat,data=crimeY2000,
        main="One year or more (from 2000)",las=2)

Years to report seems interesting, you have a decreasing tendency, then we can spend some time improving it via ggplot:

# no missing:
crimeYearGG=crimeY2000[complete.cases(crimeY2000$YearsToReport),]

base = ggplot(crimeYearGG,aes(x=factor(year), y=YearsToReport)) 
box  = base + geom_boxplot()
box

It may be the case that your audience does not know how to read a boxplot. It is a great plot, but encoding so much statistical information. Then we can go simple, and use lines connecting the easy-to-understand points in every boxplot:

base  = ggplot(crimeYearGG,aes(x=factor(year), y=YearsToReport))
mins = base + stat_summary(fun.y=min, # function for 'y' is min()
                           geom="line",
                           show.legend = T,size=1,
                           aes(group=1,col='Min'))
mins # just the min values

Let me add the max values:

minsMaxs= mins + stat_summary(fun.y=max,
                              geom="line",
                              linetype='dashed',
                              size=1,show.legend = F,
                              aes(group=1,col='Max'))

minsMaxs

Adding the median:

minsMaxsMd= minsMaxs + stat_summary(fun.y=median,
                                    geom="line",size=2,
                                    aes(group=1,col='Median'))
minsMaxsMd

Let’s take control of the colors by customizing the legend:

# Change color of lines:
all1=minsMaxsMd + scale_colour_manual(name="Trends",
                                      values=c("blue", "black","red")
                                      )
all1

Now, let’s complete the story by telling how the data filtered behaves, that is, the crimes that took less than a year to report since 2000 (we will not include data from before):

# data preparation:

crimeWeek=crime[(crime$DaysToReport<365) & (crime$year>=2000),]

crimeWeek$WeeksToReport=crimeWeek$DaysToReport/7

crimeYearGG2=crimeWeek[complete.cases(crimeWeek$WeeksToReport) &complete.cases(crimeWeek$crimecat),]
#plotting it:
base = ggplot(crimeYearGG2,aes(x=factor(year), y=WeeksToReport)) 
mins = base + stat_summary(fun.y=min,size=1,
                           geom="line", linetype='dashed',show.legend = T,
                           aes(group=1,col='Min'))
minsMaxs= mins + stat_summary(fun.y=max,
                              geom="line",size=1,show.legend = F,
                              aes(group=1,col='Max'))
minsMaxsMd= minsMaxs + stat_summary(fun.y=median,
                                    geom="line",size=2,
                                    aes(group=1,col='Median'))
all2=minsMaxsMd + scale_colour_manual(name="Trends",
                                      values=c("blue", "black","red")
                                      )
all2 

We also found variability in the type of crime, so we could try a story with it; first with Years to report (for crimes that took a year or longer to report, after year 2000):

base= ggplot(crimeYearGG,
             aes(x = reorder(crimecat, YearsToReport, FUN = max), # reorder!
                 y=YearsToReport)) 
mins = base + stat_summary(fun.y=min,size=1,
                           geom="line", linetype='dashed',show.legend = T,
                           aes(group=1,col='Min'))
minsMaxs= mins + stat_summary(fun.y=max,
                              geom="line",size=1,show.legend = F,
                              aes(group=1,col='Max'))
minsMaxsMd= minsMaxs + stat_summary(fun.y=median, size=2,
                                    geom="line",
                                    aes(group=1,col='Median'))
all3=minsMaxsMd + scale_colour_manual(name="Trends",
                                      values=c("blue", "black","red"))
all3 + coord_flip()

Now, for crimes that took less than year to report after year 2000:

base = ggplot(crimeYearGG2,
              aes(x = reorder(crimecat, WeeksToReport, FUN = max),
                  y=WeeksToReport)) 
mins = base + stat_summary(fun.y=min,size=1,
                           geom="line", linetype='dashed',show.legend = T,
                           aes(group=1,col='Min'))
minsMaxs= mins + stat_summary(fun.y=max,
                              geom="line",size=1,show.legend = F,
                              aes(group=1,col='Max'))
minsMaxsMd= minsMaxs + stat_summary(fun.y=median,size=2,
                                    geom="line",
                                    aes(group=2,col='Median'))
all3=minsMaxsMd + scale_colour_manual(name="Trends",
                                      values=c("blue", "black","red"))
all3+coord_flip()

Exercise 3:
Complete the information needed in the previous plots.

It is very common to hear in scientific texts about the mean difference test known as the one-way anova, which beyond describing, as we have done, seeks to show if the mean of the numerical variable varies or not accross the values of the categorical groups.

#making a subset:
anovaData=crimeY2000[crimeY2000$YearsToReport>=5,]

#checking the mean per factor value:
tapply(anovaData$YearsToReport, anovaData$Precinct, mean,na.rm=T)
library(Rmisc)

group.CI(YearsToReport ~ Precinct, 
         data=anovaData, 
         ci = 0.95)
anovaData=anovaData[complete.cases(anovaData),]

# introducing ggpubr
library(ggpubr)
ggline(data=anovaData,x = "Precinct", y = "YearsToReport",add = 'mean_ci',
       error.plot = "pointrange") + scale_y_continuous(breaks=seq(7,10,0.5))
# Compute the analysis of variance
res.aov <- aov(YearsToReport ~ Precinct, data = anovaData)

# Summary of the analysis
summary(res.aov)[[1]]$Pr[1]
# non parametric
kruskal.test(YearsToReport ~ Precinct, data = anovaData)

Go to table of contents.

Numerical-Numerical relationships

The study of bivariate relationships among numerical variables is known as correlation analysis. The data we have been using has few numerical columns, but I will produce two by aggregating the original data set using Neigborhood:

  • Aggregating days to report and neighborhood:
# 1. MEAN of days it takes to report a crime by neighborhood
daysByNeigh=tapply(crime$DaysToReport, crime$Neighborhood, mean,na.rm=T)

# you have:
head(daysByNeigh)
  • Aggregating crimes by neighborhood
# 2. PROPORTION of crimes by neighborhood
crimesByNeigh=tapply(crime$crimecat, crime$Neighborhood, length)%>%      
                     prop.table()%>%
                     "*"(100)%>% 
                     round(2) 
head(crimesByNeigh)
  • Converting to data Frames: We will transpose the result of tapply:
library(tibble)
as.data.frame(daysByNeigh)%>%rownames_to_column()

Knowing how it works, we can create two data frames:

daysByNeigh=as.data.frame(daysByNeigh)%>%rownames_to_column()
crimesByNeigh=as.data.frame(crimesByNeigh)%>%rownames_to_column()
  • Merging the two dataframes: Since both data frames have the same neighboorhood, we can make one data frame by mergeing them:
num_num=merge(daysByNeigh,crimesByNeigh) # 'row name' is the "key"
head(num_num)

Once we have the data organized, the clear option is the scatterplot:

base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh)) 
plot1= base +  geom_point() 
plot1

We can improve the plot, this time introducing ggrepel:

library(ggrepel)
base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh,
                           label=rowname)) # you need this aesthetics!
plot1= base +  geom_point() 
plot1 + geom_text_repel()

We can limit the labels, annotating the ones that represent at least 5% of the crimes in the city:

base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh,label=rowname)) 
plot1= base +  geom_point() 
plot1 + geom_text_repel(aes(label=ifelse(crimesByNeigh>=5,
                                         num_num$rowname, "")))

Notice the difference without ggrepel:

base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh)) 
scatp1 = base +  geom_point() 
scatp1 + geom_text(aes(label=ifelse(crimesByNeigh>=5,num_num$rowname, "")))

The good thing is that ggrepel makes better use of the space:

base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh,label=rowname)) 
base +  geom_point() + geom_text_repel(aes(label=ifelse(num_num$rowname=='NORTHGATE',
                                                        num_num$rowname, "")))

An alternative, to highlight overlaping of points:

base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh)) 
scatp1 = base +  geom_hex(bins = 10)
scatp2= scatp1 + geom_text_repel(aes(label=ifelse(crimesByNeigh>=5,
                                                  num_num$rowname,
                                                  ""))) 
scatp2 + scale_fill_distiller(palette ="Greys",direction=1) # try -1

The palettes can be selected from the brewer colors website. Using the same palette as before, we can try a different plot (stat_density_2d):

base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh)) 
scatp1 = base +  stat_density_2d(aes(fill = ..density..), 
                                 geom = "raster", contour = FALSE)
scatp2=scatp1+geom_text_repel(aes(label=ifelse(crimesByNeigh>=5,
                                               num_num$rowname, "")))
scatp3 = scatp2 +  theme(legend.position='none') 
scatp4= scatp3 + scale_fill_distiller(palette="Greys", direction=1) 
scatp4

The extra space you see can dissappear using:

scatp5 = scatp4 +  scale_x_continuous(expand = c(0, 0)) + 
         scale_y_continuous(expand = c(0, 0)) 
scatp5

Exercise 4:
Complete the elements missing in the previous plots.


Go to table of contents.

Back to course schedule menu