Course: Computational Thinking for Governance Analytics

Prof. José Manuel Magallanes, PhD


Session 3: Modeling Governance Complexity in R

In this session, we will learn some basics on modeling, namely two important techniques:

  1. Linear Regression
  2. Logistic Regression

This session will work with a collection of indexes:

link='https://raw.githubusercontent.com/EvansDataScience/data/master/allIndexes.csv'
indexes=read.csv(link,stringsAsFactors = F)

names(indexes)
## [1] "Country"         "ISO"             "corruptionIndex" "scoreEconomy"   
## [5] "scorepress"      "environment"     "presscat"        "environmentCat"

We can check descriptives:

summary(indexes)
##    Country              ISO            corruptionIndex  scoreEconomy  
##  Length:129         Length:129         Min.   :14.00   Min.   :2.920  
##  Class :character   Class :character   1st Qu.:31.00   1st Qu.:6.320  
##  Mode  :character   Mode  :character   Median :40.00   Median :6.920  
##                                        Mean   :45.42   Mean   :6.829  
##                                        3rd Qu.:58.00   3rd Qu.:7.510  
##                                        Max.   :90.00   Max.   :8.810  
##    scorepress     environment       presscat     environmentCat  
##  Min.   : 8.59   Min.   :37.10   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:49.12   1st Qu.:59.25   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :59.63   Median :70.84   Median :2.000   Median :1.0000  
##  Mean   :58.45   Mean   :69.39   Mean   :2.093   Mean   :0.5426  
##  3rd Qu.:66.89   3rd Qu.:81.26   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :80.96   Max.   :90.68   Max.   :3.000   Max.   :1.0000

And data types:

str(indexes,width=70,strict.width='cut')
## 'data.frame':    129 obs. of  8 variables:
##  $ Country        : chr  "New Zealand" "Denmark" "Finland" "Sweden" ..
##  $ ISO            : chr  "NZL" "DNK" "FIN" "SWE" ...
##  $ corruptionIndex: int  90 90 89 88 86 85 84 83 82 81 ...
##  $ scoreEconomy   : num  8.48 7.77 7.75 7.65 8.44 7.67 8.81 7.74 7.9..
##  $ scorepress     : num  79.5 80.7 81 77.2 77.8 ...
##  $ environment    : num  88 89.2 90.7 90.4 86.9 ...
##  $ presscat       : int  3 3 3 3 3 3 1 3 3 3 ...
##  $ environmentCat : int  1 1 1 1 1 1 1 1 1 1 ...

Keep in mind thta the last two ones are categorical. Presscat is a categorical representation of the press freedom index:

table(indexes$presscat)
## 
##  1  2  3 
## 23 71 35

And enviromentCat is a representation of environment performance index:

table(indexes$environmentCat)
## 
##  0  1 
## 59 70

The former is ordinal (the higher the better), while the second is a dichotomous variable (good or bad performance). _____

Linear Regression

You need a linear regression when your variable of interest is continuous (a measurement). In this case, I will simply use the EPI as my explanandum (Y):

hist(indexes$environment)

The other variables in the set could be used as our explanans (X), that is, we should have enough theorical or experiential background to hypothesize that the behavior of Y is affected by these X.

You are comfortable with Linear Regression when you can verify that each variable in X has a linear relationship with Y:

explanans=names(indexes)[c(3:5)]

for (x in explanans){
    p=cor.test(indexes[,x],indexes$environment)
    
    if(p$p.value<0.05){
        messageYES=paste('environment index is correlated with',x)
        print(messageYES)
        
    }else{
        messageNO=paste('environment index is NOT correlated with',x)
        print(messageNO)
    }
    
}
## [1] "environment index is correlated with corruptionIndex"
## [1] "environment index is correlated with scoreEconomy"
## [1] "environment index is correlated with scorepress"

The level of correlation is moderate to low.

It would be great if the correlation between each pair of variables in X were not high:

cor(indexes[explanans])
##                 corruptionIndex scoreEconomy scorepress
## corruptionIndex       1.0000000    0.6959598  0.5986622
## scoreEconomy          0.6959598    1.0000000  0.4119720
## scorepress            0.5986622    0.4119720  1.0000000

Most correlations are moderate; however the correlation between corruption and economy are close to be considered high (traditionally it is high when you reach +/- 0.7).

Given the above exploration, we can go and request a linear regression model:

LinRegEPI = lm(environment ~ corruptionIndex + scoreEconomy + scorepress, data = indexes)

That was it! I just told R that the Y (environment) is to be regressed (~) by all those X variables (corruptionIndex + scoreEconomy + scorepressOK).

You can see the datailed result using:

summary(LinRegEPI)
## 
## Call:
## lm(formula = environment ~ corruptionIndex + scoreEconomy + scorepress, 
##     data = indexes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.0284  -5.8354  -0.0301   7.7106  31.5477 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     29.26063    9.07126   3.226   0.0016 ** 
## corruptionIndex  0.40824    0.07909   5.162 9.35e-07 ***
## scoreEconomy     3.48563    1.47797   2.358   0.0199 *  
## scorepress      -0.03788    0.09164  -0.413   0.6800    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.9 on 125 degrees of freedom
## Multiple R-squared:  0.4655, Adjusted R-squared:  0.4527 
## F-statistic: 36.28 on 3 and 125 DF,  p-value: < 2.2e-16

However, we tend to look for a couple of key indicators:

  • How much is each variable in X influencing Y?

The sign of the coefficient of a X indicates if it affects Y in direct or inverse way. The absolute value of the coefficient would suggest which impacts more. You should only be confident of these effects if the column Pr(>|t|) is TRUE.

  • What measure should I use to evaluate the regression as a whole?
summary(LinRegEPI)$adj.r.squared # from 0 to 1, the closer to 1 the better.
## [1] 0.4526508

The way to evaluate a regression does not end here. There may be several issues affecting these results:

  • Residuals, the difference between what this model predicts Y is ( Yp ), and the actual value of Y, should be closed to a normal distribution. for that we need to see them along the vertical of the qqplot.
# normality of residuals?
library(car)
qqPlot(LinRegEPI, main="QQ Plot")

  • Multicollinearity, the predictors are no highly correlated:
# collinearity?
vif(LinRegEPI) > 4 # problem if some are TRUE
## corruptionIndex    scoreEconomy      scorepress 
##           FALSE           FALSE           FALSE
  • Heteroskedasticity, if the bivariate relationship between the Residuals and the Yp is not constant. That is, the error variance changes with the level of the response, which will happen if the output of the ncvTest function is non-significant.
# Heteroskedastic?
ncvTest(LinRegEPI)$p<0.05
## [1] TRUE
  • Outliers. Outliers are always present. In practice, you can not solve any issue if you are not controlling the presence of outliers, if they could be controlled:
influencePlot(LinRegEPI,    
              id.method="noteworthy",
              id.n=2, 
              main="Identifying outliers", 
              sub="Circle size is proportial to Cook's Distance")

##        StudRes        Hat       CookD
## 40  -2.5213589 0.06450724 0.105087771
## 64  -0.2481556 0.17029510 0.003183753
## 127  3.3406953 0.18830904 0.598625540

When outliers are prominent, you could ask for robust regression:

library(MASS)
LinRegEPI_R = rlm(environment ~ corruptionIndex + scoreEconomy + scorepress, 
                data = indexes)

summary(LinRegEPI_R)
## 
## Call: rlm(formula = environment ~ corruptionIndex + scoreEconomy + 
##     scorepress, data = indexes)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.8916  -6.5786  -0.3437   7.4061  36.6714 
## 
## Coefficients:
##                 Value   Std. Error t value
## (Intercept)     20.0001  9.2062     2.1725
## corruptionIndex  0.3681  0.0803     4.5858
## scoreEconomy     5.0979  1.5000     3.3987
## scorepress      -0.0354  0.0930    -0.3806
## 
## Residual standard error: 10.66 on 125 degrees of freedom

You don’t see a p.value now, but you can consider significant the coefficients whose t value (in absolute terms) is greater than two.

Above, you have corrected coefficients. The same coefficients kept significant. The directions are kept too. The effect of econonomy (EFI) has increased, while the intercept (value when all the rest is zero) has decreased.

You can also omit the outliers and see the effect on the regresssion:

outCases=c(40,64,127)
indexes[outCases,]
##       Country ISO corruptionIndex scoreEconomy scorepress environment
## 40     Rwanda RWA              54         7.57      34.94       50.34
## 64      China CHN              40         6.40       8.59       65.10
## 127 Venezuela VEN              17         2.92      44.78       76.23
##     presscat environmentCat
## 40         1              0
## 64         1              0
## 127        1              1
LinRegEPI_OUT = lm(environment ~ corruptionIndex + scoreEconomy + scorepress, data = indexes[-outCases,])

summary(LinRegEPI_OUT)
## 
## Call:
## lm(formula = environment ~ corruptionIndex + scoreEconomy + scorepress, 
##     data = indexes[-outCases, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.4764  -6.0975  -0.3531   7.5999  26.1431 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     17.86344    9.60119   1.861 0.065216 .  
## corruptionIndex  0.38508    0.07815   4.928 2.65e-06 ***
## scoreEconomy     5.85382    1.52971   3.827 0.000206 ***
## scorepress      -0.10233    0.09742  -1.050 0.295616    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.26 on 122 degrees of freedom
## Multiple R-squared:  0.5305, Adjusted R-squared:  0.5189 
## F-statistic: 45.95 on 3 and 122 DF,  p-value: < 2.2e-16

Notice the new Adjusted R-squared.

Finally in this section, let me show you now what to do when one of the X variables is a category:

  • Set the reference for categorical variable. When one of your X variables is categorical, the result will show a coefficient for every level of the category but one. The one missing is the reference. It is needed because the coefficient informs the effect of a particular level when compared to the reference.
# The function 'relevel' CAN NOT accept ordinals. 
# This is why I did not set it as ordinal before.
# This variable has 3 levels. Let's choose '1' as the reference.
indexes$presscat=as.factor(indexes$presscat)
indexes$presscat <- relevel(indexes$presscat, ref = 1)
  • Write the equation of the regression as usual:
LinRegEPI_catX <- lm(environment ~ corruptionIndex + scoreEconomy + presscat,data = indexes)

summary(LinRegEPI_catX)
## 
## Call:
## lm(formula = environment ~ corruptionIndex + scoreEconomy + presscat, 
##     data = indexes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.1889  -5.7063  -0.0712   7.7190  30.5783 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     28.68526    8.45163   3.394 0.000925 ***
## corruptionIndex  0.38868    0.09104   4.269 3.87e-05 ***
## scoreEconomy     3.54757    1.49346   2.375 0.019062 *  
## presscat2       -1.62576    2.63061  -0.618 0.537699    
## presscat3       -1.02657    3.79998  -0.270 0.787492    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.93 on 124 degrees of freedom
## Multiple R-squared:  0.4664, Adjusted R-squared:  0.4492 
## F-statistic:  27.1 on 4 and 124 DF,  p-value: 3.648e-16

The result is telling you that the change from category 1 to category 2 does not have effect. The same applies for the change from 1 to 3 (from 2 to 3 is not informed).

Go to beginning.


Logistic Regression

You need a logistic regression when your variable of interest is categorical. In this case, I will simply use the categorical version of EPI as my explanandum (Y):

barplot(table(indexes$environmentCat))

Before, the linear regression model computed the coefficients of X so that we predict the value of Y, being those continuous variables. In this case, having an dependent variable with only two values, ‘0’ and ‘1’, in Y, we will use a binary logistic regression. Then, the model will instead help you see which of the X variables will increase the ‘odds’ of getting a ‘1’.

The way to request this model is very similar to linear regression:

# function 'glm' !
LogitEPI_a = glm(environmentCat ~ corruptionIndex + scoreEconomy, 
                   data = indexes,
                   family = binomial())

# see full results: 
summary(LogitEPI_a)
## 
## Call:
## glm(formula = environmentCat ~ corruptionIndex + scoreEconomy, 
##     family = binomial(), data = indexes)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0633  -0.8287   0.1968   0.8611   2.4499  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -5.13431    1.94931  -2.634 0.008441 ** 
## corruptionIndex  0.07698    0.02026   3.800 0.000145 ***
## scoreEconomy     0.29992    0.33465   0.896 0.370134    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 177.89  on 128  degrees of freedom
## Residual deviance: 130.80  on 126  degrees of freedom
## AIC: 136.8
## 
## Number of Fisher Scoring iterations: 5

Instead of the Adjusted RSquared, the GLM function offers the Akaike Information Criterion (AIC) as a relative measure of fitness. If you had two models, the smaller the AIC signals the best one of the two compared. Let’s make another model:

# remember that presscat is factor
LogitEPI_b =glm(environmentCat ~ corruptionIndex + scoreEconomy + presscat, 
                   data = indexes,
                   family = binomial())
summary(LogitEPI_b)
## 
## Call:
## glm(formula = environmentCat ~ corruptionIndex + scoreEconomy + 
##     presscat, family = binomial(), data = indexes)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9418  -0.8366   0.2137   0.8971   2.4145  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     -4.95802    1.98720  -2.495  0.01260 * 
## corruptionIndex  0.06897    0.02366   2.915  0.00355 **
## scoreEconomy     0.31725    0.33555   0.945  0.34441   
## presscat2       -0.05980    0.54435  -0.110  0.91253   
## presscat3        0.43061    0.86086   0.500  0.61693   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 177.89  on 128  degrees of freedom
## Residual deviance: 130.35  on 124  degrees of freedom
## AIC: 140.35
## 
## Number of Fisher Scoring iterations: 5

Now use the AIC:

if (LogitEPI_a$aic < LogitEPI_b$aic){
    print("model 'a' is better")
}else{print("model 'b' is better")}
## [1] "model 'a' is better"
  • How much is each variable in X influencing Y?

The coefficients computed are not directly interpretable, we need to exponentiate them:

#getting coefficients
coeffsa=coef(summary(LogitEPI_a))
coeffsb=coef(summary(LogitEPI_b))

This is what you get for model a:

data.frame(CoefficientExp=exp(coeffsa[,1]),Significant=coeffsa[,4]<0.05)
##                 CoefficientExp Significant
## (Intercept)        0.005891122        TRUE
## corruptionIndex    1.080021339        TRUE
## scoreEconomy       1.349748294       FALSE

This is what you get for model b:

data.frame(CoefficientExp=exp(coeffsb[,1]),Significant=coeffsb[,4]<0.05)
##                 CoefficientExp Significant
## (Intercept)        0.007026814        TRUE
## corruptionIndex    1.071400513        TRUE
## scoreEconomy       1.373352410       FALSE
## presscat2          0.941954534       FALSE
## presscat3          1.538200355       FALSE

Exponentiating the resulting coefficients from the model is the first step to give an adequate interpretation. From there, you know there is a direct effect (increases the odds of Y=1) if the coefficient is greater than one, and an inverse effect if less than one, while its closeness to 1 would mean no effect. As before, we should be confident of a coefficient value if this were significant. Then, only corruption has a direct effect on the odds of Y=1.

Keep in mind that if the coefficient is categorical, the increase or decrease of the odds depends on what level of the category is the reference. In the case above, presscat2 decreases the odds of Y=1 respect to presscat1.

An easier way to interpret those values, as the effect on the probability of ocurrence of the event 1 (that the environment index is OK), is by computing the marginal values:

library(margins)
margins_LogitEPI_a = margins(LogitEPI_a) 
## Warning in warn_for_weights(model): 'weights' used in model estimation are
## currently ignored!
marginalSummary=summary(margins_LogitEPI_a)

# just to see the results better:

data.frame(coeff=round(marginalSummary$AME,3),
           lower=round(marginalSummary$lower,3),
           upper=round(marginalSummary$upper,3),
           sig=marginalSummary$p<0.05)
##                 coeff  lower upper   sig
## corruptionIndex 0.013  0.008 0.018  TRUE
## scoreEconomy    0.051 -0.059 0.161 FALSE

Only doing better in corruption policies helps reach good environmental policies, the probability increases on average in 1.2% for each increase in one in the corruption index.

We can have a basic R plot:

plot(margins_LogitEPI_a)

The last plot clarifies why the scoreEconomy index has no signifcant effect, even though its coefficient has an avera value higher than the corruption index.

  • Do I have a good logistic model?

The critical way to evaluate our models is by understanding the **confusion matrix*.

The confusion matrix requires two inputs, the actual values and the values predicted by the model:

actualValues=indexes$environmentCat
predictedValues=predict(LogitEPI_a, type = 'response')

Using the values above:

library(InformationValue)

cm=confusionMatrix(actualValues, predictedValues)

# adding names to cm
row.names(cm)=c('PredictedNegative','PredictedPositive')
colnames(cm)=c('ActualNegative','ActualPositive')

# then:
cm
##                   ActualNegative ActualPositive
## PredictedNegative             46             19
## PredictedPositive             13             51

From the table above, we know that the model predicted well 47 zeros and 51 ones, thats is 98 matches out of 129. So, we made 31 mistakes, which is interpreted as the misclassification error rate:

31/129
## [1] 0.2403101
misClassError(actualValues, predictedValues)
## [1] 0.2481

Another key concept in evaluating this model is the ROC curve:

plotROC(actualValues, predictedValues)

The better the model predicts, the more this curve will expand towards the left top corner. You can also see the value of the area under the curve (AUROC). The most the AUROC can be be is 1. The model is worthless if AUROC is 0.5, when the border of the curve is close to the diagonal (from bottom left to top rigth).

Notice the labels sensitivity and specificity. These are important concepts that can be used in the logistic regression context. They are build upon the following concepts:

TruePositive=cm['PredictedPositive','ActualPositive']
TrueNegative=cm['PredictedNegative','ActualNegative']
FalsePositive=cm['PredictedPositive','ActualNegative']
FalseNegative=cm['PredictedNegative','ActualPositive']

Sensitivity is the true positive rate (TPR), that is, the probability of a positive prediction given that the case was positive.

# TruePositive/(TruePositive+FalseNegative)
sensitivity(actualValues, predictedValues)
## [1] 0.7285714

Specificity is the false positive rate (FPR), that is, the probability of a negative prediction given that the case was negative.

# TrueNegative/(TrueNegative + FalsePositive)
specificity(actualValues, predictedValues)
## [1] 0.779661

In general, having both rates with high values make for a more reliable model. The interpretation of these values needs to be contextualized according to the decision-making situation.