In this session, we will learn some basics on modeling, namely two important techniques:
This session will work with a collection of indexes:
link='https://raw.githubusercontent.com/EvansDataScience/data/master/allIndexes.csv'
indexes=read.csv(link,stringsAsFactors = F)
names(indexes)
## [1] "Country" "ISO" "corruptionIndex" "scoreEconomy"
## [5] "scorepress" "environment" "presscat" "environmentCat"
The corruptionIndex is the Corruption Perception Index (CPI) produced by Transparency International.
The scoreEconomy is the Economic Freedom Index (EFI) produced by Fraser Institute.
The environment and environmentCat represent the Environment Performance Index (EPI) produced by Yale University and Columbia University in collaboration with the World Economic Forum. The latter is a dichotomic variable.
The scorepress and presscat are data about the World Press Freedom Index (WPFI) produced by Reporters Without Borders. The latter is a ordinal variable.
We can check descriptives:
summary(indexes)
## Country ISO corruptionIndex scoreEconomy
## Length:129 Length:129 Min. :14.00 Min. :2.920
## Class :character Class :character 1st Qu.:31.00 1st Qu.:6.320
## Mode :character Mode :character Median :40.00 Median :6.920
## Mean :45.42 Mean :6.829
## 3rd Qu.:58.00 3rd Qu.:7.510
## Max. :90.00 Max. :8.810
## scorepress environment presscat environmentCat
## Min. : 8.59 Min. :37.10 Min. :1.000 Min. :0.0000
## 1st Qu.:49.12 1st Qu.:59.25 1st Qu.:2.000 1st Qu.:0.0000
## Median :59.63 Median :70.84 Median :2.000 Median :1.0000
## Mean :58.45 Mean :69.39 Mean :2.093 Mean :0.5426
## 3rd Qu.:66.89 3rd Qu.:81.26 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :80.96 Max. :90.68 Max. :3.000 Max. :1.0000
And data types:
str(indexes,width=70,strict.width='cut')
## 'data.frame': 129 obs. of 8 variables:
## $ Country : chr "New Zealand" "Denmark" "Finland" "Sweden" ..
## $ ISO : chr "NZL" "DNK" "FIN" "SWE" ...
## $ corruptionIndex: int 90 90 89 88 86 85 84 83 82 81 ...
## $ scoreEconomy : num 8.48 7.77 7.75 7.65 8.44 7.67 8.81 7.74 7.9..
## $ scorepress : num 79.5 80.7 81 77.2 77.8 ...
## $ environment : num 88 89.2 90.7 90.4 86.9 ...
## $ presscat : int 3 3 3 3 3 3 1 3 3 3 ...
## $ environmentCat : int 1 1 1 1 1 1 1 1 1 1 ...
Keep in mind thta the last two ones are categorical. Presscat is a categorical representation of the press freedom index:
table(indexes$presscat)
##
## 1 2 3
## 23 71 35
And enviromentCat is a representation of environment performance index:
table(indexes$environmentCat)
##
## 0 1
## 59 70
The former is ordinal (the higher the better), while the second is a dichotomous variable (good or bad performance). _____
You need a linear regression when your variable of interest is continuous (a measurement). In this case, I will simply use the EPI as my explanandum (Y):
hist(indexes$environment)
The other variables in the set could be used as our explanans (X), that is, we should have enough theorical or experiential background to hypothesize that the behavior of Y is affected by these X.
You are comfortable with Linear Regression when you can verify that each variable in X has a linear relationship with Y:
explanans=names(indexes)[c(3:5)]
for (x in explanans){
p=cor.test(indexes[,x],indexes$environment)
if(p$p.value<0.05){
messageYES=paste('environment index is correlated with',x)
print(messageYES)
}else{
messageNO=paste('environment index is NOT correlated with',x)
print(messageNO)
}
}
## [1] "environment index is correlated with corruptionIndex"
## [1] "environment index is correlated with scoreEconomy"
## [1] "environment index is correlated with scorepress"
The level of correlation is moderate to low.
It would be great if the correlation between each pair of variables in X were not high:
cor(indexes[explanans])
## corruptionIndex scoreEconomy scorepress
## corruptionIndex 1.0000000 0.6959598 0.5986622
## scoreEconomy 0.6959598 1.0000000 0.4119720
## scorepress 0.5986622 0.4119720 1.0000000
Most correlations are moderate; however the correlation between corruption and economy are close to be considered high (traditionally it is high when you reach +/- 0.7).
Given the above exploration, we can go and request a linear regression model:
LinRegEPI = lm(environment ~ corruptionIndex + scoreEconomy + scorepress, data = indexes)
That was it! I just told R that the Y (environment) is to be regressed (~) by all those X variables (corruptionIndex + scoreEconomy + scorepressOK).
You can see the datailed result using:
summary(LinRegEPI)
##
## Call:
## lm(formula = environment ~ corruptionIndex + scoreEconomy + scorepress,
## data = indexes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.0284 -5.8354 -0.0301 7.7106 31.5477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.26063 9.07126 3.226 0.0016 **
## corruptionIndex 0.40824 0.07909 5.162 9.35e-07 ***
## scoreEconomy 3.48563 1.47797 2.358 0.0199 *
## scorepress -0.03788 0.09164 -0.413 0.6800
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.9 on 125 degrees of freedom
## Multiple R-squared: 0.4655, Adjusted R-squared: 0.4527
## F-statistic: 36.28 on 3 and 125 DF, p-value: < 2.2e-16
However, we tend to look for a couple of key indicators:
The sign of the coefficient of a X indicates if it affects Y in direct or inverse way. The absolute value of the coefficient would suggest which impacts more. You should only be confident of these effects if the column Pr(>|t|) is TRUE.
summary(LinRegEPI)$adj.r.squared # from 0 to 1, the closer to 1 the better.
## [1] 0.4526508
The way to evaluate a regression does not end here. There may be several issues affecting these results:
# normality of residuals?
library(car)
qqPlot(LinRegEPI, main="QQ Plot")
# collinearity?
vif(LinRegEPI) > 4 # problem if some are TRUE
## corruptionIndex scoreEconomy scorepress
## FALSE FALSE FALSE
# Heteroskedastic?
ncvTest(LinRegEPI)$p<0.05
## [1] TRUE
influencePlot(LinRegEPI,
id.method="noteworthy",
id.n=2,
main="Identifying outliers",
sub="Circle size is proportial to Cook's Distance")
## StudRes Hat CookD
## 40 -2.5213589 0.06450724 0.105087771
## 64 -0.2481556 0.17029510 0.003183753
## 127 3.3406953 0.18830904 0.598625540
When outliers are prominent, you could ask for robust regression:
library(MASS)
LinRegEPI_R = rlm(environment ~ corruptionIndex + scoreEconomy + scorepress,
data = indexes)
summary(LinRegEPI_R)
##
## Call: rlm(formula = environment ~ corruptionIndex + scoreEconomy +
## scorepress, data = indexes)
## Residuals:
## Min 1Q Median 3Q Max
## -26.8916 -6.5786 -0.3437 7.4061 36.6714
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 20.0001 9.2062 2.1725
## corruptionIndex 0.3681 0.0803 4.5858
## scoreEconomy 5.0979 1.5000 3.3987
## scorepress -0.0354 0.0930 -0.3806
##
## Residual standard error: 10.66 on 125 degrees of freedom
You don’t see a p.value now, but you can consider significant the coefficients whose t value (in absolute terms) is greater than two.
Above, you have corrected coefficients. The same coefficients kept significant. The directions are kept too. The effect of econonomy (EFI) has increased, while the intercept (value when all the rest is zero) has decreased.
You can also omit the outliers and see the effect on the regresssion:
outCases=c(40,64,127)
indexes[outCases,]
## Country ISO corruptionIndex scoreEconomy scorepress environment
## 40 Rwanda RWA 54 7.57 34.94 50.34
## 64 China CHN 40 6.40 8.59 65.10
## 127 Venezuela VEN 17 2.92 44.78 76.23
## presscat environmentCat
## 40 1 0
## 64 1 0
## 127 1 1
LinRegEPI_OUT = lm(environment ~ corruptionIndex + scoreEconomy + scorepress, data = indexes[-outCases,])
summary(LinRegEPI_OUT)
##
## Call:
## lm(formula = environment ~ corruptionIndex + scoreEconomy + scorepress,
## data = indexes[-outCases, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.4764 -6.0975 -0.3531 7.5999 26.1431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.86344 9.60119 1.861 0.065216 .
## corruptionIndex 0.38508 0.07815 4.928 2.65e-06 ***
## scoreEconomy 5.85382 1.52971 3.827 0.000206 ***
## scorepress -0.10233 0.09742 -1.050 0.295616
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.26 on 122 degrees of freedom
## Multiple R-squared: 0.5305, Adjusted R-squared: 0.5189
## F-statistic: 45.95 on 3 and 122 DF, p-value: < 2.2e-16
Notice the new Adjusted R-squared.
Finally in this section, let me show you now what to do when one of the X variables is a category:
# The function 'relevel' CAN NOT accept ordinals.
# This is why I did not set it as ordinal before.
# This variable has 3 levels. Let's choose '1' as the reference.
indexes$presscat=as.factor(indexes$presscat)
indexes$presscat <- relevel(indexes$presscat, ref = 1)
LinRegEPI_catX <- lm(environment ~ corruptionIndex + scoreEconomy + presscat,data = indexes)
summary(LinRegEPI_catX)
##
## Call:
## lm(formula = environment ~ corruptionIndex + scoreEconomy + presscat,
## data = indexes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.1889 -5.7063 -0.0712 7.7190 30.5783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.68526 8.45163 3.394 0.000925 ***
## corruptionIndex 0.38868 0.09104 4.269 3.87e-05 ***
## scoreEconomy 3.54757 1.49346 2.375 0.019062 *
## presscat2 -1.62576 2.63061 -0.618 0.537699
## presscat3 -1.02657 3.79998 -0.270 0.787492
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.93 on 124 degrees of freedom
## Multiple R-squared: 0.4664, Adjusted R-squared: 0.4492
## F-statistic: 27.1 on 4 and 124 DF, p-value: 3.648e-16
The result is telling you that the change from category 1 to category 2 does not have effect. The same applies for the change from 1 to 3 (from 2 to 3 is not informed).
You need a logistic regression when your variable of interest is categorical. In this case, I will simply use the categorical version of EPI as my explanandum (Y):
barplot(table(indexes$environmentCat))
Before, the linear regression model computed the coefficients of X so that we predict the value of Y, being those continuous variables. In this case, having an dependent variable with only two values, ‘0’ and ‘1’, in Y, we will use a binary logistic regression. Then, the model will instead help you see which of the X variables will increase the ‘odds’ of getting a ‘1’.
The way to request this model is very similar to linear regression:
# function 'glm' !
LogitEPI_a = glm(environmentCat ~ corruptionIndex + scoreEconomy,
data = indexes,
family = binomial())
# see full results:
summary(LogitEPI_a)
##
## Call:
## glm(formula = environmentCat ~ corruptionIndex + scoreEconomy,
## family = binomial(), data = indexes)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0633 -0.8287 0.1968 0.8611 2.4499
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.13431 1.94931 -2.634 0.008441 **
## corruptionIndex 0.07698 0.02026 3.800 0.000145 ***
## scoreEconomy 0.29992 0.33465 0.896 0.370134
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 177.89 on 128 degrees of freedom
## Residual deviance: 130.80 on 126 degrees of freedom
## AIC: 136.8
##
## Number of Fisher Scoring iterations: 5
Instead of the Adjusted RSquared, the GLM function offers the Akaike Information Criterion (AIC) as a relative measure of fitness. If you had two models, the smaller the AIC signals the best one of the two compared. Let’s make another model:
# remember that presscat is factor
LogitEPI_b =glm(environmentCat ~ corruptionIndex + scoreEconomy + presscat,
data = indexes,
family = binomial())
summary(LogitEPI_b)
##
## Call:
## glm(formula = environmentCat ~ corruptionIndex + scoreEconomy +
## presscat, family = binomial(), data = indexes)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9418 -0.8366 0.2137 0.8971 2.4145
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.95802 1.98720 -2.495 0.01260 *
## corruptionIndex 0.06897 0.02366 2.915 0.00355 **
## scoreEconomy 0.31725 0.33555 0.945 0.34441
## presscat2 -0.05980 0.54435 -0.110 0.91253
## presscat3 0.43061 0.86086 0.500 0.61693
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 177.89 on 128 degrees of freedom
## Residual deviance: 130.35 on 124 degrees of freedom
## AIC: 140.35
##
## Number of Fisher Scoring iterations: 5
Now use the AIC:
if (LogitEPI_a$aic < LogitEPI_b$aic){
print("model 'a' is better")
}else{print("model 'b' is better")}
## [1] "model 'a' is better"
The coefficients computed are not directly interpretable, we need to exponentiate them:
#getting coefficients
coeffsa=coef(summary(LogitEPI_a))
coeffsb=coef(summary(LogitEPI_b))
This is what you get for model a:
data.frame(CoefficientExp=exp(coeffsa[,1]),Significant=coeffsa[,4]<0.05)
## CoefficientExp Significant
## (Intercept) 0.005891122 TRUE
## corruptionIndex 1.080021339 TRUE
## scoreEconomy 1.349748294 FALSE
This is what you get for model b:
data.frame(CoefficientExp=exp(coeffsb[,1]),Significant=coeffsb[,4]<0.05)
## CoefficientExp Significant
## (Intercept) 0.007026814 TRUE
## corruptionIndex 1.071400513 TRUE
## scoreEconomy 1.373352410 FALSE
## presscat2 0.941954534 FALSE
## presscat3 1.538200355 FALSE
Exponentiating the resulting coefficients from the model is the first step to give an adequate interpretation. From there, you know there is a direct effect (increases the odds of Y=1) if the coefficient is greater than one, and an inverse effect if less than one, while its closeness to 1 would mean no effect. As before, we should be confident of a coefficient value if this were significant. Then, only corruption has a direct effect on the odds of Y=1.
Keep in mind that if the coefficient is categorical, the increase or decrease of the odds depends on what level of the category is the reference. In the case above, presscat2 decreases the odds of Y=1 respect to presscat1.
An easier way to interpret those values, as the effect on the probability of ocurrence of the event 1 (that the environment index is OK), is by computing the marginal values:
library(margins)
margins_LogitEPI_a = margins(LogitEPI_a)
## Warning in warn_for_weights(model): 'weights' used in model estimation are
## currently ignored!
marginalSummary=summary(margins_LogitEPI_a)
# just to see the results better:
data.frame(coeff=round(marginalSummary$AME,3),
lower=round(marginalSummary$lower,3),
upper=round(marginalSummary$upper,3),
sig=marginalSummary$p<0.05)
## coeff lower upper sig
## corruptionIndex 0.013 0.008 0.018 TRUE
## scoreEconomy 0.051 -0.059 0.161 FALSE
Only doing better in corruption policies helps reach good environmental policies, the probability increases on average in 1.2% for each increase in one in the corruption index.
We can have a basic R plot:
plot(margins_LogitEPI_a)
The last plot clarifies why the scoreEconomy index has no signifcant effect, even though its coefficient has an avera value higher than the corruption index.
The critical way to evaluate our models is by understanding the **confusion matrix*.
The confusion matrix requires two inputs, the actual values and the values predicted by the model:
actualValues=indexes$environmentCat
predictedValues=predict(LogitEPI_a, type = 'response')
Using the values above:
library(InformationValue)
cm=confusionMatrix(actualValues, predictedValues)
# adding names to cm
row.names(cm)=c('PredictedNegative','PredictedPositive')
colnames(cm)=c('ActualNegative','ActualPositive')
# then:
cm
## ActualNegative ActualPositive
## PredictedNegative 46 19
## PredictedPositive 13 51
From the table above, we know that the model predicted well 47 zeros and 51 ones, thats is 98 matches out of 129. So, we made 31 mistakes, which is interpreted as the misclassification error rate:
31/129
## [1] 0.2403101
misClassError(actualValues, predictedValues)
## [1] 0.2481
Another key concept in evaluating this model is the ROC curve:
plotROC(actualValues, predictedValues)
The better the model predicts, the more this curve will expand towards the left top corner. You can also see the value of the area under the curve (AUROC). The most the AUROC can be be is 1. The model is worthless if AUROC is 0.5, when the border of the curve is close to the diagonal (from bottom left to top rigth).
Notice the labels sensitivity and specificity. These are important concepts that can be used in the logistic regression context. They are build upon the following concepts:
TruePositive=cm['PredictedPositive','ActualPositive']
TrueNegative=cm['PredictedNegative','ActualNegative']
FalsePositive=cm['PredictedPositive','ActualNegative']
FalseNegative=cm['PredictedNegative','ActualPositive']
Sensitivity is the true positive rate (TPR), that is, the probability of a positive prediction given that the case was positive.
# TruePositive/(TruePositive+FalseNegative)
sensitivity(actualValues, predictedValues)
## [1] 0.7285714
Specificity is the false positive rate (FPR), that is, the probability of a negative prediction given that the case was negative.
# TrueNegative/(TrueNegative + FalsePositive)
specificity(actualValues, predictedValues)
## [1] 0.779661
In general, having both rates with high values make for a more reliable model. The interpretation of these values needs to be contextualized according to the decision-making situation.