Part 1: Organizing a reproducible environment

Toolbox

We start by preparing the environment we will use to work. This environment will have some characteristics:

It will allow that our work can be reproducible.
It will facilitate auditability.
It will ease knowledge management and innovation.
It will automate several steps.

Let me pay attention to the automatable steps, which are described in the Figure below:

As you can see above, there are several steps that could be somehow automated. In any project to carry out analytics, the good use of tools to automate the ones I show is welcome, as it mainly saves lots of time which we could devote to ensure the quality of our analysis. However, to achieve this, we depend on managing different tools.

This session is devoted to make sure you have the tools that will be used along the course. So, please follow the coming instructions on how to install them:

Zotero, which will be used to facilitate the citation process. You should first create an account here, and then, install the client in your computer using this link. If you are using Mozilla, Chrome or Safari, I would recommend installing the Connector.
GitHub, which will allow you to have a repo on the cloud synchronized with a local repo in your local hard drive. To use GitHub, please sign up here, and then download and install the client on your computer from this link.
Anaconda is a suite of programs to run Python. Anaconda will ease the use of Python, particularly the Jupyter environment. Please enlace.
R and RStudio. To use R, we need both the R platform and RStudio. Please, first install R from here. Then, go to this link to get RStudio.
We will also use Latex to produce journal-like papers. This link offers you Latex for every operating system. If you are using Windows, I recommend you download Miktex. In that case, while runing the Basic Installer, do not forget the selection recommended in the image below:

Finally, get an account on Overleaf 2 here. This is not strictly needed, but you may find it interesting in case you want to share your work with a latex user.

Basic Steps for reproducibility

We will focus on the production of a simple paper (the simplest ever). Let me describe you the steps to follow:

Create a GitHub repository. Go to your GitHUb account and create a repo there. The repo will be a free one. Just follow the instructions.
Clone the Repo. Once your repo is created, clone it. Make sure the GItHu Client is already installed. You need to find the button clone or download. When you press it, select the option Open in Desktop. You might get a message requesting that you confirm you want to clone the repo in the client (or app). After confirming the operation, the desktop client will ask you where you want to save the local copy of the cloud repo.
Upload a file to the repo. When you need to send a file from your computer to this cloned repo, you need to put that file there. Download this file and put it there. Then, go to your client and check that this changes are recognized. You should then commit and push.
Get the link of the data. The data can be accessed now (as long as you have an internet connection). Go to your repo in the cloud, and click on the file name. This will take you to the file contents. Depending on the file type you can or can not see the values. This is a csv file, so you will see the contents. Now, get the link to the data, by right-clicking on the option download or raw (whichever is available).
Create the following document in R. Go to your RStudio and create an RScript. The codes will be:

# collecting
fileLink="https://github.com/EvansDataScience/data/raw/master/censoredworld.csv"
dataidx=read.csv(fileLink)

# Describing a categorical variable**:
tableONI=table(dataidx$ONI)
tableONI

# Using a plot for the categorical:
barplot(tableONI)


# Describing the numerical variables
summary(dataidx[,c(3,4)])

# Using a plot for the numerical:
boxplot(dataidx[,c(3,4)])

## Describing bivariate relationships

# * Numerical and categorical:

boxplot(dataidx$FH~dataidx$Region)

#Boxplots were introduced by Tuckey (Tukey, John W (1977). Exploratory Data Analysis. Addison-Wesley.)

# * Numerical and Numerical
plot(dataidx$FH~dataidx$RWB)

# The scatter plot is thought to be invented by  John Frederick W. Herschel according to this link: https://qz.com/1235712/the-origins-of-the-scatter-plot-data-visualizations-greatest-invention/

Transform the RScript. I have a series of templates. Let’s use each one. For that, you need to go to this repo and clone it into your github account.

Back to course schedule menu

Course: Visual Analytics for Policy and Management

Prof. José Manuel Magallanes, PhD

Part 1: Organizing a reproducible environment

Toolbox

Basic Steps for reproducibility