Data Science for Social Good fellows at work

Social Good Summer Blog, issue five

This summer series will highlight weekly blog posts from this year’s UW Data Science for Social Good Fellows.

“From the classroom to the real-world: using data science to approach inequality” by Haowen Zheng, 2018 Data Science for Social Good Fellow

Data Science for Social Good fellows at work in the studio

I am a student fellow in this year’s Data Science for Social Good (DSSG) program in the Access to Out-of-School Educational Opportunities team at the eScience Institute, University of Washington. Our team aims to identify gaps in students’ access to out-of-school programs across neighborhoods in the city of Denver in order to inform the work of our stakeholders, the Center on Reinventing Public Education (CRPE) and Reschool Colorado, through their “Blueprint 4 Summer” initiative.

I majored in English in college, then transitioned to social sciences and became a student in applied quantitative research methods in grad school because of my passion for investigating all kinds of social inequalities. Thus, I was thrilled for this opportunity to apply the coding, statistical, and domain-related knowledge I learned in classrooms to a real-world complex social problem. Looking back at the past seven weeks, I have learned many important lessons from this process. From the beginning, I discovered that unlike the ease of working with pedagogical research questions and the clean and well-structured datasets in school, dealing with real-life questions and datasets is not always as effortless. Running into challenges and dealing with all sorts of complex problems in our research is a daily ritual.

The most common problem when working with real-life datasets is probably that they are often chaotically massive and the information needed for research comes from various data sources. For my team’s project, data came from four different sources. These include open data, like the basic demographic information pulled from American Community Survey and Open Denver Catalog; sensitive and anonymized student data provided by the Denver Public School (DPS) administration; and private data provided by Reschool like what kind of summer programs people search for on its website “Blueprint 4 Summer.” It took us a considerable amount of time to familiarize ourselves with the different data structures and clean them up before anyone could conduct any kind of analysis on them. This also required us to try more innovative ways to store and access data. Thanks to our brilliant data scientists Karen and Jose, we were able to host the data on a remote database, which was not only convenient to use, but also protected the sensitive data we had.

In addition to just organizing the datasets, there were more difficulties to overcome and sometimes judgement calls to make because the nature of the data is not at all times satisfactory. For example, we planned to plot and present the DPS student characteristics of interest on the map of Denver, so we could see where students with certain needs are concentrated. However, we only had the family addresses for a portion of the students. It turned out that DPS only collected this information on those students who applied for school choice (alternative schools/programs to publicly provided schools where students are generally assigned by their family location). Thus, we had to ask ourselves how representative this subset of data was of our target population, and to answer this question, we had to conduct several background reviews, such as consulting relevant reports for usual practices, comparing across samples, and reporting on the differences after analyses. The data’s limitations also encouraged us to think about ways to improve data collection in order to benefit future research.

Other differences between what I have learned in school and this real-life project similarly and quickly emerged when we moved on to present and analyze data. In the classroom settings that I am used to, the audience of traditional social science research is mostly the academic community, which emphasizes and scrutinizes the robustness of advanced statistical methods. However, for this project, pursuing cutting-edge statistics is not necessarily the most useful practice for our stakeholders who strive to understand the patterns of access based on neighborhood and student characteristics. Our goal here is to reduce the complexity of data to clear patterns so it is easily interpretable to our stakeholders.

Therefore, descriptive analytical strategies, as well as comprehensible and simple visualization tools, are probably the most helpful. For example, one set of data could be plotted against other measures of interest to show the possible patterns of covariation. Besides, it is also conducive to create an index that measures the ease of accessibility based on the number of out-of-school programs and students’ distance to them. And data science comes into play, where tools like R Shiny dashboard make it possible to put together interactive data-driven graphics; and a few lines of R code could easily pull the distance data we need from the Google API calls and calculate the index quickly according to a given function. Compared with traditional software that focuses more narrowly on statistical inference like Stata and SPSS, the relatively newer data science tools like R and Python are more flexible in functionality.

Constantly taking into account the concerns of our stakeholders is a unique characteristic of working on a concrete social issue like this project. Besides adjusting our goals based on those considerations, I enjoy getting connected with people who work in the field and learning how they understand and approach inequality. Two weeks ago, we visited CRPE, one of our important stakeholders and a non-profit think tank that focuses on educational policies. During the talk with their research team, I learned details about the US education system, the nature of out-of-school opportunities, parents’ concerns about their children’s educational opportunities, and the importance of constructing an equitable environment for kids from a young age.

These lessons helped me think more deeply about many important questions, like what factors we should incorporate into the analysis, how mechanisms work to cause disparities in the distribution of resources, and how inequality impacts students’ outcomes at a later stage, especially for students from low-income and minority neighborhoods. These discussions with our passionate stakeholders were inspirational and motivated us to continue our pursuit of social good via data science.

Overall, despite the many challenges we faced while applying data science to achieve social good, all the experiences, including those contemplations, debates, and re-dos, have been an integral part of the journey. I value this opportunity immensely because it has given me the confidence to put my best research foot forward from the safety of the classroom to the complexity of “real-life.”


“Looking towards data science for better educational outcomes” by Sreekanth Krishnaiah, 2018 Data Science for Social Good Fellow

Big data has come to be used so often in public discourse that it needs no introduction. The beginning of the 21st century saw the advent of ‘big data’ which not only revolutionized the tech industry but has also transformed the way we make choices, live, think, and work. That our professional and personal lives have become so entwined with the prospects of ‘Big Data’ may be a cause of concern, but this post is not about that. This post is about how we can leverage the large amounts of data being generated in K-12 classrooms and use data analytics and innovative data science techniques to draw meaningful insights to improve educational outcomes.

The massive amounts of data generated and collected from classrooms provide a huge potential for conducting data science research to draw valuable insights. This kind of analysis can serve as a feedback loop to educators in the classroom and help them make data-driven decisions. Classroom data doesn’t just mean traditional academic test scores and subjective behavioral outcomes. It is imperative for teachers to understand measures like students’ engagement level, attention span at different times of the day, use of classroom resources, language and vocabulary use, social habits, and academic performance to ensure better educational and behavioural outcomes. While teachers bank on classroom observations and test scores to determine these measures, data science can help provide additional evidence to back their observations.

Imagine a scenario where we are collecting information about all the clicks students make while using learning software in the classroom and texts written all through the day, and use devices in the classroom to track speech and document facial expressions. Data science could be used to search the data collected for patterns about student behaviour through a myriad of techniques like functional image analysis, text analysis, statistical inference, machine learning, or in most cases just simpler techniques. At a time when researchers are contemplating how to go about measuring student outcomes, we need to empower the teacher in the classroom with this level of student information to help him/her assess not only the student’s academic standing but also their mental well-being, empathy and tolerance levels. This data-driven assessment along with traditional classroom observations can be a more robust way to evaluate student behaviour.

Big data can draw attention to underlying patterns that may have otherwise been difficult to discern. With adequate domain expertise, such patterns can drive meaningful, data-driven decisions. Take for instance the ‘Out-of-School Resources’ project we are currently working on at the eScience Institute as part of their Data Science for Social Good Program over the summer. Using data from the city of Denver, Colorado, we are trying to analyze the distribution of out-of-school resources which include libraries, parks, playgrounds, athletic fields, rec centers and summer programs hosted on the BluePrint4summer website by ReSchool. Our work involves studying the demographics of different neighborhoods and understanding the demand for these resources through parents’ internet search data.

The goal of the project is to make recommendations to our stakeholders – Center for Reinventing Public Education and ReSchool- to ensure that students in the city have improved access to out-of school-resources. The interactive web application we are currently working on will help ReSchool look at the current distribution of summer programs through the lens of equity. This will inform them about the steps they need to take to ensure that every student enjoys equal opportunities to attend programs of their choice irrespective of  gender, race, ethnicity, disability status, parents’ economic status and location. Many non-profit organizations working in Denver have expressed interest in the work we are doing and are looking forward to our final results. My work in the program helped me further understand the scope and potential of data science in advancing social good.

Data science can also revolutionize the way we conduct educational policy research. Traditionally, research in education has always leaned towards quantitative and qualitative research methods. Together with big data we can develop more sophisticated models and  conduct research with much greater breadth, depth and scale. In the coming years, data science will help achieve data-driven change in education. There has been an increase in calls for educational reform over the last few years, but with very little understanding about how the education system will be affected by such reforms. Coupling Data Science with the extensive research that already exists within education, we can contribute new insights on the impacts these reforms have on students. We need a foundation on which to base reform activities. In the technology industry, business intelligence serves this role; and in education, data science/advanced analytics can similarly serve this role.

[i] References:
The end of theory: The data theory makes the scientific method obsolete by Chris Anderson
Big Data Comes to School: Implications for Learning, Assessment, and Research
How data and analytics can improve education?
 The Future of Big Data and Analytics in K-12 Education [i]