“Documenting data science and documentation in data science: an ethnographic exploration”

Thursday, Jan. 24, 4:30 to 5:20 p.m., Bagley Hall, room 154

Stuart Geiger, ethnographer of computation and computational ethnographer; research staff at the Berkeley Institute for Data Science

[Watch a recording of this seminar on YouTube.]


The collection, curation, and analysis of data has always been as social as it is technical. Even in the most automated, data-driven systems, there are always humans who work behind the scenes, from the software developers and hardware operators who maintain invisible infrastructures to those who collect, label, annotate, clean, validate, merge, and manage data. These activities tend to get far less attention than the headline-grabbing technologies of machine learning and artificial intelligence, but it is crucial to always keep them in view. In this talk, I specifically discuss the central yet often passed over role of documentation in data science, based on several recent and ongoing studies and projects about the role and importance of documentation in software packages, datasets, analysis code, research protocols, and research teams. Documentation is often seen as an unglamorous, low-status chore to be left for later, but it is a crucial form of communication, collaboration, and collective sensemaking. However, documentation can be so difficult precisely because of the complex skills involved in writing good documentation, as well as the many different, sometimes even contradictory roles it plays for various audiences and stakeholders. In examining the work of documentation as communication, we gain a broader view into many pressing issues in data science, including those around open science, reproducibility, and data ethics.


Stuart Geiger is a staff ethnographer at the UC-Berkeley Institute for Data Science, where he studies the people, platforms, infrastructures, and institutions that support the production of knowledge at scale. His Ph.D work at the UC-Berkeley School of Information studied the governance and moderation of Wikipedia and Twitter, focusing on the social roles of software developers and data scientists. He is a methodological and disciplinary pluralist, integrating approaches from across the humanities, the interpretivist and quantitative social sciences, and computer, information, and data science. His work has been published in venues including CSCW, CHI, ICWSM, American Behavioral Scientist, Information, Communication & Society, and Big Data & Society. Stuart is also a founding member of UC-Berkeley’s cross-departmental working groups on Data Science Studies, Algorithms in Culture, and Algorithmic Fairness & Opacity.

This seminar is presented as part of the Distinguished Young Academic Data Scientists (DYADS) speaker series. The DYADS initiative promotes networking and speaking opportunities for outstanding postdoctoral scholars from our partner institutions. Through speaking engagements in our high profile seminar series and meetings with faculty colleagues, the scholars broaden their academic networks and their visibility on other campuses.