Please join us for a UW Data Science Seminar on Tuesday, October 10th from 4:30 to 5:20 p.m. PST. The seminar will feature Nathan TeBlunthuis, a Postdoctoral Research Fellow at the Information School at the University of Michigan.
This event will take place in the Physics/Astronomy Auditorium 102 (PAA A102) on the University of Washington campus.
“Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can!”
Abstract: Automated classifiers (ACs), often built via supervised machine learning (SML), can categorize large, statistically powerful samples of data ranging from text to images and video. They have become widely popular measurement devices in computational social science and related fields. Despite this popularity, even highly accurate classifiers make errors that cause misclassification bias and misleading results when input to downstream statistical analyses—unless such analyses account for these errors. As we show in a systematic literature review of SML applications, scholars largely ignore misclassification bias.
In principle, existing statistical methods can use “gold standard” validation data, such as that created by human annotators, to correct misclassification bias. We introduce and test such methods, including a new method we design and implement in the R package “misclassification models”, via Monte Carlo simulations designed to reveal each method’s limitations, which we also release. Based on our results, we recommend our new error correction method as it is versatile and efficient. In sum, automated classifiers, even those below common accuracy standards or those making systematic misclassifications, can be useful for measurement with careful study design and appropriate error correction methods.
Bio: Dr. Nathan TeBlunthuis is a Postdoctoral Research Fellow in the Information School at the University of Michigan. He was previously at the Department of Communication Studies at Northwestern University and completed his PhD in Communication at the University of Washington. He is computational social scientist who studies how collective action is organized in projects like Wikipedia, online communities like Reddit, and social movements. An important part of his work is to improve the measurement of meaningful communication behaviors from unstructured data such as text and multimedia.
The UW Data Science Seminar is an annual lecture series at the University of Washington that hosts scholars working across applied areas of data science, such as the sciences, engineering, humanities and arts along with methodological areas in data science, such as computer science, applied math and statistics. Our presenters come from all domain fields and include occasional external speakers from regional partners, governmental agencies and industry.
The 2022-2023 seminars will be held in person, and are free and open to the public.