Scholars and policy makers recognize the need for better and timelier data about contentious collective action, both the peaceful protests that are understood as part of democracy and the violent events that are threats to it. News media provide the only consistent source of information available outside government intelligence agencies and are thus the focus of all scholarly efforts to improve collective action data. Human coding of news sources is time-consuming and thus can never be timely and is necessarily limited to a small number of sources, a small time interval, or a limited set of protest "issues" as captured by particular keywords. There have been a number of attempts to address this need through machine coding of electronic versions of news media, but approaches so far remain less than optimal. The goal of this paper is to outline the steps needed to build, test and validate an open-source system for coding protest events from any electronically available news source using advances from natural language processing and machine learning. Such a system should have the effect of increasing the speed and reducing the labor costs associated with identifying and coding collective actions in news sources, thus increasing the timeliness of protest data and reducing biases due to excessive reliance on too few news sources. The system will also be open, available for replication, and extendable by future social movement researchers, and social and computational scientists.
Alex Hanna is a PhD candidate in sociology at the University of Wisconsin-Madison. Substantively, they are interested in social movements, media, and the Middle East. Methodologically, they are interested in computational social science, textual analysis, and social network analysis. Alex's work has appeared in both social and computational science venues, including Mobilization, the ANNALS of the American Academy of Political and Social Science, and ICWSM. They also co-founded and contribute regularly to the computational social science blog Bad Hessian, where they write about Python, R, and Twitter.