About the Society
Papers, Posters, Syllabi
Submit an Item
Polmeth Mailing List
Below results based on the criteria 'content analysis'
Total number of records returned: 7
Extracting Systematic Social Science Meaning from Text
automated content analysis
2008 U.S. Presidential election
We develop two methods of automated content analysis that give approximately unbiased estimates of quantities of theoretical interest to social scientists. With a small sample of documents hand coded into investigator-chosen categories, our methods can give accurate estimates of the proportion of text documents in each category in a larger population. Existing methods successful at maximizing the percent of documents correctly classified allow for the possibility of substantial estimation bias in the category proportions of interest. Our first approach corrects this bias for any existing classifier, with no additional assumptions. Our second method estimates the proportions without the intermediate step of individual document classification, and thereby greatly reduces the required assumptions. For both methods, we also correct statistically, apparently for the first time, for the far less-than-perfect levels of inter-coder reliability that typically characterize human attempts to classify documents, an approach that will normally outperform even population hand coding when that is feasible. We illustrate these methods by tracking the daily opinions of millions of people about candidates for the 2008 presidential nominations in online blogs, data we introduce and make available with this article, and through evaluations in available corpora from other areas, including movie reviews, university web sites, and Enron emails. We also offer easy-to-use software that implements all methods described.
item response theory
Wordscores is a widely-used procedure for inferring policy positions, or scores, for new documents on the basis of scores for words derived from documents with known scores. It is computationally straightforward, requires no distributional assumptions, but has unresolved practical and theoretical problems: In applications, estimated document scores are on the wrong scale and Wordscores does not specify a statistical model so it is unclear what assumptions the method makes about political text or how to tell whether they fit particular applications. The first part of the paper demonstrates that badly scaled document score estimates reflect deeper problems with the method. The second part shows how to understand Wordscores as an approximation to correspondence analysis which itself approximates a statistical ideal point model for words. Problems with the method are identified with the conditions under which these layers of approximation fail to ensure consistent and unbiased estimation of the parameters of the ideal point model.
Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology
Many people attempt to discover useful information by reading large quantities of unstructured text, but because of known human limitations even experts are ill-suited to succeed at this task. This difficulty has inspired the creation of numerous automated cluster analysis methods to aid discovery. We address two problems that plague this literature. First, the optimal use of any one of these methods requires that it be applied only to a specific substantive area, but the best area for each method is rarely discussed and usually unknowable ex ante. We tackle this problem with mathematical, statistical, and visualization tools that define a search space built from the solutions to all previously proposed cluster analysis methods (and any qualitative approaches one has time to include) and enable a user to explore it and quickly identify useful information. Second, in part because of the nature of unsupervised learning problems, cluster analysis methods are not routinely evaluated in ways that make them vulnerable to being proven suboptimal or less than useful in specific data types. We therefore propose new experimental designs for evaluating these methods. With such evaluation designs, we demonstrate that our computer-assisted approach facilitates more efficient and insightful discovery of useful information than either expert human coders using qualitative or quantitative approaches or existing automated methods. We (will) make available an easy-to-use software package that implements all our suggestions.
Automated Coding of International Event Data Using Sparse Parsing Techniques
Schrodt, Philip A.
natural language processing
"Event data" record the interactions of political actors reported in sources such as newspapers and news services; this type of data is widely used in research in international relations. Over the past ten years, there has been a shift from coding event data by humans -- typically university students -- to using computerized coding. The automated methods are dramatically faster, enabling data sets to be coded in real time, and provide far greater transparency and consistency than human coding. This paper reviews the experience of the Kansas Event Data System (KEDS) project in developing automated coding using "sparse parsing" machine coding methods, discusses a number of design decisions that were made in creating the program, and assesses features that would improve the effectiveness of these programs.
Do Voters Learn from Presidential Election Campaigns?
Alvarez, R. Michael
random effects panel models
presidential election campaigns
We present a model of voter campaign learning which is based on Bayesian learning models. This model assumes voters are imperfectly informed and that they incorporate new information into their existing perceptions about candidate issue positions in a systematic manner. Additional information made available to voters about candidate issue positions during the course of a political campaign will lead voters to have more precise perceptions of the issue positions of the candidates involved. We use panel survey data from the 1976 and 1980 presidental elections, combined with content analyses of the media during these same elections. Our primary analysis is conducted using random effects panel models. We find that during each of these campaigns many voters became better informed about the positions of candidates on many issues and that these changes in voter information are directly related to the information flow during each presidential campaign.
A Robust Transformation Procedure for Interpreting Political Texts
In a recent article in the American Political Science Review, Laver, Benoit, and Garry propose a new method for conducting content analysis. Their Wordscores approach, by automating text coding procedures, represents a fundamental advance in content analysis and will potentially have a large long-term impact on research across the discipline. In this research note, we contend that the usefulness of this procedure is unfortunately limited by the fact that the transformation procedure used by the authors (which is meant to allow for the substantive interpretation of results) has two significant shortcomings. Specifically, it distorts the metric on which content scores are placed—hindering the ability of scholars to make meaningful comparisons across texts—and it is very sensitive to the texts that are scored—opening up the possibility that researchers may generate, inadvertently or not, results that depend on the texts they choose to include in their analyses. We propose (and have written program code to implement) a transformation procedure that solves these problems.
An Automated Method of Topic-Coding Legislative Speech Over Time with Application to the 105th-108th U.S. Senate
We describe a method for statistical learning from speech documents that we apply to the Congressional Record in order to gain new insight into the dynamics of the political agenda. Prior efforts to evaluate the attention of elected representatives across topic areas have largely been expensive manual coding exercises and are generally circumscribed along one or more features of detail: limited time periods, high levels of temporal aggregation, and coarse topical categories. Conversely, the Congressional Record has scarcely been used for such analyses, largely because it contains too much information to absorb manually. We describe here a method for inferring, through the patterns of word choice in each speech and the dynamics of word choice patterns across time, (a) what the topics of speeches are, and (b) the probability that attention will be paid to any given topic or set of topics over time. We use the model to examine the agenda in the United States Senate from 1997-2004, based on a new database of over 70 thousand speech documents containing over 70 million words. We estimate the model for 42 topics and provide evidence that we can reveal speech topics that are both distinctive and inter-related in substantively meaningful ways. We demonstrate further that the dynamics our model gives us leverage into important questions about the dynamics of the political agenda.