A Textual Analysis of Every U.S. General Election Debate

The first 2020 U.S. Presidential debate was widely covered as one of the worst in U.S. history, with analysts like CNN’s Jake Tapper left grasping for the words to describe it. Was it truly “a hot mess, inside a dumpster fire, inside a trainwreck”? Using textual analyses, I attempt to quantitatively compare this debate with every other in U.S. history.

My results are built on a dataset of every general election Presidential and Vice Presidential debate. I scraped this data from the Commission on Presidential Debates website and rev.com. The final dataset includes over 13,500 questions and responses organized by debate and speaker, from the first Kennedy-Nixon debate in 1960 to Harris-Pence VP debate in 2020. You can find the data here.

Simple Sentiment Analysis

One way to analyze these data is to look at how positive or negative the candidates are in their responses — the sentiment of their arguments. In its most simple form, a sentiment analysis counts the number of positive words (“excellent”, “brilliant”, “win”) and negative words (“worst”, “horrible”, “fraud”) used in a text. A text with more positive than negative words will have a more positive average sentiment (and vice versa).

Using the AFINN dictionary, which assigns scores of -5 to 5 to a corpus of nearly 2,500 words, I calculate the average sentiment of every general election debater. The plot below shows the average sentiment of presidential candidates from the two major parties.

Average Sentiment of U.S. Presidential Debaters

At this point, it is important to note that sentiment analysis is an imperfect science, and that any results are dependent on certain assumptions about language. Using the same dictionary of words over time, for example, might be problematic if a word is more emotionally charged today than in the past. Furthermore, any textual analysis might miss important information about factors like the debaters’ tone or non-verbal actions.

Even with these caveats in mind, Trump clearly stands out as a historically negative debater. The average sentiment of his performances is more than twice as negative than even the closest comparisons (Reagan in 1980 and 1984, and Biden in 2020). By contrast, the most positive debate performances were delivered by Gerald Ford, each of the George Bushes in their first terms, and both John F. Kennedy and Richard Nixon in the 1960 debates. In general, sentiment appears to have declined over time from the first two years of presidential debates in 1960 and 1976 (there were no presidential debates in the 1964, 1968, and 1972 elections) and the last two in 2016 and 2020.

I can also break out sentiment by each individual debate performance. Here, I include VP debates and the three non-major party candidates that made the debate stage. The graph shows that the four most negative debate performances in U.S. history all belong to Trump.

Average Sentiment by U.S. General Election Debate Performance

Why are Trump’s debates rated as so negative? The simple dictionary-based method allows me to easily look “under the hood” and find out. The tables below compare Trump’s most emotionally charged words with those of John F. Kennedy, one of the more positive debaters. The purple words are negative, arranged by sentiment value then by usage. The green words are positive, arranged similarly so that the most positive, most used words are at the bottom. The farthest right column shows how many times Trump/Kennedy used each word as a percent of all their positive and negative words.

Trump Most Used Negative & Positive Words
Word	Sentiment Value	Share of Neg & Pos Words (%)
bastards	-5	0.08
fraud	-4	0.50
hell	-4	0.34
catastrophic	-4	0.08
fraudulent	-4	0.08
bad	-3	4.79
lost	-3	1.51
worse	-3	1.09
worst	-3	1.01
horrible	-3	0.92
fantastic	4	0.08
funny	4	0.08
miracle	4	0.08
wins	4	0.17
amazing	4	0.25
winner	4	0.25
winning	4	0.25
win	4	0.50
wonderful	4	0.67
outstanding	5	0.08

Kennedy Most Used Negative & Positive Words
Word	Sentiment Value	Share of Neg & Pos Words (%)
lost	-3	0.50
crisis	-3	0.38
worst	-3	0.38
killed	-3	0.25
withdrawal	-3	0.25
bad	-3	0.13
destroy	-3	0.13
destruction	-3	0.13
died	-3	0.13
disastrous	-3	0.13
popular	3	0.13
praised	3	0.13
prosperous	3	0.13
wealth	3	0.13
vitality	3	0.25
breakthrough	3	0.38
successful	3	0.88
fantastic	4	0.13
triumph	4	0.13
win	4	0.38

Trump’s words “bastards”, “fraud”, “hell”, “catastrophic”, and “fraudulent” all had lower sentiment scores than any word Kennedy used in his four debates. In addition, Trump’s frequent usage of the word “bad” (which alone makes up nearly five percent of his negative/positive words) drags his sentiment score down. While Trump does also use more very emotionally charged positive words than Kennedy (in particular “wonderful” and variations of the word “win”), Kennedy also often used a number of words scored as fairly positive like “successful”, “breakthrough”, and “win”.

One final point: while the first plot showed that the sentiment in Biden’s first 2020 debate performance was one of the most negative compared to the average of past candidates, Biden’s individual past debate performances haven’t been particularly negative. The graph below highlights Trump’s four presidential debates and Biden’s two VP debates (in 2008 and 2012) and one presidential debate.

Average Sentiment by U.S. General Election Debate Performance – Biden/Trump

Additional Measures of Sentiment and Emotion

One potential problem with the simple dictionary-based method used in the previous analysis is that it could overlook the presence of “valence shifter” words, like “not” and “very”, that could change a sentence’s meaning or intensity. For example, if someone used the word “bad” we might want to know if they said something was “not bad” or “very bad”. Another thing we might be interested in is looking beyond simply a negative/positive scale, and instead at specific emotions like anger or disgust.

The sentimentr package implements solutions for both. The package uses a similar set of dictionary methods but with an expanded vocabulary, accounting for valence shifters, and word tagging for particular emotions.

Using this package, I can first re-create the graph of debater sentiment over time for the two major parties. While some of the trends shift, the main takeaway clearly holds: Trump is a historically negative debater.

Average Sentiment of U.S. Presidential Debaters (sentimentr)

I can then look at particular emotions expressed, beyond simply negative or positive sentiment. This analysis essentially works the same way — certain words are tagged as corresponding to a particular emotion. The more often these words are used, the stronger the average emotional score. Here, I look at four emotions: disgust, anger, fear, and sentiment.

Average Emotional Scores of U.S. Presidential Debaters (sentimentr)

The graphs show that Trump’s 2016 debates ranked the highest in disgust and anger, while his 2020 debate ranked the lowest in trust. George W. Bush’s 2004 debate performances, which included discussions of 9-11 and the “war on terror” is ranked as the most fearful.

Most Commonly Used Bigrams

To better understand the topics discussed in each series of debates, I next construct a series of simple word clouds for each election (where size corresponds to how frequently a word is used). Instead of using individual words, I use bigrams, sequences of two adjacent words. These allow us to better extract complete themes.

Below are clouds from 1960, 1984, 2004, 2016, and 2020. Each provides a window into the issues that were relevant at the time. In 1960 and 1984, for example, foreign policy discussions surrounding the Soviet Union and Latin and South America played a prominent role in the debates. In 2004, debates around health care and Middle East foreign policy took center stage.

You can find clouds for every year of presidential debates here.

Principal Components Analysis

In my final analysis, I convert each candidate’s entire debate text into a two-dimensional measure. While this measure will not be easily interpretable, you can think of candidates that are closer together on these two dimensions as using more “similar” language. This sounds like magic, but is really a two-step process:

Create a table where every row represents a speaker and every column represents a word. The value of each cell is the number of times the speaker in the row \(n\) uses the word in column \(m\).¹ To maximize the amount of useful information each cell provides, multiply each column by a measure that is inversely proportional to how frequently the word appears across documents (so a word like “the” is weighted downward).² This is called a term frequency–inverse document frequency matrix.
Use a Principal Components Analysis to “project” this very big \(N\)x\(M\) table into an \(N\)x\(2\) table that captures as much of the information from the original table as possible.³

The graphs below show these principal components for every debate performance. Again, the key thing to remember is just that points that are closer together were found to be more similar using this technique.

The first graph plots these components from data that includes the nouns, adjectives, and verbs each speaker used (I exclude proper nouns, such as “Obama”, which are likely to be time-period specific).

Principal Components Analysis (Nouns, Adjectives, and Verbs)

On the bottom, one set of debate performances sticks out from the others: George W. Bush and John Kerry’s September 30, 2004 debate. A likely reason? The candidates agreed to focus this debate entirely on foreign policy and homeland security, shaping the vocabulary they used. The 2016 Pence-Kaine debate, the two closest performances, also included lengthy discussions on a number of foreign policy and national defense topics.

You can also see that the performances grouped together on the left side of the plot are generally from the more distant past. In particular, the far upper left includes every debate performance from Kennedy, Nixon, Ford, and Carter. By contrast, the debate performances in the top right appear to be primarily from the late 90s and early 2000s.

While the exclusion of proper nouns may have reduced the influence of some time-specific trends, the relative importance of certain issues (e.g. inflation, health care) are still likely to vary over time and shape the nouns used. To avoid picking up these time-specific topics, I re-run the analysis including only adjectives and verbs.

Principal Components Analysis (Adjectives and Verbs)

While many the earliest debates remain off to the left, a new outlying cluster appears in the bottom right: every debate performance of Donald Trump. His debate performances are, well, quantifiably unique.

Thanks for reading! If you enjoyed this, please share it. You can find out the code for this document, as well as the cleaned data in this Github repository.

Here, I normalize this number by using log(1+n), as is common.↩
It is also common practice to exclude words that appear in very many debates (thus not providing unique information) or very few debates (thus telling us little about how the documents relate to each other). In my analysis I exclude words that appear in more than 90% or less than 10% of speakers’ debate performances.↩
This might still sound like magic, but I promise is just linear algebra.↩