Abstract
Tutorial looking at an introudction to bag of words models.
This tutorial builds somewhat on what we have covered in the previous tutorial - Text Mining and Sentiment Analysis. It provides an introduction to bag of words models, term frequency inverse document frequency (TF-IDF) concept as well as topic modelling. A lot of the applied tasks and examples given are a continuation of the methods and learnings from the previous tutorial as well.
The goal is to continue our learning in the natural language processing landscape, essentially be looking at a greater selection of methods for examining text data, manipulating it into different forms, and extracting insight from it. The tutorial is broken in three parts, however they flow neatly from one to another. A summary of each part is given below:
PART 1: introduces bag of words models; representation of text that describes the occurrence of words within a document PART 2: we cover TF-IDF; a method that reflects how important a word is to a document in a collection or corpus *PART 3: introduce Topic Modelling; an unsupervised classification approach for documents
The work in this tutorial is largely based off the working of Ian Durbach; his profile is found here. In fact, a lot of the work follows suit from Data Science for Industry course from university of Cape Town. Other great resources that are used for this tutorial are discussed below.
One of the resources is Chapter 3 of Text Mining in R and covers tf-idf and related topics. In fact, Text Mining in R by Julia Silge and David Robinson, is highly recommended, as this book and their approach closely followed.
The code in part 3, topic modelling lesson, borrows heavily from Chapter 6 of Text Mining for R, and also contains some of the material on converting between tidy and non-tidy text formats in Chapter 5.
The document is for the most part very applied in nature, and doesn’t
assume much natural language processing experience. For programming
purposes, it would be useful if you are familiar with the
tidyverse, or at least dplyr specifically.
It must be stressed that this is only a starting point, a hopefully fun foray into the world of text, not definitive statement of how you should analyze text. In fact, some of the methods demonstrated would likely be too rudimentary for most goals.
This part of our notebook covers bag-of-words models. Used in NLP settings, where a text is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity.
In the notebook we:
In a bag-of-words model, a text document is represented by the set of words used in the document. This is a simplified representation that ignores much of the more subtle structure of text, like the order that words appear in and grammar. Frequency counts of the words used in a document can be used to cluster documents or as features in predictive modelling.
For example, say we have a collection of newspaper articles drawn from entertainment and business categories. The set of documents is known as the corpus. The set of all unique words used in any of the articles constitutes our bag-of-words, also called the vocabulary.
We can represent this data as a matrix, in which each article is a row, each word is a column, and the entry in row i and column j contains the number of times word j appears in document i (many of these values will be zero, so we have a sparse matrix). A final column contains our response variable, indicating the type of article (entertainment or business).
We could use the word frequencies as features to build a model that, on the basis of frequencies of different words, predicts whether an article is about business or entertainment. This is a typical kind of use for a bag-of-words model. Some hypothetical data is shown below.
| document | sing | money | the | happy | dividend | … | response |
|---|---|---|---|---|---|---|---|
| 1 | 5 | 0 | 15 | 0 | 0 | … | ent |
| 2 | 0 | 5 | 12 | 2 | 0 | … | bus |
| 3 | 0 | 0 | 3 | 0 | 6 | … | bus |
| … | … | … | … | … | … | … | … |
| 100 | 10 | 0 | 13 | 10 | 2 | … | ent |
We’ll start by loading the packages we need, loading the data containing the tweets, and doing some wrangling. Most of this is the same as what we have been doing for the NLP tutorials.
Specifically, we:
This is mostly the same as what we did in the previous tutorial, so refer back to it if anything is not clear. In addition, we take a sample of 1000 tweets before and after he became president. That will be enough to build a model later.
library(tidyverse)
library(stringr)
library(lubridate)
library(tidytext)
library(rpart)
load('trump-tweets.RData')
# make data a tibble
tweets <- as_tibble(tweets)
# parse the date and add some date related variables
tweets <- tweets %>%
mutate(date = parse_datetime(str_sub(tweets$created_at, 5, 30), "%b %d %H:%M:%S %z %Y")) %>%
mutate(is_potus = (date > ymd(20161108))) %>%
mutate(month = make_date(year(date), month(date)))
# take a random sample of 1000 tweets before and he after became president
set.seed(123)
tweets_sample <- tweets %>% group_by(is_potus) %>% slice_sample(n = 1000) %>% ungroup()
# turn into tidy text
replace_reg <- '(http.*?(\\s|.$))|(www.*?(\\s|.$))|&|<|>'
unnest_reg <- "[^\\w_#@'’]"
tidy_tweets <- tweets_sample %>%
filter(is_retweet == FALSE) %>% #remove retweets
mutate(text = str_replace_all(text, replace_reg, '')) %>% #remove stuff we don't want like links
unnest_tokens(word, text, token = 'regex', pattern = unnest_reg) %>% #tokenize
# filter(!word %in% stop_words$word, str_detect(word, '[A-Za-z]')) %>% #LEAVE THE STOP WORDS IN THIS TIME
filter(!str_detect(word, '@real[Dd]onald[Tt]rump')) %>% #remove @realDonaldTrump, which we know is highly predictive
select(date, word, is_potus, favorite_count, id_str, month) #choose the variables we need
Let’s look at the most popular words:
tidy_tweets %>%
group_by(word) %>%
count() %>%
ungroup() %>%
arrange(desc(n)) %>%
filter(rank(desc(n)) <= 10)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 the 1694
## 2 to 927
## 3 and 890
## 4 a 747
## 5 of 688
## 6 in 617
## 7 is 614
## 8 for 454
## 9 i 438
## 10 you 413
Now we shall seek to extract bag-of-words data from the text. We put our data into “bag-of-words” form, by:
Here we simplify things by looking only at the frequency of the 200
most popular words. We first find out what the 200 most commonly used
words are. We’re doing this a slightly different way to the previous
block of code, where we first sorted (by the count variable
n, using arrange), and then filtered by rank.
Here we use the top_n() , a convenience function that
selects the top or bottom entries in each group, ordered by
wt.
word_bag <- tidy_tweets %>%
group_by(word) %>%
count() %>%
ungroup() %>%
top_n(200, wt = n) %>%
select(-n)
We actually end up with more than 200 words, because the 200 in
top_n() refers to the rank of the last observation that
gets included. All words tied for the last rank (200) get included, and
we end up with more than 200 words. That’s really not a problem here -
because the main goal of the subsetting is just to reduce the full
problem to something more manageable in the tutorial, we don’t really
need to worry if we have exactly 200 words to use as features.
nrow(word_bag)
## [1] 203
Now we calculate the number of times each of these words was used in each of the 2000 tweets. Note that because we’re using a subset of words, we might find that some tweets don’t contain any of these words in them. These observations are dropped from the analysis.
tweets_tdf <- tidy_tweets %>%
inner_join(word_bag) %>%
group_by(id_str,word) %>%
count() %>%
group_by(id_str) %>%
mutate(total = sum(n)) %>%
ungroup()
We now reshape the data set to get it into traditional format for
predictive modelling, using pivot_wider(). This means that
each tweet will be in its own row, and each word in its own column. Note
that we’re moving here from a tidy to untidy format, because that is the
format required by rpart, the classification tree
package we’re going to use to do further modelling.
bag_of_words <- tweets_tdf %>%
select(id_str, word, n) %>%
pivot_wider(names_from = word, values_from = n, values_fill = 0) %>%
left_join(tweets %>% select(id_str, is_potus)) %>%
select(id_str, is_potus, everything())
# number of tweets
nrow(bag_of_words)
## [1] 1834
# number of variables (words, plus id and response)
ncol(bag_of_words)
## [1] 205
bag_of_words %>% head(6)
## # A tibble: 6 × 205
## id_str is_po…¹ ameri…² congr…³ love of on people the we you
## <chr> <lgl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 1000064605… TRUE 1 1 1 1 1 1 1 1 1
## 2 1000151354… TRUE 0 0 0 1 0 0 0 0 0
## 3 1000174070… TRUE 0 0 0 0 1 0 2 1 0
## 4 1000366478… TRUE 1 0 0 2 0 1 4 0 0
## 5 1000718611… TRUE 0 0 0 0 0 0 2 0 0
## 6 1001099771… TRUE 0 0 0 0 0 0 1 0 0
## # … with 194 more variables: a <int>, and <int>, do <int>, down <int>, i <int>,
## # it <int>, must <int>, no <int>, nothing <int>, obama <int>, s <int>,
## # security <int>, u <int>, with <int>, about <int>, are <int>, be <int>,
## # `if` <int>, `in` <int>, korea <int>, north <int>, that <int>, very <int>,
## # which <int>, will <int>, at <int>, from <int>, good <int>, great <int>,
## # his <int>, house <int>, news <int>, should <int>, this <int>, back <int>,
## # been <int>, by <int>, going <int>, have <int>, help <int>, hunt <int>, …
Each tweet is represented by a row. That is, each word is represented by a column.
Losing some tweets is actually a bit of a problem. There are two reasons we want the same number of tweets in both “before” and “after” classes of the response. First, many predictive modelling approaches do better with balanced data. Second, it gives us an easy benchmark (50%) to evaluate our results against. So we should check whether the dropped tweets have come disproportionately from one response class and whether the classes are still balanced.
table(bag_of_words$is_potus)
##
## FALSE TRUE
## 975 859
Here we see that whilst only 25 non-presidential tweets were lost by only taking the 204 most frequent words, 141 presidential tweets were removed. To balance the classes again, we will randomly select tweets from the majority class equal to the number in the minority class.
min_class_size <- min(table(bag_of_words$is_potus))
bag_of_words <- bag_of_words %>% group_by(is_potus) %>% sample_n(min_class_size) %>% ungroup()
table(bag_of_words$is_potus)
##
## FALSE TRUE
## 859 859
Here, we shall start building our bag-of-words classifier model.
We’d like to know whether we can use these word frequencies to predict whether a Trump tweet was made while he was president or not.
We have a binary response, and continuous (integer) predictors, so we can use a variety of approaches to model this problem. Here we’ll use a binary classification tree constructed using the CART algorithm, implemented in the rpart package.
Before building the tree, we split our data into training and test sets. I’ve included 70% of the data in the training set and left the rest for testing.
set.seed(321)
training_ids <- bag_of_words %>%
group_by(is_potus) %>%
slice_sample(prop = 0.7) %>%
ungroup() %>%
select(id_str)
training_tweets <- bag_of_words %>%
right_join(training_ids, by = 'id_str') %>%
select(-id_str)
test_tweets <- bag_of_words %>%
anti_join(training_ids, by = 'id_str') %>%
select(-id_str)
We then fit a tree to the training data.
fit <- rpart(is_potus ~ ., training_tweets, method = 'class')
And plot the full tree.
# options(repr.plot.width = 12, repr.plot.height = 10) # set plot size in the notebook
plot(fit, main = 'Full Classification Tree')
text(fit, use.n = TRUE, all = TRUE, cex=.8)
First split was on the word “and”. This is the key idea here. We are interested in seeing the words, i.e., what are the key words which we are we using to see if this was pre or post presidency.
We can assess the accuracy in training data set by extracting a cross-table of predicted against observed classifications, and calculating the percentage of classification the model got correct. Because we have balanced class sizes (the same number of tweets before and after presidency) a completely random classifier returns, on average, 50% accuracy.
fittedtrain <- predict(fit, type = 'class')
predtrain <- table(training_tweets$is_potus, fittedtrain)
predtrain
## fittedtrain
## FALSE TRUE
## FALSE 466 135
## TRUE 195 406
paste("Training Accuracy is:", round(sum(diag(predtrain))/sum(predtrain), 3)) # training accuracy
## [1] "Training Accuracy is: 0.725"
We can do the same in the test dataset - this is the accuracy we would be most interested in.
fittedtest <- predict(fit, newdata = test_tweets, type = 'class')
predtest <- table(test_tweets$is_potus, fittedtest)
predtest
## fittedtest
## FALSE TRUE
## FALSE 192 66
## TRUE 95 163
paste("Testing Accuracy is:", round(sum(diag(predtest))/sum(predtest), 3)) # test accuracy
## [1] "Testing Accuracy is: 0.688"
Ultimately this is the procedure for the bag of words model. Essentially it just looks at the term frequency, as an indicator of how important that word is in each document (tweet).
We now introduce TF-IDF concept. Specifically in this section we shall briefly discuss the maths and background to TF-IDF; a way of weighting word frequencies that adjusts for how common a word is across documents (and hence how “special” a word is to a particular document). We then:
So what we have seen so far, particularly in the previous tutorial on Sentiment Analysis and Text Mining, is that we focused on identifying the frequency of individual terms within a document along with the sentiments that these words provide. It is also important to understand the importance that words provide within and across documents.
So consider our previous example - Donald Trump’s tweets. We have words contained in tweets, and specifically the frequencies of those words, to discriminate between a tweet made by Donald Trump was made while he was president or not. This is standard practice in text mining: we describe a document by the words that are contained in the document. Now, how can we decide what words are “important” to a document - by which we mean distinguish that document from another?
We have two approaches:
use word frequency. If a word like “apple” appears frequently in a document, it is more likely to be about a recipe or health than about politics, for example. However some words that appear frequently are not particularly useful: words like “the”, “a”, and so on. These are words that may appear frequently in a particular document, but are not interesting to us because they also appear frequently in many other documents. We’ve previously encountered these as stop words, and we’ve typically removed them from the analysis to reveal more interesting underlying patterns.So this essentially relied on term frequency (tf); which identifies how frequently a word occurs in a document.
rather than removing the stop words, is to downweight them. This is what tf-idf (term frequency-inverse document frequency) does. Here “term” is just used as a general indicator of the unit of text we are interested in, like “token”. Essentially, TF-IDF decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents.
Tf-idf is composed of two parts:
The tf part: A term’s frequency increases with how many times it is used in that document. For tf-idf we use relative frequencies: the number of times a word appears, divided by the total number of words in that document.
The idf: A term’s inverse document frequency is a measure of how many documents (in the corpus) contain that term. It decreases the weight for commonly used words (which all or most documents will use) relative to words that are not used by many documents in a corpus.’
Specifically, we are going to replace the term frequencies we were using before with new values called “inverse document frequency weighted term frequencies” or tf-idf for short. So each term in each document will get its own tf-idf “score”, which we denote \(tfidf(\text{term t in document i})\).
Then:
\[tfidf(\text{term t in document i}) = tf(\text{term t in document i}) \times idf(\text{term t})\]
where
\[tf(\text{term t in document i}) = \displaystyle\frac{\text{Number of times term t appears in document i}}{\text{Number of terms in document i}}\]
and
\[idf(\text{term t}) = \ln\biggl(\displaystyle\frac{\text{Number of documents in corpus}}{\text{Number of documents containing term t}}\biggr)\]
The log function increases in its argument, and the number of documents in the corpus is fixed, so as the number of documents containing a term increases (the denominator above), the idf part decreases. The log transform ensures that this decrease is steep initally, and then levels off. Below I show how the idf weight changes with the number of documents (out of a total of 100 documents) that have a term.
# options(repr.plot.width = 8, repr.plot.height = 8) # set plot size in the notebook
plot(1:100, log(100/1:100), type = 'l',
xlab = 'Number of documents (out of 100) containing term t', ylab = 'idf(t)')
So suppose we have 100 documents. And we have a word that appears only once. That score IDF is gonna be very high. If we had 10 documents, that score will half.
As a bonus, we no longer have to carry around stop word dictionaries around with us - using idf allows us to include all words and let idf down weight the ones that are most common.
Let’s do this now in R. Below we carry out the calculations above to work out tf-idf values for the words (terms) in our tweets (documents).
ndocs <- length(unique(tweets_tdf$id_str))
idf <- tweets_tdf %>%
group_by(word) %>%
summarize(docs_with_word = n()) %>%
ungroup() %>%
mutate(idf = log(ndocs / docs_with_word)) %>% arrange(desc(idf))
tweets_tdf <- tweets_tdf %>%
left_join(idf, by = 'word') %>%
mutate(tf = n/total, tf_idf = tf * idf)
Let’s explore how the weighting affects the terms that come up as “most important”, by looking at one particular tweet.
set.seed(321)
random_tweet <- sample(tweets_tdf$id_str, 1)
tweets %>% filter(id_str == random_tweet) %>% select(text)
## # A tibble: 1 × 1
## text
## <chr>
## 1 Heading to Alabama now, big crowd!
Below we rank words in descending order of importance by the
criterion of word frequency (n). By changing this to
inverse document frequency (idf) and tf-idf
(tf_idf) you can see which words become more or less
important, and get a sense for why.
tweets_tdf %>% filter(id_str == random_tweet) %>% arrange(desc(n))
## # A tibble: 3 × 8
## id_str word n total docs_with_word idf tf tf_idf
## <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 911338971250216960 big 1 3 85 3.07 0.333 1.02
## 2 911338971250216960 now 1 3 111 2.80 0.333 0.935
## 3 911338971250216960 to 1 3 681 0.991 0.333 0.330
We’ve done the tf-idf calculation “from scratch” to get a better
understanding of what is happening. The tidytext
package has a function bind_tf_idf() that does the same
thing.
tweets_tdf <- tweets_tdf %>%
select(-idf, -tf, -tf_idf) %>% #remove the old ones we worked out
bind_tf_idf(word, id_str, n) #replace with values from tidytext
tweets_tdf %>% filter(id_str == random_tweet) %>% arrange(desc(n)) # check same as above
## # A tibble: 3 × 8
## id_str word n total docs_with_word tf idf tf_idf
## <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 911338971250216960 big 1 3 85 0.333 3.07 1.02
## 2 911338971250216960 now 1 3 111 0.333 2.80 0.935
## 3 911338971250216960 to 1 3 681 0.333 0.991 0.330
Tf-idf features often give better accuracy in predictive modelling than using word frequencies. In this section we repeat the earlier analysis, building a classification tree to predict whether a tweet was made before or after becoming president, but replacing the word frequency features with the tf-idf values calculated above.
We start by reshaping the data set:
tfidf <- tweets_tdf %>%
select(id_str, word, tf_idf) %>% # note the change, using tf-idf
pivot_wider(names_from = word, values_from = tf_idf, values_fill = 0) %>%
left_join(tweets %>% select(id_str,is_potus))
We use same training and test sets as before.
training_tweets <- tfidf %>%
right_join(training_ids, by = 'id_str') %>%
select(-id_str)
test_tweets <- tfidf %>%
anti_join(training_ids, by = 'id_str') %>%
select(-id_str)
We fit a tree to training data:
fit <- rpart(factor(is_potus) ~ ., training_tweets)
Plot the tree we just created:
# options(repr.plot.width = 12, repr.plot.height = 10) # set plot size in the notebook
plot(fit, main='Full Classification Tree')
text(fit, use.n=TRUE, all=TRUE, cex=.8)
And check the accuracy in training and test data sets:
fittedtrain <- predict(fit, type = 'class')
predtrain <- table(training_tweets$is_potus, fittedtrain)
predtrain
## fittedtrain
## FALSE TRUE
## FALSE 479 122
## TRUE 223 378
paste("Training Accuracy is:", round(sum(diag(predtrain))/sum(predtrain), 3)) # training accuracy
## [1] "Training Accuracy is: 0.713"
fittedtest <- predict(fit,newdata=test_tweets,type='class')
predtest <- table(test_tweets$is_potus,fittedtest)
predtest
## fittedtest
## FALSE TRUE
## FALSE 298 76
## TRUE 112 146
paste("Testing Accuracy is:", round(sum(diag(predtest))/sum(predtest), 3)) # test accuracy
## [1] "Testing Accuracy is: 0.703"
We get a very slight improvement in accuracy from replacing word frequency features with ones based on tf-idf.
This example leverages the data provided in the
harrypotter package. This package provides access to the
full texts of the first seven Harry Potter books.
We load the harrypotter package.
# load package
if (packageVersion("devtools") < 1.6) {
install.packages("devtools")
}
devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)
The seven novels we are working with, and are provided by the
harrypotter package, include:
philosophers_stone: Harry Potter and the Philosophers
Stone (1997) chamber_of_secrets: Harry Potter and the
Chamber of Secrets (1998) prisoner_of_azkaban: Harry
Potter and the Prisoner of Azkaban (1999)
goblet_of_fire: Harry Potter and the Goblet of Fire
(2000) order_of_the_phoenix: Harry Potter and the Order
of the Phoenix (2003) half_blood_prince: Harry Potter
and the Half-Blood Prince (2005) *deathly_hallows: Harry
Potter and the Deathly Hallows (2007)
We could have a look at the the raw text of the first chapter of the
philosophers_stone, however the output (as expected) is
quite lengthy. If you want you could execute the code yourdelf. We look
at the data structruer below. We see that in Harry Potter and the
Philosophers Stone, we have 17 chapters.
str(philosophers_stone)
## chr [1:17] "THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfe"| __truncated__ ...
Now, with our data we need to do what we have always been doing - turning it into tiday format. We need to do this in order to compute term frequencies. The following converts all seven Harry Potter novels into a tibble that has each word by chapter by book.
# Convert to Tidy
titles <- c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban",
"Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince",
"Deathly Hallows")
books <- list(philosophers_stone, chamber_of_secrets, prisoner_of_azkaban,
goblet_of_fire, order_of_the_phoenix, half_blood_prince,
deathly_hallows)
series <- tibble()
for(i in seq_along(titles)) {
clean <- tibble(chapter = seq_along(books[[i]]),
text = books[[i]]) %>%
unnest_tokens(word, text) %>%
mutate(book = titles[i]) %>%
select(book, everything())
series <- rbind(series, clean)
}
# set factor to keep books in order of publication
series$book <- factor(series$book, levels = rev(titles))
series
## # A tibble: 1,089,386 × 3
## book chapter word
## <fct> <int> <chr>
## 1 Philosopher's Stone 1 the
## 2 Philosopher's Stone 1 boy
## 3 Philosopher's Stone 1 who
## 4 Philosopher's Stone 1 lived
## 5 Philosopher's Stone 1 mr
## 6 Philosopher's Stone 1 and
## 7 Philosopher's Stone 1 mrs
## 8 Philosopher's Stone 1 dursley
## 9 Philosopher's Stone 1 of
## 10 Philosopher's Stone 1 number
## # … with 1,089,376 more rows
From this cleaned up text we can compute the term frequency for each word. Lets do this for computing term frequencies by book and across the entire Harry Potter series.
book_words <- series %>%
count(book, word, sort = TRUE) %>%
ungroup()
series_words <- book_words %>%
group_by(book) %>%
summarise(total = sum(n))
book_words <- left_join(book_words, series_words)
book_words
## # A tibble: 67,881 × 4
## book word n total
## <fct> <chr> <int> <int>
## 1 Order of the Phoenix the 11740 258763
## 2 Deathly Hallows the 10335 198906
## 3 Goblet of Fire the 9305 191882
## 4 Half-Blood Prince the 7508 171284
## 5 Order of the Phoenix to 6518 258763
## 6 Order of the Phoenix and 6189 258763
## 7 Deathly Hallows and 5510 198906
## 8 Order of the Phoenix of 5332 258763
## 9 Prisoner of Azkaban the 4990 105275
## 10 Goblet of Fire and 4959 191882
## # … with 67,871 more rows
Here, we can see that common and noncontextual words (“the”, “to”, “and”, “of”, etc.) rule the term frequency list. We can visualize the distribution of term frequency for each novel. Here we’ll look at the distribution of n/total.
book_words %>%
mutate(ratio = n / total) %>%
ggplot(aes(ratio, fill = book)) +
geom_histogram(show.legend = FALSE) +
scale_x_log10() +
facet_wrap(~ book, ncol = 2)
Since the distribution is so clustered around 0 I add scale_x_log10() to spread it out. Even so we see the long right tails for thos extremely common words
Zipf’s law states that within a group or corpus of documents, the frequency of any word is inversely proportional to its rank in a frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. Zipf’s law is most easily observed by plotting the data on a log-log graph, with the axes being log(rank order) and log(term frequency).
freq_by_rank <- book_words %>%
group_by(book) %>%
mutate(rank = row_number(),
`term freq` = n / total)
ggplot(freq_by_rank, aes(rank, `term freq`, color = book)) +
geom_line() +
scale_x_log10() +
scale_y_log10()
Our plot illustrates that the distribution is similar across the seven
books. Lert us now look at applying TF-IDF.
The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents, in this case, the Harry Potter series.
Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common. Or put another way, tf-idf helps to find the important words that can provide specific document context.
We can easily compute the idf and tf-idf using the
bind_tf_idf function provided by the tidytext
package.
book_words <- book_words %>%
bind_tf_idf(word, book, n)
book_words
## # A tibble: 67,881 × 7
## book word n total tf idf tf_idf
## <fct> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Order of the Phoenix the 11740 258763 0.0454 0 0
## 2 Deathly Hallows the 10335 198906 0.0520 0 0
## 3 Goblet of Fire the 9305 191882 0.0485 0 0
## 4 Half-Blood Prince the 7508 171284 0.0438 0 0
## 5 Order of the Phoenix to 6518 258763 0.0252 0 0
## 6 Order of the Phoenix and 6189 258763 0.0239 0 0
## 7 Deathly Hallows and 5510 198906 0.0277 0 0
## 8 Order of the Phoenix of 5332 258763 0.0206 0 0
## 9 Prisoner of Azkaban the 4990 105275 0.0474 0 0
## 10 Goblet of Fire and 4959 191882 0.0258 0 0
## # … with 67,871 more rows
Notice how the common non-contextual words (“the”, “to”, “and”, etc.) have high tf values but their idf and tf-idf values are 0. Note, that common terms that occur in every document will have an IDF of 0. We can look at the words that have the highest tf-idf values.
Here we see mainly names for characters in each book that are unique to that book, and therefore used often, but are absent or nearly absent in the other books.
book_words %>%
arrange(desc(tf_idf))
## # A tibble: 67,881 × 7
## book word n total tf idf tf_idf
## <fct> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Half-Blood Prince slughorn 335 171284 0.00196 1.25 0.00245
## 2 Deathly Hallows c 1300 198906 0.00654 0.336 0.00220
## 3 Order of the Phoenix umbridge 496 258763 0.00192 0.847 0.00162
## 4 Goblet of Fire bagman 208 191882 0.00108 1.25 0.00136
## 5 Chamber of Secrets lockhart 197 85401 0.00231 0.560 0.00129
## 6 Prisoner of Azkaban lupin 369 105275 0.00351 0.336 0.00118
## 7 Goblet of Fire winky 145 191882 0.000756 1.25 0.000947
## 8 Goblet of Fire champions 84 191882 0.000438 1.95 0.000852
## 9 Deathly Hallows xenophilius 79 198906 0.000397 1.95 0.000773
## 10 Half-Blood Prince mclaggen 65 171284 0.000379 1.95 0.000738
## # … with 67,871 more rows
To understand the most common contextual words in each book we can take a look at the top 15 terms with the highest tf-idf.
book_words %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word))),
book = factor(book, levels = titles)) %>%
group_by(book) %>%
top_n(15, wt = tf_idf) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = book)) +
geom_bar(stat = "identity", alpha = .8, show.legend = FALSE) +
labs(title = "Highest tf-idf words in the Harry Potter series",
x = NULL, y = "tf-idf") +
facet_wrap(~book, ncol = 2, scales = "free") +
coord_flip()
As you can see most of these high ranking tf-idf words are nouns that provide specific context around the the most common characters in each individual book.
Examples in Parts 1 and 2 illustrated how we can use bag of word models down weighted using TF IDF allows us to vectorize documents containing tokens in such a way we can use predictive modelling in a supervised learning sense to classify documents to one of any number of classes, based on the contents of that document.
The TF-IDF statistic computes the frequency of a term adjusted for how rarely it is used. Since the ratio inside the idf’s log function is always greater than or equal to 1, the value of idf (and tf-idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0. So, it is essentially a measure of the ‘importance’ of a particular word for a given document, and it’s often used in combination with other analyses, for example as an input to a text classifier, or to help rank search results.
asking what a document is about. Idea is to see what a document is all about and group similar documents which have similar themes, based on the contents.
This notebook covers topic modelling and in particular one method used to do topic modelling, latent dirichlet allocation.
Specifically, we shall be looking at:
Look at how a topic model attempts to “summarize” the document-term matrix
Describe Latent Dirichlet allocation (LDA), a popular method for doing topic modeling that avoids some of the pitfalls of other methods.
Show how you can do topic modelling with LDA using the topicmodels package
Note that during this tutorial we shall pave the way by introducing a number of methods for doing topic modelling, and also some fundamental modelling “building blocks” like maximum likelihood. We finish off by creating a topic model for a collection of movie reviews.
Topic modelling is a type of bag-of-words model that aims to summarize a document by describing it in terms of a small number of “topics”. The resulting descriptions can then be used for several purposes, for example, to judge similarity between documents and to cluster together similar documents, or for document classification.
As we’ve seen, in the bag-of-words model a document is represented by:
When a number of documents are collected into a corpus, we start by identifying all the words that are used by the documents in the corpus – the “vocabulary” of the corpus. Each document can then be thought of as a (often very) long vector of frequencies (as long as the vocabulary), many of which will be zero. These vectors, stacked upon each other, one per document, give us the document-term matrix.
The “wideness” of the document-term matrix poses a challenge to efficient further analysis, such as clustering and classification. So, we might ask whether we can reduce the dimensionality of the problem by summarizing each document in some way that preserves most of the information contained in the vector of word frequencies, but that makes subsequent analysis easier and more efficient.
This is what topic modelling does. TLDR; In a nutshell, it summarizes documents by describing them in terms of a smaller number of topics. Over and above improving the efficiency with which further analyses can be done, the resulting summaries contain useful information about what a document is “about”.
The idea of MLE is highly prevalent with topic modelling. So we shall be observing frequency distributions of the words, and we trying to maximize likelihood of seeing that particular set of words, by changing the topic probabilities - so the topics are what we interested but we don’t know what the probabilities are, so we are trying to get them by tweaking them to see such that the observed set of words are as likely as possible.
We shall be moving away from using the trump data set, containing all the tweets made by him (Donald Trump). We move onto using a reviews data set. Essentially what we have is 15 documents, and 35 words taken from each document.
In this example we construct a topic model on reviews for two different movies, Taxi Driver and Waterworld. Since we know that the main “topics” should be to do with the two movies, we can easily assess whether LDA is doing a good job at separating out topics, and when it fails.
Using web scraping, we have collected 160 reviews of Taxi Driver, and 160 reviews of Waterworld from the IMDb site. Here you can read the synopses for Taxi Driver and Waterworld. Clearly they are quite different types of movies, although they have some things in common, like being about a loner who protects a young girl from harm.
We start by loading required packages as well as the data.
library(tidyverse)
library(tidytext)
library(topicmodels)
load('my_imdb_reviews.RData')
We then do a bit of data cleaning to make subsequent analyses easier.
We change the class of the review variable from factor to
character and add a document ID variable.
reviews <- as_tibble(reviews)
reviews$review <- as.character(reviews$review)
reviews$reviewId <- 1:nrow(reviews)
We then tokenize the reviews into individual words and remove stop words from the data frame.
tidy_reviews <- reviews %>%
unnest_tokens(word, review, token = 'words', to_lower = T) %>%
filter(!word %in% stop_words$word)
We can now count the number of times each word in our vocabulary was used in each review, creating a “long” (and tidy) format of the document-term matrix.
reviews_tdf <- tidy_reviews %>%
group_by(reviewId,word) %>%
count() %>%
ungroup()
Previously we have created our own “wide” versions of the
document-term matrix using the pivot_wider() function. We
did this, for example, when building a bag-of-words model in the Lecture
6.1 notebook. This was fine when building features for a classification
tree (using rpart), but some R packages require a
particular implementation of document-term matrix. One common
implementation is the DocumentTermMatrix class provided by
the tm package.
The topicmodels package, which we use to implement
the LDA topic model, requires the document-term matrix to be provided as
a DocumentTermMatrix object. Fortunately,
tidytext provides a number of functions for turning
“ordinary” document-term matrices into DocumentTermMatrix
(or other kinds of) objects, and also for going the other way (tidying
document-term matrices). This is described in more detail here.
Essentially, we create the DocumentTermMatrix object using
cast_dtm() as shown below.
dtm_reviews <- reviews_tdf %>%
cast_dtm(reviewId, word, n)
We can now estimate the parameters of the topic model using LDA. We
use the LDA() function provided by the package
topicmodels. We need to specify the number of latent
variables (topics) we wish to use.
Notice that the object that is returned by LDA() below
has a specific class, the “LDA_VEM” class. This is a so-called “S4”
object. R provides a number of different systems for object oriented
programming: the main ones are an older, more flexible “S3” object
class, and a newer, stricter “S4” class. We don’t need to worry about
these intricacies, but one important practical difference is that the
$ operator (the way you reference variables within an
object e.g. tweets$text) is not defined for the S4 class.
Instead, you use the @ operator.
reviews_lda <- LDA(dtm_reviews, k = 2, control = list(seed = 1234))
reviews_lda
## A LDA_VEM topic model with 2 topics.
str(reviews_lda)
## Formal class 'LDA_VEM' [package "topicmodels"] with 14 slots
## ..@ alpha : num 0.0317
## ..@ call : language LDA(x = dtm_reviews, k = 2, control = list(seed = 1234))
## ..@ Dim : int [1:2] 320 7992
## ..@ control :Formal class 'LDA_VEMcontrol' [package "topicmodels"] with 13 slots
## .. .. ..@ estimate.alpha: logi TRUE
## .. .. ..@ alpha : num 25
## .. .. ..@ seed : int 1234
## .. .. ..@ verbose : int 0
## .. .. ..@ prefix : chr "/var/folders/y5/7wlh6jzj03185p7x7ggqb_280000gn/T//RtmpY2SYgl/filefdac7a12e0cf"
## .. .. ..@ save : int 0
## .. .. ..@ nstart : int 1
## .. .. ..@ best : logi TRUE
## .. .. ..@ keep : int 0
## .. .. ..@ estimate.beta : logi TRUE
## .. .. ..@ var :Formal class 'OPTcontrol' [package "topicmodels"] with 2 slots
## .. .. .. .. ..@ iter.max: int 500
## .. .. .. .. ..@ tol : num 1e-06
## .. .. ..@ em :Formal class 'OPTcontrol' [package "topicmodels"] with 2 slots
## .. .. .. .. ..@ iter.max: int 1000
## .. .. .. .. ..@ tol : num 1e-04
## .. .. ..@ initialize : chr "random"
## ..@ k : int 2
## ..@ terms : chr [1:7992] "actors" "affection" "albert" "alienated" ...
## ..@ documents : chr [1:320] "1" "2" "3" "4" ...
## ..@ beta : num [1:2, 1:7992] -7.07 -7.16 -9.97 -9.72 -7.03 ...
## ..@ gamma : num [1:320, 1:2] 1 1 1 1 1 ...
## ..@ wordassignments:List of 5
## .. ..$ i : int [1:30858] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ j : int [1:30858] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ v : num [1:30858] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ nrow: int 320
## .. ..$ ncol: int 7992
## .. ..- attr(*, "class")= chr "simple_triplet_matrix"
## ..@ loglikelihood : num [1:320] -1328 -959 -2045 -571 -1742 ...
## ..@ iter : int 27
## ..@ logLiks : num(0)
## ..@ n : int 37987
The main things we are interested in are:
beta: these parameters control the probability of a
given topic k generating a particular word i.gamma: this gives the topic “mixtures” for each
document.We’ll first look at the beta parameters and examine which words are most associated with each topic. This will be a good way of checking, in our case, whether the words align themselves into two topics we are expecting (i.e. one to do with Taxi Driver, the other Waterworld). More generally, you would do this to assess what the topics are about.
We start by creating a data frame containing the “terms” (the words we tokenized the reviews into), and the beta parameters (one beta parameter per-word-per-topic).
term <- as.character(reviews_lda@terms)
topic1 <- reviews_lda@beta[1,]
topic2 <- reviews_lda@beta[2,]
reviews_topics <- tibble(term = term, topic1 = topic1, topic2 = topic2)
The data frame above is not yet in “tidy” format, because the beta
parameters are spread across two columns. We use the
pivot_longer() function to get the data into a tidy format.
At the same time, we transform the beta parameters into probabilities by
exponentiating each beta.
reviews_topics <- reviews_topics %>%
pivot_longer(c(topic1, topic2), names_to = 'topic', values_to = 'beta') %>%
mutate(beta = exp(beta)) # pr(topic k generates word i) = exp(beta_ik)
head(reviews_topics)
## # A tibble: 6 × 3
## term topic beta
## <chr> <chr> <dbl>
## 1 actors topic1 0.000848
## 2 actors topic2 0.000775
## 3 affection topic1 0.0000468
## 4 affection topic2 0.0000602
## 5 albert topic1 0.000889
## 6 albert topic2 0.0000607
So if your in topic1, the probability of seeing the word
“actors” is the value indicated. Likewise, if you are in
topic2, the probability of seeing the word “actors” is the
value seen above in the tibble.
The tidytext package provides a function
tidy() that tidies a number of different object types,
including “LDA_VEM” objects. So you can skip the previous step and use a
direct call to tidy() the reviews_lda object
we created earlier. You can use the same approach with
matrix = "gamma" to extract those parameters.
reviews_topics <- tidy(reviews_lda, matrix = 'beta')
head(reviews_topics)
## # A tibble: 6 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 actors 0.000848
## 2 2 actors 0.000775
## 3 1 affection 0.0000468
## 4 2 affection 0.0000602
## 5 1 albert 0.000889
## 6 2 albert 0.0000607
Finally, we can extract the top 20 terms used in each topic and
arrange these in a tidy format, and plot with a call to
ggplot().
top_terms <- reviews_topics %>%
group_by(topic) %>%
top_n(20, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = 'free') +
coord_flip()
This visualization shows the most frequent words used by each topic, and this gives us a way of understanding what the topics are about. The words used by Topic 2 are mostly to do with Waterworld (the movie title, actors’ names, some plot elements like “mariner”, “land”, etc). The words used by Topic 1 are mostly about Taxi Driver. So, the topic model is clearly picking up that the reviews are largely about two “topics”, and that these topics relate to the movies being reviewed. Also note that some words (like “movie” and “film”) are common to both topics. This is a benefit of LDA: words can be associated with more than one topic.
A different way of drawing out the differences between topics is to look at which words have the greatest difference in beta values between Topic 1 and Topic 2. This could be done in several ways but a good default is to take the advice in TMR and use the log (base 2) ratio of the two betas: \(log_2(\beta_2/\beta_1)\). Any log ratio is symmetrical: the log ratio for \(\beta_2/\beta_1\) is -1 times the log ratio for \(\beta_1/\beta_2\). A base of 2 in the log ratio makes things a bit easier to interpret. The log (base 2) ratio increases by 1 for every doubling of \(\beta_2\) (relative to \(\beta_1\)).
We calculate the log ratios below and plot the words with the biggest positive and negative log ratios. We first filter out common words, such as those that have a beta greater than 1/1000 in at least one topic.
beta_spread <- reviews_topics %>%
mutate(topic = paste0('topic', topic)) %>%
pivot_wider(names_from = topic, values_from = beta) %>%
filter(topic1 > .001 | topic2 > .001) %>%
mutate(log_ratio = log2(topic2 / topic1))
beta_spread %>%
group_by(direction = log_ratio > 0) %>%
top_n(10, abs(log_ratio)) %>%
ungroup() %>%
mutate(term = reorder(term, log_ratio)) %>%
ggplot(aes(term, log_ratio)) +
geom_col() +
labs(y = 'Log2 ratio of beta in topic 2 / topic 1') +
coord_flip()
The word “dryland” is a very strong indicator that it has something to do with topic 2. The word “atoll” is also something to do with water so it makes sense to be a strong indicator. Note that, an atoll is a ring-shaped coral reef, island, or series of islets. The bottom words show us strong indicators that its topic 1 as opposed to topic 2.
LDA models each document as a mixture of topics. These mixtures can be used in a straightforward way to assess whether a particular document is “mostly about Topic 1” or “mostly about Topic 2”. In our case, since we have a very clear idea both of what the topics are about (i.e. how topics map onto movies) and what the reviews were supposed to be about we can examine these mixtures and see whether they make sense.
The “mixture” of topics in each document is given by document-topic probabilities referred to collectively as “gamma”. Below we extract these from the object containing the LDA results, and merge the gamma values back into the original data frame containing the reviews.
reviews_gamma <- reviews %>%
left_join(tidy(reviews_lda, matrix = 'gamma') %>%
mutate(reviewId = as.numeric(document)) %>% #some cleaning to make key variable (reviewId) usable
select(-document) %>%
spread(key = topic, value = gamma, sep = '_'))
We know that Topic 1 is clearly about Taxi Driver and Topic 2 is about Waterworld. We can therefore ask:
We answer this question below (Taxi Driver has
imdbID = 0075314). Six of the 160 Taxi Driver
reviews were estimated to mostly be about Topic 2, while three of the
160 Waterworld reviews were mostly about Topic 1, according to
LDA.
reviews_gamma %>% group_by(imdbId) %>% summarize(ntopic1 = sum(topic_1 > 0.5))
## # A tibble: 2 × 2
## imdbId ntopic1
## <fct> <int>
## 1 0075314 154
## 2 0114898 3
154 out of 160 is still pretty good. Let’s have a look at a few cases where the topic model got it wrong. First, we can look at a few reviews for Taxi Driver that the model said were mostly about Topic 2 (the other topic - waterwrodl).
reviews_gamma %>% filter(imdbId == '0075314') %>% arrange(topic_1) %>% head(7) %>% select(review, topic_1, topic_2)
## # A tibble: 7 × 3
## review topic_1 topic_2
## <chr> <dbl> <dbl>
## 1 " a nicely made dark and majestic experience see the world t… 0.376 0.624
## 2 " to those who can t figure out the base of this movie you n… 0.388 0.612
## 3 " i had some hopes for this movie since i found it among the … 0.419 0.581
## 4 " when diving into this movie you may think this is about liv… 0.444 0.556
## 5 " a truly disturbing movie travis bickle robert deniro gr… 0.444 0.556
## 6 " my favourite movie number one such a film is once seen in… 0.445 0.555
## 7 " the sicilian connection taxi driver and the winter s tale w… 0.519 0.481
Here we see the Taxi Driver review, and even though it contains words like Taxi Driver quite often, our model gave it a 62% probability of being from topic 2
Compare these to the Taxi Driver reviews that were “most” about Topic 1 (which was, remember, to do with Taxi Driver).
reviews_gamma %>% filter(imdbId == '0075314') %>% arrange(desc(topic_1)) %>% head(3) %>% select(review, topic_1, topic_2)
## # A tibble: 3 × 3
## review topic_1 topic_2
## <chr> <dbl> <dbl>
## 1 " travis bickle is the sort of person you wouldn t even see … 1.00 6.56e-5
## 2 " watching taxi driver made me realize everything i thought i… 1.00 8.42e-5
## 3 " i watched taxi driver 1976 a couple of years ago though… 1.00 8.72e-5
We notice when people write very long drwn out reviews.
Let’s do the same thing for Waterworld: look at a few reviews that the model said were mostly about Topic 1 (Taxi Driver).
reviews_gamma %>% filter(imdbId != '0075314') %>% arrange(topic_2) %>% head(4) %>% select(review, topic_1, topic_2)
## # A tibble: 4 × 3
## review topic_1 topic_2
## <chr> <dbl> <dbl>
## 1 " best and most efficient opening sequence in film ever pe… 0.584 0.416
## 2 " i can find something to like in almost any film what s gr… 0.545 0.455
## 3 " this film is mad max combined with planet of the apes and… 0.543 0.457
## 4 " this can go with the chronicles of riddick mad max termi… 0.485 0.515
And finally, compare the reviews above to reviews that were “most” about Topic 2.
reviews_gamma %>% filter(imdbId != '0075314') %>% arrange(desc(topic_2)) %>% head(3) %>% select(review, topic_1, topic_2)
## # A tibble: 3 × 3
## review topic_1 topic_2
## <chr> <dbl> <dbl>
## 1 " out of sort of like a mad max meets indiana jones … 7.04e-5 1.00
## 2 " it takes gumption to open a movie with the sight of kevin c… 7.99e-5 1.00
## 3 " i ll admit it i liked waterworld or parts of it … 9.12e-5 1.00
We see similar results, when the reviewer writes a lot.
This was just an illustration of how latent dirclct allocation can be used to divide a corpus of documents into different topics.
For more examples and teachings, you can work through this example in Chapter 6 ofon Topic Modelling in Text Mining in R. There is also a topic modelling part to the case study in Chapter 9.